---
source_path: "models/types/stt.md"
canonical_url: "https://doc.sensory.com/tnl/7.8/models/types/stt/"
---

# Speech To Text _(STT only)_

These models do audio transcription with transformers.

STT models have [task-type](https://doc.sensory.com/tnl/7.8/api/setting-keys/configuration.md#task-type)` == `[lvcsr](https://doc.sensory.com/tnl/7.8/api/setting-keys/values.md#lvcsr) and filenames that
by convention match `stt-*.snsr`

**Also see these related items:** [STT models](https://doc.sensory.com/tnl/7.8/models/index.md#stt-models) included in this distribution.

## Operation

```mermaid
flowchart TD
    start((start))
    fetch[/samples from ->audio-pcm/]
    audio(^sample-count)
    process[process]
    partial(^result-partial)
    intent(^nlu-intent)
    slot(^nlu-slot)
    result(^result)
    nlu{NLU<br>match?}

    slm{SLM<br>included?}
    generate[generate]
    slmstart(^slm-start)
    slmresultpartial(^slm-result-partial)
    slmresult(^slm-result)

    start --> fetch
    fetch --> audio
    audio --> process
    process --> fetch
    process -->|hypothesis| partial
    partial --> fetch
    process -->|VAD endpoint<br>or STREAM_END| nlu
    nlu -->|yes| intent
    nlu -->|no| result
    intent --> slot
    slot --> result
    slot -->|more| intent

    result --> slm
    slm -->|yes| slmstart
    slm -->|no| fetch
    slmstart -->|OK| generate
    slmstart -->|STOP| fetch
    generate -->|response| slmresultpartial
    slmresultpartial --> generate
    generate -->|done| slmresult
    slmresult --> fetch
```

Recognition flow.

1. Read audio data from [->audio-pcm](https://doc.sensory.com/tnl/7.8/api/setting-keys/runtime.md#-audio-pcm).
2. Invoke [^sample-count](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#sample-count-event).
3. Invoke [^result-partial](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#result-partial) with interim recognition hypotheses
   every [partial-result-interval](https://doc.sensory.com/tnl/7.8/api/setting-keys/configuration.md#partial-result-interval) ms.
5. Continue processing until [STREAM_END](https://doc.sensory.com/tnl/7.8/api/inference.md#rc_stream_end) occurs on [->audio-pcm](https://doc.sensory.com/tnl/7.8/api/setting-keys/runtime.md#-audio-pcm),
   one of the event handlers returns a code other than [OK](https://doc.sensory.com/tnl/7.8/api/inference.md#rc_ok), or
   an external [VAD](https://doc.sensory.com/tnl/7.8/api/setting-keys/values.md#vad) detects a speech endpoint.
6. If NLU is configured, invoke [^nlu-intent](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#nlu-intent) and [^nlu-slot](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#nlu-slot) for each
   top-level result that matches.
7. Invoke [^result](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#result) with the final recognition hypothesis.
8. If an SLM is not available, resume processing at step 1.
9. Invoke [^slm-start](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#slm-start). If the handler returns [STOP](https://doc.sensory.com/tnl/7.8/api/inference.md#rc_stop),
   resume processing at step 1.
10. Invoke [^slm-result-partial](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#slm-result-partial) as the model generates text.
11. Invoke [^slm-result](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#slm-result) when text generation is complete.
12. Resume processing at step 1.

**Note:**

STT recognizers do **not** produce a final recognition hypothesis until they
run out of audio samples to process, or an external VAD detects a speech
endpoint.

With live audio you should use these with a VAD template such as
[tpl-vad-lvcsr](https://doc.sensory.com/tnl/7.8/models/index.md#tpl-vad-lvcsr), [tpl-opt-spot-vad-lvcsr](https://doc.sensory.com/tnl/7.8/models/index.md#tpl-opt-spot-vad-lvcsr), or [tpl-spot-vad-lvcsr](https://doc.sensory.com/tnl/7.8/models/index.md#tpl-spot-vad-lvcsr).

## Settings

**Available events:** [^nlu-intent](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#nlu-intent), [^nlu-slot](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#nlu-slot), [^result](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#result), [^result-partial](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#result-partial), [^sample-count](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#sample-count-event), [^slm-result](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#slm-result), [^slm-result-partial](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#slm-result-partial), [^slm-start](https://doc.sensory.com/tnl/7.8/api/setting-keys/events.md#slm-start)

**Available iterators:** _none_

**Available results:** [audio-stream](https://doc.sensory.com/tnl/7.8/api/setting-keys/results.md#audio-stream), [audio-stream-first](https://doc.sensory.com/tnl/7.8/api/setting-keys/results.md#audio-stream-first), [audio-stream-last](https://doc.sensory.com/tnl/7.8/api/setting-keys/results.md#audio-stream-last)

**Available runtime settings:** [->audio-pcm](https://doc.sensory.com/tnl/7.8/api/setting-keys/runtime.md#-audio-pcm), [audio-stream-from](https://doc.sensory.com/tnl/7.8/api/setting-keys/runtime.md#audio-stream-from), [audio-stream-to](https://doc.sensory.com/tnl/7.8/api/setting-keys/runtime.md#audio-stream-to)

**Available configuration settings:** [audio-stream-size](https://doc.sensory.com/tnl/7.8/api/setting-keys/configuration.md#audio-stream-size), [custom-vocab](https://doc.sensory.com/tnl/7.8/api/setting-keys/configuration.md#custom-vocab), [partial-result-interval](https://doc.sensory.com/tnl/7.8/api/setting-keys/configuration.md#partial-result-interval), [samples-per-second](https://doc.sensory.com/tnl/7.8/api/setting-keys/configuration.md#samples-per-second), [stt-profile](https://doc.sensory.com/tnl/7.8/api/setting-keys/configuration.md#stt-profile)

**Available values:** [lvcsr](https://doc.sensory.com/tnl/7.8/api/setting-keys/values.md#lvcsr)

**Also see these related items:** [live-spot.c](https://doc.sensory.com/tnl/7.8/api/sample/c/live-spot.md#live-spot-code), [snsr-eval.c](https://doc.sensory.com/tnl/7.8/api/sample/c/snsr-eval.md#snsr-eval-code), [PhraseSpot.java](https://doc.sensory.com/tnl/7.8/api/sample/android/enroll-trigger.md#et-code), [segmentSpottedAudio.java](https://doc.sensory.com/tnl/7.8/api/sample/java/segmentSpottedAudio.md#segmentspottedaudio-code)

<!-- Abbreviation definitions from includes/abbreviations.md -->
*[API]: Application Programming Interface
*[LVCSR]: Large Vocabulary Continuous Speech Recognition model, feed-forward neural net acoustic model with FST decoder
*[NLU]: Natural Language Understanding model
*[SLM]: Generative Small Language Model
*[STT]: Speech To Text: transformers with language model and CTC decoding
*[TNL]: TrulyNatural, Sensory's large-vocabulary speech recognition technology
*[VAD]: Voice Activity Detector
