Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit f68484c

Browse files
naymaraqnaymaraq
andauthored
Unified inference of streaming ASR (NVIDIA-NeMo#14817)
* init inference folders Signed-off-by: naymaraq <[email protected]> * added base asr inference Signed-off-by: naymaraq <[email protected]> * add ctc and rnnt inference classes Signed-off-by: naymaraq <[email protected]> * small changes for ctc/rnnt inference Signed-off-by: naymaraq <[email protected]> * add cache aware ctc/rnnt inference classes Signed-off-by: naymaraq <[email protected]> * finilize asr inference part Signed-off-by: naymaraq <[email protected]> * add word class Signed-off-by: naymaraq <[email protected]> * add enums file Signed-off-by: naymaraq <[email protected]> * add alignment preserving itn Signed-off-by: naymaraq <[email protected]> * add punctuation/capitalization model Signed-off-by: naymaraq <[email protected]> * add audio_io and progressbar files Signed-off-by: naymaraq <[email protected]> * add framing and buffering files Signed-off-by: naymaraq <[email protected]> * mv common/inference/utils into asr/inference/utils Signed-off-by: naymaraq <[email protected]> * add StreamingState objects Signed-off-by: naymaraq <[email protected]> * temporary rm enhancement stuff Signed-off-by: naymaraq <[email protected]> * rm common/inference Signed-off-by: naymaraq <[email protected]> * add greedy decoders for CTC/RNNT Signed-off-by: naymaraq <[email protected]> * add endpointing files Signed-off-by: naymaraq <[email protected]> * add text processing Signed-off-by: naymaraq <[email protected]> * mv itn_utils into utils Signed-off-by: naymaraq <[email protected]> * add bpe_decoder, context_manager for cache aware, recognizer_utils Signed-off-by: naymaraq <[email protected]> * add base_recognizer and recognizer interface files Signed-off-by: naymaraq <[email protected]> * add recognizers Signed-off-by: naymaraq <[email protected]> * add factory Signed-off-by: naymaraq <[email protected]> * add inference example and asr_client.py Signed-off-by: naymaraq <[email protected]> * minor fix Signed-off-by: naymaraq <[email protected]> * minor fixes Signed-off-by: naymaraq <[email protected]> * add example usage Signed-off-by: naymaraq <[email protected]> * add jsonl support Signed-off-by: naymaraq <[email protected]> * rm niva prefix Signed-off-by: naymaraq <[email protected]> * fix docstrings Signed-off-by: naymaraq <[email protected]> * mv RequestType into enums.py Signed-off-by: naymaraq <[email protected]> * rm redundant setters Signed-off-by: naymaraq <[email protected]> * add a log_level to config.yaml Signed-off-by: naymaraq <[email protected]> * setup log_level in RecognizerBuilder Signed-off-by: naymaraq <[email protected]> * add comments in multi stream and fix docstrings in buffering Signed-off-by: naymaraq <[email protected]> * conditional import for diskcache Signed-off-by: naymaraq <[email protected]> * set log level to INFO Signed-off-by: naymaraq <[email protected]> * add MPS device support Signed-off-by: naymaraq <[email protected]> * add tests Signed-off-by: naymaraq <[email protected]> * move inference into examples/asr/asr_chunked_inference/ctc Signed-off-by: naymaraq <[email protected]> * rm duplicated create_partial_transcript method Signed-off-by: naymaraq <[email protected]> * Apply isort and black reformatting Signed-off-by: naymaraq <[email protected]> Signed-off-by: naymaraq <[email protected]> * resolve flake8 errors Signed-off-by: naymaraq <[email protected]> * resolve return type Signed-off-by: naymaraq <[email protected]> * fix imports in tests Signed-off-by: naymaraq <[email protected]> * optimize bpe_decoder Signed-off-by: naymaraq <[email protected]> * optimize log prob normalization Signed-off-by: naymaraq <[email protected]> * optimize split_text function Signed-off-by: naymaraq <[email protected]> * fix parital batching, improved GPU utilization Signed-off-by: naymaraq <[email protected]> * simplify ctc greedy decoder Signed-off-by: naymaraq <[email protected]> * add a method to perform ITN on a list of texts Signed-off-by: naymaraq <[email protected]> * remove duplicated code in enums Signed-off-by: naymaraq <[email protected]> * remove unnecessary pad_to logging Signed-off-by: naymaraq <[email protected]> * modified update_punctuation_and_language_tokens_timestamps function to ensure correct global timestamps for eou calculation Signed-off-by: naymaraq <[email protected]> * Apply isort and black reformatting Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] conditional import for pynini and nemo_text_processing Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] fix configs, added asr_output_granularity Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] write segment/word level output into json instead of ctm Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] add output granuality to request options Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] add segment related fields to state Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] add remove repeated punctuation function Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] add TextSegment class Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] update bpe decoder to support text segment Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] update recognizers Signed-off-by: naymaraq <[email protected]> * [refactor: segment-level output] update text processing to support segment-level output Signed-off-by: naymaraq <[email protected]> * rm unused and duplicated code Signed-off-by: naymaraq <[email protected]> * Apply isort and black reformatting Signed-off-by: naymaraq <[email protected]> * code cleanup Signed-off-by: naymaraq <[email protected]> * Apply isort and black reformatting Signed-off-by: naymaraq <[email protected]> * rm unused code and code cleanup Signed-off-by: naymaraq <[email protected]> * Set num_slots to 1024 and add a num_slots parameter to the config files Signed-off-by: naymaraq <[email protected]> * removed hyp.alignment processing codes Signed-off-by: naymaraq <[email protected]> * disable amp Signed-off-by: naymaraq <[email protected]> * mv diskcache req into requirements_asr.txt Signed-off-by: naymaraq <[email protected]> * set use_amp to true and make typing consistent Signed-off-by: naymaraq <[email protected]> * use match/case for readability Signed-off-by: naymaraq <[email protected]> * rm lambdas from punctuation_capitalization_config.py Signed-off-by: naymaraq <[email protected]> * rm detect_eou method from RNNTGreedyEndpointing Signed-off-by: naymaraq <[email protected]> * reuse read_manifest from manifest_utils Signed-off-by: naymaraq <[email protected]> * use librosa instead of soundfile Signed-off-by: naymaraq <[email protected]> * unfreeze ASRRequestOptions dataclass Signed-off-by: naymaraq <[email protected]> * set use_amp to false for buffered CTC/RNNT recognizers, improved throughput Signed-off-by: naymaraq <[email protected]> * change matmul precision to high for cache aware models Signed-off-by: naymaraq <[email protected]> * optimized audio buffer shifting Signed-off-by: naymaraq <[email protected]> * Move running scripts and YAML files out of the ctc folder Signed-off-by: naymaraq <[email protected]> * reorganize file structure Signed-off-by: naymaraq <[email protected]> * Apply isort and black reformatting Signed-off-by: naymaraq <[email protected]> * Minor code simplifications Signed-off-by: naymaraq <[email protected]> * rm duplicated initializations from recognizers Signed-off-by: naymaraq <[email protected]> * remove package version for diskcache Signed-off-by: naymaraq <[email protected]> * move tqdm import to the top Signed-off-by: naymaraq <[email protected]> * simplify millisecond_to_frames function Signed-off-by: naymaraq <[email protected]> * raise a ValueError in case of stream_id > n_audio_files Signed-off-by: naymaraq <[email protected]> * fix return types Signed-off-by: naymaraq <[email protected]> * use list/dict/... instead of List/Dict/... Signed-off-by: naymaraq <[email protected]> * use keyword argument passing to create CacheFeatureBufferer Signed-off-by: naymaraq <[email protected]> * clean up state resetting logic Signed-off-by: naymaraq <[email protected]> * reuse normalize_batch Signed-off-by: naymaraq <[email protected]> * rename verbatim_transcripts and automatic_punctuation Signed-off-by: naymaraq <[email protected]> * rename recognizers to pipelines Signed-off-by: naymaraq <[email protected]> * rename asr/*_inference -> model_wrappers/*_inference_wrapper Signed-off-by: naymaraq <[email protected]> * Apply isort and black reformatting Signed-off-by: naymaraq <[email protected]> * reorgonize pnc, itn, text_processing params Signed-off-by: naymaraq <[email protected]> * improved code readability in pipeline initializations Signed-off-by: naymaraq <[email protected]> * Apply isort and black reformatting Signed-off-by: naymaraq <[email protected]> * add CI script for testing Signed-off-by: naymaraq <[email protected]> * add output_dir in CI test Signed-off-by: naymaraq <[email protected]> * move python running script into new folder Signed-off-by: naymaraq <[email protected]> * renamed asr_streaming_infer -> asr_streaming_inference Signed-off-by: naymaraq <[email protected]> * correct path in CI test Signed-off-by: naymaraq <[email protected]> * fix: variable may be used before it is initialized Signed-off-by: naymaraq <[email protected]> * fix docstring in itn/ folder Signed-off-by: naymaraq <[email protected]> * fix docstring in model_wrappers/ folder Signed-off-by: naymaraq <[email protected]> * fix docstring in utils/ folder Signed-off-by: naymaraq <[email protected]> * fix docstring in pipelines/ folder Signed-off-by: naymaraq <[email protected]> * fix docstring in streaming/ folder Signed-off-by: naymaraq <[email protected]> * remove PnC codes since nlp models are no longer supported Signed-off-by: naymaraq <[email protected]> * minor changes Signed-off-by: naymaraq <[email protected]> * return step output from transcribe_step method Signed-off-by: naymaraq <[email protected]> * Apply isort and black reformatting Signed-off-by: naymaraq <[email protected]> * fix functional_test Signed-off-by: naymaraq <[email protected]> * increase timeout for L0_Unit_Tests_CPU_ASR Signed-off-by: naymaraq <[email protected]> * rm cache aware inference from functional test Signed-off-by: naymaraq <[email protected]> --------- Signed-off-by: naymaraq <[email protected]> Signed-off-by: naymaraq <[email protected]> Co-authored-by: naymaraq <[email protected]> Co-authored-by: naymaraq <[email protected]>
1 parent 155b255 commit f68484c

91 files changed

Lines changed: 11418 additions & 2 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/cicd-main-speech.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ jobs:
4141
- script: L0_Unit_Tests_CPU_ASR
4242
runner: azure-gpu-vm-runner1-cpu
4343
cpu-only: true
44-
timeout: 20
44+
timeout: 30
4545
- script: L0_Unit_Tests_GPU_TTS
4646
runner: self-hosted-azure-gpus-1
4747
- script: L0_Unit_Tests_CPU_TTS
@@ -129,6 +129,8 @@ jobs:
129129
script: L2_Speech_Transcription_Speech_to_Text_Streaming_Infer
130130
- runner: self-hosted-azure
131131
script: L2_Speech_Transcription_Speech_to_Text_Cache_Aware_Infer
132+
- runner: self-hosted-azure
133+
script: L2_Speech_Transcription_Streaming_Inference
132134
- runner: self-hosted-azure
133135
script: L2_Speech_Transcription_Canary_Transcribe_Full_Manifest
134136
- runner: self-hosted-azure

examples/asr/asr_chunked_inference/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ On the other hand, if you increase your chunk size, then the delay between spoke
1313
## Chunked Inference
1414

1515
For MultitaskAED models, we provide a script to perform chunked inference. This script will split the input audio into non-overlapping chunks and perform inference on each chunk. The script will then concatenate the results to provide the final transcript.
16+
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Universal Streaming Inference
2+
3+
The `asr_streaming_infer.py` script enables streaming inference for both buffered (CTC/RNNT/TDT) and cache-aware (CTC/RNNT) ASR models. It supports processing a single audio file, a directory of audio files, or a manifest file.
4+
5+
Beyond streaming ASR, the script also supports:
6+
7+
* **Inverse Text Normalization (ITN)**
8+
* **End-of-Utterance (EoU) Detection**
9+
* **Word-level and Segment-level Output**
10+
11+
All related configurations can be found in the `../conf/asr_streaming_inference/` directory.
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""
16+
This script serves as the entry point for local ASR inference, supporting buffered CTC/RNNT/TDT and cache-aware CTC/RNNT inference.
17+
18+
The script performs the following steps:
19+
(1) Accepts as input a single audio file, a directory of audio files, or a manifest file.
20+
- Note: Input audio files must be 16 kHz, mono-channel WAV files.
21+
(2) Creates a pipeline object to perform inference.
22+
(3) Runs inference on the input audio files.
23+
(4) Writes the transcriptions to an output json/jsonl file. Word/Segment level output is written to a separate JSON file.
24+
25+
Example usage:
26+
python asr_streaming_infer.py \
27+
--config-path=../conf/asr_streaming_inference/ \
28+
--config-name=config.yaml \
29+
audio_file=<path to audio file, directory of audio files, or manifest file> \
30+
output_filename=<path to output jsonfile> \
31+
lang=en \
32+
enable_pnc=False \
33+
enable_itn=True \
34+
asr_output_granularity=segment \
35+
...
36+
# See ../conf/asr_streaming_inference/*.yaml for all available options
37+
38+
Note:
39+
The output file is a json file with the following structure:
40+
{"audio_filepath": "path/to/audio/file", "text": "transcription of the audio file", "json_filepath": "path/to/json/file"}
41+
"""
42+
43+
44+
from time import time
45+
46+
import hydra
47+
48+
49+
from nemo.collections.asr.inference.factory.pipeline_builder import PipelineBuilder
50+
from nemo.collections.asr.inference.utils.manifest_io import calculate_duration, dump_output, get_audio_filepaths
51+
from nemo.collections.asr.inference.utils.progressbar import TQDMProgressBar
52+
from nemo.utils import logging
53+
54+
# disable nemo_text_processing logging
55+
try:
56+
from nemo_text_processing.utils import logger as nemo_text_logger
57+
58+
nemo_text_logger.propagate = False
59+
except ImportError:
60+
# NB: nemo_text_processing requires pynini, which is tricky to install on MacOS
61+
# since nemo_text_processing is not necessary for ASR, wrap the import
62+
logging.warning("NeMo text processing library is unavailable.")
63+
64+
65+
@hydra.main(version_base=None)
66+
def main(cfg):
67+
68+
# Set the logging level
69+
logging.setLevel(cfg.log_level)
70+
71+
# Reading audio filepaths
72+
audio_filepaths = get_audio_filepaths(cfg.audio_file, sort_by_duration=True)
73+
logging.info(f"Found {len(audio_filepaths)} audio files")
74+
75+
# Build the pipeline
76+
pipeline = PipelineBuilder.build_pipeline(cfg)
77+
progress_bar = TQDMProgressBar()
78+
79+
# Run the pipeline
80+
start = time()
81+
output = pipeline.run(audio_filepaths, progress_bar=progress_bar)
82+
exec_dur = time() - start
83+
84+
# Calculate RTFX
85+
data_dur = calculate_duration(audio_filepaths)
86+
rtfx = data_dur / exec_dur if exec_dur > 0 else float('inf')
87+
logging.info(f"RTFX: {rtfx:.2f} ({data_dur:.2f}s / {exec_dur:.2f}s)")
88+
89+
# Dump the transcriptions to a output file
90+
dump_output(output, cfg.output_filename, cfg.output_dir)
91+
logging.info(f"Transcriptions written to {cfg.output_filename}")
92+
logging.info("Done!")
93+
94+
95+
if __name__ == "__main__":
96+
main()
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# ================================
2+
# ASR Configuration
3+
# ================================
4+
asr:
5+
model_name: nvidia/parakeet-ctc-1.1b # Pre-trained CTC/hybrid model from NGC/HuggingFace or local .nemo file path
6+
device: cuda # Device for inference: 'cuda' or 'cpu'
7+
device_id: 0 # GPU device ID
8+
compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32'
9+
use_amp: false # Enable Automatic Mixed Precision
10+
11+
12+
# ==========================================
13+
# Inverse Text Normalization Configuration
14+
# ==========================================
15+
itn:
16+
input_case: lower_cased # Input text case handling: 'lower_cased', 'cased'
17+
whitelist: null # Custom whitelist for ITN processing
18+
overwrite_cache: false # Whether to overwrite existing cache files
19+
max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing
20+
left_padding_size: 4 # Padding size (#spans) for ITN context
21+
batch_size: 32 # Batch size for ITN inference
22+
n_jobs: 16 # Number of parallel jobs for ITN processing
23+
24+
25+
# ========================
26+
# Confidence estimation
27+
# ========================
28+
confidence:
29+
exclude_blank: true # Exclude blank tokens when calculating confidence
30+
aggregation: mean # Aggregation method for confidence across time steps
31+
method_cfg:
32+
name: entropy # Confidence estimation method: 'max_prob' or 'entropy'
33+
entropy_type: tsallis
34+
alpha: 0.5
35+
entropy_norm: exp
36+
37+
38+
# ========================
39+
# Endpointing settings
40+
# ========================
41+
endpointing:
42+
stop_history_eou: 800 # Time window (ms) for evaluating EoU
43+
residue_tokens_at_end: 2 # Number of residual tokens used for EoU
44+
45+
46+
# ========================
47+
# Streaming configuration
48+
# ========================
49+
streaming:
50+
sample_rate: 16000 # Audio sample rate in Hz
51+
batch_size: 256 # Number of audio frames per batch
52+
left_padding_size: 1.6 # Left padding duration in seconds
53+
right_padding_size: 1.6 # Right padding duration in seconds
54+
chunk_size: 4.8 # Audio chunk size in seconds
55+
word_boundary_tolerance: 4 # Tolerance for word boundaries
56+
request_type: feature_buffer # Type of request: frame or feature_buffer
57+
padding_mode: right # Padding mode: left or right. How to pad frames to match the required buffer length
58+
59+
60+
# ========================
61+
# Pipeline settings
62+
# ========================
63+
matmul_precision: high # Matrix multiplication precision: highest, high, medium
64+
log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL)
65+
pipeline_type: buffered # Pipeline type: buffered, cache_aware
66+
asr_decoding_type: ctc # Decoding method: ctc or rnnt
67+
68+
69+
# ========================
70+
# Runtime arguments defined at runtime via command line
71+
# ========================
72+
audio_file: null # Path to audio file, directory, or manifest JSON
73+
output_filename: null # Path to output transcription JSON file
74+
output_dir: null # Directory to save time-aligned output
75+
enable_pnc: false # Whether to apply punctuation & capitalization
76+
enable_itn: false # Whether to apply inverse text normalization
77+
asr_output_granularity: segment # Output granularity: word or segment
78+
cache_dir: null # Directory to store cache (e.g., .far files)
79+
lang: null # Language code for ASR model
80+
return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# ================================
2+
# ASR Configuration
3+
# ================================
4+
asr:
5+
model_name: nvidia/parakeet-rnnt-1.1b # Pre-trained RNNT/hybrid model from NGC/HuggingFace or local .nemo file path
6+
device: cuda # Device for inference: 'cuda' or 'cpu'
7+
device_id: 0 # GPU device ID
8+
compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32'
9+
use_amp: false # Enable Automatic Mixed Precision
10+
ngram_lm_model: "" # Path to ngram language model
11+
ngram_lm_alpha: 0.0 # Alpha for language model
12+
13+
14+
# ==========================================
15+
# Inverse Text Normalization Configuration
16+
# ==========================================
17+
itn:
18+
input_case: lower_cased # Input text case handling: 'lower_cased', 'cased'
19+
whitelist: null # Custom whitelist for ITN processing
20+
overwrite_cache: false # Whether to overwrite existing cache files
21+
max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing
22+
left_padding_size: 4 # Padding size (#spans) for ITN context
23+
batch_size: 32 # Batch size for ITN inference
24+
n_jobs: 16 # Number of parallel jobs for ITN processing
25+
26+
27+
# ========================
28+
# Confidence estimation
29+
# ========================
30+
confidence:
31+
exclude_blank: true # Exclude blank tokens when calculating confidence
32+
aggregation: mean # Aggregation method for confidence across time steps
33+
method_cfg:
34+
name: entropy # Confidence estimation method: 'max_prob' or 'entropy'
35+
entropy_type: tsallis
36+
alpha: 0.5
37+
entropy_norm: exp
38+
39+
40+
# ========================
41+
# Endpointing settings
42+
# ========================
43+
endpointing:
44+
stop_history_eou: 800 # Time window (ms) for evaluating EoU
45+
residue_tokens_at_end: 2 # Number of residual tokens used for EoU
46+
47+
48+
# ========================
49+
# Streaming configuration
50+
# ========================
51+
streaming:
52+
sample_rate: 16000 # Audio sample rate in Hz
53+
batch_size: 256 # Number of audio frames per batch
54+
left_padding_size: 1.6 # Left padding duration in seconds
55+
right_padding_size: 1.6 # Right padding duration in seconds
56+
chunk_size: 4.8 # Audio chunk size in seconds
57+
word_boundary_tolerance: 4 # Tolerance for word boundaries
58+
request_type: feature_buffer # Type of request: frame or feature_buffer
59+
stateful: true # Whether to use stateful processing
60+
padding_mode: right # Padding mode: left or right. How to pad frames to match the required buffer length
61+
62+
63+
# ========================
64+
# Pipeline settings
65+
# ========================
66+
matmul_precision: high # Matrix multiplication precision: highest, high, medium
67+
log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL)
68+
pipeline_type: buffered # Pipeline type: buffered, cache_aware
69+
asr_decoding_type: rnnt # Decoding method: ctc or rnnt
70+
71+
72+
# ========================
73+
# Runtime arguments defined at runtime via command line
74+
# ========================
75+
audio_file: null # Path to audio file, directory, or manifest JSON
76+
output_filename: null # Path to output transcription JSON file
77+
output_dir: null # Directory to save time-aligned output
78+
enable_pnc: false # Whether to apply punctuation & capitalization
79+
enable_itn: false # Whether to apply inverse text normalization
80+
asr_output_granularity: segment # Output granularity: word or segment
81+
cache_dir: null # Directory to store cache (e.g., .far files)
82+
lang: null # Language code for ASR model
83+
return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# ================================
2+
# ASR Configuration
3+
# ================================
4+
asr:
5+
model_name: stt_en_fastconformer_hybrid_large_streaming_multi # Pre-trained CTC/hybrid model from NGC/HuggingFace or local .nemo file path
6+
device: cuda # Device for inference: 'cuda' or 'cpu'
7+
device_id: 0 # GPU device ID
8+
compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32'
9+
use_amp: true # Enable Automatic Mixed Precision
10+
11+
12+
# ==========================================
13+
# Inverse Text Normalization Configuration
14+
# ==========================================
15+
itn:
16+
input_case: lower_cased # Input text case handling: 'lower_cased', 'cased'
17+
whitelist: null # Custom whitelist for ITN processing
18+
overwrite_cache: false # Whether to overwrite existing cache files
19+
max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing
20+
left_padding_size: 4 # Padding size (#spans) for ITN context
21+
batch_size: 32 # Batch size for ITN inference
22+
n_jobs: 16 # Number of parallel jobs for ITN processing
23+
24+
25+
# ========================
26+
# Confidence estimation
27+
# ========================
28+
confidence:
29+
exclude_blank: true # Exclude blank tokens when calculating confidence
30+
aggregation: mean # Aggregation method for confidence across time steps
31+
method_cfg:
32+
name: entropy # Confidence estimation method: 'max_prob' or 'entropy'
33+
entropy_type: tsallis
34+
alpha: 0.5
35+
entropy_norm: exp
36+
37+
38+
# ========================
39+
# Endpointing settings
40+
# ========================
41+
endpointing:
42+
stop_history_eou: 800 # Time window (ms) for evaluating EoU
43+
residue_tokens_at_end: 2 # Number of residual tokens used for EoU
44+
45+
46+
# ========================
47+
# Streaming configuration
48+
# ========================
49+
streaming:
50+
sample_rate: 16000 # Audio sample rate in Hz
51+
batch_size: 256 # Number of audio frames per batch
52+
word_boundary_tolerance: 4 # Tolerance for word boundaries
53+
att_context_size: [70,13] # Attention context size: [70,13],[70,6],[70,1],[70,0]
54+
use_cache: true # Whether to use cache for streaming
55+
use_feat_cache: true # Whether to cache mel-spec features, set false to re-calculate all mel-spec features in audio buffer
56+
chunk_size_in_secs: null # Amount of audio to load for each streaming step, e.g., 0.08s for FastConformer. Set to `null` for using default size equal to 1+lookahead frames.
57+
request_type: frame # Type of request: frame, only frame is supported for cache-aware streaming
58+
num_slots: 1024 # Number of slots in the context manager: must be >= batch_size
59+
60+
61+
# ========================
62+
# Pipeline settings
63+
# ========================
64+
matmul_precision: high # Matrix multiplication precision: highest, high, medium
65+
log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL)
66+
pipeline_type: cache_aware # Pipeline type: buffered, cache_aware
67+
asr_decoding_type: ctc # Decoding method: ctc or rnnt
68+
69+
# ========================
70+
# Runtime arguments defined at runtime via command line
71+
# ========================
72+
audio_file: null # Path to audio file, directory, or manifest JSON
73+
output_filename: null # Path to output transcription JSON file
74+
output_dir: null # Directory to save time-aligned output
75+
enable_pnc: false # Whether to apply punctuation & capitalization
76+
enable_itn: false # Whether to apply inverse text normalization
77+
asr_output_granularity: segment # Output granularity: word or segment
78+
cache_dir: null # Directory to store cache (e.g., .far files)
79+
lang: null # Language code for ASR model
80+
return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer

0 commit comments

Comments
 (0)