Codestin Search App

jctian98 · 2025-10-27T05:45:31Z

This PR adds several core components of SpeechLM module:

(1) abs_job.py: the overall job template abstract class, which users can define the model and preprocessors. By defining different models and preprocessors, the codebase can support various tasks (e.g., SpeechLM, Diffusion, etc.)
(2) speechlm_job.py: the inherited speechlm job template, which provides the building method for the model and preprocessor for SpeechLM. Specifically, it describes how we convert the data dict into the multi-stream training sequences. (aka, the overall preprocessing logics)
(3) parallel.py: the modeling file built from any given HF LLM. It describes the multimodal forward process and loss computing, but it doesn't implement the inference code yet.
(4) task_conf_speechlm.py: the extended task definition. This is directly moved from #6257 by changing some folder structure.
(5) prepare_length_stats.py: the bin file to collect length stats before launching the training. Such stats would be used for batchfy.

Other revisions are minor.

Prior PRs: #6257 , #6258 , #6260 #6278

Request review: @Masao-Someki @wanchichen @siddhu001

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request introduces several core components for the SpeechLM module, including abstract job templates, a SpeechLM-specific job implementation, a parallel HuggingFace model wrapper, and data processing utilities. The changes are substantial and lay the groundwork for multimodal training. My review identified several critical issues in the new parallel.py and speechlm_job.py files. These include bugs that can lead to application crashes during model initialization or data preprocessing, as well as incorrect logic in the loss and accuracy calculations. Addressing these issues is crucial for the stability and correctness of the new module.

gemini-code-assist · 2025-10-27T05:47:38Z

espnet2/speechlm/model/speechlm/lm/parallel.py

+                cur_start, _ = intervals[0]
+                # Split intervals if they exceed max_loss_interval size
+                for _, end in intervals[1:]:
+                    if end - cur_start <= max_loss_interval:
+                        continue
+                    else:
+                        model.loss_intervals.append((cur_start, end))
+                        cur_start = end
+
+                # Add final interval if any tokens remain
+                if end > cur_start:
+                    model.loss_intervals.append((cur_start, end))


The logic for creating loss_intervals is buggy and will crash with an UnboundLocalError if a discrete modality has only one stream (i.e., len(intervals) == 1). This happens because the loop over intervals[1:] is not entered, leaving the end variable undefined for line 159. Furthermore, the current logic is flawed as it seems to incorrectly merge non-contiguous vocabulary intervals, which is not what the comment "Split large vocabularies into smaller intervals" implies. A more robust approach is to iterate through each vocabulary interval and split it if it exceeds max_loss_interval.

for start, end in intervals: # Split a large interval into smaller chunks while end - start > max_loss_interval: model.loss_intervals.append((start, start + max_loss_interval)) start += max_loss_interval if end > start: model.loss_intervals.append((start, end))

gemini-code-assist · 2025-10-27T05:47:39Z

espnet2/speechlm/model/speechlm/lm/parallel.py

+                if this_mask.int().sum() == 0:
+                    continue
+                # Compute loss only for vocabulary subset [start:end]
+                this_logits = hidden_states[this_mask]


In the _loss method, you are indexing hidden_states with this_mask. However, hidden_states has a shape of [B, T-1, S, H], while this_mask (for streams > 0) has a shape of [B, T-1, S-1]. This dimension mismatch will cause an indexing error. You should first slice hidden_states to select the streams from 1 onwards before applying the mask.

this_logits = hidden_states[:, :, 1:][this_mask]

gemini-code-assist · 2025-10-27T05:47:39Z

espnet2/speechlm/model/speechlm/speechlm_job.py

+        self.num_stream = max(
+            [io.num_stream() for io in multimodal_io.values() if io.is_discrete]
+        )


The calculation of self.num_stream will raise a ValueError if multimodal_io does not contain any discrete IOs. In this scenario, the list comprehension will be empty, and calling max() on an empty sequence is an error. You should handle this case to avoid a crash, for instance by providing a default value if no discrete IOs are present.

Suggested change

self.num_stream = max(

[io.num_stream() for io in multimodal_io.values() if io.is_discrete]

)

self.num_stream = max(

[io.num_stream() for io in multimodal_io.values() if io.is_discrete] or [0]

)

gemini-code-assist · 2025-10-27T05:47:39Z

espnet2/speechlm/model/speechlm/speechlm_job.py

+        apply_eots = [
+            msg1[0] == msg2[0] for msg1, msg2 in zip(messages[:1], messages[1:])
+        ] + [False]


The logic for generating apply_eots is incorrect. zip(messages[:1], messages[1:]) only compares the first two messages in the sequence. Consequently, apply_eots will have a length of at most 2. When this is zipped with messages, the loop will only process the first two messages, and any subsequent messages in the dialogue will be ignored. To compare all adjacent messages, you should zip messages[:-1] with messages[1:].

Suggested change

apply_eots = [

msg1[0] == msg2[0] for msg1, msg2 in zip(messages[:1], messages[1:])

] + [False]

apply_eots = [

msg1[0] == msg2[0] for msg1, msg2 in zip(messages[:-1], messages[1:])

] + [False]

gemini-code-assist · 2025-10-27T05:47:39Z

espnet2/speechlm/model/speechlm/lm/parallel.py

+                )
+                loss[:, :, 1:].masked_scatter_(this_mask, this_loss)
+                if not self.training:
+                    this_acc = this_logits.argmax(-1) == this_targets - start


The accuracy calculation for streams 1 and onwards is incorrect. The this_targets variable is already adjusted to be relative to the interval's start (residual_ids[this_mask] - start). Subtracting start again when comparing with the argmax result will lead to incorrect accuracy metrics during evaluation.

this_acc = this_logits.argmax(-1) == this_targets

codecov · 2025-10-27T06:02:28Z

Codecov Report

❌ Patch coverage is 0% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.48%. Comparing base (ef20477) to head (a4a4a76).
⚠️ Report is 51 commits behind head on master.

Files with missing lines	Patch %	Lines
espnet2/speechlm/model/abs_job.py	0.00%	6 Missing ⚠️
espnet2/speechlm/model/__init__.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6279   +/-   ##
=======================================
  Coverage   56.48%   56.48%           
=======================================
  Files         898      898           
  Lines       84823    84823           
=======================================
  Hits        47914    47914           
  Misses      36909    36909

Flag	Coverage Δ
test_integration_espnet2	`46.81% <ø> (ø)`
test_integration_espnetez	`36.93% <ø> (ø)`
test_python_espnet2	`50.93% <0.00%> (ø)`
test_python_espnetez	`12.73% <0.00%> (ø)`
test_utils	`18.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…nto speechlm_model merge remove

jctian98 · 2025-10-27T14:35:37Z

Have reflected the review comments of Gemini.
This PR is ready for review.

siddhu001 · 2025-10-28T01:31:50Z

@jctian98 it might be helpful to add line at beginning of each file (including [espnet2/speechlm/model/speechlm/lm/parallel.py]) to describe what file does for future development.

sw005320 · 2025-10-28T14:35:57Z

espent2/speechlm/bin is not a convention of our design
Some core functions should be in espent2/bin. This will make our entire design consistent.
Some utility functions would be under espent2/speechlm/util

See https://github.com/espnet/espnet/tree/master/espnet2/tts as a reference.

merge remote

add model files

3f32f51

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. ESPnet2 New Features labels Oct 27, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

5bf40eb

for more information, see https://pre-commit.ci

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

jctian98 added 2 commits October 27, 2025 09:34

reflect gemini

5c48397

Merge branch 'speechlm_model' of https://github.com/jctian98/espnet i…

6a606d9

…nto speechlm_model merge remove

add init file

4a09701

jctian98 mentioned this pull request Oct 27, 2025

[SpeechLM] Deepspeed trainer #6278

Merged

add docstring

ef20477

Fhrozen added this to the v.202512 milestone Oct 29, 2025

Merge remote-tracking branch 'origin' into speechlm_model

a4a4a76

merge remote

jctian98 merged commit 4ef8c81 into espnet:master Oct 31, 2025
44 of 53 checks passed

jctian98 deleted the speechlm_model branch November 3, 2025 00:55

Fhrozen modified the milestones: v.202512, v.202511 Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SpeechLM] model, preprocessor and collect_stats#6279

[SpeechLM] model, preprocessor and collect_stats#6279
jctian98 merged 7 commits intoespnet:masterfrom
jctian98:speechlm_model

jctian98 commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

codecov bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

jctian98 commented Oct 27, 2025

Uh oh!

siddhu001 commented Oct 28, 2025 •

edited

Loading

Uh oh!

sw005320 commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jctian98 commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jctian98 commented Oct 27, 2025

Uh oh!

siddhu001 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sw005320 commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 27, 2025 •

edited

Loading

siddhu001 commented Oct 28, 2025 •

edited

Loading