Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Comments

inference trick from: "Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition"#6339

Open
Miamoto wants to merge 5 commits intoespnet:masterfrom
Miamoto:inference_trick
Open

inference trick from: "Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition"#6339
Miamoto wants to merge 5 commits intoespnet:masterfrom
Miamoto:inference_trick

Conversation

@Miamoto
Copy link
Contributor

@Miamoto Miamoto commented Jan 16, 2026

What did you change?

•	Added the implementation from "Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition"
•	Introduced new decodingoption to enable inference trick: "inference_lf_trick: True" can be added in decode_asr.yaml file.

Why did you make this change?


Is your PR small enough?

yes


Additional Context

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. ASR Automatic speech recogntion ESPnet2 labels Jan 16, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an inference trick from the paper "Improving Cross-Attention based on Positional Alignment during Inference for Robust Long-form Speech Recognition". The changes introduce a new decoding option and modify several components to apply a Gaussian bias to cross-attention scores during inference. My review identified a critical issue where the new feature would be silently ignored when using optimized attention mechanisms like Flash Attention. I have also pointed out a minor type hint inconsistency. Overall, the implementation of the core logic appears sound, but the interaction with existing optimizations needs to be addressed to ensure the feature works correctly in all configurations.

Miamoto and others added 2 commits January 16, 2026 17:51
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@sw005320
Copy link
Contributor

  • Can you fix the CI error?
  • Can you come up with a better name for inference_lf_trick?
  • For the CTC prefix score, margin is prepared in CTCPrefixScorer, but it cannot be configurable via the current option argument. So, how about using this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASR Automatic speech recogntion ESPnet2 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants