Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Clotho_v2 Audio Captioning (DCASE 2023 implementation)#5967

Merged
sw005320 merged 63 commits intoespnet:masterfrom
Shikhar-S:clotho_dcase23
Dec 8, 2024
Merged

Clotho_v2 Audio Captioning (DCASE 2023 implementation)#5967
sw005320 merged 63 commits intoespnet:masterfrom
Shikhar-S:clotho_dcase23

Conversation

@Shikhar-S
Copy link
Contributor

@Shikhar-S Shikhar-S commented Nov 30, 2024

What?

Implementation of the BEATs-BART Audio Captioning system described in this paper.

Our Scores
cider_d: 46.1
spider: 29.8

Paper's comparable scores
cider_d: 45.8
spider: 29.6

(Other numbers for this implementation are in the README and the paper's numbers can be found in second last row of Table 2 (w/o Instructor).)

This PR also implements 1) an option to not load pre-trained weights in ESPnet HuggingFace decoder, and 2) modify the pre-trained model's architecture using external configs.

Why?

This PR improves on the data preparation (training on all 5 captions, text preprocessing), and implements the same decoder as the paper's implementation (more attention heads, no pre-trained weight loading).

See also

Builds on top of #5915

Shikhar-S and others added 30 commits September 23, 2024 09:30
@sw005320 sw005320 added the Recipe label Dec 1, 2024
@codecov
Copy link

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 10.39326% with 638 lines in your changes missing coverage. Please review.

Project coverage is 47.48%. Comparing base (cccc290) to head (ec75a49).
Report is 64 commits behind head on master.

Files with missing lines Patch % Lines
espnet2/asr/encoder/beats_encoder.py 10.11% 613 Missing ⚠️
...2/asr/decoder/hugging_face_transformers_decoder.py 15.00% 17 Missing ⚠️
espnet2/bin/asr_inference.py 16.66% 5 Missing ⚠️
espnet2/torch_utils/initialize.py 0.00% 2 Missing ⚠️
espnet2/tasks/abs_task.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5967      +/-   ##
==========================================
+ Coverage   45.59%   47.48%   +1.89%     
==========================================
  Files         614      529      -85     
  Lines       54356    47844    -6512     
==========================================
- Hits        24781    22717    -2064     
+ Misses      29575    25127    -4448     
Flag Coverage Δ
test_integration_espnet2 47.48% <10.39%> (-0.56%) ⬇️
test_integration_espnetez ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@ftshijt ftshijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for the updates. Please fix the CI issues accordingly~

@Fhrozen Fhrozen modified the milestones: v.202412, v.202503 Dec 4, 2024
@Shikhar-S
Copy link
Contributor Author

LGTM! thanks for the updates. Please fix the CI issues accordingly~

Done. Please lmk if any other changes are necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this file?
It has the absolute path.
We also discussed that we don't need to separate an upload shell.
Can you double-check it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Rechecked and fixed.

Copy link
Contributor

@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a unit test, which covers 1) HF transformers_decoder and 2) beats_encoder? If 1) is tricky, you can try 2) only.

Check https://github.com/espnet/espnet/tree/master/test/espnet2/asr/encoder

@Shikhar-S
Copy link
Contributor Author

Handled all comments and added tests.
The ci fails for due to low coverage. However most of the lines it cites are either covered by other tests, or are trivial (like var assignment).

@sw005320 sw005320 merged commit efd0d59 into espnet:master Dec 8, 2024
| bur_openslr80 | Burmese ASR training dataset | ASR | BUR | https://openslr.org/80/ | |
| catslu | CATSLU-MAPS | SLU | CMN | https://sites.google.com/view/catslu/home | |
| catslu_entity | CATSLU | SLU/Entity Classifi. | CMN | https://sites.google.com/view/catslu/home | |
| clotho_v2 | Clotho v2.1 dataset for audio captioning | ASR | ENG | https://zenodo.org/records/4783391
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change ASR --> AAC?

@sw005320
Copy link
Contributor

sw005320 commented Dec 8, 2024

Thanks! This is an amazing contribution!

@Shikhar-S Shikhar-S deleted the clotho_dcase23 branch December 8, 2024 21:16
Shikhar-S pushed a commit to Shikhar-S/espnet that referenced this pull request Mar 13, 2025
Clotho_v2 Audio Captioning (DCASE 2023 implementation)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants