Clotho_v2 Audio Captioning (DCASE 2023 implementation)#5967
Clotho_v2 Audio Captioning (DCASE 2023 implementation)#5967sw005320 merged 63 commits intoespnet:masterfrom
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5967 +/- ##
==========================================
+ Coverage 45.59% 47.48% +1.89%
==========================================
Files 614 529 -85
Lines 54356 47844 -6512
==========================================
- Hits 24781 22717 -2064
+ Misses 29575 25127 -4448
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
ftshijt
left a comment
There was a problem hiding this comment.
LGTM! thanks for the updates. Please fix the CI issues accordingly~
Done. Please lmk if any other changes are necessary. |
egs2/clotho_v2/asr1/run_upload.sh
Outdated
There was a problem hiding this comment.
Do you need this file?
It has the absolute path.
We also discussed that we don't need to separate an upload shell.
Can you double-check it?
There was a problem hiding this comment.
Thanks for catching this. Rechecked and fixed.
sw005320
left a comment
There was a problem hiding this comment.
Can you create a unit test, which covers 1) HF transformers_decoder and 2) beats_encoder? If 1) is tricky, you can try 2) only.
Check https://github.com/espnet/espnet/tree/master/test/espnet2/asr/encoder
… and correct assertion in beats
|
Handled all comments and added tests. |
| | bur_openslr80 | Burmese ASR training dataset | ASR | BUR | https://openslr.org/80/ | | | ||
| | catslu | CATSLU-MAPS | SLU | CMN | https://sites.google.com/view/catslu/home | | | ||
| | catslu_entity | CATSLU | SLU/Entity Classifi. | CMN | https://sites.google.com/view/catslu/home | | | ||
| | clotho_v2 | Clotho v2.1 dataset for audio captioning | ASR | ENG | https://zenodo.org/records/4783391 |
There was a problem hiding this comment.
Can you change ASR --> AAC?
|
Thanks! This is an amazing contribution! |
Clotho_v2 Audio Captioning (DCASE 2023 implementation)
What?
Implementation of the BEATs-BART Audio Captioning system described in this paper.
Our Scores
cider_d: 46.1
spider: 29.8
Paper's comparable scores
cider_d: 45.8
spider: 29.6
(Other numbers for this implementation are in the README and the paper's numbers can be found in second last row of Table 2 (w/o Instructor).)
This PR also implements 1) an option to not load pre-trained weights in ESPnet HuggingFace decoder, and 2) modify the pre-trained model's architecture using external configs.
Why?
This PR improves on the data preparation (training on all 5 captions, text preprocessing), and implements the same decoder as the paper's implementation (more attention heads, no pre-trained weight loading).
See also
Builds on top of #5915