-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Introduce modular files for speech models #35902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce modular files for speech models #35902
Conversation
utils/modular_model_converter.py
Outdated
""" | ||
for assignment, node in assignments.items(): | ||
should_keep = any(re.search(pattern, assignment) for pattern in ASSIGNMENTS_REGEX_TO_KEEP) | ||
|
||
# If it's a DOCSTRING var and is assigned to None, the parent's docstring is kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add this because for many of the models I've used, their docstring was kinda custom (e.g. contained link to original paper). So instead of just copying the docstring from modular file, I figured it would be best to adopt this hybrid approach.
If you agree with the change, I should also update the modular docs: https://github.com/huggingface/transformers/blob/main/docs/source/en/modular_transformers.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, I don't really get here. This is already the actual behavior to have the docstring use the parent if it's None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I wanted to say "instead of copying the docstring from the parent ..." (my comment on the code is also kinda obscure)
Essentially, now there are two possibilities:
- either set
MYMODEL_INPUT_DOCSTRING = None
, in which case the assignment will be copied by the parent (as it is already the case) - or set it to something else (new docstring), and the assignment will be copied from the modular file
So it is more flexible than the existing approach.
utils/modular_model_converter.py
Outdated
new_node = node.with_changes(body=node.body.with_changes(body=new_statements)) | ||
imports_to_keep.append(new_node) | ||
existing_protected_statements.update({str(stmt) for stmt in new_statements}) | ||
import_statements = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this beacuse the code before had problematic behaviour for "safe" imports that had multiple other statements inside them, e.g. L381:395 on modeling_wav2vec2.py
if is_deepspeed_zero3_enabled():
import deepspeed
with deepspeed.zero.GatheredParameters(self.conv.weight, modifier_rank=0):
...
The whole block after the import statement would be displaced in the top of the new modeling script (in the import statements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's one of the current limitations. However, removing everything else does not seem like a good solution either. Could not wrap my mind around a nice rule for this. For now, the best is maybe to patch the original modeling file to dissociate safe import and other logic? Would that require a lot of change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let's do it like this, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however, in the example above it would be better to move:
if is_deepspeed_zero3_enabled():
import deepspeed
outside of constructor, because in the current state the newly created module (prior to running ruff inside modular converter) would have two such statements, and the first one would become:
if is_deepspeed_zero3_enabled():
pass
after the run_ruff call.
But if we move it top-side, deepspeed would no longer be lazily imported. I think this is not a problem, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks for the contribution! I just looked at the modular part, let me know if something is unclear!! 🤗
utils/modular_model_converter.py
Outdated
# Exclude names to prevent edge cases where we want to keep a name that may | ||
# exist in the mapping, e.g. `Wav2Vec2BaseModelOutput` where `Wav2Vec2` is | ||
# a "base" model identifier but we want the type to pass as is in the produced modeling file | ||
EXCLUDE_NAMES = ["Wav2Vec2BaseModelOutput"] | ||
|
||
|
||
def preserve_case_replace(text, patterns: dict, default_name: str): | ||
# Create a regex pattern to match all variations | ||
regex_pattern = "|".join(re.escape(key) for key in patterns.keys()) | ||
compiled_regex = re.compile(f"(?<![a-z0-9])({regex_pattern})(.|$)", re.IGNORECASE | re.DOTALL) | ||
|
||
# Create exclude pattern | ||
exclude_pattern = "|".join(re.escape(key) for key in EXCLUDE_NAMES) | ||
compiled_regex = re.compile(f"(?<![a-z0-9])(?!{exclude_pattern})({regex_pattern})(.|$)", re.IGNORECASE | re.DOTALL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not a fan of having exclusions here. And the regex is already way too complicated 🥲 Moreover, I don't think we actually want an output type from another model, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, it felt bad while doing it 😂 Unfortunately we need output types from other models in the files I introduced (almost all of them need the Wav2Vec2BaseModelOutput).
But it could be done cleaner, with "type aliasing" e.g. for WavLM model that needs Wav2Vec2ModelBaseOutput
, we could add
WavLMBaseOutput = Wav2Vec2ModelBaseOutput
inside modular.
What do you think?
utils/modular_model_converter.py
Outdated
""" | ||
for assignment, node in assignments.items(): | ||
should_keep = any(re.search(pattern, assignment) for pattern in ASSIGNMENTS_REGEX_TO_KEEP) | ||
|
||
# If it's a DOCSTRING var and is assigned to None, the parent's docstring is kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, I don't really get here. This is already the actual behavior to have the docstring use the parent if it's None
|
||
# Keep return annotation in `modular_xxx.py` if any, else original return annotation | ||
new_return_annotation = updated_methods[name].returns if updated_methods[name].returns else func.returns | ||
|
||
if not re.match( | ||
r"\ndef .*\(.*\):\n raise.*Error\(.*", | ||
mapper.python_module.code_for_node(updated_methods[name]), | ||
): | ||
func = func.with_changes(body=updated_methods[name].body, params=new_params, decorators=new_decorators) | ||
func = func.with_changes( | ||
body=updated_methods[name].body, | ||
params=new_params, | ||
decorators=new_decorators, | ||
returns=new_return_annotation, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this one! Nice!
utils/modular_model_converter.py
Outdated
new_node = node.with_changes(body=node.body.with_changes(body=new_statements)) | ||
imports_to_keep.append(new_node) | ||
existing_protected_statements.update({str(stmt) for stmt in new_statements}) | ||
import_statements = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's one of the current limitations. However, removing everything else does not seem like a good solution either. Could not wrap my mind around a nice rule for this. For now, the best is maybe to patch the original modeling file to dissociate safe import and other logic? Would that require a lot of change?
…or Wav2Vec2BaseModelOutput
edba3d2
to
5c47d86
Compare
@Cyrilvallez hey, could you take a look again? |
Hey @nikosanto13! Super sorry about the delay, last week we were on an offsite with the whole transformers team, and this week was a bit crazy because of some big refactoring of core parts and releases! 🙂 This PR is still very much welcome, and I'll get a deeper look into all models asap! Please bear with me in the meantime 🙏 Be assured that this is definitely on my to-do list! 🤗 You can check out #36688 as well, which propose similar changes about the assignments 😉 |
Actually, tagging @eustlb as this is mostly audio models, maybe you have some time to help review the modular parts? In the meantime, @nikosanto13 I believe some change can be reverted since #36279, the imports inside functions should not need to be moved 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HUGE! Kudos @nikosanto13 that's a big big work!
- most of the modular files are missing a licence! (we can probably add it automatically!)
if is_peft_available():
from peft.tuners.lora import LoraLayer
seems imported in quite a few places where it was not needed before
- lets not add new features at the same time (its alrady super huge like that!)
Otherwise LGTM to me! Let's run all tests and GO! 🚀
@@ -1188,14 +1163,21 @@ def forward( | |||
if not return_dict: | |||
return (hidden_states, extract_features) + encoder_outputs[1:] | |||
|
|||
return Wav2Vec2BaseModelOutput( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol yeah good catch here!
if is_deepspeed_zero3_enabled(): | ||
import deepspeed | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general this should stay as an import here rather than at the top
class UniSpeechSatPositionalConvEmbedding(Wav2Vec2PositionalConvEmbedding): | ||
pass | ||
|
||
|
||
class UniSpeechSatFeatureEncoder(Wav2Vec2FeatureEncoder): | ||
pass | ||
|
||
|
||
class UniSpeechSatFeatureProjection(Wav2Vec2FeatureProjection): | ||
pass | ||
|
||
|
||
class UniSpeechSatEncoder(Wav2Vec2Encoder): | ||
pass | ||
|
||
|
||
class UniSpeechSatEncoderStableLayerNorm(Wav2Vec2EncoderStableLayerNorm): | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused ones should be not needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the modeling file is the same if I remove them, this would lead to several undefined-name (F821)
violations (in the modular file) since the defined classes are needed for latter parts of the modular file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay no worries I did not check if they are used / unused!
# layer normalization (has no effect when `config.do_stable_layer_norm == False`) | ||
# extract_features = self.layer_norm_for_extract(extract_features) | ||
# quantized_features, codevector_perplexity = self.quantizer(extract_features) | ||
# | ||
# project quantized features twice | ||
# quantized_features = self.project_q(quantized_features) | ||
# quantized_features = self.project_hid(quantized_features) | ||
# | ||
# loss = None | ||
# logits = quantized_features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to cleanunp!
if is_peft_available(): | ||
from peft.tuners.lora import LoraLayer | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be here!
utils/modular_model_converter.py
Outdated
@@ -855,11 +860,18 @@ def _merge_assignments(self, assignments: dict[str, cst.CSTNode], object_mapping | |||
|
|||
Merging rule: if any assignment with the same name was redefined in the modular, we use it and its dependencies ONLY if it matches | |||
a pattern in `ASSIGNMENTS_REGEX_TO_KEEP`. Otherwise, we use the original value and dependencies. This rule was chosen to avoid having to rewrite the | |||
big docstrings. | |||
big docstrings. If the assignment is a DOCSTRING var and is assigned to None, the parent's docstring is kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not super super intuitive but noit a real problem!
…imports in their original locations
@Cyrilvallez hey, no worries for the delay! thanks for giving me pointers to latest changes on the modular converter #36279 fix was not enough in the lazy imports inside class definitions, as it only works for functions. But inspired by this I added a similar change to work on class definitions (see inline comment) |
@@ -677,14 +678,18 @@ def leave_FunctionDef(self, node): | |||
|
|||
def visit_If(self, node): | |||
# If we are inside a function, do not add the import to the list of imports | |||
if self.current_function is None: | |||
if self.current_function is None and self.current_class is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Cyrilvallez fix similar to #36279
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, when adding it for functions I thought we never had imports directly inside classes, so did not add it...Turns out I was wrong... 🥲🥲
@ArthurZucker ty for your review 🤗 the deepspeed and peft lazy import statements should be ok now, I added a fix on modular converter that makes my previous changes redundant let me know if there is anything else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good for me! @eustlb can you give a final look and merge? 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the good work @nikosanto13 !! 🤗
LGTM, I'll just run slow tests for the affected models.
BTW subsequent work could focus on propagating to other speech models that rely partially on wav2vec modelling: seamless m4t, speecht5, sew, sew_d
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
run-slow: wavlm |
run-slow: wav2vec2_bert, wav2vec2_conformer, unispeech, unispeech_sat, hubert, data2vec |
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/data2vec', 'models/hubert', 'models/unispeech', 'models/unispeech_sat', 'models/wav2vec2_bert', 'models/wav2vec2_conformer'] |
run-slow: wav2vec2_bert, wav2vec2_conformer, unispeech, unispeech_sat, hubert, data2vec |
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/data2vec', 'models/hubert', 'models/unispeech', 'models/unispeech_sat', 'models/wav2vec2_bert', 'models/wav2vec2_conformer'] |
run-slow: wav2vec2_bert, wav2vec2_conformer, unispeech, unispeech_sat, hubert, data2vec |
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/data2vec', 'models/hubert', 'models/unispeech', 'models/unispeech_sat', 'models/wav2vec2_bert', 'models/wav2vec2_conformer'] |
@eustlb thanks for taking care of this
yeah I skipped them by mistake, maybe I could open another PR now that this has been merged |
@nikosanto13 thanks again for the work !! |
* WAV_2_VEC_2 to WAV2VEC2 * added modular files for hubert, wavlm, wav2vec2_bert, data2vec_audio * remove unnessary definitions in modulars * added modular files for UniSpeech, UniSpeechSat, Wav2Vec2Conformer * docstring fix for UniSpeechForCTC * removed unneccessary re-definition of modular classes * reverted lazy imports change on modular_model_converter, type-alias for Wav2Vec2BaseModelOutput * top-level import of deepspeed in seamless_m4t, speecht5 * avoid tracking imports inside classes, relocate lazy deepspeed, peft imports in their original locations * convert modular * tiny modular typing fixes * some more modular fixes * make style --------- Co-authored-by: eustlb <[email protected]> Co-authored-by: Eustache Le Bihan <[email protected]>
What does this PR do?
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @Cyrilvallez
Additional details
modeling_wav2vec2.py
: Hubert, WavLM, Data2VecAudio, Wav2Vec2Conformer, Wav2Vec2Bert, UniSpeech, UniSpeechSat