-
Notifications
You must be signed in to change notification settings - Fork 3
Retraining script for spacy #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
brabbit61
commented
Jun 21, 2023
- Added documentation to train and retrain spacy models
- Added scripts for the above tasks
tieandrews
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good just need to rearrange some things and reuse our existing code for parts. Nice work.
| if os.path.exists(val_path): | ||
| shutil.rmtree(val_path) | ||
| logger.info(f"The folder '{val_path}' has been deleted.") | ||
| if os.path.exists(test_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: this is duplicating the process done in the labelling_data_split.py file, instead run that script first in bash with these args then make this function run with those processed folders, that way we remove code duplication and make the parts of the process more re-usable.
| test_gdd_ids | ||
| ) | ||
|
|
||
| def get_article_gdd_ids(labelled_file_path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete duplicate function, see comment above
|
|
||
| # TODO: Else If the data_path consists of parquet files, load JSON files from all parquet files in the directory | ||
|
|
||
| def split_train_val_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete duplicate function, see comment above
|
|
||
| rm -f spacy_transformer_$VERSION.cfg | ||
|
|
||
| python3 spacy_preprocess.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments on this file below, should remove duplicate code, use labelling_data_split.py here then call pre-processing
tieandrews
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, merging.