Releases: airaria/TextBrewer
TextBrewer 0.2.1
New Features
-
More flexible distillation: Supports feeding different batches to the student and teacher. It means the batches for the student and teacher no longer need to be the same. It can be used for distilling models with different vocabularies (e.g., from RoBERTa to BERT). See the documentation for details.
-
Faster distillation: Users can pre-compute and cache the teacher outputs, then feed the cache to the distiller to save teacher's forward pass time. See the documentation for details.
Improvements
MultiTaskDistillernow is the subclass ofGeneralDistillerand supports intermediate feature matching loss.- Tensorboard now records more detailed losses (KD loss, hard label loss, matching losses...).
pkd_lossnow accepts tensors of shape (batch_size, length,hidden_size) or (batch_size,hidden_size). In the latter case, the loss is computed directly on the input tensors, without taking the hidden states on the first position.
TextBrewer 0.2.0.1
Bug Fixes
- Fixed bugs in
MultiTaskDistiller. - Fixed the endless training loop when training for
num_steps. Now distillers will stop correctly.
TextBrewer 0.2.0
New Features
-
Now supports distributed data-parallel training with
torch.nn.parallel.DistributedDataParallel! You can passlocal_rankto theTrainingConfigto setup for the distributed training. The detailed usage ofDistributedDataParallelcan be found at the PyTorch docs. -
We also added an example (Chinese NER task) to demonstrate how to use TextBrewer with distributed data-parallel training.
TextBrewer 0.1.10
New Features
- Now supports mixed precision training with Apex! Just set
fp16toTrueinTrainingConfig. See the documentation ofTrainingConfigfor detail. - Added
data_paralleloption inTrainingConfigto enable data parallel training within TextBrewer.
TextBrewer 0.1.9
New Features
- Added an option
is_caching_logitstoDistillationConfig. Ifis_caching_logitsis True, the distiller will cache the batches and the output logits of the teacher model, so that those logits will only be calcuated once. It will speed up the distillation process. This feature is only available forBasicDistillerandMultiTeacherDistiller. Be caution of setting it to True on large datasets, since it will store the batches and logits into the memory.
Improvements
- Added new argument
max_grad_normto distillers'trainmethod. It sets the strength of gradient clipping. Default -1, i.e., no gradient clipping. - Added new arguments
scheduler_classandscheduler_argsto distillers'trainmethod. The oldschedulermay cause convergence problem and is deprecated in favor ofscheduler_classandscheduler_args. See the documentation for details. - Removed
printin thedisplay_paramters. Now it won't print the statistics directly to the screen.
Bug Fixes
- Fixed wrong call of zero_grad().
TextBrewer 0.1.8
Improvements:
TrainingConfig.log_dircan be set toNoneto disable TensorBoard.- Added an attribute
print_freqto the distiller to control the frequency of logging. - Added a new argument
num_stepsto thetrainmethod of the distiller. Ifnum_stepsis specified, the distiller will ignorenum_epochsand allow an unknown-size dataloader (i.e., which has no__len__attribute). - Added a new argument
batch_postprocessorto thetrainmethod of the distiller to allow post-processing of batches.
TextBrewer 0.1.7
This is the first release of TextBrewer.