Add layoutlm layoutxlm support#2980
Conversation
whoisjones
left a comment
There was a problem hiding this comment.
Looks good, I couldn't really check the I/O operation of images in the dataset. I have some minor things you might want to look at before merging.
| left_context = left_context[-context_length:] | ||
| break | ||
| return left_context | ||
| left_context = sentence.tokens + left_context |
There was a problem hiding this comment.
left_context += sentence.tokens to be consistent with right context
There was a problem hiding this comment.
Notice that addition on lists is not cummutative, hence your suggestion would lead to wrong results.
flair/datasets/ocr.py
Outdated
| :param path_to_split_directory: base folder with the task data | ||
| :param label_type: the label_type to add the ocr labels to | ||
| :param encoding: the encoding to load the .json files with | ||
| :param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000 |
There was a problem hiding this comment.
is normalizing to thousands usual? If it is just selected by random, why not make the normalization factor as int and optional and normalize images if factor is provided.
There was a problem hiding this comment.
Normalizing to thousands is very usual, it is done by Layoutlm (& v2/v3), Docformer, Lambert, ...
I haven't seen an implementation so far that did it different
| ) | ||
|
|
||
| return tensor_args | ||
| # random check some tokens to save performance. |
There was a problem hiding this comment.
Why not check entire dataset or discard inputs not having all required metadata?
There was a problem hiding this comment.
As noted in the comment, running the checks on a broader scale would impact the speed, this is rather to ensure that the user gets a good warning if there are no bounding boxes in general, without impacting the performance.
And Discarding inputs would silently hide errors, so I am against that.
flair/embeddings/transformer.py
Outdated
There was a problem hiding this comment.
run formatter again should remove empty line
| if "bbox" in batch_encoding: | ||
| model_kwargs["bbox"] = batch_encoding["bbox"].to(device, non_blocking=True) | ||
|
|
||
| if self.token_embedding or self.needs_manual_ocr: |
There was a problem hiding this comment.
we check 2x for self.token_embedding (line 547) and self.needs_manual_ocr (line 549) and jointly in line 516. The part required by both are the word_ids. Can we move everything after line 526 also in the condition for self.token_embeddings? looks like self.needs_manual_ocr only needs word_ids_list.
|
👍 |
|
@helpmefindaname thanks for adding this! |
This PR adds the following:
add_metadata,get_metadataandhas_metadatafor each datapoint (Token, Sentence, Span)I tested the training, using 100 epochs, batch_size=16, lr=5e-5, train_with_dev=True for the following embeddings: