Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add layoutlm layoutxlm support#2980

Merged
alanakbik merged 11 commits intoflairNLP:masterfrom
helpmefindaname:add_layoutlm_layoutxlm_support
Nov 15, 2022
Merged

Add layoutlm layoutxlm support#2980
alanakbik merged 11 commits intoflairNLP:masterfrom
helpmefindaname:add_layoutlm_layoutxlm_support

Conversation

@helpmefindaname
Copy link
Member

@helpmefindaname helpmefindaname commented Nov 7, 2022

This PR adds the following:

  • Metadata for Datapoints: there is add_metadata, get_metadata and has_metadata for each datapoint (Token, Sentence, Span)
  • OCR support for embeddings: You can use embeddings like layoutlm or layoutlmv3 using the following:
     sentence = Sentence(["I", "love", "Berlin"])
     sentence[0].add_metadata("bbox", BoundingBox(0, 0, 10, 10))
     sentence[1].add_metadata("bbox", (12, 0, 22, 10))
     sentence[2].add_metadata("bbox", (0, 12, 10, 22))
     emb = TransformerWordEmbeddings("microsoft/layoutlm-base-uncased", layers="-1,-2,-3,-4", layer_mean=True)
     emb.embed(sentence)
    
    when using layoutlmv3 or layoutlmv2, also the image needs to be added:
    with Image.open("tests/resources/tasks/example_images/i_love_berlin.png") as img:
         img.load()
         img = img.convert("RGB")
    sentence.add_metadata("image", img)
    
  • Support for the SROIE dataset, which can be loaded as the following:
    corpus = SROIE()
    corpus = SROIE(load_images=True) # for embeddings such as layoutlmv2 or layoutlmv3 which take information from the images
    

I tested the training, using 100 epochs, batch_size=16, lr=5e-5, train_with_dev=True for the following embeddings:

Embedding Micro-F1
Layoutlm-base 93.59%
Layoutlm-large 93.59%
Layoutlmv3-base 94.60%
Layoutlmv3-large 94.80%

This was referenced Nov 7, 2022
Copy link
Member

@whoisjones whoisjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I couldn't really check the I/O operation of images in the dataset. I have some minor things you might want to look at before merging.

left_context = left_context[-context_length:]
break
return left_context
left_context = sentence.tokens + left_context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left_context += sentence.tokens to be consistent with right context

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that addition on lists is not cummutative, hence your suggestion would lead to wrong results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*and

:param path_to_split_directory: base folder with the task data
:param label_type: the label_type to add the ocr labels to
:param encoding: the encoding to load the .json files with
:param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is normalizing to thousands usual? If it is just selected by random, why not make the normalization factor as int and optional and normalize images if factor is provided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizing to thousands is very usual, it is done by Layoutlm (& v2/v3), Docformer, Lambert, ...
I haven't seen an implementation so far that did it different

)

return tensor_args
# random check some tokens to save performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check entire dataset or discard inputs not having all required metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the comment, running the checks on a broader scale would impact the speed, this is rather to ensure that the user gets a good warning if there are no bounding boxes in general, without impacting the performance.

And Discarding inputs would silently hide errors, so I am against that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run formatter again should remove empty line

if "bbox" in batch_encoding:
model_kwargs["bbox"] = batch_encoding["bbox"].to(device, non_blocking=True)

if self.token_embedding or self.needs_manual_ocr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we check 2x for self.token_embedding (line 547) and self.needs_manual_ocr (line 549) and jointly in line 516. The part required by both are the word_ids. Can we move everything after line 526 also in the condition for self.token_embeddings? looks like self.needs_manual_ocr only needs word_ids_list.

@whoisjones
Copy link
Member

👍

@alanakbik alanakbik merged commit dde4847 into flairNLP:master Nov 15, 2022
@alanakbik
Copy link
Collaborator

@helpmefindaname thanks for adding this!

@helpmefindaname helpmefindaname deleted the add_layoutlm_layoutxlm_support branch November 28, 2022 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments