Increase the robustness for "large" PDFs #16

lolipopshock · 2022-04-11T15:47:31Z

The fix tries to improve the robustness of the VILA library for "large" PDF -- the width or height of the PDF is more than 1000, and it has tokens with bounding box dimensions larger than 1000. In this case, the input will break the 2D position encoding process used in the base Transformer models, which is fundamentally a lookup table (bbox dimension value -> some embedding values) that only takes input from 0~1000.

I added a normalize function to solve this issue. When the input PDF size is "large" (i.e., either page_width>1000 or page_height>1000), it will normalize all the tokens in this page using the normalize_bbox function that coverts the dimension to the range 0~1000.

However, this solution is not perfect ~ our models hasn't been appropriately tuned for this large PDFs. Ideally, we should retrain such models with normalized inputs.

It will lead to one API change:

import layoutparser as lp # For visualization 

from vila.pdftools.pdf_extractor import PDFExtractor
from vila.predictors import HierarchicalPDFPredictor
# Choose from SimplePDFPredictor,
# LayoutIndicatorPDFPredictor, 
# and HierarchicalPDFPredictor

pdf_extractor = PDFExtractor("pdfplumber")
page_tokens, page_images = pdf_extractor.load_tokens_and_image(f"path-to-your.pdf")

vision_model = lp.EfficientDetLayoutModel("lp://PubLayNet") 
pdf_predictor = HierarchicalPDFPredictor.from_pretrained("allenai/hvila-row-layoutlm-finetuned-docbank")

for idx, page_token in enumerate(page_tokens):
    blocks = vision_model.detect(page_images[idx])
    page_token.annotate(blocks=blocks)
    pdf_data = page_token.to_pagedata().to_dict()
    predicted_tokens = pdf_predictor.predict(pdf_data, page_token.page_size) #<---- you need to specify the page size in the predict function! 
    lp.draw_box(page_images[idx], predicted_tokens, box_width=0, box_alpha=0.25)

lolipopshock · 2022-04-11T15:59:28Z

Now VILA works for large PDFs like poster: page_size is (2304.0, 2448.0) for the example below.

yoganandc · 2022-04-11T19:11:34Z

src/vila/predictors.py

+    # Right now only execute this for only "large" PDFs
+    # TODO: Change it for all PDFs


you're doing this because there isn't a retrained model at this time, correct?

Yeah, that's correct!

lolipopshock · 2022-04-19T21:20:06Z

For further reference, when we merge this issue, we'll also release v0.3.0 of vila due to changes in API.

yoganandc

@lolipopshock let's merge, i verified that master crashes with a big pdf and that this branch doesn't.

lolipopshock added 4 commits April 11, 2022 11:04

add the clip function

8498269

page token size normalization

ad4fb1a

update readme

aceb3b3

Add test

aa40c09

lolipopshock mentioned this pull request Apr 11, 2022

Parsing non-standard sized documents #14

Closed

yoganandc reviewed Apr 11, 2022

View reviewed changes

yoganandc approved these changes Apr 19, 2022

View reviewed changes

lolipopshock merged commit 66b090f into master Apr 20, 2022

lolipopshock mentioned this pull request Apr 22, 2022

Add normalize bbox in vila preprocessing to support large PDF allenai/mmda#77

Merged

lolipopshock deleted the simple-large-pdf-fix branch June 29, 2022 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase the robustness for "large" PDFs #16

Increase the robustness for "large" PDFs #16

Uh oh!

lolipopshock commented Apr 11, 2022 •

edited

Loading

Uh oh!

lolipopshock commented Apr 11, 2022 •

edited

Loading

Uh oh!

yoganandc Apr 11, 2022

Uh oh!

lolipopshock Apr 11, 2022 •

edited

Loading

Uh oh!

lolipopshock commented Apr 19, 2022

Uh oh!

yoganandc left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Right now only execute this for only "large" PDFs
		# TODO: Change it for all PDFs

Increase the robustness for "large" PDFs #16

Increase the robustness for "large" PDFs #16

Uh oh!

Conversation

lolipopshock commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lolipopshock commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoganandc Apr 11, 2022

Choose a reason for hiding this comment

Uh oh!

lolipopshock Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lolipopshock commented Apr 19, 2022

Uh oh!

yoganandc left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lolipopshock commented Apr 11, 2022 •

edited

Loading

lolipopshock commented Apr 11, 2022 •

edited

Loading

lolipopshock Apr 11, 2022 •

edited

Loading