-
Notifications
You must be signed in to change notification settings - Fork 16
Increase the robustness for "large" PDFs #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # Right now only execute this for only "large" PDFs | ||
| # TODO: Change it for all PDFs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're doing this because there isn't a retrained model at this time, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's correct!
|
For further reference, when we merge this issue, we'll also release v0.3.0 of vila due to changes in API. |
yoganandc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lolipopshock let's merge, i verified that master crashes with a big pdf and that this branch doesn't.
The fix tries to improve the robustness of the VILA library for "large" PDF -- the width or height of the PDF is more than 1000, and it has tokens with bounding box dimensions larger than 1000. In this case, the input will break the 2D position encoding process used in the base Transformer models, which is fundamentally a lookup table (bbox dimension value -> some embedding values) that only takes input from 0~1000.
I added a normalize function to solve this issue. When the input PDF size is "large" (i.e., either page_width>1000 or page_height>1000), it will normalize all the tokens in this page using the
normalize_bboxfunction that coverts the dimension to the range 0~1000.However, this solution is not perfect ~ our models hasn't been appropriately tuned for this large PDFs. Ideally, we should retrain such models with normalized inputs.
It will lead to one API change: