How to extract the text from the LayoutItem objects when we set extract_layout=True in the parser? #695
Unanswered
michelle-unia-mermich
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
To parse a research paper that spans 2 columns and have images and tables at different positions on the page, I have now connected to your code API and used the
premium_mode:If I do not use
extract_layout=Trueand parse as normal, the parsed text is not accurate because the caption text is mixed up with the actual paragraph text in the parsed document. The reading order is also not accurate, for example, if the page has (A) a bottom left column text box and (B) a top right column text box, humans will read (A) first and then (B) according to standard, but the parser reads (B) first and then (A), in my attempts.To make sure that the reading order is correct and the final document only has the words of section headers and paragraph text without caption text, I do:
extract_layout=Trueextract_layout=True, we can have a list ofLayoutItemobjects from the page. EachLayoutItemobject is labelled into different categories, including:and finally,
textorsectionHeadertextandsectionHeaderLayoutItem objects according to x,y coordinates inbboxattributes ofLayoutItemThe
LayoutItemobject only has those attributes:The only information I can get from this
LayoutItemis the image queried using GET requests. How do I get the text within eachLayoutItemobject that is labeled as "text" or "sectionHeader"? I can pass this image through an OCR reader or the parser again, but that just seems expensive and wasteful, since the LlamaParse parser has already gone through those words once; it's just that I cannot associate eachLayoutItemimage to a text block in the final parsed text document.I also tried to use the
bboxattribute of LayoutItem to identify the text section in the parsed document - by using thePageItemobject.for example, we have
Each
PageItemobject has bbox and text value attribute, but the bbox does not match any in the LayoutItem lists. Basically, thePageItemobjects that are recognised from each page are different from theLayoutItemobjects from each page, and the recognition/classification ofPageItemobjects is no where as good asLayoutItem. For example, if a page has 19LayoutItemobjects, it only has 9PageItemobjects; and the text in aPageItemobject may combine all text ofcaptionandtextLayoutItemobjects together, with the same wrong reading order in the original document.Is there a way to retrieve the words from each object of LayoutItem without using another OCR or parsing those images for the second time?
I would really appreciate your help! Please let me know if I need to provide any more details/documents.
Beta Was this translation helpful? Give feedback.
All reactions