Add the PDF Image loader #1

lolipopshock · 2021-07-13T04:56:34Z

No description provided.

lolipopshock · 2021-07-13T05:00:58Z

After some reconsideration, I think it would probably be better if the load_image function exist in Document. When calling doc.images, it will identify whether the document image is loaded - If not, it will run the commands to have it loaded. The reason is very simple - efficiency. It's an expensive operation and we only want to have images loaded when needed.

But there are definitely some drawbacks. And in order to implement this function, we need some modifications for the Document class - for example, having the page size and original file path stored.

lolipopshock · 2021-07-14T16:36:27Z

The images are loaded by default, and can be disabled via parser.parse(..., load_images=False).
Will add readme and tests later today.

kyleclo

,lgtm

kyleclo · 2021-07-14T21:40:23Z

mmda/types/image.py

+def tobase64(self):
+    # Ref: https://stackoverflow.com/a/31826470
+    buffered = BytesIO()
+    self.save(buffered, format="JPEG")


why not png?

kyleclo · 2021-07-14T21:41:24Z

mmda/types/image.py

+
+# Monkey patch the PIL.Image methods to add base64 conversion
+
+def tobase64(self):


do we want to support 2 forms: base64 and proper image file that one can download & view

kyleclo · 2021-07-14T21:42:13Z

mmda/types/image.py

+    img = Image.open(buffered)
+    return img  
+
+Image.Image.tobase64 = tobase64 # This is the method applied to individual Image classes 


prefer we define our own Image class, inherit from PIL's Image class, and override the methods.

kyleclo · 2021-07-14T21:44:13Z

mmda/parsers/parser.py

+from mmda.types.image import Image
+
+
+class BaseParser:


why the rename?

kyleclo · 2021-07-14T21:54:29Z

mmda/parsers/symbol_scraper_parser.py

            Sent: [],
-            Block: []
+            Block: [],
+            DocImage: [],


add inline comment saying this is loaded into doc in parse()

Add the PDF Image loader

9701e6c

lolipopshock added 3 commits July 14, 2021 12:16

Monkey patch the PIL.Image methods to add base64 conversion

81f8e24

Update the class and API design

b30cc83

load images in symbol_scraper

2aef3bd

lolipopshock requested a review from kyleclo July 14, 2021 16:36

lolipopshock added 3 commits July 14, 2021 18:06

minor tweaking

dbf807a

Add test load PDF images

3f591fb

remove DocImage in _convert_nested_text_to_doc_json

db07c00

kyleclo approved these changes Jul 14, 2021

View reviewed changes

kyleclo merged commit 41325ee into main Jul 14, 2021

kyleclo deleted the add-image-extractor branch July 14, 2021 22:10

kyleclo mentioned this pull request Jun 22, 2022

Bib Entry Parser/Predictor #83

Merged

geli-gel mentioned this pull request Jul 29, 2022

Angelez/bibentries #113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add the PDF Image loader #1

Add the PDF Image loader #1

Uh oh!

lolipopshock commented Jul 13, 2021

Uh oh!

lolipopshock commented Jul 13, 2021 •

edited

Loading

Uh oh!

lolipopshock commented Jul 14, 2021

Uh oh!

kyleclo left a comment

Uh oh!

kyleclo Jul 14, 2021

Uh oh!

kyleclo Jul 14, 2021

Uh oh!

kyleclo Jul 14, 2021

Uh oh!

kyleclo Jul 14, 2021

Uh oh!

kyleclo Jul 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		# Monkey patch the PIL.Image methods to add base64 conversion

		def tobase64(self):

Add the PDF Image loader #1

Add the PDF Image loader #1

Uh oh!

Conversation

lolipopshock commented Jul 13, 2021

Uh oh!

lolipopshock commented Jul 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lolipopshock commented Jul 14, 2021

Uh oh!

kyleclo left a comment

Choose a reason for hiding this comment

Uh oh!

kyleclo Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

kyleclo Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

kyleclo Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

kyleclo Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

kyleclo Jul 14, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lolipopshock commented Jul 13, 2021 •

edited

Loading