-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Description
Hey there, I'm running pd3f on a workstation and accessing via SSH tunnel to my local machine.
I'm using the browser GUI to (hopefully π€) OCR a book scan with several hundred pages. The book scan is already in PDF format - the file size is around 30MB.
In the log output of the web GUI I am seeing:
INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
INFO:pd3f.doc_info:media line width: 174.0
INFO:pd3f.doc_info:median line height: 9.0
INFO:pd3f.doc_info:median line space: 4.159999999999968
INFO:pd3f.doc_info:counter width: [(409.44, 1036), (8, 1036), (409.68, 1014), (410.16, 982), (409.2, 974)]
INFO:pd3f.doc_info:counter height: [(10, 19582), (9, 11277), (8, 10001), (7, 1238), (9.24, 1180)]
INFO:pd3f.doc_info:counter lineheight: [(4.159999999999968, 3830), (4.160000000000025, 2457), (4.159999999999997, 2251), (4.399999999999977, 2118), (2.759999999999991, 1806)]
INFO:pd3f.export:export page #0
It's been at least 20 minute since I started the conversion, so I'm surprised to see the tool is still on page #0.
In the terminal I'm seeing the following:
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:00] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:01] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:02] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:03] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:04] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:05] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:06] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:07] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
web_1 | 172.18.0.1 - - [11/Jul/2023 22:46:08] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1 | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1 | + sleep 1
I'm new to pd3f but it looks like the ocr worker is stuck in a loop waiting to receive a file?
Any suggestions for troubleshooting are much appreciated.
rahulkrprajapati and asimsikka
Metadata
Metadata
Assignees
Labels
No labels