Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Stuck on find /to-ocr -name '*.pdf' -type fΒ #23

@g-simmons

Description

@g-simmons

Hey there, I'm running pd3f on a workstation and accessing via SSH tunnel to my local machine.

I'm using the browser GUI to (hopefully 🀞) OCR a book scan with several hundred pages. The book scan is already in PDF format - the file size is around 30MB.

In the log output of the web GUI I am seeing:

INFO:root:setting up ocr
INFO:root:ocr finished successfully
INFO:pd3f.parsr_wrapper:sending PDF to Parsr
INFO:pd3f.parsr_wrapper:got response from Parsr
INFO:pd3f.doc_info:media line width: 174.0
INFO:pd3f.doc_info:median line height: 9.0
INFO:pd3f.doc_info:median line space: 4.159999999999968
INFO:pd3f.doc_info:counter width: [(409.44, 1036), (8, 1036), (409.68, 1014), (410.16, 982), (409.2, 974)]
INFO:pd3f.doc_info:counter height: [(10, 19582), (9, 11277), (8, 10001), (7, 1238), (9.24, 1180)]
INFO:pd3f.doc_info:counter lineheight: [(4.159999999999968, 3830), (4.160000000000025, 2457), (4.159999999999997, 2251), (4.399999999999977, 2118), (2.759999999999991, 1806)]
INFO:pd3f.export:export page #0

It's been at least 20 minute since I started the conversion, so I'm surprised to see the tool is still on page #0.

In the terminal I'm seeing the following:

ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:00] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:01] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:02] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:03] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:04] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:05] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:06] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:07] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1
web_1         | 172.18.0.1 - - [11/Jul/2023 22:46:08] "GET /update/B3QUgcWaGCVeuaQTqy7u5Z HTTP/1.1" 200 -
ocr_worker_1  | ++ find /to-ocr -name '*.pdf' -type f
ocr_worker_1  | + sleep 1

I'm new to pd3f but it looks like the ocr worker is stuck in a loop waiting to receive a file?

Any suggestions for troubleshooting are much appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions