The script identifies whether the given pdf is structured (text based) or scanned one.
If it's the text based pdf, it uses pdftotext tool to extract the text content and saves pages in the given folder. It also separates the pdf into individual pdf pages using pdfseparate.
Make sure that pdftotext, pdfinfo and pdfseparate are installed in your computer. These utils are available in poppler-utils.
- Reads the pdf file
- Uses
pdfinfoto get the total pages in the pdf and size - Uses
pdftotextto dump the text and compares the size of the extract text content. If the text content size is 500 bytes in average for each page, then it is structured otherwise scanned one. - If the pdf is structured, then it uses
pdftotextto extract the text content page-wise and puts the txt files in thetextfolder. - If the pdf is non-structured i.e. scanned, then it uses Abbyy OCR service to extract the text content
TODO - Creates
stats.jsonfile with the following content (structured = false if scanned)
{ "structured": true, "pages": 5 }- Uses
pdfseparateto extract each pdf page and saves in thepagesfolder.
Execute bash runtest.sh to run all above tests at once.
- Register in ABBYY and get application-id and password, copy
settings.config.baktosettings.configand update application-id and password python run.pyto see the optionspython run.py -i tests/sample.pdf -o outcreates folderout/textwith the extracted text files,out/pageswith the separated pdf files andout/stats.json.
- log the events
- handle exceptions