Codestin Search App

improve table merge algorithm (added check on table layout) MultiPageTableExtractor.
refactoring table merge MultiPageTableExtractor.
improve header footer analysis HeaderFooterDetector.
added header footer analysis support in Tabby.
added header footer analysis info (parameter need_header_footer_analysis) in documentation (readthedocs).
update to python3.10.
update to ubuntu22.04.
added Contributing Information

Added simple multilingual textual layer correctness classification based on letter percentage calculation (textual_layer_classifier=letter).
Added a new parameter textual_layer_classifier = [simple, ml (default), letter].
Remove parameter fast_textual_layer_detection. Now it is a textual_layer_classifier=simple.
Fix bug with table_type=table_wo_external_bounds (fixed cv2.BoundingRect).
Some refactoring TableRecognition.
Added parameter table_type and TableRecognition info into documentation.

Upgrade PyPDF2 to pypdf>4 and fix bugs in attachments extraction from PDF files.
Added each_page_textual_layer_detection parameter for textual layer detection on each page of PDF documents (for PdfAutoReader).
Added ENABLE_CANCELLATION env variable for enabling/disabling parsing cancellation after client disconnection (enabled by default).
Fixed location coordinates of attached images extracted by PdfTabbyReader.
Added new reader PdfBrokenEncodingReader for PDF documents with textual layer but broken encoding (pdf_with_text_layer=bad_encoding).

Release note: v2.3.1

Fix bug with bold lines in DocxReader (see issue 479).
Upgraded requirements.txt (beautifulsoup4 to 4.12.3 version).
Added support for external grobid (added support parameter Authorization).
Added GOST (Russian government standard) frame recognition in PdfTabbyReader (need_gost_frame_analysis parameter).
Update documentation (added GOST frame recognition).
Added multi-page table handling to PdfTabbyReader.

Dedoc telegram chat created.
Added patterns parameter for configuring default structure type.
Added notebooks with Dedoc usage (see issue 484).
Fix bug OutOfMemoryError: Java heap space in PdfTabbyReader (see issue 489).
Fix bug with numeration in DocxReader (see issue 494).
Added GOST (Russian government standard) frame recognition in PdfImageReader and PdfTxtlayerReader (need_gost_frame_analysis parameter).

Fix bugs with start, end of BBoxAnnotation in PdfTabbyReader.
Improve columns classification and orientation detection for PDF and images (is_one_column_document and document_orientation parameters).
Upgrade docker: docker-compose is no longer supported, use docker compose instead.
Fix bug of tables parsing in DocxReader (see issue).
Added simple textual layer detection in PdfAutoReader (fast_textual_layer_detection parameter).
Improve paragraph extraction from PDF documents and images.
Retrain a classifier for diplomas (document_type="diploma") on a new dataset.

Added internal functions and classes to support integration of Dedoc into langchain
Upgrade some dependencies, in particular, xgboost>=1.6.0, pandas, pdfminer.six

Show page division and page numbers in the HTML output representation (API usage, return_format="html").
Make imports from dedoc library faster.
Added tutorial how to add a new language to dedoc (not finished entirely).
Added additional page_id metadata for multi-page nodes (structure_type="tree" in API, TreeConstructor in the library).
Updated OCR and orientation/columns classification benchmarks.
Minor edits of README.md.
Fixed empty cells handling in CSVReader.
Fixed bounding boxes extraction for text in tables for PdfTabbyReader.

Releases: ispras/dedoc