Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Not able to run structured extraction #1

@ShimantaBhuyan

Description

@ShimantaBhuyan

Hi team, I was trying to get the project running, but it failed at the template prediction step. Here are the logs:

$ python3 structured_extraction.py

--- Processing ./ExtractFromPDF.pdf ---
2025-05-04 09:56:07,079 - INFO - Starting TWIX processing for: ./ExtractFromPDF.pdf
2025-05-04 09:56:07,079 - INFO - Running twix.transform...
Phrase extraction starts...
Phrase extraction for the merged file starts...
Phrase extraction for individual files starts...
Field prediction starts...
2025-05-04 09:56:17,536 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
perfect match starts...
cluster pruning starts...
2025-05-04 09:56:22,187 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
re-clustering starts...
2025-05-04 09:56:41,467 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Template prediction starts...
There does not exist an input such that every predict key exists at least twice.
2025-05-04 09:56:41,476 - ERROR - An error occurred during transaction extraction: list index out of range
Traceback (most recent call last):
  File "/Users/devkrishna/Desktop/Playground/TWIX/twix-ui/backend/structured_extraction.py", line 267, in extract_credit_card_transactions
    fields, template, extraction_objects, cost = twix.transform(
                                                 ^^^^^^^^^^^^^^^
  File "/Users/devkrishna/Desktop/Playground/TWIX/twix/transform.py", line 11, in transform
    template, cost = pattern.predict_template(pdf_paths, result_folder_path, LLM_model_name=LLM_model_name)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devkrishna/Desktop/Playground/TWIX/twix/pattern.py", line 1583, in predict_template
    template = predict_template_docs(phrases_bb, keywords, phrases, metadata)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devkrishna/Desktop/Playground/TWIX/twix/pattern.py", line 513, in predict_template_docs
    row_mp = seperate_rows(sample_phrases_bb)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devkrishna/Desktop/Playground/TWIX/twix/pattern.py", line 1165, in seperate_rows
    p_pre = pv[0][0]
            ~~^^^
IndexError: list index out of range

--- Extraction Results ---
{
    "Transactions": []
}

--- Estimated Cost ---
Total estimated cost: $0.000000

Intermediate files saved in: ./twix_output/ExtractFromPDF

Another thing I noticed is that although it prints out total cost is $0, there was usage of the OpenAI APIs which I could confirm from my dashboard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions