Thanks to visit codestin.com
Credit goes to github.com

Skip to content

siddartha-10/Needle-In-Haystack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Needle in a Haystack Extraction Tool

Overview

This project provides a Python module that extracts specific information (the "needles") from a large body of text (the "haystack") using example sentences and a defined schema. It leverages advanced natural language processing techniques, including embeddings and language models, to efficiently find and extract relevant data and store it a CSV file.

Features

  • Embeddings for Similarity Matching: Uses SentenceTransformer models to compute embeddings and identify sentences similar to the provided examples.
  • Keyword Generation: Dynamically generates keywords based on example needles and schema to enhance sentence selection.
  • Parallel Processing: Implements multithreading to process API calls concurrently, significantly improving performance.

Prerequisites

  • Python > 3.10
  • OpenAI API Key

Setup

  1. Clone the Repository

    git clone https://github.com/siddartha-10/needle-in-haystack.git
    cd needle-in-haystack
    cd code
    
  2. Setup env variables

    OPENAI_API_KEY = "your_api_key"
    
  3. Installing the Requirements

    pip install -r requirements.txt 
    
  4. Run the code

    python app.py
    

Example outputs Format

    {
    "name": "Starwars Technologies",
    "location": "Saturn",
    "employee_count": 13000,
    "founding_year": 20558,
    "is_public": false,
    "valuation": 6.0,
    "primary_focus": "interstellar communication"
  }

Video Explanation Link

https://www.loom.com/share/456695bda6c34d9cbc2437a5a388a0ed?sid=271851df-bf11-4927-bd1a-4aeecb7da818

watch it the video in 1.5x or 2.0x

Code Explanation in Few Sentences.

Here is the basic overview of how the code works.

    1. Split haystack into sentences.
    2. Compute embeddings for sentences and example needles.
    3. Find candidate sentences based on similarity.
    4. Generate keywords using the LLM.
    5. Find additional candidate sentences containing keywords.
    6. Process candidate sentences in parallel to extract data.
    7. Return a list of extracted data conforming to the schema.
    8. Generates a json and a csv file.

Contact

  1. Twitter :- @Siddartha_10

About

Hybrid approach combining similarity based and dynamic keyword matching.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages