Needle in a Haystack Extraction Tool

Overview

This project provides a Python module that extracts specific information (the "needles") from a large body of text (the "haystack") using example sentences and a defined schema. It leverages advanced natural language processing techniques, including embeddings and language models, to efficiently find and extract relevant data and store it a CSV file.

Features

Embeddings for Similarity Matching: Uses SentenceTransformer models to compute embeddings and identify sentences similar to the provided examples.
Keyword Generation: Dynamically generates keywords based on example needles and schema to enhance sentence selection.
Parallel Processing: Implements multithreading to process API calls concurrently, significantly improving performance.

Prerequisites

Python > 3.10
OpenAI API Key

Setup

Clone the Repository

git clone https://github.com/siddartha-10/needle-in-haystack.git
cd needle-in-haystack
cd code

Setup env variables
```
OPENAI_API_KEY = "your_api_key"
```
Installing the Requirements
```
pip install -r requirements.txt 
```
Run the code
```
python app.py
```

Example outputs Format

    {
    "name": "Starwars Technologies",
    "location": "Saturn",
    "employee_count": 13000,
    "founding_year": 20558,
    "is_public": false,
    "valuation": 6.0,
    "primary_focus": "interstellar communication"
  }

Video Explanation Link

https://www.loom.com/share/456695bda6c34d9cbc2437a5a388a0ed?sid=271851df-bf11-4927-bd1a-4aeecb7da818

watch it the video in 1.5x or 2.0x

Code Explanation in Few Sentences.

Here is the basic overview of how the code works.

    1. Split haystack into sentences.
    2. Compute embeddings for sentences and example needles.
    3. Find candidate sentences based on similarity.
    4. Generate keywords using the LLM.
    5. Find additional candidate sentences containing keywords.
    6. Process candidate sentences in parallel to extract data.
    7. Return a list of extracted data conforming to the schema.
    8. Generates a json and a csv file.

Contact

Twitter :- @Siddartha_10

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
code		code
examples		examples
.DS_Store		.DS_Store
.example_env		.example_env
.gitIgnore		.gitIgnore
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
haystack.txt		haystack.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Needle in a Haystack Extraction Tool

Overview

Features

Prerequisites

Setup

Example outputs Format

Video Explanation Link

Code Explanation in Few Sentences.

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

siddartha-10/Needle-In-Haystack

Folders and files

Latest commit

History

Repository files navigation

Needle in a Haystack Extraction Tool

Overview

Features

Prerequisites

Setup

Example outputs Format

Video Explanation Link

Code Explanation in Few Sentences.

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages