This is a small, runnable demo that:
- Ingests a sample patent dataset (CSV)
- Builds a hybrid retrieval index (BM25 + vectors via Chroma + Sentence Transformers)
- Generates a cited brief with inline
[doc_id]references - Shows a Streamlit UI to test queries
# 1) Create and activate a virtual environment
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
# 2) Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 3) Build the index (uses sample CSV in data/raw)
python -m src.ingest
# 4) Run the demo UI
streamlit run app/streamlit_app.py- LLM watermarking methods
- drone swarming computer vision
- synthetic data generation patents
- transformer optimization energy efficiency
- If
nltkcomplains about missing data, the code has a fallback sentence splitter (no internet required). - If
chromadbinstall is problematic on your system, try updating pip and setuptools:pip install --upgrade pip setuptools wheel.
Security: This starter uses public, synthetic sample data. Do not ingest client or restricted data.