Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

gminneci/NeuriScout

Repository files navigation

NeuriScout

A full-stack application for searching and analyzing NeurIPS 2025 research papers using semantic search and LLM-powered insights.

Features

  • Semantic Search: Search through NeurIPS 2025 papers, workshops, tutorials, invited talks, and expo events using natural language queries
  • Advanced Filtering: Filter by author, affiliation, session/event type, and conference day/time (AM/PM)
    • Full Conference Coverage: Search across the entire NeurIPS 2025 San Diego program (Dec 2-7)
    • Multiple Content Types: Papers (5,450 items), workshops, tutorials, invited talks, and expo events (panels, demonstrations, workshops)
    • Smart Session Filtering: Find specific event types like "Invited Talk", "Workshop", or "Expo Talk Panel"
  • Bookmarks: Save papers and events to bookmarks for easy access later
    • Persistent Storage: Bookmarks are saved in browser localStorage and persist across sessions
    • Smart Organization: Bookmarked items are sorted by day, time (AM/PM), and poster number for easy conference navigation
    • Poster Positions: View poster numbers for each paper to quickly locate them at the conference
    • CSV Export: Export your bookmarks to CSV with day, time, poster, title, and session information
    • Easy Management: Add/remove bookmarks with a single click, clear all bookmarks at once
  • Deep Dive Chat: Add up to 25 papers to a Deep Dive queue and chat about them with OpenAI or Google Gemini
    • One-click add: Use the button on each paper card or “Add all to Deep Dive” for the current results
    • Smart Pre-upload: Papers are uploaded when you open the chat panel, making your first query instant
    • File Caching: Papers are cached per session - no re-uploading on subsequent questions
    • Full Paper Access: Gemini reads complete PDFs natively (no truncation)
  • Markdown & LaTeX Support: Full rendering of mathematical formulas and formatted text
  • Customizable System Prompts: Configure how the AI responds to your questions
  • Model Selection: Choose from available OpenAI and Gemini models

Project Structure

NeuriScout/
├── backend/              # FastAPI backend server
│   ├── main.py          # API endpoints
│   ├── rag.py           # RAG logic and paper fetching
│   └── ingest.py        # Data ingestion script
├── frontend/            # Next.js frontend
│   └── src/
│       ├── app/         # Pages and components
│       └── lib/         # API client
├── data/                # Data files (CSV, JSON)
├── scripts/             # Utility scripts for data processing
└── chroma_db/          # Vector database (generated)

Setup

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • OpenAI API key and/or Google Gemini API key

Installation

  1. Clone the repository:
git clone https://github.com/gminneci/NeuriScout.git
cd NeuriScout
  1. Set up Python environment and install backend:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -e .

This installs the package in editable mode with all dependencies and creates the neuriscout-backend and neuriscout-ingest commands.

  1. Generate the ChromaDB database:
neuriscout-ingest

This will process the data from:

  • data/papercopilot_neurips2025_merged_openreview.csv (papers)
  • data/neurips_2025_enriched_events.csv (workshops, tutorials, invited talks)
  • data/neurips_2025_expo_events.csv (expo events: panels, demonstrations, workshops)

Creates the vector database in chroma_db/ with embeddings for ~5,500 unique items (papers + events). This takes a few minutes and creates ~90MB of data.

Note: The ChromaDB is required for the application to work. Keep it in the chroma_db/ directory (it's excluded from git).

  1. Set up frontend:
cd frontend
npm install
cd ..
  1. Configure API keys (optional):

You can either set environment variables or enter them in the UI:

export OPENAI_API_KEY=your_key_here
export GEMINI_API_KEY=your_key_here

Running the Application

  1. Start the backend (in one terminal):
source venv/bin/activate  # On Windows: venv\Scripts\activate
neuriscout-backend

The backend will run on http://localhost:8000

Note: The first startup takes 10-30 seconds while loading AI models.

  1. Start the frontend (in a new terminal):
cd frontend
npm run dev

The frontend will run on http://localhost:3000

  1. Open your browser to http://localhost:3000

Usage

  1. Search Papers & Events: Enter keywords or research questions in the search box
  2. Filter Results:
    • Use the dropdowns to filter by author, affiliation, or session
    • Filter by session to find specific content types:
      • "Invited Talk" - Find all 6 invited talks (Rich Sutton, Zeynep Tufekci, Yejin Choi, Melanie Mitchell, Kyunghyun Cho, Andrew Saxe)
      • "Workshop" - Browse workshop sessions
      • "Tutorial" - Find tutorial sessions
      • "Expo Talk Panel", "Expo Workshop", "Expo Demonstration" - Browse expo events
      • Or filter by poster sessions (e.g., "San Diego Poster Session 1")
    • Use day filters (Tue-Sun) and time filters (AM/PM) to browse by conference schedule
    • Combine multiple filters with OR logic (e.g., "MIT" OR "Stanford")
  3. Bookmark Items:
    • Click the star icon on any paper or event card to bookmark it
    • View all bookmarks via the "Bookmarks" button in the header
    • Bookmarks are automatically sorted by day, time (AM/PM), and poster number
    • Poster numbers are displayed next to the time to help navigate the conference
    • Export all bookmarks to CSV for offline reference
    • Clear individual bookmarks or all at once
  4. Build Your Deep Dive:
    • Click "Add to Deep Dive" on individual paper cards (or "Add all to Deep Dive" for the current results)
    • Track how many slots remain (up to 25 papers can be active at once)
    • Remove papers from the Deep Dive button if you want to swap them out
  5. Deep Dive Chat:
    • Click "Deep Dive (X/25)" to open the chat panel
    • Papers are automatically uploaded in the background (Gemini only)
    • Click the settings icon to configure API keys and models
    • Ask questions about the Deep Dive papers and tweak the system prompt anytime
    • Subsequent questions are instant thanks to file caching
  6. Links to Sources:
    • "View on NeurIPS" opens the official NeurIPS virtual site page for the paper/event (for logged-in bookmarks)
    • "Paper" opens the OpenReview page for papers (reviews/discussion)
    • NeurIPS links are standardized and work for all content types (posters, oral, tutorials, workshops, invited talks)

Data Processing

The scripts/ directory contains utilities for:

  • Scraping paper data from various sources
  • Merging datasets
  • Data validation and debugging

Deployment (Free Hosting)

Deploy to Vercel (Frontend) + Render (Backend)

Backend Deployment (Render)

  1. Create a Render account at render.com

  2. Create a new Web Service:

    • Connect your GitHub repository
    • Name: neuriscout-backend
    • Environment: Python 3
    • Build Command: pip install -e .
    • Start Command: neuriscout-backend
  3. Add a Persistent Disk (for ChromaDB):

    • Go to your service settings
    • Add Disk: Mount path /opt/render/project/src/chroma_db, Size: 1GB
  4. Set Environment Variables:

    PYTHON_VERSION=3.11.0
    HOST=0.0.0.0
    PORT=8000
    CHROMA_DB_PATH=/opt/render/project/src/chroma_db
    ALLOWED_ORIGINS=https://your-app.vercel.app
    

    Optional (or enter in UI):

    OPENAI_API_KEY=your_key
    GEMINI_API_KEY=your_key
    
  5. Upload ChromaDB data:

    • The ChromaDB database is NOT included in the repository (it's ~90MB)
    • You need to generate it locally using neuriscout-ingest and then upload it
    • Options for uploading:
      • Via Render Shell: Access your service's Shell tab and run neuriscout-ingest
      • Manual upload: Use scp or Render's file upload to copy your local chroma_db/ directory to the persistent disk mount path
    • The database contains ~5,500 unique items (papers, workshops, tutorials, invited talks, expo events) with embeddings and takes a few minutes to generate
  6. Copy your service URL (e.g., https://neuriscout-backend.onrender.com)

Frontend Deployment (Vercel)

  1. Create a Vercel account at vercel.com

  2. Import your repository:

    • Click "New Project"
    • Import your GitHub repository
    • Root Directory: frontend
    • Framework Preset: Next.js
  3. Configure Environment Variables:

    • Add NEXT_PUBLIC_API_URL with your Render backend URL
    • Example: https://neuriscout-backend.onrender.com
  4. Update CORS on Backend:

    • Go back to Render dashboard
    • Update ALLOWED_ORIGINS to include your Vercel URL
    • Example: https://neuriscout.vercel.app,https://neuriscout-*.vercel.app
  5. Deploy!

    • Vercel will automatically deploy
    • Your app will be live at https://your-app.vercel.app

Important Notes:

  • Cold Starts: Render's free tier sleeps after 15 minutes of inactivity. First request takes ~30 seconds to wake up.
  • API Keys: You can set API keys as environment variables on Render, or users can enter them in the UI.
  • ChromaDB Database:
    • The vector database is NOT in the repository (excluded via .gitignore)
    • Generate it locally with neuriscout-ingest before deploying
    • Upload to the persistent disk after deployment (via Render Shell or manual transfer)
    • The database is ~90MB and contains embeddings for ~5,500 items (papers + events + expo)
  • Custom Domain: Both Vercel and Render support custom domains for free.

Alternative: Deploy to Railway

Railway offers $5 free credit per month and simpler deployment:

  1. Create Railway account at railway.app
  2. Deploy from GitHub:
    • New Project → Deploy from GitHub
    • Select your repository
  3. Set environment variables:
    HOST=0.0.0.0
    ALLOWED_ORIGINS=*
    CHROMA_DB_PATH=/app/chroma_db
    
  4. Upload ChromaDB:
    • Generate locally: neuriscout-ingest
    • Upload using Railway CLI or create a persistent volume and transfer files
    • The chroma_db/ directory (~90MB) must be accessible at the path set in CHROMA_DB_PATH

Note: Railway paid tier ($5/month) is recommended for reliable hosting with better resources.

Railway Admin Endpoints

The backend includes two admin endpoints for managing deployments:

GET /admin/status - Diagnostic information:

  • Base directory and ChromaDB path
  • Data file existence and sizes (CSV files)
  • Collection status (exists, item count)
  • Example: curl https://your-app.railway.app/admin/status

POST /admin/reingest - Manual data ingestion:

  • Runs the ingest process to populate ChromaDB
  • Returns stdout/stderr from the ingest process
  • Takes several minutes to complete (~5,500 items)
  • Example: curl -X POST https://your-app.railway.app/admin/reingest

Note: The /admin/reingest endpoint runs synchronously and blocks the API during execution. For large datasets, use SSH to run the ingest in the background.

Re-running Ingest via Railway SSH

If the automatic ingestion fails or you need to re-populate the database:

  1. Install Railway CLI (if not already installed):

    npm i -g @railway/cli
    # or
    brew install railway
  2. Authenticate:

    railway login
  3. Copy SSH command from Railway Dashboard:

    • Navigate to your project in the Railway dashboard
    • Right-click on your service
    • Select "Copy SSH Command" from the dropdown menu
    • This generates a command like:
      railway ssh --project=<project-id> --environment=<env-id> --service=<service-id>
  4. Connect and run ingest:

    # Connect using the copied SSH command
    railway ssh --project=<project-id> --environment=<env-id> --service=<service-id>
    
    # Inside the SSH session, run ingest in the background
    python -m backend.ingest > /app/ingest.log 2>&1 &
    
    # Exit the SSH session
    exit
  5. Monitor progress:

    # Check collection count via the status endpoint
    curl https://your-app.railway.app/admin/status | grep count
    
    # Or reconnect via SSH to check the log
    railway ssh --project=<project-id> --environment=<env-id> --service=<service-id>
    tail -f /app/ingest.log

The ingest process generates embeddings for ~5,500 unique items and takes several minutes. The collection count should reach ~5,500 when complete.

Alternative: Run a single command without an interactive session:

railway ssh --project=<project-id> --environment=<env-id> --service=<service-id> -- python -m backend.ingest

For more details on Railway SSH, see the Railway CLI SSH documentation.

Post-deploy: Populate NeurIPS Links

After merging to main, Railway will auto-deploy. To enable the new "View on NeurIPS" links, re-run the ingest so the ChromaDB contains the neurips_virtualsite_url metadata:

  1. Connect via Railway SSH (copy command from Dashboard):
railway ssh --project=<project-id> --environment=<env-id> --service=<service-id>
  1. Run ingest in the container:
python -m backend.ingest > /app/ingest.log 2>&1 &
  1. Monitor progress:
curl https://your-app.railway.app/admin/status | grep count
# or
tail -f /app/ingest.log

Target count is ~5500 items. Once complete, the frontend shows both "View on NeurIPS" and "Paper" buttons.

Technology Stack

Backend:

  • FastAPI
  • ChromaDB (vector database)
  • Sentence Transformers (embeddings)
  • OpenAI API / Google Gemini API

Frontend:

  • Next.js 16
  • React
  • TypeScript
  • Tailwind CSS
  • React Markdown + KaTeX

License

MIT License

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •