Thanks to visit codestin.com
Credit goes to github.com

Skip to content

(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"

License

MagellaX/SCOUT-RL

Repository files navigation

SCOUT-RL

(Stepwise Controlled Understanding for Trajectories)

SCOUT‑RL is a research scaffold for building a Reinforcement‑Learning‑from‑AI‑Feedback (RLAIF) pipeline for search/research agents. It runs out‑of‑the‑box with a lightweight local text index and proxy “AI‑judge” signals, and is organized so you can drop in real judges, reward models, and PPO/GRPO training.

What you get

  • Minimal but runnable pipeline to generate multi‑step search trajectories.
  • Step‑wise proxy rewards for query relevance and information gain.
  • Clean module boundaries for reward models, policy training, memory, and search backends.

Core innovations (design targets)

  • Multilevel AI‑feedback: dense, step‑wise signals beyond answer‑only rewards.
  • Step‑wise objective: combine answer quality, query relevance, information gain, and memory use.
  • Memory utilisation: encourage reuse of retrieved knowledge and stop over‑searching.
  • Extensible search: pluggable backends (local now; web/multimodal next).

Getting started

  • Prereqs: Python 3.9+ (tested on 3.12)
  • Install: pip install -r requirements.txt

Quick demo

  1. Run a search over the tiny local index python scripts/run_inference.py "Who designed the Eiffel Tower?"

  2. Generate toy multi‑step trajectories (written to data/processed/trajectories.jsonl) python scripts/generate_data.py --config configs/default.yaml

  3. Peek at outputs or compute simple metrics python scripts/evaluate.py --pred data/processed/trajectories.jsonl

  4. Train the lightweight RL agent (REINFORCE) python scripts/train_policy.py --episodes 80 --max_steps 3 --lr 0.05 This learns a softmax policy over query templates using step‑wise judge rewards.

  5. Run the trained agent greedily python scripts/run_agent.py "Who designed the Eiffel Tower?"

Directory layout

  • configs/ — experiment configuration (YAML).
  • rlaif_pipeline/ — package modules:
    • config.py (YAML loader, directory setup)
    • search_interfaces/ (local TF‑IDF‑like index now)
    • data_generation.py (trajectory builder + proxy judges)
    • data_curation.py (filters; hook for preferences)
    • reward_model.py (interfaces; plug your models here)
    • policy.py (policy + RL training stub)
    • memory.py (simple knowledge store)
  • scripts/ — CLI entry points for data/inference/train/eval.
  • data/ — sample corpus and generated outputs.

Configuration

  • Default file: configs/default.yaml
    • project_name: label for runs.
    • seed: random seed.
    • paths: data_dir, processed_dir, models_dir.
    • model: base_model name (placeholder until you wire a real LLM).
    • reward: weights alpha, beta, gamma, delta for the combined objective.
    • search: backend (currently local) and top_k.
      • To use web search via Serper: set backend: serper and export SERPER_API_KEY.
    • data: sample_corpus path and an example questions list.
    • judge: choose tfidf (CPU-only, default), heuristic (lexical), or embedding (requires fastembed model download). All are open-source and run on CPU.

CLI commands

  • Demo search (offline local index) python scripts/run_inference.py "<question>" --config configs/default.yaml

  • Generate trajectories (step‑wise proxies + retrieval) python scripts/generate_data.py --config configs/default.yaml Outputs JSONL to data/processed/trajectories.jsonl by default.

  • Evaluate (optional EM/F1 if you provide gold answers) python scripts/evaluate.py --pred data/processed/trajectories.jsonl With gold: python scripts/evaluate.py --pred <pred.jsonl> --gold <gold.jsonl>

  • Stubs to extend

    • Train reward model: python scripts/train_reward_model.py --config configs/default.yaml
    • Train policy (PPO/GRPO): python scripts/train_policy.py --config configs/default.yaml

Data formats

  • Input corpus: data/sample_corpus.jsonl

    • One JSON per line: {id, title, text, url}
    • Used by the local index for offline search.
  • Generated trajectories: data/processed/trajectories.jsonl

    • One JSON per question with fields:
      • question: original question string
      • steps: list of step objects
        • query: issued query string
        • query_relevance: proxy score in [0,1]
        • results: list of {doc_id, title, score, summary, url}
        • information_gain: proxy score in [0,1]
      • final_answer: simple heuristic answer (demo only)
      • scores:
        • step_query_relevance: Σ step relevance
        • step_information_gain: Σ step info gain
        • combined_reward_proxy: α·Answer + β·ΣRel + γ·ΣInfo + δ·Memory (Answer/Memory are 0 in proxy demo)

Where the important bits live

  • Proxy “AI‑judge” relevance: rlaif_pipeline/data_generation.py
  • Proxy information gain: rlaif_pipeline/data_generation.py
  • Combined objective assembly: rlaif_pipeline/data_generation.py
  • Local search backend: rlaif_pipeline/search_interfaces/local.py
  • Config helpers and directories: rlaif_pipeline/config.py
  • Memory store: rlaif_pipeline/memory.py
  • Reward model interface: rlaif_pipeline/reward_model.py
  • Policy and RL training hook: rlaif_pipeline/policy.py

Extending the scaffold

  • Plug in an LLM‑as‑judge

    • Implement real query_relevance and information_gain scorers.
    • Option A: call a hosted LLM and log scores during data generation.
    • Option B: train a small reward model (see below) and use it during RL.
  • Train reward models

    • Use curated trajectories/preference pairs to train a step‑level model that predicts relevance + info‑gain, and optionally an answer‑quality model.
    • Wire your trainer in scripts/train_reward_model.py and export to models/.
  • Add PPO/GRPO training

    • Implement the optimisation loop in rlaif_pipeline/policy.py and scripts/train_policy.py, consuming step‑wise rewards from the reward model.
    • Optimise the weighted objective controlled by alpha, beta, gamma, delta.
  • Replace/extend search backends

    • Keep the LocalSearchIndex as a dev fallback; add a web search or vector store backend under rlaif_pipeline/search_interfaces/ and toggle via config.
  • Memory reward and reuse

    • Use rlaif_pipeline/memory.py to cache high‑value snippets.
    • Give positive reward when the agent answers using memory instead of re‑searching.

Testing and dev tips

  • Optional tests: install pytest, then run python -m pytest.
  • Edit configs/default.yaml to change corpus/questions and reward weights.
  • Add your own corpus by writing JSONL lines with {id,title,text,url}.

Limitations of the demo

  • The “AI‑judge” scores are lexical proxies; swap them for real LLM or learned reward models before running serious experiments.
  • train_reward_model.py and train_policy.py are stubs by design.

License

  • See LICENSE.

About

(Stepwise controlled Understanding for Trajectories) -- “agent that learns to hunt"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages