SCOUT-RL

(Stepwise Controlled Understanding for Trajectories)

SCOUT‑RL is a research scaffold for building a Reinforcement‑Learning‑from‑AI‑Feedback (RLAIF) pipeline for search/research agents. It runs out‑of‑the‑box with a lightweight local text index and proxy “AI‑judge” signals, and is organized so you can drop in real judges, reward models, and PPO/GRPO training.

What you get

Minimal but runnable pipeline to generate multi‑step search trajectories.
Step‑wise proxy rewards for query relevance and information gain.
Clean module boundaries for reward models, policy training, memory, and search backends.

Core innovations (design targets)

Multilevel AI‑feedback: dense, step‑wise signals beyond answer‑only rewards.
Step‑wise objective: combine answer quality, query relevance, information gain, and memory use.
Memory utilisation: encourage reuse of retrieved knowledge and stop over‑searching.
Extensible search: pluggable backends (local now; web/multimodal next).

Getting started

Prereqs: Python 3.9+ (tested on 3.12)
Install: pip install -r requirements.txt

Quick demo

Run a search over the tiny local index python scripts/run_inference.py "Who designed the Eiffel Tower?"
Generate toy multi‑step trajectories (written to data/processed/trajectories.jsonl) python scripts/generate_data.py --config configs/default.yaml
Peek at outputs or compute simple metrics python scripts/evaluate.py --pred data/processed/trajectories.jsonl
Train the lightweight RL agent (REINFORCE) python scripts/train_policy.py --episodes 80 --max_steps 3 --lr 0.05 This learns a softmax policy over query templates using step‑wise judge rewards.
Run the trained agent greedily python scripts/run_agent.py "Who designed the Eiffel Tower?"

Directory layout

configs/ — experiment configuration (YAML).
rlaif_pipeline/ — package modules:
- config.py (YAML loader, directory setup)
- search_interfaces/ (local TF‑IDF‑like index now)
- data_generation.py (trajectory builder + proxy judges)
- data_curation.py (filters; hook for preferences)
- reward_model.py (interfaces; plug your models here)
- policy.py (policy + RL training stub)
- memory.py (simple knowledge store)
scripts/ — CLI entry points for data/inference/train/eval.
data/ — sample corpus and generated outputs.

Configuration

Default file: configs/default.yaml
- project_name: label for runs.
- seed: random seed.
- paths: data_dir, processed_dir, models_dir.
- model: base_model name (placeholder until you wire a real LLM).
- reward: weights alpha, beta, gamma, delta for the combined objective.
- search: backend (currently local) and top_k.
  - To use web search via Serper: set backend: serper and export SERPER_API_KEY.
- data: sample_corpus path and an example questions list.
- judge: choose tfidf (CPU-only, default), heuristic (lexical), or embedding (requires fastembed model download). All are open-source and run on CPU.

CLI commands

Demo search (offline local index) python scripts/run_inference.py "<question>" --config configs/default.yaml
Generate trajectories (step‑wise proxies + retrieval) python scripts/generate_data.py --config configs/default.yaml Outputs JSONL to data/processed/trajectories.jsonl by default.
Evaluate (optional EM/F1 if you provide gold answers) python scripts/evaluate.py --pred data/processed/trajectories.jsonl With gold: python scripts/evaluate.py --pred <pred.jsonl> --gold <gold.jsonl>
Stubs to extend
- Train reward model: python scripts/train_reward_model.py --config configs/default.yaml
- Train policy (PPO/GRPO): python scripts/train_policy.py --config configs/default.yaml

Data formats

Input corpus: data/sample_corpus.jsonl
- One JSON per line: {id, title, text, url}
- Used by the local index for offline search.
Generated trajectories: data/processed/trajectories.jsonl
- One JSON per question with fields:
  - question: original question string
  - steps: list of step objects
    - query: issued query string
    - query_relevance: proxy score in [0,1]
    - results: list of {doc_id, title, score, summary, url}
    - information_gain: proxy score in [0,1]
  - final_answer: simple heuristic answer (demo only)
  - scores:
    - step_query_relevance: Σ step relevance
    - step_information_gain: Σ step info gain
    - combined_reward_proxy: α·Answer + β·ΣRel + γ·ΣInfo + δ·Memory (Answer/Memory are 0 in proxy demo)

Where the important bits live

Proxy “AI‑judge” relevance: rlaif_pipeline/data_generation.py
Proxy information gain: rlaif_pipeline/data_generation.py
Combined objective assembly: rlaif_pipeline/data_generation.py
Local search backend: rlaif_pipeline/search_interfaces/local.py
Config helpers and directories: rlaif_pipeline/config.py
Memory store: rlaif_pipeline/memory.py
Reward model interface: rlaif_pipeline/reward_model.py
Policy and RL training hook: rlaif_pipeline/policy.py

Extending the scaffold

Plug in an LLM‑as‑judge
- Implement real query_relevance and information_gain scorers.
- Option A: call a hosted LLM and log scores during data generation.
- Option B: train a small reward model (see below) and use it during RL.
Train reward models
- Use curated trajectories/preference pairs to train a step‑level model that predicts relevance + info‑gain, and optionally an answer‑quality model.
- Wire your trainer in scripts/train_reward_model.py and export to models/.
Add PPO/GRPO training
- Implement the optimisation loop in rlaif_pipeline/policy.py and scripts/train_policy.py, consuming step‑wise rewards from the reward model.
- Optimise the weighted objective controlled by alpha, beta, gamma, delta.
Replace/extend search backends
- Keep the LocalSearchIndex as a dev fallback; add a web search or vector store backend under rlaif_pipeline/search_interfaces/ and toggle via config.
Memory reward and reuse
- Use rlaif_pipeline/memory.py to cache high‑value snippets.
- Give positive reward when the agent answers using memory instead of re‑searching.

Testing and dev tips

Optional tests: install pytest, then run python -m pytest.
Edit configs/default.yaml to change corpus/questions and reward weights.
Add your own corpus by writing JSONL lines with {id,title,text,url}.

Limitations of the demo

The “AI‑judge” scores are lexical proxies; swap them for real LLM or learned reward models before running serious experiments.
train_reward_model.py and train_policy.py are stubs by design.

License

See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

SCOUT-RL

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
configs		configs
data		data
models		models
rlaif_pipeline		rlaif_pipeline
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Uh oh!

License

Uh oh!

MagellaX/SCOUT-RL

Folders and files

Latest commit

History

Repository files navigation

SCOUT-RL

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages