(Stepwise Controlled Understanding for Trajectories)
SCOUT‑RL is a research scaffold for building a Reinforcement‑Learning‑from‑AI‑Feedback (RLAIF) pipeline for search/research agents. It runs out‑of‑the‑box with a lightweight local text index and proxy “AI‑judge” signals, and is organized so you can drop in real judges, reward models, and PPO/GRPO training.
What you get
- Minimal but runnable pipeline to generate multi‑step search trajectories.
- Step‑wise proxy rewards for query relevance and information gain.
- Clean module boundaries for reward models, policy training, memory, and search backends.
Core innovations (design targets)
- Multilevel AI‑feedback: dense, step‑wise signals beyond answer‑only rewards.
- Step‑wise objective: combine answer quality, query relevance, information gain, and memory use.
- Memory utilisation: encourage reuse of retrieved knowledge and stop over‑searching.
- Extensible search: pluggable backends (local now; web/multimodal next).
Getting started
- Prereqs: Python 3.9+ (tested on 3.12)
- Install:
pip install -r requirements.txt
Quick demo
-
Run a search over the tiny local index
python scripts/run_inference.py "Who designed the Eiffel Tower?" -
Generate toy multi‑step trajectories (written to
data/processed/trajectories.jsonl)python scripts/generate_data.py --config configs/default.yaml -
Peek at outputs or compute simple metrics
python scripts/evaluate.py --pred data/processed/trajectories.jsonl -
Train the lightweight RL agent (REINFORCE)
python scripts/train_policy.py --episodes 80 --max_steps 3 --lr 0.05This learns a softmax policy over query templates using step‑wise judge rewards. -
Run the trained agent greedily
python scripts/run_agent.py "Who designed the Eiffel Tower?"
Directory layout
configs/— experiment configuration (YAML).rlaif_pipeline/— package modules:config.py(YAML loader, directory setup)search_interfaces/(local TF‑IDF‑like index now)data_generation.py(trajectory builder + proxy judges)data_curation.py(filters; hook for preferences)reward_model.py(interfaces; plug your models here)policy.py(policy + RL training stub)memory.py(simple knowledge store)
scripts/— CLI entry points for data/inference/train/eval.data/— sample corpus and generated outputs.
Configuration
- Default file:
configs/default.yamlproject_name: label for runs.seed: random seed.paths:data_dir,processed_dir,models_dir.model:base_modelname (placeholder until you wire a real LLM).reward: weightsalpha, beta, gamma, deltafor the combined objective.search:backend(currentlylocal) andtop_k.- To use web search via Serper: set
backend: serperand exportSERPER_API_KEY.
- To use web search via Serper: set
data:sample_corpuspath and an examplequestionslist.judge: choosetfidf(CPU-only, default),heuristic(lexical), orembedding(requiresfastembedmodel download). All are open-source and run on CPU.
CLI commands
-
Demo search (offline local index)
python scripts/run_inference.py "<question>" --config configs/default.yaml -
Generate trajectories (step‑wise proxies + retrieval)
python scripts/generate_data.py --config configs/default.yamlOutputs JSONL todata/processed/trajectories.jsonlby default. -
Evaluate (optional EM/F1 if you provide gold answers)
python scripts/evaluate.py --pred data/processed/trajectories.jsonlWith gold:python scripts/evaluate.py --pred <pred.jsonl> --gold <gold.jsonl> -
Stubs to extend
- Train reward model:
python scripts/train_reward_model.py --config configs/default.yaml - Train policy (PPO/GRPO):
python scripts/train_policy.py --config configs/default.yaml
- Train reward model:
Data formats
-
Input corpus:
data/sample_corpus.jsonl- One JSON per line:
{id, title, text, url} - Used by the local index for offline search.
- One JSON per line:
-
Generated trajectories:
data/processed/trajectories.jsonl- One JSON per question with fields:
question: original question stringsteps: list of step objectsquery: issued query stringquery_relevance: proxy score in [0,1]results: list of{doc_id, title, score, summary, url}information_gain: proxy score in [0,1]
final_answer: simple heuristic answer (demo only)scores:step_query_relevance: Σ step relevancestep_information_gain: Σ step info gaincombined_reward_proxy: α·Answer + β·ΣRel + γ·ΣInfo + δ·Memory (Answer/Memory are 0 in proxy demo)
- One JSON per question with fields:
Where the important bits live
- Proxy “AI‑judge” relevance:
rlaif_pipeline/data_generation.py - Proxy information gain:
rlaif_pipeline/data_generation.py - Combined objective assembly:
rlaif_pipeline/data_generation.py - Local search backend:
rlaif_pipeline/search_interfaces/local.py - Config helpers and directories:
rlaif_pipeline/config.py - Memory store:
rlaif_pipeline/memory.py - Reward model interface:
rlaif_pipeline/reward_model.py - Policy and RL training hook:
rlaif_pipeline/policy.py
Extending the scaffold
-
Plug in an LLM‑as‑judge
- Implement real
query_relevanceandinformation_gainscorers. - Option A: call a hosted LLM and log scores during data generation.
- Option B: train a small reward model (see below) and use it during RL.
- Implement real
-
Train reward models
- Use curated trajectories/preference pairs to train a step‑level model that predicts relevance + info‑gain, and optionally an answer‑quality model.
- Wire your trainer in
scripts/train_reward_model.pyand export tomodels/.
-
Add PPO/GRPO training
- Implement the optimisation loop in
rlaif_pipeline/policy.pyandscripts/train_policy.py, consuming step‑wise rewards from the reward model. - Optimise the weighted objective controlled by
alpha, beta, gamma, delta.
- Implement the optimisation loop in
-
Replace/extend search backends
- Keep the
LocalSearchIndexas a dev fallback; add a web search or vector store backend underrlaif_pipeline/search_interfaces/and toggle via config.
- Keep the
-
Memory reward and reuse
- Use
rlaif_pipeline/memory.pyto cache high‑value snippets. - Give positive reward when the agent answers using memory instead of re‑searching.
- Use
Testing and dev tips
- Optional tests: install
pytest, then runpython -m pytest. - Edit
configs/default.yamlto change corpus/questions and reward weights. - Add your own corpus by writing JSONL lines with
{id,title,text,url}.
Limitations of the demo
- The “AI‑judge” scores are lexical proxies; swap them for real LLM or learned reward models before running serious experiments.
train_reward_model.pyandtrain_policy.pyare stubs by design.
License
- See
LICENSE.