JanusDoc Evaluation Suite

Automated evaluation system for JanusDoc using Evalite.

Overview

This repository contains 8 realistic test scenarios to measure JanusDoc's ability to suggest documentation updates based on code changes. Each scenario is a separate PR with specific code changes that should (or should not) trigger documentation suggestions.

Test Project: TaskFlow - A simple TypeScript/Express task management API with 14 documentation files.

Running Evaluations

# Install dependencies
npm install

# Run all evaluations
npm run eval

# Run in watch mode
npm run eval:watch

Test Scenarios

#	Scenario	Change Type	Expected Files	Difficulty
1	New Endpoint	Add POST endpoint	3 files	Easy
2	Rename Parameter	Parameter rename	2 files	Medium
3	Breaking Change	Schema change	4 files	Easy
4	New Feature	Major feature	4+ files	Hard
5	Deprecation	Deprecate endpoint	4 files	Medium
6	Internal Refactor	No API changes	0 files (negative)	Hard
7	Config Change	New env vars	2 files	Easy
8	Behavior Change	Sorting behavior	4 files	Hard

See EXPECTED_RESULTS.md for detailed expected suggestions per scenario.

Evaluation Metrics

Precision: Correct suggestions / Total suggestions (avoids false positives)
Recall: Correct suggestions / Expected suggestions (catches all needed updates)
F1 Score: Harmonic mean of Precision and Recall

Repository Structure

janusdoc-evals/
├── src/                    # TaskFlow API source code
├── docs/                   # TaskFlow documentation (test fixtures)
├── evals/                  # Evalite test configuration
│   ├── janusdoc.eval.ts   # Main eval file
│   ├── test-scenarios.ts  # Scenario definitions
│   ├── scorers.ts         # Precision/Recall/F1 scorers
│   └── utils.ts           # Helper functions
├── EXPECTED_RESULTS.md    # Expected suggestions per scenario
└── README.md              # This file

Environment Setup

Create a .envrc file (or export manually):

export GITHUB_TOKEN="your_github_token"
export OPENAI_API_KEY="your_openai_key"

Results

Current JanusDoc performance:

5/8 scenarios completing successfully
Precision: 100% (no false positives)
Recall: 25-50% (room for improvement)
Best F1 Score: 66.7% on behavior changes

See evaluation output for detailed per-scenario results.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.janusdoc		.janusdoc
docs		docs
evals		evals
src		src
.gitignore		.gitignore
.janusdoc.json		.janusdoc.json
EXPECTED_RESULTS.md		EXPECTED_RESULTS.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JanusDoc Evaluation Suite

Overview

Running Evaluations

Test Scenarios

Evaluation Metrics

Repository Structure

Environment Setup

Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

dielduarte/janusdoc-evals

Folders and files

Latest commit

History

Repository files navigation

JanusDoc Evaluation Suite

Overview

Running Evaluations

Test Scenarios

Evaluation Metrics

Repository Structure

Environment Setup

Results

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages