Replication package for the paper:
FixtureDB: A Multi-Language Dataset of Test Fixture Definitions JoĂŁo Almeida, Andre Hora
ICSME 2026 — Tool Demonstration and Data Showcase Track
TODO: add DOI once published
This repository contains the extraction pipeline that builds FixtureDB. The dataset itself (SQLite database + CSV exports) is archived separately on Zenodo at TODO: Zenodo DOI.
The toy dataset contains fixture definitions extracted from 200 GitHub repositories across 4 programming languages:
| Metric | Toy Dataset |
|---|---|
| Total Repositories | 200 (50 per language) |
| Total Test Files | 257,764 |
| Total Fixtures | 35,169 |
| Languages | Python, Java, JavaScript, TypeScript |
| Size (SQLite + CSVs) | ~175 MB (uncompressed) / 26 MB (compressed) |
| Export Format | SQLite database + 5 CSV files: - repositories.csv (metadata) - repository_statistics.csv (aggregated metrics per repo) - test_files.csv (file metadata) - test_file_statistics.csv (aggregated metrics per file) - fixtures.csv (individual fixtures) |
| Reproducibility | Pinned GitHub commits for all repositories |
Download: [Latest Zenodo Release](TODO: add Zenodo DOI) — includes full SQLite database and CSV exports for analysis.
| Property | Value |
|---|---|
| SEART GitHub Search Extraction | April 1–2, 2026 |
| Repository Selection | Quality filters: ≥5 test files, ≥50 commits, ≥500 stars |
| Languages | Python, Java, JavaScript, TypeScript |
| GitHub API Version | v3 REST API |
| Required Tools | See requirements.txt for exact versions |
| Tree-sitter | Grammar support for all 4 languages |
| Complexity Analysis | Lizard + language-specific cognitive complexity |
| Python Environment | 3.8+ |
The dataset was constructed through a five-phase pipeline:
- GitHub Search (April 1–2, 2026) — Query SEART API for repositories by language and star count
- Repository Cloning — Download full source code for all matching repositories
- Test File Detection — Discover test files using language-specific patterns and parse with Tree-sitter
- Fixture Extraction — Identify fixture definitions and scan for mock framework usage
- Metrics & Export — Compute complexity metrics, validate quality, generate CSV exports
See docs/collection-pipeline.md for detailed pipeline walkthrough and docs/data/data-collection.md for reproducibility steps. For exact tool versions, see requirements.txt.
Complete documentation has been organized into dedicated files in the docs/ folder:
| Document | Purpose |
|---|---|
| docs/INDEX.md | Start here — overview and quick navigation |
| docs/collection-pipeline.md | Collection pipeline phases with Mermaid diagram |
| Document | Purpose |
|---|---|
| docs/getting-started/intro.md | What is FixtureDB and why it matters |
| docs/getting-started/repository-structure.md | Project layout and organization |
| docs/getting-started/setup.md | Installation and dependencies |
| docs/getting-started/running.md | Command reference for pipeline operations |
| Document | Purpose |
|---|---|
| docs/data/data-collection.md | Five-phase pipeline walkthrough |
| docs/data/storage.md | Disk usage and database growth |
| docs/data/csv-export-guide.md | CSV export format and columns |
| docs/data/csv-user-guide.md | CSV exports for non-SQL users |
| Document | Purpose |
|---|---|
| docs/architecture/database-schema.md | Complete ERD and table specifications |
| docs/architecture/configuration.md | All tunable parameters |
| docs/architecture/detection.md | Tree-sitter AST and mock detection |
| docs/architecture/data-pipeline-overview.md | Detailed pipeline architecture |
| docs/architecture/metrics-reference.md | Metrics definitions and computation |
| Document | Purpose |
|---|---|
| docs/usage/reproducing.md | Exact corpus replication with pinned commits |
| docs/usage/usage.md | SQL query examples and data access |
| docs/usage/fixture-patterns-reference.md | Fixture types and classification patterns |
| Document | Purpose |
|---|---|
| docs/reference/limitations.md | Known constraints and validation status |
| docs/reference/license.md | MIT (code) and CC BY 4.0 (dataset) |
| docs/reference/testing.md | Test suite and validation |
| docs/reference/references.md | Academic citations and sources |
# Install dependencies
pip install -r requirements.txt
# Set up your GitHub token
cp .env.example .env
# Edit .env and add your GITHUB_TOKEN
# Initialize the database
python pipeline.py init
# Run the full pipeline (all languages)
python pipeline.py runFor detailed setup, see docs/getting-started/setup.md.
FixtureDB is a structured dataset of test fixture definitions extracted from open-source software repositories on GitHub across Python, Java, JavaScript, and TypeScript.
A test fixture is any code that prepares or tears down state before or after a test runs. For each fixture, the dataset records structural metadata (size, complexity, scope, type) and mock framework usage.
Why it matters: Prior empirical work on fixtures is exclusively Java-based. FixtureDB is the first cross-language resource treating the fixture as its primary unit of analysis.
See docs/getting-started/intro.md for the full overview.
FixtureDB focuses exclusively on quantitative, objective aspects of test fixtures:
-
Framework Detection: Syntactically unambiguous markers only (decorators, annotations, attributes)
- Python:
@pytest.fixture,setUp()/tearDown()methods - Java:
@Before/@Afterannotations - JavaScript/TypeScript: Mocha/Jest
beforeEach()/afterEach()and related patterns
- Python:
-
Structural Metrics: Lines of code, cyclomatic complexity, parameter counts, fixture type/scope
-
Mock Framework Usage: Detection of mock object patterns within fixture code
CSV exports contain quantitative metrics. The SQLite database includes additional internal infrastructure for reproducibility and future research.
All fixture detectors include comprehensive unit tests (tests/test_framework_detection.py) verifying:
- Correct framework identification across supported languages
- AST-based detection accuracy
- Cross-language consistency
See docs/architecture/detection.md for technical details on detection algorithms.
The following visualizations provide an overview of the FixtureDB corpus:
Repository Distribution and Pipeline Status
Creation Timeline and Activity Patterns
Fixture Distribution and Scope Patterns
Mock Usage and Framework Diversity
Detection Patterns and Execution Scopes
Lines of Code and Complexity Metrics
Framework-Specific Scope Adoption
Nesting, Reuse, and Complexity Patterns
File Characteristics and Fixture Design
Project Popularity vs Fixture Quality
Comprehensive exploratory data analysis documentation is available in the following guides:
| Guide | Purpose |
|---|---|
| EDA_INDEX.md | Navigation guide for all EDA resources |
| EDA_COMPLETE_SUMMARY.md | Master reference: all improvements, integrations, and next steps |
| EDA_IMPROVEMENTS_2026.md | Detailed descriptions of all 8 new plots and their design rationale |
| EDA_QUICK_REFERENCE.md | Research workflows, CSV column mapping, and which plot to use for what question |
| EDA_KEY_INSIGHTS.md | Data-driven findings: language comparisons, fixture patterns, teardown adoption analysis |
Quick Start: Begin with EDA_INDEX.md for overview and navigation to specific analysis goals.





















