Context
Sergey Kornilov (Biostochastics, LLC) built clawbio_bench v0.1.0, an independent audit suite testing ClawBio skills across three dimensions: safety, correctness, and honesty. Run against commit 1481fb4, result: 80/140 tests passing (57.1%).
Full remediation plan: REMEDIATION-PLAN.md
Scorecard
| Skill |
Pass |
Fail |
Rate |
Worst Finding |
| bio-orchestrator |
41 |
13 |
75.9% |
stub_silent, routed_wrong |
| equity-scorer |
3 |
12 |
20.0% |
fst_mislabeled, heim_unbounded, edge_crash |
| nutrigx-advisor |
8 |
2 |
80.0% |
snp_invalid, score_incorrect |
| pharmgx-reporter |
14 |
19 |
42.4% |
correct_determinate, disclosure_failure |
| claw-metagenomics |
6 |
1 |
85.7% |
exit_suppressed |
| fine-mapping |
4 |
12 |
25.0% |
pathology_flagged |
| clinical-variant |
4 |
1 |
80.0% |
report_structure_complete |
Fixes shipped
Equity-scorer (was 3/15, 20%)
| Finding |
Fix |
Commit |
C-06 fst_mislabeled |
Renamed all output labels from "Hudson FST" to "Nei's GST". Added Nei 1973 citation. |
f6076f5 |
U-2/F-27 heim_unbounded |
Added weight normalization (sum to 1.0), negative weight rejection, zero weight rejection, score clamping to [0, 100]. |
f6076f5 |
edge_crash (9 tests) |
Added 8 edge case tests. Core computation functions pass all edge cases. Remaining CLI-layer crashes need separate investigation. |
b8d9d6a |
Fine-mapping (was 4/16, 25%)
| Finding |
Fix |
Commit |
| Purity mean vs min |
Changed _purity() from np.mean to np.min per Wang et al. 2020 section 3.2. |
c65d84a |
| PIP formula |
Verified correct: uses 1 - prod(1 - alpha). No change needed. |
c65d84a |
| Input validation |
Added ValueError contracts for n <= 0, w <= 0, NaN z-scores, coverage outside (0,1], min_purity outside [0,1]. SE <= 0 warning in sumstats loader. |
c65d84a |
Test counts after fixes
- equity-scorer: 36/36 passing
- fine-mapping: 76/76 passing
Remaining tasks
CI integration
A scientific-audit job has been added to CI that runs clawbio-bench --smoke after unit tests pass. Verdicts uploaded as artifacts with 30-day retention.
Benchmark repo
https://github.com/biostochastics/clawbio_bench — independent external standard maintained by Sergey Kornilov. We are requesting contributor access to help extend coverage.
cc @camlloyd
Context
Sergey Kornilov (Biostochastics, LLC) built clawbio_bench v0.1.0, an independent audit suite testing ClawBio skills across three dimensions: safety, correctness, and honesty. Run against commit
1481fb4, result: 80/140 tests passing (57.1%).Full remediation plan: REMEDIATION-PLAN.md
Scorecard
Fixes shipped
Equity-scorer (was 3/15, 20%)
fst_mislabeledf6076f5heim_unboundedf6076f5edge_crash(9 tests)b8d9d6aFine-mapping (was 4/16, 25%)
_purity()fromnp.meantonp.minper Wang et al. 2020 section 3.2.c65d84a1 - prod(1 - alpha). No change needed.c65d84aValueErrorcontracts forn <= 0,w <= 0, NaN z-scores,coverageoutside (0,1],min_purityoutside [0,1]. SE <= 0 warning in sumstats loader.c65d84aTest counts after fixes
Remaining tasks
CI integration
A
scientific-auditjob has been added to CI that runsclawbio-bench --smokeafter unit tests pass. Verdicts uploaded as artifacts with 30-day retention.Benchmark repo
https://github.com/biostochastics/clawbio_bench — independent external standard maintained by Sergey Kornilov. We are requesting contributor access to help extend coverage.
cc @camlloyd