This repository contains the code for our EMNLP 2025 paper: "On LLM-Based Scientific Inductive Reasoning Beyond Equations".
Large Language Models (LLMs) have demonstrated strong deductive reasoning skills (e.g., mathematics, programming). However, their ability to perform inductive reasoning in scientific contexts remains underexplored.
We introduce SIRBench-V1, the first benchmark to systematically evaluate LLMs on scientific inductive reasoning tasks beyond mathematical equations.
The benchmark spans 7 tasks across biology and chemistry:
-
🧬 Biology
- DNA Translation
- DNA Table Inference
- DNA Transformation
-
⚗️ Chemistry
- Molecule Design
- Molecule Captioning
- Reaction Prediction
- Name Prediction
Each task requires models to induce underlying scientific rules from examples and apply them to new inputs, rather than simply memorizing known mappings.
This repository builds on the OpenCompass framework, which enables efficient evaluation across different LLMs.
- 📥 Clone this repository
- 🛠️ Install the framework
pip install -e .
- 📦 Install additional dependencies
pip install fcd rdkit biopython tenacity
Each task can be run with the corresponding config file:
opencompass examples/eval_sirbenchv1_{task}.py
For example, to evaluate DNA Translation:
opencompass examples/eval_sirbenchv1_dna_transform.py
You can modify the config files under ./examples
to test any model supported by OpenCompass.
Note: Before running experiments, please configure your OpenAI API key. For quick experiments, API key can be added to the examples/eval_sirbenchv1_{task}.py
files.
-
./examples/
: Evaluation entrypoints for each SIRBench-V1 task. -
./data/sirbenchv1/
: Processed datasets. -
opencompass/configs/datasets/sirbenchv1/
: Benchmark dataset configuration files. -
opencompass/datasets/sirbenchv1/
: Data loaders for SIRBench-V1. -
opencampasslongicl/opencompass/openicl/icl_inferencer/
: Custom inference strategies-
icl_hr_inferencer.py
(Hypothesis Refinement) -
icl_onepass_sc_inferencer.py
(Self-Consistency)
-
We build SIRBench-V1 from authentic and counterfactual tasks using existing scientific resources.
Our benchmark is also available at Hugging Face.
To generate more test samples based on different dataset configurations, please download the datasets from the sources below and put them in the specified path. The configurations files in opencompass/configs/datasets/sirbenchv1/
can then be modified accordingly.
- Genomic sequences: GenomicLLM_GRCh38
20230906_cds_res_nt2aa_dev.csv
,20230906_cds_res_nt2aa_test.csv
,20230906_cds_res_nt2aa_train.csv
->data/sirbenchv1/dna_translator/
- Molecule design & captioning: ChEBI-20
ChEBI-20_data
->data/sirbenchv1/chem_molecule_design/
- Reaction prediction: USPTO-MIT Mixed
uspto_mixed.pickle
->data/sirbenchv1/chem_reaction_prediction/
- Name prediction: PubChem
llm_test.csv
->data/sirbenchv1/chem_name_prediction/
If you use this code or benchmark, please cite our paper:
@inproceedings{lin2025sirbench,
title = {On LLM-Based Scientific Inductive Reasoning Beyond Equations},
author = {Brian S. Lin and Jiaxin Yuan and Zihan Zhou and Shouli Wang and Shuo Wang and
Cunliang Kong and Qi Shi and Yuxuan Li and Liner Yang and Zhiyuan Liu and Maosong Sun},
booktitle = {Proceedings of EMNLP},
year = {2025}
}