FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael Brenner, Peter Norgaard

Datasets

The datasets are organized as follows:

Benchmark / Evaluation Datasets

FEABench Gold
FEABench Large: This is a CSV file containing the application ids, URLs and titles of the 200 problems we evaluated on. The code to generate the inputs and outputs is in generate_feabench_large.

Library of Annotated Snippets

This is a dataset with the following structure

theme
  ├── model_id
  ├── annotation
  └── snippet

📁 Repository Structure

The directories in this repo are organized as follows:

feabench
    ├── common
        ├── agents  # Code pertaining to the Corrector and ToolLookup `subagents` and Tools.
        ├── eval  # Code to evaluate results
        └── remote_service  # Code to set up the MPHClient
    ├── generate_feabench_large  # Code to segment tutorial pdfs and JAVA files
    ├── data  # Data for the benchmark.

Inference Workflow

Single-Turn Evaluation Specify directory locations in common/constants.py

On FEABench Gold

python run_external_inference.py -- \
--version=0 --prompt=prompt_v0_nosol.txt --model_type=openai --run=8-28 --problems="comsol_267"

Specify directory locations in common/constants.py

On FEABench Large

python run_external_inference_large.py -- \
--model_type=anthropic --trial=9-24 --subset=val

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
common		common
data		data
docs		docs
generate_feabench_large		generate_feabench_large
LICENSE		LICENSE
README.md		README.md
common_parsing_utils.py		common_parsing_utils.py
llm_client.py		llm_client.py
parse_java_api.py		parse_java_api.py
parse_java_api_test.py		parse_java_api_test.py
retry_lib.py		retry_lib.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Datasets

Benchmark / Evaluation Datasets

Library of Annotated Snippets

📁 Repository Structure

Inference Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

google/feabench

Folders and files

Latest commit

History

Repository files navigation

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Datasets

Benchmark / Evaluation Datasets

Library of Annotated Snippets

📁 Repository Structure

Inference Workflow

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages