This is the repository for HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics.
The updated paper is available here: HARDMath.
This repository hosts the full dataset and the evaluation dataset, together with the generation and evaluation code described in the paper. The format of the data is detailed below.
Improving the mathematical reasoning capabilities of Large Language Models (LLMs) is of significant interest to the machine learning community. To rigorously track the progress of these models, comprehensive and diverse benchmarks are essential. However, most existing benchmarks focus on problems at the undergraduate level or below and often feature straightforward solutions. In contrast, many real-world problems in science and engineering do not follow this trend and require approximations or sophisticated techniques that current benchmarks fail to evaluate. Therefore, it is imperative to develop benchmarks with more challenging problems and emphasize a different style of mathematical reasoning.
An example of a difficult problem that does not have an exact solution but can be accurately approximated using techniques from applied mathematics.
To address this gap, we introduce HARDMath, a dataset of challenging, graduate-level problems in applied mathematics that can be used for language model evaluation. Unlike other popular mathematical datasets, HARDMath contains problems that require a combination of advanced problem-solving skills, approximation methods, and mathematical intuition. The dataset contains a large test set of 1,050 problems and a mini test set of 437 problems, divided across seven different problem types. A "Word Problem in Context" set is also introduced, which consists of 40 handwritten problems that require asymptotic reasoning in the context of plausible real-world scenarios.
| Problem Type | Form |
|---|---|
| Nondimensionalization of symbolic polynomials | |
| Nondimensionalization of numerical polynomials | |
| Polynomial root-finding | |
| Polynomial root correction terms | |
| Nonlinear ordinary differential equations | |
| Integrals | |
| Laplace Integrals |
Our full datasets of problems and solutions are stored in the data directory. They are available in either CSV or JSON format.
In the CSV and JSON files containing our data, each problem stores the following information:
- “question” (str): the text containing the applied mathematics problem
- “solution” (str): the text containing the worked solution and the boxed final solution
- “question_type” (str): the category the problem/solution falls into (options include “integral,” “ODE,” “nondimensionalization_numeric,” etc.)
- “answer_type” (str): the type of final solution to the applied mathematics problem (options include “math_expression” for problems that only contain one solution regime, and “list” for problems that contain two solution regimes)
- “extracted_answer” (str): LaTeX expressions containing the final boxed solution, which is a list of expressions if there are multiple solution regimes, the expression itself if there is only one solution regime, or a float if the solution is a numerical value
-
“small_eval_point” (float): the
$x$ value at which the numerical solutions and approximate solutions are evaluated for the “small” solution regime - “small_analytical” (float): the numerical value of the analytical solution evaluated at small_eval_point
- “small_numerical” (float): the numerical value of the ground truth (numerical) solution evaluated at small_eval_point
-
“large_eval_point” (float): the
$x$ value at which the numerical solutions and approximate solutions are evaluated for the “large” solution regime - “large_analytical” (float): the numerical value of the ground truth (numerical) solution evaluated at large_eval_point
- “large_numerical” (float): the numerical value of the ground truth (numerical) solution evaluated at large_eval_point
To generate problems and their solutions, navigate to the src directory and choose the type of problem you would like to generate. Running the [problem_type]_generator.ipynb Jupyter notebook will generate num_problems near the top of each notebook.
First, clone the repository to your local machine:
git clone https://github.com/sarahmart/HARDMath.gitNavigate into the cloned directory:
cd HARDMath/evaluationThe repository includes a requirements.yml file to set up a Conda environment with all necessary dependencies. To create the environment, run:
conda env create -f requirements.ymlAfter the environment is created, activate it with:
conda activate hardmath-envThis will install all required packages and dependencies and the custom modules needed to run the scripts.
To run the script, use the following command:
python script_name.py --data_dir <path_to_data> --input_file <input_file_name> --example_file <example_file_name> --output_dir <output_directory> --output_file <output_file_name> --model <model_name> --grader <grader_name> --key <api_key> --question_type <question_type> --temperature <temperature_value> --server_ip <server_ip_address>Example command:
python generate_response_and_score.py --data_dir data --input_file eval_HARDMath.json --example_file example_HARDMath_1shot.json --output_dir results/test --output_file nondimensionalization_symbolic_0shot_gpt4.json --model gpt-3.5-turbo --grader gpt-4o --key YOUR_API_KEY --question_type nondimensionalization_symbolic --temperature 0.0 --data_dir: Directory where your input files are located. Default isdata.--input_file: The main input data file (e.g.,eval_HARDMath.json). Default iseval_HARDMath.json.--example_file: Example data file for one-shot learning (e.g.,example_HARDMath_1shot.json). Default isexamples_HARDMath_1shot.json.--output_dir: Directory where the results will be saved. Default isresults/test.--output_file: Name of the output file (e.g.,nondimensionalization_symbolic_1shot_gpt4.json). Default isnondimensionalization_symbolic_1shot_gpt4.json.--model: The model to use for generating responses. Choices includegpt-4-turbo,gpt-3.5-turbo,gpt-4o,llama3-8b,codellama-13b. Default isgpt-3.5-turbo.--grader: The model to use for grading responses. Choices includegpt-4-turbo,gpt-4o. Default isgpt-4o.--key: Your API key for the model (if using OpenAI's GPT models). If not provided, the script will attempt to load it from the environment variableOPENAI_API_KEY.--prompt_file: (Optional) Path to a file containing custom prompts. If not provided, the script will create new prompts.--shot_num: Number of examples to use for few-shot learning. Default is0.--question_type: The type of mathematical problems being evaluated. Choices includenondimensionalization_symbolic,nondimensionalization_numeric,integral,ode,polynomial_roots. Default isnondimensionalization_symbolic.--integral_subtype: (Optional) Subtype of integral problems (if applicable). Choices includetraditionalandlaplace.--temperature: Controls the creativity of the model's responses. Lower values make the model more deterministic. Default is0.0.--server_ip: IP address of the server if using a local model server (e.g., Ollama Server).
The results will be saved in the specified output directory and file. The output JSON file will contain the prompts, model responses, extracted answers, and comparison scores.
If you encounter any issues while running the script, ensure that:
- The input files are correctly formatted JSON files.
- The custom modules (
utils,create_prompt,models,answer_extraction) are properly set up. - Your API key is correctly provided if using GPT models from OpenAI.
For more detailed debugging, check the error messages printed during the script execution.