Chart2Code-Benchmark is a new benchmark designed to evaluate chart generation capabilities of LMMs under progressively challenging conditions.
Chart2Code covers three progressively challenging levels: reproduction, editing, and long-table to chart generation.
Level1(Chart Reproduction) reproduces charts from a reference figure and user query;
Level2(Chart Editing) involves complex modifications such as changing chart types or adding elements;
Level3(Long-Table to Chart Generation) requires models to transform long, information-dense tables into faithful charts following user instructions.
More details about Chart2Code in project page.🌐
Here we provide a quick start guide to evaluate LMMs on Chart2Code.
git clone https://github.com/showlab/Chart2Code.git
conda env create -f environment.yaml
conda activate chart2code
cd Chart2CodeSetup API key and API base URL in .env for different LMMs. Claude、 Gemini and GPT are accessed through API proxy providers,while Seed is accessed through ARK API.
OPENAI_API_KEY=${your_api_proxy_provider_key}
ARK_API_KEY=${your_ark_api_key}
OPENAI_API_URL=${your_api_proxy_provider_url}
ARK_BASE_URL=${your_ark_api_base_url}Download the Chart2Code data from Huggingface and unzip it under the root directory.
wget https://huggingface.co/datasets/CSU-JPG/Chart2Code/resolve/main/data.zip
unzip data.zipThe file structure should be like this:
├── data
│ ├── level1_direct
│ │ ├── 3d_1.png
│ │ ├── 3d_1.py
│ │ └── ...
│ ├── level1_figure
│ │ ├── fig1_density_2
│ │ ├── ...
│ └── level1_customize
│ ├── table_1_instruction_2.png
│ ├── table_1_instruction_2.py
│ ├── table_1_instruction_2_request.txt
│ └── table_1_instruction_2_data.txt
│ └── ...
│ ├── level2
│ │ ├── bar_1_v1.png
│ │ ├── bar_1_v1.py
│ │ ├── bar_1_v1_data.txt
│ │ └── ...
│ └── level3
│ ├── table_1.xlsx
│ ├── table1_1.png
│ ├── table1_1_generate.py
│ ├── table1_1.txt
│ ├── table1_1_generate.png
│ └── ...
│ ├── level1_direct.json
│ ├── level1_figure.json
│ ├── level1_customize.json
│ ├── level2.json
│ └── level3.json
│—— Evaluation
└── ...
Inference for each benchmark level is handled by a dedicated shell script located in the scripts/ directory.
You must specify a model for each run. You can do this in two ways:
- Pass it as an argument (Recommended): Provide the
MODEL_IDENTIFIERdirectly when executing the script. - Edit the script: Set the
MODEL_IDENTIFIERvariable inside the corresponding .sh file.
You can modify the LOAD_SOURCE parameter in the shell script to select how the model is loaded:
local: By default, the model will be loaded from theInference/modelsdirectory.hub: The model weights will be loaded directly from the Hugging Face Hub online.
You can also adjust other parameters like GPU_VISIBLE_DEVICES in the script to fit your hardware setup.
cd scripts/inference
# For level1_customize
bash inference_customize.sh qwen3_customize_30B
# For level1_direct
bash inference_direct.sh qwen2.5_direct_72B
# For level1_figure
bash inference_figure.sh InternVL_3.5_figure_38B
# For level2
bash inference_level2.sh deepseek_level2
# For level3
bash inference_level3.sh gpt_5_level3Available Models
We now support the following models:| Model Name | MODEL_IDENTIFIER | ||||
|---|---|---|---|---|---|
| level1_customize | level1_direct | level1_figure | level2 | level3 | |
| InternVL-3.5-38B | InternVL_3.5_customize_38B | InternVL_3.5_direct_38B | InternVL_3.5_figure_38B | InternVL_3.5_level2_38B | InternVL_3.5_level3_38B |
| InternVL-3.5-8B | InternVL_3.5_customize_8B | InternVL_3.5_direct_8B | InternVL_3.5_figure_8B | InternVL_3.5_level2_8B | InternVL_3.5_level3_8B |
| InternVL-3-38B | InternVL_3_customize_38B | InternVL_3_direct_38B | InternVL_3_figure_38B | InternVL_3_level2_38B | InternVL_3_level3_38B |
| InternVL-3-8B | InternVL_3_customize_8B | InternVL_3_direct_8B | InternVL_3_figure_8B | InternVL_3_level2_8B | InternVL_3_level3_8B |
| InternVL-2.5-38B | InternVL_2.5_customize_38B | InternVL_2.5_direct_38B | InternVL_2.5_figure_38B | InternVL_2.5_level2_38B | InternVL_2.5_level3_38B |
| InternVL-2.5-8B | InternVL_2.5_customize_8B | InternVL_2.5_direct_8B | InternVL_2.5_figure_8B | InternVL_2.5_level2_8B | InternVL_2.5_level3_8B |
| Qwen3-VL-30B | qwen3_customize_30B | qwen3_direct_30B | qwen3_figure_30B | qwen3_level2_30B | qwen3_level3_30B |
| Qwen3-VL-30B-think | qwen3_customize_30B_think | qwen3_direct_30B_think | qwen3_figure_30B_think | qwen3_level2_30B_think | qwen3_level3_30B_think |
| Qwen2.5-VL-72B | qwen2.5_customize_72B | qwen2.5_direct_72B | qwen2.5_figure_72B | qwen2.5_level2_72B | qwen2.5_level3_72B |
| Qwen2.5-VL-7B | qwen2.5_customize_7B | qwen2.5_direct_7B | qwen2.5_figure_7B | qwen2.5_level2_7B | qwen2.5_level3_7B |
| Qwen2-VL-72B | qwen2_customize_72B | qwen2_direct_72B | qwen2_figure_72B | qwen2_level2_72B | qwen2_level3_72B |
| Qwen2-VL-7B | qwen2_customize_7B | qwen2_direct_7B | qwen2_figure_7B | qwen2_level2_7B | qwen2_level3_7B |
| MOLMO-7B-D | molmo_customize_7BD | molmo_direct_7BD | molmo_figure_7BD | molmo_level2_7BD | molmo_level3_7BD |
| MIMO-VL-7B-RL-think | mimo_RL_customize_think | mimo_RL_direct_think | mimo_RL_figure_think | mimo_RL_level2_think | mimo_RL_level3_think |
| MIMO-VL-7B-RL-nothink | mimo_RL_customize_nothink | mimo_RL_direct_nothink | mimo_RL_figure_nothink | mimo_RL_level2_nothink | mimo_RL_level3_nothink |
| MIMO-VL-7B-SFT-nothink | mimo_SFT_customize_nothink | mimo_SFT_direct_nothink | mimo_SFT_figure_nothink | mimo_SFT_level2_nothink | mimo_SFT_level3_nothink |
| MIMO-VL-7B-SFT-think | mimo_SFT_customize_think | mimo_SFT_direct_think | mimo_SFT_figure_think | mimo_SFT_level2_think | mimo_SFT_level3_think |
| LLaVA-OV-Qwen2-7B-OV | llava_ov_customize | llava_ov_direct | llava_ov_figure | liava_ov_level2 | llava_ov_level3 |
| LLaVA-OV-Qwen2-7B-SI | llava_si_customize | llava_si_direct | llava_si_figure | llava_si_level2 | llava_si_level3 |
| SEED-1.6-VL | seed_1.6_customize | seed_1.6_direct | seed_1.6_figure | seed_1.6_level2 | seed_1.6_level3 |
| SEED-1.5-VL | seed_1.5_customize | seed_1.5_direct | seed_1.5_figure | seed_1.5_level2 | seed_1.5_level3 |
| Claude-Sonnet-4 | claude_customize | claude_direct | claude_figure | claude_level2 | claude_level3 |
| DeepSeek-VL-7B | deepseek_customize | deepseek_direct | deepseek_figure | deepseek_level2 | deepseek_level3 |
| Gemini-2.5-Pro | gemini_2.5_customize | gemini_2.5_direct | gemini_2.5_figure | gemini_2.5_level2 | gemini_2.5_level3 |
| GLM-4V-9B | glm_customize | glm_direct | glm_figure | glm_level2 | glm_level3 |
| GPT-5 | gpt_5_customize | gpt_5_direct | gpt_5_figure | gpt_5_level2 | gpt_5_level3 |
| Kimi-VL-A3B | kimi_customize | kimi_direct | kimi_figure | kimi_level2 | kimi_level3 |
For the results obtained from inference, the first step is to check the execution rate. The code that runs successfully and its corresponding generated images will undergo the following evaluations: base_evaluation, LLM_evaluation, and LMM_evaluation.
cd scripts/evaluate
# step1: check execution rate
bash execute_evaluate.sh
# step2: run base evaluation
bash base_evaluator.sh
# step3: run LLM evaluation to evaluate the code
bash LLM_evaluator.sh
# step4: run LMM evaluation to evaluate the image
bash LMM_evaluator.sh- [2025.10.22] We release our paper in arxiv.
-
Special thanks to Henry Hengyuan Zhao for serving as the Project Leader of this paper.
-
We are grateful to Lijian Wu and Ziyuan Zhen for their hard work in data annotation and baseline testing.
-
We also extend our appreciation to Mao Dongxing, Yifei Tao, Lijian Wu, and Wan Yang for their contributions to this work.
If you find ChartCode useful, please cite using this BibTeX:
@misc{tang2025chartscodehierarchicalbenchmark,
title={From Charts to Code: A Hierarchical Benchmark for Multimodal Models},
author={Jiahao Tang and Henry Hengyuan Zhao and Lijian Wu and Yifei Tao and Dongxing Mao and Yang Wan and Jingru Tan and Min Zeng and Min Li and Alex Jinpeng Wang},
year={2025},
eprint={2510.17932},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2510.17932},
}
