Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tags: EvolvingLMMs-Lab/lmms-eval

Tags

v0.5

Toggle v0.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[Feat] v0.5 Release Pack (#846)

* add scibench task (full) and change medqa (#840)

* add scibench task (full ) and change medqa

* run precommit

---------

Co-authored-by: pbcong <[email protected]>

* add csbench (#841)

* add csbench

* run precommit

---------

Co-authored-by: pbcong <[email protected]>

* fix linting (#842)

* [Feature] Add WenetSpeech Dataset (#837)

* [fix] batch size in openai compatible endpoint (#835)

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* [Feature] Add WenetSpeech Dataset

* add lmms-eval-0.5 doc's 1st draft

* remove unneccessary parts in lmms-eval-0.5.md

---------

Co-authored-by: b8zhong <[email protected]>

* This commit documents the official release of **LMMS-Eval v0.5: Multimodal Expansion**, detailing significant new features including:

*   A comprehensive **audio evaluation suite** (Step2 Audio Paralinguistic, VoiceBench, WenetSpeech).
*   A production-ready **response caching system**.
*   Integration of **five new models** (e.g., GPT-4o Audio Preview, Gemma-3).
*   Addition of **numerous new benchmarks** across vision, coding, and STEM domains.
*   Support for the **Model Context Protocol (MCP)** and improvements to **Async OpenAI integration**.

* This commit formally announces and documents the **LMMS-Eval v0.5: Multimodal Expansion** release, updating the `README.md` and refining the `v0.5` release notes with improved structure and reproducibility validation for new benchmarks.

* Updates the status legend for reproducibility validation in the LMMS-Eval v0.5 release notes, changing '†' to '+-'.

* Revise metrics and model integration in lmms-eval doc

Updated metrics and model integration details in the documentation.

* Fix model name in LMMs-Eval v0.5 announcement

Corrected the name of the model 'GPT-4o Audio' to 'GPT-4o Audio Preview' in the announcement section.

---------

Co-authored-by: Do Duc Anh (Erwin) <[email protected]>
Co-authored-by: pbcong <[email protected]>
Co-authored-by: Cong <[email protected]>
Co-authored-by: JAM_Yichen <[email protected]>
Co-authored-by: b8zhong <[email protected]>

v0.4.1

Toggle v0.4.1's commit message
[fix] batch size in openai compatible endpoint (#835)

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

* more

v0.4

Toggle v0.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[Feat] LMMS-Eval 0.4 (#721)

* Update task utils and logger

* [Main Update] Doc to messages feature support and Split simple and chat mode (#692)

* Update deps

* Restructured

* Delete models

* Remove deprecated models

* Set up auto doc to messages and chat models

* Lint

* Allow force simple mode

* Add auto doc to messages for audio and video

Fix lint

Init server structure

Restructure to server folder

Clean base and providers

Add clean method for models

Fix loggers save result

Fix dummy server error

Suppress llava warnings

Sample evaluator on llava in the wild

Update mmmu doc to messages

Update version

* Add judge server implementation with various providers and evaluation protocols

Add AsyncAzureOpenAIProvider implementation and update provider factory

Refactor sample saving in EvaluationTracker to use cleaned data and improve logging

Add llm_as_judge_eval metric to multiple tasks and integrate llm_judge API for evaluation

* Refactor MathVerseEvaluator to utilize llm_judge server for response generation and evaluation, enhancing API integration and error handling. Update MMBench_Evaluator to streamline API key handling based on environment variables.

* Refactor EvaluationTracker to directly modify sample data for improved clarity and efficiency. Update MathVerseEvaluator to streamline answer scoring by eliminating unnecessary extraction steps and enhance evaluation prompts. Remove deprecated metrics from configuration files.

* Refactor MathVistaEvaluator to integrate llm_judge server for enhanced response generation and evaluation. Streamline API configuration and error handling by removing direct API key management and utilizing a custom server configuration for requests.

* Update MathVista task configurations to replace 'gpt_eval_score' with 'llm_as_judge_eval' across multiple YAML files and adjust the result processing function accordingly. This change aligns with the integration of the llm_judge server for enhanced evaluation metrics.

* Add new OlympiadBench task configurations for mathematics and physics evaluation. Introduce 'olympiadbench_OE_MM_maths_en_COMP.yaml' and 'olympiadbench_OE_MM_physics_en_COMP.yaml' files, while removing outdated English and Chinese test configurations. Update evaluation metrics to utilize 'llm_as_judge_eval' for consistency across tasks.

* Add reasoning model utility functions and integrate into Qwen2_5_VL model. Introduced `parse_reasoning_model_answer` to clean model responses and updated answer processing in the Qwen2_5_VL class to utilize this new function, enhancing response clarity and logging.

* Update OlympiadBench task configuration to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation.

* Refactor olympiadbench_process_results to enhance response clarity. Updated the return format to include question, response, and ground truth for improved evaluation context. Simplified judge result determination logic.

* Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation.

* Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for consistency with recent configuration updates and improved clarity in response generation.

* Add launcher and sglang launcher for local llm as judge

* Lint

* add new tasks MMVU and Visual Web Bench (#727)

* add mmvu task

* fix linting videomathqa

* fix mmvu to use llm judge

* add visualwebbench task

* Add Qwen2_5 chat to support doc_to_messages

* Refactor documentation and codebase to standardize naming conventions from 'lm_eval' to 'lmms_eval'. Update task configurations and evaluation metrics accordingly for consistency across the project.

* Update model guide and task configurations to replace 'max_gen_toks' with 'max_new_tokens' for consistency across YAML files and documentation. This change aligns with recent updates in the generation parameters for improved clarity in model behavior.

* Update task utils and logger

* [Main Update] Doc to messages feature support and Split simple and chat mode (#692)

* Update deps

* Restructured

* Delete models

* Remove deprecated models

* Set up auto doc to messages and chat models

* Lint

* Allow force simple mode

* Add auto doc to messages for audio and video

Fix lint

Init server structure

Restructure to server folder

Clean base and providers

Add clean method for models

Fix loggers save result

Fix dummy server error

Suppress llava warnings

Sample evaluator on llava in the wild

Update mmmu doc to messages

Update version

* Add judge server implementation with various providers and evaluation protocols

Add AsyncAzureOpenAIProvider implementation and update provider factory

Refactor sample saving in EvaluationTracker to use cleaned data and improve logging

Add llm_as_judge_eval metric to multiple tasks and integrate llm_judge API for evaluation

* Refactor MathVerseEvaluator to utilize llm_judge server for response generation and evaluation, enhancing API integration and error handling. Update MMBench_Evaluator to streamline API key handling based on environment variables.

* Refactor EvaluationTracker to directly modify sample data for improved clarity and efficiency. Update MathVerseEvaluator to streamline answer scoring by eliminating unnecessary extraction steps and enhance evaluation prompts. Remove deprecated metrics from configuration files.

* Refactor MathVistaEvaluator to integrate llm_judge server for enhanced response generation and evaluation. Streamline API configuration and error handling by removing direct API key management and utilizing a custom server configuration for requests.

* Update MathVista task configurations to replace 'gpt_eval_score' with 'llm_as_judge_eval' across multiple YAML files and adjust the result processing function accordingly. This change aligns with the integration of the llm_judge server for enhanced evaluation metrics.

* Add new OlympiadBench task configurations for mathematics and physics evaluation. Introduce 'olympiadbench_OE_MM_maths_en_COMP.yaml' and 'olympiadbench_OE_MM_physics_en_COMP.yaml' files, while removing outdated English and Chinese test configurations. Update evaluation metrics to utilize 'llm_as_judge_eval' for consistency across tasks.

* Add reasoning model utility functions and integrate into Qwen2_5_VL model. Introduced `parse_reasoning_model_answer` to clean model responses and updated answer processing in the Qwen2_5_VL class to utilize this new function, enhancing response clarity and logging.

* Update OlympiadBench task configuration to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation.

* Refactor olympiadbench_process_results to enhance response clarity. Updated the return format to include question, response, and ground truth for improved evaluation context. Simplified judge result determination logic.

* Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation.

* Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for consistency with recent configuration updates and improved clarity in response generation.

* Add launcher and sglang launcher for local llm as judge

* Lint

* add new tasks MMVU and Visual Web Bench (#727)

* add mmvu task

* fix linting videomathqa

* fix mmvu to use llm judge

* add visualwebbench task

* Add Qwen2_5 chat to support doc_to_messages

* Refactor documentation and codebase to standardize naming conventions from 'lm_eval' to 'lmms_eval'. Update task configurations and evaluation metrics accordingly for consistency across the project.

* Update model guide and task configurations to replace 'max_gen_toks' with 'max_new_tokens' for consistency across YAML files and documentation. This change aligns with recent updates in the generation parameters for improved clarity in model behavior.

* Refactor evaluation logic to ensure distributed execution only occurs when multiple processes are active. Update metrics handling in OpenAI Math task to correctly track exact matches and coverage based on results.

* Fix text auto messages

* Update docs

* Add vllm chat models

* Add openai compatible

* Add sglang runtime

* Fix errors

* Fix sglang error

* Add Claude Code Action workflow configuration

* Refactor VLLM model initialization and update generation parameters across tasks. Change model version to a more generic name and adjust sampling settings to enable sampling and increase max new tokens for better performance.

* Update max_new_tokens in Huggingface model and enhance metrics handling in OpenAI math task. Remove breakpoint in VLLM model initialization.

* Allow logging task input

* Add development guidelines document outlining core rules, coding best practices, and error resolution strategies for the codebase.

* fix repr and group

* Add call tools for async openai with mcp client

* Add examples

* Support multi-node eval

* Fix grouping func

* Feature/inference throughput logging (#747)

* Add inference throughput logging to chat models

Implements TPOT (Time Per Output Token) and inference speed metrics:
- TPOT = (e2e_latency - TTFT) / (num_output_tokens - 1)
- Inference Speed = 1 / TPOT tokens/second

Modified chat models:
- openai_compatible.py: API call timing with token counting
- vllm.py: Batch-level timing with per-request metrics
- sglang.py: Timing with meta_info extraction
- huggingface.py: Batch processing with token calculation
- llava_hf.py: Single-request timing with error handling
- qwen2_5_vl.py: Batch timing implementation

Features:
- Precise timing around model.generate() calls
- TTFT estimation when not available from model
- Comprehensive logging with formatted metrics
- Batch processing support
- Error handling for robustness

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Add throughput metrics documentation and update logging in chat models

* Add gen metric utils

* Revise qwen logging

* Revise llava_hf logging

* Revise hf model loggging

* Revise sglang logging

* Support vllm logging

* Add open logging

---------

Co-authored-by: Claude <[email protected]>
Co-authored-by: kcz358 <[email protected]>

* Refactor evaluation process to utilize llm_judge API

- Updated internal evaluation scripts for D170, DC100, and DC200 tasks to replace GPT evaluation with llm_judge evaluation.
- Introduced custom prompts for binary evaluation based on model responses and ground truth.
- Modified YAML configuration files to reflect changes in the evaluation metrics and aggregation methods.
- Enhanced error handling and logging for evaluation failures.

This change aims to improve the accuracy and reliability of model evaluations across different tasks.

* Dev/olympiad bench (#762)

* Refactor vLLM model files and add OlympiadBench evaluation utilities

- Cleaned up imports and removed unused variables in `vllm.py`.
- Updated threading configuration in `simple/vllm.py` to use environment variables.
- Introduced new utility functions for processing OlympiadBench documents and results in `utils.py`, `zh_utils.py`, and `en_utils.py`.
- Added evaluation logic for OlympiadBench tasks in `olympiadbench_evals.py`.
- Created multiple YAML configuration files for various OlympiadBench tasks, including math and physics in both English and Chinese.
- Implemented aggregation functions for results processing in the OlympiadBench context.

* Implement OlympiadBench evaluation utilities and refactor math verification

- Introduced new utility functions for processing OlympiadBench documents and results in `en_utils.py` and `zh_utils.py`.
- Added a custom timeout decorator in `math_verify_utils.py` to replace the previous signal-based timeout.
- Created multiple YAML configuration files for various OlympiadBench tasks, including math and physics in both English and Chinese.
- Removed outdated files from the `olympiadbench_official` directory to streamline the codebase.
- Enhanced evaluation logic for OlympiadBench tasks in `olympiadbench_evals.py` and added aggregation functions for results processing.

* Update mathvision utility imports and modify YAML configurations for OlympiadBench

- Added error handling for importing evaluation utilities in `utils.py` to improve robustness.
- Changed `doc_to_target` from "answer" to "final_answer" in both `olympiadbench_all_boxed.yaml` and `olympiadbench_boxed.yaml` to ensure consistency in output naming.

* Update YAML configurations for AIME tasks

- Changed `do_sample` parameter to `true` in `aime24_figures_agg64.yaml` to enable sampling during generation.
- Added new configuration file `aime25_nofigures_agg64.yaml` for a new task, including detailed metrics and filtering options for evaluation.

These updates enhance the flexibility and functionality of the AIME evaluation tasks.

* Refactor internal evaluation scripts for consistency and readability

- Removed unnecessary blank lines in `d170_cn_utils.py`, `d170_en_utils.py`, `dc100_en_utils.py`, and `dc200_cn_utils.py` to improve code clarity.
- Streamlined the `evaluate_binary` API call formatting for better readability.

These changes enhance the maintainability of the evaluation scripts across different tasks.

* Update documentation for `lmms_eval` to enhance clarity and usability

- Revised command-line interface section in `commands.md` for improved readability and updated links to the main README.
- Enhanced `current_tasks.md` with clearer instructions for listing supported tasks and their question counts.
- Added comprehensive model examples in `model_guide.md` for image, video, and audio models, including implementation details and key notes.
- Expanded `README.md` to provide an overview of the framework's capabilities and updated the table of contents for better navigation.
- Included new audio model examples in `run_examples.md` to demonstrate usage.
- Introduced an audio task example in `task_guide.md` to guide users in configuring audio tasks effectively.

These updates aim to improve the overall documentation experience for users and developers working with `lmms_eval`.

* Introduce LMMS-Eval v0.4: Major update with unified message interface, multi-node distributed evaluation, and enhanced judge API

* Add agg8 task and fix data path

* Fix warning

* Remove bug report documentation from the codebase, consolidating information on identified bugs and fixes for improved clarity and maintainability.

* Add announcement for the release of `lmms-eval` v0.4.0 in README.md

* Enhance documentation for LMMS-Eval v0.4 with detailed installation instructions, system requirements, and troubleshooting tips.

* Remove outdated system requirements and installation instructions from LMMS-Eval v0.4 documentation to streamline content and improve clarity.

* Fix datetime format string in olympiadbench submission file naming

Co-authored-by: drluodian <[email protected]>

* Fix video frame handling in protocol with range() for consistent iteration

Co-authored-by: drluodian <[email protected]>

* Convert vLLM environment variables to integers for proper type handling

Co-authored-by: drluodian <[email protected]>

* Fix force_simple model selection to check model availability

Co-authored-by: drluodian <[email protected]>

* Fix format issue and add avg@8 for aime

* Allow vllm for tp

* fix parsing logic

* Fix OpenAI payload max tokens parameter to use max_new_tokens

Co-authored-by: drluodian <[email protected]>

* Update OpenAI payload handling to include support for model version o4 and remove max_tokens parameter

* Refactor model version handling across evaluation tasks by removing hardcoded GPT model names and replacing them with environment variable support for dynamic model versioning. Update server configuration to utilize the unified judge API for improved response handling.

* batch update misused calls for eval model

* Update evaluation tasks to use environment variables for GPT model versioning, replacing hardcoded values with dynamic configuration. Remove unused YAML loading logic in multilingual LLAVA benchmark utilities.

* Enhance VLLM configuration to support distributed execution for multiple processes. Update multilingual LLAVA benchmark YAML files to include dataset names and remove deprecated config entries.

* Remove reviewer guideline and co-authored-by mention from contribution instructions in claude.md

* Add development guidelines document outlining core rules, coding best practices, and error resolution strategies for the codebase.

* Refactor score parsing logic in multiple utility files to include stripping whitespace from the score string before processing.

* Update .gitignore to include new workspace directory and modify utility files to enhance response handling by replacing Request object usage with direct server method calls for text generation across multiple evaluation tasks.

* Refactor evaluation tasks to utilize the unified judge API by replacing direct server method calls with Request object usage. Update server configuration in multiple utility files to enhance response handling and streamline evaluation processes.

* Refactor generation parameter handling in Llava_OneVision model to streamline configuration. Remove redundant default settings and ensure proper handling of sampling parameters based on the do_sample flag. Update multiple YAML task files to increase max_new_tokens and comment out temperature settings for clarity. Introduce new YAML configuration for MMMU validation reasoning task.

* Enhance score processing logic in utility functions to improve error handling and validation. Implement robust regex patterns for score extraction, ensuring all components are accounted for and scores are clamped within valid ranges. Add logging for better traceability of errors and fallback mechanisms for invalid inputs in the mia_bench evaluation process.

* Fix launch error when num proc = 1

* Refactor VLLM model parameter handling to simplify distributed execution logic. Remove redundant checks for tensor parallelism and streamline generation parameter settings by eliminating unused temperature and top_p configurations.

* Refactor VLLM message handling to prioritize image URLs before text content. Remove unused distributed executor backend parameter for cleaner execution logic.

* feat(vllm): Set default max_new_tokens to 4096, temperature to 0, and top_p to 0.95

* docs: Update lmms-eval-0.4 documentation with images and installation instructions

* docs: Update lmms-eval-0.4 documentation to include backward compatibility check

* refactor: Simplify server config instantiation in utils files

* docs: Update supported tasks count in README

* Update docs

* Fix mathverse bugs

* docs: Update images in lmms-eval-0.4.md

* docs: Remove API Benefits and Upcoming Benchmarks sections

* docs: Update image URL for Unified Message Interface

* docs: Fix typo in LMMS-Eval v0.4 documentation

Corrected "27.8/16.40" to "27.8/26.40" in the performance comparison table.
Also corrected "16.78/13.82" to "16.78/15.82" in the performance comparison table.

* fix(docs): Correct typo in LMMS-Eval v0.4 performance comparison table

* refactor(docs): Refactor LMMS-Eval v0.4 performance table for clarity

* Update docs

---------

Co-authored-by: kcz358 <[email protected]>
Co-authored-by: Cong <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: Cursor Agent <[email protected]>

v0.3.5

Toggle v0.3.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Update pyproject.toml

v0.3.4

Toggle v0.3.4's commit message
Update datasets dependency in pyproject.toml to allow for newer versi…

…ons (>=2.16.1).

v0.3.3

Toggle v0.3.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[Fix] add more model examples (#644)

* Update OpenAI compatibility script for Azure integration

- Set environment variables for Azure OpenAI API configuration.
- Modify model arguments to use the new GPT-4o model version and enable Azure OpenAI support.
- Clean up commented installation instructions for clarity.

* Refactor imports and clean up code in various utility files

- Consolidated import statements in `plm.py` and `utils.py` for better organization.
- Removed redundant blank lines in `eval_utils.py`, `fgqa_utils.py`, `rcap_utils.py`, `rdcap_utils.py`, `rtloc_utils.py`, and `sgqa_utils.py` to enhance readability.
- Ensured consistent import structure across utility files for improved maintainability.

* Update README and add example scripts for model evaluations

- Revised installation instructions to facilitate direct package installation from Git.
- Added detailed usage examples for various models including Aria, LLaVA, and Qwen2-VL.
- Introduced new example scripts for model evaluations, enhancing user guidance for running specific tasks.
- Improved clarity in environmental variable setup and common issues troubleshooting sections.

* Update README to reflect new example script locations and remove outdated evaluation instructions

- Changed paths for model evaluation scripts to point to the new `examples/models` directory.
- Added a note directing users to find more examples in the updated location.
- Removed outdated evaluation instructions for LLaVA on multiple datasets to streamline the documentation.

* Update README to reflect new script locations and enhance evaluation instructions

- Replaced outdated evaluation commands with new script paths in the `examples/models` directory.
- Updated sections for evaluating larger models, including the introduction of new scripts for tensor parallel and SGLang evaluations.
- Streamlined instructions for model evaluation to improve clarity and usability.

v0.3.2

Toggle v0.3.2's commit message
[Fix] Regular Linting

- Added missing comma in AVAILABLE_MODELS for consistency.
- Reordered import statements in vora.py for better readability.
- Simplified input_data generation by condensing method calls into a single line.
- Ensured default generation parameters are set correctly in the VoRA class.

v0.3.1

Toggle v0.3.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Update README.md

v0.3.0

Toggle v0.3.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #432 from EvolvingLMMs-Lab/pufanyi/pypi_0.3.0

PyPI 0.3.0

v0.2.4

Toggle v0.2.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[Feat] Add support for evaluation of InternVideo2-Chat && Fix evaluat…

…ion for mvbench (#280)

* [add] add internvideo2 support && change mvbench to video branch

* [add] answer_prompt of internvideo2

* [add] change video type of internvideo2

* [fix] update template of mvbench

* [reformat]

* [fix] generate_until_multi_round

* [Feat] videochat2 support

---------

Co-authored-by: heyinan <[email protected]>