Codestin Search App

v0.4

[Feat] LMMS-Eval 0.4 (#721)

* Update task utils and logger

* [Main Update] Doc to messages feature support and Split simple and chat mode (#692)

* Update deps

* Restructured

* Delete models

* Remove deprecated models

* Set up auto doc to messages and chat models

* Lint

* Allow force simple mode

* Add auto doc to messages for audio and video

Fix lint

Init server structure

Restructure to server folder

Clean base and providers

Add clean method for models

Fix loggers save result

Fix dummy server error

Suppress llava warnings

Sample evaluator on llava in the wild

Update mmmu doc to messages

Update version

* Add judge server implementation with various providers and evaluation protocols

Add AsyncAzureOpenAIProvider implementation and update provider factory

Refactor sample saving in EvaluationTracker to use cleaned data and improve logging

Add llm_as_judge_eval metric to multiple tasks and integrate llm_judge API for evaluation

* Refactor MathVerseEvaluator to utilize llm_judge server for response generation and evaluation, enhancing API integration and error handling. Update MMBench_Evaluator to streamline API key handling based on environment variables.

* Refactor EvaluationTracker to directly modify sample data for improved clarity and efficiency. Update MathVerseEvaluator to streamline answer scoring by eliminating unnecessary extraction steps and enhance evaluation prompts. Remove deprecated metrics from configuration files.

* Refactor MathVistaEvaluator to integrate llm_judge server for enhanced response generation and evaluation. Streamline API configuration and error handling by removing direct API key management and utilizing a custom server configuration for requests.

* Update MathVista task configurations to replace 'gpt_eval_score' with 'llm_as_judge_eval' across multiple YAML files and adjust the result processing function accordingly. This change aligns with the integration of the llm_judge server for enhanced evaluation metrics.

* Add new OlympiadBench task configurations for mathematics and physics evaluation. Introduce 'olympiadbench_OE_MM_maths_en_COMP.yaml' and 'olympiadbench_OE_MM_physics_en_COMP.yaml' files, while removing outdated English and Chinese test configurations. Update evaluation metrics to utilize 'llm_as_judge_eval' for consistency across tasks.

* Add reasoning model utility functions and integrate into Qwen2_5_VL model. Introduced `parse_reasoning_model_answer` to clean model responses and updated answer processing in the Qwen2_5_VL class to utilize this new function, enhancing response clarity and logging.

* Update OlympiadBench task configuration to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation.

* Refactor olympiadbench_process_results to enhance response clarity. Updated the return format to include question, response, and ground truth for improved evaluation context. Simplified judge result determination logic.

* Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation.

* Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for consistency with recent configuration updates and improved clarity in response generation.

* Add launcher and sglang launcher for local llm as judge

* Lint

* add new tasks MMVU and Visual Web Bench (#727)

* add mmvu task

* fix linting videomathqa

* fix mmvu to use llm judge

* add visualwebbench task

* Add Qwen2_5 chat to support doc_to_messages

* Refactor documentation and codebase to standardize naming conventions from 'lm_eval' to 'lmms_eval'. Update task configurations and evaluation metrics accordingly for consistency across the project.

* Update model guide and task configurations to replace 'max_gen_toks' with 'max_new_tokens' for consistency across YAML files and documentation. This change aligns with recent updates in the generation parameters for improved clarity in model behavior.

* Update task utils and logger

* [Main Update] Doc to messages feature support and Split simple and chat mode (#692)

* Update deps

* Restructured

* Delete models

* Remove deprecated models

* Set up auto doc to messages and chat models

* Lint

* Allow force simple mode

* Add auto doc to messages for audio and video

Fix lint

Init server structure

Restructure to server folder

Clean base and providers

Add clean method for models

Fix loggers save result

Fix dummy server error

Suppress llava warnings

Sample evaluator on llava in the wild

Update mmmu doc to messages

Update version

* Add judge server implementation with various providers and evaluation protocols

Add AsyncAzureOpenAIProvider implementation and update provider factory

Refactor sample saving in EvaluationTracker to use cleaned data and improve logging

Add llm_as_judge_eval metric to multiple tasks and integrate llm_judge API for evaluation

* Update OlympiadBench task configuration to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation.

* Update olympiadbench_OE_MM_physics_en_COMP.yaml to change 'doc_to_target' from 'answer' to 'final_answer' for improved clarity in response generation.

* Add launcher and sglang launcher for local llm as judge

* Lint

* add new tasks MMVU and Visual Web Bench (#727)

* add mmvu task

* fix linting videomathqa

* fix mmvu to use llm judge

* add visualwebbench task

* Add Qwen2_5 chat to support doc_to_messages

* Refactor evaluation logic to ensure distributed execution only occurs when multiple processes are active. Update metrics handling in OpenAI Math task to correctly track exact matches and coverage based on results.

* Fix text auto messages

* Update docs

* Add vllm chat models

* Add openai compatible

* Add sglang runtime

* Fix errors

* Fix sglang error

* Add Claude Code Action workflow configuration

* Refactor VLLM model initialization and update generation parameters across tasks. Change model version to a more generic name and adjust sampling settings to enable sampling and increase max new tokens for better performance.

* Update max_new_tokens in Huggingface model and enhance metrics handling in OpenAI math task. Remove breakpoint in VLLM model initialization.

* Allow logging task input

* Add development guidelines document outlining core rules, coding best practices, and error resolution strategies for the codebase.

* fix repr and group

* Add call tools for async openai with mcp client

* Add examples

* Support multi-node eval

* Fix grouping func

* Feature/inference throughput logging (#747)

* Add inference throughput logging to chat models

Implements TPOT (Time Per Output Token) and inference speed metrics:
- TPOT = (e2e_latency - TTFT) / (num_output_tokens - 1)
- Inference Speed = 1 / TPOT tokens/second

Modified chat models:
- openai_compatible.py: API call timing with token counting
- vllm.py: Batch-level timing with per-request metrics
- sglang.py: Timing with meta_info extraction
- huggingface.py: Batch processing with token calculation
- llava_hf.py: Single-request timing with error handling
- qwen2_5_vl.py: Batch timing implementation

Features:
- Precise timing around model.generate() calls
- TTFT estimation when not available from model
- Comprehensive logging with formatted metrics
- Batch processing support
- Error handling for robustness

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Add throughput metrics documentation and update logging in chat models

* Add gen metric utils

* Revise qwen logging

* Revise llava_hf logging

* Revise hf model loggging

* Revise sglang logging

* Support vllm logging

* Add open logging

---------

Co-authored-by: Claude <[email protected]>
Co-authored-by: kcz358 <[email protected]>

* Refactor evaluation process to utilize llm_judge API

- Updated internal evaluation scripts for D170, DC100, and DC200 tasks to replace GPT evaluation with llm_judge evaluation.
- Introduced custom prompts for binary evaluation based on model responses and ground truth.
- Modified YAML configuration files to reflect changes in the evaluation metrics and aggregation methods.
- Enhanced error handling and logging for evaluation failures.

This change aims to improve the accuracy and reliability of model evaluations across different tasks.

* Dev/olympiad bench (#762)

* Refactor vLLM model files and add OlympiadBench evaluation utilities

- Cleaned up imports and removed unused variables in `vllm.py`.
- Updated threading configuration in `simple/vllm.py` to use environment variables.
- Introduced new utility functions for processing OlympiadBench documents and results in `utils.py`, `zh_utils.py`, and `en_utils.py`.
- Added evaluation logic for OlympiadBench tasks in `olympiadbench_evals.py`.
- Created multiple YAML configuration files for various OlympiadBench tasks, including math and physics in both English and Chinese.
- Implemented aggregation functions for results processing in the OlympiadBench context.

* Implement OlympiadBench evaluation utilities and refactor math verification

- Introduced new utility functions for processing OlympiadBench documents and results in `en_utils.py` and `zh_utils.py`.
- Added a custom timeout decorator in `math_verify_utils.py` to replace the previous signal-based timeout.
- Created multiple YAML configuration files for various OlympiadBench tasks, including math and physics in both English and Chinese.
- Removed outdated files from the `olympiadbench_official` directory to streamline the codebase.
- Enhanced evaluation logic for OlympiadBench tasks in `olympiadbench_evals.py` and added aggregation functions for results processing.

* Update mathvision utility imports and modify YAML configurations for OlympiadBench

- Added error handling for importing evaluation utilities in `utils.py` to improve robustness.
- Changed `doc_to_target` from "answer" to "final_answer" in both `olympiadbench_all_boxed.yaml` and `olympiadbench_boxed.yaml` to ensure consistency in output naming.

* Update YAML configurations for AIME tasks

- Changed `do_sample` parameter to `true` in `aime24_figures_agg64.yaml` to enable sampling during generation.
- Added new configuration file `aime25_nofigures_agg64.yaml` for a new task, including detailed metrics and filtering options for evaluation.

These updates enhance the flexibility and functionality of the AIME evaluation tasks.

* Refactor internal evaluation scripts for consistency and readability

- Removed unnecessary blank lines in `d170_cn_utils.py`, `d170_en_utils.py`, `dc100_en_utils.py`, and `dc200_cn_utils.py` to improve code clarity.
- Streamlined the `evaluate_binary` API call formatting for better readability.

These changes enhance the maintainability of the evaluation scripts across different tasks.

* Update documentation for `lmms_eval` to enhance clarity and usability

- Revised command-line interface section in `commands.md` for improved readability and updated links to the main README.
- Enhanced `current_tasks.md` with clearer instructions for listing supported tasks and their question counts.
- Added comprehensive model examples in `model_guide.md` for image, video, and audio models, including implementation details and key notes.
- Expanded `README.md` to provide an overview of the framework's capabilities and updated the table of contents for better navigation.
- Included new audio model examples in `run_examples.md` to demonstrate usage.
- Introduced an audio task example in `task_guide.md` to guide users in configuring audio tasks effectively.

These updates aim to improve the overall documentation experience for users and developers working with `lmms_eval`.

* Introduce LMMS-Eval v0.4: Major update with unified message interface, multi-node distributed evaluation, and enhanced judge API

* Add agg8 task and fix data path

* Fix warning

* Remove bug report documentation from the codebase, consolidating information on identified bugs and fixes for improved clarity and maintainability.

* Add announcement for the release of `lmms-eval` v0.4.0 in README.md

* Enhance documentation for LMMS-Eval v0.4 with detailed installation instructions, system requirements, and troubleshooting tips.

* Remove outdated system requirements and installation instructions from LMMS-Eval v0.4 documentation to streamline content and improve clarity.

* Fix datetime format string in olympiadbench submission file naming