Thanks to visit codestin.com
Credit goes to github.com

Skip to content

New perf. metrics, stability and other improvements #184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Alexsandruss
Copy link
Contributor

@Alexsandruss Alexsandruss commented Apr 28, 2025

Description

Changes:

  • Add support of LightGBM daal4py modelbuilders
  • Add garbage collection and result cleaning in datasets prefetching function to avoid out-of-memory errors
  • Add SKLBENCH_DATA_CACHE env variable as the first default location for datasets cache for convenience ($PWD/data_cache is still working if env variable is not set)
  • Change default dtype to float32
  • Adjust compatibility mode of report generator to work with latest versions of stock sklearn and RAPIDS and for other cases
  • Update collected performance metrics:
    • Add cost metrics counted in microdollars (most readable degree for usual case computation time)
    • Add CPU load profiling
    • Add RAM and VRAM usage profiling
    • Add coefficient of variation for time
    • Add 1st run time
    • Add 1st-mean run ratio
  • Change color scale from RED-YELLOW-GREEN to RED-WHITE-GREEN in perf. report for better readability
  • Add option for cache flushing between case runs
  • Docs:
    • Mention Kaggle dataset download requirements
    • Note about experimental configs content and meaning
    • Move Benchmarking Config Specification to separate file
    • Add benchmarking scopes short explanation

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

@Alexsandruss Alexsandruss added enhancement New feature or request docs documentation and readme update labels Apr 28, 2025
@Alexsandruss Alexsandruss mentioned this pull request Apr 28, 2025
9 tasks
@Alexsandruss Alexsandruss changed the title Updates and fixes New perf. metrics, stability improvements and other fixes Apr 28, 2025
@Alexsandruss Alexsandruss marked this pull request as ready for review April 28, 2025 11:45
@Alexsandruss Alexsandruss changed the title New perf. metrics, stability improvements and other fixes New perf. metrics, stability and other improvements Apr 28, 2025
Copy link
Contributor

@ahuber21 ahuber21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge pile of changes that I only skimmed, but I don't see any obvious issues. Would you like comments or opinions on specific changes, or anything that needs to be double-checked?

BTW: I don't think the execution with shell=True is a problem here, as the command is hardcoded and can't be changed. But still it would be nice to work around it. Did you try running without shell=True?

- Custom loaders for named datasets
- User-provided datasets in compatible format

Kaggle API keys and competition rules acceptance are required for next dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it requires additional config files placed under specific folders too.

@@ -137,7 +137,7 @@ def split_and_transform_data(bench_case, data, data_description):
device = get_bench_case_value(bench_case, "algorithm:device", None)
common_data_format = get_bench_case_value(bench_case, "data:format", "pandas")
common_data_order = get_bench_case_value(bench_case, "data:order", "F")
common_data_dtype = get_bench_case_value(bench_case, "data:dtype", "float64")
common_data_dtype = get_bench_case_value(bench_case, "data:dtype", "float32")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would venture to guess that usage of float32 is much, much less common in both sklearn and sklearnex than float64.

graph_max_degree=self.graph_max_degree,
window_size=self.window_size,
num_threads=self.n_jobs,
# num_threads=self.n_jobs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this line should be removed, then better to remove it altogether.

- `INCLUDE` - Other configuration files whose parameter sets to include
- `PARAMETERS_SETS` - Benchmark parameters within each set
- `TEMPLATES` - List different setups with parameters sets template-specific parameters
- `SETS` - List parameters sets to include in the template
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could explain what a "set" is.

|:---------------|:--------------|:--------|:------------|
|<h3>Benchmark workflow parameters</h3>||||
| `bench`:`taskset` | None | | Value for `-c` argument of `taskset` utility used over benchmark subcommand. |
| `bench`:`vtune_profiling` | None | | Analysis type for `collect` argument of Intel(R) VTune* Profiler tool. Linux* OS only. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VTune functionalities are quite undocumented. Could expand on them and add examples and screenshots.

|:---------------|:--------------|:--------|:------------|
| `algorithm`:`estimator` | None | | Name of measured estimator. |
| `algorithm`:`estimator_params` | Empty `dict` | | Parameters for estimator constructor. |
| `algorithm`:`online_inference_mode` | False | | Enables online mode for inference methods of estimator (separate call for each sample). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could mention whether those are parallelized or not (shouldn't be).

@@ -10,9 +10,13 @@ Data handling steps:
Existing data sources:
- Synthetic data from sklearn
- OpenML datasets
- Kaggle competition datasets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to these changes but: this repository would be a lot easier to use if it could avoid pulling data from kaggle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs documentation and readme update enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants