-
Notifications
You must be signed in to change notification settings - Fork 73
New perf. metrics, stability and other improvements #184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge pile of changes that I only skimmed, but I don't see any obvious issues. Would you like comments or opinions on specific changes, or anything that needs to be double-checked?
BTW: I don't think the execution with shell=True
is a problem here, as the command is hardcoded and can't be changed. But still it would be nice to work around it. Did you try running without shell=True
?
- Custom loaders for named datasets | ||
- User-provided datasets in compatible format | ||
|
||
Kaggle API keys and competition rules acceptance are required for next dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it requires additional config files placed under specific folders too.
@@ -137,7 +137,7 @@ def split_and_transform_data(bench_case, data, data_description): | |||
device = get_bench_case_value(bench_case, "algorithm:device", None) | |||
common_data_format = get_bench_case_value(bench_case, "data:format", "pandas") | |||
common_data_order = get_bench_case_value(bench_case, "data:order", "F") | |||
common_data_dtype = get_bench_case_value(bench_case, "data:dtype", "float64") | |||
common_data_dtype = get_bench_case_value(bench_case, "data:dtype", "float32") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would venture to guess that usage of float32 is much, much less common in both sklearn and sklearnex than float64.
graph_max_degree=self.graph_max_degree, | ||
window_size=self.window_size, | ||
num_threads=self.n_jobs, | ||
# num_threads=self.n_jobs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this line should be removed, then better to remove it altogether.
- `INCLUDE` - Other configuration files whose parameter sets to include | ||
- `PARAMETERS_SETS` - Benchmark parameters within each set | ||
- `TEMPLATES` - List different setups with parameters sets template-specific parameters | ||
- `SETS` - List parameters sets to include in the template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could explain what a "set" is.
|:---------------|:--------------|:--------|:------------| | ||
|<h3>Benchmark workflow parameters</h3>|||| | ||
| `bench`:`taskset` | None | | Value for `-c` argument of `taskset` utility used over benchmark subcommand. | | ||
| `bench`:`vtune_profiling` | None | | Analysis type for `collect` argument of Intel(R) VTune* Profiler tool. Linux* OS only. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VTune functionalities are quite undocumented. Could expand on them and add examples and screenshots.
|:---------------|:--------------|:--------|:------------| | ||
| `algorithm`:`estimator` | None | | Name of measured estimator. | | ||
| `algorithm`:`estimator_params` | Empty `dict` | | Parameters for estimator constructor. | | ||
| `algorithm`:`online_inference_mode` | False | | Enables online mode for inference methods of estimator (separate call for each sample). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could mention whether those are parallelized or not (shouldn't be).
@@ -10,9 +10,13 @@ Data handling steps: | |||
Existing data sources: | |||
- Synthetic data from sklearn | |||
- OpenML datasets | |||
- Kaggle competition datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to these changes but: this repository would be a lot easier to use if it could avoid pulling data from kaggle.
Description
Changes:
SKLBENCH_DATA_CACHE
env variable as the first default location for datasets cache for convenience ($PWD/data_cache
is still working if env variable is not set)cost
metrics counted inmicrodollars
(most readable degree for usual case computation time)PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.
You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
Checklist to comply with before moving PR from draft:
PR completeness and readability
Testing