-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Hi Tim, I and Yulong have compiled some pointers to consider for MOOT. Please see below (Let me know if we misunderstood the datasets).
π Open Issues in the Datasets
This document outlines some open issues identified during the construction and use of the solution-objective dataset. These issues, once addressed, might improve clarity, consistency, and reliability when using the dataset for optimization, analysis, or benchmarking.
1. Duplicate Solutions with Inconsistent Objective Values (about Data Structure/Usage)
- Description: The dataset may contain multiple entries with the same solution but different objective values, due to noise, randomness, or repeated trials.
- Open Questions:
- Should repeated entries be averaged to get a single objective estimate?
- Should all entries be retained to reflect natural variability?
- Should statistical outliers be removed?
2. Parameter Type Definitions (about Data Structure)
- Description: Each variable should have an explicitly defined type, such as (I think you mentioned this in the email, but not all the datasets have those):
integer,float, for numerical parameterscategoricalfor enums or string optionsbooleanfor true/false flagsordinal
- Open Questions:
- Are types clearly defined for each parameter?
- Is there a permitted range for each?
- Without type annotations, how can distances, encodings, or transformations be properly applied?
3. Dataset Documentation (about Data Structure)
- Description: The dataset would benefit from a clear and formal specification of its column structure.
- Open Questions:
- Which columns correspond to solution parameters?
- Which columns are objectives or evaluation metrics?
- Are there additional columns (e.g., workload labels, run IDs) that should be explicitly documented?
4. Unclear Semantic Meaning of Each Target Column (about Data Structure)
- Description: The meaning of each target column (objective metric) is not always clear (I guess we might not be able to address this?).
- Open Questions:
- What does each metric represent? (e.g., is "latency" average latency or 99th percentile?)
- Are the units clearly stated (e.g., seconds, milliseconds, throughput per second)?
- Should the metric be maximized or minimized (I guess this means + and -?)?
5. Handling Invalid Solutions (about Data Usage)
- Description: The dataset is constructed from a limited number of sampled solutions. During optimization or model evaluation, a method may propose solutions that are not included in the dataset, i.e., unseen or invalid with respect to the table. We need to tell people what to do in that case.
- Open Questions:
- How should such solutions be handled?
- Discard them and resample?
- Approximate objective using surrogate models (e.g., regression, KNN, GP)?
- Use nearest neighbor or interpolation methods to estimate their objective?
- Does extrapolating to unseen solutions introduce bias or evaluation inconsistencies?
- How should such solutions be handled?
6. Handling Redundant Solutions (about Data Usage)
- Description: When optimizing on the datasets, one decision to make is whether the redundant solutions that have been sampled before should consume the budget. Some guidelines on this point would be helpful.
- Open Questions:
- How should such solutions be handled?
- For expensive problems, sampling the same solution again might not consume the budget
- Otherwise, it might consume the budget.
- How should such solutions be handled?
Suggestions
To ensure the dataset is usable and interpretable by the research community:
- Provide a schema file (e.g., JSON, YAML, MARKDOWN) describing each column.
- Annotate types and valid ranges for all solution parameters.
- Clearly document objective metrics, including units and objectives.
- Decide on a consistent policy for handling duplicates and missing values.
Metadata
Metadata
Assignees
Labels
No labels