Possible TODO list for MOOT

Hi Tim, I and Yulong have compiled some pointers to consider for MOOT. Please see below (Let me know if we misunderstood the datasets).


# 🔍 Open Issues in the Datasets

This document outlines some open issues identified during the construction and use of the solution-objective dataset. These issues, once addressed, might improve clarity, consistency, and reliability when using the dataset for optimization, analysis, or benchmarking.

---

## 1. Duplicate Solutions with Inconsistent Objective Values (about Data Structure/Usage)

- **Description**: The dataset may contain multiple entries with the same solution but different objective values, due to noise, randomness, or repeated trials.
- **Open Questions**:
  - Should repeated entries be averaged to get a single objective estimate?
  - Should all entries be retained to reflect natural variability?
  - Should statistical outliers be removed?

---

## 2. Parameter Type Definitions (about Data Structure)

- **Description**: Each variable should have an explicitly defined type, such as (I think you mentioned this in the email, but not all the datasets have those):
  - `integer`, `float`,  for numerical parameters
  - `categorical` for enums or string options
  - `boolean` for true/false flags
  - `ordinal`
- **Open Questions**:
  - Are types clearly defined for each parameter?
  - Is there a permitted range for each?
  - Without type annotations, how can distances, encodings, or transformations be properly applied?

---

## 3. Dataset Documentation (about Data Structure)

- **Description**: The dataset would benefit from a clear and formal specification of its column structure.
- **Open Questions**:
  - Which columns correspond to solution parameters?
  - Which columns are objectives or evaluation metrics?
  - Are there additional columns (e.g., workload labels, run IDs) that should be explicitly documented?

---

## 4. Unclear Semantic Meaning of Each Target Column (about Data Structure)

- **Description**: The meaning of each target column (objective metric) is not always clear (I guess we might not be able to address this?).
- **Open Questions**:
  - What does each metric represent? (e.g., is "latency" average latency or 99th percentile?)
  - Are the units clearly stated (e.g., seconds, milliseconds, throughput per second)?
  - Should the metric be maximized or minimized (I guess this means + and -?)?

---

## 5. Handling Invalid Solutions (about Data Usage)

- **Description**: The dataset is constructed from a limited number of sampled solutions. During optimization or model evaluation, a method may propose solutions that are not included in the dataset, i.e., unseen or invalid with respect to the table. We need to tell people what to do in that case.
- **Open Questions**:
  - How should such solutions be handled?
    - Discard them and resample?
    - Approximate objective using surrogate models (e.g., regression, KNN, GP)?
    - Use nearest neighbor or interpolation methods to estimate their objective?
  - Does extrapolating to unseen solutions introduce bias or evaluation inconsistencies?

---

## 6. Handling Redundant Solutions (about Data Usage)

- **Description**: When optimizing on the datasets, one decision to make is whether the redundant solutions that have been sampled before should consume the budget. Some guidelines on this point would be helpful.
- **Open Questions**:
  - How should such solutions be handled?
    - For expensive problems, sampling the same solution again might not consume the budget
    - Otherwise, it might consume the budget.


---

## Suggestions

To ensure the dataset is usable and interpretable by the research community:

- Provide a schema file (e.g., JSON, YAML, MARKDOWN) describing each column.
- Annotate types and valid ranges for all solution parameters.
- Clearly document objective metrics, including units and objectives.
- Decide on a consistent policy for handling duplicates and missing values.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible TODO list for MOOT #13

🔍 Open Issues in the Datasets

1. Duplicate Solutions with Inconsistent Objective Values (about Data Structure/Usage)

2. Parameter Type Definitions (about Data Structure)

3. Dataset Documentation (about Data Structure)

4. Unclear Semantic Meaning of Each Target Column (about Data Structure)

5. Handling Invalid Solutions (about Data Usage)

6. Handling Redundant Solutions (about Data Usage)

Suggestions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible TODO list for MOOT #13

Description

🔍 Open Issues in the Datasets

1. Duplicate Solutions with Inconsistent Objective Values (about Data Structure/Usage)

2. Parameter Type Definitions (about Data Structure)

3. Dataset Documentation (about Data Structure)

4. Unclear Semantic Meaning of Each Target Column (about Data Structure)

5. Handling Invalid Solutions (about Data Usage)

6. Handling Redundant Solutions (about Data Usage)

Suggestions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions