Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@danielezhu
Copy link
Contributor

Issue #, if available:

Description of changes:

Current state of affairs

Problem 1
Currently, we define the schema for what gets written to the saved output file (generated by util.save_dataset) in the EvalOutputRecord class. This class has attributes like model_input, sent_less_output, etc. that correspond to columns in the Ray Dataset produced by an EvalAlgorithm's evaluate method.

However, the column names in the produced Ray Dataset are governed by the *_COLUMN_NAME constants in constants.py. Thus, we need to ensure parity between these column name constants and the attributes of EvalOutputRecord. Specifically, this code in EvalOutputRecord.from_row requires that the attribute names in EvalOutputRecord exactly match their corresponding *_COLUMN_NAME strings.

Currently, there is no mechanism for ensuring such parity. There is simply a comment in the docstring of the EvalOutputRecord class. Even if the comment is acknowledged by an engineer, there is still room for manual error, like typos.

Problem 2
Currently, EvalOutputRecord looks to the set COLUMN_NAMES as the source of truth regarding Ray Dataset column names, but not every *_COLUMN_NAME constant is included in this set. Since there's nothing enforcing that we include every new *_COLUMN_NAME constant in COLUMN_NAMES, it is very easy to accidentally skip the step of updating COLUMN_NAMES.

This problem has already resulted in a bug. In this PR, the attributes prompt, sent_more_prompt, and sent_less_prompt were added to EvalOutputRecord. While the corresponding constants SENT_MORE_PROMPT_COLUMN_NAME = "sent_more_prompt" and SENT_LESS_PROMPT_COLUMN_NAME = "sent_less_prompt" exist in constants.py (note that PROMPT_COLUMN_NAME = "prompt" is currently being defined in each eval algo's code, which I have changed in this PR), they aren't included in COLUMN_NAMES, which means that this snippet will not "pick up" the prompt, sent_more_prompt, and sent_less_prompt columns when constructing the EvalOutputRecord object.

Proposed changes

Fix for Problem 1
I propose that we get rid of the attributes in EvalOutputRecord that correspond to the non-score columns (ex: model_input, sent_less_output, etc), and instead use a single attribute, non_score_columns, which is aDict, to encode the same information content. The keys to this dict will be validated in EvalOutputRecord's __post_init__ method by comparing them to constants.COLUMN_NAMES.

Fix for Problem 2
I have added a new Enum, ColumnNames, which encapsulates all of the *_COLUMN_NAME constants. The set COLUMN_NAMES is now defined automatically, using the values in ColumnNames. Because COLUMN_NAMES is created from ColumnNames, we will never run into the issue where we add a new *_COLUMN_NAME constant but COLUMN_NAMES doesn't get updated accordingly.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Daniel Zhu added 2 commits January 16, 2024 11:17

COLUMN_NAMES = [e.value for e in ColumnNames]

# These constants are included so that eval algorithm code doesn't need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subjective:
I think we should remove these, it's fine being verbose in the code: ColumnName.MODEL_INPUT_COLUMN_NAME.value. It keeps the code readable, ensures in readers mind that ColumnNames enum is the source of truth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense; will update

xiaoyi-cheng
xiaoyi-cheng previously approved these changes Jan 17, 2024
@danielezhu danielezhu merged commit 9b61de4 into aws:main Jan 17, 2024
@danielezhu danielezhu deleted the refactor_constants branch January 17, 2024 22:30
@danielezhu danielezhu changed the title fix: remove possibility of human error from column name-related code fix: create single source of truth for dataset column names Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants