fix: create single source of truth for dataset column names #171

danielezhu · 2024-01-16T20:49:35Z

Issue #, if available:

Description of changes:

Current state of affairs

Problem 1
Currently, we define the schema for what gets written to the saved output file (generated by util.save_dataset) in the EvalOutputRecord class. This class has attributes like model_input, sent_less_output, etc. that correspond to columns in the Ray Dataset produced by an EvalAlgorithm's evaluate method.

However, the column names in the produced Ray Dataset are governed by the *_COLUMN_NAME constants in constants.py. Thus, we need to ensure parity between these column name constants and the attributes of EvalOutputRecord. Specifically, this code in EvalOutputRecord.from_row requires that the attribute names in EvalOutputRecord exactly match their corresponding *_COLUMN_NAME strings.

Currently, there is no mechanism for ensuring such parity. There is simply a comment in the docstring of the EvalOutputRecord class. Even if the comment is acknowledged by an engineer, there is still room for manual error, like typos.

Problem 2
Currently, EvalOutputRecord looks to the set COLUMN_NAMES as the source of truth regarding Ray Dataset column names, but not every *_COLUMN_NAME constant is included in this set. Since there's nothing enforcing that we include every new *_COLUMN_NAME constant in COLUMN_NAMES, it is very easy to accidentally skip the step of updating COLUMN_NAMES.

This problem has already resulted in a bug. In this PR, the attributes prompt, sent_more_prompt, and sent_less_prompt were added to EvalOutputRecord. While the corresponding constants SENT_MORE_PROMPT_COLUMN_NAME = "sent_more_prompt" and SENT_LESS_PROMPT_COLUMN_NAME = "sent_less_prompt" exist in constants.py (note that PROMPT_COLUMN_NAME = "prompt" is currently being defined in each eval algo's code, which I have changed in this PR), they aren't included in COLUMN_NAMES, which means that this snippet will not "pick up" the prompt, sent_more_prompt, and sent_less_prompt columns when constructing the EvalOutputRecord object.

Proposed changes

Fix for Problem 1
I propose that we get rid of the attributes in EvalOutputRecord that correspond to the non-score columns (ex: model_input, sent_less_output, etc), and instead use a single attribute, non_score_columns, which is aDict, to encode the same information content. The keys to this dict will be validated in EvalOutputRecord's __post_init__ method by comparing them to constants.COLUMN_NAMES.

Fix for Problem 2
I have added a new Enum, ColumnNames, which encapsulates all of the *_COLUMN_NAME constants. The set COLUMN_NAMES is now defined automatically, using the values in ColumnNames. Because COLUMN_NAMES is created from ColumnNames, we will never run into the issue where we add a new *_COLUMN_NAME constant but COLUMN_NAMES doesn't get updated accordingly.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…e in saved outputs

…amatically, thus ensuring it contains all column names

malhotra18 · 2024-01-17T18:27:52Z

src/fmeval/constants.py

+
+COLUMN_NAMES = [e.value for e in ColumnNames]
+
+# These constants are included so that eval algorithm code doesn't need


Subjective:
I think we should remove these, it's fine being verbose in the code: ColumnName.MODEL_INPUT_COLUMN_NAME.value. It keeps the code readable, ensures in readers mind that ColumnNames enum is the source of truth.

Makes sense; will update

src/fmeval/eval_algorithms/util.py

…OLUMN_NAME.value

Daniel Zhu added 2 commits January 16, 2024 11:17

fix: create single source of truth that defines the columns to includ…

f9250f3

…e in saved outputs

Add enum for column names so that COLUMN_NAMES can be populated progr…

13e8185

…amatically, thus ensuring it contains all column names

danielezhu requested review from malhotra18 and xiaoyi-cheng January 16, 2024 20:50

malhotra18 reviewed Jan 17, 2024

View reviewed changes

Merge branch 'main' into refactor_constants

1b5e55e

xiaoyi-cheng previously approved these changes Jan 17, 2024

View reviewed changes

Daniel Zhu added 3 commits January 17, 2024 13:20

Replace all instances of *_COLUMN_NAME constants with ColumnNames.*_C…

5275c75

…OLUMN_NAME.value

Rename non_score_columns to dataset_columns

5f93257

Fix linting

60747a7

danielezhu dismissed xiaoyi-cheng’s stale review via 60747a7 January 17, 2024 21:21

malhotra18 approved these changes Jan 17, 2024

View reviewed changes

xiaoyi-cheng approved these changes Jan 17, 2024

View reviewed changes

danielezhu merged commit 9b61de4 into aws:main Jan 17, 2024

danielezhu deleted the refactor_constants branch January 17, 2024 22:30

danielezhu changed the title ~~fix: remove possibility of human error from column name-related code~~ fix: create single source of truth for dataset column names Jan 17, 2024

danielezhu mentioned this pull request Jan 18, 2024

feat: stringify dataset column contents during data loading #168

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: create single source of truth for dataset column names #171

fix: create single source of truth for dataset column names #171

Uh oh!

danielezhu commented Jan 16, 2024

Uh oh!

malhotra18 Jan 17, 2024

Uh oh!

danielezhu Jan 17, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		COLUMN_NAMES = [e.value for e in ColumnNames]

		# These constants are included so that eval algorithm code doesn't need

Uh oh!

fix: create single source of truth for dataset column names #171

fix: create single source of truth for dataset column names #171

Uh oh!

Conversation

danielezhu commented Jan 16, 2024

Current state of affairs

Proposed changes

Uh oh!

malhotra18 Jan 17, 2024

Choose a reason for hiding this comment

Uh oh!

danielezhu Jan 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants