refactor: move repeated code in evaluate method into util functions and simplify the EvalAlgorithmInterface method signatures #224

danielezhu · 2024-03-22T07:23:17Z

Note
Apologies in advance for the massive diff. Note that although 25 files have been changed, the majority of these files have just a of couple lines changed (see bullet point 4 in the list below). Also, a huge chunk of the diff comes from the unit tests I wrote for the new util functions, and from deleting a bunch of unit tests that are now redundant.

Description of changes:
This PR accomplishes two main tasks:

Create util functions for boilerplate logic found in the evaluate method of nearly all evaluation algorithms. Source code diff, Unit test diff. Since these util functions will be called from the evaluate method of all eval algos as we move forward with the eval algo redesign, the amount of source code for eval algorithms will drastically shrink, as will the amount of unit test code. See the diff for the SummarizationAccuracy unit tests for how much we can get rid of per algorithm.
Simplify and generalize the EvalAlgorithmInterface so that its role is reduced to enforcing the implementation of the two methods: evaluate_sample and evaluate, where full flexibility regarding their input arguments is given to implementers of concrete subclasses of EvalAlgorithmInterface. The one constraint that we continue to enforce is the output signature. Diff.

Several additional tasks that are accomplished are:

Create util functions for semantic robustness algorithms and a base class, SemanticRobustnessConfig, that configs for the various robustness algorithms can inherit from. Diff
Update SummarizationAccuracy to use the new util functions to shorten code. Source code diff, Unit test diff. Note that a ton of unit test code can be deleted, as all of the logic that was previously tested here is now being tested in the unit tests for the util functions I added.
Update GeneralSemanticRobustness to use the new util functions to shorten code. Source code diff, Unit test diff
Call get_eval_results_path when saving outputs so that the EVAL_RESULTS_PATH environment variable is respected even if it is set after the initialization of an EvalAlgorithmInterface object. This "bug"/unexpected behavior has been observed by @polaschwoebel.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…d simplify the EvalAlgorithmInterface method signatures

franluca

just few minor questions

src/fmeval/eval_algorithms/summarization_accuracy.py

src/fmeval/eval_algorithms/util.py

…ssed via a list.

…to evaluate_dataset

src/fmeval/eval_algorithms/classification_accuracy.py

src/fmeval/eval_algorithms/util.py

polaschwoebel · 2024-03-25T10:20:52Z

src/fmeval/eval_algorithms/eval_algorithm.py

-                            evaluated on
-        :param target_output: The expected responses for the prompts in model_input
-        :return: list evaluation scores for the sample.
+    def evaluate(self, *args, **kwargs) -> List[EvalOutput]:


Why don't we keep the non-optional named arguments here (save and num_records)?

src/fmeval/eval_algorithms/eval_algorithm.py

src/fmeval/eval_algorithms/semantic_robustness_utils.py

src/fmeval/eval_algorithms/general_semantic_robustness.py

polaschwoebel · 2024-03-25T12:09:55Z

src/fmeval/eval_algorithms/summarization_accuracy.py

-                    ),
-                )
-
+            eval_output = util.evaluate_dataset(


I find it confusing that the function that brings together everything (i.e., running a pipeline on a dataset) is called evaluate_dataset, could it be called run_pipeline or similar instead? Or alternatively, could this be split into two helpers: prepare_dataset and compute_and_aggregate_metrics perhaps?

I can rename this function to run_pipeline in a follow-up. Thanks!

oyangz · 2024-03-25T19:14:02Z

src/fmeval/eval_algorithms/summarization_accuracy.py

+        self.pipeline = TransformPipeline([meteor_score, rouge_score, bert_score])
+
+    @staticmethod
+    def build_pipeline(


nit: missing docstring for build_pipeline.

malhotra18

This whole PR could be very easily split into small logical chunks. Any particular reason for not doing that?

Big PR leads to reviewer inefficiency too, it is very easy for a reviewer to miss important changes.

Leads to accumulated backlog, as I see lot of comments are to be addressed in a follow-up CR, which is hard to track as an external reviewer.

malhotra18 · 2024-03-26T18:29:43Z

src/fmeval/eval_algorithms/eval_algorithm.py

-    def __init__(self, eval_algorithm_config: EvalAlgorithmConfig):
-        """Initialize an instance of a subclass of EvalAlgorithmConfig
+    @abstractmethod
+    def evaluate_sample(self, *args, **kwargs) -> List[EvalScore]:


why are we moving towards *args, **kwargs here? Idea was to let the customers see what are the acceptable arguments for built-in eval algorithms too.

args and kwargs is a slippery slope, opens door for having n number of input variables, leading to confusing signatures. We also want to let these signatures to serve as guidelines when contributing new eval algorithms, not leave them completely open ended for any input variable.

We want to keep them generic, agreed but not give complete freedom to take in any input variable.

Discussed offline and I will restore the original signatures in a follow-up PR.

malhotra18 · 2024-03-26T18:39:14Z

src/fmeval/eval_algorithms/general_semantic_robustness.py

-                f"The value should be at least 2."
-            )
+        super().__post_init__()
+        require(


Subjective:
I know we follow this style in internal packages. I am not a fan of this though, because it is not readable, requires an extra hop in code to see what kind of exception is being through.

And it is anyways not helping as there is no impact on code modularity or verbiage.

Hence intentionally we showcased exceptions being thrown.

src/fmeval/eval_algorithms/general_semantic_robustness.py

src/fmeval/eval_algorithms/semantic_robustness_utils.py

src/fmeval/eval_algorithms/summarization_accuracy.py

Daniel Zhu added 4 commits March 22, 2024 00:22

refactor: move repeated code in evaluate method into utilfunctions an…

1eec91a

…d simplify the EvalAlgorithmInterface method signatures

Rename bertscore_model_type to model_type_for_bertscore for consistency

d306208

Fix imports in gsr integ test

841dafe

Fix linting

adaecb3

danielezhu requested review from franluca, keerthanvasist and malhotra18 March 22, 2024 08:14

franluca previously approved these changes Mar 22, 2024

View reviewed changes

src/fmeval/eval_algorithms/summarization_accuracy.py Show resolved Hide resolved

src/fmeval/eval_algorithms/util.py Outdated Show resolved Hide resolved

Fix bug in util.evaluate where TransformPipeline arguments are not pa…

36ab862

…ssed via a list.

danielezhu dismissed franluca’s stale review via 36ab862 March 22, 2024 16:18

Daniel Zhu added 2 commits March 22, 2024 10:05

Rename util.evaluate to util.run_evaluation

08a4e80

Fix linting

24888c3

franluca previously approved these changes Mar 22, 2024

View reviewed changes

danielezhu mentioned this pull request Mar 22, 2024

Toxicity transforms #227

Closed

Restore perturbation constants to their original values

af42942

danielezhu dismissed franluca’s stale review via af42942 March 23, 2024 19:53

Daniel Zhu added 3 commits March 23, 2024 14:12

fix: update GetModelResponse transform to work with any ModelRunner

6d0786d

Merge branch 'fix_get_model_response' into refactor_evaluate

f6e5e91

refactor: pull the for loop out of util.run_evaluation and rename it …

0c77495

…to evaluate_dataset

franluca previously approved these changes Mar 24, 2024

View reviewed changes

Merge branch 'main' into refactor_evaluate

552dd98

danielezhu dismissed franluca’s stale review via 552dd98 March 25, 2024 07:55

franluca previously approved these changes Mar 25, 2024

View reviewed changes

polaschwoebel previously approved these changes Mar 25, 2024

View reviewed changes

Merge branch 'main' into refactor_evaluate

54e616b

danielezhu dismissed stale reviews from polaschwoebel and franluca via 54e616b March 25, 2024 17:44

Merge branch 'main' into refactor_evaluate

087486d

oyangz reviewed Mar 25, 2024

View reviewed changes

oyangz approved these changes Mar 25, 2024

View reviewed changes

keerthanvasist approved these changes Mar 25, 2024

View reviewed changes

danielezhu merged commit c460469 into aws:main Mar 25, 2024

danielezhu deleted the refactor_evaluate branch March 25, 2024 19:31

malhotra18 reviewed Mar 26, 2024

View reviewed changes

danielezhu mentioned this pull request Mar 26, 2024

chore: restore evaluate_sample and evaluate signatures in EvalAlgorithmInterface #231

Merged

Uh oh!

refactor: move repeated code in evaluate method into util functions and simplify the EvalAlgorithmInterface method signatures #224

refactor: move repeated code in evaluate method into util functions and simplify the EvalAlgorithmInterface method signatures #224

Uh oh!

Conversation

danielezhu commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franluca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polaschwoebel Mar 25, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polaschwoebel Mar 25, 2024

Choose a reason for hiding this comment

Uh oh!

danielezhu Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oyangz Mar 25, 2024

Choose a reason for hiding this comment

Uh oh!

malhotra18 left a comment

Choose a reason for hiding this comment

Uh oh!

malhotra18 Mar 26, 2024

Choose a reason for hiding this comment

Uh oh!

danielezhu Mar 26, 2024

Choose a reason for hiding this comment

Uh oh!

malhotra18 Mar 26, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

danielezhu commented Mar 22, 2024 •

edited

Loading

danielezhu Mar 25, 2024 •

edited

Loading