Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@danielezhu
Copy link
Contributor

@danielezhu danielezhu commented Mar 22, 2024

Note
Apologies in advance for the massive diff. Note that although 25 files have been changed, the majority of these files have just a of couple lines changed (see bullet point 4 in the list below). Also, a huge chunk of the diff comes from the unit tests I wrote for the new util functions, and from deleting a bunch of unit tests that are now redundant.

Description of changes:
This PR accomplishes two main tasks:

  1. Create util functions for boilerplate logic found in the evaluate method of nearly all evaluation algorithms. Source code diff, Unit test diff. Since these util functions will be called from the evaluate method of all eval algos as we move forward with the eval algo redesign, the amount of source code for eval algorithms will drastically shrink, as will the amount of unit test code. See the diff for the SummarizationAccuracy unit tests for how much we can get rid of per algorithm.
  2. Simplify and generalize the EvalAlgorithmInterface so that its role is reduced to enforcing the implementation of the two methods: evaluate_sample and evaluate, where full flexibility regarding their input arguments is given to implementers of concrete subclasses of EvalAlgorithmInterface. The one constraint that we continue to enforce is the output signature. Diff.

Several additional tasks that are accomplished are:

  1. Create util functions for semantic robustness algorithms and a base class, SemanticRobustnessConfig, that configs for the various robustness algorithms can inherit from. Diff
  2. Update SummarizationAccuracy to use the new util functions to shorten code. Source code diff, Unit test diff. Note that a ton of unit test code can be deleted, as all of the logic that was previously tested here is now being tested in the unit tests for the util functions I added.
  3. Update GeneralSemanticRobustness to use the new util functions to shorten code. Source code diff, Unit test diff
  4. Call get_eval_results_path when saving outputs so that the EVAL_RESULTS_PATH environment variable is respected even if it is set after the initialization of an EvalAlgorithmInterface object. This "bug"/unexpected behavior has been observed by @polaschwoebel.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

franluca
franluca previously approved these changes Mar 22, 2024
Copy link
Contributor

@franluca franluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just few minor questions

franluca
franluca previously approved these changes Mar 22, 2024
@danielezhu danielezhu mentioned this pull request Mar 22, 2024
franluca
franluca previously approved these changes Mar 24, 2024
franluca
franluca previously approved these changes Mar 25, 2024
polaschwoebel
polaschwoebel previously approved these changes Mar 25, 2024
evaluated on
:param target_output: The expected responses for the prompts in model_input
:return: list evaluation scores for the sample.
def evaluate(self, *args, **kwargs) -> List[EvalOutput]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we keep the non-optional named arguments here (save and num_records)?

),
)

eval_output = util.evaluate_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing that the function that brings together everything (i.e., running a pipeline on a dataset) is called evaluate_dataset, could it be called run_pipeline or similar instead? Or alternatively, could this be split into two helpers: prepare_dataset and compute_and_aggregate_metrics perhaps?

Copy link
Contributor Author

@danielezhu danielezhu Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename this function to run_pipeline in a follow-up. Thanks!

@danielezhu danielezhu dismissed stale reviews from polaschwoebel and franluca via 54e616b March 25, 2024 17:44
self.pipeline = TransformPipeline([meteor_score, rouge_score, bert_score])

@staticmethod
def build_pipeline(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing docstring for build_pipeline.

@danielezhu danielezhu merged commit c460469 into aws:main Mar 25, 2024
@danielezhu danielezhu deleted the refactor_evaluate branch March 25, 2024 19:31
Copy link
Contributor

@malhotra18 malhotra18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole PR could be very easily split into small logical chunks. Any particular reason for not doing that?

Big PR leads to reviewer inefficiency too, it is very easy for a reviewer to miss important changes.

Leads to accumulated backlog, as I see lot of comments are to be addressed in a follow-up CR, which is hard to track as an external reviewer.

def __init__(self, eval_algorithm_config: EvalAlgorithmConfig):
"""Initialize an instance of a subclass of EvalAlgorithmConfig
@abstractmethod
def evaluate_sample(self, *args, **kwargs) -> List[EvalScore]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we moving towards *args, **kwargs here? Idea was to let the customers see what are the acceptable arguments for built-in eval algorithms too.

  • args and kwargs is a slippery slope, opens door for having n number of input variables, leading to confusing signatures. We also want to let these signatures to serve as guidelines when contributing new eval algorithms, not leave them completely open ended for any input variable.

We want to keep them generic, agreed but not give complete freedom to take in any input variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline and I will restore the original signatures in a follow-up PR.

f"The value should be at least 2."
)
super().__post_init__()
require(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subjective:
I know we follow this style in internal packages. I am not a fan of this though, because it is not readable, requires an extra hop in code to see what kind of exception is being through.

And it is anyways not helping as there is no impact on code modularity or verbiage.

Hence intentionally we showcased exceptions being thrown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants