-
Couldn't load subscription status.
- Fork 57
refactor: move repeated code in evaluate method into util functions and simplify the EvalAlgorithmInterface method signatures #224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d simplify the EvalAlgorithmInterface method signatures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just few minor questions
| evaluated on | ||
| :param target_output: The expected responses for the prompts in model_input | ||
| :return: list evaluation scores for the sample. | ||
| def evaluate(self, *args, **kwargs) -> List[EvalOutput]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we keep the non-optional named arguments here (save and num_records)?
| ), | ||
| ) | ||
|
|
||
| eval_output = util.evaluate_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it confusing that the function that brings together everything (i.e., running a pipeline on a dataset) is called evaluate_dataset, could it be called run_pipeline or similar instead? Or alternatively, could this be split into two helpers: prepare_dataset and compute_and_aggregate_metrics perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can rename this function to run_pipeline in a follow-up. Thanks!
| self.pipeline = TransformPipeline([meteor_score, rouge_score, bert_score]) | ||
|
|
||
| @staticmethod | ||
| def build_pipeline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing docstring for build_pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole PR could be very easily split into small logical chunks. Any particular reason for not doing that?
Big PR leads to reviewer inefficiency too, it is very easy for a reviewer to miss important changes.
Leads to accumulated backlog, as I see lot of comments are to be addressed in a follow-up CR, which is hard to track as an external reviewer.
| def __init__(self, eval_algorithm_config: EvalAlgorithmConfig): | ||
| """Initialize an instance of a subclass of EvalAlgorithmConfig | ||
| @abstractmethod | ||
| def evaluate_sample(self, *args, **kwargs) -> List[EvalScore]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we moving towards *args, **kwargs here? Idea was to let the customers see what are the acceptable arguments for built-in eval algorithms too.
- args and kwargs is a slippery slope, opens door for having n number of input variables, leading to confusing signatures. We also want to let these signatures to serve as guidelines when contributing new eval algorithms, not leave them completely open ended for any input variable.
We want to keep them generic, agreed but not give complete freedom to take in any input variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline and I will restore the original signatures in a follow-up PR.
| f"The value should be at least 2." | ||
| ) | ||
| super().__post_init__() | ||
| require( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subjective:
I know we follow this style in internal packages. I am not a fan of this though, because it is not readable, requires an extra hop in code to see what kind of exception is being through.
And it is anyways not helping as there is no impact on code modularity or verbiage.
Hence intentionally we showcased exceptions being thrown.
Note
Apologies in advance for the massive diff. Note that although 25 files have been changed, the majority of these files have just a of couple lines changed (see bullet point 4 in the list below). Also, a huge chunk of the diff comes from the unit tests I wrote for the new util functions, and from deleting a bunch of unit tests that are now redundant.
Description of changes:
This PR accomplishes two main tasks:
evaluatemethod of nearly all evaluation algorithms. Source code diff, Unit test diff. Since these util functions will be called from theevaluatemethod of all eval algos as we move forward with the eval algo redesign, the amount of source code for eval algorithms will drastically shrink, as will the amount of unit test code. See the diff for theSummarizationAccuracyunit tests for how much we can get rid of per algorithm.EvalAlgorithmInterfaceso that its role is reduced to enforcing the implementation of the two methods:evaluate_sampleandevaluate, where full flexibility regarding their input arguments is given to implementers of concrete subclasses ofEvalAlgorithmInterface. The one constraint that we continue to enforce is the output signature. Diff.Several additional tasks that are accomplished are:
SemanticRobustnessConfig, that configs for the various robustness algorithms can inherit from. DiffSummarizationAccuracyto use the new util functions to shorten code. Source code diff, Unit test diff. Note that a ton of unit test code can be deleted, as all of the logic that was previously tested here is now being tested in the unit tests for the util functions I added.GeneralSemanticRobustnessto use the new util functions to shorten code. Source code diff, Unit test diffget_eval_results_pathwhen saving outputs so that theEVAL_RESULTS_PATHenvironment variable is respected even if it is set after the initialization of anEvalAlgorithmInterfaceobject. This "bug"/unexpected behavior has been observed by @polaschwoebel.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.