fix: replace add_column with map in _generate_prompt_column #161
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
This PR replaces the last instance of
dataset.add_column(which I somehow missed when submitting PR #115) withdataset.map. Note that this PR is not necessary for correctness, but simply to make our code uniform (since it's kind of weird to have just one function that usesadd_columnwhile all of them usemap).Note: a question that likely arises is "How come the integration test that was added to
test_factual_knowledge.py(which was supposed to validate the fix) was able to pass, given that_generate_prompt_columnwas still usingadd_column, which should've caused inconsistencies in the batch formats in the Ray task graph?"The reason the test passed despite me not replacing every instance of
add_columnis that we don't strictly need to replace every instance; only calls toadd_columnthat occur immediately before a dataset aggregation operation (for example,dataset.mean) need to be replaced withmap. I should've explained this more clearly in my description for PR #115.If you recall, errors like
'DataFrame' object has no attribute 'num_columns'and'pyarrow.lib.Table' object has no attribute 'reset_index'occur when reducing mapped outputs (see this PR). If you drill deeper, you will see that only the batch format of the mapped outputs that are being directly consumed by the reduction/aggregation operation matters.Thus, it is perfectly valid to do the following:
Since PR #115 replaced all of the calls to
add_columnthat occur immediately beforeaggregate_evaluation_scores(which callsdataset.map), it successfully got rid of the root cause of the'DataFrame' object has no attribute 'num_columns'and'pyarrow.lib.Table' object has no attribute 'reset_index'errors.Replacing all other calls to
add_column(i.e. the one that I missed in_generate_prompt_column) is purely a matter of style and not correctness.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.