Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@keerthanvasist
Copy link
Contributor

@keerthanvasist keerthanvasist commented May 28, 2024

Add SaveStrategy to allow flexibility in saving localized evaluation outputs

Description of changes:
Introduces a new class SaveStrategy that allows users define their own saving strategy for localized evaluation outputs. Due to the distributed nature of the computations. If the dataset is large, and all of the data is pulled to the head node, it might lead to OOM errors. In order to avoid that, the data is pulled in batches, and save function is called on each batch at a time. In order to allow this mechanism, while allowing more flexbility in the way outputs are saved, this class works as a ContextManager.

This PR looks big, but is in essence a small change that reflects in every evaluation algorithm. The main change is in save_strategy.py and eval_algorithms/common.py (which are new files).

Incidentally, this PR also updates ray version from 2.9.1 to 2.23.0

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@keerthanvasist keerthanvasist force-pushed the saver branch 4 times, most recently from 4de8c13 to 605814c Compare May 28, 2024 14:51
Copy link
Contributor

@danielezhu danielezhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most comments are nits about docstring issues, but I also left some suggestions on the save strategy source code and unit tests. Thanks!

},
)
] * 3
with patch.object(s3_client, "upload_part", return_value={"ETag": 1}) as upload_part, patch.object(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a side effect for upload_part where it returns {"ETag": "etag_1"}, {"ETag": "etag_2"}, {"ETag": "etag_3"}? Later, we can validate that self._parts_info contains these values.

for _ in range(num_of_save_times):
save_strategy.save(records)
assert upload_part.call_count == 3
assert complete_multipart_upload.call_count == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up from the comment above, at this point, self._parts_info[PARTS] should look like

[
   {PART_NUMBER: 1, E_TAG: "etag_1"},
   {PART_NUMBER: 1, E_TAG: "etag_2"},
   {PART_NUMBER: 1, E_TAG: "etag_3"}
]

so we can use

complete_multipart_upload.assert_called_once_with(
     Bucket=# mocked value,
     Key=# mocked value,
     UploadId="1234",
     MultipartUpload={PARTS: # the list above},
)

@keerthanvasist keerthanvasist merged commit 3ce289f into aws:main May 28, 2024
oyangz added a commit that referenced this pull request Jun 11, 2024
* feat: add SaveStrategy to allow flexibility in saving localized evaluation outputs (#281)

* feat: modify GeneratePrompt transform to take placeholder_dict (#288)

* feat: modify GeneratePrompt transform to take placeholder_dict

* fix: unit test

* fix: requested changes

---------

Co-authored-by: keerthanvasist <[email protected]>
Co-authored-by: Xiaoyi Cheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants