[data] Add TopK aggregator #59556

cem-anyscale · 2025-12-18T18:50:06Z

Description

Add TopKUnique aggregator that computes most frequent k values

* Add TopKUnique aggregator that computes most frequent k values Signed-off-by: cem <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a new TopKUnique aggregator, which is a valuable addition for computing the most frequent unique values in a column. The implementation correctly builds upon the existing ValueCounter and utilizes heapq.nlargest for efficient top-k computation. The accompanying tests are thorough, covering basic functionality, global frequency aggregation across blocks, and various edge cases. I have a couple of suggestions for a minor code cleanup to improve readability and a recommendation to enhance test robustness by adding a case for frequency ties.

python/ray/data/aggregate.py

python/ray/data/tests/test_aggregations.py

richardliaw · 2025-12-18T18:54:25Z

Why is it called TopKUnique and not just TopK?

cem-anyscale · 2025-12-18T18:57:06Z

Why is it called TopKUnique and not just TopK?

Yeah TopK would be better; will rename.

python/ray/data/aggregate.py

* rename aggregator name * rename default alias name Signed-off-by: cem <[email protected]>

Signed-off-by: cem <[email protected]>

github-actions · 2026-01-02T00:43:11Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

[data] Add TopKUnique aggregator

d9847a6

* Add TopKUnique aggregator that computes most frequent k values Signed-off-by: cem <[email protected]>

cem-anyscale requested a review from a team as a code owner December 18, 2025 18:50

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_aggregations.py Outdated Show resolved Hide resolved

ray-gardener bot added the data Ray Data-related issues label Dec 18, 2025

richardliaw changed the title ~~[data] Add TopKUnique aggregator~~ [data] Add TopK aggregator Dec 18, 2025

cursor bot reviewed Dec 18, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

* simplify code

c322ab8

* rename aggregator name * rename default alias name Signed-off-by: cem <[email protected]>

cem-anyscale force-pushed the cem/topk_v2 branch from 047ae0e to c322ab8 Compare December 18, 2025 19:25

run pre_commit

91b47ec

Signed-off-by: cem <[email protected]>

cem-anyscale added the go add ONLY when ready to merge, run all tests label Dec 18, 2025

add annotation

b61f813

Signed-off-by: cem <[email protected]>

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Add TopK aggregator #59556

[data] Add TopK aggregator #59556

Uh oh!

cem-anyscale commented Dec 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

richardliaw commented Dec 18, 2025

Uh oh!

cem-anyscale commented Dec 18, 2025

Uh oh!

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[data] Add TopK aggregator #59556

Are you sure you want to change the base?

[data] Add TopK aggregator #59556

Uh oh!

Conversation

cem-anyscale commented Dec 18, 2025

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

richardliaw commented Dec 18, 2025

Uh oh!

cem-anyscale commented Dec 18, 2025

Uh oh!

Uh oh!

github-actions bot commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants