[data] Add TopK aggregator #59556

cem-anyscale · 2025-12-18T18:50:06Z

Description

Add TopKUnique aggregator that computes most frequent k values

* Add TopKUnique aggregator that computes most frequent k values Signed-off-by: cem <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a new TopKUnique aggregator, which is a valuable addition for computing the most frequent unique values in a column. The implementation correctly builds upon the existing ValueCounter and utilizes heapq.nlargest for efficient top-k computation. The accompanying tests are thorough, covering basic functionality, global frequency aggregation across blocks, and various edge cases. I have a couple of suggestions for a minor code cleanup to improve readability and a recommendation to enhance test robustness by adding a case for frequency ties.

python/ray/data/aggregate.py

python/ray/data/tests/test_aggregations.py

richardliaw · 2025-12-18T18:54:25Z

Why is it called TopKUnique and not just TopK?

cem-anyscale · 2025-12-18T18:57:06Z

Why is it called TopKUnique and not just TopK?

Yeah TopK would be better; will rename.

python/ray/data/aggregate.py

* rename aggregator name * rename default alias name Signed-off-by: cem <[email protected]>

Signed-off-by: cem <[email protected]>

[data] Add TopKUnique aggregator

d9847a6

* Add TopKUnique aggregator that computes most frequent k values Signed-off-by: cem <[email protected]>

cem-anyscale requested a review from a team as a code owner December 18, 2025 18:50

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_aggregations.py Outdated Show resolved Hide resolved

ray-gardener bot added the data Ray Data-related issues label Dec 18, 2025

richardliaw changed the title ~~[data] Add TopKUnique aggregator~~ [data] Add TopK aggregator Dec 18, 2025

cursor bot reviewed Dec 18, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

* simplify code

c322ab8

* rename aggregator name * rename default alias name Signed-off-by: cem <[email protected]>

cem-anyscale force-pushed the cem/topk_v2 branch from 047ae0e to c322ab8 Compare December 18, 2025 19:25

run pre_commit

91b47ec

Signed-off-by: cem <[email protected]>

cem-anyscale added the go add ONLY when ready to merge, run all tests label Dec 18, 2025

add annotation

b61f813

Signed-off-by: cem <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Add TopK aggregator #59556

[data] Add TopK aggregator #59556

cem-anyscale commented Dec 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

richardliaw commented Dec 18, 2025

Uh oh!

cem-anyscale commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[data] Add TopK aggregator #59556

Are you sure you want to change the base?

[data] Add TopK aggregator #59556

Conversation

cem-anyscale commented Dec 18, 2025

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

richardliaw commented Dec 18, 2025

Uh oh!

cem-anyscale commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants