Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@suraj-subrahmanyan
Copy link
Contributor

This PR is addresses issue #2937 . I ran a JMH microbenchmark on deletes.contains() line as it runs for each tweet processed.

Speed Comparison:

A: Set Size = 10,000 IDs, Hit Ratio = 10%
-[Java HashSet] Lookup Perf: 50,000 lookups in 2.103 ms (23.78 Mops/s)
-[FastUtil] Lookup Perf: 50,000 lookups in 2.243 ms (22.29 Mops/s)
-[HPPC] Lookup Perf: 50,000 lookups in 2.475 ms (20.20 Mops/s)

B: Set Size = 50,000 IDs, Hit Ratio = 15%
-[Java HashSet] Lookup Perf: 50,000 lookups in 55.943 ms (0.894 Mops/s)
-[FastUtil] Lookup Perf: 50,000 lookups in 39.959 ms (1.251 Mops/s)
-[HPPC] Lookup Perf: 50,000 lookups in 34.732 ms (1.440 Mops/s)

Not included here, but in a JVM benchmark, Java HashSet had a much high memory consumption and is inconsistent.

Size comparison:

Name Size
Current Anserini Fat JAR 177MB
HPPC Dependency Net Contribution <1MB
FastUtil Dependency (for reference) 23MB

Since com.carrotsearch.hppc dependency is small and has the relative performance of fastutil, it is a good replacement. Also, com.carrotsearch.hppc is more stable than Lucene's org.apache.lucene.internal.hppc.LongHashSet since it is public library.

@lintool
Copy link
Member

lintool commented Aug 23, 2025

@suraj-subrahmanyan The advantage of using org.apache.lucene.internal.hppc.LongHashSet is that no new dependency is needed. Can you change the code to switch over to that?

@suraj-subrahmanyan
Copy link
Contributor Author

suraj-subrahmanyan commented Aug 23, 2025

Yea, that makes sense -- I was actually thinking the same since it would likely have similar performance and contribute nothing to the fatjar size. However, org.apache.lucene.internal.hppc.LongHashSet was added in Release 9.11.0 of Lucene and by upgrading the Lucene version Anserini breaks (?)

@lintool
Copy link
Member

lintool commented Aug 23, 2025

org.apache.lucene.internal.hppc.LongHashSet was added in Release 9.11.0 of Lucene

Oh, I see... :(

LG for now. Okay, let me coordinate with #2939 for merging.

@lintool lintool merged commit 9f65a8c into castorini:master Sep 16, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants