Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@suraj-subrahmanyan
Copy link
Contributor

@suraj-subrahmanyan suraj-subrahmanyan commented Aug 17, 2025

This PR adds

  1. physical location of index to output
  2. total disk size to output

Sample of output:

Took a subset of the MS MARCO corpus (10,000 passages) and built an index. Then, I ran bin/run.sh io.anserini.index.IndexReaderUtils -index indexes/msmarco-passage/small-index -stats

OUTPUT:

Index statistics
----------------
documents:             10000
documents (non-empty): 10000
unique terms:          25760
total terms:           372638
index_path:            /home/ssubr/RESEARCH/forks/anserini/indexes/msmarco-passage/lucene-index-msmarco-small
total_size:            4.7 MB

System.out.println("unique terms: " + results.get("unique_terms"));
System.out.println("total terms: " + results.get("total_terms"));
System.out.println("physical location: " + indexPath.toAbsolutePath());
System.out.println("total size on disk: " + new File(indexPath.toString()).length() + " bytes");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to gather this in Map<String, Object> results = IndexReaderUtils.getIndexStats(reader, args.field);? So that other calls to getIndexStats will also have these kv pairs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a much better way to handle it, I will fix it

@lintool
Copy link
Member

lintool commented Aug 17, 2025

Reorganizing of imports is fine.

long totalSize = findDirectorySize(indexPath);
results.put("total_size_disk", totalSize);


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I saw that -- have been making changes and will commit shortly to resolve some things

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please show a sample of the output.

reader.close();
}

public static long findDirectorySize(Path path) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-space indent please.

return String.format("%.1f %s", size, units[unitIndex]);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trim extra lines.

@lintool
Copy link
Member

lintool commented Aug 17, 2025

Also, please fix broken tests and add additional test case to cover new code.
CACM index is already part of the test suite.

@lintool
Copy link
Member

lintool commented Aug 18, 2025

Index statistics
----------------
documents:             10000
documents (non-empty): 10000
unique terms:          25760
total terms:           372638
physical location of index:     /home/ssubr/forks/anserini/indexes/msmarco-passage/small-index
total size on disk:    4.7 MB

Change "physical location of index" to "index path" and you can align all the values?

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something still seems off... see comments?

System.out.println("documents (non-empty): " + results.get("non_empty_documents"));
System.out.println("unique terms: " + results.get("unique_terms"));
System.out.println("total terms: " + results.get("total_terms"));
System.out.println("physical location of index: " + results.get("physical_location"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem right?

how about just "index_path" and "total_size"?

Copy link
Contributor Author

@suraj-subrahmanyan suraj-subrahmanyan Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, you are right -- I'll change them. As a side note, I will squash the commits and fix up the git history.

@lintool lintool self-requested a review August 18, 2025 02:40
@lintool lintool merged commit d97a308 into castorini:master Aug 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants