A new implementation of my graph index, I wanted to start from scratch. Now using a different Louvain Method implementation that uses much less memory.
- Add fast buffer-based GFA reader, inspired by
strangepg - Generate edge lists from a GFA
- Integrate
strangepgfile reading for faster GFA loading
- Integrate
- On disk binary search for node IDs to their community ID
- It works but needs to be implemented in the code to store the community IDs.
- separate the GFA file based on the communities produced.
- Change the map to a string: <int, int> and the second Int is the community ID, or keep a vector of node length and add to it <string, int> with node id and community ID. Need to test memory for both.
- Generate the community ID to file offset index (int: <int, int>, community ID: <start, end>)
- Need to look if I can then gzip the chunks separately and how will this change the offsets.
- Separate the edges that belong to different communities to their own chunk. I don't think it's actually needed
- As long as I'm hashing the sequence IDs later, maybe I should hash them first and use that absail dictionary that uses less memory.
- Parallelize the GFA chunking/gzipping. Not necessary, it's faster now with compression level 6 instead of 9.
- Maybe make the community index a binary file that gets loaded into memory completely for faster access.
- Index the Paths and other lines, these will be line by line indexed, should be easy
- Recursive chunking, I think I should further chunk the communities that are too large. Do it on the separated file.
- Add command line interface
- Add unit tests
- Benchmark against other graph clustering tools
- Add Rust interface
- Add conda package
- Add command line interface
- Edit old ChGraph to work with the new indexes
- Investigate why retrieving (node_id 3456 community 743) takes very long, and probably should investigate every chunk.