Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mkeskells
Copy link
Contributor

@mkeskells mkeskells commented Sep 2, 2025

SUMMARY

new serialisation format
Its a bit draft, and open for discussion. It compiles and tests

mostly a mechanistic code from another PR to simplify

Automated Checks

  • I have run ./gradlew test and made sure that my PR does not break any unit test.

@mkeskells mkeskells marked this pull request as draft September 2, 2025 23:52
@mkeskells
Copy link
Contributor Author

any opinions on the serialisation

@blacelle
Copy link
Member

blacelle commented Sep 8, 2025

About this change on the serialization format of Roaring64Bitmap:

@weijietong As initial contributor of the 64bits ART implementations, do you have any opinion on this matter?

@blacelle
Copy link
Member

blacelle commented Sep 8, 2025

(Given the amount of changes in this PR and related matters, especially on the serialization format, I wonder if it would be simpler to go with an alternative implementation, than changing the existing one. The alternative implementation may at some point replace the legacy one (with additional mecanisms to manage the changes of serialization format)).

(@mkeskells I guess it may feel overwhelming to push such changes into such an established library, making some changes quite laborious. Hence this suggestion of an alternative implementation. Benchmarks would remain very useful to confirm strength of weaknesses of various designs. And an additional implementation may make such benchamrks easier to conduct).

@blacelle
Copy link
Member

blacelle commented Sep 8, 2025

#803 (comment)

I had a look. It seems from a quick read that the proposal is that we write a sequence of 32bit high vae -> 32bit roaringBitmap. Thats quite a way away from the structure of the 64BitRoaringMap, bit not unachievable. It would churn memory a bit as I seet if, converting the lower 2 bytes of the Trie into a data structure, or at least some scanning to work out number of entries etc
Am I reading that correctly.
as Seq<32bit int, Seq<32bit roaringMap>>
which I think is Seq<32bit int, Seq<16 bit int, 16bit container>

I would imagine that having a portable format would be useful, but also a fast format has some attractions. Its not black and white. Maybe there is a need for both.
Certainly if we were to do this I would like to get rid of the duplication of strucure with DataOutput and ByteBuffer!

Is there a specific view if what the format should be?

I personally did not study ART-based serialization format. The Map-based is pretty straightforward. My view would be to have a look into other implementation (CRoaring, GoRoaring) to check their ART-implementation (if any), and try (...) to converge a portable format.

I would be very curious to know if the Map-based format would induce such a big penalty or not when read/written into ART structures. Having a dedicated ART format seems very legitimate too. Though, it should not be too much tied with the implementation (to prevent one big issue here: changing the format due to changing the implementation).

I'm not very keen not to do the specification effort, as it has been demonstrated that some users have serialization usages through libraries, and have strong expectations in term of stability. I underatand you own use-cases @mkeskells involves desrialization/deserialization, supposedly long-term scenarios, hence expecting stability on this matter (or with retro-compatibility).

@blacelle
Copy link
Member

blacelle commented Sep 8, 2025

[Check failure on line 11 in roaringbitmap/src/main/java/org/roaringbitmap/art/BranchNode.java](https://github.com/RoaringBitmap/RoaringBitmap/pull/805/files#annotation_38523723317) 

Code scanning
/ CodeQL

Inconsistent equals and hashCode
Error

Class BranchNode overrides  but not hashCode.

@mkeskells Is this legit?

@mkeskells
Copy link
Contributor Author

Class BranchNode overrides but not hashCode.


@mkeskells Is this legit?

I only added equals for the tests. These are not user visible classes, and along with most mutuble collections dont have a sensible hashcode

happy to leave as it is, or change to use another method and adjext the test to not use assertEquals if that is easuer for maintainers

@blacelle
Copy link
Member

It feels awkward to have .equals and not .hascode. It feels awkward to have .equals if it is used only for tests.

along with most mutuble collections dont have a sensible hashcode

I do not get your point.

I would feel better if .equals was not added, if it is relevant only for tests-purposes, as it may lead to confusion to maintainers. I guess we could introduce some test helper methods, dedicated for test purposes.

@mkeskells
Copy link
Contributor Author

I personally did not study ART-based serialization format. The Map-based is pretty straightforward. My view would be to have a look into other implementation (CRoaring, GoRoaring) to check their ART-implementation (if any), and try (...) to converge a portable format.

Agreed - the ART serialisation is very fragile

I would be very curious to know if the Map-based format would induce such a big penalty or not when read/written into ART structures. Having a dedicated ART format seems very legitimate too. Though, it should not be too much tied with the implementation (to prevent one big issue here: changing the format due to changing the implementation).

The issue that I see is that the current map serialisation format is based on 32 bit roaring bitmaps, whichwould have to be constructed and deconstructed on the fly,and the structure for that is really just a sequnce of (16 bit address, container)

I think that for the 64 bit roaring bitmaps it map be better to consider an interchange format based on 16 bit containers

Effectively that could make the 32 and 64 bit solutions are similar (conceptually)
i.e. a Sequence of (address, container)

for the 32 bit the address is 16 bit, and for the 64 bit its 48 bits

you could potentially add different prefix sizes in the future/when valhalla delivers

From a quick look at the code I think this should be easy for both of the 64 bit implementations. It would add a bit of time to the serialisation and deserialisation, but [I assume] in reality its dominated by the containers both for time and space

I'm not very keen not to do the specification effort, as it has been demonstrated that some users have serialization usages through libraries, and have strong expectations in term of stability. I underatand you own use-cases @mkeskells involves desrialization/deserialization, supposedly long-term scenarios, hence expecting stability on this matter (or with retro-compatibility).

at least fo this implementation the docs show the dragons -

  * Unlike RoaringBitmap, there is no specification for now: it may change from one java version to
  * another, and from one RoaringBitmap version to another.

@mkeskells
Copy link
Contributor Author

mkeskells commented Sep 11, 2025

I would feel better if .equals was not added, if it is relevant only for tests-purposes, as it may lead to confusion to maintainers. I guess we could introduce some test helper methods, dedicated for test purposes.

Done. On checking it wasnt used in this PR, it got copied over from the other one, but I will remove it there as well

@blacelle
Copy link
Member

 * Unlike RoaringBitmap, there is no specification for now: it may change from one java version to
  * another, and from one RoaringBitmap version to another.

this comment is a bit out-dated. We worked into a specification of a portable format. It is compatible with either CRoaring or GoRoaring, and compatible with Java-Roaring if some flag is toggled. Though, it is definitely quite some burden for seemingly limited (but not inexistant) use.

@mkeskells
Copy link
Contributor Author

mkeskells commented Sep 12, 2025

 * Unlike RoaringBitmap, there is no specification for now: it may change from one java version to
  * another, and from one RoaringBitmap version to another.

this comment is a bit out-dated. We worked into a specification of a portable format. It is compatible with either CRoaring or GoRoaring, and compatible with Java-Roaring if some flag is toggled. Though, it is definitely quite some burden for seemingly limited (but not inexistant) use.

This is the comment from Roaring64Bitmap. I doubt it would chnage from one java version to another, but would chnage in the internal details of the BranchNodes change,
Also if the Containers, or LeafNode change, which I am proposing

@blacelle
Copy link
Member

This is the comment from Roaring64Bitmap. I doubt it would chnage from one java version to another, but would chnage in the internal details of the BranchNodes change,

Oh, OK. Then I suppose this could be easier to merge. @lemire I was not aware we were explicitely providing less guarantee on Roaring64Bitmap serialization format.

@lemire
Copy link
Member

lemire commented Sep 18, 2025

@lemire I was not aware we were explicitely providing less guarantee on Roaring64Bitmap serialization format.

Please see

https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations

@blacelle
Copy link
Member

blacelle commented Sep 18, 2025

Java Roaring bitmaps implementation offers an Map-based 64-bit implementation handling signed longs. It is not compatible with this serialization format (which does not handle signed keys).

@lemire It is clear to me Map-based 64-bit implementation has no clear serialization format (I'm the author of this part of the spec :D). I meant here I feel weird we do not provide any guarantee about the stability of the actual format (even if not specificed) (i.e. zero guarantee in term of retrocompatbility), given:

it may change from one java version to another, and from one RoaringBitmap version to another

in Roaring64Bitmap Javadoc. In practise, I would expect some guarantee (hence some feedback to @mkeskells around issues raised by changing the format). But I may be zeal from myself, given the Javadoc states the contrary.

@mkeskells
Copy link
Contributor Author

Also happy to work to a format that is stable

My preference would be to have something based on some metadat and a sequence of the 16 bit Containers, as that seems to be the basis for everything

I am not tied to the format in this PR, so if the change was nessessary, but not sufficient, let move to something that is portable, extensible, and reasonable efficient

Or have 2 formats, a native fast format with constraints, and a portable format, that is stable

@mkeskells
Copy link
Contributor Author

Any more thoughts on what to do here?

@lemire
Copy link
Member

lemire commented Oct 11, 2025

@mkeskells

We currently have one documented 64-bit format. We even have a Kaitai formal definition.

https://github.com/RoaringBitmap/RoaringFormatSpec

Sadly, people have been designing their own 64-bit formats left and right in various programming languages which makes interoperability difficult and it is generally difficult to rely or test any one format.

I encourage everyone to adopt a common portable standard, as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants