Row Format
Apache Fory™ provides a random-access row format that enables reading nested fields from binary data without full deserialization. This drastically reduces overhead when working with large objects where only partial data access is needed.
Overview
Row format is a cache-friendly binary random access format that supports:
- Zero-copy access: Read fields directly from binary without allocating objects
- Partial deserialization: Access only the fields you need
- Skipping serialization: Skip serialization of fields you don't need
- Cross-language compatibility: Works across Python, Java, C++, and other languages
- Column format conversion: Can convert to Apache Arrow columnar format automatically
Basic Usage
public class Bar {
String f1;
List<Long> f2;
}
public class Foo {
int f1;
List<Integer> f2;
Map<String, Integer> f3;
List<Bar> f4;
}
RowEncoder<Foo> encoder = Encoders.bean(Foo.class);
// Create large dataset
Foo foo = new Foo();
foo.f1 = 10;
foo.f2 = IntStream.range(0, 1_000_000).boxed().collect(Collectors.toList());
foo.f3 = IntStream.range(0, 1_000_000).boxed().collect(Collectors.toMap(i -> "k" + i, i -> i));
List<Bar> bars = new ArrayList<>(1_000_000);
for (int i = 0; i < 1_000_000; i++) {
Bar bar = new Bar();
bar.f1 = "s" + i;
bar.f2 = LongStream.range(0, 10).boxed().collect(Collectors.toList());
bars.add(bar);
}
foo.f4 = bars;
// Encode to row format (cross-language compatible with Python/C++)
BinaryRow binaryRow = encoder.toRow(foo);
// Zero-copy random access without full deserialization
BinaryArray f2Array = binaryRow.getArray(1); // Access f2 list
BinaryArray f4Array = binaryRow.getArray(3); // Access f4 list
BinaryRow bar10 = f4Array.getStruct(10); // Access 11th Bar
long value = bar10.getArray(1).getInt64(5); // Access 6th element of bar.f2
// Partial deserialization - only deserialize what you need
RowEncoder<Bar> barEncoder = Encoders.bean(Bar.class);
Bar bar1 = barEncoder.fromRow(f4Array.getStruct(10)); // Deserialize 11th Bar only
Bar bar2 = barEncoder.fromRow(f4Array.getStruct(20)); // Deserialize 21st Bar only
// Full deserialization when needed
Foo newFoo = encoder.fromRow(binaryRow);
Key Benefits
| Feature | Description |
|---|---|
| Zero-Copy Access | Read nested fields without deserializing entire object |
| Memory Efficiency | Memory-map large datasets directly from disk |
| Cross-Language | Binary format compatible between Java, Python, C++ |
| Partial Deserialization | Deserialize only specific elements you need |
| High Performance | Skip unnecessary data parsing for analytics workloads |
When to Use Row Format
Row format is ideal for:
- Analytics workloads: When you only need to access specific fields
- Large datasets: When full deserialization is too expensive
- Memory-mapped files: Working with data larger than RAM
- Data pipelines: Processing data without full object reconstruction
- Cross-language data sharing: When data needs to be accessed from multiple languages
Cross-Language Compatibility
Row format works seamlessly across languages. The same binary data can be accessed from:
Python
import pyfory
import pyarrow as pa
from dataclasses import dataclass
from typing import List, Dict
@dataclass
class Bar:
f1: str
f2: List[pa.int64]
@dataclass
class Foo:
f1: pa.int32
f2: List[pa.int32]
f3: Dict[str, pa.int32]
f4: List[Bar]
encoder = pyfory.encoder(Foo)
binary: bytes = encoder.to_row(foo).to_bytes()
# Zero-copy access
foo_row = pyfory.RowData(encoder.schema, binary)
print(foo_row.f2[100000])
print(foo_row.f4[100000].f1)
C++
#include "fory/encoder/row_encoder.h"
#include "fory/row/writer.h"
struct Bar {
std::string f1;
std::vector<int64_t> f2;
};
FORY_FIELD_INFO(Bar, f1, f2);
struct Foo {
int32_t f1;
std::vector<int32_t> f2;
std::map<std::string, int32_t> f3;
std::vector<Bar> f4;
};
FORY_FIELD_INFO(Foo, f1, f2, f3, f4);
fory::encoder::RowEncoder<Foo> encoder;
encoder.Encode(foo);
auto row = encoder.GetWriter().ToRow();
// Zero-copy random access
auto f2_array = row->GetArray(1);
auto f4_array = row->GetArray(3);
auto bar10 = f4_array->GetStruct(10);
int64_t value = bar10->GetArray(1)->GetInt64(5);
std::string str = bar10->GetString(0);
Performance Comparison
| Operation | Object Format | Row Format |
|---|---|---|
| Full deserialization | Allocates all objects | Zero allocation |
| Single field access | Full deserialization required | Direct offset read |
| Memory usage | Full object graph in memory | Only accessed fields |
| Suitable for | Small objects, full access | Large objects, selective access |
Related Topics
- Cross-Language Serialization - XLANG mode
- Advanced Features - Zero-copy serialization
- Row Format Specification - Protocol details