Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 9c285b6

Browse files
committed
feat: Update readme to make it more descriptive
1 parent 629dd67 commit 9c285b6

File tree

1 file changed

+61
-8
lines changed

1 file changed

+61
-8
lines changed

README.md

+61-8
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,17 @@
1-
# Glint: SQL Query Compiler for Java
1+
# Glint: Vectorized and Code Generation Driven Query Engine in Java
22

33
> Briefly flashing the powers of query compilation without the machinery of a spark.
44
55
## Description
66

7-
Glint is a SQL query engine with query compilation support in Java.
7+
Glint is a minimal SQL query engine with vectorized and query compilation support in Java.
88

99
Following in the tradition of the new movement of modular database architectures
1010
Glint has no catalog or data management; its only capability is turning SQL queries
1111
into Java code that is then compiled and executed; think Calcite not Spark.
1212

1313
In order to make it fun, at least for tests and benchmark purposes, we did plug
14-
an Arrow compatible API with support for Memory, CSV and Partquet data sources
15-
allowing us to run against most benchmark datasets out there.
14+
an Arrow compatible API with support for Memory, CSV and Parquet data sources.
1615

1716
## Architecture
1817

@@ -22,7 +21,7 @@ aspect of a query compiler is studied or demonstrated.
2221

2322
But before all of this, let's start with a brief tour of query engines in general this
2423
will allow us to frame the architecture discussion in a concrete context by understanding
25-
the fundamental components and patterns that shape modern query processing systems.
24+
the fundamental components and patterns that shape modern query processing systems.
2625

2726
### Query Engine Architecture and Paradigms
2827

@@ -42,7 +41,7 @@ SELECT col1 FROM table WHERE col2 > 10
4241
```
4342

4443
Driving the execution of the above model are two execution paradigms: vectorized and compiled.
45-
Vectorized execution processes data in batches (vectors) to better utilize CPU caches and
44+
Vectorized execution processes data in batches (vectors) to better utilize CPU caches and
4645
enable SIMD operations.
4746

4847
Instead of processing one row at a time like the Volcano model, it handles chunks of data
@@ -75,9 +74,63 @@ complexity.
7574

7675
Each approach has its trade-offs: Vectorized engines have lower compilation overhead and are
7776
more flexible for dynamic workloads, while compiled engines can achieve better absolute performance
78-
for stable queries by generating specialized code paths.
77+
for stable queries by generating specialized code paths.
7978

8079
In a paper by Timo Kersten and others - [Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf) they showed that the performance of
8180
both approaches was pretty much on-par, with the results showing that data-centric code generation
8281
being slightly better at compute intensive queries and vectorized being better at memory-bound
83-
queries.
82+
queries.
83+
84+
### Implementation Details
85+
86+
```
87+
88+
┌─────────────────────────────────────────────────────────┐
89+
│ DataFrame API │
90+
├─────────────────────────────────────────────────────────┤
91+
│ Logical Planning │
92+
│ ┌─────────────┐ ┌──────────┐ ┌───────────────┐ │
93+
│ │ Scan │ │ Join │ │ Project │ │
94+
│ └─────────────┘ └──────────┘ └───────────────┘ │
95+
├─────────────────────────────────────────────────────────┤
96+
│ Physical Planning │
97+
│ ┌─────────────┐ ┌─────────┐ ┌─────────────┐ │
98+
│ │ TableScan │ │HashJoin │ │Project │ │
99+
│ └─────────────┘ └─────────┘ └─────────────┘ │
100+
├─────────────────────────────────────────────────────────┤
101+
│ Execution │
102+
│ ┌─────────────┐ ┌─────────┐ ┌─────────────┐ │
103+
│ │ScanOperator │ │ JoinOp │ │ ProjectOp │ │
104+
│ └─────────────┘ └─────────┘ └─────────────┘ │
105+
└─────────────────────────────────────────────────────────┘
106+
│ │ │
107+
└──────────────┼────────────────┘
108+
109+
┌─────────────────────────────────────────────────────────┐
110+
│ Apache Arrow │
111+
└─────────────────────────────────────────────────────────┘
112+
113+
```
114+
115+
- Apache Arrow Integration:
116+
- Uses Arrow's columnar memory format throughout the engine
117+
- Leverages Arrow's VectorSchemaRoot for batch processing
118+
- Implements custom FieldVector wrappers for type safety
119+
- Enables zero-copy data sharing between operations
120+
121+
- Three-Layer Architecture:
122+
- Logical Plans: Abstract representation of operations (WHAT)
123+
- Physical Plans: Concrete implementation strategies (HOW)
124+
- Operators: Actual execution code using Volcano model
125+
126+
- DataFrame API:
127+
- Provides a fluent interface for query construction
128+
- Supports common operations (select, filter, join)
129+
- Handles schema inference and validation
130+
- Abstracts query planning complexity from users
131+
132+
- Query Execution:
133+
- Uses vectorizewd Volcano-style iterator model
134+
- Processes data in batches for efficiency
135+
- Supports push-down optimizations
136+
- Implements memory-efficient operations

0 commit comments

Comments
 (0)