CrateDB Architecture Guide
CrateDB Architecture Guide
January, 2024
Table of Contents
CrateDB does not just provide faster time-to-market, but also very low operational
overhead. Scaling is a matter of adding nodes, the database itself takes care of data
distribution.
The combination of advanced indexing and columnar storage enables very fast queries
in single-digit milliseconds across billions of rows.
The flexible deployment options provide the level of flexibility needed for any use
case: as a fully managed service on AWS, Azure and GCP; as a self-deployed solution
in your own cloud environment or on-premises; as a solution deployed at the Edge.
3
Open Source Licensing Model
CrateDB embraces and supports the open source development model. Its source code is
available on GitHub under the Apache 2.0 License, allowing users to access and
contribute to its ongoing development.
On the other hand, users can access CrateDB as a service through the CrateDB Cloud
offering, simplifying the setup and management process.
For all deployment models, support subscriptions are available on demand, with “Basic”
and “Premium” options.
4
Multi-Model Database
CrateDB is a multi-model database. Its strength lies in the efficient handling
of multiple data models within the same database and even within the same table:
It eliminates the need to manage and synchronize multiple database technologies and
learn different languages by offering unified access via the well-known SQL language.
All data models are accessible via SQL, the well-known query language, allowing
for complex queries, full-text and vector search.
Complex objects and nested objects can be stored with no human intervention. Data
can be directly inserted as a JSON string.
New columns, supporting any data type and format, can be dynamically added,
without table locks, allowing for seamless adaptation to changing needs and
requirements. An important CrateDB’s strength lies in its schema flexibility.
Binary large objects (BLOBs) can be stored separately from the main database
workload, reducing storage costs, with streamlined read/write access via the HTTP
REST API.
5
Document / JSON
CrateDB supports the storage of JSON-based documents into OBJECT typed columns,
offering flexibility for complex multi-model data structures with diverse attributes, nesting
levels, and arrays of objects.
CrateDB simplifies SQL access to nested documents by enabling direct and nested
attribute querying. This feature allows users to navigate and retrieve data seamlessly
within JSON structures using native SQL syntax.
Inserting documents and JSON payloads is straightforward through JSON strings,
which are automatically parsed and cast into the defined datatypes.
Alternatively, dedicated object literals, closely resembling JSON syntax, provide full
control over used datatypes, especially when dynamically creating new attributes.
Different policies exist to enforce schemas, work schema-free, or follow a hybrid
approach that enforces certain attributes and datatypes. Three options – 'strict',
'dynamic', or 'ignored' – are available for defining the level of flexibility allowed for
both objects and new columns in the schema.
Automatic indexing and columnar storage are applied to any attribute in the JSON-
based documents, depending on the column policy.
Relational Data
No matter the data stored in CrateDB, it is represented in tables and columns, albeit in a
different manner in the background. This simplicity facilitates a quick start with CrateDB,
allowing the use of its native SQL for reading and writing data.
While CrateDB does not support ACID transactions, atomic operations and durability is
guaranteed at the record level. Queries by primary key always return the latest results.
The choice of eventual consistency prioritizes high-availability and partition tolerance,
ensuring resilience to most hardware and network failures. In conjunction with Multi
Version Concurrency Control, the storage engine can handle high-volume concurrent
reads and writes.
6
Full-Text Search & Search Engine
Rich search capabilities are essential for almost any modern application, with the user
experience set high by powerful and user-friendly search engines like Google or Bing.
CrateDB meets the demand for quick and accurate search in textual data, leveraging
the robust full-text search capabilities of Apache Lucene.
Dedicated full-text indexes are updated in real time using powerful analyzers
supporting more than 30 languages, enabling instant queries across billions of
records with no latency.
CrateDB ensures scalability through its shared-nothing architecture and horizontal
scaling capabilities.
The SQL interface exposes most of Lucene’s powerful query syntax, allowing users to
effortlessly perform complex search queries, including boolean logic, wildcard
searches, phrase searches, proximity searches, fuzzy search capabilities, and
synonyms.
Any table can be enhanced with a (nested) column of type FLOAT_VECTOR that
utilizes an HNSW algorithm for efficient indexing, with a maximum of 2048
dimensions.
The built-in approximate k-Nearest-Neighbour Search (kNN) retrieves similar vectors
based on the Euclidean distance. In combination with lexicographical full-text search,
this approach significantly increases search precision and relevance.
7
Time Series Data
CrateDB is very well suited for time series data:
High Cardinality: robust sharding and partitioning capabilities enable time-based
partitioning, facilitating the long-term storage of time series data without the need for
aggregation and downsampling. Preserving granular details is crucial, providing
enhanced flexibility for revisiting historical data, gaining new insights, and conducting
precise forecasting, essential for strategic decision-making processes.
Data Tiering: With CrateDB, you can efficiently move older partitions to slower but
cost-effective spinning disks, while keeping recent data on fast SSDs. This allows for
fast query speed on the most recent data without losing details in older data.
Broad Data Model Support: CrateDB accommodates various data types that can be
combined with time series data. Enrich your time series data with additional context
from documents, track fleets with geospatial information, calculate
vectors/embeddings, and join across different time series. This versatility eliminates
the need for new costly technologies with complex maintenance and data
synchronization.
8
Spatial Data
Spatial data often generates substantial volumes of information. CrateDB provides
scalable SQL support for geospatial data types and functions tailored for applications
such as fleet tracking, mapping, and location analytics.
In many use cases, including data analysis and machine learning, location data plays
a crucial role. CrateDB accommodates this by storing and querying geographical
information using the geo_point and geo_shape data types.
Users can fine-tune geographic index precision and resolution to achieve faster query
results. Additionally, exact queries can be performed using scalar functions
like intersects, within, and distance.
BLOB Data
CrateDB provides robust support for storing binary large objects (BLOBs) with advanced
cluster features for efficient replication and sharding of these files. By treating BLOBs like
regular data, CrateDB ensures their optimal distribution across multiple nodes in the
cluster.
To integrate BLOBs into CrateDB, creating a dedicated BLOB table is necessary. This
table can be sharded to distribute binaries effectively across nodes, promoting scalability
and improved performance. Users have the flexibility to define a custom directory path
exclusively for storing BLOB data. This distinct path can differ from the regular data path,
allowing for the segregation of normal data stored on fast SSDs and BLOB data stored on
large, cost-effective spinning disks. The storage path can be globally set or specified
during the creation of a BLOB table.
Interacting with BLOB tables in CrateDB is made easy through the use of the HTTP(S)
protocol. For uploading a BLOB, the SHA1 hash of the BLOB serves as its unique
identifier. To download a BLOB, a simple GET request to the appropriate endpoint is
sufficient, and for deletion, a DELETE request can be sent.
9
Native SQL Syntax
CrateDB is designed with scalability and ease of use at its core, emphasizing support for
SQL to seamlessly work with data and integrate into a wide ecosystem.
10
Distributed Shared-Nothing
Architecture
CrateDB has been designed with distribution in mind since its inception, employing a shared-
nothing architecture that seamlessly scales to hundreds of nodes with minimal operational
effort. This design aims to achieve:
Node Architecture
In contrast to a primary-secondary architecture, every node in the CrateDB cluster can perform
every operation, making all nodes equal in terms of functionality and configuration. The four
major components are:
SQL Handler: Responsible for incoming client requests, the SQL Handler parses and
analyzes SQL statement, creating an execution plan.
Job Execution Service: Manages the execution of plans (“jobs”), with defined phases
and resulting operations. Jobs, consisting of multiple operations, are distributed via the
Transport Protocol to involved nodes, both local and remote.
Cluster State Service: Manages the cluster state, including master node election and
node discovery, making it a key component for cluster setup.
Data storage component: Handles operations for storing and retrieving data from disk
based on the execution plan. CrateDB stores data in sharded tables, dividing and storing
them across multiple nodes. Each shard is a distinct Lucene index stored physically on the
file system with reads and writes operating at the shard level.
11
Interconnected, uniform CrateDB nodes
12
Cluster State Management
Each node in the cluster maintains a versioned copy of the latest cluster state. However,
only one node in the cluster – the master node – can change the state at runtime. When
the master node updates the cluster state, it publishes the new state to all nodes in the
cluster and awaits responses from all nodes before processing the next update.
The cluster state encompasses all essential information for maintaining the cluster and
coordinating operations, including:
At any given time, there can be only one master node. The cluster becomes available to
serve requests once a master has been elected. The election, requiring a majority (also
known as quorum) among master-eligible nodes, is a crucial step in this process.
Horizontal Scalability
As business demands grow, so do data volumes and hardware requirements over time.
When there is an increase in requirements for CPU, RAM, and Disk Storage/IOPS,
additional nodes can be seamlessly added to the CrateDB cluster without manual
intervention. The cluster automatically rebalances the data to accommodate the new
nodes.
13
High Availability
One of the key benefits of a distributed database is its ability to provide high availability for
always-on applications.
CrateDB goes beyond just allowing multi-node setups; nodes can be distributed across
multiple availability zones or data centers to further enhance availability.
The system ensures uninterrupted data access during maintenance operations through
the execution of rolling software updates.
CrateDB clusters exhibit self-healing characteristics, where nodes re-joining a cluster
after a failover automatically synchronize with the latest data.
Achieving high availability requires a minimum of three nodes to maintain a quorum for
master node election, which holds the cluster state. The determination of the number of
nodes is guided by the availability Service Level Agreement (SLA), specifying how many
nodes can fail before the cluster cannot accept reads and writes. For instance:
In a three-node cluster, one node can be offline.
In a five-node cluster, two nodes can be offline, accommodating scenarios such as
hardware failures and concurrent rolling maintenance operations.
Users have the flexibility, on a per-table level, to decide how many replicas of the data
should be created. This choice dictates how many nodes each table and its shards are
replicated on, providing fine-grained control over data redundancy.
14
Failover and Recovery Process in CrateDB
1. A node leaves the cluster due to a hardware failure, network partition, or a rolling
maintenance task.
2. As data is automatically replicated in CrateDB, there is an automatic failover,
ensuring data remains available despite one node leaving the cluster.
3. When a node is back up and rejoins the cluster, the data is automatically
synchronized to the latest stage and rebalanced, if necessary.
4. After completing the data synchronization, the node is fully operational again, and the
cluster has recovered autonomously without any manual intervention.
The following section outlines how CrateDB’s sharding and replication mechanism works
to ensure efficient operations.
15
Data Storage
In CrateDB, every table is both sharded and optionally partitioned, resulting in tables
being divided and distributed across the nodes of a cluster. Each shard in CrateDB
corresponds to a Lucene index made of segments stored on the file system. Physically,
the files reside under one of the configured data directories of a node.
Lucene’s characteristic of only appending data to segment files ensures that data written
to the disk is never mutated. This characteristic simplifies replication and recovery
processes, as syncing a shard is a straightforward process of fetching data from a
specific marker.
CrateDB performs periodic merges of segments as they grow over time. During a merge,
documents marked as deleted are discarded, and newly created segments contain only
valid, non-deleted documents from the original segments. This merging process is
triggered automatically in the background, and users can also execute a segment merge
on demand using the OPTIMIZE TABLE command.
It’s important to note that CrateDB employs the concept of soft delete to accelerate the
replication and recovery process. The retention lease period specifies how long the
deleted documents must be preserved to free up disk space occupied by deleted records.
CrateDB discards documents only after the expiration of this period, with the default
retention lease period set to 12 hours.
16
Partitioning and Sharding
Tables in CrateDB can be divided by defining partition columns.
Each table partition is then further split into a configured number of shards, with these
shards distributed across the cluster.
As nodes are added to the cluster, CrateDB dynamically adjusts shard locations to
achieve the maximum possible distribution.
Non-partitioned tables, which function as a single partition, are still divided into the
configured number of shards.
When a record with a new distinct combination of values for the configured partition
columns is inserted, CrateDB dynamically creates a new partition on the fly, inserting
the document into this newly created partition.
17
Read requests in CrateDB are automatically broken down and executed in parallel across
multiple shards on multiple nodes, significantly enhancing read performance. With a fixed
number of primary shards, individual rows can be routed to a fixed shard number with a
simple formula:
When hash values are evenly distributed (which is typically the case), rows will be evenly
distributed amongst the fixed number of available shards.
The routing column, specified during table creation, determines the shard allocation. All
rows with the same value in the routing column are stored in the same shard. If a primary
key is defined, it serves as the default routing column; otherwise, the internal document ID is
used. If the routing column is explicitly defined, it must match a primary key column.
Replication
Replication in CrateDB operates at the table level, ensuring that each table shard has the
configured number of copies available. In this setup, there is always one primary shard, and
the remaining shards are designated as replica shards.
Write operations are directed exclusively to the primary shard, while read operations
can be distributed across any shard.
CrateDB efficiently synchronizes data from the primary shard to all replica shards,
benefiting from the append-only characteristic of Lucene segments.
In the event of a primary shard loss due to node failure, CrateDB automatically
promotes a replica shard to primary status.
Replication not only enhances fault tolerance but also improves read performance by
increasing the number of shards distributed across a cluster. This increase provides more
opportunities for CrateDB to parallelize query execution across multiple nodes.
The figures below illustrate the concept, with a table split into multiple shards and
distributed evenly across a cluster. The first example shows table t1 with shards s1 to s3:
18
Table with Three Shards Distributed Across Nodes
Data is replicated at a shard level, as depicted in the second figure showing table t1 with
three shards (s1 to s3) and one replica (r1 to r3) each:
Table with Three Shards and One Replica Each Distributed Across Nodes
The third illustration demonstrates a table split into two partitions, using the month as a
routing table. Adding data for more months automatically creates a new partition:
19
Partitioned Table with Three Shards per Partition
Read operations are executed on primary shards and their replicas to increase query
distribution and enhance performance. CrateDB randomly assigns a shard when routing
an operation, with the option to configure this behavior.
Any replica shard failing to write data or timing out in step 5 is immediately considered as
unavailable.
20
Writing and Replication in CrateDB
21
Consistency Levels:
CrateDB accesses documents internally through 'get' queries by primary key and
'search' queries by any secondary attribute, transparent to the user.
For search operations, CrateDB adopts an Eventual Consistency model rooted
in the concept of shared IndexReaders, which provide caching and reverse lookup
capabilities for shards. An IndexReader is bound to the Lucene segment it originated
from and requires periodic refreshes to reflect new changes. This refresh can occur
either in a time-based manner, configurable by the user, or manually triggered using
the REFRESH TABLE command. Therefore, a search operation only recognizes a
change if the corresponding IndexReader was refreshed after the change occurred.
In contrast, Strong Consistency is ensured for operations by the primary key
when the shard routing key and row identifier can be computed from query
specification, such as when the full primary key is defined in the WHERE clause. This
approach, accessing a document through a single shard and a straightforward index
lookup on the '_id' field, proves to be the most efficient method.
If a query specification results in such a 'get' operation, changes are immediately visible.
This is achieved by first looking up the document in the translog, which always holds the
most recent version of the document. This mechanism enables the common 'fetch' and
'update' use case. For instance, if a client updates a row and subsequently looks up that
row by its primary key, the changes will be immediately visible, retrieved directly from the
translog without requiring a refresh of the IndexReader.
Advanced Indexing
When ingesting data, the queries you will make over time may not be clear initially, and
use cases tend to evolve, leading to new requirements and query patterns.
To ensure fast query responses for any query type, CrateDB automatically indexes every
attribute by default. Depending on the data type, the following strategies are applied:
Inverted Index for text values: Facilitates efficient search for precise text matches,
including support for wildcards and regular expressions.
Block k-d trees (BKD) for numeric, date, and geospatial values: Highly efficient
indexes designed for optimal IO. Most data structure resides in on-disk blocks, with a
small in-heap binary tree index structure for locating blocks at search time. This
design ensures excellent query and update performance regardless of the number of
updates performed.
HNSW (Hierarchical Navigable Small World) graphs for high dimensional
vectors: Enables efficient approximate nearest neighbor search, commonly known as
similarity search.
22
Additionally, full-text indexes can be added on-demand to unlock features like fuzzy search,
phrase search, and attribute boosting. CrateDB offers over 30 languages, 11 analyzers, 15
tokenizers, more than 35 token filters, and the flexibility for custom analyzers and tokenizers.
Columnar Storage
In conjunction with advanced indexing strategies, CrateDB adopts a columnar storage
approach that facilitates fast queries and complex aggregations across large data sets:
Storing data for the same field together optimizes file-system cache utilization. This
design eliminates the need to load unnecessary data for fields not needed by the query.
By segmenting data into blocks and incorporating metadata about the range or set of
unique values in the block header, certain queries may entirely skip unnecessary
blocks during execution.
Implementation of specific techniques allows querying data without decompressing it
first.
We will explore how CrateDB’s query engine is architected to optimize data throughput
and query performance, particularly as the number of concurrent operations grows. Key
features, including distributed query processing, advanced indexing techniques,
real-time data ingestion, and real-time querying, synergize to deliver a seamless and
high-performance user experience.
Distributed Writes
At its core, CrateDB employs a sharding mechanism to distribute data across multiple
nodes in a cluster. Upon data ingestion, it undergoes sharding, breaking it into self-
contained subsets known as shards. This approach optimally distributes the workload
across available CPUs, even on single node.
The sharding strategy enables parallel write operations, allowing independent and
concurrent writing to each shard on different nodes. This distribution prevents any single
node from becoming a bottleneck, improving write throughput and scalability.
24
Each shard in CrateDB is associated with a Lucene index, a structure storing data in
immutable segments. When new data is written, it first goes to an in-memory buffer. Upon
reaching a specific size, this data is flushed and appended to a new Lucene segment on
disk. This append-only approach ensures efficient write operations without the need to
rewrite or rearrange existing data or table-level locks.
In addition to sharding, CrateDB replicates data at the shard level across various nodes,
ensuring data availability and fault tolerance. Write operations are deemed successful
only when the data is written to the primary shard and replicated to the required number
of replica shards, as per the defined replication factor. This replication, coupled with the
distributed nature of sharding, contributes to the robustness and reliability of write
operations in CrateDB.
Distributed Reads
In addition to the advanced mechanisms mentioned earlier, such as columnar storage,
efficient indexing per datatype, full-text indexes, and similarity search, achieving
outstanding performance in CrateDB involves the distribution of reads and the lock-free
execution of writes and reads.
CrateDB’s design, focused on distribution, leads to operations being split across shards
and their replicas by the query planner. This strategy accelerates aggregations by
selecting only the necessary data partitions, utilizing available hardware on individual
nodes, distributing queries across all nodes, and pushing down aggregations to multiple
nodes to reduce pressure during the merge step of query execution.
25
Executing exact aggregations in distributed databases presents challenges but holds
significant potential to enhance query speed. CrateDB’s approach to aggregation involves
four steps: collect, reshuffle, aggregate, and merge. The reshuffling step occurs in
CrateDB’s unique distribution layer, redistributing intermediate responses between nodes
to expedite processing. While each request is sent to all nodes, the responses pass
through the distribution layer. Each node hashes the returned values and distributes each
row, along with the values, to other nodes based on the hash value. Each node can then
reduce its distinct part of the data to the aggregated values, sometimes even to a single
value, as it knows that no other node has data with the same values. This innovative
approach helps optimize the execution of aggregations in a distributed environment.
26
Security
Data encryption
CrateDB provides the capability to encrypt internal communication between CrateDB
nodes and external communication with HTTP and PostgreSQL clients using Transport
Layer Security (TLS).
CrateDB strongly recommends the use of encrypted disks for storing data, and this is the
default setting in CrateDB’s managed cloud offering. This emphasis on encryption helps
enhance the overall security and confidentiality of data in transit and at rest within the
CrateDB environment.
Authentication
Comprehensive authentication mechanisms are in place to ensure secure access:
User Authentication: Authentication is required to access the database. It can be
done by submitting a username/password or using a client certificate.
TLS for Secure Connections: Connections can be secured through Transport Layer
Security (TLS). This is crucial for ensuring encrypted communication between clients
and the CrateDB server.
Password Security: Passwords are securely stored with a per-user salt and hashed
using the PBKDF2 key derivation function and the SHA-512 hash algorithm.
Host-Based Authentication: Host-based authentication in CrateDB adds an extra
layer of security by verifying the username, IP address, protocol, and connection
scheme. Trust-based authentication is recommended for secure network
environments and during development.
Node-to-Node Authentication: Internal node-to-node communication is secured
either through trust-based authentication or by using certificates.
27
Authorization
CrateDB employs Role-Based Access Control (RBAC) to manage user access and
privileges.
RBAC is used to associate users with specific roles.
Each role is assigned specific privileges, defining the degree of access to various
resources within CrateDB.
Privileges can be granted at different levels, including the cluster, schema, and
table/view levels.
Auditing
Queries performed on CrateDB are logged in internal tables and can be exported to log
files on disk for further analysis or integration into SIEM systems. This allows
administrators to ensure that system management tasks are being appropriately
performed. Monitoring information can be collected and reported using JMX and
forwarded, for example, to Prometheus, a popular cloud-native monitoring tool.
28
Flexible Deployment Models
In today's diverse technological landscape, the one-size-fits-all approach to database
deployment is no longer sufficient. Enterprises have unique requirements based on their
operational, security, and regulatory needs. Recognizing this diversity, CrateDB offers a
range of flexible deployment options, ensuring that regardless of your environment –
whether it’s at the Edge, on-premises, hybrid, or in the cloud – there is a solution that
aligns seamlessly with your infrastructure and business goals.
Containerized
CrateDB facilitates seamless deployment and scaling in diverse environments by offering
support for containerization and orchestration tools such as Docker and Kubernetes.
This enhances operational flexibility and ensures adaptability to varying infrastructure
requirements.
29
Best of Both Worlds: Hybrid deployments offer a balanced solution, combining the
control and security of on-premises systems with the scalability and flexibility of cloud
environments. CrateDB seamlessly integrates with this model, allowing sensitive data
to be stored on-premises while leveraging cloud resources for scalable computing
and storage.
Edge
Optimized for real-time analytics and Edge AI: CrateDB is uniquely positioned for
edge computing environments, especially in IoT and real-time analytics scenarios.
Deploying directly at the Edge, CrateDB facilitates immediate data processing,
reducing latency and bandwidth constraints associated with data transfer to
centralized systems. This ensures faster insights and decision-making at the point of
data generation.
Consolidating data for Aggregated Reports: Ideal for aggregating data from
multiple clusters to generate comprehensive reports.
Ensuring High Availability: Logical replication facilitates high availability by
redistributing data if one cluster becomes unavailable.
Replicating Across Different CrateDB Versions: Logical replication supports
replication between different compatible versions of CrateDB. However, replicating
tables created on a cluster with a higher major/minor version to a cluster with a lower
major/minor version is not supported.
30
Improved Performance and Reduced Latency: CrateDB leverages its
understanding of the geographical distribution of data to route queries and manage
data replication, minimizing latency intelligently. This is particularly advantageous for
global applications that necessitate fast data access from specific geographic
locations.
Enhanced Resilience: Zone-aware deployments significantly contribute to the
resilience of the database. By distributing data across multiple zones, CrateDB
ensures high availability and durability, safeguarding against data center outages or
regional disruptions.
31
Ecosystem Integrations
In the dynamic landscape of data management and analytics, a database’s capability to
seamlessly integrate with various components of the technology ecosystem is crucial.
This section outlines how CrateDB fits into a broader technology ecosystem, highlighting
its strengths in both ingesting and feeding data. It emphasizes the adaptability and
openness of CrateDB in integrating with various tools and platforms.
Programming Languages
CrateDB extends support to a range of programming languages through dedicated
drivers and integrations.
Java Integration: For Java, CrateDB provides a JDBC driver, ensuring seamless
integration with numerous Java applications and frameworks. This makes it the
optimal choice for Java developers seeking to harness CrateDB's scalability and
performance.
Python Integration: In the realm of Python, a language prominent in data analytics
and machine learning, CrateDB offers a specialized Python driver and an
SQLAlchemy dialect. This driver facilitates smooth integration with Python’s extensive
ecosystem of data processing and analytics libraries, positioning CrateDB as an
appealing option for data scientists and Python developers.
.NET Compatibility: .NET can utilize the Npgsql driver to connect and interact with
CrateDB, leveraging familiar interfaces within their applications.
Web Development and Scripting Languages: For web development and scripting
languages such as Node.js, PHP, Ruby, CrateDB is accessible through PostgreSQL-
compatible drivers. Utilizing PostgreSQL Wire Protocol enables developers in these
languages to interact with CrateDB, capitalizing on the familiarity and stability of
established drivers.
Go Programming Language: CrateDB supports the Go programming language
through the PostgreSQL Wire Protocol, ensuring efficiency in building high-
performance applications. This compatibility enables Go developers to seamlessly
incorporate CrateDB into their applications, benefiting from its distributed architecture
and real-time data processing capabilities.
RESTful HTTP Interface: CrateDB offers a RESTful HTTP interface, providing a
versatile and straightforward option for virtually any programming language capable
of making HTTP requests. This ensures accessibility even in languages where
specific CrateDB drivers are not available.
32
SQL- and API-based Integrations
CrateDB provides SQL-based access through the PostgreSQL Wire Protocol, rendering it
compatible with any tool or application supporting SQL queries. Additionally, its HTTP
interface facilitates API integrations by allowing the execution of SQL statements over
HTTP, ensuring versatile and seamless interaction with various applications and services.
33
With its Python client library, CrateDB offers an intuitive interface for direct interaction
with databases from Python scripts, facilitating the training, evaluation and deployment of
machine learning models. This includes reading data into data frames for preprocessing,
executing SQL queries for data manipulation, and writing predictions back to the
database. The combination of CrateDB and Python streamlines the workflow for data
scientists, allowing them to focus more on model development and less on data
management.
Moreover, CrateDB is an ideal backend for model tracking with MLflow, providing a
robust solution for managing the entire machine learning lifecycle. Its seamless
integration with orchestration tools further enables the creation of end-to-end workflows.
34
Getting started with CrateDB
CrateDB Cloud & Free Tier
Begin your journey with CrateDB by exploring CrateDB Cloud. This offers a convenient
way to experience CrateDB's capabilities right away with nothing to install. Access
tutorials and guides on http://cratedb.com to get hands-on experience with the cloud
platform.
In addition to the official resources, CrateDB has a vibrant and supportive community.
Engaging with the community is a great way to get insights, tips, and real-world examples
of CrateDB implementations. Whether you're encountering a technical challenge or
seeking innovative ways to use CrateDB, the community is a valuable resource for
shared knowledge and collaborative problem-solving: https://community.cratedb.com/
35
CrateDB is an open source, multi-model, and distributed
database that offers high performance, scalability, and
flexibility. Our team is on a mission to develop reliable and
scalable database technology where response time is never an
issue, regardless of the complexity and volume of data.