A Unified Data Infrastructure Architecture
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output
Connectors Data Warehouse Dashboards
OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)
Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS
Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)
Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)
Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
Interpreting the Architecture
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output
Generate relevant Extract data from Store data in a Present results of
Provide an interface for analysts and data scientists
business and operational systems format accessible to data analysis to
to derive insights (query)
operational data (E) query & processing internal and
systems external users
Execute queries and data models against stored
Deliver to storage,
data, often using distributed compute (processing)
aligning schemas Optimize for low Embed data models
between source cost, scalability, and into operational
and destination (L) analytic workloads systems and
(e.g., column store) applications
Transform data to a
structure ready for In some cases,
analysis (T) provide additional
data structures or
guarantees Describe what Predict what will
happened in the happen in the future
past (including very
recent past) Build data-driven/
ML applications
Coordinate the flow of data and the execution of computations across the full lifecycle
Ensure proper data quality, performance, and governance of all systems and datasets
Three Common Blueprints
Analytic
1 Modern Business Intelligence
Systems
2 Multimodal Data Processing
Operational
3 AI and ML
Systems
1. Modern Business Intelligence Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output
Connectors Data Warehouse Dashboards
OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)
Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS
Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)
Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)
Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
2. Multimodal Data Processing Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output
Connectors Data Warehouse Dashboards
OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)
Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS
Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)
Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)
Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
3. AI and ML Blueprint
Data Transformation Model Training and Development Model Inference
Data Labeling
(Labelbox, Snorkel,
Scale, Sagemaker)
Data Sources
(Data lake + Dataflow Automation
data warehouse + (Airflow, Pachyderm, Elementl, Prefect, Tecton, Kubeflow)
streaming engine)
Query Engines Feature Store Feature Server
(Presto, Hive) (Tecton) (Tecton, Cassandra)
Data Science
Libraries
(Spark, Pandas,
NumPy, Dask)
Data Science Platform Model Batch Predictor
(Jupyter, Databricks, Domino, Sagemaker, DataRobot, Registry (Spark)
H2O, Colab, Deepnote, Noteable) (Algorithmia,
MLflow,
Sagemaker) Online Model Clients
Server
Experiment ML (TF Serving, Ray
Tracking Framework Compiler Serve, Seldon)
(Weights and (Scikit-learn, (TVM)
Biases, Comet, XGBoost, MLlib)
MLflow)
Model
DL Monitoring
Visualization Framework (Fiddler, Arthur,
(Tensorboard, (TensorFlow, Keras, Arize)
Fiddler) PyTorch, H2O)
Model Tuning RL Libraries
(Sigopt, hyperopt, (Gym, Dopamine,
Ray Tune) RLlib, Coach)
Distributed
Processing
(Spark, Ray, Dask,
Distributed TF,
Kubeflow,
Horovod)