Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
440 views6 pages

Data Report Martin Inline Graphics R7 PDF

The document describes a unified data architecture that includes components for ingesting data from various sources, storing the data in warehouses and lakes, and performing queries, analytics, and modeling on the historical and current data to generate business insights and predictive outcomes. The architecture supports ingesting, transforming, storing large volumes of structured and unstructured data, as well as querying, analyzing, and visualizing the data using various tools.

Uploaded by

Soumya Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
440 views6 pages

Data Report Martin Inline Graphics R7 PDF

The document describes a unified data architecture that includes components for ingesting data from various sources, storing the data in warehouses and lakes, and performing queries, analytics, and modeling on the historical and current data to generate business insights and predictive outcomes. The architecture supports ingesting, transforming, storing large volumes of structured and unstructured data, as well as querying, analyzing, and visualizing the data using various tools.

Uploaded by

Soumya Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Unified Data Infrastructure Architecture

Query and Processing


Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
Interpreting the Architecture
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Generate relevant Extract data from Store data in a Present results of


Provide an interface for analysts and data scientists
business and operational systems format accessible to data analysis to
to derive insights (query)
operational data (E) query & processing internal and
systems external users
Execute queries and data models against stored
Deliver to storage,
data, often using distributed compute (processing)
aligning schemas Optimize for low Embed data models
between source cost, scalability, and into operational
and destination (L) analytic workloads systems and
(e.g., column store) applications
Transform data to a
structure ready for In some cases,
analysis (T) provide additional
data structures or
guarantees Describe what Predict what will
happened in the happen in the future
past (including very
recent past) Build data-driven/
ML applications

Coordinate the flow of data and the execution of computations across the full lifecycle

Ensure proper data quality, performance, and governance of all systems and datasets
Three Common Blueprints

Analytic
1 Modern Business Intelligence
Systems

2 Multimodal Data Processing

Operational
3 AI and ML
Systems
1. Modern Business Intelligence Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
2. Multimodal Data Processing Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
3. AI and ML Blueprint

Data Transformation Model Training and Development Model Inference

Data Labeling
(Labelbox, Snorkel,
Scale, Sagemaker)

Data Sources
(Data lake + Dataflow Automation
data warehouse + (Airflow, Pachyderm, Elementl, Prefect, Tecton, Kubeflow)
streaming engine)

Query Engines Feature Store Feature Server


(Presto, Hive) (Tecton) (Tecton, Cassandra)

Data Science
Libraries
(Spark, Pandas,
NumPy, Dask)

Data Science Platform Model Batch Predictor


(Jupyter, Databricks, Domino, Sagemaker, DataRobot, Registry (Spark)
H2O, Colab, Deepnote, Noteable) (Algorithmia,
MLflow,
Sagemaker) Online Model Clients
Server
Experiment ML (TF Serving, Ray
Tracking Framework Compiler Serve, Seldon)
(Weights and (Scikit-learn, (TVM)
Biases, Comet, XGBoost, MLlib)
MLflow)
Model
DL Monitoring
Visualization Framework (Fiddler, Arthur,
(Tensorboard, (TensorFlow, Keras, Arize)
Fiddler) PyTorch, H2O)

Model Tuning RL Libraries


(Sigopt, hyperopt, (Gym, Dopamine,
Ray Tune) RLlib, Coach)

Distributed
Processing
(Spark, Ray, Dask,
Distributed TF,
Kubeflow,
Horovod)

You might also like