Azure Data Factory
Cloud version of SSIS
Copy Data
More than 80 connectors to different services are available
Transform Data
Using newly added Data Flow, now Data Factory is complete cloud based ETL tool.
Definition:
Azure Data Factory (ADF) is a hybrid data integration service
that enables you to quickly and efficiently create automated
Azure Data Factory
data pipelines – without having to write any code!
➢ Hybrid Data Integration Service
➢ Simplifies ETL at scale
➢ Enables modern data integration
➢ Drag and drop interface
➢ Over 80 connectors available
➢ Move, transform and save data
➢ Managed Service
Azure Data Factory
➢ Create Data Driver workflows
➢ Orchestrate and automate data movement
➢ Transform and store data
➢ Operationalize the process
➢ ETL or ELT scenarious
Data Factory on Azure Ecosystem
Migration?
01 Data Factory excels in periodic data loads and transformation instead.
Streaming?
02 ADF can orchestrate, but there are other dedicated services for streaming
Transformations?
03 Data flows for simple ones, but you can use Databricks or HDInsight for more complex transforms
SSIS vs Data Factory
Cluster Types
SSIS Data Factory
More code-free transformations Much higher scalability
On Premises connectors (e.g excel) Cloud and SaaS Connectors
Event based Triggers
Can use SSIS Packages
Data Factory considerations
Two versions Build options Highly No data storage Security
integrated standards
ADF V2 is the PowerShell, DevOps, Key Need to persist HTTP/TLS
current and .Net, Python, Vault, Monitor, data by the end. whenever
improved REST, ARM Automation possible
version
Azure Data Factory Components
Delivery Manager
Delivery man
Shop House
Disassembly Delivery Assembly Address &
Address &
Cabinet Info Details Info Keys Cabinet
Keys
Data Factory Pipeline
Integration Runtime
Blob Storage
Copy Activity
Order Table
Order.csv
Data Factory vs SSIS
Cluster Types
Azure Data Factory SSIS
Pipeline Package
Linked Service Connection manager
Source Source
Sink Destination
Activity Control flow task
Data Flow Data flow
➢ Data Factories can contain one ore more pipelines
➢ Logical group of Activities
➢ Manage Activities as a set
Data Factory ➢ One Pipeline can have one or more activities
Pipeline
• Represents a processing step in the pipelines
• Actions to perform on data
• Ingest data
• Transform data
Azure Data
Factory Activities • Store data
• Can be linked
• Execute sequentially or
• Run in parallel
Activity types
Data movement activities
01 Copy data amongst data stores located on-premises and in the cloud
Data stores – Blob storage, Cosmos DB, Amazon Redshift, Google BigQuery Hive, Maria DB…etc.
Data transformation activities
02 Transform and enrich data
e.g. Hive, Pig, MapReduce, Spark or Databricks
Control activities
03 Control pipeline flow
e.g. ForEach, Web
• Data Flow is a new feature of Azure Data Factory
(ADF) that allows you to develop graphical data
transformation logic that can be executed as activities
Data Flows within ADF pipelines.
• Two types:
• Mapping
• Wrangling
➢ Simply point or reference the data
➢ Reference data used in an Activity
➢ Files
➢ Folders
Dataset
➢ Documents
➢ Tables
➢ Similar to connection string
➢ Represent the connection information to connect to
external resources
Linked service
➢ Datastores like Azure SQL Server
➢ Compute resource e.g. Spark Cluster
ADF Components
➢ Provides fully managed, serverless compute
infrastructure
➢ You don't have to worry about infrastructure
provision, software installation, patching, or capacity
Integration Runtimes scaling.
➢ Pay only for duration of actual use
➢ Bridges between the activity and linked service
➢ Activity defines the action
➢ Linked service define the location
➢ Data Integration Capabilities
➢ Data Flow
➢ Data Movement
➢ Format conversion, column mapping, serialization/
deserialization etc.
➢ Provides the native compute to move data between
Integration Runtimes
cloud data stores in a secure, reliable, and high-
performance manner.
➢ Activity dispatch (e.g. Databricks Notebook, HDInsight
Hive, pig, spark activity, SP, ADL Analytics U-SQL activity)
➢ SSIS Package execution
Azure Integration Runtime
Work on public networks
Responsible for data flows, data movements, and activity dispatches
Self-hosted Integration Runtime
Integration Runtimes Work on public and private networks
Provide data movement and activity dispatch capabilities
Need to install on on-premises machine or a virtual machine inside private
Specify the infrastructure to run activities network
SSIS Integration Runtime
Supports SSIS package execution
Works on public and private networks
Integration Runtimes
➢ Default IR – AutoResolveIntegrationRuntime
➢ Create Azure IR
Integration Runtimes ➢ When you want to explicitly define the location of IR
➢ Virtually group the activities executions on different IR for
management purpose
➢ Execute pipeline
➢ Many to many relationship b/w pipeline and trigger
➢ Three types of Trigger
➢ Schedule Trigger – Invoke pipeline on a wall-clock schedule
➢ Tumbling Window Trigger – Operates on a periodic interval, also retain state
➢ one-to-one relationship
➢ Advance configuration options - Dependencies, delay, retry, concurrency
Triggers ➢ Properties - trigger().outputs.WindowStartTime/WindowEndTime
➢ Event-based Trigger – trigger pipeline in response to an event
➢ e.g. Arrival/deletion of file in Blob storage
➢ Event trigger with Azure Event Grid Service
➢ Properties – triggerBody().folderPath/fileName
Demo: Copy Activity
Summary
Data Flows
Allows you to develop graphical data
transformation logic
Example of the SSIS Control
Flow tab for loading our
data mart tables:
Example of the ADF Pipeline
for loading our data mart
tables:
Example of SSIS Data Flow
tab for loading the
FactInternetSales table:
Example of ADF Mapping
Data Flows for loading the
FactInternetSales table:
Mapping Data flow – Transform Data
(Known data and schema)
Data Flow Wrangling Data flow – Prepare and explore
data using power query (known or unknown
datasets)
Mapping Data Flows
Mapping Data Flow Actions
Cluster Types
Multiple Inputs/outputs Schema Modifiers Row Modifiers
Join Derived Columns Filter
Conditional Split Select Sort
Exists Aggregate Alter Row
Union Surrogate key
Lookup Pivot
Unpivot
Window
Wrangling Data Flows
Data flows behind the scene
Behind the scene Data flow will execute on Azure Databricks using Spark
ADF internally handles all the code translation, spark optimization and execution of transformation