Introduction to DataStage
IBM Infosphere DataStage v11.5
© Copyright IBM Corporation 2015
Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit objectives
• List and describe the uses of DataStage
• List and describe the DataStage clients
• Describe the DataStage workflow
• Describe the two types of parallelism exhibited by DataStage
parallel jobs
Introduction to DataStage © Copyright IBM Corporation 2015
What is IBM InfoSphere DataStage?
• Design jobs for Extraction, Transformation, and Loading (ETL)
• Ideal tool for data integration projects – such as, data warehouses,
data marts, and system migrations
• Import, export, create, and manage metadata for use within jobs
• Build, run, and monitor jobs, all within DataStage
• Administer your DataStage development and execution environments
• Create batch (controlling) jobs
Called job sequence
Introduction to DataStage © Copyright IBM Corporation 2015
What is Information Server?
• Suite of applications, including DataStage, that share a common:
Repository
Set of application services and functionality
− Provided by the Metadata Server component
• By default an application named “server1”, hosted by an IBM WebSphere
Application Server (WAS) instance
− Provided services include:
• Security
• Repository
• Logging and reporting
• Metadata management
• Managed using the Information Server Web Console client
Introduction to DataStage © Copyright IBM Corporation 2015
Information Server backbone
Information Information Information FastTrack DataStage / MetaBrokers
Data Click
Services Governance Analyzer QualityStage
Director Catalog
Metadata Metadata
Access Services Analysis Services
Metadata Server
Information Server Web Console
Introduction to DataStage © Copyright IBM Corporation 2015
Information Server Web Console
Administration Reporting
InfoSphere
Users
Introduction to DataStage © Copyright IBM Corporation 2015
DataStage architecture
• DataStage clients
Administrator Designer Director
• DataStage engines
Parallel engine
− Runs parallel jobs
Server engine
− Runs server jobs
− Runs job sequences
Introduction to DataStage © Copyright IBM Corporation 2015
DataStage Administrator
Project environment
variables
Introduction to DataStage © Copyright IBM Corporation 2015
DataStage Designer
Menus / toolbar
DataStage parallel
job with DB2
Connector stage
Job log
Introduction to DataStage © Copyright IBM Corporation 2015
DataStage Director
Log messages
Introduction to DataStage © Copyright IBM Corporation 2015
Developing in DataStage
• Define global and project properties in Administrator
• Import metadata into the Repository
Specifies formats of sources and targets accessed by your jobs
• Build job in Designer
• Compile job in Designer
• Run the job and monitor job log messages
The job log can be viewed either in Director or in Designer
− In Designer, only the job log for the currently opened job is available
Jobs can be run from either Director, Designer, or from the command line
Performance statistics show up in the log and also on the Designer canvas
as the job runs
Introduction to DataStage © Copyright IBM Corporation 2015
DataStage project repository
User-added folder
Standard jobs folder
Standard table
definitions folder
Introduction to DataStage © Copyright IBM Corporation 2015
Types of DataStage jobs
• Parallel jobs
Executed by the DataStage parallel engine
Built-in capability for pipeline and partition parallelism
Compiled into OSH
− Executable script viewable in Designer and the log
• Server jobs
Executed by the DataStage Server engine
Use a different set of stages than parallel jobs
No built-in capability for partition parallelism
Runtime monitoring in the job log
• Job sequences (batch jobs, controlling jobs)
A server job that runs and controls jobs and other activities
Can run both parallel jobs and other job sequences
Provides a common interface to the set of jobs it controls
Introduction to DataStage © Copyright IBM Corporation 2015
Design elements of parallel jobs
• Stages
Passive stages (E and L of ETL)
− Read data
− Write data
− Examples: Sequential File, DB2, Oracle, Peek stages
Processor (active) stages (T of ETL)
− Transform data (Transformer stage)
− Filter data (Transformer stage)
− Aggregate data (Aggregator stage)
− Generate data (Row Generator stage)
− Merge data (Join, Lookup stages)
• Links
"Pipes” through which the data moves from stage-to-stage
Introduction to DataStage © Copyright IBM Corporation 2015
Pipeline parallelism
• Transform, Enrich, Load stages execute in parallel
• Like a conveyor belt moving rows from stage to stage
Run downstream stages while upstream stages are running
• Advantages:
Reduces disk usage for staging areas
Keeps processors busy
• Has limits on scalability
Introduction to DataStage © Copyright IBM Corporation 2015
Partition parallelism
• Divide the incoming stream of data into subsets to be separately
processed by an operation
Subsets are called partitions
• Each partition of data is processed by copies the same stage
For example, if the stage is Filter, each partition will be filtered in exactly
the same way
• Facilitates near-linear scalability
8 times faster on 8 processors
24 times faster on 24 processors
This assumes the data is evenly distributed
Introduction to DataStage © Copyright IBM Corporation 2015
Three-node partitioning
Node 1
subset1 Stage
Node 2
subset2
Data Stage
Node 3
subset3
Stage
• Here the data is split into three partitions (nodes)
• The stage is executed on each partition of data separately and in
parallel
• If the data is evenly distributed, the data will be processed three
times faster
Introduction to DataStage © Copyright IBM Corporation 2015
Job design versus execution
A developer designs the flow in DataStage Designer
… at runtime, this job runs in parallel for any number
of partitions (nodes)
Introduction to DataStage © Copyright IBM Corporation 2015
Configuration file
• Determines the degree of parallelism (number of partitions) of jobs
that use it
• Every job runs under a configure file
• Each DataStage project has a default configuration file
Specified by the $APT_CONFIG_FILE job parameter
Individual jobs can run under different configuration files than the project
default
− The same job can also run using different configuration files on different job runs
Introduction to DataStage © Copyright IBM Corporation 2015
Example: Configuration file
Node (partition)
Node (partition)
Resources attached
to the node
Introduction to DataStage © Copyright IBM Corporation 2015
Checkpoint
1. True or false: DataStage Director is used to build and compile your
ETL jobs
2. True or false: Use Designer to monitor your job during execution
3. True or false: Administrator is used to set global and project
properties
Introduction to DataStage © Copyright IBM Corporation 2015
Checkpoint solutions
1. False.
DataStage Designer is used to build and compile jobs.
Use DataStage Director to run and monitor jobs, but you can do this
from DataStage Designer too.
2. True.
The job log is available both in Director and Designer. In Designer,
you can only view log messages for a job open in Designer.
3. True.
Introduction to DataStage © Copyright IBM Corporation 2015
Unit summary
• List and describe the uses of DataStage
• List and describe the DataStage clients
• Describe the DataStage workflow
• Describe the two types of parallelism exhibited by DataStage parallel
jobs
Introduction to DataStage © Copyright IBM Corporation 2015