Stages Overview in DataStage.
DataStage
provides a wide range of stages to extract, transform, and load data
efficiently. Each stage plays a unique role in building high-performance ETL
pipelines. Below is a clear and practical overview of the most commonly used
stages in Parallel Jobs.
1.
Transformer Stage
The Transformer
is one of the most powerful and frequently used stages in DataStage.
What it does
- Performs row-by-row
transformations
- Applies business rules,
calculations, and conditional logic
- Handles string, date, and
numeric operations
- Supports multiple outputs
using constraints
Where it is
used
- Data cleansing
- Derivation of new columns
- Complex business transformations
- Routing records based on conditions
Pro Tip
Use Stage
Variables to simplify long expressions and improve performance.
2. Sequential File Stage
This stage is
used to read/write data from plain text files like .txt, .csv, .dat.
Key
capabilities
- Supports delimited, fixed-width,
and CSV formats
- Handles headers/footers,
null markers, and escape characters
- Ideal for integration with external
systems
When to use
- Reading raw source files
- Writing output for downstream
systems
- Debugging quick test files
3. Dataset
Stage
Dataset is
DataStage’s high-performance, native file format.
Why it is
important
- Supports parallelism,
partitioning, and high-speed I/O
- Much faster than sequential files
for large data volumes
- Used for staging and intermediate
storage between jobs
Best use
cases
- Reprocessing
- Checkpointing
- Passing data between parallel jobs
without re-reading sources
4. Lookup Stage
Lookup stage
enriches input rows by matching with a reference dataset.
Key features
- Supports inner, outer, range,
and sparse lookups
- Loads small reference data into
memory for fast access
- Very efficient when reference data
is small
Use Cases
- Fetching dimension keys
- Adding descriptions to transaction
data
- Validating reference codes
Limitation
Avoid for very
large reference tables — use Join instead.
5. Join Stage
The Join stage
combines data from two or more input links based on matching keys.
Types of
joins supported
- Inner Join
- Left/Right Outer Join
- Full Outer Join
Advantages
- Best for large datasets
- Higher performance compared to
lookup for big tables
- Works well when inputs are properly
partitioned and sorted
Use Cases
- Combining sales + customer
- Joining order header + order
details
- Merging fact with dimension data
6.
Remove Duplicates Stage
Used to
eliminate duplicate rows based on specified key columns.
How it works
- Requires input data to be sorted
and partitioned
- You can keep either first or
last duplicate record
- Removes unwanted duplicate records
during data load
Use Cases
- Removing duplicate customer records
- Cleaning staging data
- Ensuring
uniqueness in dimension tables
7.
Aggregator Stage
Aggregator
stage performs group-based calculations.
Supported
operations
- SUM, COUNT, MIN, MAX, AVG
- First/Last values
- Statistical functions
Where to use
- Creating daily/weekly/monthly
summaries
- Calculating totals or averages
- Preparing aggregated facts for
reporting
Tip
Use sorted
aggregation when possible to improve performance.
8. Copy Stage
The Copy stage
is simple but extremely useful.
What it does
- Copies incoming records to multiple
outputs
- Helps split data for different
processing paths
- Used as a metadata fixer
when column definitions mismatch
Use Cases
- Sending one input to multiple
transformations
- Testing/debugging
- Splitting valid vs invalid data
flows
✔
Summary Table
|
Stage |
Purpose |
|
Transformer |
Complex
transformations & business rules |
|
Sequential
File |
Read/write
flat files |
|
Dataset |
High-speed
DataStage storage format |
|
Lookup |
Fast lookup
using small reference data |
|
Join |
Combine large
datasets efficiently |
|
Remove
Duplicates |
Eliminate
duplicate records |
|
Aggregator |
Summaries and
group calculations |
|
Copy |
Duplicate
data to multiple outputs |
No comments:
Post a Comment