Wednesday, 10 December 2025

Active vs Passive Stages in DataStage.

ACTIVE STAGES

Active stages process records and can change the number of rows that go out compared to what went in.

✔ They can:

Transform data
Filter rows
Add or drop rows
Modify data values
Perform lookups
Join datasets
Group / aggregate data

✔ Have both input and output links

(Some may not require both, but generally yes)

✔ Affect row flow (record count may change)

Examples of Active Stages

These are common in real jobs:

Transformer Stage

Applies business logic row-by-row.

Aggregator Stage

Groups and aggregates data.

Filter Stage

Filters out rows based on conditions.

Lookup Stage

Performs lookups using reference links.

Join / Merge Stage

Combines multiple inputs based on keys.

Sort Stage

Sorts data (not passive).

Pivot / Unpivot Stage

Changes row/column structure.

Switch / Change Capture / Surrogate Key Generator

Changes shape/flow of data.

Copy Stage

Even though simple, it still processes rows (active).

PASSIVE STAGES

Passive stages do NOT process or change data.
They simply read or write data.

✔ No business logic

✔ No filtering

✔ No transformation

✔ Row count IN = Row count OUT (except sequential files—header/footer not counted)

These stages only connect DataStage to external systems.

Examples of Passive Stages

Sequential File Stage

Reads/writes files.

Dataset Stage

Reads/writes DataStage native datasets.

File Set Stage

ODBC / DB2 / Oracle Connectors

Database read/write.

External Source / External Target

FTP / SFTP Stage

Complex Flat File Stage

Peek Stage (only displays data)

✔ If a stage only reads or writes data → Passive Stage

✔ If a stage transforms, filters, joins, aggregates, or applies logic → Active Stage

what is a UNion stage in datstage.?

There is no stage called "Union" in Datstage .

Use “Funnel' stage to achieve “union/append” of datasets .

Funnel can take inputs with different structures and supports sequence, continuous, and sort merge modes, giving more control over output order.

Parallelism and partitioning methods in Datastgae

Parallelism in DataStage

Parallelism means executing ETL processes simultaneously to improve performance. DataStage PX achieves this using the parallel engine, which divides work across multiple nodes (defined in APT_CONFIG_FILE).

Types of Parallelism in DataStage

1. Pipeline Parallelism

Different stages run at the same time, processing streaming data.
Example:
Source → Transformer → Target
All three run concurrently once the pipeline fills.

✔ Improves performance by overlapping operations.

2. Partition Parallelism

A dataset is split into partitions, and each node processes its portion in parallel.

Example with 4 partitions:

1000 rows → split into 4 → 250 rows each

Each node processes 250 rows at the same time.

✔ Massive performance boost
✔ Depends heavily on partitioning method

3. Component Parallelism (less used term)

Running multiple job instances in parallel
Example: Multiple instances of the same job processing different file paths

Partitioning in Data Stage

Partitioning defines how input data is divided across parallel nodes.

Common Partitioning Methods

1. Hash Partitioning

Splits data based on a key (e.g., customer_id)
Ensures same key always goes to the same partition

Use when:
✔ Joining
✔ Aggregations
✔ Removing duplicates

2. Range Partitioning

Splits data based on ranges

Example:

1–10000 → Node1
10001–20000 → Node2

Use when:
✔ Requirement is range-based
✔ Ordered processing constraints

3. Entire Partitioning

Entire dataset sent to each partition (broadcast)

Use when:
✔ Small reference dataset
✔ Lookup that must be available everywhere

4. Modulus Partitioning

System distributes data using modulo arithmetic:
partition = key % number_of_nodes

Use when:
✔ Keys are evenly distributed
✔ Good for balancing

5. Round Robin Partitioning

Sends rows one-by-one to each partition in a rotating sequence

Use when:
✔ No key-based logic
✔ Need an even distribution

6. Random Partitioning

Random assignment of rows
Rarely used because it can cause imbalance

Repartitioning

When partitioning of two inputs is different, DataStage repartitions automatically.

Example:

Input1 is hash partitioned
Input2 is round robin

Join stage will repartition both using same key.

Repartitioning costs performance → try to avoid unnecessary repartition.

Collector Types

When DataStage needs to merge partitions back:

Sort Merge collector → keeps order
Ordered collector → structured but slower
Round Robin collector → no ordering
Sequential collector → one output stream

Example: Join Stage

For a join on customer_id:

Both inputs must be hash partitioned on customer_id
If not, DataStage will automatically repartition

Real-Time Example

Dataset = 1 million customers
APT_CONFIG_FILE = 4 nodes

Using hash partitioning on `customer_id`:

Node1: customer_id 1–25%
Node2: 26–50%
Node3: 51–75%
Node4: 76–100%

Each node processes only its slice, parallelly → huge performance gain.

Summary

Parallelism = running faster by using multiple nodes

Partitioning = how the data is split across nodes

Good design means:

Proper partitioning
Avoid unnecessary repartition
Use correct collector
Use node pools efficiently

What is APT_CONFIG_FILE ?

APT_CONFIG_FILE is an environment variable that points to a configuration file (.apt file) which defines:

Number of processing nodes
Node names
CPU allocation
Disk resources
Scratch disk paths
Resource pools

This file controls parallelism in Data Stage PX (Parallel Extender).A datastage job cannot run without .apt configuration file.

Example APT Config File (2-node)


node "node1"
{
  fastname "server1"
  pool "node"
  scratchdisk "/scratch1"
}

node "node2"
{
  fastname "server2"
  pool "node"
  scratchdisk "/scratch2"
}

This means the job will run on 2 nodes, in parallel.

Why is APT_CONFIG_FILE Important?

✔ Controls job parallelism

More nodes = higher performance.

✔ Controls resource usage

Disk paths, node pools, etc.

✔ Controls load distribution

E.g., partitioning, sorting, aggregations all depend on node definitions.

Where is APT_CONFIG_FILE set?

Usually set in:

Project → Properties → Environment Variables → APT_CONFIG_FILE
Or in DS job → Job Properties
Or inside UNIX shell before execution:

export APT_CONFIG_FILE=/opt/IBM/InformationServer/Server/Config/apt_config.apt

Q: What happens if APT_CONFIG_FILE has 4 nodes but your job has only 1 input row?

Job will still run in 4-way parallel, but only 1 node will get the row; others will be idle.

Q: How do you check which config file your job is using?

In Director → Right-click job → View Log → It shows the APT config file name in the startup message.

What is Transformer Stage in datstage ?

The Transformer stage is a processing stage used to transform individual input rows into output rows. It allows you to write complex expressions, apply functions, use variables, and route data to different links based on conditions.

🔧 Key Features

1. Row-by-row processing

Evaluates each row individually.

2. Derivations / Expressions

You can derive new columns using:

String functions (Trim, Substring, etc.)
Date/time functions
Lookup functions
Conditional expressions (If…Then…Else)

3. Constraints

Used to filter and control the flow of data to each output link.

Example:

If Amount > 1000 Then 1 Else 0

4. Stage Variables

Used for complex logic.
Can store intermediate calculations.
Can maintain values across rows.

Example: Running total or previous row value.

5. Multiple Output Links

You can split data into multiple outputs based on constraints.
Example:

Valid Records
Rejected Records
Error Records

6. Built-in Lookup

You can reference a reference link directly inside derivations.

When to Use Transformer Stage?

Use it when you need:

Data validation
Complex business logic
Conditional filtering
Field derivations
Reference lookups (small/moderate datasets)

🚫 What Transformer Should NOT Be Used For

Very large dataset joins (use Join/Merge stage instead)
Column-level mass transformations (use Modify stage)
High-performance parallel logic (Transformer is slower)

Example Use Case

Source: Customer records
Logic:

If Age > 18 → mark “Adult”
If Country = ‘IN’ → output to India link
Else → output to Others link

All of this is done inside the Transformer stage.

Pages

Wednesday, 10 December 2025

✔ They can:

✔ Have both input and output links

✔ Affect row flow (record count may change)

Examples of Active Stages

Transformer Stage

Aggregator Stage

Filter Stage

Lookup Stage

Join / Merge Stage

Sort Stage

Pivot / Unpivot Stage

Switch / Change Capture / Surrogate Key Generator

Copy Stage

PASSIVE STAGES

✔ No business logic

✔ No filtering

✔ No transformation

✔ Row count IN = Row count OUT (except sequential files—header/footer not counted)

Examples of Passive Stages

Sequential File Stage

Dataset Stage

File Set Stage

ODBC / DB2 / Oracle Connectors

External Source / External Target

FTP / SFTP Stage

Complex Flat File Stage

Peek Stage (only displays data)

✔ If a stage only reads or writes data → Passive Stage

✔ If a stage transforms, filters, joins, aggregates, or applies logic → Active Stage

Parallelism in DataStage

Parallelism means executing ETL processes simultaneously to improve performance. DataStage PX achieves this using the parallel engine, which divides work across multiple nodes (defined in APT_CONFIG_FILE).

Types of Parallelism in DataStage

1. Pipeline Parallelism

2. Partition Parallelism

3. Component Parallelism (less used term)

Partitioning in Data Stage

Common Partitioning Methods

1. Hash Partitioning

2. Range Partitioning

3. Entire Partitioning

4. Modulus Partitioning

5. Round Robin Partitioning

6. Random Partitioning

Repartitioning

Collector Types

Example: Join Stage

Real-Time Example

Using hash partitioning on customer_id:

Summary

Parallelism = running faster by using multiple nodes

Partitioning = how the data is split across nodes

Example APT Config File (2-node)

Why is APT_CONFIG_FILE Important?

✔ Controls job parallelism

✔ Controls resource usage

✔ Controls load distribution

Where is APT_CONFIG_FILE set?

Q: What happens if APT_CONFIG_FILE has 4 nodes but your job has only 1 input row?

Q: How do you check which config file your job is using?

🔧 Key Features

1. Row-by-row processing

2. Derivations / Expressions

3. Constraints

4. Stage Variables

5. Multiple Output Links

6. Built-in Lookup

When to Use Transformer Stage?

🚫 What Transformer Should NOT Be Used For

Example Use Case

Most Recent posts

Using hash partitioning on `customer_id`: