Wednesday, 10 December 2025

Active vs Passive Stages in DataStage.

 

ACTIVE STAGES

Active stages process records and can change the number of rows that go out compared to what went in.

✔ They can:

  • Transform data

  • Filter rows

  • Add or drop rows

  • Modify data values

  • Perform lookups

  • Join datasets

  • Group / aggregate data

✔ Have both input and output links

(Some may not require both, but generally yes)

✔ Affect row flow (record count may change)


Examples of Active Stages

These are common in real jobs:

  Transformer Stage

       Applies business logic row-by-row.

  Aggregator Stage

     Groups and aggregates data.

  Filter Stage

      Filters out rows based on conditions.

  Lookup Stage

    Performs lookups using reference links.

 Join / Merge Stage

    Combines multiple inputs based on keys.

 Sort Stage

   Sorts data (not passive).

Pivot / Unpivot Stage

Changes row/column structure.

Switch / Change Capture / Surrogate Key Generator

Changes shape/flow of data.

Copy Stage

Even though simple, it still processes rows (active).


PASSIVE STAGES

Passive stages do NOT process or change data.
They simply read or write data.

✔ No business logic

✔ No filtering

✔ No transformation

✔ Row count IN = Row count OUT (except sequential files—header/footer not counted)

These stages only connect DataStage to external systems.


Examples of Passive Stages

Sequential File Stage

Reads/writes files.

Dataset Stage

Reads/writes DataStage native datasets.

File Set Stage

ODBC / DB2 / Oracle Connectors

Database read/write.

External Source / External Target

FTP / SFTP Stage

Complex Flat File Stage

Peek Stage (only displays data)


✔ If a stage only reads or writes dataPassive Stage

✔ If a stage transforms, filters, joins, aggregates, or applies logicActive Stage




what is a UNion stage in datstage.?

There is no stage called "Union" in Datstage .

Use “Funnel' stage to achieve “union/append” of datasets .

Funnel can take inputs with different structures and supports sequence, continuous, and sort merge modes, giving more control over output order. 

Parallelism and partitioning methods in Datastgae

 

Parallelism in DataStage 

 Parallelism means executing ETL processes simultaneously to improve performance. DataStage PX achieves this using the parallel engine, which divides work across multiple nodes (defined in APT_CONFIG_FILE).


 Types of Parallelism in DataStage

1. Pipeline Parallelism

  • Different stages run at the same time, processing streaming data.

  • Example:
    Source → Transformer → Target
    All three run concurrently once the pipeline fills.

✔ Improves performance by overlapping operations.


2. Partition Parallelism

  • A dataset is split into partitions, and each node processes its portion in parallel.

Example with 4 partitions:

1000 rows → split into 4250 rows each

Each node processes 250 rows at the same time.

✔ Massive performance boost
✔ Depends heavily on partitioning method


3. Component Parallelism (less used term)

  • Running multiple job instances in parallel

  • Example: Multiple instances of the same job processing different file paths


 Partitioning in Data Stage

Partitioning defines how input data is divided across parallel nodes.


Common Partitioning Methods

1. Hash Partitioning

  • Splits data based on a key (e.g., customer_id)

  • Ensures same key always goes to the same partition

Use when:
✔ Joining
✔ Aggregations
✔ Removing duplicates


2. Range Partitioning

  • Splits data based on ranges

Example:

  • 1–10000 → Node1

  • 10001–20000 → Node2

Use when:
✔ Requirement is range-based
✔ Ordered processing constraints


3. Entire Partitioning

  • Entire dataset sent to each partition (broadcast)

Use when:
✔ Small reference dataset
✔ Lookup that must be available everywhere


4. Modulus Partitioning

  • System distributes data using modulo arithmetic:
    partition = key % number_of_nodes

Use when:
✔ Keys are evenly distributed
✔ Good for balancing


5. Round Robin Partitioning

  • Sends rows one-by-one to each partition in a rotating sequence

Use when:
✔ No key-based logic
✔ Need an even distribution


6. Random Partitioning

  • Random assignment of rows

  • Rarely used because it can cause imbalance


Repartitioning

When partitioning of two inputs is different, DataStage repartitions automatically.

Example:

  • Input1 is hash partitioned

  • Input2 is round robin

Join stage will repartition both using same key.

Repartitioning costs performance → try to avoid unnecessary repartition.


Collector Types

When DataStage needs to merge partitions back:

  • Sort Merge collector → keeps order

  • Ordered collector → structured but slower

  • Round Robin collector → no ordering

  • Sequential collector → one output stream


 Example: Join Stage

For a join on customer_id:

  • Both inputs must be hash partitioned on customer_id

  • If not, DataStage will automatically repartition


Real-Time Example

Dataset = 1 million customers
APT_CONFIG_FILE = 4 nodes

Using hash partitioning on customer_id:

  • Node1: customer_id 1–25%

  • Node2: 26–50%

  • Node3: 51–75%

  • Node4: 76–100%

Each node processes only its slice, parallelly → huge performance gain.


Summary

Parallelism = running faster by using multiple nodes

Partitioning = how the data is split across nodes

Good design means:

  • Proper partitioning

  • Avoid unnecessary repartition

  • Use correct collector

  • Use node pools efficiently




 

What is APT_CONFIG_FILE ?

 

APT_CONFIG_FILE is an environment variable that points to a configuration file (.apt file) which defines:

  • Number of processing nodes

  • Node names

  • CPU allocation

  • Disk resources

  • Scratch disk paths

  • Resource pools

This file controls parallelism in Data Stage PX (Parallel Extender).A datastage job cannot run without .apt configuration file.


Example APT Config File (2-node)

node "node1" { fastname "server1" pool "node" scratchdisk "/scratch1" } node "node2" { fastname "server2" pool "node" scratchdisk "/scratch2" }

This means the job will run on 2 nodes, in parallel.


Why is APT_CONFIG_FILE Important?

✔ Controls job parallelism

More nodes = higher performance.

✔ Controls resource usage

Disk paths, node pools, etc.

✔ Controls load distribution

E.g., partitioning, sorting, aggregations all depend on node definitions.


Where is APT_CONFIG_FILE set?

Usually set in:

  • Project → Properties → Environment Variables → APT_CONFIG_FILE

  • Or in DS job → Job Properties

  • Or inside UNIX shell before execution:

export APT_CONFIG_FILE=/opt/IBM/InformationServer/Server/Config/apt_config.apt

Q: What happens if APT_CONFIG_FILE has 4 nodes but your job has only 1 input row?

Job will still run in 4-way parallel, but only 1 node will get the row; others will be idle.

Q: How do you check which config file your job is using?

In Director → Right-click job → View Log → It shows the APT config file name in the startup message.




 

What is Transformer Stage in datstage ?

The Transformer stage is a processing stage used to transform individual input rows into output rows. It allows you to write complex expressions, apply functions, use variables, and route data to different links based on conditions.


🔧 Key Features

1. Row-by-row processing

Evaluates each row individually.

2. Derivations / Expressions

You can derive new columns using:

  • String functions (Trim, Substring, etc.)

  • Date/time functions

  • Lookup functions

  • Conditional expressions (If…Then…Else)

3. Constraints

Used to filter and control the flow of data to each output link.

Example:

If Amount > 1000 Then 1 Else 0

4. Stage Variables

  • Used for complex logic.

  • Can store intermediate calculations.

  • Can maintain values across rows.

Example: Running total or previous row value.

5. Multiple Output Links

You can split data into multiple outputs based on constraints.
Example:

  • Valid Records

  • Rejected Records

  • Error Records

6. Built-in Lookup

You can reference a reference link directly inside derivations.


When to Use Transformer Stage?

Use it when you need:

  • Data validation

  • Complex business logic

  • Conditional filtering

  • Field derivations

  • Reference lookups (small/moderate datasets)


🚫 What Transformer Should NOT Be Used For

  • Very large dataset joins (use Join/Merge stage instead)

  • Column-level mass transformations (use Modify stage)

  • High-performance parallel logic (Transformer is slower)


Example Use Case

Source: Customer records
Logic:

  • If Age > 18 → mark “Adult”

  • If Country = ‘IN’ → output to India link

  • Else → output to Others link

All of this is done inside the Transformer stage.



 

Most Recent posts

Copy and Modify Stages

In IBM Infosphere DataStage , both Copy Stage and Modify Stage are simple processing stages used in parallel jobs , but their purpose i...