Learn ETL Datastage faster: 2026

Thursday, 5 March 2026

Copy and Modify Stages

In IBM Infosphere DataStage, both Copy Stage and Modify Stage are simple processing stages used in parallel jobs, but their purpose is different.

Copy Stage

Definition :

The Copy Stage is used to pass data from input to output without changing the data.

It simply copies rows from the source to the target.

Example Design

Sequential File → Copy Stage → Dataset

Input:

Emp_ID   Name
101      John
102      Mary

Output (same data):

Emp_ID   Name
101      John
102      Mary

When Copy Stage is Used

1. Splitting data into multiple outputs

2. Passing data without transformation

3. Improving parallel processing

4. Debugging jobs

Example – Multiple Outputs

→ Target1
Source → Copy →
→ Target2

Same data goes to both targets.

✔ No transformation
✔ Very fast
✔ Minimal processing

2️. Modify Stage

Definition

The Modify Stage is used to change column values or data types using simple expressions.

It is faster than Transformer because it performs simple transformations.

Example

Input:

Name = john
Salary = 5000

Modify stage expression:

Name = upcase(Name);
Salary = Salary * 1.10;

Output:

Name = JOHN
Salary = 5500

3️ .Common Modify Stage Operations

Data Type Conversion

string_to_int(Age)

Change Column Value

Salary = Salary + 1000

Convert Case

Name = upcase(Name)

4️. Copy vs Modify Stage .

Feature	Copy Stage	Modify Stage
Transformation	No	Yes (simple only)
Performance	Very fast	Faster than Transformer
Expressions	Not allowed	Allowed
Use Case	Data duplication	Simple data modification

5️. Real-Time Scenario

Scenario

Source sends customer names in lowercase, but target requires uppercase.

Design

Source → Modify Stage → Target

Expression

Customer_Name = upcase(Customer_Name);

Top of Form

Bottom of Form

Tuesday, 20 January 2026

XML Data processing and Rest Api services in Datastage

Hierarchical Data stage is used to manage and transform hierarchical data, such as XML and JSON files.This stage is designed for high performance and scalability in handling complex, tree-structured data.

Friday, 9 January 2026

What is OSH?

OSH is the underlying execution engine / scripting language used by DataStage to run parallel jobs.

When you run a Parallel Job, DataStage internally converts the job design into an OSH script and executes it.

OSH (Orchestrate Shell) is the execution language used by IBM DataStage’s parallel engine. Parallel jobs are converted into OSH scripts, which control how stages run, how data is partitioned, and how processing happens across nodes.

Why OSH Is Important

Controls job execution
Manages parallelism
Handles partitioning
Manages data flow between stages
Executes on multiple nodes

Simple Flow

DataStage Job Design

↓

Generated OSH Script

↓

APT Engine executes OSH

Where OSH Exists

OSH scripts are created temporarily during job execution
Location (example):

$APT_TMPDIR

Usually auto-deleted after job completion (unless debug enabled)

What OSH Contains

Stage operators
Link definitions
Partitioning logic
Sorting logic
File paths
Node allocations

Example (Conceptual)

ds_operator input | ds_transform | ds_aggregator | ds_operator output

OSH vs Unix Shell

Aspect	OSH	Unix Shell
Purpose	DataStage job execution	OS command execution
Used by	DataStage engine	Users / scripts
Parallelism	Built-in	Manual
User writes it?	No (auto-generated)	Yes

When You See OSH (Real Projects)

Job failure analysis
Performance tuning
Debugging parallel jobs
DS_SUPPORT / DSENGINE logs

Tuesday, 6 January 2026

How do you identify a datastge job is running parallely or sequentially.?

Check the Job Type in Data Stage Designer

This is the first and simplest check.

· Parallel Job → runs in parallel

· Server Job / Sequence Job → runs sequentially

📌 If it’s a Server Job, it cannot run in parallel.

2️ .Check Stage Type Used

Some stages are always sequential.

Sequential-only stages:

· Server Sequential File

· Server Transformer

· Server Lookup

· Server Join

📌 If your job mainly uses Server stages, the job is sequential.

3️. Look at the Job Log (Very Important)

Open Director → Job Log.

Parallel job log shows:

Operator: pxfunnel

Operator: pxpartition

Operator: pxsort

Number of nodes = 4

Sequential execution indicators:

No mention of px operators
No mention of nodes
Single process messages only

📌 If you don’t see px* operators → job is behaving sequentially.

4️. Check Environment Variable: $APT_CONFIG_FILE

This controls parallelism.

· If not set or invalid, job runs on 1 node

· If points to a valid config file → parallel execution

📌 Verify in:

Job Properties → Parameters → Environment

5️. Check Number of Partitions on Links

In Designer:

· Right-click link → Properties

· Check Partitioning

Sequential behavior if:

· Partition count = 1

· Partitioning = Entire / Same

Parallel behavior:

· Hash / Range / Round-Robin with multiple partitions

6️. CPU & Process Monitoring (OS Level)

On the DataStage server:

· Parallel job → multiple osh / dsapi_slave processes

· Sequential job → single process

Commands:

ps -ef | grep dsapi

top

7️. Dataset vs Sequential File

Dataset (.ds) → supports parallelism
Sequential File (.txt, .dat) → often forces serialization (unless multiple readers/writers)

📌 Heavy use of Sequential Files can make a parallel job behave sequentially.

8️. Peek / Debug Mode

If you enable Peek and see only one data stream, the job is not parallel.

I identify a DataStage job running sequentially by checking the job type, stage types, partitioning on links, $APT_CONFIG_FILE, and especially the job log for px operators and node count.

Learn ETL Datastage faster

Pages

Thursday, 5 March 2026

Copy and Modify Stages

Tuesday, 20 January 2026

IBM Cloud Pak for Datastage

XML Data processing and Rest Api services in Datastage

Hierarchical Data stage

Friday, 9 January 2026

What is OSH?

Tuesday, 6 January 2026

How do you identify a datastge job is running parallely or sequentially.?

Most Recent posts

Copy and Modify Stages