Thursday, 5 March 2026

Copy and Modify Stages

In IBM Infosphere DataStage, both Copy Stage and Modify Stage are simple processing stages used in parallel jobs, but their purpose is different.

 

Copy Stage

Definition :

The Copy Stage is used to pass data from input to output without changing the data.

It simply copies rows from the source to the target.

Example Design

Sequential File → Copy Stage → Dataset

Input:

Emp_ID   Name
101      John
102      Mary

Output (same data):

Emp_ID   Name
101      John
102      Mary

When Copy Stage is Used

1.    Splitting data into multiple outputs

2.    Passing data without transformation

3.    Improving parallel processing

4.    Debugging jobs

Example – Multiple Outputs

                → Target1
Source → Copy →
                → Target2

Same data goes to both targets.

No transformation
Very fast
Minimal processing


2️. Modify Stage

Definition

The Modify Stage is used to change column values or data types using simple expressions.

It is faster than Transformer because it performs simple transformations.

Example

Input:

Name = john
Salary = 5000

Modify stage expression:

Name = upcase(Name);
Salary = Salary * 1.10;

Output:

Name = JOHN
Salary = 5500


3️ .Common Modify Stage Operations

Data Type Conversion

string_to_int(Age)

Change Column Value

Salary = Salary + 1000

Convert Case

Name = upcase(Name)


4️. Copy vs Modify Stage .

Feature

Copy Stage

Modify Stage

Transformation

No

Yes (simple only)

Performance

Very fast

Faster than Transformer

Expressions

Not allowed

Allowed

Use Case

Data duplication

Simple data modification


5️. Real-Time Scenario

Scenario

Source sends customer names in lowercase, but target requires uppercase.

Design

Source → Modify Stage → Target

Expression

Customer_Name = upcase(Customer_Name);

Top of Form

 

Bottom of Form

 

 

Tuesday, 20 January 2026

Friday, 9 January 2026

What is OSH?

OSH is the underlying execution engine / scripting language used by DataStage to run parallel jobs.

When you run a Parallel Job, DataStage internally converts the job design into an OSH script and executes it.

OSH (Orchestrate Shell) is the execution language used by IBM DataStage’s parallel engine. Parallel jobs are converted into OSH scripts, which control how stages run, how data is partitioned, and how processing happens across nodes. 


Why OSH Is Important

  • Controls job execution
  • Manages parallelism
  • Handles partitioning
  • Manages data flow between stages
  • Executes on multiple nodes

Simple Flow

DataStage Job Design

     

Generated OSH Script

     

APT Engine executes OSH


Where OSH Exists

  • OSH scripts are created temporarily during job execution
  • Location (example):
    • $APT_TMPDIR
  • Usually auto-deleted after job completion (unless debug enabled)

What OSH Contains

  • Stage operators
  • Link definitions
  • Partitioning logic
  • Sorting logic
  • File paths
  • Node allocations

Example (Conceptual)

ds_operator input | ds_transform | ds_aggregator | ds_operator output


OSH vs Unix Shell

Aspect

OSH

Unix Shell

Purpose

DataStage job execution

OS command execution

Used by

DataStage engine

Users / scripts

Parallelism

Built-in

Manual

User writes it?

No (auto-generated)

Yes


When You See OSH (Real Projects)

  • Job failure analysis
  • Performance tuning
  • Debugging parallel jobs
  • DS_SUPPORT / DSENGINE logs

 

Tuesday, 6 January 2026

How do you identify a datastge job is running parallely or sequentially.?

 

Check the Job Type in Data Stage Designer

This is the first and simplest check.

·        Parallel Job → runs in parallel

·        Server Job / Sequence Job → runs sequentially

📌 If it’s a Server Job, it cannot run in parallel.


2️ .Check Stage Type Used

Some stages are always sequential.

Sequential-only stages:

·        Server Sequential File

·        Server Transformer

·        Server Lookup

·        Server Join

📌 If your job mainly uses Server stages, the job is sequential.


3️. Look at the Job Log (Very Important)

Open Director → Job Log.

Parallel job log shows:

Operator: pxfunnel

Operator: pxpartition

Operator: pxsort

Number of nodes = 4

Sequential execution indicators:

  • No mention of px operators
  • No mention of nodes
  • Single process messages only

📌 If you don’t see px* operators → job is behaving sequentially.


4️. Check Environment Variable: $APT_CONFIG_FILE

This controls parallelism.

·        If not set or invalid, job runs on 1 node

·        If points to a valid config file → parallel execution

📌 Verify in:

Job Properties → Parameters → Environment


5️. Check Number of Partitions on Links

In Designer:

·        Right-click link → Properties

·        Check Partitioning

Sequential behavior if:

·        Partition count = 1

·        Partitioning = Entire / Same

Parallel behavior:

·        Hash / Range / Round-Robin with multiple partitions   


6️. CPU & Process Monitoring (OS Level)

On the DataStage server:

·        Parallel job → multiple osh / dsapi_slave processes

·        Sequential job → single process

Commands:

ps -ef | grep dsapi

top


7️. Dataset vs Sequential File

  • Dataset (.ds) → supports parallelism
  • Sequential File (.txt, .dat) → often forces serialization (unless multiple readers/writers)

📌 Heavy use of Sequential Files can make a parallel job behave sequentially.


8️. Peek / Debug Mode

If you enable Peek and see only one data stream, the job is not parallel.


I identify a DataStage job running sequentially by checking the job type, stage types, partitioning on links, $APT_CONFIG_FILE, and especially the job log for px operators and node count.


 

Most Recent posts

Copy and Modify Stages

In IBM Infosphere DataStage , both Copy Stage and Modify Stage are simple processing stages used in parallel jobs , but their purpose i...