Sort stage

What is Sort Stage in DataStage?

The Sort stage is a processing stage in DataStage used to sort data rows based on one or more columns in ascending or descending order. The Sort stage has a single input link which carries the data to be sorted, and a single output link carrying the sorted data. You can sort in sequential mode to sort an entire data set or in parallel mode to sort data within partitions.

It is commonly used to:

  • Prepare data for Merge, Join, Remove Duplicates
  • Ensure business key order
  • Improve performance of downstream stages

Key Characteristics

Feature

Details

Stage Type

processing stage

Execution

Parallel

Input

One input link

Output

One output link

Sorting Type

Ascending / Descending

Memory Usage

Uses RAM, spills to disk if needed


Why Do We Need Sort Stage?

  • If Input data is not ordered
  • Downstream stage expects sorted data
  • To remove duplicates from input dataset
  • To merge two datasets
  • To get ordered input data for business logic

Sort Stage Configuration

1️ Sort Keys

You define:

·        Column name

·        Sort order:

·        Ascending

·        Descending

·        Case sensitivity (for string columns)

Example:

customer_id → Ascending order_date → Descending


2️ .Stable Sort

  • Preserves the relative order of rows with equal key values
  • Useful when secondary ordering must be maintained

3️. Allow Duplicate Records

  • Enabled → duplicates allowed
  • Disabled → duplicate rows removed (based on sort keys)

4️. Sort Order Options

  • Ascending (A → Z, 1 → 9)
  • Descending (Z → A, 9 → 1)

5️. Case Sensitivity

  • Case-sensitive sort (A ≠ a)
  • Case-insensitive sort

Sort Stage vs Remove Duplicates

Sort Stage

Remove Duplicates Stage

Sorts data

Removes duplicates

Can remove duplicates optionally

Requires sorted input

More flexible

Specialized stage


Sort Stage Performance Tuning

Use Hash Partitioning Before Sort

Sorting works within partitions, so partitioning improves speed.

Example:

Partition by HASH (customer_id)


Reduce Data Volume

  • Filter unwanted rows before sorting
  • Select only required columns

Increase Sort Memory

Set environment variables:

APT_SORT_MEMORY=512MB APT_SORT_TMP=/data/sort_tmp


Avoid Sorting When Not Needed

  • Some databases return ordered data
  • Use ORDER BY in source SQL when possible

Sort Stage vs Transformer Sort

Sort Stage

Transformer

Efficient for large data

Not recommended

Parallel processing

Single-node

Optimized engine

Slower

 ✅ Always use Sort stage for large datasets


Common Errors in Sort Stage

Error

Reason

Cannot allocate memory

Low sort memory

Job runs slow

Large data, no partitioning

Disk full

Insufficient temp space

Wrong output order

Incorrect sort key order


Real-Time Project Example

Scenario: Remove duplicate customer records
Solution:

  1. Sort by customer_id
  2. Disable Allow duplicate records
  3. Send output to target

When NOT to Use Sort Stage

  • Small lookup datasets (use Lookup stage)
  • When database can sort faster
  • When ordering is unnecessary

 

 

No comments:

Post a Comment

Most Recent posts

Copy and Modify Stages

In IBM Infosphere DataStage , both Copy Stage and Modify Stage are simple processing stages used in parallel jobs , but their purpose i...