What is Sort Stage in DataStage?
The Sort stage is a processing stage in DataStage used to sort data rows based on one or more columns in ascending or descending order. The Sort stage has a single input link which carries the data to be sorted, and a single output link carrying the sorted data. You can sort in sequential mode to sort an entire data set or in parallel mode to sort data within partitions.
It is commonly used to:
- Prepare data for Merge, Join, Remove Duplicates
- Ensure business key order
- Improve performance of downstream stages
Key Characteristics
|
Feature |
Details |
|
Stage Type |
processing stage |
|
Execution |
Parallel |
|
Input |
One input link |
|
Output |
One output link |
|
Sorting Type |
Ascending / Descending |
|
Memory Usage |
Uses RAM, spills to disk if needed |
Why Do We Need Sort Stage?
- If Input data is not ordered
- Downstream stage expects sorted data
- To remove duplicates from input dataset
- To merge two datasets
- To get ordered input data for business logic
Sort Stage Configuration
1️⃣ Sort Keys
You define:
· Column name
· Sort order:
· Ascending
· Descending
· Case sensitivity (for string columns)
Example:
customer_id → Ascending order_date → Descending
2️ .Stable Sort
- Preserves the relative order of rows with equal key values
- Useful when secondary ordering must be maintained
3️. Allow Duplicate Records
- Enabled → duplicates allowed
- Disabled → duplicate rows removed (based on sort keys)
4️. Sort Order Options
- Ascending (A → Z, 1 → 9)
- Descending (Z → A, 9 → 1)
5️. Case Sensitivity
- Case-sensitive sort (A ≠ a)
- Case-insensitive sort
Sort Stage vs Remove Duplicates
|
Sort Stage |
Remove Duplicates Stage |
|
Sorts data |
Removes duplicates |
|
Can remove duplicates optionally |
Requires sorted input |
|
More flexible |
Specialized stage |
Sort Stage Performance Tuning
✔ Use Hash Partitioning Before Sort
Sorting works within partitions, so partitioning improves speed.
Example:
Partition by HASH (customer_id)
✔ Reduce Data Volume
- Filter unwanted rows before sorting
- Select only required columns
✔ Increase Sort Memory
Set environment variables:
APT_SORT_MEMORY=512MB APT_SORT_TMP=/data/sort_tmp
✔ Avoid Sorting When Not Needed
- Some databases return ordered data
- Use ORDER BY in source SQL when possible
Sort Stage vs Transformer Sort
|
Sort Stage |
Transformer |
|
Efficient for large data |
Not recommended |
|
Parallel processing |
Single-node |
|
Optimized engine |
Slower |
✅ Always use Sort stage for large datasets
Common Errors in Sort Stage
|
Error |
Reason |
|
Cannot allocate memory |
Low sort memory |
|
Job runs slow |
Large data, no partitioning |
|
Disk full |
Insufficient temp space |
|
Wrong output order |
Incorrect sort key order |
Real-Time Project Example
Scenario: Remove duplicate customer
records
Solution:
- Sort by customer_id
- Disable Allow duplicate records
- Send output to target
When NOT to Use Sort Stage
- Small lookup datasets (use Lookup stage)
- When database can sort faster
- When ordering is unnecessary
No comments:
Post a Comment