In IBM DataStage, the Aggregator Stage is a active parallel stage used to perform group-level calculations—similar to SQL GROUP BY operations.

What is Aggregator Stage?

The Aggregator stage groups incoming rows based on key columns, and then performs aggregate functions on numeric or non-numeric fields.

It is mainly used for:

SUM → Total salary per department
COUNT → Number of employees per region
MAX/MIN → Highest sales, lowest score
AVG → Average marks
FIRST/LAST → First or last record in each group
OTHER → Variance, Standard deviation, Percentiles (limited)
Key Features

✔ Active stage (doesn’t change row count except during grouping)

✔ Works in parallel but can also run sequentially

✔ Supports these modes:
Hash partitioning → Required when grouping on keys

Sort Aggregation → Faster if data is pr-sorted

Auto Partitioning → When no key is used (e.g., global totals)

Where Aggregator Stage is used?

Examples in real ETL jobs:

1. Summarization / Rollups

Total sales by date
Total revenue by branch
Count of transactions per customer

2. Deduplication (with FIRST/LAST)

Get the latest record per ID
Get the earliest transaction per account

3. Data Quality / Profiling

Count nulls, count distinct values

4. Global Calculations

Total file count
Total number of rows loaded

Example

Requirement:

Total sales per product.

Input:

Product	Sale
A	100
B	50
A	40

Aggregator Output:

Product	TotalSale
A	140
B	50

Group by = Product
Aggregate = SUM(Sale)

How Aggregator Works with Allow Merge

When Allow Merge = Yes:

· Aggregator assumes:

o Input data is already sorted

o Sorted by all Group By columns

· Aggregation is done by merging incoming groups

If input is not sorted correctly, results will be wrong.

Allow Merge vs Normal Aggregation

Feature	Allow Merge = No	Allow Merge = Yes
Internal sort	Yes	No
Performance	Slower	Faster
Input must be sorted	No	Yes
Risk	Low	High if misused

Learn ETL Datastage faster

Pages

Aggregator stage

What is Aggregator Stage?

Key Features

✔ Active stage (doesn’t change row count except during grouping)

✔ Works in parallel but can also run sequentially

✔ Supports these modes:

Where Aggregator Stage is used?

1. Summarization / Rollups

2. Deduplication (with FIRST/LAST)

3. Data Quality / Profiling

4. Global Calculations

No comments:

Post a Comment

Most Recent posts

Copy and Modify Stages