Aggregator stage

In IBM DataStage, the Aggregator Stage is a active parallel stage used to perform group-level calculations—similar to SQL GROUP BY operations.


What is Aggregator Stage?

The Aggregator stage groups incoming rows based on key columns, and then performs aggregate functions on numeric or non-numeric fields.

 

 

It is mainly used for:

  • SUM → Total salary per department

  • COUNT → Number of employees per region

  • MAX/MIN → Highest sales, lowest score

  • AVG → Average marks

  • FIRST/LAST → First or last record in each group

  • OTHER → Variance, Standard deviation, Percentiles (limited)


     Key Features

    ✔ Active stage (doesn’t change row count except during grouping)

    ✔ Works in parallel but can also run sequentially

    ✔ Supports these modes:

  • Hash partitioning → Required when grouping on keys
    • Sort Aggregation → Faster if data is pr-sorted
  • Auto Partitioning → When no key is used (e.g., global totals)

 Where Aggregator Stage is used?

       Examples in real ETL jobs:

    1. Summarization / Rollups

  •     Total sales by date

  •     Total revenue by branch

  •     Count of transactions per customer

  2. Deduplication (with FIRST/LAST)

  •   Get the latest record per ID

  •   Get the earliest transaction per account

  3. Data Quality / Profiling

  • Count nulls, count distinct values

  4. Global Calculations

  • Total file count

  • Total number of rows loaded


Example

    Requirement:

Total sales per product.

Input:

Product

Sale

A

100

B

50

A

40

Aggregator Output:

Product

   TotalSale

A

  140

B

  50

Group by = Product
Aggregate = SUM(Sale)


How Aggregator Works with Allow Merge

When Allow Merge = Yes:

·        Aggregator assumes:

o   Input data is already sorted

o   Sorted by all Group By columns

·        Aggregation is done by merging incoming groups

  If input is not sorted correctly, results will be wrong. 

Allow Merge vs Normal Aggregation

Feature

Allow Merge = No

Allow Merge = Yes

Internal sort

Yes

No

Performance

Slower

Faster

Input must be sorted

No

Yes

Risk

Low

High if misused

 

 


No comments:

Post a Comment

Most Recent posts

Copy and Modify Stages

In IBM Infosphere DataStage , both Copy Stage and Modify Stage are simple processing stages used in parallel jobs , but their purpose i...