In IBM DataStage, the Aggregator Stage is a active parallel stage used to perform group-level calculations—similar to SQL GROUP BY operations.
What is Aggregator Stage?
The Aggregator stage groups incoming rows based on key columns, and then performs aggregate functions on numeric or non-numeric fields.
It is mainly used for:
-
SUM → Total salary per department
-
COUNT → Number of employees per region
-
MAX/MIN → Highest sales, lowest score
-
AVG → Average marks
-
FIRST/LAST → First or last record in each group
-
OTHER → Variance, Standard deviation, Percentiles (limited)
Key Features
✔ Active stage (doesn’t change row count except during grouping)
✔ Works in parallel but can also run sequentially
✔ Supports these modes:
- Hash partitioning → Required when grouping on keys
-
Sort Aggregation → Faster if data is pr-sorted
- Auto Partitioning → When no key is used (e.g., global totals)
Where Aggregator Stage is used?
Examples in real ETL jobs:
1. Summarization / Rollups
-
Total sales by date
-
Total revenue by branch
-
Count of transactions per customer
2. Deduplication (with FIRST/LAST)
-
Get the latest record per ID
-
Get the earliest transaction per account
3. Data Quality / Profiling
-
Count nulls, count distinct values
4. Global Calculations
-
Total file count
-
Total number of rows loaded
No comments:
Post a Comment