The Change Capture stage in DataStage compares old and new datasets and identifies delta changes such as inserts, updates, and deletes. It outputs change codes that help in performing incremental data loads efficiently.
Why Do We Use the Change Capture Stage?
In real-time ETL processes, loading the entire dataset every day is inefficient.
Instead, it is better to load only the delta changes.
The Change Capture stage provides this by comparing:
-
Old Dataset (Before)
-
New Dataset (After)
and generating a list of records that have changed.
How It Works
The stage compares both datasets row by row using:
-
Key columns (Primary Keys)
It then outputs records with specific change codes to indicate what type of change occurred.
Change Codes (Key Output)
| Change Code | Meaning |
|---|---|
I or 1 | Insert (New record in After dataset, not found in Before dataset) |
D or 2 | Delete (Record exists in Before dataset, missing in After dataset) |
C or 3 | Change (Record exists in both, but one or more column values changed) |
E or 4 | Copy (Record unchanged) — normally filtered out |
Most ETL loads use only I, D, C
Because unchanged records do not need reprocessing.
Example Scenario
Before Dataset (Day 1)
ID Name Salary 1 John 3000 2 Mary 4000
| ID | Name | Salary |
|---|---|---|
| 1 | John | 3000 |
| 2 | Mary | 4000 |
After Dataset (Day 2)
ID Name Salary 1 John 3200 3 Alex 3500
| ID | Name | Salary |
|---|---|---|
| 1 | John | 3200 |
| 3 | Alex | 3500 |
Change Capture Output
ID Change Code Description 1 C Salary updated 2 D Record deleted 3 I New record inserted
| ID | Change Code | Description |
|---|---|---|
| 1 | C | Salary updated |
| 2 | D | Record deleted |
| 3 | I | New record inserted |
This delta is used for incremental loading into the target table.
When Do We Use Change Capture Stage?
- Daily incremental/delta
loads
- CDC
(Change Data Capture) processes
- Slowly
Changing Dimensions (SCD)
- Synchronizing two
data repositories
- Large
datasets where full reload is expensive
No comments:
Post a Comment