A Surrogate
Key (SK) is an artificial, system-generated unique identifier assigned to dimension
records in a data warehouse.
It has no business meaning and is used instead of natural/business keys.
🔹 Why Surrogate Keys are Used
✔️ Handles
Slowly Changing Dimensions (SCD)
✔️
Improves join performance
✔️
Avoids issues with changing business keys
✔️
Ensures uniqueness across history
🔹 Example
Source (Business Key)
Customer_ID | Name | City
1001 | Ravi | Pune
Dimension Table
Cust_Key | Customer_ID | Name | City
1 | 1001 | Ravi | Pune
2 | 1001 | Ravi | Mumbai ← Type 2 change
Cust_Key is the surrogate key.
🔹 How Surrogate Keys are Handled in DataStage
DataStage provides multiple ways to generate and manage surrogate keys.
1️⃣ Surrogate Key Generator Stage.
🔸 What it does
- Automatically generates sequential unique numbers
- Thread-safe in parallel jobs
🔸 Job Flow
Source → Transformer → Surrogate Key Generator → Dimension Table
🔸 Configuration
- Specify:
- Key column name
- Starting value
- Increment value
2️⃣ Using Sequence in Database
🔸 How it works
- Database sequence generates key
- DataStage reads the generated value
🔸 Example
NEXTVAL(customer_seq)
3️⃣ Using Max + 1 Logic
SELECT MAX(CUST_KEY) + 1 FROM DIM_CUSTOMER
🔹 Surrogate Keys in SCD Handling
🔸 Type 1
- Same surrogate key
- Record updated
🔸 Type 2
- New surrogate key generated
- Old record expired
🔸 Type 3
- Same surrogate key
- Limited history columns updated
🔹 Surrogate Key Lookup During Fact Load
Fact table stores only surrogate keys, not business keys.
Fact Load Flow
Source → Lookup (Dimension) → Fact Table
Example
Fact Table
----------
Cust_Key | Amount
1 | 5000
🔹 Best Practices in DataStage
✔️ Always
use Surrogate Key Generator stage
✔️
Maintain natural key + surrogate key
✔️
Never expose surrogate keys to business users
✔️
Use lookup caching for performance
No comments:
Post a Comment