Learn ETL Datastage faster: 2025

Friday, 26 December 2025

what is SCD ?

SCD (Slowly Changing Dimension) refers to dimension tables in a data warehouse where attribute values change slowly over time, not frequently like transaction data.

The challenge is how to handle those changes. SCD defines techniques to manage historical changes in dimension data.

🔹 Example Dimension Table: Customer

Cust_ID	Name	City
101	Ravi	Chennai

Later, customer moves to Bangalore.

🔹 Why SCD Is Important

Business wants:

Current customer data
OR historical data
OR both (depends on reporting needs)

SCD defines how much history to store.

🔹 Types of SCD .

🔹 SCD Type 0 – No Change

Data never changes
Ignore updates

Example: Date of Birth

🔹 SCD Type 1 – Overwrite

Old value is replaced
No history maintained

Example:

Cust_ID	Name	City
101	Ravi	Bangalore

✔ Simple
❌ History lost

Use case: Correction of wrong data

🔹 SCD Type 2 – Full History (Very Important)

Create new row for every change
Maintain history using:

Surrogate key
Start date / End date
Current flag

Example:

SK	Cust_ID	City	Start_Date	End_Date	Is_Current
1	101	Chennai	2019-01-01	2023-03-31	N
2	101	Bangalore	2023-04-01	9999-12-31	Y

✔ Full history
✔ Most commonly used

🔹 SCD Type 3 – Limited History

Stores previous value in another column

Cust_ID	City	Prev_City
101	Bangalore	Chennai

✔ One level history
❌ Not scalable

🔹 SCD Type 4 – History Table

Current data in main table
History stored in separate table

🔹 SCD Type 6 – Hybrid

Combination of Type 1 + Type 2 + Type 3

🔹 Which SCD Types Are Used Most?

1️. Type 1 – Corrections
2️. Type 2 – History tracking (most important)
3️. Type 3 – Limited cases

🔹 ETL / DataStage Context

In DataStage:

Use Lookup stage to check existing record
Compare source vs target
Decide:

Insert new row (Type 2)
Update existing row (Type 1)

Manage surrogate keys

🔹 SCD Type 2 Logic (Simplified)

Match business key
Compare attributes
If change detected:

Expire old record
Insert new record

Tuesday, 23 December 2025

What are LAG () and LEAD ()?

LAG () and LEAD () are window (analytic) functions used to access data from another row without using self-joins.

· LAG () → gets value from a previous row

· LEAD () → gets value from a next row

LAG() and LEAD() allow comparison of a row with its previous or next row within a window.

🔹 Syntax

LAG (column, offset, default)

OVER (PARTITION BY column ORDER BY column)

LEAD (column, offset, default)

OVER (PARTITION BY column ORDER BY column)

· offset → how many rows before/after (default = 1)

default → value if row doesn’t exist (optional)

🔹 Example Table: sales

month	amount
Jan	1000
Feb	1200
Mar	1100
Apr	1500

🔹 LAG() Example (Previous Value)

SELECT month, amount,

LAG(amount) OVER (ORDER BY month) AS prev_amount

FROM sales;

Output:

month	amount	prev_amount
Jan	1000	NULL
Feb	1200	1000
Mar	1100	1200
Apr	1500	1100

🔹 LEAD() Example (Next Value)

SELECT month, amount,

LEAD(amount) OVER (ORDER BY month) AS next_amount

FROM sales;

Output:

month	amount	next_amount
Jan	1000	1200
Feb	1200	1100
Mar	1100	1500
Apr	1500	NULL

🔹 Use Case: Difference Calculation

SELECT month, amount,

amount - LAG(amount) OVER (ORDER BY month) AS diff

FROM sales;

➡ Shows month-to-month change

🔹 Using Default Value

LAG (amount, 1, 0) OVER (ORDER BY month)

➡ First row gets 0 instead of NULL

🔹 PARTITION BY

SELECT emp, dept, salary,

LAG(salary) OVER (

PARTITION BY dept

ORDER BY salary

) AS prev_salary

FROM emp;

➡ Comparison happens within each department

🔹 ETL / Data Engineering Use Cases

✔ Delta / incremental load detection
✔ Change data capture logic
✔ SCD Type-2 comparison
✔ Trend analysis
✔ Data quality checks

Bottom of Form

ROW_NUMBER vs RANK vs DENSE_RANK

ROW_NUMBER vs RANK vs DENSE_RANK

All three are window (analytic) functions used for ranking rows.

🔹 Sample Data

emp	dept	salary
A	IT	9000
B	IT	9000
C	IT	8000
D	IT	7000

🔹 ROW_NUMBER()

Assigns unique number to each row
No ties (even if values are same)

SELECT emp, salary,

ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn

FROM emp_table;

Output:

emp	salary	rn
A	9000	1
B	9000	2
C	8000	3
D	7000	4

➡ Always unique numbering.

🔹 RANK()

Same values get same rank
Gaps appear in ranking

SELECT emp, salary,

RANK() OVER (ORDER BY salary DESC) AS rnk

FROM emp_table;

Output:

emp	salary	rnk
A	9000	1
B	9000	1
C	8000	3
D	7000	4

➡ Rank 2 is skipped

🔹 DENSE_RANK()

Same values get same rank
No gaps in ranking

SELECT emp, salary,

DENSE_RANK() OVER (ORDER BY salary DESC) AS drnk

FROM emp_table;

Output:

emp	salary	drnk
A	9000	1
B	9000	1
C	8000	2
D	7000	3

➡ No missing rank numbers

🔹 Key Differences (Quick Table)

Function	Handles Ties	Gaps in Rank	Unique Values
ROW_NUMBER	❌ No	❌ No	✅ Yes
RANK	✅ Yes	✅ Yes	❌ No
DENSE_RANK	✅ Yes	❌ No	❌ No

🔹 When to Use What

✔ Use ROW_NUMBER

Remove duplicates
Pick exactly one record
Deduplication logic

✔ Use RANK

Competition-style ranking
Reports where rank gaps matter

✔ Use DENSE_RANK

Top-N per group
Business ranking without gaps

🔹 ETL / Data Engineering Example

Top 2 Salaries per Department:

SELECT *

FROM (

SELECT emp, dept, salary,

DENSE_RANK () OVER (

PARTITION BY dept

ORDER BY salary DESC

) AS dr

FROM emp_table

) t

WHERE dr <= 2;

ROW_NUMBER gives unique numbers, RANK allows gaps, and DENSE_RANK removes gaps while handling ties.

Top of Form

Bottom of Form

Learn ETL Datastage faster

Pages

Friday, 26 December 2025

what is SCD ?

Tuesday, 23 December 2025

What are LAG () and LEAD ()?

ROW_NUMBER vs RANK vs DENSE_RANK

Most Recent posts

Copy and Modify Stages