Friday, 26 December 2025

what is SCD ?

SCD (Slowly Changing Dimension) refers to dimension tables in a data warehouse where attribute values change slowly over time, not frequently like transaction data.

The challenge is how to handle those changes. SCD defines techniques to manage historical changes in dimension data.


🔹 Example Dimension Table: Customer

Cust_ID

 Name

  City

101

Ravi

Chennai

Later, customer moves to Bangalore.


🔹 Why SCD Is Important

Business wants:

  • Current customer data
  • OR historical data
  • OR both (depends on reporting needs)

SCD defines how much history to store.


🔹 Types of SCD .


🔹 SCD Type 0 – No Change

  • Data never changes
  • Ignore updates

Example: Date of Birth


🔹 SCD Type 1 – Overwrite

  • Old value is replaced
  • No history maintained

Example:

Cust_ID

Name

City

101

Ravi

Bangalore

Simple
History lost

Use case: Correction of wrong data


🔹 SCD Type 2 – Full History (Very Important)

  • Create new row for every change
  • Maintain history using:
    • Surrogate key
    • Start date / End date
    • Current flag

Example:

SK

Cust_ID

City

Start_Date

End_Date

Is_Current

1

101

Chennai

2019-01-01

2023-03-31

N

2

101

Bangalore

2023-04-01

9999-12-31

Y

Full history
Most commonly used


🔹 SCD Type 3 – Limited History

  • Stores previous value in another column

Cust_ID

City

Prev_City

101

Bangalore

Chennai

One level history
Not scalable


🔹 SCD Type 4 – History Table

  • Current data in main table
  • History stored in separate table

🔹 SCD Type 6 – Hybrid

  • Combination of Type 1 + Type 2 + Type 3

🔹 Which SCD Types Are Used Most?

1️Type 1 – Corrections
2️
Type 2 – History tracking (most important)
3️
Type 3 – Limited cases


🔹 ETL / DataStage Context

In DataStage:

  • Use Lookup stage to check existing record
  • Compare source vs target
  • Decide:
    • Insert new row (Type 2)
    • Update existing row (Type 1)
  • Manage surrogate keys

🔹 SCD Type 2 Logic (Simplified)

  1. Match business key
  2. Compare attributes
  3. If change detected:
    • Expire old record
    • Insert new record

Tuesday, 23 December 2025

What are LAG () and LEAD ()?

 


LAG () and LEAD () are window (analytic) functions used to access data from another row without using self-joins.

·        LAG () → gets value from a previous row

·        LEAD () → gets value from a next row


LAG() and LEAD() allow comparison of a row with its previous or next row within a window.


🔹 Syntax

LAG (column, offset, default)

OVER (PARTITION BY column ORDER BY column)

 

LEAD (column, offset, default)

OVER (PARTITION BY column ORDER BY column)

·        offset → how many rows before/after (default = 1)

  • default → value if row doesn’t exist (optional)

🔹 Example Table: sales

month

amount

Jan

1000

Feb

1200

Mar

1100

Apr

1500


🔹 LAG() Example (Previous Value)

SELECT month, amount,

       LAG(amount) OVER (ORDER BY month) AS prev_amount

FROM sales;

Output:

month

amount

prev_amount

Jan

1000

NULL

Feb

1200

1000

Mar

1100

1200

Apr

1500

1100


🔹 LEAD() Example (Next Value)

SELECT month, amount,

       LEAD(amount) OVER (ORDER BY month) AS next_amount

FROM sales;

Output:

month

amount

next_amount

Jan

1000

1200

Feb

1200

1100

Mar

1100

1500

Apr

1500

NULL


🔹 Use Case: Difference Calculation

SELECT month, amount,

       amount - LAG(amount) OVER (ORDER BY month) AS diff

FROM sales;

Shows month-to-month change


🔹 Using Default Value

LAG (amount, 1, 0) OVER (ORDER BY month)

First row gets 0 instead of NULL


🔹 PARTITION BY

SELECT emp, dept, salary,

       LAG(salary) OVER (

         PARTITION BY dept

         ORDER BY salary

       ) AS prev_salary

FROM emp;

Comparison happens within each department


🔹 ETL / Data Engineering Use Cases

Delta / incremental load detection
Change data capture logic
SCD Type-2 comparison
Trend analysis
Data quality checks

Bottom of Form

 

ROW_NUMBER vs RANK vs DENSE_RANK

 

ROW_NUMBER vs RANK vs DENSE_RANK

All three are window (analytic) functions used for ranking rows.


🔹 Sample Data

emp

dept

salary

A

IT

9000

B

IT

9000

C

IT

8000

D

IT

7000


🔹 ROW_NUMBER()

  • Assigns unique number to each row
  • No ties (even if values are same)

SELECT emp, salary,

       ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn

FROM emp_table;

Output:

emp

salary

rn

A

9000

1

B

9000

2

C

8000

3

D

7000

4

Always unique numbering.


🔹 RANK()

  • Same values get same rank
  • Gaps appear in ranking

SELECT emp, salary,

       RANK() OVER (ORDER BY salary DESC) AS rnk

FROM emp_table;

Output:

emp

salary

rnk

A

9000

1

B

9000

1

C

8000

3

D

7000

4

Rank 2 is skipped


🔹 DENSE_RANK()

  • Same values get same rank
  • No gaps in ranking

SELECT emp, salary,

       DENSE_RANK() OVER (ORDER BY salary DESC) AS drnk

FROM emp_table;

Output:

emp

salary

drnk

A

9000

1

B

9000

1

C

8000

2

D

7000

3

No missing rank numbers


🔹 Key Differences (Quick Table)

Function

Handles Ties

Gaps in Rank

Unique Values

ROW_NUMBER

No

No

Yes

RANK

Yes

Yes

No

DENSE_RANK

Yes

No

No


🔹 When to Use What

Use ROW_NUMBER

  • Remove duplicates
  • Pick exactly one record
  • Deduplication logic

Use RANK

  • Competition-style ranking
  • Reports where rank gaps matter

Use DENSE_RANK

  • Top-N per group
  • Business ranking without gaps

🔹 ETL / Data Engineering Example

Top 2 Salaries per Department:

SELECT *

FROM (

  SELECT emp, dept, salary,

         DENSE_RANK () OVER (

           PARTITION BY dept

           ORDER BY salary DESC

         ) AS dr

  FROM emp_table

) t

WHERE dr <= 2;


ROW_NUMBER gives unique numbers, RANK allows gaps, and DENSE_RANK removes gaps while handling ties.


Top of Form

 

Bottom of Form

 

Most Recent posts

IBM Cloud Pak for Datastage