Thursday, 4 December 2025

How to configure DB Connector Stages –


Difference between ODBC Connector, Oracle Connector, and DB2 Connector stages.

    1. ODBC Connector: Generic connector; can connect to any DB supporting ODBC drivers.

    2. Oracle Connector: Optimized specifically for Oracle; supports array insert, bulk load, and Oracle-specific features (like direct path load).

   3. DB2 Connector: Optimized for IBM DB2; supports DB2 CLI features and better pushdown optimization.

Note: Use native connectors (Oracle/DB2) instead of ODBC for better performance.


1. Configuring connection parameters dynamically.

Use Job Parameters or Parameter Sets for DB name, user, password, and table name.In the connector properties, use below parameters like:

Server = #DB_SERVER#

User = #DB_USER#

Password = #DB_PASSWORD#.

This allows reuse across environments (Dev, QA, Prod).

 2.Main properties for reading data.

Table name / SQL Query

Column list

Read Mode (Sequential / Parallel)

Fetch Buffer Size / Array Size

Before / After SQL

Isolation Level.
      These control what and how data is extracted.

3.Difference between Read, Write, and User-defined SQL:

        Read → Automatically generates a SELECT * FROM table.

        Write → Automatically generates INSERT INTO table VALUES(...).

        User-defined SQL → Lets you write custom SQL (e.g., joins, filters, stored procedure calls).


4. Array Size / Batch Size :

         Defines how many rows are fetched or inserted per database round trip.

         Higher value = fewer DB calls = better performance.

        Too large can cause memory issues. Common value: 1000–5000.

5.Commit frequency:

        Defines how often a transaction is committed (e.g., every 10,000 rows).

       Helps control rollback size and improve performance.
       If set too high → long rollback; too low → frequent commits = slow.

     Auto-Commit = Yes.

           Each record is committed individually.
                ✅ Useful for small updates or CDC jobs.
               ❌ Avoid for bulk loads (very slow).

       Transaction Size :

        Similar to commit frequency — defines number of rows per commit.

         Example: Set Transaction Size = 10,000 → commits every 10k rows.


6. Capturing rejected records / SQL errors:

Enable Reject link or Reject file in the stage.

Properties: “Reject link”“Write to log”“Abort on error”.

You can also output SQLSTATESQLCODEDBMSCODE for debugging.


7. Performance tuning for large inserts :

Use Array Size and Transaction Size appropriately.

Disable Auto-Commit.

Use bulk load mode if supported (Oracle Direct Load).

Avoid index updates during load — rebuild indexes afterward.

Tune commit interval based on rollback segment.


8.Handling Deadlock / ORA-00060 :

   Identify conflicting transactions (using DB logs).

Retry logic in job or re-run the batch.

Use smaller commit intervals.

For update operations, maintain a consistent order of access to rows.


 

Wednesday, 3 December 2025

What is Slowly Changing Dimensions (SCD) – Type 1 & Type 2 ?


A Slowly Changing Dimension (SCD) tracks how dimension data changes over time.


ü SCD Type 1 (Overwrite old data)

Definition

No history tracked.

If dimension attributes change, update the existing row.

Useful for non-historical or correction-based fields.

Real-world example

Changing customer email ID or phone number where history is not important.


ü SCD Type 1 – Job Design (DataStage)

Job Stages

Source (DB extract / file)

Lookup Stage

Lookup on Dimension Key (e.g., Customer_ID)

Transformer

Compare columns → Identify changed rows

If no match → INSERT

If match but attributes changed → UPDATE

Two Output Links

Insert link → Dimension table

Update link → Dimension table

Target (DB Connector)

Change comparison logic example

IF Lkp_CustID IS NULL THEN 'I'ELSE IF Inp.Email <> Lkp.Email THEN 'U'ELSE 'R'   -- no change


Sample Input / Output – SCD Type 1

Source Input

Cust_ID

Name

Email

101

John

john@gmail.com

Existing Dimension Table

Cust_ID

Name

Email

101

John

john.old@gmail.com

Expected Output (after SCD1)

Cust_ID

Name

Email

101

John

john@gmail.com

Old email is overwritten, no history kept.


SCD Type 2 (Preserve full history)

Definition

Creates a new row when the attribute changes.

Maintains full history with date ranges or active flag.

Common SCD2 Fields

Column

Purpose

Start_Date

      When record became active

End_Date

     When record stopped being active

Current_Flag

     Y/N

Version_No     

Optional incremental version


SCD Type 2 – Job Design (DataStage)

Job Stages

Source Stage

Lookup Stage on business key

Transformer Stage

Compare incoming vs existing attribute(s)

Identify:

New record → INSERT

Changed record → EXPIRE existing + INSERT new

Two Output Links

Expired-Update Link → Update old row

Current_Flag = 'N'End_Date = CurrentTimestamp()

New-Insert Link → Insert new row

Current_Flag = 'Y'Start_Date = CurrentTimestamp()Version_No = previous+1

Target (DB Connector)


 Sample Input / Output – SCD Type 2

Source Input

Cust_ID

Name

City

101

John

Chennai

Existing Dimension Table

Dim_Key

Cust_ID

Name

City

Start_Date

End_Date

Curr_Flag

Version

1

101

John

Mumbai

2020-01-01

9999-12-31

Y

1

Incoming change:

City changed

Mumbai → Chennai

Expected Output

Expire old row

Dim_Key

Cust_ID

Name

City

Start_Date

End_Date

Curr_Flag

Version

1

101

John

Mumbai

2020-01-01

2024-12-03 17:00

N

1

Insert new row

Dim_Key

Cust_ID

Name

City

Start_Date

End_Date

Curr_Flag

Version

2

101

John

Chennai

2024-12-03

9999-12-31

Y

2


 

窗体底端

 

Real-Time Examples & When to Use Lookup or Join

 Here are practical scenarios you face in ETL/DataStage projects:


Scenario 1: Dimension Lookup (Fast Lookup Needed)

Input: Fact file of 10M rows
Reference: Customer Dimension = 50K rows

Use Lookup

 Customer dimension is small → easy to cache

 Minimal overhead → very fast

 Ideal when the reference doesn't change frequently

➡️ Performance Impact:
Lookup can process 10M rows in minutes because the 50K dimension is held in memory.


Scenario 2: Large-to-Large Data Merge

Input: Sales Fact = 80M rows
Reference: Product Master = 50M rows

Use Join

Both datasets are large

Lookup is not feasible (memory heavy, slow)

Join distributes data across nodes (parallel processing)

➡️ Performance Impact:
Join will handle partitioning and load balancing → significantly faster for heavy volumes.


Scenario 3: Reference Table Changes Every Day

Input: Daily transaction file
Reference: Daily price list (variable but large)

Use Join

Reference data changes often

Caching daily large tables is inefficient

Join avoids cache rebuilding overhead


When Lookup Fails or Causes Slowness

Reference table > 1–2 million rows

Memory constraints on ETL server

Multiple lookups in a single job

Lookup key not selective

Lookup on unsorted huge data → long load time


Special Case: Sparse Lookup (DataStage)

Used when:

Reference data is in a database table

Input is small

Each record hits DB for a lookup

Example: Validate a handful of customer IDs from a DB table
� Good for real-time or selective validation, but bad for large datasets (too many DB hits).


Quick Decision Guide

      Condition                   

Best Option

Small reference, large input         

Lookup

Both datasets are large        

Join

Reference is in DB and input is small    

Sparse Lookup

Need full outer join

Join

Need reject records for failed matches

Lookup

Complex join conditions

Join



Lookup vs Join – Which to Use When?

In ETL development, one of the most common design decisions is choosing Lookup or Join when combining datasets. Both achieve the same outcome—bringing additional data from a reference source—but their performance, scalability, and best-use scenarios are different. A smart choice here can save hours of batch runtime and significant system resources.

In this article, let’s break down how they work, performance differences, and real-time examples that you can directly relate to DataStage or any ETL tool.

What is a Lookup?

A Lookup is used to fetch related information from a reference dataset based on a key.
Usually used for small to medium reference tables, loaded into memory (hash file, dataset, or cached stage) for fast matching.

Key Points

Works like a key-value dictionary.

Ideal for dimension lookups, parameter tables, validations, and code mappings.

Can be cached in memory (fast).

Fails for large datasets because memory consumption becomes high.


What is a Join?

A Join combines two datasets based on a common key—similar to SQL joins.
Best suited when both datasets are large and can be processed in parallel.

Key Points

ü Designed for high-volume processing.

ü Uses sorting/partitioning to match records.

ü Supports inner, left, right, and full joins.

ü Slower for small reference data due to sorting overhead.

 Lookup vs Join – Performance Differences

Feature

Lookup

Join

Best for

Small/medium reference tables

Large datasets

Performance

Very fast if cached

Depends on sorting/partitioning

Memory usage

High if reference table is huge

Generally balanced

Parallelism

Not always fully parallel

Fully parallel (PX engine)

Reject Handling

Yes, easy to capture lookup failures

Requires custom logic

Complex conditions

Limited to equality conditions

Can handle complex join conditions

Initial overhead

Low

Sorting and partitioning overhead

 

 

 


Most Recent posts

How to configure DB Connector Stages –