Thursday, 4 December 2025

How to configure DB Connector Stages –

Difference between ODBC Connector, Oracle Connector, and DB2 Connector stages.

1. ODBC Connector: Generic connector; can connect to any DB supporting ODBC drivers.

2. Oracle Connector: Optimized specifically for Oracle; supports array insert, bulk load, and Oracle-specific features (like direct path load).

3. DB2 Connector: Optimized for IBM DB2; supports DB2 CLI features and better pushdown optimization.

Note: Use native connectors (Oracle/DB2) instead of ODBC for better performance.

1. Configuring connection parameters dynamically.

Use Job Parameters or Parameter Sets for DB name, user, password, and table name.In the connector properties, use below parameters like:

Server = #DB_SERVER#

User = #DB_USER#

Password = #DB_PASSWORD#.

This allows reuse across environments (Dev, QA, Prod).

2.Main properties for reading data.

Table name / SQL Query

Column list

Read Mode (Sequential / Parallel)

Fetch Buffer Size / Array Size

Before / After SQL

Isolation Level.
These control what and how data is extracted.

3.Difference between Read, Write, and User-defined SQL:

Read → Automatically generates a SELECT * FROM table.

Write → Automatically generates INSERT INTO table VALUES(...).

User-defined SQL → Lets you write custom SQL (e.g., joins, filters, stored procedure calls).

4. Array Size / Batch Size :

Defines how many rows are fetched or inserted per database round trip.

Higher value = fewer DB calls = better performance.

Too large can cause memory issues. Common value: 1000–5000.

5.Commit frequency:

Defines how often a transaction is committed (e.g., every 10,000 rows).

Helps control rollback size and improve performance.
If set too high → long rollback; too low → frequent commits = slow.

Auto-Commit = Yes.

Each record is committed individually.
✅ Useful for small updates or CDC jobs.
❌ Avoid for bulk loads (very slow).

Transaction Size :

Similar to commit frequency — defines number of rows per commit.

Example: Set Transaction Size = 10,000 → commits every 10k rows.

6. Capturing rejected records / SQL errors:

Enable Reject link or Reject file in the stage.

Properties: “Reject link”, “Write to log”, “Abort on error”.

You can also output SQLSTATE, SQLCODE, DBMSCODE for debugging.

7. Performance tuning for large inserts :

Use Array Size and Transaction Size appropriately.

Disable Auto-Commit.

Use bulk load mode if supported (Oracle Direct Load).

Avoid index updates during load — rebuild indexes afterward.

Tune commit interval based on rollback segment.

8.Handling Deadlock / ORA-00060 :

Identify conflicting transactions (using DB logs).

Retry logic in job or re-run the batch.

Use smaller commit intervals.

For update operations, maintain a consistent order of access to rows.

Wednesday, 3 December 2025

What is Slowly Changing Dimensions (SCD) – Type 1 & Type 2 ?

A Slowly Changing Dimension (SCD) tracks how dimension data changes over time.

ü SCD Type 1 (Overwrite old data)

Definition

No history tracked.

If dimension attributes change, update the existing row.

Useful for non-historical or correction-based fields.

Real-world example

Changing customer email ID or phone number where history is not important.

ü SCD Type 1 – Job Design (DataStage)

Job Stages

Source (DB extract / file)

Lookup Stage

Lookup on Dimension Key (e.g., Customer_ID)

Transformer

Compare columns → Identify changed rows

If no match → INSERT

If match but attributes changed → UPDATE

Two Output Links

Insert link → Dimension table

Update link → Dimension table

Target (DB Connector)

Change comparison logic example

IF Lkp_CustID IS NULL THEN 'I'ELSE IF Inp.Email <> Lkp.Email THEN 'U'ELSE 'R' -- no change

Sample Input / Output – SCD Type 1

Source Input

Cust_ID	Name	Email
101	John	john@gmail.com

Existing Dimension Table

Cust_ID	Name	Email
101	John	john.old@gmail.com

Expected Output (after SCD1)

Cust_ID	Name	Email
101	John	john@gmail.com

Old email is overwritten, no history kept.

SCD Type 2 (Preserve full history)

Definition

Creates a new row when the attribute changes.

Maintains full history with date ranges or active flag.

Common SCD2 Fields

Column	Purpose
Start_Date	When record became active
End_Date	When record stopped being active
Current_Flag	Y/N
Version_No	Optional incremental version

SCD Type 2 – Job Design (DataStage)

Job Stages

Source Stage

Lookup Stage on business key

Transformer Stage

Compare incoming vs existing attribute(s)

Identify:

New record → INSERT

Changed record → EXPIRE existing + INSERT new

Two Output Links

Expired-Update Link → Update old row

Current_Flag = 'N'End_Date = CurrentTimestamp()

New-Insert Link → Insert new row

Current_Flag = 'Y'Start_Date = CurrentTimestamp()Version_No = previous+1

Target (DB Connector)

Sample Input / Output – SCD Type 2

Source Input

Cust_ID	Name	City
101	John	Chennai

Existing Dimension Table

Dim_Key	Cust_ID	Name	City	Start_Date	End_Date	Curr_Flag	Version
1	101	John	Mumbai	2020-01-01	9999-12-31	Y	1

Incoming change:

City changed

Mumbai → Chennai

Expected Output

Expire old row

Dim_Key	Cust_ID	Name	City	Start_Date	End_Date	Curr_Flag	Version
1	101	John	Mumbai	2020-01-01	2024-12-03 17:00	N	1

Insert new row

Dim_Key	Cust_ID	Name	City	Start_Date	End_Date	Curr_Flag	Version
2	101	John	Chennai	2024-12-03	9999-12-31	Y	2

窗体底端

Real-Time Examples & When to Use Lookup or Join

Here are practical scenarios you face in ETL/DataStage projects:

Scenario 1: Dimension Lookup (Fast Lookup Needed)

Input: Fact file of 10M rows
Reference: Customer Dimension = 50K rows

Use Lookup

Customer dimension is small → easy to cache

Minimal overhead → very fast

Ideal when the reference doesn't change frequently

➡️ Performance Impact:
Lookup can process 10M rows in minutes because the 50K dimension is held in memory.

Scenario 2: Large-to-Large Data Merge

Input: Sales Fact = 80M rows
Reference: Product Master = 50M rows

� Use Join

Both datasets are large

Lookup is not feasible (memory heavy, slow)

Join distributes data across nodes (parallel processing)

➡️ Performance Impact:
Join will handle partitioning and load balancing → significantly faster for heavy volumes.

Scenario 3: Reference Table Changes Every Day

Input: Daily transaction file
Reference: Daily price list (variable but large)

� Use Join

Reference data changes often

Caching daily large tables is inefficient

Join avoids cache rebuilding overhead

❌ When Lookup Fails or Causes Slowness

Reference table > 1–2 million rows

Memory constraints on ETL server

Multiple lookups in a single job

Lookup key not selective

Lookup on unsorted huge data → long load time

�Special Case: Sparse Lookup (DataStage)

Used when:

Reference data is in a database table

Input is small

Each record hits DB for a lookup

Example: Validate a handful of customer IDs from a DB table
� Good for real-time or selective validation, but bad for large datasets (too many DB hits).

� Quick Decision Guide

Condition	Best Option
Small reference, large input	Lookup
Both datasets are large	Join
Reference is in DB and input is small	Sparse Lookup
Need full outer join	Join
Need reject records for failed matches	Lookup
Complex join conditions	Join

Lookup vs Join – Which to Use When?

In ETL development, one of the most common design decisions is choosing Lookup or Join when combining datasets. Both achieve the same outcome—bringing additional data from a reference source—but their performance, scalability, and best-use scenarios are different. A smart choice here can save hours of batch runtime and significant system resources.

In this article, let’s break down how they work, performance differences, and real-time examples that you can directly relate to DataStage or any ETL tool.

What is a Lookup?

A Lookup is used to fetch related information from a reference dataset based on a key.
Usually used for small to medium reference tables, loaded into memory (hash file, dataset, or cached stage) for fast matching.

Key Points

l Works like a key-value dictionary.

l Ideal for dimension lookups, parameter tables, validations, and code mappings.

l Can be cached in memory (fast).

l Fails for large datasets because memory consumption becomes high.

What is a Join?

A Join combines two datasets based on a common key—similar to SQL joins.
Best suited when both datasets are large and can be processed in parallel.

Key Points

ü Designed for high-volume processing.

ü Uses sorting/partitioning to match records.

ü Supports inner, left, right, and full joins.

ü Slower for small reference data due to sorting overhead.

Lookup vs Join – Performance Differences

Feature	Lookup	Join
Best for	Small/medium reference tables	Large datasets
Performance	Very fast if cached	Depends on sorting/partitioning
Memory usage	High if reference table is huge	Generally balanced
Parallelism	Not always fully parallel	Fully parallel (PX engine)
Reject Handling	Yes, easy to capture lookup failures	Requires custom logic
Complex conditions	Limited to equality conditions	Can handle complex join conditions
Initial overhead	Low	Sorting and partitioning overhead

Home

Thursday, 4 December 2025

How to configure DB Connector Stages –

Difference between ODBC Connector, Oracle Connector, and DB2 Connector stages.

1. ODBC Connector: Generic connector; can connect to any DB supporting ODBC drivers.

1. Configuring connection parameters dynamically.

2.Main properties for reading data.

3.Difference between Read, Write, and User-defined SQL:

4. Array Size / Batch Size :

Each record is committed individually. ✅ Useful for small updates or CDC jobs. ❌ Avoid for bulk loads (very slow).

Transaction Size :

Similar to commit frequency — defines number of rows per commit.

6. Capturing rejected records / SQL errors:

7. Performance tuning for large inserts :

8.Handling Deadlock / ORA-00060 :

Wednesday, 3 December 2025

What is Slowly Changing Dimensions (SCD) – Type 1 & Type 2 ?

A Slowly Changing Dimension (SCD) tracks how dimension data changes over time.

ü SCD Type 1 (Overwrite old data)

Definition

Real-world example

ü SCD Type 1 – Job Design (DataStage)

Job Stages

Sample Input / Output – SCD Type 1

Source Input

Existing Dimension Table

Expected Output (after SCD1)

SCD Type 2 (Preserve full history)

Definition

Common SCD2 Fields

SCD Type 2 – Job Design (DataStage)

Job Stages

Sample Input / Output – SCD Type 2

Source Input

Existing Dimension Table

Incoming change:

Expected Output

Expire old row

Insert new row

Real-Time Examples & When to Use Lookup or Join

Scenario 1: Dimension Lookup (Fast Lookup Needed)

Scenario 2: Large-to-Large Data Merge

Scenario 3: Reference Table Changes Every Day

❌ When Lookup Fails or Causes Slowness

�Special Case: Sparse Lookup (DataStage)

� Quick Decision Guide

Lookup vs Join – Which to Use When?

What is a Lookup?

Key Points

What is a Join?

Key Points

Lookup vs Join – Performance Differences

Most Recent posts

How to configure DB Connector Stages –

Each record is committed individually.
✅ Useful for small updates or CDC jobs.
❌ Avoid for bulk loads (very slow).