Learn ETL Datastage faster: Which Partitioning Method Does a Lookup Stage Use in Data Stage?

Wednesday, 17 December 2025

Which Partitioning Method Does a Lookup Stage Use in Data Stage?

In IBM Data Stage (Parallel Jobs), a Lookup stage typically uses HASH partitioning.

🔹 Default Partitioning of Lookup Stage

HASH partitioning on the lookup key(s)

Why HASH?

· Ensures that:

o Rows with the same lookup key go to the same partition

· This allows correct and fast key-based matching

🔹 How It Works

Input Data ──(HASH on key)──▶ Lookup Stage ◀──(HASH on key)── Reference Data

· Both primary input and reference input

· Are hash-partitioned on the lookup key

🔹 Example

Lookup Condition

SRC.CUST_ID = DIM.CUST_ID

Partitioning Used

HASH on CUST_ID

🔹 Lookup Modes and Partitioning

1️. Normal Lookup (Parallel)

Uses HASH partitioning
Distributed across nodes

2️. Sparse Lookup

Also uses HASH partitioning
Optimized for large reference data

3️. Lookup with Reference as Dataset

Dataset is usually:

Pre-hashed
Or re-partitioned using HASH

🔹 Can Lookup Use Other Partitioning?

Partition Type	Supported	Notes
HASH	✅ Yes	Default & recommended
ROUND-ROBIN	❌ No	Incorrect results
RANGE	❌ No	Not suitable
ENTIRE	⚠️ Yes	Used for small reference tables

🔹 ENTIRE Partitioning (Special Case)

Used when:

Reference table is small

Reference data is copied to all nodes
Primary input can remain parallel

Increases memory usage

🔹 Performance Best Practices

✔️ Hash both inputs on lookup key
✔️ Use ENTIRE only for small reference tables
✔️ Avoid repartitioning inside lookup
✔️ Use dataset for large reference data

Lookup stage uses HASH partitioning on the lookup key to ensure correct matching in parallel jobs.

Learn ETL Datastage faster

Pages

Wednesday, 17 December 2025

Which Partitioning Method Does a Lookup Stage Use in Data Stage?

No comments:

Post a Comment

Most Recent posts

Copy and Modify Stages