Wednesday, 17 December 2025

Which Partitioning Method Does a Lookup Stage Use in Data Stage?

 In IBM Data Stage (Parallel Jobs), a Lookup stage typically uses HASH partitioning.


🔹 Default Partitioning of Lookup Stage

 HASH partitioning on the lookup key(s)

Why HASH?

·        Ensures that:

o   Rows with the same lookup key go to the same partition

·        This allows correct and fast key-based matching


🔹 How It Works

Input Data  ──(HASH on key)── Lookup Stage ──(HASH on key)── Reference Data

·        Both primary input and reference input

·        Are hash-partitioned on the lookup key


🔹 Example

Lookup Condition

SRC.CUST_ID = DIM.CUST_ID

Partitioning Used

HASH on CUST_ID


🔹 Lookup Modes and Partitioning

1️. Normal Lookup (Parallel)

  • Uses HASH partitioning
  • Distributed across nodes

2️.  Sparse Lookup

  • Also uses HASH partitioning
  • Optimized for large reference data

3️. Lookup with Reference as Dataset

  • Dataset is usually:
    • Pre-hashed
    • Or re-partitioned using HASH

🔹 Can Lookup Use Other Partitioning?

Partition Type

Supported

Notes

HASH

Yes

Default & recommended

ROUND-ROBIN

No

Incorrect results

RANGE

No

Not suitable

ENTIRE

⚠️ Yes

Used for small reference tables


🔹 ENTIRE Partitioning (Special Case)

  • Used when:
    • Reference table is small
  • Reference data is copied to all nodes
  • Primary input can remain parallel

 Increases memory usage


🔹 Performance Best Practices

✔️ Hash both inputs on lookup key
✔️ Use ENTIRE only for small reference tables
✔️ Avoid repartitioning inside lookup
✔️ Use dataset for large reference data


Lookup stage uses HASH partitioning on the lookup key to ensure correct matching in parallel jobs.

No comments:

Post a Comment

Most Recent posts

Which Partitioning Method Does a Lookup Stage Use in Data Stage?

  In IBM Data Stage (Parallel Jobs), a Lookup stage typically uses HASH partitioning. 🔹 Default Partitioning of Lookup Stage   H...