Wednesday, 17 December 2025

Which Partitioning Method Does a Lookup Stage Use in Data Stage?

 In IBM Data Stage (Parallel Jobs), a Lookup stage typically uses HASH partitioning.


🔹 Default Partitioning of Lookup Stage

 HASH partitioning on the lookup key(s)

Why HASH?

·        Ensures that:

o   Rows with the same lookup key go to the same partition

·        This allows correct and fast key-based matching


🔹 How It Works

Input Data  ──(HASH on key)── Lookup Stage ──(HASH on key)── Reference Data

·        Both primary input and reference input

·        Are hash-partitioned on the lookup key


🔹 Example

Lookup Condition

SRC.CUST_ID = DIM.CUST_ID

Partitioning Used

HASH on CUST_ID


🔹 Lookup Modes and Partitioning

1️. Normal Lookup (Parallel)

  • Uses HASH partitioning
  • Distributed across nodes

2️.  Sparse Lookup

  • Also uses HASH partitioning
  • Optimized for large reference data

3️. Lookup with Reference as Dataset

  • Dataset is usually:
    • Pre-hashed
    • Or re-partitioned using HASH

🔹 Can Lookup Use Other Partitioning?

Partition Type

Supported

Notes

HASH

Yes

Default & recommended

ROUND-ROBIN

No

Incorrect results

RANGE

No

Not suitable

ENTIRE

⚠️ Yes

Used for small reference tables


🔹 ENTIRE Partitioning (Special Case)

  • Used when:
    • Reference table is small
  • Reference data is copied to all nodes
  • Primary input can remain parallel

 Increases memory usage


🔹 Performance Best Practices

✔️ Hash both inputs on lookup key
✔️ Use ENTIRE only for small reference tables
✔️ Avoid repartitioning inside lookup
✔️ Use dataset for large reference data


Lookup stage uses HASH partitioning on the lookup key to ensure correct matching in parallel jobs.

Most Recent posts

Copy and Modify Stages

In IBM Infosphere DataStage , both Copy Stage and Modify Stage are simple processing stages used in parallel jobs , but their purpose i...