In IBM Data Stage (Parallel Jobs), a Lookup stage typically uses HASH partitioning.
🔹 Default Partitioning of Lookup Stage
HASH partitioning on the lookup key(s)
Why HASH?
· Ensures that:
o Rows with the same lookup key go to the same partition
· This allows correct and fast key-based matching
🔹 How It Works
Input Data ──(HASH on key)──▶ Lookup Stage ◀──(HASH on key)── Reference Data
· Both primary input and reference input
· Are hash-partitioned on the lookup key
🔹 Example
Lookup Condition
SRC.CUST_ID = DIM.CUST_ID
Partitioning Used
HASH on CUST_ID
🔹 Lookup Modes and Partitioning
1️. Normal Lookup (Parallel)
- Uses HASH partitioning
- Distributed across nodes
2️. Sparse Lookup
- Also uses HASH partitioning
- Optimized for large reference data
3️. Lookup with Reference as Dataset
- Dataset is usually:
- Pre-hashed
- Or re-partitioned using HASH
🔹 Can Lookup Use Other Partitioning?
|
Partition Type |
Supported |
Notes |
|
HASH |
✅ Yes |
Default & recommended |
|
ROUND-ROBIN |
❌ No |
Incorrect results |
|
RANGE |
❌ No |
Not suitable |
|
ENTIRE |
⚠️ Yes |
Used for small reference tables |
🔹 ENTIRE Partitioning (Special Case)
- Used when:
- Reference table is small
- Reference data is copied to all nodes
- Primary input can remain parallel
Increases memory usage
🔹 Performance Best Practices
✔️ Hash both inputs on lookup key
✔️
Use ENTIRE only for small reference tables
✔️
Avoid repartitioning inside lookup
✔️
Use dataset for large reference data
Lookup stage uses HASH partitioning on the lookup key to ensure correct matching in parallel jobs.
No comments:
Post a Comment