1. Lookup Stage
Type: Active stage
Join Type: Lookup-based (reference lookup)
Input Links:
1 Primary link
1 or more Reference links
Working:
- Lookup matches the primary stream with the reference stream(s) based on keys.
- Uses hash-based lookup.
- Reference data is loaded into memory (hashed file) → faster.
- Lookup stage uses hash partitioning for correct matching, but it does not sort the input datasets.
Supports Which Joins?
- Inner Join
- Left Outer Join (most common)
- FULL outer and RIGHT outer NOT supported
When to Use Lookup
- Reference dataset is small or medium sized.
- You need fast lookup on primary data.
- Surrogate key lookup (dimension keys).
- Slowly changing dimension (SCD) lookup.
- Normal lookup is generally faster than Sparse lookup when the reference dataset can fit into memory, because the entire reference data is loaded once and then matching is done in memory.
- Sparse lookup is better when the reference dataset is too large to fit into memory or when the reference data keeps changing, because it queries the database row by row.
2. Join Stage
Type: Active stage
Join Type: SQL-style join
Input Links:
2 or more input links
Working:
- Performs equi-joins and non-equi joins similar to SQL.
- Requires sorted input on join keys.
- Handles large datasets better than Lookup.
Supports Which Joins?
- Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
When to Use Join Stage
- When both input datasets are large.
- When you need full outer or right outer join.
- When joining more than two datasets.
- When data is already sorted or can be sorted easily.
3. Merge Stage
✔️ Type: Active stage
✔️ Join Type: Sequential merge (sorted)
✔️ Input Links:
- 1 Master link
- Multiple Update (secondary) links
✔️ Working:
- Inputs must be sorted on merge keys.
- It merges datasets sequentially like SQL outer join.
- Master row drives the output.
✔️ Supports Which Joins?
- Inner Join
- Left Outer Join (master is preserved)
- Right and Full outer joins NOT supported
✔️ When to Use Merge Stage
- When input datasets are already sorted.
- When you need fast sequential merging.
- When you have a Master → Update relationship.
- When data volume is very high, and sorting is done upstream.
Quick Comparison Table
|
Feature |
Lookup Stage |
Join Stage |
Merge Stage |
|
Input Type |
Primary + Reference |
2 or more |
1 Master + Updates |
|
Sorting Needed |
❌ No |
✔️ Yes |
✔️ Yes |
|
Memory Usage |
High (loads reference in memory) |
Low |
Low |
|
Performance |
Fast for small lookups |
Good for large data |
Excellent for very large sorted data |
|
Supports Full Outer |
❌ No |
✔️ Yes |
❌ No |
|
Supports Right Outer |
❌ No |
✔️ Yes |
❌ No |
|
Best Use Case |
Small lookup tables |
Large table joins |
Sorted sequential merging |
Which one to use in which situation?
✔️ If reference table is small → Use Lookup
✔️ If both tables are large and need outer join → Use Join
✔️ If data is already sorted → Use Merge
✔️ For SCD lookups → Use Lookup
✔️ For huge fact + huge dimension join → Use Join
No comments:
Post a Comment