Lookup,Join and merge stages

 

1. Lookup Stage

      Type: Active stage

      Join Type: Lookup-based (reference lookup)

      Input Links:

1       Primary link

            1 or more Reference links

 Working:

  • Lookup matches the primary stream with the reference stream(s) based on keys.
  • Uses hash-based lookup.
  • Reference data is loaded into memory (hashed file) → faster.
  • Lookup stage uses hash partitioning for correct matching, but it does not sort the input datasets.

 Supports Which Joins?

  • Inner Join
  • Left Outer Join (most common)
  • FULL outer and RIGHT outer NOT supported

 When to Use Lookup

  • Reference dataset is small or medium sized.
  • You need fast lookup on primary data.
  • Surrogate key lookup (dimension keys).
  • Slowly changing dimension (SCD) lookup. 
  •  Normal lookup is generally faster than Sparse lookup when the reference dataset can fit into memory, because the entire reference data is loaded once and then matching is done in memory.
  •  Sparse lookup is better when the reference dataset is too large to fit into memory or when the reference data keeps changing, because it queries the database row by row.

 2. Join Stage

         Type: Active stage

          Join Type: SQL-style join

         Input Links:

 2 or more input links

        Working:

  • Performs equi-joins and non-equi joins similar to SQL.
  • Requires sorted input on join keys.
  • Handles large datasets better than Lookup.

    Supports Which Joins?

  • Inner Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join

 When to Use Join Stage

  • When both input datasets are large.
  • When you need full outer or right outer join.
  • When joining more than two datasets.
  • When data is already sorted or can be sorted easily.

 3. Merge Stage

✔️ Type: Active stage

✔️ Join Type: Sequential merge (sorted)

✔️ Input Links:

  • 1 Master link
  • Multiple Update (secondary) links

✔️ Working:

  • Inputs must be sorted on merge keys.
  • It merges datasets sequentially like SQL outer join.
  • Master row drives the output.

✔️ Supports Which Joins?

  • Inner Join
  • Left Outer Join (master is preserved)
  • Right and Full outer joins NOT supported

✔️ When to Use Merge Stage

  • When input datasets are already sorted.
  • When you need fast sequential merging.
  • When you have a Master → Update relationship.
  • When data volume is very high, and sorting is done upstream.

 Quick Comparison Table

Feature

Lookup Stage

Join Stage

Merge Stage

Input Type

Primary + Reference

2 or more

1 Master + Updates

Sorting Needed

No

✔️ Yes

✔️ Yes

Memory Usage

High (loads reference in memory)

Low

Low

Performance

Fast for small lookups

Good for large data

Excellent for very large sorted data

Supports Full Outer

No

✔️ Yes

No

Supports Right Outer

No

✔️ Yes

No

Best Use Case

Small lookup tables

Large table joins

Sorted sequential merging


Which one to use in which situation?

✔️ If reference table is small → Use Lookup

✔️ If both tables are large and need outer join → Use Join

✔️ If data is already sorted → Use Merge

✔️ For SCD lookups → Use Lookup

✔️ For huge fact + huge dimension join → Use Join


Top of Form

 

Bottom of Form

 

No comments:

Post a Comment

Most Recent posts

Copy and Modify Stages

In IBM Infosphere DataStage , both Copy Stage and Modify Stage are simple processing stages used in parallel jobs , but their purpose i...