Wednesday, 10 December 2025

Parallelism and partitioning methods in Datastgae

 

Parallelism in DataStage 

 Parallelism means executing ETL processes simultaneously to improve performance. DataStage PX achieves this using the parallel engine, which divides work across multiple nodes (defined in APT_CONFIG_FILE).


 Types of Parallelism in DataStage

1. Pipeline Parallelism

  • Different stages run at the same time, processing streaming data.

  • Example:
    Source → Transformer → Target
    All three run concurrently once the pipeline fills.

✔ Improves performance by overlapping operations.


2. Partition Parallelism

  • A dataset is split into partitions, and each node processes its portion in parallel.

Example with 4 partitions:

1000 rows → split into 4250 rows each

Each node processes 250 rows at the same time.

✔ Massive performance boost
✔ Depends heavily on partitioning method


3. Component Parallelism (less used term)

  • Running multiple job instances in parallel

  • Example: Multiple instances of the same job processing different file paths


 Partitioning in Data Stage

Partitioning defines how input data is divided across parallel nodes.


Common Partitioning Methods

1. Hash Partitioning

  • Splits data based on a key (e.g., customer_id)

  • Ensures same key always goes to the same partition

Use when:
✔ Joining
✔ Aggregations
✔ Removing duplicates


2. Range Partitioning

  • Splits data based on ranges

Example:

  • 1–10000 → Node1

  • 10001–20000 → Node2

Use when:
✔ Requirement is range-based
✔ Ordered processing constraints


3. Entire Partitioning

  • Entire dataset sent to each partition (broadcast)

Use when:
✔ Small reference dataset
✔ Lookup that must be available everywhere


4. Modulus Partitioning

  • System distributes data using modulo arithmetic:
    partition = key % number_of_nodes

Use when:
✔ Keys are evenly distributed
✔ Good for balancing


5. Round Robin Partitioning

  • Sends rows one-by-one to each partition in a rotating sequence

Use when:
✔ No key-based logic
✔ Need an even distribution


6. Random Partitioning

  • Random assignment of rows

  • Rarely used because it can cause imbalance


Repartitioning

When partitioning of two inputs is different, DataStage repartitions automatically.

Example:

  • Input1 is hash partitioned

  • Input2 is round robin

Join stage will repartition both using same key.

Repartitioning costs performance → try to avoid unnecessary repartition.


Collector Types

When DataStage needs to merge partitions back:

  • Sort Merge collector → keeps order

  • Ordered collector → structured but slower

  • Round Robin collector → no ordering

  • Sequential collector → one output stream


 Example: Join Stage

For a join on customer_id:

  • Both inputs must be hash partitioned on customer_id

  • If not, DataStage will automatically repartition


Real-Time Example

Dataset = 1 million customers
APT_CONFIG_FILE = 4 nodes

Using hash partitioning on customer_id:

  • Node1: customer_id 1–25%

  • Node2: 26–50%

  • Node3: 51–75%

  • Node4: 76–100%

Each node processes only its slice, parallelly → huge performance gain.


Summary

Parallelism = running faster by using multiple nodes

Partitioning = how the data is split across nodes

Good design means:

  • Proper partitioning

  • Avoid unnecessary repartition

  • Use correct collector

  • Use node pools efficiently




 

No comments:

Post a Comment

Most Recent posts

IBM Cloud Pak for Datastage