Parallelism in DataStage
Parallelism means executing ETL processes simultaneously to improve performance. DataStage PX achieves this using the parallel engine, which divides work across multiple nodes (defined in APT_CONFIG_FILE).
Types of Parallelism in DataStage
1. Pipeline Parallelism
-
Different stages run at the same time, processing streaming data.
-
Example:
Source → Transformer → Target
All three run concurrently once the pipeline fills.
✔ Improves performance by overlapping operations.
2. Partition Parallelism
-
A dataset is split into partitions, and each node processes its portion in parallel.
Example with 4 partitions:
1000 rows → split into 4 → 250 rows each
Each node processes 250 rows at the same time.
✔ Massive performance boost
✔ Depends heavily on partitioning method
3. Component Parallelism (less used term)
-
Running multiple job instances in parallel
-
Example: Multiple instances of the same job processing different file paths
Partitioning in Data Stage
Partitioning defines how input data is divided across parallel nodes.
Common Partitioning Methods
1. Hash Partitioning
-
Splits data based on a key (e.g., customer_id)
-
Ensures same key always goes to the same partition
Use when:
✔ Joining
✔ Aggregations
✔ Removing duplicates
2. Range Partitioning
-
Splits data based on ranges
Example:
-
1–10000 → Node1
-
10001–20000 → Node2
Use when:
✔ Requirement is range-based
✔ Ordered processing constraints
3. Entire Partitioning
-
Entire dataset sent to each partition (broadcast)
Use when:
✔ Small reference dataset
✔ Lookup that must be available everywhere
4. Modulus Partitioning
-
System distributes data using modulo arithmetic:
partition = key % number_of_nodes
Use when:
✔ Keys are evenly distributed
✔ Good for balancing
5. Round Robin Partitioning
-
Sends rows one-by-one to each partition in a rotating sequence
Use when:
✔ No key-based logic
✔ Need an even distribution
6. Random Partitioning
-
Random assignment of rows
-
Rarely used because it can cause imbalance
Repartitioning
When partitioning of two inputs is different, DataStage repartitions automatically.
Example:
-
Input1 is hash partitioned
-
Input2 is round robin
Join stage will repartition both using same key.
Repartitioning costs performance → try to avoid unnecessary repartition.
Collector Types
When DataStage needs to merge partitions back:
-
Sort Merge collector → keeps order
-
Ordered collector → structured but slower
-
Round Robin collector → no ordering
-
Sequential collector → one output stream
Example: Join Stage
For a join on customer_id:
-
Both inputs must be hash partitioned on customer_id
-
If not, DataStage will automatically repartition
Real-Time Example
Dataset = 1 million customers
APT_CONFIG_FILE = 4 nodes
Using hash partitioning on customer_id:
-
Node1: customer_id 1–25%
-
Node2: 26–50%
-
Node3: 51–75%
-
Node4: 76–100%
Each node processes only its slice, parallelly → huge performance gain.
Summary
Parallelism = running faster by using multiple nodes
Partitioning = how the data is split across nodes
Good design means:
-
Proper partitioning
-
Avoid unnecessary repartition
-
Use correct collector
-
Use node pools efficiently
No comments:
Post a Comment