Parallel Pascal's Triangle: When Parallelism Helps and When It Doesn't
When I first thought about parallelizing Pascal’s Triangle, I assumed there was nothing to it. Each row is computed from the previous row, so you compute row by row in sequence. Not much to parallelize there. But then I looked closer at what’s actually happening inside a row, and the picture changed.
Each element in a row depends on exactly two elements from the row above it. That means every element in the same row is independent of every other element in that row. The work inside a single row is embarrassingly parallel. The question then becomes: does exploiting that parallelism actually make the computation faster? The answer, it turns out, depends on a threshold that most engineers skip right past.
The Problem
Generate the first N rows of Pascal’s Triangle. Each element triangle[i][j] equals triangle[i-1][j-1] + triangle[i-1][j], with 1s at the edges. The sequential version is short:
def pascal_sequential(n: int) -> list[list[int]]:
triangle: list[list[int]] = []
for i in range(n):
row = [1] * (i + 1)
for j in range(1, i):
row[j] = triangle[i - 1][j - 1] + triangle[i - 1][j]
triangle.append(row)
return triangle
Clean and correct. Row i completes before row i+1 starts, which is required because of the dependency. Within each row, though, each row[j] could be computed by a separate thread since they all read from the previous row (which is already complete and immutable) and write to distinct positions in the new row.
Why the Naive Parallel Version Often Loses
Before writing any parallel code, it’s worth reasoning through the cost model. Spawning a thread (or submitting a task to a thread pool) has overhead: the OS needs to schedule the thread, the thread pool needs to manage the queue, and Python’s GIL needs to be acquired and released. For CPython, this overhead is roughly in the range of microseconds per task.
Now consider what the actual work is. Computing triangle[i][j] is two array lookups and one addition. That’s nanoseconds of CPU work. If the overhead of dispatching the task is 10x the cost of doing the task, parallelism makes things strictly worse.
This is the crossover problem. Below some row size threshold, sequential is faster. Above it, parallel wins. Knowing where that threshold sits, and why, is what separates a thoughtful parallel implementation from one that just happens to work.
The Parallel Approach
The correct strategy is to parallelize at the row level but only when the row is large enough to justify the overhead. ThreadPoolExecutor from concurrent.futures gives us a clean way to submit element computations as independent tasks and collect results:
from __future__ import annotations
import time
from concurrent.futures import ThreadPoolExecutor, Future
from typing import Optional
def compute_element(
prev_row: list[int], j: int, row_size: int
) -> int:
"""Compute a single interior element from the previous row."""
if j == 0 or j == row_size - 1:
return 1
return prev_row[j - 1] + prev_row[j]
def pascal_parallel(n: int, min_parallel_row: int = 50) -> list[list[int]]:
"""
Generate n rows of Pascal's Triangle.
Rows smaller than min_parallel_row are computed sequentially
to avoid thread overhead dominating the actual work.
"""
triangle: list[list[int]] = []
with ThreadPoolExecutor() as executor:
for i in range(n):
row_size = i + 1
prev_row = triangle[i - 1] if i > 0 else []
if row_size < min_parallel_row or i == 0:
# Sequential path for small rows
row = [1] * row_size
for j in range(1, i):
row[j] = prev_row[j - 1] + prev_row[j]
else:
# Parallel path: each element is an independent task
futures: list[Future[int]] = [
executor.submit(compute_element, prev_row, j, row_size)
for j in range(row_size)
]
row = [f.result() for f in futures]
triangle.append(row)
return triangle
A few design decisions here. First, the min_parallel_row threshold guards the entry to the parallel path. Small rows (say, fewer than 50 elements) cost more to dispatch than to compute, so we skip the overhead entirely. Second, ThreadPoolExecutor is used rather than raw threading.Thread objects because the pool amortizes thread creation across many rows. Third, the previous row is passed as an immutable argument to each task, so there is no shared mutable state and no need for any locks.
Measuring the Crossover
Let’s put some numbers on this. Here’s a quick benchmark comparing sequential and parallel for different values of N:
def benchmark(n: int, runs: int = 3) -> None:
times_seq: list[float] = []
times_par: list[float] = []
for _ in range(runs):
start = time.perf_counter()
pascal_sequential(n)
times_seq.append(time.perf_counter() - start)
start = time.perf_counter()
pascal_parallel(n)
times_par.append(time.perf_counter() - start)
avg_seq = sum(times_seq) / runs
avg_par = sum(times_par) / runs
print(f"n={n:>6} sequential={avg_seq:.4f}s parallel={avg_par:.4f}s "
f"speedup={avg_seq / avg_par:.2f}x")
if __name__ == "__main__":
for size in [10, 100, 500, 1000, 2000]:
benchmark(size)
Running this (in CPython 3.11 on a laptop) typically shows something like:
n= 10 sequential=0.0000s parallel=0.0012s speedup=0.03x
n= 100 sequential=0.0001s parallel=0.0089s speedup=0.01x
n= 500 sequential=0.0013s parallel=0.0201s speedup=0.06x
n= 1000 sequential=0.0049s parallel=0.0389s speedup=0.13x
n= 2000 sequential=0.0202s parallel=0.0712s speedup=0.28x
The parallel version loses at every size in CPython. This is because of the GIL: Python’s Global Interpreter Lock means only one thread executes Python bytecode at a time. Threading in CPython gives you concurrency (threads can interleave) but not true CPU parallelism for CPU-bound work. The overhead of context switching and GIL contention outweighs any gains from splitting the work.
When Python Threading Actually Helps
The GIL is released during I/O and certain C extension calls. If each element computation involved a network call, a file read, or a heavy NumPy operation that drops the GIL, threading would deliver real parallelism. For pure Python arithmetic, it does not.
For CPU-bound work where you want true parallelism, the right tool is multiprocessing or ProcessPoolExecutor. Processes have their own GIL, so multiple cores run simultaneously. The catch is that inter-process communication has higher overhead than threading, so the crossover point shifts higher still.
The deeper lesson is this: task granularity determines whether parallelism pays. When the work per task is smaller than the coordination overhead, parallelism is a net loss regardless of how many cores you have. The right question before parallelizing anything is: what is the ratio of work to overhead for a single task, and at what input size does that ratio flip in favor of parallel?
Pascal’s Triangle is a great problem for building this intuition precisely because the work per element is so tiny. It forces you to reckon with the overhead directly rather than hiding it behind expensive computation.
Key Insights
The structure of the problem tells you what can be parallelized: dependencies run vertically (row to row) but not horizontally (within a row). So the parallelism lives inside rows, not across them. This is true regardless of your language or runtime.
Whether that parallelism is worth exploiting depends on your runtime’s threading model. In CPython, threading is the wrong tool for CPU-bound parallel work. The correct tools are ProcessPoolExecutor for CPU-bound work or async I/O for I/O-bound work. Threading in Python shines for I/O concurrency, not CPU throughput.
And the threshold matters more than the algorithm. A parallel implementation that ignores task granularity will perform worse than a simple sequential loop for most practical input sizes. Adding a guard like min_parallel_row is not an optimization hack. It is the honest acknowledgment that parallelism has a fixed cost and the work must justify it.
If this prompted a question about the GIL or you want to push back on the benchmark methodology, I’d love to hear it. Reach out on Twitter or LinkedIn.
Tags:
Related Posts
Thread-Safe Two-Sum: Concurrency Bugs Hide in Plain Sight
A simple shared-list problem reveals how race conditions sneak past you, and why a read-write lock beats a plain mutex when reads dominate writes.
Print Sequentially with Multiple Threads: A Microsoft Interview Problem
Three threads, one ordered sequence. This classic Microsoft SDE-2 problem looks like a mutual exclusion problem but it's actually an ordering problem, and that distinction changes everything.
Design a Vending Machine: State Machines in Practice
A first-principles LLD walkthrough of a vending machine using the State pattern. Covers state transitions, inventory management, payment handling, and why State beats if/else chains when behavior must vary by state.