Parallel Pascal's Triangle: When Parallelism Helps and When It Doesn't

When I first thought about parallelizing Pascal’s Triangle, I assumed there was nothing to it. Each row is computed from the previous row, so you compute row by row in sequence. Not much to parallelize there. But then I looked closer at what’s actually happening inside a row, and the picture changed.

Each element in a row depends on exactly two elements from the row above it. That means every element in the same row is independent of every other element in that row. The work inside a single row is embarrassingly parallel. The question then becomes: does exploiting that parallelism actually make the computation faster? The answer, it turns out, depends on a threshold that most engineers skip right past.

The Problem

Generate the first N rows of Pascal’s Triangle. Each element triangle[i][j] equals triangle[i-1][j-1] + triangle[i-1][j], with 1s at the edges. The sequential version is short:

def pascal_sequential(n: int) -> list[list[int]]:
    triangle: list[list[int]] = []
    for i in range(n):
        row = [1] * (i + 1)
        for j in range(1, i):
            row[j] = triangle[i - 1][j - 1] + triangle[i - 1][j]
        triangle.append(row)
    return triangle

Clean and correct. Row i completes before row i+1 starts, which is required because of the dependency. Within each row, though, each row[j] could be computed by a separate thread since they all read from the previous row (which is already complete and immutable) and write to distinct positions in the new row.

Why the Naive Parallel Version Often Loses

Before writing any parallel code, it’s worth reasoning through the cost model. Spawning a thread (or submitting a task to a thread pool) has overhead: the OS needs to schedule the thread, the thread pool needs to manage the queue, and Python’s GIL needs to be acquired and released. For CPython, this overhead is roughly in the range of microseconds per task.

Now consider what the actual work is. Computing triangle[i][j] is two array lookups and one addition. That’s nanoseconds of CPU work. If the overhead of dispatching the task is 10x the cost of doing the task, parallelism makes things strictly worse.

This is the crossover problem. Below some row size threshold, sequential is faster. Above it, parallel wins. Knowing where that threshold sits, and why, is what separates a thoughtful parallel implementation from one that just happens to work.

The Parallel Approach

The correct strategy is to parallelize at the row level but only when the row is large enough to justify the overhead. ThreadPoolExecutor from concurrent.futures gives us a clean way to submit element computations as independent tasks and collect results:

from __future__ import annotations

import time
from concurrent.futures import ThreadPoolExecutor, Future
from typing import Optional


def compute_element(
    prev_row: list[int], j: int, row_size: int
) -> int:
    """Compute a single interior element from the previous row."""
    if j == 0 or j == row_size - 1:
        return 1
    return prev_row[j - 1] + prev_row[j]


def pascal_parallel(n: int, min_parallel_row: int = 50) -> list[list[int]]:
    """
    Generate n rows of Pascal's Triangle.
    Rows smaller than min_parallel_row are computed sequentially
    to avoid thread overhead dominating the actual work.
    """
    triangle: list[list[int]] = []

    with ThreadPoolExecutor() as executor:
        for i in range(n):
            row_size = i + 1
            prev_row = triangle[i - 1] if i > 0 else []

            if row_size < min_parallel_row or i == 0:
                # Sequential path for small rows
                row = [1] * row_size
                for j in range(1, i):
                    row[j] = prev_row[j - 1] + prev_row[j]
            else:
                # Parallel path: each element is an independent task
                futures: list[Future[int]] = [
                    executor.submit(compute_element, prev_row, j, row_size)
                    for j in range(row_size)
                ]
                row = [f.result() for f in futures]

            triangle.append(row)

    return triangle

A few design decisions here. First, the min_parallel_row threshold guards the entry to the parallel path. Small rows (say, fewer than 50 elements) cost more to dispatch than to compute, so we skip the overhead entirely. Second, ThreadPoolExecutor is used rather than raw threading.Thread objects because the pool amortizes thread creation across many rows. Third, the previous row is passed as an immutable argument to each task, so there is no shared mutable state and no need for any locks.

Measuring the Crossover

Let’s put some numbers on this. Here’s a quick benchmark comparing sequential and parallel for different values of N:

def benchmark(n: int, runs: int = 3) -> None:
    times_seq: list[float] = []
    times_par: list[float] = []

    for _ in range(runs):
        start = time.perf_counter()
        pascal_sequential(n)
        times_seq.append(time.perf_counter() - start)

        start = time.perf_counter()
        pascal_parallel(n)
        times_par.append(time.perf_counter() - start)

    avg_seq = sum(times_seq) / runs
    avg_par = sum(times_par) / runs
    print(f"n={n:>6}  sequential={avg_seq:.4f}s  parallel={avg_par:.4f}s  "
          f"speedup={avg_seq / avg_par:.2f}x")


if __name__ == "__main__":
    for size in [10, 100, 500, 1000, 2000]:
        benchmark(size)

Running this (in CPython 3.11 on a laptop) typically shows something like:

n=    10  sequential=0.0000s  parallel=0.0012s  speedup=0.03x
n=   100  sequential=0.0001s  parallel=0.0089s  speedup=0.01x
n=   500  sequential=0.0013s  parallel=0.0201s  speedup=0.06x
n=  1000  sequential=0.0049s  parallel=0.0389s  speedup=0.13x
n=  2000  sequential=0.0202s  parallel=0.0712s  speedup=0.28x

The parallel version loses at every size in CPython. This is because of the GIL: Python’s Global Interpreter Lock means only one thread executes Python bytecode at a time. Threading in CPython gives you concurrency (threads can interleave) but not true CPU parallelism for CPU-bound work. The overhead of context switching and GIL contention outweighs any gains from splitting the work.

When Python Threading Actually Helps

The GIL is released during I/O and certain C extension calls. If each element computation involved a network call, a file read, or a heavy NumPy operation that drops the GIL, threading would deliver real parallelism. For pure Python arithmetic, it does not.

For CPU-bound work where you want true parallelism, the right tool is multiprocessing or ProcessPoolExecutor. Processes have their own GIL, so multiple cores run simultaneously. The catch is that inter-process communication has higher overhead than threading, so the crossover point shifts higher still.

The deeper lesson is this: task granularity determines whether parallelism pays. When the work per task is smaller than the coordination overhead, parallelism is a net loss regardless of how many cores you have. The right question before parallelizing anything is: what is the ratio of work to overhead for a single task, and at what input size does that ratio flip in favor of parallel?

Pascal’s Triangle is a great problem for building this intuition precisely because the work per element is so tiny. It forces you to reckon with the overhead directly rather than hiding it behind expensive computation.

Key Insights

The structure of the problem tells you what can be parallelized: dependencies run vertically (row to row) but not horizontally (within a row). So the parallelism lives inside rows, not across them. This is true regardless of your language or runtime.

Whether that parallelism is worth exploiting depends on your runtime’s threading model. In CPython, threading is the wrong tool for CPU-bound parallel work. The correct tools are ProcessPoolExecutor for CPU-bound work or async I/O for I/O-bound work. Threading in Python shines for I/O concurrency, not CPU throughput.

And the threshold matters more than the algorithm. A parallel implementation that ignores task granularity will perform worse than a simple sequential loop for most practical input sizes. Adding a guard like min_parallel_row is not an optimization hack. It is the honest acknowledgment that parallelism has a fixed cost and the work must justify it.

If this prompted a question about the GIL or you want to push back on the benchmark methodology, I’d love to hear it. Reach out on Twitter or LinkedIn.