Worker executors¶

Every node in a Skyward pool runs a worker process that receives tasks and executes them. The Worker dataclass controls how that execution happens — specifically, whether tasks run as separate OS processes or as threads inside the worker. This choice determines whether CPU-bound Python code can use all available cores or is limited by the GIL.

A CPU-bound task¶

Consider a tight numerical loop — pure Python, no C extensions, no I/O:

def cpu_burn(task_id: int) -> dict:
    """CPU-intensive task: tight numerical loop for ~10 seconds."""
    start = monotonic()
    total = 0.0
    i = 0
    while monotonic() - start < 10:
        total += (i ** 0.5) * ((i + 1) ** 0.5)
        i += 1
    elapsed = monotonic() - start
    return {"task_id": task_id, "iterations": i, "elapsed": round(elapsed, 1)}

This is the worst case for the GIL: the Python interpreter never releases the lock, so only one thread can make progress at a time. On a 2-vCPU machine with concurrency=2, a thread-based executor will show ~50% CPU utilization — one core active, one idle.

Thread executor (default)¶

The thread executor runs tasks as threads inside the worker process. All threads share the same memory space and the same GIL:

# Supports streaming, low overhead, ideal for I/O-bound and GIL-releasing workloads.
with sky.Compute(
    provider=sky.AWS(),
    worker=sky.Worker(concurrency=2),  # executor="thread" is the default
    nodes=3,
) as compute:
    results = sky.gather(*(cpu_burn(i) for i in range(total)), stream=True)
    for r in (results >> compute):
        print(f"[thread] Task {r['task_id']}: {r['iterations']:,} iters in {r['elapsed']}s")

# Process executor — each task runs in a separate OS process.

Threads are lightweight, support streaming (generator functions and iterator parameters), and work seamlessly with distributed collections. For most workloads — I/O-bound tasks, C extension heavy code (NumPy, PyTorch), and mixed workloads — the thread executor is the right choice. The executor="thread" is the default, so Worker(concurrency=2) is equivalent.

Process executor¶

The process executor runs each task in a separate OS process via ProcessPoolExecutor. Each process has its own Python interpreter and its own GIL, so CPU-bound work saturates all available cores:

with sky.Compute(
    provider=sky.AWS(),
    worker=sky.Worker(concurrency=2, executor="process"),
    nodes=3,
) as compute:
    results = sky.gather(*(cpu_burn(i) for i in range(total)), stream=True)
    for r in (results >> compute):
        print(f"[process] Task {r['task_id']}: {r['iterations']:,} iters in {r['elapsed']}s")

On a 2-vCPU instance with concurrency=2, this achieves ~100% CPU utilization — both cores fully active. Use this explicitly for pure-Python CPU-bound workloads where bypassing the GIL matters.

Trade-offs¶

	Process (`"process"`)	Thread (`"thread"`)
GIL	Bypassed — each process has its own interpreter	Shared — one thread runs at a time
CPU-bound	Full core utilization	Limited to ~1 core
I/O-bound	Works, but heavier per-task overhead	Lightweight, ideal for I/O waits
Distributed collections	Available (via IPC bridge)	Available (shared memory with worker)
Task isolation	Full — crash in one task doesn't affect others	Shared — exceptions propagate normally
Serialization	Task args and results cross process boundary (pickle)	No extra serialization

When to use each¶

Use thread (default) for:

I/O-bound tasks (API calls, database queries, file processing)
C extension heavy workloads that release the GIL (NumPy, PyTorch inference)
Streaming tasks (generator functions, iterator parameters)
Most ML training workloads (PyTorch, JAX, Keras release the GIL)

Use process for:

Pure-Python CPU-bound computation (tight loops, data transformation)
Any workload where Python code dominates CPU time
Tasks that benefit from crash isolation

Run the full example¶

git clone https://github.com/gabfssilva/skyward.git
cd skyward
uv run python guides/14_worker_executors.py

What you learned:

Worker(executor="thread") (default) runs tasks as threads — lightweight, supports streaming, shares memory, but GIL-limited for pure-Python CPU-bound code.
Worker(executor="process") runs tasks in separate OS processes — bypasses the GIL, full CPU utilization for compute-heavy work.
concurrency controls task slots per node — total parallelism = nodes * concurrency.
Distributed collections work with both executors — the process executor uses an IPC bridge to proxy operations to the parent worker.
Choose based on workload: process for CPU-bound, thread for I/O-bound.