S3 volumes¶

This guide walks through a complete volume workflow: upload a dataset from your local machine, train a model on a remote cluster, and download the result — all through S3-compatible object storage. Your local machine talks to S3 via Storage, and remote workers see the same data as mounted directories via s3fs-fuse.

Storage and volumes¶

A Storage object represents an S3-compatible endpoint. Presets like sky.storage.Hyperstack() auto-provision ephemeral credentials — no manual key management needed.

A Volume maps an S3 bucket (or a prefix within it) to a local path on every worker. You declare two: one read-only for input data, one writable for output artifacts.

# Auto-provisioned credentials
storage = sky.storage.Hyperstack()

data_volume = sky.Volume(
    bucket=DATA_BUCKET,
    mount="/data",
    prefix="iris/",
    read_only=True,
)
model_volume = sky.Volume(
    bucket=MODEL_BUCKET,
    mount="/output",
    prefix="experiment-001/",
    read_only=False,
)

The prefix scopes each volume to a subfolder within its bucket. read_only=True on the data volume prevents accidental writes to the dataset. The model volume is writable so the training function can persist its output to S3.

The training function¶

The remote function reads from /data and writes to /output — regular filesystem paths. It doesn't know about S3, buckets, or object stores. Libraries that expect file paths work without modification.

@sky.function
def train(data_path: str, output_dir: str) -> dict:
    """Load dataset from volume, train a model, save to output volume."""
    import pickle

    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score

    raw = np.loadtxt(data_path, delimiter=",", skiprows=1)
    features, labels = raw[:, :-1], raw[:, -1].astype(int)

    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(features, labels)

    acc = accuracy_score(labels, model.predict(features))

    out = Path(output_dir) / "model.pkl"
    out.parent.mkdir(parents=True, exist_ok=True)
    with open(out, "wb") as f:
        pickle.dump(model, f)

    return {
        "samples": len(features),
        "accuracy": round(acc, 4),
        "model_bytes": out.stat().st_size,
    }

Imports happen inside the function body because they only need to exist on the remote worker.

Uploading data with Storage¶

Before the cluster starts, you upload your dataset from the local machine directly to S3. Storage is a context manager that opens an S3 connection. upload puts a local file into the bucket at the given key. ls lists objects in the bucket. You can also use download, exists, and rm.

with storage:
    storage.upload(DATA_BUCKET, csv_path, key="iris.csv")
    print(f"Uploaded: {storage.ls(DATA_BUCKET)}")

Training on the cluster¶

With data in S3, you provision a pool with both volumes mounted and dispatch the training function.

with sky.Compute(
    provider=sky.Hyperstack(),
    accelerator=sky.accelerators.L4(),
    image=sky.Image(pip=["scikit-learn", "numpy"]),
    volumes=[data_volume, model_volume],
) as compute:
    result = train("/data/iris.csv", "/output") >> compute
    print(f"{result['samples']} samples, acc={result['accuracy']}, model={result['model_bytes']} bytes")

The pool mounts both volumes on every worker during bootstrap. When the with block exits, the instances are destroyed — but the model checkpoint is already in S3.

Downloading results¶

After the pool is torn down, the model persists in the output bucket. A second Storage session downloads it back to your local machine.

model_path = Path("/tmp/trained_model.pkl")

with storage:
    storage.download(MODEL_BUCKET, "experiment-001/model.pkl", model_path)

import pickle

with open(model_path, "rb") as f:
    model = pickle.load(f)  # noqa: S301

from sklearn.metrics import accuracy_score

preds = model.predict(iris.data)
acc = accuracy_score(iris.target, preds)
print(f"Loaded model: {type(model).__name__}, accuracy={acc:.4f}")

The downloaded model is a standard pickle file. You can load it locally and verify it works against the original data — no cluster required.

The full picture¶

The three phases — upload, train, download — decouple your local environment from the remote cluster. Your local machine never needs the GPU libraries, and the remote workers never need your local filesystem. S3 is the bridge between both.

sequenceDiagram
    participant L as Local Machine
    participant S as S3 Bucket
    participant W as Remote Worker

    L->>S: Storage.upload(dataset)
    Note over W: Pool provisions, mounts volumes
    W->>S: read /data/iris.csv (s3fs-fuse)
    W->>S: write /output/model.pkl (s3fs-fuse)
    Note over W: Pool tears down
    L->>S: Storage.download(model.pkl)

Why not rely solely on `@sky.function` input and output?¶

You could pass a NumPy array as an argument to a @sky.function function and return the trained model directly. For small payloads that works — cloudpickle serializes the arguments, lz4-compresses them, and ships them over SSH. But the approach breaks down as data grows.

The problem is both size and contention. Skyward communicates with remote workers through Casty actors — when you send a large payload as a function argument, the worker actor is busy deserializing and processing that message for the entire transfer. No other task can be dispatched to that worker until the operation completes. A 2 GB dataset as input and a 500 MB model as output means the worker is effectively unavailable for the duration of both transfers. Multiply that by several nodes and the coordination overhead dominates actual compute time.

Volumes sidestep this entirely. Instead of pushing data through actor messages, you place it in S3 and let workers read it locally via FUSE. The function receives a path, not the data itself — a string costs bytes, not gigabytes. The actor channel stays free for what it's designed to carry: lightweight task coordination. Outputs written to a writable volume persist in S3 immediately, surviving instance preemption and pool teardown.

The rule of thumb: if it fits comfortably in a return value (a metric, a small dict, a summary), pass it through @sky.function. If it's a couple hundred MBs dataset, a model checkpoint, or anything you'd rather not push through actor messages — use a volume.

Run the full example¶

git clone https://github.com/gabfssilva/skyward.git
cd skyward
uv run python guides/17_s3_volumes.py

What you learned:

sky.Volume maps an S3 bucket to a local path on every worker — read or read-write.
sky.Storage manages data outside the cluster — upload before training, download after.
s3fs-fuse handles the mounting transparently — no SDK, no download step, just file paths.
Three-phase workflow — upload, train, download — decouples local and remote environments through S3.