Skip to content

Configuration

TOML configuration files

Skyward loads configuration from two TOML files, merged with project settings taking precedence:

  1. Global defaults: ~/.skyward/defaults.toml
  2. Project config: skyward.toml (in the current working directory)

File format

[providers.my-aws]
type = "aws"
region = "us-west-2"

[providers.my-vastai]
type = "vastai"
min_reliability = 0.95
geolocation = "US"

[pools.training]
provider = "my-aws"
nodes = 4
accelerator = "A100"

[pools.training.image]
python = "3.13"
pip = ["torch", "transformers"]
apt = ["ffmpeg"]

[[pools.training.volumes]]
bucket = "my-bucket"
mount = "/data"

The [providers] section defines named provider configurations. Each must have a type field matching a supported provider (aws, gcp, hyperstack, tensordock, vastai, runpod, verda). All other fields are passed to the provider's config class.

The [pools] section defines named pools that reference a provider by name. Pools support nodes, accelerator (as a string name), image (as a sub-table), and volumes (as an array of tables).

Using named pools

with sky.Compute.Named("training") as compute:
    result = train() >> compute

API reference

skyward.PoolSpec dataclass

Resolved pool specification — the internal, fully-normalized form.

Created from user-facing Spec and Options objects during pool startup. Carries every parameter needed to provision a cluster, including hardware requirements, networking, autoscaling bounds, and plugin configuration.

Parameters:

Name Type Description Default
nodes Nodes

Node count specification (fixed or autoscaling).

required
accelerator Accelerator | None

GPU/accelerator type, or None for CPU-only.

required
region str

Cloud region for instance placement.

required
vcpus float | None

Minimum vCPUs per node.

None
memory_gb float | None

Minimum RAM in GB per node.

None
architecture Architecture | None

CPU architecture filter. None accepts any.

None
allocation AllocationStrategy

Instance lifecycle strategy (spot, on-demand, etc.).

'spot-if-available'
image Image

Declarative environment specification.

(lambda: Image())()
ttl int

Auto-shutdown timeout in seconds after pool exits. 0 disables.

600
worker Worker

Worker configuration (concurrency, executor backend).

Worker()
provider ProviderName | None

Override provider name (usually inferred from the Spec).

None
max_hourly_cost float | None

Cost cap per node per hour in USD. None means no cap.

None
provision_timeout float

Maximum seconds to wait for cloud instance provisioning (polling until the instance is running with an IP).

300.0
ssh_timeout float

Maximum seconds to wait for an SSH connection to a node.

300.0
bootstrap_timeout float

Maximum seconds for bootstrap script, post-bootstrap steps, and worker startup to complete.

300.0
ssh_retry_interval float

Seconds between SSH connection retry attempts.

5.0
provision_retry_delay float

Seconds between provision retry attempts after a failure.

10.0
max_provision_attempts int

Maximum number of provision attempts before giving up.

10
volumes tuple[Volume, ...]

S3/GCS volumes to mount on workers.

()
autoscale_cooldown float

Seconds between autoscaling decisions.

30.0
autoscale_idle_timeout float

Seconds of idle before the autoscaler considers scaling down.

60.0
reconcile_tick_interval float

Seconds between reconciler ticks (provision/drain evaluation).

15.0
plugins tuple[Plugin, ...]

Composable plugins applied to this pool.

()
cluster bool

Whether workers form a Casty cluster. True (default) enables distributed collections and cluster-aware coordination. False runs each worker independently.

True
retry_on_interruption int

Maximum retries per task on infrastructure interruption. 0 disables.

3

Examples:

>>> spec = PoolSpec(
...     nodes=Nodes(min=4),
...     accelerator=H100(),
...     region="us-east-1",
...     allocation="spot-if-available",
...     image=Image(pip=["torch"]),
... )

nodes instance-attribute

accelerator instance-attribute

region instance-attribute

vcpus = None class-attribute instance-attribute

memory_gb = None class-attribute instance-attribute

disk_gb = None class-attribute instance-attribute

architecture = None class-attribute instance-attribute

allocation = 'spot-if-available' class-attribute instance-attribute

image = field(default_factory=(lambda: Image())) class-attribute instance-attribute

ttl = 600 class-attribute instance-attribute

worker = field(default_factory=Worker) class-attribute instance-attribute

provider = None class-attribute instance-attribute

max_hourly_cost = None class-attribute instance-attribute

provision_timeout = 300.0 class-attribute instance-attribute

ssh_timeout = 300.0 class-attribute instance-attribute

bootstrap_timeout = 300.0 class-attribute instance-attribute

ssh_retry_interval = 5.0 class-attribute instance-attribute

provision_retry_delay = 10.0 class-attribute instance-attribute

max_provision_attempts = 10 class-attribute instance-attribute

volumes = () class-attribute instance-attribute

autoscale_cooldown = 30.0 class-attribute instance-attribute

autoscale_idle_timeout = 60.0 class-attribute instance-attribute

reconcile_tick_interval = 15.0 class-attribute instance-attribute

plugins = () class-attribute instance-attribute

cluster = True class-attribute instance-attribute

retry_on_interruption = 3 class-attribute instance-attribute

accelerator_name property

Get the canonical accelerator name for provider matching.

accelerator_count property

Get the number of accelerators per node.

accelerator_memory_gb property

Get the requested VRAM per accelerator in GB, or 0 if unspecified.

__init__(nodes, accelerator, region, vcpus=None, memory_gb=None, disk_gb=None, architecture=None, allocation='spot-if-available', image=(lambda: Image())(), ttl=600, worker=Worker(), provider=None, max_hourly_cost=None, provision_timeout=300.0, ssh_timeout=300.0, bootstrap_timeout=300.0, ssh_retry_interval=5.0, provision_retry_delay=10.0, max_provision_attempts=10, volumes=(), autoscale_cooldown=30.0, autoscale_idle_timeout=60.0, reconcile_tick_interval=15.0, plugins=(), cluster=True, retry_on_interruption=3)

skyward.Image dataclass

Declarative image specification.

Defines the environment (Python version, packages, etc.) in a declarative way. The generate_bootstrap() method generates idempotent shell scripts that work across all cloud providers.

Parameters:

Name Type Description Default
python str | Literal['auto']

Python version to use. "auto" detects current interpreter.

'auto'
pip list[str] | tuple[str, ...]

Packages to install via uv add.

()
pip_indexes list[PipIndex] | tuple[PipIndex, ...]

Scoped package indexes. Each PipIndex maps specific packages to a custom index URL via uv's explicit index support.

()
apt list[str] | tuple[str, ...]

System packages to install via apt-get.

()
env dict[str, str]

Environment variables to export on remote workers.

dict()
shell_vars dict[str, str]

Shell commands for dynamic variable capture (evaluated at bootstrap).

dict()
includes list[str] | tuple[str, ...]

Paths relative to CWD to sync to workers (dirs or .py files).

()
excludes list[str] | tuple[str, ...]

Glob patterns to ignore within includes (e.g., "__pycache__").

()
skyward_source SkywardSource

Where to install skyward from. "auto" detects editable installs as "local", otherwise "pypi".

'auto'
metrics MetricsConfig

Metrics to collect (CPU, GPU, Memory, etc.). None disables.

(lambda: _default_metrics())()
bootstrap_timeout int

Maximum seconds for the bootstrap script to complete. Default 300.

300

Examples:

>>> image = Image(
...     python="3.13",
...     pip=["torch", "transformers"],
...     apt=["git", "ffmpeg"],
...     env={"HF_TOKEN": "xxx"},
... )
>>> # Disable metrics
>>> image = Image(metrics=None)

python = 'auto' class-attribute instance-attribute

pip = () class-attribute instance-attribute

pip_indexes = () class-attribute instance-attribute

apt = () class-attribute instance-attribute

env = field(default_factory=dict) class-attribute instance-attribute

shell_vars = field(default_factory=dict) class-attribute instance-attribute

includes = () class-attribute instance-attribute

excludes = () class-attribute instance-attribute

skyward_source = 'auto' class-attribute instance-attribute

metrics = field(default_factory=(lambda: _default_metrics())) class-attribute instance-attribute

bootstrap_timeout = 300 class-attribute instance-attribute

__post_init__()

Convert lists to tuples for immutability.

env_hash()

Hash of the remote environment specification.

Covers packages, env vars, Python version, etc. — but NOT the local skyward version. Stable across local code edits, suitable for daemon pool fingerprinting.

Returns:

Type Description
str

A 12-character hex digest (SHA-256 prefix).

content_hash()

Hash including the local skyward version.

Used by WarmableProvider implementations to detect whether an existing AMI/snapshot can be reused or a fresh bootstrap is needed. Includes skyward_version because the wheel is baked into the snapshot.

Returns:

Type Description
str

A 12-character hex digest (SHA-256 prefix).

__init__(python='auto', pip=(), pip_indexes=(), apt=(), env=dict(), shell_vars=dict(), includes=(), excludes=(), skyward_source='auto', metrics=(lambda: _default_metrics())(), bootstrap_timeout=300)

skyward.DEFAULT_IMAGE = Image() module-attribute

skyward.AllocationStrategy = Literal['spot', 'on-demand', 'spot-if-available', 'cheapest']

Instance lifecycle strategy.

  • "spot" -- spot/preemptible only (cheapest, may be interrupted).
  • "on-demand" -- on-demand only (reliable, higher cost).
  • "spot-if-available" -- try spot first, fall back to on-demand.
  • "cheapest" -- compare spot and on-demand, pick lowest price.

skyward.api.spec.PoolState

Pool lifecycle states reported by the pool actor.

Progression: INITREQUESTINGPROVISIONINGREADYSHUTTING_DOWNDESTROYED.

INIT = 'init' class-attribute instance-attribute

Pool object created, actor system not yet started.

REQUESTING = 'requesting' class-attribute instance-attribute

Querying provider for available offers.

PROVISIONING = 'provisioning' class-attribute instance-attribute

Cloud instances being created and bootstrapped.

READY = 'ready' class-attribute instance-attribute

All nodes healthy, pool accepting tasks.

SHUTTING_DOWN = 'shutting_down' class-attribute instance-attribute

Graceful teardown in progress.

DESTROYED = 'destroyed' class-attribute instance-attribute

All instances terminated, resources released.