Configuration¶
TOML configuration files¶
Skyward loads configuration from two TOML files, merged with project settings taking precedence:
- Global defaults:
~/.skyward/defaults.toml - Project config:
skyward.toml(in the current working directory)
File format¶
[providers.my-aws]
type = "aws"
region = "us-west-2"
[providers.my-vastai]
type = "vastai"
min_reliability = 0.95
geolocation = "US"
[pools.training]
provider = "my-aws"
nodes = 4
accelerator = "A100"
[pools.training.image]
python = "3.13"
pip = ["torch", "transformers"]
apt = ["ffmpeg"]
[[pools.training.volumes]]
bucket = "my-bucket"
mount = "/data"
The [providers] section defines named provider configurations. Each must have a type field matching a supported provider (aws, gcp, hyperstack, tensordock, vastai, runpod, verda). All other fields are passed to the provider's config class.
The [pools] section defines named pools that reference a provider by name. Pools support nodes, accelerator (as a string name), image (as a sub-table), and volumes (as an array of tables).
Using named pools¶
API reference¶
skyward.PoolSpec
dataclass
¶
Resolved pool specification — the internal, fully-normalized form.
Created from user-facing Spec and Options objects during pool
startup. Carries every parameter needed to provision a cluster,
including hardware requirements, networking, autoscaling bounds, and
plugin configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nodes
|
Nodes
|
Node count specification (fixed or autoscaling). |
required |
accelerator
|
Accelerator | None
|
GPU/accelerator type, or |
required |
region
|
str
|
Cloud region for instance placement. |
required |
vcpus
|
float | None
|
Minimum vCPUs per node. |
None
|
memory_gb
|
float | None
|
Minimum RAM in GB per node. |
None
|
architecture
|
Architecture | None
|
CPU architecture filter. |
None
|
allocation
|
AllocationStrategy
|
Instance lifecycle strategy (spot, on-demand, etc.). |
'spot-if-available'
|
image
|
Image
|
Declarative environment specification. |
(lambda: Image())()
|
ttl
|
int
|
Auto-shutdown timeout in seconds after pool exits. |
600
|
worker
|
Worker
|
Worker configuration (concurrency, executor backend). |
Worker()
|
provider
|
ProviderName | None
|
Override provider name (usually inferred from the |
None
|
max_hourly_cost
|
float | None
|
Cost cap per node per hour in USD. |
None
|
provision_timeout
|
float
|
Maximum seconds to wait for cloud instance provisioning (polling until the instance is running with an IP). |
300.0
|
ssh_timeout
|
float
|
Maximum seconds to wait for an SSH connection to a node. |
300.0
|
bootstrap_timeout
|
float
|
Maximum seconds for bootstrap script, post-bootstrap steps, and worker startup to complete. |
300.0
|
ssh_retry_interval
|
float
|
Seconds between SSH connection retry attempts. |
5.0
|
provision_retry_delay
|
float
|
Seconds between provision retry attempts after a failure. |
10.0
|
max_provision_attempts
|
int
|
Maximum number of provision attempts before giving up. |
10
|
volumes
|
tuple[Volume, ...]
|
S3/GCS volumes to mount on workers. |
()
|
autoscale_cooldown
|
float
|
Seconds between autoscaling decisions. |
30.0
|
autoscale_idle_timeout
|
float
|
Seconds of idle before the autoscaler considers scaling down. |
60.0
|
reconcile_tick_interval
|
float
|
Seconds between reconciler ticks (provision/drain evaluation). |
15.0
|
plugins
|
tuple[Plugin, ...]
|
Composable plugins applied to this pool. |
()
|
cluster
|
bool
|
Whether workers form a Casty cluster. |
True
|
retry_on_interruption
|
int
|
Maximum retries per task on infrastructure interruption. |
3
|
Examples:
>>> spec = PoolSpec(
... nodes=Nodes(min=4),
... accelerator=H100(),
... region="us-east-1",
... allocation="spot-if-available",
... image=Image(pip=["torch"]),
... )
nodes
instance-attribute
¶
accelerator
instance-attribute
¶
region
instance-attribute
¶
vcpus = None
class-attribute
instance-attribute
¶
memory_gb = None
class-attribute
instance-attribute
¶
disk_gb = None
class-attribute
instance-attribute
¶
architecture = None
class-attribute
instance-attribute
¶
allocation = 'spot-if-available'
class-attribute
instance-attribute
¶
image = field(default_factory=(lambda: Image()))
class-attribute
instance-attribute
¶
ttl = 600
class-attribute
instance-attribute
¶
worker = field(default_factory=Worker)
class-attribute
instance-attribute
¶
provider = None
class-attribute
instance-attribute
¶
max_hourly_cost = None
class-attribute
instance-attribute
¶
provision_timeout = 300.0
class-attribute
instance-attribute
¶
ssh_timeout = 300.0
class-attribute
instance-attribute
¶
bootstrap_timeout = 300.0
class-attribute
instance-attribute
¶
ssh_retry_interval = 5.0
class-attribute
instance-attribute
¶
provision_retry_delay = 10.0
class-attribute
instance-attribute
¶
max_provision_attempts = 10
class-attribute
instance-attribute
¶
volumes = ()
class-attribute
instance-attribute
¶
autoscale_cooldown = 30.0
class-attribute
instance-attribute
¶
autoscale_idle_timeout = 60.0
class-attribute
instance-attribute
¶
reconcile_tick_interval = 15.0
class-attribute
instance-attribute
¶
plugins = ()
class-attribute
instance-attribute
¶
cluster = True
class-attribute
instance-attribute
¶
retry_on_interruption = 3
class-attribute
instance-attribute
¶
accelerator_name
property
¶
Get the canonical accelerator name for provider matching.
accelerator_count
property
¶
Get the number of accelerators per node.
accelerator_memory_gb
property
¶
Get the requested VRAM per accelerator in GB, or 0 if unspecified.
__init__(nodes, accelerator, region, vcpus=None, memory_gb=None, disk_gb=None, architecture=None, allocation='spot-if-available', image=(lambda: Image())(), ttl=600, worker=Worker(), provider=None, max_hourly_cost=None, provision_timeout=300.0, ssh_timeout=300.0, bootstrap_timeout=300.0, ssh_retry_interval=5.0, provision_retry_delay=10.0, max_provision_attempts=10, volumes=(), autoscale_cooldown=30.0, autoscale_idle_timeout=60.0, reconcile_tick_interval=15.0, plugins=(), cluster=True, retry_on_interruption=3)
¶
skyward.Image
dataclass
¶
Declarative image specification.
Defines the environment (Python version, packages, etc.) in a
declarative way. The generate_bootstrap() method generates idempotent
shell scripts that work across all cloud providers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
python
|
str | Literal['auto']
|
Python version to use. |
'auto'
|
pip
|
list[str] | tuple[str, ...]
|
Packages to install via |
()
|
pip_indexes
|
list[PipIndex] | tuple[PipIndex, ...]
|
Scoped package indexes. Each |
()
|
apt
|
list[str] | tuple[str, ...]
|
System packages to install via |
()
|
env
|
dict[str, str]
|
Environment variables to export on remote workers. |
dict()
|
shell_vars
|
dict[str, str]
|
Shell commands for dynamic variable capture (evaluated at bootstrap). |
dict()
|
includes
|
list[str] | tuple[str, ...]
|
Paths relative to CWD to sync to workers (dirs or |
()
|
excludes
|
list[str] | tuple[str, ...]
|
Glob patterns to ignore within includes (e.g., |
()
|
skyward_source
|
SkywardSource
|
Where to install skyward from. |
'auto'
|
metrics
|
MetricsConfig
|
Metrics to collect (CPU, GPU, Memory, etc.). |
(lambda: _default_metrics())()
|
bootstrap_timeout
|
int
|
Maximum seconds for the bootstrap script to complete. Default |
300
|
Examples:
>>> image = Image(
... python="3.13",
... pip=["torch", "transformers"],
... apt=["git", "ffmpeg"],
... env={"HF_TOKEN": "xxx"},
... )
python = 'auto'
class-attribute
instance-attribute
¶
pip = ()
class-attribute
instance-attribute
¶
pip_indexes = ()
class-attribute
instance-attribute
¶
apt = ()
class-attribute
instance-attribute
¶
env = field(default_factory=dict)
class-attribute
instance-attribute
¶
shell_vars = field(default_factory=dict)
class-attribute
instance-attribute
¶
includes = ()
class-attribute
instance-attribute
¶
excludes = ()
class-attribute
instance-attribute
¶
skyward_source = 'auto'
class-attribute
instance-attribute
¶
metrics = field(default_factory=(lambda: _default_metrics()))
class-attribute
instance-attribute
¶
bootstrap_timeout = 300
class-attribute
instance-attribute
¶
__post_init__()
¶
Convert lists to tuples for immutability.
env_hash()
¶
Hash of the remote environment specification.
Covers packages, env vars, Python version, etc. — but NOT the local skyward version. Stable across local code edits, suitable for daemon pool fingerprinting.
Returns:
| Type | Description |
|---|---|
str
|
A 12-character hex digest (SHA-256 prefix). |
content_hash()
¶
Hash including the local skyward version.
Used by WarmableProvider implementations to detect whether
an existing AMI/snapshot can be reused or a fresh bootstrap is
needed. Includes skyward_version because the wheel is
baked into the snapshot.
Returns:
| Type | Description |
|---|---|
str
|
A 12-character hex digest (SHA-256 prefix). |
__init__(python='auto', pip=(), pip_indexes=(), apt=(), env=dict(), shell_vars=dict(), includes=(), excludes=(), skyward_source='auto', metrics=(lambda: _default_metrics())(), bootstrap_timeout=300)
¶
skyward.DEFAULT_IMAGE = Image()
module-attribute
¶
skyward.AllocationStrategy = Literal['spot', 'on-demand', 'spot-if-available', 'cheapest']
¶
Instance lifecycle strategy.
"spot"-- spot/preemptible only (cheapest, may be interrupted)."on-demand"-- on-demand only (reliable, higher cost)."spot-if-available"-- try spot first, fall back to on-demand."cheapest"-- compare spot and on-demand, pick lowest price.
skyward.api.spec.PoolState
¶
Pool lifecycle states reported by the pool actor.
Progression: INIT → REQUESTING → PROVISIONING
→ READY → SHUTTING_DOWN → DESTROYED.
INIT = 'init'
class-attribute
instance-attribute
¶
Pool object created, actor system not yet started.
REQUESTING = 'requesting'
class-attribute
instance-attribute
¶
Querying provider for available offers.
PROVISIONING = 'provisioning'
class-attribute
instance-attribute
¶
Cloud instances being created and bootstrapped.
READY = 'ready'
class-attribute
instance-attribute
¶
All nodes healthy, pool accepting tasks.
SHUTTING_DOWN = 'shutting_down'
class-attribute
instance-attribute
¶
Graceful teardown in progress.
DESTROYED = 'destroyed'
class-attribute
instance-attribute
¶
All instances terminated, resources released.