Accelerators¶

Every cloud provider names GPU instances differently. AWS calls an A100 machine a p4d.24xlarge. RunPod uses a gpuTypeId. VastAI filters by GPU model. The accelerator parameter on Compute abstracts this: you describe the hardware you want, and the provider figures out how to get it.

This page serves as both a practical guide for choosing hardware in Skyward and a technical reference for ML accelerators. The comparison tables cover 18 accelerators across NVIDIA, AMD, Google, Intel, and AWS — from the $0.11/hr T4 (VastAI) to the $6/hr+ B200 (Verda).

Choosing an accelerator¶

By workload¶

Workload	Recommended	Why
Development/prototyping	T4	$0.13/hr on VastAI, 16 GB is enough for small models
Inference (small models)	L4	242 TFLOPS FP8 at 72W and $0.22/hr on RunPod
Inference (large models)	L40S, RTX 4090	48/24 GB memory, strong FP8
Fine-tuning (LoRA)	RTX 4090, A100-40GB	24-40 GB fits adapters for 7-13B models
Full fine-tuning	A100-80GB, H100	Need full model + optimizer in memory
Pre-training	8x H100, 8x B200	Maximum compute + NVLink for gradient sync
Budget training	RTX 3090, RTX 4090	Best TFLOPS/$ on marketplace providers
Maximum memory (single GPU)	MI300X	192 GB fits 70B in FP16 without sharding

By model size¶

Parameters	Minimum	Recommended	Notes
< 1B	T4 (16 GB)	L4 (24 GB)	FP8 inference on L4 is fast and cheap
1-7B	A10G (24 GB)	RTX 4090 (24 GB)	LoRA fits on 24 GB; full fine-tune needs 40+ GB
7-13B	A100-40GB	A100-80GB	Optimizer states double memory needs
13-70B	A100-80GB	2x H100 or MI300X	MI300X fits 70B on one card
70B+	4x H100	8x H100 or 8x B200	Tensor parallelism across NVLink

By efficiency¶

Metric	Best choice	Value	Price	Provider
TFLOPS/$ (BF16, spot)	L40S	1,097 TFLOPS/$	$0.33/hr	RunPod
TFLOPS/$ (BF16, consumer)	RTX 4090	688 TFLOPS/$	$0.24/hr	RunPod
TFLOPS/$ (BF16, datacenter)	A100	495 TFLOPS/$	$0.63/hr	VastAI
TFLOPS/W (BF16)	L4	1.68 TFLOPS/W	$0.22/hr	RunPod
GB/$ (memory per dollar)	RTX 3090	141 GB/$	$0.17/hr	RunPod
GB/W (memory per watt)	L4	0.33 GB/W	$0.22/hr	RunPod

TFLOPS/$ computed as BF16 dense TFLOPS / cheapest spot $/hr. Higher is better.

At a glance¶

Performance numbers below are dense tensor/matrix core throughput — the metric that matters for ML. Structured sparsity (2:4) doubles these figures, but most training workloads don't benefit from it. Prices are the cheapest spot rate across Skyward providers at time of writing.

NVIDIA datacenter¶

Accelerator	BF16	FP8	Memory	Bandwidth	TDP	From
B200	2,250	4,500	180 GB HBM3e	7,700 GB/s	1,000W	$6.26/hr (Verda)
H200	989	1,979	141 GB HBM3e	4,800 GB/s	700W	$2.29/hr (RunPod)
H100 SXM	989	1,979	80 GB HBM3	3,350 GB/s	700W	$1.50/hr (VastAI)
A100 80GB	312	—	80 GB HBM2e	2,039 GB/s	400W	$0.62/hr (VastAI)
L40S	362	733	48 GB GDDR6	864 GB/s	350W	$0.33/hr (RunPod)
L4	121	242	24 GB GDDR6	300 GB/s	72W	$0.22/hr (RunPod)
A10G	125	—	24 GB GDDR6	600 GB/s	150W	$2.01/hr (AWS)
T4	65*	—	16 GB GDDR6	320 GB/s	70W	$0.11/hr (VastAI)

All values in TFLOPS (dense tensor core). *T4 uses FP16 — Turing lacks native BF16. "—" in spec columns = hardware does not support this precision. "—" in price column = no offers currently listed on Skyward providers.

NVIDIA consumer¶

Accelerator	BF16	FP8	Memory	Bandwidth	TDP	From
RTX 5090	210	419	32 GB GDDR7	1,792 GB/s	575W	$0.53/hr (RunPod)
RTX 4090	165	330	24 GB GDDR6X	1,008 GB/s	450W	$0.24/hr (RunPod)
RTX 4080 S	105	209	16 GB GDDR6X	736 GB/s	320W	$0.17/hr (RunPod)
RTX 3090	71	—	24 GB GDDR6X	936 GB/s	350W	$0.17/hr (RunPod)

AMD, Google, Intel, AWS¶

Accelerator	BF16	FP8	Memory	Bandwidth	TDP	From
MI300X	1,307	2,615	192 GB HBM3	5,300 GB/s	750W	—
MI250X	383	—	128 GB HBM2e	3,277 GB/s	500W	—
TPU v5p	459	459	95 GB HBM2e	2,765 GB/s	~250W	GCP
TPU v5e	197	—	16 GB HBM2e	819 GB/s	~120W	GCP
Gaudi 3	1,835	1,835	128 GB HBM2e	3,700 GB/s	900W	AWS
Trainium2	667	1,299	96 GB HBM3	2,900 GB/s	~500W	AWS

AMD values are matrix engine TFLOPS. Gaudi 3 values are MME (Matrix Multiplication Engine) throughput. TPU/Trainium TDP are estimates — official figures not published.

Using accelerators¶

Factory functions¶

Use the factory functions under sky.accelerators:

sky.Compute(provider=sky.AWS(), accelerator=sky.accelerators.A100())
sky.Compute(provider=sky.AWS(), accelerator=sky.accelerators.H100(count=4))
sky.Compute(provider=sky.AWS(), accelerator=sky.accelerators.A100(memory="40GB"))

Each factory returns an Accelerator dataclass — frozen, immutable, with name, memory, count, and optional metadata (CUDA versions, form factors). The factory populates defaults from an internal catalog, so sky.accelerators.H100() already knows it has 80GB of HBM3 without you specifying it.

Memory and form factor variants¶

Some accelerators ship in multiple configurations. Use keyword arguments to select the variant:

# Memory variants
sky.accelerators.A100()              # 80GB (default)
sky.accelerators.A100(memory="40GB") # 40GB PCIe variant
sky.accelerators.V100(memory="32GB") # 32GB variant

# Form factor variants
sky.accelerators.H100()                        # Default
sky.accelerators.H100(form_factor="SXM")       # High-bandwidth SXM
sky.accelerators.H100(form_factor="NVL")       # NVLink 2-GPU module

Multi-GPU¶

Pass count to request multiple accelerators per node:

sky.accelerators.H100(count=4)    # 4x H100 per node
sky.accelerators.A100(count=8)    # 8x A100 per node

Custom accelerators¶

For hardware not in the catalog — experimental chips, private clouds, or overriding defaults:

import skyward as sky

my_gpu = sky.accelerators.Custom("My-GPU", memory="48GB")
my_gpu = sky.accelerators.Custom("H100-Custom", memory="80GB", count=8, cuda_min="12.0")

Detecting at runtime¶

Inside a @sky.function function, sky.instance_info() reports what hardware the function is running on:

@sky.function
def check_gpu():
    import torch

    info = sky.instance_info()

    return {
        "cuda_available": torch.cuda.is_available(),
        "device_count": torch.cuda.device_count(),
        "device_name": torch.cuda.get_device_name(0),
        "accelerators": info.accelerators,
        "accelerator_info": info.accelerator,
    }

Hardware reference¶

NVIDIA datacenter¶

Hopper and Blackwell (current generation)¶

The workhorses of large-scale training and inference. Blackwell roughly doubles Hopper's tensor throughput and memory bandwidth while adding FP4 support.

Spec	H100 SXM	H200	B200
Architecture	Hopper	Hopper	Blackwell
Process	TSMC 4N	TSMC 4N	TSMC 4NP
CUDA cores	16,896	16,896	18,432
Tensor cores	4th gen (528)	4th gen (528)	5th gen (592)
Compute capability	9.0	9.0	10.0
FP32	67 TFLOPS	67 TFLOPS	75 TFLOPS
BF16 tensor	989 / 1,979	989 / 1,979	2,250 / 4,500
FP8 tensor	1,979 / 3,958	1,979 / 3,958	4,500 / 9,000
FP4 tensor	—	—	9,000 / 18,000
Memory	80 GB HBM3	141 GB HBM3e	180 GB HBM3e
Mem bandwidth	3,350 GB/s	4,800 GB/s	7,700 GB/s
NVLink	4th gen, 900 GB/s	4th gen, 900 GB/s	5th gen, 1,800 GB/s
PCIe	Gen 5	Gen 5	Gen 5
TDP	700W	700W	1,000W

Tensor values shown as dense / sparse.

The H200 is compute-identical to the H100 — same die, same clocks. The difference is memory: HBM3e at 141 GB (vs 80 GB HBM3) with 43% more bandwidth (4.8 TB/s vs 3.35 TB/s). This matters most for inference of large models where memory capacity and bandwidth are the bottleneck, not raw compute.

The B200 uses a dual-die design: two GB100 dies connected by a 10 TB/s internal link (NV-HBI). From software's perspective it appears as a single GPU with 208 billion transistors.

Factory	Memory	Notes
`H100()`	40/80GB HBM3	Flagship training GPU. FP8 support.
`H100(form_factor="SXM")`	80GB	High-bandwidth SXM variant.
`H100(form_factor="NVL")`	94GB	NVLink 2-GPU module.
`H200()`	141GB HBM3e	1.4-1.9x inference speedup vs H100.
`B100()`	192GB HBM3e	FP4 support, 2nd-gen transformer engine.
`B200()`	192GB HBM3e	Flagship Blackwell. ~2.5x inference vs H100.

Ampere and Ada Lovelace¶

Ampere (A100, A10G) introduced TF32 and structural sparsity. Ada Lovelace (L40S, L4) added FP8 via 4th-gen tensor cores and moved to TSMC 4N — the same process as Hopper.

Spec	A100 80GB	L40S	L4	A10G
Architecture	Ampere	Ada Lovelace	Ada Lovelace	Ampere
Process	TSMC 7nm	TSMC 4N	TSMC 4N	Samsung 8nm
CUDA cores	6,912	18,176	7,424	9,216
Tensor cores	3rd gen (432)	4th gen (568)	4th gen (232)	3rd gen (288)
Compute capability	8.0	8.9	8.9	8.6
FP32	19.5 TFLOPS	91.6 TFLOPS	30.3 TFLOPS	31.2 TFLOPS
BF16 tensor	312 / 624	362 / 733	121 / 242	125 / 250
FP8 tensor	—	733 / 1,466	242 / 485	—
INT8 tensor	624 / 1,248	733 / 1,466	242 / 485	250 / 500
Memory	80 GB HBM2e	48 GB GDDR6	24 GB GDDR6	24 GB GDDR6
Mem bandwidth	2,039 GB/s	864 GB/s	300 GB/s	600 GB/s
NVLink	3rd gen, 600 GB/s	—	—	—
PCIe	Gen 4	Gen 4	Gen 4	Gen 4
TDP	400W	350W	72W	150W

Tensor values shown as dense / sparse. A100 and A10G lack FP8 (3rd-gen tensor cores). L40S and L4 lack NVLink.

The A100 remains the price/performance sweet spot for training at $0.62/hr spot on VastAI. Its 80 GB HBM2e fits most 7-13B parameter models. The L40S offers 3x more raw FP32 compute but half the memory bandwidth — better suited for inference and mixed workloads than large-scale training. The L4 at 72W and $0.22/hr on RunPod is the best value for inference: its FP8 throughput (242 TFLOPS) rivals the A100's BF16 (312 TFLOPS) at less than half the cost.

Factory	Memory	Notes
`A100()`	80GB HBM2e	Default. First GPU with TF32 and structural sparsity.
`A100(memory="40GB")`	40GB	PCIe variant.
`A100(count=8)`	8x 80GB	Multi-GPU.
`A800()`	80GB	China-specific A100 variant.
`A40()`	48GB GDDR6	Professional visualization + compute.
`A10G()`	24GB GDDR6	AWS g5 instances.
`L4()`	24GB GDDR6	Ada Lovelace. Replaces T4 for inference.
`L40S()`	48GB GDDR6	Ada Lovelace. Compute-optimized.

Legacy (still widely available)¶

Spec	T4
Architecture	Turing
Process	TSMC 12nm
CUDA cores	2,560
Tensor cores	2nd gen (320)
Compute capability	7.5
FP32	8.1 TFLOPS
FP16 tensor	65 / 130
INT8 tensor	130 / 260
Memory	16 GB GDDR6
Mem bandwidth	320 GB/s
TDP	70W

The T4 uses Turing's 2nd-gen tensor cores — no BF16 or FP8 support. FP16 is the highest-precision tensor format available.

The T4 at $0.13/hr on VastAI and 70W is the cheapest GPU available on Skyward. Its 16 GB is enough for inference on models up to ~3B parameters and development work. No BF16 means you need explicit FP16 casting for mixed precision.

Factory	Memory	Notes
`T4()`	16GB GDDR6	Cheapest option for dev/inference.
`T4G()`	16GB	ARM64 variant for AWS Graviton.
`V100()`	16/32GB HBM2	Volta. `V100(memory="32GB")` for the larger variant.
`P100()`	16GB HBM2	Pascal. First HBM GPU for deep learning.

NVIDIA consumer¶

Available primarily on marketplace providers like VastAI, RunPod, and TensorDock. Consumer cards lack NVLink (except the RTX 3090) and ECC memory, but deliver exceptional price/performance for single-GPU workloads.

Spec	RTX 5090	RTX 4090	RTX 4080 S	RTX 3090
Architecture	Blackwell (GB203)	Ada Lovelace (AD102)	Ada Lovelace (AD103)	Ampere (GA102)
Process	TSMC 4N	TSMC 4N	TSMC 4N	Samsung 8nm
CUDA cores	21,760	16,384	10,240	10,496
Tensor cores	5th gen (680)	4th gen (512)	4th gen (320)	3rd gen (328)
Compute capability	12.0	8.9	8.9	8.6
FP32	105 TFLOPS	82.6 TFLOPS	52.2 TFLOPS	35.6 TFLOPS
BF16 tensor	210 / 419	165 / 330	105 / 209	71 / 142
FP8 tensor	419 / 838	330 / 661	209 / 418	—
INT8 tensor	838 / 1,676	661 / 1,321	418 / 836	285 / 569
Memory	32 GB GDDR7	24 GB GDDR6X	16 GB GDDR6X	24 GB GDDR6X
Mem bandwidth	1,792 GB/s	1,008 GB/s	736 GB/s	936 GB/s
NVLink	—	—	—	3rd gen, 112.5 GB/s
PCIe	Gen 5	Gen 4	Gen 4	Gen 4
TDP	575W	450W	320W	350W

Tensor values shown as dense / sparse. RTX 3090 lacks FP8 (3rd-gen tensor cores, Ampere). RTX 3090 is the last consumer card with NVLink support — NVIDIA removed it from the 40 and 50 series. RTX 5090 moves to GDDR7, delivering 78% more bandwidth than the 4090.

The RTX 4090 at $0.24/hr spot on RunPod is the most popular consumer GPU for ML — 165 TFLOPS BF16 with 24 GB of memory handles LoRA fine-tuning of 7B models comfortably. The RTX 3090 offers the same 24 GB at similar pricing on RunPod but without FP8 support. The RTX 5090 at $0.53/hr on RunPod brings 32 GB GDDR7 and PCIe Gen 5 but costs ~2x more.

# RTX 50 series (Blackwell)
sky.accelerators.RTX_5090()    # 32GB
sky.accelerators.RTX_5080()    # 16GB

# RTX 40 series (Ada Lovelace)
sky.accelerators.RTX_4090()    # 24GB — popular for fine-tuning
sky.accelerators.RTX_4080()    # 16GB

# RTX 30 series (Ampere)
sky.accelerators.RTX_3090()    # 24GB
sky.accelerators.RTX_3080()    # 10GB

# Older generations also available: RTX 20xx, GTX 16xx, GTX 10xx

Workstation GPUs like RTX_A6000() (48GB), RTX_6000_Ada(), and RTX_PRO_6000() are also supported.

AMD Instinct¶

AMD's datacenter accelerators use CDNA architecture with matrix cores (AMD's equivalent of tensor cores). The MI300X's 192 GB HBM3 makes it the highest-memory single-GPU available — enough to fit a 70B model in FP16 without sharding.

Spec	MI300X	MI250X
Architecture	CDNA 3	CDNA 2
Process	TSMC 5nm + 6nm	TSMC 6nm
Compute units	304 (8 XCDs)	220 (2 GCDs)
FP32 (matrix)	163 TFLOPS	47.9 TFLOPS
BF16 (matrix)	1,307 TFLOPS	383 TFLOPS
FP8 (matrix)	2,615 TFLOPS	—
INT8 (matrix)	2,615 TOPS	383 TOPS
Memory	192 GB HBM3	128 GB HBM2e
Mem bandwidth	5,300 GB/s	3,277 GB/s
Interconnect	Infinity Fabric 4th gen, 896 GB/s (8-GPU)	Infinity Fabric 3rd gen
TDP	750W	500W

The MI300X uses 3.5D chiplet packaging: 8 XCD compute dies (5nm) + 4 I/O dies (6nm) with 256 MB Infinity Cache. The MI250X is a dual-GCD MCM — each GCD appears as a separate device to software. MI250X powered the Frontier exascale supercomputer.

pool = sky.Compute(
    provider=sky.VastAI(),
    accelerator=sky.accelerators.MI300X(),
)

Factory	Memory	Architecture	Notes
`MI300X()`	192GB HBM3	CDNA 3	Designed for LLMs.
`MI300A()`	128GB	CDNA 3 (APU)	Integrated CPU+GPU.
`MI250X()`	128GB HBM2e	CDNA 2	HPC workloads.
`MI250()`	128GB HBM2e	CDNA 2	Training.
`MI210()`	64GB HBM2e	CDNA 2	Training.
`MI100()`	32GB HBM2	CDNA 1	Compute.

Google TPUs¶

Tensor Processing Units are Google's custom ASICs, available exclusively through GCP. TPUs are architecturally different from GPUs — they use systolic arrays optimized for BF16/INT8 matrix multiplication and connect via a dedicated inter-chip interconnect (ICI) in torus topologies.

Spec	TPU v5p	TPU v5e
Type	Training-optimized	Cost-optimized
BF16	459 TFLOPS	197 TFLOPS
FP8	459 TFLOPS	—
INT8	918 TOPS	393 TOPS
Memory	95 GB HBM2e	16 GB HBM2e
Mem bandwidth	2,765 GB/s	819 GB/s
ICI bandwidth	1,200 GB/s (3D torus)	400 GB/s (2D torus)
TDP	~250W (liquid cooled)	~120W (air cooled)
Max pod size	8,960 chips	256 chips

TPU v5p has 2 TensorCores + 4 SparseCores (2nd-gen) per chip for embedding-heavy workloads. TPU v5e has 1 TensorCore with 4 MXUs (128x128 systolic arrays). Google does not publish FP32 or TDP figures. ICI uses optical circuit switches for flexible topology.

The v5p targets large-scale training (up to 8,960-chip pods). The v5e is cost-optimized for inference and training of models up to ~200B parameters, offering 2.3x price/performance over TPU v4.

# Single TPU chip
sky.accelerators.TPUv5p()

# Multi-chip slices
sky.accelerators.TPUv5p_8()    # 8-chip slice
sky.accelerators.TPUv4_64()    # 64-chip slice
sky.accelerators.TPUv3_32()    # 32-chip slice

Factory	Generation	Notes
`TPUv6()`	6th gen (2024)	Latest.
`TPUv5p()`	5th gen perf	Training-optimized.
`TPUv5e()`	5th gen eff	Inference-optimized.
`TPUv4()`	4th gen	General.
`TPUv3()` / `TPUv2()`	Legacy	Still available.

Intel Gaudi and AWS Trainium¶

Two non-GPU accelerators designed specifically for deep learning, each with distinct architectural bets.

Spec	Gaudi 3	Trainium2
Vendor	Intel (Habana Labs)	AWS (Annapurna Labs)
Process	TSMC 5nm	5nm
Compute engines	8 MMEs + 64 TPCs	8 NeuronCores-v3
BF16	1,835 TFLOPS (MME)	667 TFLOPS
FP8	1,835 TFLOPS (MME)	1,299 TFLOPS
FP8 sparse	—	2,563 TFLOPS
Memory	128 GB HBM2e	96 GB HBM3
Mem bandwidth	3,700 GB/s	2,900 GB/s
On-chip SRAM	96 MB (12.8 TB/s)	224 MB scratchpad
Interconnect	24x 200GbE RoCEv2 (4.8 Tbps)	NeuronLink-v3 (1,280 GB/s)
TDP	900W (OAM)	~500W

Gaudi 3 is unique in having 24 on-die RDMA/Ethernet ports — no external NIC needed. It uses a dual-chiplet design (two 5nm dies). Trainium2 has 16 dedicated Collective Communication cores for distributed training. A Trn2 instance has 16 chips (20.8 PFLOPS FP8); UltraServer packs 64 chips (83.2 PFLOPS).

Gaudi 3 achieves identical BF16 and FP8 throughput (1,835 TFLOPS) — unusual, as most accelerators halve throughput going from FP8 to BF16. Its integrated Ethernet eliminates the need for InfiniBand, reducing infrastructure cost. Trainium2's NeuronLink interconnect enables tight coupling of up to 64 chips, and its configurable FP8 (cFP8) format allows custom exponent/mantissa splits.

# Gaudi on AWS DL2 instances
pool = sky.Compute(
    provider=sky.AWS(),
    accelerator=sky.accelerators.Gaudi3(),
)

# Trainium on AWS Trn2 instances
pool = sky.Compute(
    provider=sky.AWS(),
    accelerator=sky.accelerators.Trainium2(),
    image=sky.Image(pip=["torch-neuronx"]),
)

Factory	Memory	Notes
`Gaudi3()`	128GB HBM2e	Latest generation.
`Gaudi2()`	96GB HBM2e	2x performance vs Gaudi.
`Gaudi()`	—	First gen.
`Trainium2()`	64GB	4x performance vs v1.
`Trainium1()`	32GB	First gen.
`Trainium3()`	128GB	Latest generation.

AWS Inferentia¶

Custom silicon for cost-effective inference:

pool = sky.Compute(
    provider=sky.AWS(),
    accelerator=sky.accelerators.Inferentia2(),
    image=sky.Image(pip=["torch-neuronx"]),
)

Factory	Memory	Instance	Notes
`Inferentia1()`	8GB	inf1.*	Single model serving.
`Inferentia2()`	32GB	inf2.*	High throughput.

Compare accelerators — interactive side-by-side comparison with charts
Providers — AWS, RunPod, VastAI, Verda, and Container configuration
Distributed Training — multi-node training guides
API Reference — complete accelerator API

Accelerators¶

Choosing an accelerator¶

By workload¶

By model size¶

By efficiency¶

At a glance¶

NVIDIA datacenter¶

NVIDIA consumer¶

AMD, Google, Intel, AWS¶

Using accelerators¶

Factory functions¶

Memory and form factor variants¶

Multi-GPU¶

Custom accelerators¶

Detecting at runtime¶

Hardware reference¶

NVIDIA datacenter¶

Hopper and Blackwell (current generation)¶

Ampere and Ada Lovelace¶

Legacy (still widely available)¶

NVIDIA consumer¶

AMD Instinct¶

Google TPUs¶

Intel Gaudi and AWS Trainium¶

AWS Inferentia¶

Related topics¶