Accelerators¶
Every cloud provider names GPU instances differently. AWS calls an A100 machine a p4d.24xlarge. RunPod uses a gpuTypeId. VastAI filters by GPU model. The accelerator parameter on Compute abstracts this: you describe the hardware you want, and the provider figures out how to get it.
This page serves as both a practical guide for choosing hardware in Skyward and a technical reference for ML accelerators. The comparison tables cover 18 accelerators across NVIDIA, AMD, Google, Intel, and AWS — from the $0.11/hr T4 (VastAI) to the $6/hr+ B200 (Verda).
Choosing an accelerator¶
By workload¶
| Workload | Recommended | Why |
|---|---|---|
| Development/prototyping | T4 | $0.13/hr on VastAI, 16 GB is enough for small models |
| Inference (small models) | L4 | 242 TFLOPS FP8 at 72W and $0.22/hr on RunPod |
| Inference (large models) | L40S, RTX 4090 | 48/24 GB memory, strong FP8 |
| Fine-tuning (LoRA) | RTX 4090, A100-40GB | 24-40 GB fits adapters for 7-13B models |
| Full fine-tuning | A100-80GB, H100 | Need full model + optimizer in memory |
| Pre-training | 8x H100, 8x B200 | Maximum compute + NVLink for gradient sync |
| Budget training | RTX 3090, RTX 4090 | Best TFLOPS/$ on marketplace providers |
| Maximum memory (single GPU) | MI300X | 192 GB fits 70B in FP16 without sharding |
By model size¶
| Parameters | Minimum | Recommended | Notes |
|---|---|---|---|
| < 1B | T4 (16 GB) | L4 (24 GB) | FP8 inference on L4 is fast and cheap |
| 1-7B | A10G (24 GB) | RTX 4090 (24 GB) | LoRA fits on 24 GB; full fine-tune needs 40+ GB |
| 7-13B | A100-40GB | A100-80GB | Optimizer states double memory needs |
| 13-70B | A100-80GB | 2x H100 or MI300X | MI300X fits 70B on one card |
| 70B+ | 4x H100 | 8x H100 or 8x B200 | Tensor parallelism across NVLink |
By efficiency¶
| Metric | Best choice | Value | Price | Provider |
|---|---|---|---|---|
| TFLOPS/$ (BF16, spot) | L40S | 1,097 TFLOPS/$ | $0.33/hr | RunPod |
| TFLOPS/$ (BF16, consumer) | RTX 4090 | 688 TFLOPS/$ | $0.24/hr | RunPod |
| TFLOPS/$ (BF16, datacenter) | A100 | 495 TFLOPS/$ | $0.63/hr | VastAI |
| TFLOPS/W (BF16) | L4 | 1.68 TFLOPS/W | $0.22/hr | RunPod |
| GB/$ (memory per dollar) | RTX 3090 | 141 GB/$ | $0.17/hr | RunPod |
| GB/W (memory per watt) | L4 | 0.33 GB/W | $0.22/hr | RunPod |
TFLOPS/$ computed as BF16 dense TFLOPS / cheapest spot $/hr. Higher is better.
At a glance¶
Performance numbers below are dense tensor/matrix core throughput — the metric that matters for ML. Structured sparsity (2:4) doubles these figures, but most training workloads don't benefit from it. Prices are the cheapest spot rate across Skyward providers at time of writing.
NVIDIA datacenter¶
| Accelerator | BF16 | FP8 | Memory | Bandwidth | TDP | From |
|---|---|---|---|---|---|---|
| B200 | 2,250 | 4,500 | 180 GB HBM3e | 7,700 GB/s | 1,000W | $6.26/hr (Verda) |
| H200 | 989 | 1,979 | 141 GB HBM3e | 4,800 GB/s | 700W | $2.29/hr (RunPod) |
| H100 SXM | 989 | 1,979 | 80 GB HBM3 | 3,350 GB/s | 700W | $1.50/hr (VastAI) |
| A100 80GB | 312 | — | 80 GB HBM2e | 2,039 GB/s | 400W | $0.62/hr (VastAI) |
| L40S | 362 | 733 | 48 GB GDDR6 | 864 GB/s | 350W | $0.33/hr (RunPod) |
| L4 | 121 | 242 | 24 GB GDDR6 | 300 GB/s | 72W | $0.22/hr (RunPod) |
| A10G | 125 | — | 24 GB GDDR6 | 600 GB/s | 150W | $2.01/hr (AWS) |
| T4 | 65* | — | 16 GB GDDR6 | 320 GB/s | 70W | $0.11/hr (VastAI) |
All values in TFLOPS (dense tensor core). *T4 uses FP16 — Turing lacks native BF16. "—" in spec columns = hardware does not support this precision. "—" in price column = no offers currently listed on Skyward providers.
NVIDIA consumer¶
| Accelerator | BF16 | FP8 | Memory | Bandwidth | TDP | From |
|---|---|---|---|---|---|---|
| RTX 5090 | 210 | 419 | 32 GB GDDR7 | 1,792 GB/s | 575W | $0.53/hr (RunPod) |
| RTX 4090 | 165 | 330 | 24 GB GDDR6X | 1,008 GB/s | 450W | $0.24/hr (RunPod) |
| RTX 4080 S | 105 | 209 | 16 GB GDDR6X | 736 GB/s | 320W | $0.17/hr (RunPod) |
| RTX 3090 | 71 | — | 24 GB GDDR6X | 936 GB/s | 350W | $0.17/hr (RunPod) |
AMD, Google, Intel, AWS¶
| Accelerator | BF16 | FP8 | Memory | Bandwidth | TDP | From |
|---|---|---|---|---|---|---|
| MI300X | 1,307 | 2,615 | 192 GB HBM3 | 5,300 GB/s | 750W | — |
| MI250X | 383 | — | 128 GB HBM2e | 3,277 GB/s | 500W | — |
| TPU v5p | 459 | 459 | 95 GB HBM2e | 2,765 GB/s | ~250W | GCP |
| TPU v5e | 197 | — | 16 GB HBM2e | 819 GB/s | ~120W | GCP |
| Gaudi 3 | 1,835 | 1,835 | 128 GB HBM2e | 3,700 GB/s | 900W | AWS |
| Trainium2 | 667 | 1,299 | 96 GB HBM3 | 2,900 GB/s | ~500W | AWS |
AMD values are matrix engine TFLOPS. Gaudi 3 values are MME (Matrix Multiplication Engine) throughput. TPU/Trainium TDP are estimates — official figures not published.
Using accelerators¶
Factory functions¶
Use the factory functions under sky.accelerators:
sky.Compute(provider=sky.AWS(), accelerator=sky.accelerators.A100())
sky.Compute(provider=sky.AWS(), accelerator=sky.accelerators.H100(count=4))
sky.Compute(provider=sky.AWS(), accelerator=sky.accelerators.A100(memory="40GB"))
Each factory returns an Accelerator dataclass — frozen, immutable, with name, memory, count, and optional metadata (CUDA versions, form factors). The factory populates defaults from an internal catalog, so sky.accelerators.H100() already knows it has 80GB of HBM3 without you specifying it.
Memory and form factor variants¶
Some accelerators ship in multiple configurations. Use keyword arguments to select the variant:
# Memory variants
sky.accelerators.A100() # 80GB (default)
sky.accelerators.A100(memory="40GB") # 40GB PCIe variant
sky.accelerators.V100(memory="32GB") # 32GB variant
# Form factor variants
sky.accelerators.H100() # Default
sky.accelerators.H100(form_factor="SXM") # High-bandwidth SXM
sky.accelerators.H100(form_factor="NVL") # NVLink 2-GPU module
Multi-GPU¶
Pass count to request multiple accelerators per node:
Custom accelerators¶
For hardware not in the catalog — experimental chips, private clouds, or overriding defaults:
import skyward as sky
my_gpu = sky.accelerators.Custom("My-GPU", memory="48GB")
my_gpu = sky.accelerators.Custom("H100-Custom", memory="80GB", count=8, cuda_min="12.0")
Detecting at runtime¶
Inside a @sky.function function, sky.instance_info() reports what hardware the function is running on:
@sky.function
def check_gpu():
import torch
info = sky.instance_info()
return {
"cuda_available": torch.cuda.is_available(),
"device_count": torch.cuda.device_count(),
"device_name": torch.cuda.get_device_name(0),
"accelerators": info.accelerators,
"accelerator_info": info.accelerator,
}
Hardware reference¶
NVIDIA datacenter¶
Hopper and Blackwell (current generation)¶
The workhorses of large-scale training and inference. Blackwell roughly doubles Hopper's tensor throughput and memory bandwidth while adding FP4 support.
| Spec | H100 SXM | H200 | B200 |
|---|---|---|---|
| Architecture | Hopper | Hopper | Blackwell |
| Process | TSMC 4N | TSMC 4N | TSMC 4NP |
| CUDA cores | 16,896 | 16,896 | 18,432 |
| Tensor cores | 4th gen (528) | 4th gen (528) | 5th gen (592) |
| Compute capability | 9.0 | 9.0 | 10.0 |
| FP32 | 67 TFLOPS | 67 TFLOPS | 75 TFLOPS |
| BF16 tensor | 989 / 1,979 | 989 / 1,979 | 2,250 / 4,500 |
| FP8 tensor | 1,979 / 3,958 | 1,979 / 3,958 | 4,500 / 9,000 |
| FP4 tensor | — | — | 9,000 / 18,000 |
| Memory | 80 GB HBM3 | 141 GB HBM3e | 180 GB HBM3e |
| Mem bandwidth | 3,350 GB/s | 4,800 GB/s | 7,700 GB/s |
| NVLink | 4th gen, 900 GB/s | 4th gen, 900 GB/s | 5th gen, 1,800 GB/s |
| PCIe | Gen 5 | Gen 5 | Gen 5 |
| TDP | 700W | 700W | 1,000W |
Tensor values shown as dense / sparse.
The H200 is compute-identical to the H100 — same die, same clocks. The difference is memory: HBM3e at 141 GB (vs 80 GB HBM3) with 43% more bandwidth (4.8 TB/s vs 3.35 TB/s). This matters most for inference of large models where memory capacity and bandwidth are the bottleneck, not raw compute.
The B200 uses a dual-die design: two GB100 dies connected by a 10 TB/s internal link (NV-HBI). From software's perspective it appears as a single GPU with 208 billion transistors.
| Factory | Memory | Notes |
|---|---|---|
H100() |
40/80GB HBM3 | Flagship training GPU. FP8 support. |
H100(form_factor="SXM") |
80GB | High-bandwidth SXM variant. |
H100(form_factor="NVL") |
94GB | NVLink 2-GPU module. |
H200() |
141GB HBM3e | 1.4-1.9x inference speedup vs H100. |
B100() |
192GB HBM3e | FP4 support, 2nd-gen transformer engine. |
B200() |
192GB HBM3e | Flagship Blackwell. ~2.5x inference vs H100. |
Ampere and Ada Lovelace¶
Ampere (A100, A10G) introduced TF32 and structural sparsity. Ada Lovelace (L40S, L4) added FP8 via 4th-gen tensor cores and moved to TSMC 4N — the same process as Hopper.
| Spec | A100 80GB | L40S | L4 | A10G |
|---|---|---|---|---|
| Architecture | Ampere | Ada Lovelace | Ada Lovelace | Ampere |
| Process | TSMC 7nm | TSMC 4N | TSMC 4N | Samsung 8nm |
| CUDA cores | 6,912 | 18,176 | 7,424 | 9,216 |
| Tensor cores | 3rd gen (432) | 4th gen (568) | 4th gen (232) | 3rd gen (288) |
| Compute capability | 8.0 | 8.9 | 8.9 | 8.6 |
| FP32 | 19.5 TFLOPS | 91.6 TFLOPS | 30.3 TFLOPS | 31.2 TFLOPS |
| BF16 tensor | 312 / 624 | 362 / 733 | 121 / 242 | 125 / 250 |
| FP8 tensor | — | 733 / 1,466 | 242 / 485 | — |
| INT8 tensor | 624 / 1,248 | 733 / 1,466 | 242 / 485 | 250 / 500 |
| Memory | 80 GB HBM2e | 48 GB GDDR6 | 24 GB GDDR6 | 24 GB GDDR6 |
| Mem bandwidth | 2,039 GB/s | 864 GB/s | 300 GB/s | 600 GB/s |
| NVLink | 3rd gen, 600 GB/s | — | — | — |
| PCIe | Gen 4 | Gen 4 | Gen 4 | Gen 4 |
| TDP | 400W | 350W | 72W | 150W |
Tensor values shown as dense / sparse. A100 and A10G lack FP8 (3rd-gen tensor cores). L40S and L4 lack NVLink.
The A100 remains the price/performance sweet spot for training at $0.62/hr spot on VastAI. Its 80 GB HBM2e fits most 7-13B parameter models. The L40S offers 3x more raw FP32 compute but half the memory bandwidth — better suited for inference and mixed workloads than large-scale training. The L4 at 72W and $0.22/hr on RunPod is the best value for inference: its FP8 throughput (242 TFLOPS) rivals the A100's BF16 (312 TFLOPS) at less than half the cost.
| Factory | Memory | Notes |
|---|---|---|
A100() |
80GB HBM2e | Default. First GPU with TF32 and structural sparsity. |
A100(memory="40GB") |
40GB | PCIe variant. |
A100(count=8) |
8x 80GB | Multi-GPU. |
A800() |
80GB | China-specific A100 variant. |
A40() |
48GB GDDR6 | Professional visualization + compute. |
A10G() |
24GB GDDR6 | AWS g5 instances. |
L4() |
24GB GDDR6 | Ada Lovelace. Replaces T4 for inference. |
L40S() |
48GB GDDR6 | Ada Lovelace. Compute-optimized. |
Legacy (still widely available)¶
| Spec | T4 |
|---|---|
| Architecture | Turing |
| Process | TSMC 12nm |
| CUDA cores | 2,560 |
| Tensor cores | 2nd gen (320) |
| Compute capability | 7.5 |
| FP32 | 8.1 TFLOPS |
| FP16 tensor | 65 / 130 |
| INT8 tensor | 130 / 260 |
| Memory | 16 GB GDDR6 |
| Mem bandwidth | 320 GB/s |
| TDP | 70W |
The T4 uses Turing's 2nd-gen tensor cores — no BF16 or FP8 support. FP16 is the highest-precision tensor format available.
The T4 at $0.13/hr on VastAI and 70W is the cheapest GPU available on Skyward. Its 16 GB is enough for inference on models up to ~3B parameters and development work. No BF16 means you need explicit FP16 casting for mixed precision.
| Factory | Memory | Notes |
|---|---|---|
T4() |
16GB GDDR6 | Cheapest option for dev/inference. |
T4G() |
16GB | ARM64 variant for AWS Graviton. |
V100() |
16/32GB HBM2 | Volta. V100(memory="32GB") for the larger variant. |
P100() |
16GB HBM2 | Pascal. First HBM GPU for deep learning. |
NVIDIA consumer¶
Available primarily on marketplace providers like VastAI, RunPod, and TensorDock. Consumer cards lack NVLink (except the RTX 3090) and ECC memory, but deliver exceptional price/performance for single-GPU workloads.
| Spec | RTX 5090 | RTX 4090 | RTX 4080 S | RTX 3090 |
|---|---|---|---|---|
| Architecture | Blackwell (GB203) | Ada Lovelace (AD102) | Ada Lovelace (AD103) | Ampere (GA102) |
| Process | TSMC 4N | TSMC 4N | TSMC 4N | Samsung 8nm |
| CUDA cores | 21,760 | 16,384 | 10,240 | 10,496 |
| Tensor cores | 5th gen (680) | 4th gen (512) | 4th gen (320) | 3rd gen (328) |
| Compute capability | 12.0 | 8.9 | 8.9 | 8.6 |
| FP32 | 105 TFLOPS | 82.6 TFLOPS | 52.2 TFLOPS | 35.6 TFLOPS |
| BF16 tensor | 210 / 419 | 165 / 330 | 105 / 209 | 71 / 142 |
| FP8 tensor | 419 / 838 | 330 / 661 | 209 / 418 | — |
| INT8 tensor | 838 / 1,676 | 661 / 1,321 | 418 / 836 | 285 / 569 |
| Memory | 32 GB GDDR7 | 24 GB GDDR6X | 16 GB GDDR6X | 24 GB GDDR6X |
| Mem bandwidth | 1,792 GB/s | 1,008 GB/s | 736 GB/s | 936 GB/s |
| NVLink | — | — | — | 3rd gen, 112.5 GB/s |
| PCIe | Gen 5 | Gen 4 | Gen 4 | Gen 4 |
| TDP | 575W | 450W | 320W | 350W |
Tensor values shown as dense / sparse. RTX 3090 lacks FP8 (3rd-gen tensor cores, Ampere). RTX 3090 is the last consumer card with NVLink support — NVIDIA removed it from the 40 and 50 series. RTX 5090 moves to GDDR7, delivering 78% more bandwidth than the 4090.
The RTX 4090 at $0.24/hr spot on RunPod is the most popular consumer GPU for ML — 165 TFLOPS BF16 with 24 GB of memory handles LoRA fine-tuning of 7B models comfortably. The RTX 3090 offers the same 24 GB at similar pricing on RunPod but without FP8 support. The RTX 5090 at $0.53/hr on RunPod brings 32 GB GDDR7 and PCIe Gen 5 but costs ~2x more.
# RTX 50 series (Blackwell)
sky.accelerators.RTX_5090() # 32GB
sky.accelerators.RTX_5080() # 16GB
# RTX 40 series (Ada Lovelace)
sky.accelerators.RTX_4090() # 24GB — popular for fine-tuning
sky.accelerators.RTX_4080() # 16GB
# RTX 30 series (Ampere)
sky.accelerators.RTX_3090() # 24GB
sky.accelerators.RTX_3080() # 10GB
# Older generations also available: RTX 20xx, GTX 16xx, GTX 10xx
Workstation GPUs like RTX_A6000() (48GB), RTX_6000_Ada(), and RTX_PRO_6000() are also supported.
AMD Instinct¶
AMD's datacenter accelerators use CDNA architecture with matrix cores (AMD's equivalent of tensor cores). The MI300X's 192 GB HBM3 makes it the highest-memory single-GPU available — enough to fit a 70B model in FP16 without sharding.
| Spec | MI300X | MI250X |
|---|---|---|
| Architecture | CDNA 3 | CDNA 2 |
| Process | TSMC 5nm + 6nm | TSMC 6nm |
| Compute units | 304 (8 XCDs) | 220 (2 GCDs) |
| FP32 (matrix) | 163 TFLOPS | 47.9 TFLOPS |
| BF16 (matrix) | 1,307 TFLOPS | 383 TFLOPS |
| FP8 (matrix) | 2,615 TFLOPS | — |
| INT8 (matrix) | 2,615 TOPS | 383 TOPS |
| Memory | 192 GB HBM3 | 128 GB HBM2e |
| Mem bandwidth | 5,300 GB/s | 3,277 GB/s |
| Interconnect | Infinity Fabric 4th gen, 896 GB/s (8-GPU) | Infinity Fabric 3rd gen |
| TDP | 750W | 500W |
The MI300X uses 3.5D chiplet packaging: 8 XCD compute dies (5nm) + 4 I/O dies (6nm) with 256 MB Infinity Cache. The MI250X is a dual-GCD MCM — each GCD appears as a separate device to software. MI250X powered the Frontier exascale supercomputer.
| Factory | Memory | Architecture | Notes |
|---|---|---|---|
MI300X() |
192GB HBM3 | CDNA 3 | Designed for LLMs. |
MI300A() |
128GB | CDNA 3 (APU) | Integrated CPU+GPU. |
MI250X() |
128GB HBM2e | CDNA 2 | HPC workloads. |
MI250() |
128GB HBM2e | CDNA 2 | Training. |
MI210() |
64GB HBM2e | CDNA 2 | Training. |
MI100() |
32GB HBM2 | CDNA 1 | Compute. |
Google TPUs¶
Tensor Processing Units are Google's custom ASICs, available exclusively through GCP. TPUs are architecturally different from GPUs — they use systolic arrays optimized for BF16/INT8 matrix multiplication and connect via a dedicated inter-chip interconnect (ICI) in torus topologies.
| Spec | TPU v5p | TPU v5e |
|---|---|---|
| Type | Training-optimized | Cost-optimized |
| BF16 | 459 TFLOPS | 197 TFLOPS |
| FP8 | 459 TFLOPS | — |
| INT8 | 918 TOPS | 393 TOPS |
| Memory | 95 GB HBM2e | 16 GB HBM2e |
| Mem bandwidth | 2,765 GB/s | 819 GB/s |
| ICI bandwidth | 1,200 GB/s (3D torus) | 400 GB/s (2D torus) |
| TDP | ~250W (liquid cooled) | ~120W (air cooled) |
| Max pod size | 8,960 chips | 256 chips |
TPU v5p has 2 TensorCores + 4 SparseCores (2nd-gen) per chip for embedding-heavy workloads. TPU v5e has 1 TensorCore with 4 MXUs (128x128 systolic arrays). Google does not publish FP32 or TDP figures. ICI uses optical circuit switches for flexible topology.
The v5p targets large-scale training (up to 8,960-chip pods). The v5e is cost-optimized for inference and training of models up to ~200B parameters, offering 2.3x price/performance over TPU v4.
# Single TPU chip
sky.accelerators.TPUv5p()
# Multi-chip slices
sky.accelerators.TPUv5p_8() # 8-chip slice
sky.accelerators.TPUv4_64() # 64-chip slice
sky.accelerators.TPUv3_32() # 32-chip slice
| Factory | Generation | Notes |
|---|---|---|
TPUv6() |
6th gen (2024) | Latest. |
TPUv5p() |
5th gen perf | Training-optimized. |
TPUv5e() |
5th gen eff | Inference-optimized. |
TPUv4() |
4th gen | General. |
TPUv3() / TPUv2() |
Legacy | Still available. |
Intel Gaudi and AWS Trainium¶
Two non-GPU accelerators designed specifically for deep learning, each with distinct architectural bets.
| Spec | Gaudi 3 | Trainium2 |
|---|---|---|
| Vendor | Intel (Habana Labs) | AWS (Annapurna Labs) |
| Process | TSMC 5nm | 5nm |
| Compute engines | 8 MMEs + 64 TPCs | 8 NeuronCores-v3 |
| BF16 | 1,835 TFLOPS (MME) | 667 TFLOPS |
| FP8 | 1,835 TFLOPS (MME) | 1,299 TFLOPS |
| FP8 sparse | — | 2,563 TFLOPS |
| Memory | 128 GB HBM2e | 96 GB HBM3 |
| Mem bandwidth | 3,700 GB/s | 2,900 GB/s |
| On-chip SRAM | 96 MB (12.8 TB/s) | 224 MB scratchpad |
| Interconnect | 24x 200GbE RoCEv2 (4.8 Tbps) | NeuronLink-v3 (1,280 GB/s) |
| TDP | 900W (OAM) | ~500W |
Gaudi 3 is unique in having 24 on-die RDMA/Ethernet ports — no external NIC needed. It uses a dual-chiplet design (two 5nm dies). Trainium2 has 16 dedicated Collective Communication cores for distributed training. A Trn2 instance has 16 chips (20.8 PFLOPS FP8); UltraServer packs 64 chips (83.2 PFLOPS).
Gaudi 3 achieves identical BF16 and FP8 throughput (1,835 TFLOPS) — unusual, as most accelerators halve throughput going from FP8 to BF16. Its integrated Ethernet eliminates the need for InfiniBand, reducing infrastructure cost. Trainium2's NeuronLink interconnect enables tight coupling of up to 64 chips, and its configurable FP8 (cFP8) format allows custom exponent/mantissa splits.
# Gaudi on AWS DL2 instances
pool = sky.Compute(
provider=sky.AWS(),
accelerator=sky.accelerators.Gaudi3(),
)
# Trainium on AWS Trn2 instances
pool = sky.Compute(
provider=sky.AWS(),
accelerator=sky.accelerators.Trainium2(),
image=sky.Image(pip=["torch-neuronx"]),
)
| Factory | Memory | Notes |
|---|---|---|
Gaudi3() |
128GB HBM2e | Latest generation. |
Gaudi2() |
96GB HBM2e | 2x performance vs Gaudi. |
Gaudi() |
— | First gen. |
Trainium2() |
64GB | 4x performance vs v1. |
Trainium1() |
32GB | First gen. |
Trainium3() |
128GB | Latest generation. |
AWS Inferentia¶
Custom silicon for cost-effective inference:
pool = sky.Compute(
provider=sky.AWS(),
accelerator=sky.accelerators.Inferentia2(),
image=sky.Image(pip=["torch-neuronx"]),
)
| Factory | Memory | Instance | Notes |
|---|---|---|---|
Inferentia1() |
8GB | inf1.* | Single model serving. |
Inferentia2() |
32GB | inf2.* | High throughput. |
Related topics¶
- Compare accelerators — interactive side-by-side comparison with charts
- Providers — AWS, RunPod, VastAI, Verda, and Container configuration
- Distributed Training — multi-node training guides
- API Reference — complete accelerator API