GPU-accelerated scikit-learn with cuML¶
scikit-learn runs on CPU. For large datasets, training algorithms like RandomForest or KNN becomes a bottleneck — minutes or hours spent on cross-validation and hyperparameter search. NVIDIA cuML provides GPU-backed implementations of popular sklearn estimators with the same API. Swap the import, and the same code runs on GPU with speedups of 50x to 175x.
Skyward makes this practical even if you don't have a local GPU. Provision a GPU instance on the cloud, send your code there with @sky.function, and cuML handles the rest. The cuml plugin installs cuML and configures the RAPIDS package indexes automatically.
The dataset¶
A 20,000-sample subset of MNIST, downloaded directly on the remote worker — no need to serialize and ship 784-dimensional arrays over the wire:
def load_mnist(n_samples: int): # noqa: N806
"""Load a subset of MNIST on the remote worker."""
import numpy as np
from sklearn.datasets import fetch_openml
X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False) # noqa: N806
X = (X[:n_samples] / 255.0).astype(np.float32) # noqa: N806
y = y[:n_samples].astype(np.int32)
return X, y
load_mnist is a plain function (not @sky.function) that the GPU task calls. Since it's defined at module level, cloudpickle captures it alongside the decorated functions. The data is fetched from OpenML on the remote machine.
GPU version with cuML¶
The GPU version uses standard scikit-learn imports — cuML's zero-code-change acceleration intercepts sklearn calls and routes them to the GPU:
@sky.function
def train_on_gpu(n_samples: int):
"""Train the same RandomForest, but with cuML GPU acceleration."""
from time import perf_counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
X, y = load_mnist(n_samples) # noqa: N806
clf = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
start = perf_counter()
scores = cross_val_score(clf, X, y, cv=5, n_jobs=-1)
elapsed = perf_counter() - start
return {"accuracy": scores.mean(), "time": elapsed}
All sklearn imports are inside the function because @sky.function serializes the function and sends it to a remote worker. The worker needs to resolve imports in its own environment. The function returns both accuracy and wall-clock time so we can compare against a CPU baseline.
Running with plugins¶
The cuml plugin handles installing cuml-cu12 and configuring the RAPIDS pip indexes. Combined with the sklearn plugin, which adds scikit-learn and joblib:
with sky.Compute(
provider=sky.AWS(),
accelerator=sky.accelerators.L4(),
nodes=1,
plugins=[
sky.plugins.cuml(),
sky.plugins.sklearn()
],
) as compute:
The plugins handle all dependency management — no need to manually specify pip packages or index URLs in the Image. The cuml plugin knows which RAPIDS indexes to configure for the CUDA 12 packages.
Results¶
result = train_on_gpu(N_SAMPLES) >> compute
print(f"accuracy: {result['accuracy']:.2%}, time: {result['time']:.1f}s")
Expect accuracy to be roughly equivalent between CPU and GPU — cuML implements the same algorithms, not approximations. The wall-clock time is where the difference shows: cuML on an L4 can be significantly faster depending on the algorithm and dataset size.
Run the full example¶
git clone https://github.com/gabfssilva/skyward.git
cd skyward
uv run python guides/16_cuml_acceleration.py
What you learned:
plugins=[sky.plugins.cuml(), sky.plugins.sklearn()]— cuML plugin installs RAPIDS packages and configures indexes; sklearn plugin adds scikit-learn.- cuML estimators work with sklearn utilities —
cross_val_score,GridSearchCV, andPipelineall accept cuML estimators. - Zero-code-change acceleration — cuML intercepts sklearn calls and routes them to the GPU transparently.
- Plugins handle dependencies — no manual pip packages or index URLs needed in the Image.
- Load data on the worker — avoid serializing large arrays; download datasets directly on the remote machine.