Scikit grid search¶
Hyperparameter search is embarrassingly parallel — each candidate configuration can be evaluated independently. scikit-learn's GridSearchCV already supports parallelism via n_jobs, but it's limited to the cores on a single machine. Skyward's sklearn plugin extends this to a cluster: it replaces joblib's default backend with a distributed one, so n_jobs=-1 distributes cross-validation fits across cloud instances instead of local threads.
The dataset¶
Load digits and split into train/test:
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Defining the search space¶
Use a Pipeline with a list-of-dicts param_grid to search over both estimators and their hyperparameters:
pipe = Pipeline([("clf", SVC())])
param_grid = [
{
"clf": [RandomForestClassifier(random_state=42)],
"clf__n_estimators": [50, 100, 200],
"clf__max_depth": [None, 10, 20],
},
{
"clf": [GradientBoostingClassifier(random_state=42)],
"clf__n_estimators": [50, 100],
"clf__learning_rate": [0.01, 0.1, 0.2],
},
{
"clf": [SVC()],
"clf__C": [0.1, 1, 10],
"clf__kernel": ["rbf", "poly"],
},
]
Each dict defines a grid for one estimator family. The "clf" key swaps the estimator itself (RandomForest, GradientBoosting, SVC), while "clf__param" tunes its hyperparameters. scikit-learn expands all combinations — this grid produces 21 candidates, each cross-validated 5 times, for 105 total fits. On a single machine, these run sequentially or across a few cores. On a cluster, they run across dozens of workers simultaneously.
Distributed search with the sklearn plugin¶
The sklearn plugin replaces joblib's default backend so that Parallel(n_jobs=-1) — which GridSearchCV uses internally — distributes work across cloud instances:
with sky.Compute(
provider=sky.AWS(),
nodes=3,
worker=sky.Worker(concurrency=4),
plugins=[sky.plugins.sklearn()],
):
grid_search = GridSearchCV(
estimator=pipe,
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1,
verbose=3,
)
grid_search.fit(X_train, y_train)
Inside the pool block, every joblib Parallel call is intercepted and routed to the Skyward cluster. Each fit is serialized with cloudpickle, sent to a worker, executed, and the result returned. The worker parameter accepts a Worker dataclass that controls per-node execution — Worker(concurrency=4) means each node runs 4 fits simultaneously. With 3 nodes and concurrency=4, you get 12 parallel fits.
The sklearn plugin registers the custom joblib backend on enter and restores the default on exit. The scikit-learn API is completely unchanged — GridSearchCV, Pipeline, cross_val_score all work as documented.
Results¶
After the search completes, access results through the standard scikit-learn interface:
best_clf = grid_search.best_params_["clf"]
print(f"Best: {type(best_clf).__name__}, CV={grid_search.best_score_:.2%}")
print(f"Test: {grid_search.score(X_test, y_test):.2%}")
The grid search object behaves exactly as it would in a local run — best_params_, best_score_, cv_results_ are all populated. The only difference is that the 105 fits ran on a cluster instead of a single machine.
Run the full example¶
git clone https://github.com/gabfssilva/skyward.git
cd skyward
uv run python guides/09_scikit_grid_search.py
What you learned:
plugins=[sky.plugins.sklearn()]replaces joblib's backend with a distributed one —n_jobs=-1uses all cloud workers.- Standard scikit-learn API —
GridSearchCV,Pipeline,cross_val_scorework unchanged. worker=Worker(concurrency=N)controls parallelism per node — total parallel fits = nodes x concurrency.- Pipeline + param_grid — search over different estimators and their hyperparameters in one run.