Skip to content

Scikit grid search

Hyperparameter search is embarrassingly parallel — each candidate configuration can be evaluated independently. scikit-learn's GridSearchCV already supports parallelism via n_jobs, but it's limited to the cores on a single machine. Skyward's sklearn plugin extends this to a cluster: it replaces joblib's default backend with a distributed one, so n_jobs=-1 distributes cross-validation fits across cloud instances instead of local threads.

The dataset

Load digits and split into train/test:

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Defining the search space

Use a Pipeline with a list-of-dicts param_grid to search over both estimators and their hyperparameters:

pipe = Pipeline([("clf", SVC())])

param_grid = [
    {
        "clf": [RandomForestClassifier(random_state=42)],
        "clf__n_estimators": [50, 100, 200],
        "clf__max_depth": [None, 10, 20],
    },
    {
        "clf": [GradientBoostingClassifier(random_state=42)],
        "clf__n_estimators": [50, 100],
        "clf__learning_rate": [0.01, 0.1, 0.2],
    },
    {
        "clf": [SVC()],
        "clf__C": [0.1, 1, 10],
        "clf__kernel": ["rbf", "poly"],
    },
]

Each dict defines a grid for one estimator family. The "clf" key swaps the estimator itself (RandomForest, GradientBoosting, SVC), while "clf__param" tunes its hyperparameters. scikit-learn expands all combinations — this grid produces 21 candidates, each cross-validated 5 times, for 105 total fits. On a single machine, these run sequentially or across a few cores. On a cluster, they run across dozens of workers simultaneously.

Distributed search with the sklearn plugin

The sklearn plugin replaces joblib's default backend so that Parallel(n_jobs=-1) — which GridSearchCV uses internally — distributes work across cloud instances:

with sky.Compute(
    provider=sky.AWS(),
    nodes=3,
    worker=sky.Worker(concurrency=4),
    plugins=[sky.plugins.sklearn()],
):
    grid_search = GridSearchCV(
        estimator=pipe,
        param_grid=param_grid,
        cv=5,
        scoring="accuracy",
        n_jobs=-1,
        verbose=3,
    )
    grid_search.fit(X_train, y_train)

Inside the pool block, every joblib Parallel call is intercepted and routed to the Skyward cluster. Each fit is serialized with cloudpickle, sent to a worker, executed, and the result returned. The worker parameter accepts a Worker dataclass that controls per-node execution — Worker(concurrency=4) means each node runs 4 fits simultaneously. With 3 nodes and concurrency=4, you get 12 parallel fits.

The sklearn plugin registers the custom joblib backend on enter and restores the default on exit. The scikit-learn API is completely unchanged — GridSearchCV, Pipeline, cross_val_score all work as documented.

Results

After the search completes, access results through the standard scikit-learn interface:

best_clf = grid_search.best_params_["clf"]
print(f"Best: {type(best_clf).__name__}, CV={grid_search.best_score_:.2%}")
print(f"Test: {grid_search.score(X_test, y_test):.2%}")

The grid search object behaves exactly as it would in a local run — best_params_, best_score_, cv_results_ are all populated. The only difference is that the 105 fits ran on a cluster instead of a single machine.

Run the full example

git clone https://github.com/gabfssilva/skyward.git
cd skyward
uv run python guides/09_scikit_grid_search.py

What you learned:

  • plugins=[sky.plugins.sklearn()] replaces joblib's backend with a distributed one — n_jobs=-1 uses all cloud workers.
  • Standard scikit-learn APIGridSearchCV, Pipeline, cross_val_score work unchanged.
  • worker=Worker(concurrency=N) controls parallelism per node — total parallel fits = nodes x concurrency.
  • Pipeline + param_grid — search over different estimators and their hyperparameters in one run.