Last updated: April 3, 2023.
Note: This example does not work for any ZenML versions > 0.36.1.
We’re really proud of our Kubeflow integration. It gives you a ton of power and flexibility and is a production-ready tool. But we also know that for many of you it’s one step too many. Setting up a Kubernetes cluster is probably nobody’s ideal way to spend their time, and it certainly requires some time investment to maintain.
We thought this was a concern worth addressing so I worked to build an alternative during the ZenHack Day we recently ran. GitHub Actions is a platform that allows you to execute arbitrary software development workflows right in your GitHub repository. It is most commonly used for CI/CD pipelines, but using the GitHub Actions orchestrator ZenML now enables you to easily run and schedule your machine learning pipelines as GitHub Actions workflows.
Most technical decisions come with various kinds of tradeoffs, and it’s worth taking a moment to assess why you might want to use the GitHub Actions orchestrator in the first place.
Let’s start with the downsides:
So what’s the point, then? These are indeed some serious downsides. Firstly and foremostly, there’s the cost: running your pipelines on GitHub Actions is free. If you’re interested in running your pipelines in the cloud on serverless infrastructure, there’s probably no easier way to get started than to try out this orchestrator.
You are also spared the pain of maintaining a Kubernetes cluster. Once you’ve configured it (see below for instructions) there’s basically nothing you have to do on an ongoing basis. I hope you’re sold on trying it out and want to get started, so let’s not hold off any more.
(Note that some of the commands in this tutorial rely on environment variables or a specific working directory from previous commands, so be sure to run them in the same shell. In this tutorial we’re going to use Microsoft’s Azure platform for cloud storage and our MySQL database, but it works just as well on AWS or GCP.
This tutorial assumes that you have:
If you don’t have an Azure account yet, go to https://azure.microsoft.com/en-gb/free/ and create one.
Resource groups are a concept in Azure that allows us to bundle different resources that share a similar lifecycle. We’ll create a new resource group for this tutorial so we’ll be able to differentiate them from other resources in our account and easily delete them at the end.
Go to the Azure portal, click the hamburger button in the top left to open up the portal menu. Then, hover over the Resource groups
section until a popup appears and click on the + Create
button:
Select a region and enter a name for your resource group before clicking on
Review + create
:
Verify that all the information is correct and click on
Create
:
An Azure storage account is a grouping of Azure data storage objects which also provides a namespace and authentication options to access them. We’ll need a storage account to hold the blob storage container we’ll create in the next step.
Open up the portal menu again, but this time hover over the Storage accounts
section and click on the + Create
button in the popup once it appears:
Select your previously created resource group, a region and a globally unique name and then click on Review + create
:
Make sure that all the values are correct and click on Create
:
Wait until the deployment is finished and click on Go to resource
to open up your newly created storage account:
In the left menu, select Access keys
:
Click on Show keys
, and once the keys are visible, note down the storage account name and the value of the Key field of either key1 or key2.
We’re going to use them for the <STORAGE_ACCOUNT_NAME>
and <STORAGE_ACCOUNT_KEY>
placeholders later.
Next, we’re going to create an Azure Blob Storage Container. It will be used by ZenML to store the output artifacts of all our pipeline steps.
To do so, select Containers
in the Data storage section of the storage account:
Then click the + Container
button on the top to create a new container:
Choose a name for the container and note it down. We’re going to use it later for the <BLOB_STORAGE_CONTAINER_NAME>
placeholder. Then create the container by clicking the Create
button.
Next up, we’ll need to create a GitHub Personal Access Token that ZenML will use to authenticate with the GitHub API in order to store secrets and upload Docker images.
Go to https://github.com, click on your profile image in the top right corner and select Settings
:
Scroll to the bottom and click on Developer Settings
on the left side:
Select Personal access tokens
and click on Generate new token
:
Give your token a descriptive name for future reference and select the repo
and write:packages
scopes:
Scroll to the bottom and click on Generate token
. This will bring you to a page that allows you to copy your newly generated token:
Now that we’ve got our token, let’s store it in an environment variable for future steps. We’ll also store our GitHub username that this token was created for. Replace the <PLACEHOLDERS>
in the following command and run it:
export GITHUB_USERNAME=<GITHUB_USERNAME>
export GITHUB_AUTHENTICATION_TOKEN=<PERSONAL_ACCESS_TOKEN>
When we’ll run our pipeline later, ZenML will build a Docker image for us which will be used to execute the steps of the pipeline. In order to access this image inside GitHub Actions workflow, we’ll push it to the GitHub container registry. Running the following command will use the personal access token created in the previous step to authenticate our local Docker client with this container registry:
echo "$GITHUB_AUTHENTICATION_TOKEN" | docker login ghcr.io -u "$GITHUB_USERNAME" --password-stdin
Note: If you run into issues during this step, make sure you’ve set the environment variables in the previous step and Docker is running on your machine.
Time to fork and clone an example repository which contains a very simple ZenML pipeline that trains a SKLearn SVC classifier on the digits dataset.
If you’re new to ZenML, let’s quickly go over some basic concepts that help you understand what the code in this repository is doing:
Let’s get going:
Click on Fork
in the top right:
Click on Create fork
:
git clone git@github.com:"$GITHUB_USERNAME"/github-actions-orchestrator-tutorial.git
# or `git clone https://github.com/"$GITHUB_USERNAME"/github-actions-orchestrator-tutorial.git` if you want to authenticate with HTTPS instead of SSL
cd github-actions-orchestrator-tutorial
Now that we’re done setting up and configuring all our infrastructure and external dependencies, it’s time to install ZenML and configure a ZenML stack that connects all these elements together.
For Advanced use cases where we have a remote orchestrator such as Vertex AI or to share stacks and pipeline information with team we need to have a separated non local remote ZenML Server that it can be accessible from your machine as well as all stack components that may need access to the server. Read more information about the use case here
In order to achieve this there are two different ways to get access to a remote ZenML Server.
Let’s install ZenML and all the additional packages that we’re going to need to run our pipeline:
pip install zenml
zenml integration install -y github azure sklearn
We’re also going to initialize a ZenML repository to indicate which directories and files ZenML should include when building Docker images:
zenml init
Once the deployment is finished, let’s connect to it by running the following command and logging in with the username and password you set during the deployment phase:
zenml connect --url=<DEPLOYMENT_URL>
A ZenML stack consists of many components which all play a role in making your ML pipeline run in a smooth and reproducible manner. Let’s register all the components that we’re going to need for this tutorial!
zenml orchestrator register github_orchestrator --flavor=github
zenml container-registry register github_container_registry \
--flavor=github \
--automatic_token_authentication=true \
--uri=ghcr.io/"$GITHUB_USERNAME"
zenml secrets-manager register github_secrets_manager \
--flavor=github \
--owner="$GITHUB_USERNAME" \
--repository=github-actions-orchestrator-tutorial
<BLOB_STORAGE_CONTAINER_PATH>
placeholder in the following command with the path we saved when creating the blob storage container and run it:
# The `az://` in front of the container name tells ZenML that this is an Azure container that it needs to read from/write to
zenml artifact-store register azure_artifact_store \
--flavor=azure \
--authentication_secret=azure_store_auth \
--path=az://<BLOB_STORAGE_CONTAINER_NAME>
These are all the components that we’re going to use for this tutorial, but ZenML offers additional components like:
With all components registered, we can now create and activate our ZenML stack. This makes sure ZenML knows which components to use when we’re going to run our pipeline later.
zenml stack register github_actions_stack \
-o github_orchestrator \
-x github_secrets_manager \
-c github_container_registry \
-a azure_artifact_store \
--set
Once the stack is active, we can register the secret that ZenML needs to authenticate with our artifact store.
We’re going to need the storage account name and key that we saved when we created our storage account earlier:
Replace the <PLACEHOLDERS>
in the following command with those concrete values and run it:
zenml secrets-manager secret register azure_store_auth \
--schema=azure \
--account_name=<STORAGE_ACCOUNT_NAME> \
--account_key=<STORAGE_ACCOUNT_KEY>
That was quite a lot of setup, but luckily we’re (almost) done now. Let’s execute the python script that “runs” our pipeline and quickly discuss what it is doing:
python run.py
This script runs a ZenML pipeline using our active GitHub stack. The orchestrator will now build a Docker image with our pipeline code and all the requirements installed and push it to the GitHub container registry.
Once the image is pushed, the orchestrator will write a GitHub Actions workflow file to the directory .github/workflows
. Pushing this workflow
file will trigger the actual execution of our ZenML pipeline. We’ll explain later at how to automate this step, but for our first pipeline run there is one last configuration step we need to do: We need to make sure
our GitHub Actions are allowed to pull the Docker image that ZenML just pushed.
Head to https://github.com/users/<GITHUB_USERNAME>/packages/container/package/zenml-github-actions
(replace <GITHUB_USERNAME>
with your GitHub username) and select Package settings
on the right side:
In the Manage Actions access
section, click on Add Repository
:
Search for your forked repository github-actions-orchestrator-tutorial
and give it read permissions. Your package settings should then look like this:
Done! Now all that’s left to do is commit and push the workflow file:
git add .github/workflows
git commit -m "Add ZenML pipeline workflow"
git push
If we now check out the GitHub Actions for our repository here https://github.com/<GITHUB_USERNAME>/github-actions-orchestrator-tutorial/actions
we should see our pipeline running! 🎉
If we want the orchestrator to automatically commit and push the workflow file for us, we can enable it with the following command:
zenml orchestrator update github_orchestrator --push=true
After this update, calling python run.py
should automatically build and push a Docker image, commit and push the workflow file which will in turn run our pipeline on GitHub Actions.
Once we’re done experimenting, let’s delete all the resources we created on Azure so we don’t waste any compute/money. As we’ve bundled it all in one resource group, this step is very easy. Go the Azure portal and select your resource group in the list of resources:
Next click on Delete resource group
on the top:
In the popup on the right side, type the resource group name and click Delete
:
This will take a few minutes, but after it’s finished all the resources we created should be gone.
If you have any question or feedback regarding this tutorial, let us know here or join our weekly community hour. If you want to know more about ZenML or see more examples, check out our docs, examples or join our Slack.
[Image Credit: Photo by Roman Synkevych on Unsplash]