Last updated: November 14, 2022.
We’re really proud of our Kubeflow integration. It gives you a ton of power and flexibility and is a production-ready tool. But we also know that for many of you it’s one step too many. Setting up a Kubernetes cluster is probably nobody’s ideal way to spend their time, and it certainly requires some time investment to maintain.
We thought this was a concern worth addressing so I worked to build an alternative during the ZenHack Day we recently ran. GitHub Actions is a platform that allows you to execute arbitrary software development workflows right in your GitHub repository. It is most commonly used for CI/CD pipelines, but using the GitHub Actions orchestrator ZenML now enables you to easily run and schedule your machine learning pipelines as GitHub Actions workflows.
Most technical decisions come with various kinds of tradeoffs, and it’s worth taking a moment to assess why you might want to use the GitHub Actions orchestrator in the first place.
Let’s start with the downsides:
So what’s the point, then? These are indeed some serious downsides. Firstly and foremostly, there’s the cost: running your pipelines on GitHub Actions is free. If you’re interested in running your pipelines in the cloud on serverless infrastructure, there’s probably no easier way to get started than to try out this orchestrator.
You are also spared the pain of maintaining a Kubernetes cluster. Once you’ve configured it (see below for instructions) there’s basically nothing you have to do on an ongoing basis. I hope you’re sold on trying it out and want to get started, so let’s not hold off any more.
(Note that some of the commands in this tutorial rely on environment variables or a specific working directory from previous commands, so be sure to run them in the same shell. In this tutorial we’re going to use Microsoft’s Azure platform for cloud storage and our MySQL database, but it works just as well on AWS or GCP.
This tutorial assumes that you have:
If you don’t have an Azure account yet, go to https://azure.microsoft.com/en-gb/free/ and create one.
Resource groups are a concept in Azure that allows us to bundle different resources that share a similar lifecycle. We’ll create a new resource group for this tutorial so we’ll be able to differentiate them from other resources in our account and easily delete them at the end.
Go to the Azure portal, click the hamburger button in the top left to open up the portal menu. Then, hover over the
Resource groups section until a popup appears and click on the
+ Create button:
Select a region and enter a name for your resource group before clicking on
Review + create:
Verify that all the information is correct and click on
An Azure storage account is a grouping of Azure data storage objects which also provides a namespace and authentication options to access them. We’ll need a storage account to hold the blob storage container we’ll create in the next step.
Open up the portal menu again, but this time hover over the
Storage accounts section and click on the
+ Create button in the popup once it appears:
Select your previously created resource group, a region and a globally unique name and then click on
Review + create:
Make sure that all the values are correct and click on
Wait until the deployment is finished and click on
Go to resource to open up your newly created storage account:
In the left menu, select
Show keys, and once the keys are visible, note down the storage account name and the value of the Key field of either key1 or key2.
We’re going to use them for the
<STORAGE_ACCOUNT_KEY> placeholders later.
Next, we’re going to create an Azure Blob Storage Container. It will be used by ZenML to store the output artifacts of all our pipeline steps.
To do so, select
Containers in the Data storage section of the storage account:
Then click the
+ Container button on the top to create a new container:
Choose a name for the container and note it down. We’re going to use it later for the
<BLOB_STORAGE_CONTAINER_NAME> placeholder. Then create the container by clicking the
Next up, we’ll need to create a GitHub Personal Access Token that ZenML will use to authenticate with the GitHub API in order to store secrets and upload Docker images.
Go to https://github.com, click on your profile image in the top right corner and select
Scroll to the bottom and click on
Developer Settings on the left side:
Personal access tokens and click on
Generate new token:
Give your token a descriptive name for future reference and select the
Scroll to the bottom and click on
Generate token. This will bring you to a page that allows you to copy your newly generated token:
Now that we’ve got our token, let’s store it in an environment variable for future steps. We’ll also store our GitHub username that this token was created for. Replace the
<PLACEHOLDERS> in the following command and run it:
export GITHUB_USERNAME=<GITHUB_USERNAME> export GITHUB_AUTHENTICATION_TOKEN=<PERSONAL_ACCESS_TOKEN>
When we’ll run our pipeline later, ZenML will build a Docker image for us which will be used to execute the steps of the pipeline. In order to access this image inside GitHub Actions workflow, we’ll push it to the GitHub container registry. Running the following command will use the personal access token created in the previous step to authenticate our local Docker client with this container registry:
echo "$GITHUB_AUTHENTICATION_TOKEN" | docker login ghcr.io -u "$GITHUB_USERNAME" --password-stdin
Note: If you run into issues during this step, make sure you’ve set the environment variables in the previous step and Docker is running on your machine.
If you’re new to ZenML, let’s quickly go over some basic concepts that help you understand what the code in this repository is doing:
Let’s get going:
Fork in the top right:
git clone firstname.lastname@example.org:"$GITHUB_USERNAME"/github-actions-orchestrator-tutorial.git # or `git clone https://github.com/"$GITHUB_USERNAME"/github-actions-orchestrator-tutorial.git` if you want to authenticate with HTTPS instead of SSL cd github-actions-orchestrator-tutorial
Now that we’re done setting up and configuring all our infrastructure and external dependencies, it’s time to install ZenML and configure a ZenML stack that connects all these elements together.
For Advanced use cases where we have a remote orchestrator such as Vertex AI or to share stacks and pipeline information with team we need to have a separated non local remote ZenML Server that it can be accessible from your machine as well as all stack components that may need access to the server. Read more information about the use case here
In order to achieve this there are two different ways to get access to a remote ZenML Server.
Let’s install ZenML and all the additional packages that we’re going to need to run our pipeline:
pip install zenml zenml integration install -y github azure sklearn
We’re also going to initialize a ZenML repository to indicate which directories and files ZenML should include when building Docker images:
Once the deployment is finished, let’s connect to it by running the following command and logging in with the username and password you set during the deployment phase:
zenml connect --url=<DEPLOYMENT_URL>
A ZenML stack consists of many components which all play a role in making your ML pipeline run in a smooth and reproducible manner. Let’s register all the components that we’re going to need for this tutorial!
zenml orchestrator register github_orchestrator --flavor=github
zenml container-registry register github_container_registry \ --flavor=github \ --automatic_token_authentication=true \ --uri=ghcr.io/"$GITHUB_USERNAME"
zenml secrets-manager register github_secrets_manager \ --flavor=github \ --owner="$GITHUB_USERNAME" \ --repository=github-actions-orchestrator-tutorial
<BLOB_STORAGE_CONTAINER_PATH>placeholder in the following command with the path we saved when creating the blob storage container and run it:
# The `az://` in front of the container name tells ZenML that this is an Azure container that it needs to read from/write to zenml artifact-store register azure_artifact_store \ --flavor=azure \ --authentication_secret=azure_store_auth \ --path=az://<BLOB_STORAGE_CONTAINER_NAME>
These are all the components that we’re going to use for this tutorial, but ZenML offers additional components like:
With all components registered, we can now create and activate our ZenML stack. This makes sure ZenML knows which components to use when we’re going to run our pipeline later.
zenml stack register github_actions_stack \ -o github_orchestrator \ -x github_secrets_manager \ -c github_container_registry \ -a azure_artifact_store \ --set
Once the stack is active, we can register the secret that ZenML needs to authenticate with our artifact store.
We’re going to need the storage account name and key that we saved when we created our storage account earlier:
<PLACEHOLDERS> in the following command with those concrete values and run it:
zenml secrets-manager secret register azure_store_auth \ --schema=azure \ --account_name=<STORAGE_ACCOUNT_NAME> \ --account_key=<STORAGE_ACCOUNT_KEY>
That was quite a lot of setup, but luckily we’re (almost) done now. Let’s execute the python script that “runs” our pipeline and quickly discuss what it is doing:
This script runs a ZenML pipeline using our active GitHub stack. The orchestrator will now build a Docker image with our pipeline code and all the requirements installed and push it to the GitHub container registry.
Once the image is pushed, the orchestrator will write a GitHub Actions workflow file to the directory
.github/workflows. Pushing this workflow
file will trigger the actual execution of our ZenML pipeline. We’ll explain later at how to automate this step, but for our first pipeline run there is one last configuration step we need to do: We need to make sure
our GitHub Actions are allowed to pull the Docker image that ZenML just pushed.
<GITHUB_USERNAME> with your GitHub username) and select
Package settings on the right side:
Manage Actions access section, click on
Search for your forked repository
github-actions-orchestrator-tutorial and give it read permissions. Your package settings should then look like this:
Done! Now all that’s left to do is commit and push the workflow file:
git add .github/workflows git commit -m "Add ZenML pipeline workflow" git push
If we now check out the GitHub Actions for our repository here
https://github.com/<GITHUB_USERNAME>/github-actions-orchestrator-tutorial/actions we should see our pipeline running! 🎉
If we want the orchestrator to automatically commit and push the workflow file for us, we can enable it with the following command:
zenml orchestrator update github_orchestrator --push=true
After this update, calling
python run.py should automatically build and push a Docker image, commit and push the workflow file which will in turn run our pipeline on GitHub Actions.
Once we’re done experimenting, let’s delete all the resources we created on Azure so we don’t waste any compute/money. As we’ve bundled it all in one resource group, this step is very easy. Go the Azure portal and select your resource group in the list of resources:
Next click on
Delete resource group on the top:
In the popup on the right side, type the resource group name and click
This will take a few minutes, but after it’s finished all the resources we created should be gone.
If you have any question or feedback regarding this tutorial, let us know here or join our weekly community hour. If you want to know more about ZenML or see more examples, check out our docs, examples or join our Slack.