Run your steps on the cloud with Sagemaker, Vertex AI, and AzureML | ZenML Blog

Last updated: November 21, 2022.

If you are of a more visual disposition, please check out this blog’s accompanying video tutorial.

Subscribe to the ZenML YouTube Channel.

What is a step operator?

The step operator defers the execution of individual steps in a pipeline to specialized runtime environments that are optimized for Machine Learning workloads. This is helpful when there is a requirement for specialized cloud backends ✨ for different steps. One example could be using powerful GPU instances for training jobs or distributed compute for ingestion streams.

ZenML step operators allow training in the cloud

I’m confused 🤔. How is it different from an orchestrator?

An orchestrator is a higher level entity than a step operator. It is what executes the entire ZenML pipeline code and decides what specifications and backends to use for each step.

The orchestrator runs the code which launches your step in a backend of your choice. If you don’t specify a step operator, then the step code runs on the same compute instance as your orchestrator.

While an orchestrator defines how and where your entire pipeline runs, a step operator defines how and where an individual step runs. This can be useful in a variety of scenarios. An example could be if one step within a pipeline needed to run on a separate environment equipped with a GPU (like a trainer step).

How do I use it?

A step operator is a stack component, and is therefore part of a ZenML stack.

An operator can be registered as part of the stack as follows:

zenml step-operator register OPERATOR_NAME \
    --type=OPERATOR_TYPE
    ...

And then a step can be decorated with the custom_step_operator parameter to run it with that operator backend:

from zenml.client import Client

step_operator = Client().active_stack.step_operator
@step(step_operator=step_operator.name)
def trainer(...) -> ...:
    """Train a model"""
    # This step will run in environment specified by operator

Run on AWS Sagemaker, GCP Vertex AI, and Microsoft Azure ML

The step operator makes you feel like this -- via GIPHY

ZenML’s cloud integrations are now extended to include step operators that run an individual step in all of the public cloud providers hosted ML platform offerings. The ZenML GitHub repository gives a great example of how to use these integrations. Let’s walk through one example, with AWS Sagemaker, in this blog. The other two clouds are quite similar and follow the same pattern.

Introduction to AWS Sagemaker

AWS Sagemaker is a hosted ML platform offered by Amazon Web Services. It manages the full lifecycle of building, training, and deploying machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. It offers specialized compute instances to run your training jobs and has a beautiful UI to track and manage your models and logs.

You can now use the new SagemakerStepOperator class to submit individual steps to be run on compute instances managed by Amazon Sagemaker.

Set up a stack with the AWS Sagemaker StepOperator

As we are working in the cloud, we need to first do a bunch of preperatory steps to regarding permissions and resource creation. In the future, ZenML will automate a lot of this way. For now, follow these manual steps:

zenml artifact-store register s3-store \
    --type=s3
    --path=<S3_BUCKET_PATH>
# register the container registry
zenml container-registry register ecr_registry --type=default --uri=<ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com
# create the sagemaker step operator
zenml step-operator register sagemaker \
    --type=sagemaker
    --role=<SAGEMAKER_ROLE> \
    --instance_type=<SAGEMAKER_INSTANCE_TYPE>
    --base_image=<CUSTOM_BASE_IMAGE>
    --bucket_name=<S3_BUCKET_NAME>
    --experiment_name=<SAGEMAKER_EXPERIMENT_NAME>

The command to register the stack component would look like the following. More details about the parameters that you can configure can be found in the class definition of Sagemaker Step Operator in the API docs.

# register the sagemaker stack
zenml stack register sagemaker_stack \
    -o local_orchestrator \
    -c ecr_registry \
    -a s3-store \
    -s sagemaker

# activate the stack
zenml stack set sagemaker_stack

And now you have the stack up and running! Note that similar steps can be undertaken with Vertex AI and Azure ML. See the docs for more information.

Create a pipeline with the step operator decorator

Once the above is out of the way, any step of any pipeline we create can be decorated with the following decorator:

from zenml.client import Client

step_operator = Client().active_stack.step_operator
@step(step_operator=step_operator.name)
def trainer(...) -> ...:
    """Train a model"""
    # This step will run as a custom job in Sagemaker

ZenML will take care of packaging the step for you into a docker image, pushing the image, provisioning the resources for the custom job, and monitoring it as it progresses. Once complete, the pipeline will continue as always.

You can also switch the “sagemaker” operator with any other operator of your choosing, and it will work with the same step code as you always have. Modularity at its best!

So what are you waiting for? Read more about step operators in the docs, or try it yourself with the full example at the GitHub repository. Make sure to leave a star if you do end up there!

[Image credit: Photo by lukaszlada on Unsplash]


More from us: