If you are of a more visual disposition, please check out this blog’s accompanying video tutorial.
The step operator defers the execution of individual steps in a pipeline to specialized runtime environments that are optimized for Machine Learning workloads. This is helpful when there is a requirement for specialized cloud backends ✨ for different steps. One example could be using powerful GPU instances for training jobs or distributed compute for ingestion streams.
An orchestrator is a higher level entity than a step operator. It is what executes the entire ZenML pipeline code and decides what specifications and backends to use for each step.
The orchestrator runs the code which launches your step in a backend of your choice. If you don’t specify a step operator, then the step code runs on the same compute instance as your orchestrator.
While an orchestrator defines how and where your entire pipeline runs, a step operator defines how and where an individual step runs. This can be useful in a variety of scenarios. An example could be if one step within a pipeline needed to run on a separate environment equipped with a GPU (like a trainer step).
A step operator is a stack component, and is therefore part of a ZenML stack.
An operator can be registered as part of the stack as follows:
zenml step-operator register OPERATOR_NAME \ --type=OPERATOR_TYPE ...
And then a step can be decorated with the
custom_stop_operator parameter to run it with that operator backend:
@step(custom_step_operator=OPERATOR_NAME) def trainer(...) -> ...: """Train a model""" # This step will run in environment specified by operator
The step operator makes you feel like this -- via GIPHY
ZenML’s cloud integrations are now extended to include step operators that run an individual step in all of the public cloud providers hosted ML platform offerings. The ZenML GitHub repository gives a great example of how to use these integrations. Let’s walk through one example, with AWS Sagemaker, in this blog. The other two clouds are quite similar and follow the same pattern.
AWS Sagemaker is a hosted ML platform offered by Amazon Web Services. It manages the full lifecycle of building, training, and deploying machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. It offers specialized compute instances to run your training jobs and has a beautiful UI to track and manage your models and logs.
You can now use the new
SagemakerStepOperator class to submit individual steps to be run on compute instances managed by Amazon Sagemaker.
As we are working in the cloud, we need to first do a bunch of preperatory steps to regarding permissions and resource creation. In the future, ZenML will automate a lot of this way. For now, follow these manual steps:
zenml artifact-store register s3-store \ --type=s3 --path=<S3_BUCKET_PATH>
# register the container registry zenml container-registry register ecr_registry --type=default --uri=<ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com
Set up the
aws cli set up with the right credentials. Make sure you have the permissions to create and manage Sagemaker runs.
Create a role in the IAM console that you want the jobs running in Sagemaker to assume. This role should at least have the
AmazonSageMakerFullAccess policies applied. Check this link to learn how to create a role.
Choose what instance type needs to be used to run your jobs. You can get the list here.
Come up with an experiment name if you have one created already. Check this guide to know how. If not provided, the job runs would be independent of an experiment.
Optionally, select a custom docker image that you want ZenML to use as a base image for creating an environment to run your jobs in Sagemaker.
Once you have all these values handy, you can proceed to setting up the components required for your stack.
# create the sagemaker step operator zenml step-operator register sagemaker \ --type=sagemaker --role=<SAGEMAKER_ROLE> \ --instance_type=<SAGEMAKER_INSTANCE_TYPE> --base_image=<CUSTOM_BASE_IMAGE> --bucket_name=<S3_BUCKET_NAME> --experiment_name=<SAGEMAKER_EXPERIMENT_NAME>
The command to register the stack component would look like the following. More details about the parameters that you can configure can be found in the class definition of Sagemaker Step Operator in the API docs.
# register the sagemaker stack zenml stack register sagemaker_stack \ -m local_metadata_store \ -o local_orchestrator \ -c ecr_registry \ -a s3-store \ -s sagemaker # activate the stack zenml stack set sagemaker_stack
And now you have the stack up and running! Note that similar steps can be undertaken with Vertex AI and Azure ML. See the docs for more information.
Once the above is out of the way, any step of any pipeline we create can be decorated with the following decorator:
@step(custom_step_operator="sagemaker") def trainer(...) -> ...: """Train a model""" # This step will run as a custom job in Sagemaker
ZenML will take care of packaging the step for you into a docker image, pushing the image, provisioning the resources for the custom job, and monitoring it as it progresses. Once complete, the pipeline will continue as always.
You can also switch the “sagemaker” operator with any other operator of your choosing, and it will work with the same step code as you always have. Modularity at its best!