We did not start as an open-source Machine Learning tooling company. Our original goal was to transform the commercial vehicle industry with predictive analytics. After a few promising proofs-of-concept and projects on trucks we continuously expanded to other commercial vehicles - and other industries.
As you might imagine, we learned a lot in those days. About running a business, growing a team, acquiring customers, and bootstrapping a company. Most importantly, however, we found our learnings on the tech side to be even more valuable. This is why we decided to focus our efforts on open-sourcing and growing our internal tech-stack as our core business.
In our project-days, a pattern emerged quite quickly. Delivering excellent results while staying profitable required us to be quick. Onboarding a new customer meant fast iterations over their data, and re-using approaches from previous projects. Compliance, regulations, and client requirements forced us to work in very heterogeneous environments and we often had to burst workloads off to a remote (read: cloud-based) compute backend.
Cutting to the chase: We needed to ensure the reproducibility of our projects, across a range of integrations to compute backends.
We went through some of the available MLOps toolings and used individual bits and pieces over time. There are amazing projects available, solving individual aspects of Machine Learning (so-called vertical solutions). Where the market falls short is on solutions across bigger chunks of the ML lifecycle (so-called horizontal solutions), as they tend to limit the choice of complimentary tooling. This is the exact reason we built ZenML: guaranteed reproducibility with the freedom of choice of integrations to the best available vertical tooling.
ZenML is built with reproducibility in mind. Reproducibility is a core motivation of DevOps methodologies: Builds need to be reproducible. Commonly, this is achieved by version control of code, version pinning of dependencies, and automation of workflows. ZenML bundles these practices into a coherent framework for Machine Learning. Machine Learning brings an added level of complexity to version control, beyond versioning code: Data is inherently hard to version.
ZenML takes an easy, yet effective approach to version controlling data. When sourcing data, either via dedicated data pipelines or within your training pipelines, ZenML creates an immutable snapshot of the data (TFRecords) used for your specific pipeline. This snapshot is tracked, just like any other pipeline step, and becomes available as a starting point to subsequent pipelines when using the same parameters for sourcing data.
NOTE: The principle behind versioning data in ZenML is a variation of the method used for caching pipeline steps.
It is not necessary to reinvent the wheel when it comes to version control of code - chances are, you’re already using git to do so (and if not, you should). ZenML can tap into a repository’s history and allow for version-pinning of your own code via git SHA’s.
This becomes exceptionally powerful when you have code you want/need to embed at serving time, as there is now not just lineage of data, but also lineage of code from experiment to serving.
Declarative configurations are a staple of DevOps methodologies, ultimately brought to fame through Terraform. In a nutshell: A pipeline’s configuration declares the “state” the pipeline should be in and the processing that should be applied, and ZenML figures out where the code lies and what computations to apply.
That way, when your teammate clones your repo and re-runs a pipeline config on a different environment, the pipeline remains reproducible.
While versioning and declarative configs are essential for reproducibility, there needs to be a system that keeps track of all processes as they happen. Google’s ML Metadata standardizes metadata tracking, and makes it easy to keep track of iterative experimentation as it happens. ZenML uses ML Metadata extensively (natively as well as via the TFX interface) to automatically track all relevant parameters that are created through ZenML pipeline interfaces. This not only helps in post-training workflows to compare results as experiments progress, but also has the added advantage of leveraging caching of pipeline steps.
The Machine Learning landscape is evolving at a rapid pace. We’re decoupling your experiment workflow from the tooling by providing integrations to solutions for specific aspects of your ML pipelines.
With ZenML, inputs and outputs are tracked for every pipeline step. Output artifacts (e.g. binary representations of data, splits, preprocessing results, models) are centrally stored and are automatically used for caching. To facilitate that, ZenML relies on a Metadata Store and an Artifact Store.
By default, both will point to a subfolder of your local
.zenml directory, which is created when you run
zenml init. It’ll contain both the Metadata Store (default: SQLite) as well as the Artifact Store (default: tf.Records in local folders).
More advanced configurations can easily centralize both the Metadata as well as the Artifact Store, for example for use in Continuous Integration or for collaboration across teams.
As ZenML is centered on decoupling workflow from tooling we provide a growing number of out-of-the-box supported backends for orchestration.
When you configure an orchestration backend for your pipeline, the environment you execute actual
pipeline.run() will launch all pipeline steps at the configured orchestration backend, not the local environment. ZenML will attempt to use credentials for the orchestration backend in question from the current environment.
NOTE: If no further pipeline configuration if provided (e.g. processing or training backends), the orchestration backend will also run all pipeline steps.
Integrating custom orchestration backends is fairly straightforward. Check out our example implementation of Google Cloud VMs to learn more about building your own integrations.
Sometimes, pipeline steps just need more scalability than what your orchestration backend can offer. That’s when the natively distributable codebase of ZenML can shine - it’s straightforward to run pipeline steps on processing backends like Google Dataflow or Apache Spark.
ZenML is using Apache Beam for it’s pipeline steps, therefore backends rely on the functionality of Apache Beam. A processing backend will execute all pipeline steps before the actual training.
Processing backends can be used to great effect in conjunction with an orchestration backend. To give a practical example: You can orchestrate your pipelines using Google Cloud VMs and configure to use a service account with permissions for Google Dataflow and Google Cloud AI Platform. That way you don’t need to have very open permissions on your personal IAM user, but can relay authority to service-accounts within Google Cloud.
We’re adding support for additional processing backends continuously.
Many ML use-cases and model architectures require GPUs/TPUs. ZenML offers integrations to Cloud-based ML training offers, and provides a way to extend the training interface to allow for self-built training backends.
Some of these backends rely on Docker containers or other methods of transmitting code. Please see the documentation for a specific training backend for further details.
Every ZenML pipeline yields a servable model, ready to be used in your existing architecture - for example as additional input for your CI/CD pipelines. To accommodate other architectures, ZenML has support for a growing number of dedicated serving backends, with clear linkage and lineage from data to deployment.
After a couple of months of hard work and tremendously helpful feedback from the community, we can now share the fruit of our labor.
We are proud to announce that ZenML is available open-source. Check out the GitHub repository and get started with reproducible Machine Learning!