TL;DR: Running successful ML teams and projects requires cross-functional collaboration within the executing team.
That sentence alone does not help anyone. It feels unsubstantiated, and someone playing bullshit bingo might luck out just on that one sentence - and yet, it’s true. Why? Because it’s taking the main driver behind the DevOps revolution in software development and applies it to Machine Learning.
If you’re familiar with what DevOps stands for, where it came from and why it’s such a game-changer, feel free to skip to the conclusion.
For everyone else, a quick refresher.
Flashback to the 2000s (sorry if this triggers PTSD for my fellow Ops guys out there). Software Development is a barren wasteland. Developers write code until it works on their computer, send an email to QA and Ops to please do their thing now. Nobody knows who the people in QA are, all they are doing is complaining about bugs that can’t be reproduced. Ops are the angry guys from the basement, and somehow they get mad just because code sometimes SEGFAULTs, but it can’t have been a fault on our side, it never SEGFAULTs on my machine.
This is not a healthy way to run a business. The late 2000s were full of stories of companies that failed, simply because their internal software development process was too slow to keep up with the market.
Meet “DevOps”: A new and revolutionary way of dealing with software development. What if we could deliver our software more often, and faster, to our customers? And what if it broke less in production? What if our Devs, QA, and Ops could be happier at work? And, what if we would build up less technical debt along the way?
How, you ask? Let developers, testers, AND operators work in unison, together, and own the entire process, rather than separate them into silos! From an orbital perspective, software development can be roughly broken down into a handful of stages:
Allowing teams to own the entire process, rather than just a single aspect, created a deeper understanding of all software lifecycle aspects across the functions in a team. Suddenly, everyone was a part of the development team. Testers had to be able to understand and write code because tests had to be automated. Ops guys built abstractions to the underlying hardware so that others could run software independently. Developers had to acknowledge that software is only done when being exposed successfully to customers.
Yes, there are many ways to mess this up. You might say it’s even impossible to get it right. These ideas have spawned many closely related methodologies and team designs (e.g. SREs at Google, Netflix’s Platform Engineering team, etc. pp.). But evidence confirms: getting all lifecycle functions to work (more) cross-functional will leave you better off than in a siloed environment.
Obviously, I’m trying to go somewhere with this. Machine Learning as a discipline is exactly where software development was in the early 2000s: just before the dawn of DevOps. Unless you’re in some hotshot startup, you own all data, and every one of your engineers is a core contributor to Kubeflow, your reality will look a bit bleak:
Nothing I’m about to say should sound like a revelation. I’ll paint the picture by using an example: A team running the search engine on an e-commerce platform. They need engineers with an understanding of high-performance processing, databases, and API design. They will need to be able to rely on the upstream product data staying consistent. They need to understand how their search engine is used in the frontend, and what performance metrics they need to be able to provide. They need to own the APIs used by downstream teams, e.g. the Frontend or your mobile app. Abstracted, they need to get
The same applies to Machine Learning teams if they’re supposed to be successful:
Your team will be responsible for preprocessing input data it receives, training ML models, evaluating fulfillment of business criteria, and delivering results to downstream stakeholders. Therefore, the skills your team will need are:
Data needs to be reliable if a team is supposed to generate value from it. The team and the upstream data provider need to agree on the frequency of data provision, data format, and data quality. The easiest and best agreement can be found in code, e.g. through data ingestion pipelines with strong data validation built-in, but it can be beneficial later on to establish a feature store for your Machine Learning team. Horizontally aligned Data Operations teams can help if more than one team is reliant on similar data.
Surprisingly many projects fail due to muddy business objectives. Business stakeholders and the ML team need to come together and thoroughly define what the business is trying to achieve, and how it affects the model performance indicators. At a minimum, the business and the team need to answer these questions:
Probably the most controversial point in this blogpost concerns ownership. After transitioning multiple teams across companies towards full ownership of their software development lifecycle I’m a firm believer in the approach. Teams need to be responsible for providing the actual value, not just the artifact itself. It drives healthy decision-making (e.g. how big can the final model be, how much resources can it use), and it breeds a deeper understanding of the actual needs of downstream consumers. After all, if the served model does not create a positive impact on someone downstream, the efforts of the team should have been applied elsewhere.
Your teams are now cross-functional, and everyone is eager to jump to work. But what now? How does the work of a successful ML team look like in production? We got you covered on that front, too - check out our blog post on the 12 factors of reproducible Machine Learning in production.