“Spend more time with your data.” As a data scientist or ML practitioner, you’ll probably have heard this advice more times than you can count. Familiarity with the data breeds understanding of the problem domain which in turn leads to better models and solutions. But how do you do this in practice? The very fact that we need to be reminded of this suggests at least some lack of clarity.
One way to become intimately familiar with your data is to get involved in labeling the data. This approach has recently gained popularity following the emergence of the data-centric AI movement. In this blog, I will explain the benefits of being more closely involved in annotation and labeling of your data. I will also talk through the moments in your overall workflows where you’re likely to think about doing annotation and how it benefits you in each case.
I imagine that in a few years we’ll see the push for ‘data-centric AI’ as a correction to how things are routinely done rather than an entirely separate approach or discipline. Indeed, if you speak to any researcher from the world of the humanities or social sciences they will all affirm the importance of being familiar with the raw materials of what you’re working with.
I’ve learned and relearned this countless times in own career and can attest to the benefits that come from immersion into the sources and materials that make up your problem domain or context. You’ll very rarely find you don’t benefit from familiarity with this data. (You still want to be smart about exactly how much of your time you spend doing this, but I’ll put that to one side for now.)
Familiarity with your data helps spot and solve your problems faster. For instance, spending a few hours labeling image data will give you some heuristic, however brief, as to the kinds of things contained in your images, what formats or sizes your images take, and so on. If your model training process throws an error out of nowhere, your knowledge of the input data will likely be useful in coming up with a hypothesis as to what happened.
Familiarity with your data is also a great way to keep your process nimble and focused. There’s no need to go full AutoML, trying out every hyperparameter or pre-trained model under the sun if you have a sense of the boundaries or edges of your data space.
Perhaps most importantly, the wide availability of some pretty great starting points in the form of pre-trained models (e.g. in the computer vision and NLP domains) mean that you don’t actually need much data to get started. Making an initial set of annotations allows you to get going with the work of getting a baseline started and iterating on your solution.
Labeling isn’t a static practice but rather something you probably will want to be doing throughout. There are some places where it makes a lot of sense to plan to do some labeling:
The shift and refocus towards data quality is more of a course correction than it is a fundamental rethink of how machine learning models get trained. Nevertheless, it does shift the emphasis and allocation of time to different parts of the ML training pipeline.
You will probably never regret spending time with your data, exposed to all the tiny idiosyncrasies and edge cases. After all, your model goes through some version of this itself as it works to fit a function to the precise use case you’ve set up. Knowing how to incorporate data labeling as part of your data-centric approach will pay dividends in increasing the speed at which you can iterate and boost the extent to which you feel intuition around the raw materials of your problem.
At ZenML, we’re starting work to incorporate data labeling and annotation tools into our framework so that you can gain all the benefits described above as a core part of your workflow. If you have a use case which requires data annotation in your pipelines, please let us know what you’re building and there are tools you feel like you couldn’t live without! The easiest way to contact us is via our Slack community which you can join here.