Piotr Niedzwiedz, Autor w serwisie neptune.ai

We are joining OpenAI

Piotr Niedzwiedz — Wed, 03 Dec 2025 18:25:03 +0000

Piotr Niedźwiedź, CEO/CTO and founder of neptune.ai

I’m excited to share that we’ve entered into a definitive agreement to be acquired by OpenAI, subject to closing conditions. We are thrilled to join the OpenAI team and help their AI researchers build better models faster.

We started in 2017, when the pivotal transformer paper came out. And from then until today, we have focused on helping people who build ML models during the iterative, messy, and unpredictable phase of model training.

I love how Szymon Sidor, who has spent nearly a decade at OpenAI, explained the essence of Neptune back to me, perhaps capturing it better than I ever could:

“OpenAI research converts compute into understanding. At the interface of compute and understanding are metrics. Neptune is a metrics dashboard company.”

We’ve worked closely with OpenAI to create the metrics dashboard that helps teams building foundation models. With this transaction, we’ll be able to work even more closely together and innovate on tools at a whole new level.

Our future with OpenAI

Neptune will join OpenAI and continue to support AI researchers with tools to monitor, debug, and evaluate frontier models. Built on top of Neptune, of course. I am beyond excited to join forces with some of the top research and engineering minds on the path towards AGI.

This is how Jakub Pachocki, OpenAI’s Chief Scientist, explains how Neptune will fit in at OpenAI:

“Neptune has built a fast, precise system that allows researchers to analyze complex training workflows. We plan to iterate with them to integrate their tools deep into our training stack to expand our visibility into how models learn.”

We will wind down our external services in the next few months, and are committed to working closely with our customers and users to make this transition as smooth as possible.

What is next

I am truly grateful to our customers, investors, co-founders, and colleagues who have made this journey possible. It was the ride of a lifetime already, yet still I believe this is only the beginning. We are looking forward to working with top AI researchers and supporting OpenAI’s mission of ensuring that AGI benefits all of humanity.

Learnings From Building the ML Platform at Mailchimp

Piotr Niedzwiedz — Tue, 03 Oct 2023 08:58:29 +0000

This article was originally an episode of the ML Platform Podcast, a show where Piotr Niedźwiedź and Aurimas Griciūnas, together with ML platform professionals, discuss design choices, best practices, example tool stacks, and real-world learnings from some of the best ML platform professionals.

In this episode, Mikiko Bazeley shares her learnings from building the ML Platform at Mailchimp.

You can watch it on YouTube:

Or Listen to it as a podcast on:

But if you prefer a written version, here you have it!

In this episode, you will learn about:

1 ML platform at Mailchimp and generative AI use cases
2 Generative AI problems at Mailchimp and feedback monitoring
3 Getting closer to the business as an MLOps engineer
4 Success stories of ML platform capabilities at Mailchimp
5 Golden paths at Mailchimp

Who is Mikiko Bazeley

Aurimas: Hello everyone and welcome to the Machine Learning Platform Podcast. Today, I’m your host, Aurimas, and together with me, there’s a cohost, Piotr Niedźwiedź, who is a co-founder and the CEO of neptune.ai.

With us today on the episode is our guest, Mikiko Bazeley. Mikiko is a very well-known figure in the data community. She is currently the head of MLOps at FeatureForm, a virtual feature store. Before that, she was building machine learning platforms at MailChimp.

Nice to have you here, Miki. Would you tell us something about yourself?

Mikiko Bazeley: You definitely got the details correct. I joined FeatureForm last October, and before that, I was with Mailchimp on their ML platform team. I was there before and after the big $14 billion acquisition (or something like that) by Intuit – so I was there during the handoff. Quite fun, quite chaotic at times.

But prior to that, I’ve spent a number of years working both as a data analyst, data scientist, and even a weird MLOps/ML platform data engineer role for some early-stage startups where I was trying to build out their platforms for machine learning and realize that’s actually very hard when you’re a five-person startup – lots of lessons learned there.

So I tell people honestly, I’ve spent the last eight years working up and down the data and ML value chain effectively – a fancy way of saying “job hopping.”

How to transition from data analytics to MLOps engineering

Piotr: Miki, you’ve been a data scientist, right? And later, an MLOps engineer. I know that you are not a big fan of titles; you’d rather prefer to talk about what you actually can do. But I’d say what you do is not a common combination.

How did you manage to jump from a more analytical, scientific type of role to a more engineering one?

Mikiko Bazeley: Most people are really surprised to hear that my background in college was not computer science. I actually did not pick up Python until about a year before I made the transition to a data scientist role.

When I was in college, I studied anthropology and economics. I was very interested in the way people worked because, to be frank, I didn’t understand how people worked. So that seemed like the logical area of study.

I was always fascinated by the way people made decisions, especially in a group. For example, what are cultural or social norms that we just kind of accept without too much thought? When I graduated college, my first job was working as a front desk girl at a hair salon.

At that point, I didn’t have any programming skills.

I think I had like one class in R for biostats, which I barely passed. Not because of intelligence or ambition, but mainly because I just didn’t understand the roadmap – I didn’t understand the process of how to make that kind of pivot.

My first pivot was to growth operations and sales hacking – it was called growth hacking at that time in Silicon Valley. And then, I developed a playbook for how to make these transitions. So I was able to get from growth hacking to data analytics, then data analytics to data science, and then data science to MLOps.

I think the key ingredients of making that transition from data science to an MLOps engineer were:

Having a really genuine desire for the kinds of problems that I want to solve and work on. That’s just how I’ve always focused my career – “What’s the problem I want to work on today?” and “Do I think it’s going to be interesting like one or two years from now?”

The second part was very interesting because there was one year I had four jobs. I was working as a data scientist, mentoring at two boot camps, and working on a real estate tech startup on the weekends.

I eventually left to work on it full-time during the pandemic, which was a great learning experience, but financially, it might not have been the best solution to get paid in sweat equity. But that’s okay – sometimes you have to follow your passion a little bit. You have to follow your interests.

Piotr: When it comes to decisions, in my context, I remember when I was still a student. I started from tech, my first job was an internship at Google as a software engineer.

I’m from Poland, and I remember when I got an offer from Google to join as a regular software engineer. The monthly salary was more than I was spending in a year. It was two or three times more.

It was very tempting to follow where money was at that moment. I see a lot of people in the field, especially at the beginning of their careers, thinking more short-term. The concept of looking a few steps, a few years ahead, I think it’s something that people are missing, and it’s something that, by the end of the day, may result in better outcomes.

I always ask myself when there is a decision like that; “What would happen if in a year it’s a failure and I’m not happy? Can I go back and pick up the other option?” And usually, the answer is “yes, you can.”

I know that decisions like that are challenging, but I think that you made the right call and you should follow your passion. Think about where this passion is leading.

Resources that can help bridge the technical gap

Aurimas: I also have a very similar background. I switched from analytics to data science, then to machine learning, then to data engineering, then to MLOps.

For me, it was a little bit of a longer journey because I kind of had data engineering and cloud engineering and DevOps engineering in between.

You shifted straight from data science, if I understand correctly. How did you bridge that – I would call it a technical chasm – that is needed to become an MLOps engineer?

Mikiko Bazeley: Yeah, absolutely. That was part of the work at the early-stage real estate startup. Something I’m a very big fan of is boot camps. When I graduated college, I had a very bad GPA – very, very bad.

I don’t know how they score a grade in Europe, but in the US, for example, it’s usually out of a 4.0 system, and I had a 2.4, and that is just considered very, very bad by most US standards. So I didn’t have the opportunity to go back to a grad program and a master’s program.

It was very interesting because by that point, I had approximately six years working with executive level leadership for companies like Autodesk, Teladoc, and other companies that are either very well known globally – or at least very, very well known domestically, within the US.

I had C-level people saying: “Hey, we will write you those letters to get into grad programs.”.

And grad programs were like, “Sorry, nope! You have to go back to college to redo your GPA.” And I’m like, “I’m in my late 20s. Knowledge is expensive, I’m not gonna do that.”

So I’m a big fan of boot camps.

What helped me both in the transition to the data scientist role and then also to the MLOps engineer role was doing a combination of boot camps, and when I was going to the MLOps engineer role, I also took this one workshop that’s pretty well-known called Full Stack Deep Learning. It’s taught by Dimitri and Josh Tobin, who went off to go start Gantry. I really enjoyed it.

I think sometimes people go into boot camps thinking that’s gonna get them a job, and it just really doesn’t. It’s just a very structured, accelerated learning format.

What helped me in both of those transitions was truly investing in my mentor relationship. For example, when I first pivoted from data analytics to data science, my mentor at that time was Rajiv Shah, who is the developer advocate at Hugging Face now.

I’ve been a mentor at boot camps since then – at a couple of them. A lot of times, students will kind of check-in and they’ll be like “Oh, why don’t you help me grade my project? How was my code?”

And that’s not a high-value way of leveraging an industry mentor, especially when they come with such credentials as Rajiv Shah came with.

With the full-stack deep learning course, there were some TAs there who were absolutely amazing. What I did was show them my project for grading. But for example, when moving to the data scientist role, I asked Rajiv Shah:

How do I do model interpretability if marketing, if my CMO is asking me to create a forecast, and predict results?
How do I get this model in production?
How do I get buy-in for these data science projects?
How do I leverage the strengths that I already have?

And I coupled that with the technical skills I’m developing.

I did the same thing with the ML platform role. I would ask:

What is this course not teaching me right now that I should be learning?
How do I develop my body of work?
How do I fill in these gaps?

I think I developed the skills through a combination of things.

You need to have a structured curriculum, but you also need to have projects to work with, even if they are sandbox projects – that kind of exposes you to a lot of the problems in developing ML systems.

Looking for boot camp mentors

Piotr: When you mention mentors, did you find them during boot camps or did you have other ways to find mentors? How does it work?

Mikiko Bazeley: With most boot camps, it comes down to picking the right one, honestly. For me,

I chose Springboard for my data science transition, and then I used them a little bit for the transition to the MLOps role, but I relied more heavily on the Full Stack Deep Learning course – and a lot of independent study and work too.

I didn’t finish the Springboard one for MLOps, because I’d gotten a couple of job offers by that point for four or five different companies for an MLOps engineer role.

Piotr: And was it because of the boot camp? Because you said, many people use boot camps to find jobs. How did it work in your case?

Mikiko Bazeley: The boot camp didn’t put me in contact with hiring managers. What I did do was, and this is where having public branding comes into play.

I definitely don’t think I’m an influencer. For one, I don’t have the audience size for that. What I try to do, very similar to what a lot of the folks here right now on the podcast do, is to try to share my learnings with people. I try to take my experiences and then frame them like “Okay, yes, these kinds of things can happen, but this is also how you can deal with it”.

I think building in public and sharing that learning was just so crucial for me to get a job. I see so many of these job seekers, especially on the MLOps side or the ML engineer side.

You see them all the time with a headline like: “data science, machine learning, Java, Python, SQL, or blockchain, computer vision.”

It’s two things. One, they’re not treating their LinkedIn profile as a website landing page. But at the end of the day, that’s what it is, right? Treat your landing page well, and then you might actually retain visitors, similar to a website or a SaaS product.

But more importantly, they’re not actually doing the important thing that you do with social networks, which is you have to actually engage with people. You have to share with folks. You have to produce your learnings.

So as I was going through the boot camps, that’s what I would essentially do. As I learned stuff and worked on projects, I would combine that with my experiences, and I would just share it out in public.

I would just try to be really – I don’t wanna say authentic, that’s a little bit of an overused term – but there’s the saying, “Interesting people are interested.” You have to be interested in the problems, the people, and the solutions around you. People can connect with that. If you’re just faking it like a lot of Chat GPT and Gen AI folks are – faking it with no substance – people can’t connect.

You need to have that real interest, and you need to have something with it. So that’s how I did that. I think most people don’t do that.

Piotr: There is one more factor that is needed. I’m struggling with it when it comes to sharing. I’m learning different stuff, but once I learn it, then it sounds kind of obvious, and then I’m kind of ashamed that maybe it’s too obvious. And then I just think: Let’s wait for something more sophisticated to share. And that never comes.

Mikiko Bazeley: The impostor syndrome.

Piotr: Yeah. I need to get rid of it.

Mikiko Bazeley: Aurimas, do you feel like you ever got rid of the impostor syndrome?

Aurimas: No, never.

Mikiko Bazeley: I don’t. I just find ways around it.

Aurimas: Everything that I post, I think it’s not necessarily worth other people’s time, but it looks like it is.

Mikiko Bazeley: It’s almost like you just have to set up things to get around your worst nature. All your insecurities – you just have to trick yourself like a good diet and workout.

What is FeatureForm, and different types of other feature stores

Aurimas: Let’s talk a little bit about your current work, Miki. You’re the Head of MLOps at FeatureForm. Once, I had a chance to talk with the CEO of FeatureForm and he left me with a good impression about the product.

What is FeatureForm? How is FeatureForm different from other players in the feature store market today?

Mikiko Bazeley: I think it comes down to understanding the different types of feature stores that are out there, and even understanding why a virtual feature store is maybe just a terrible name for what FeatureForm is category-wise; it’s not very descriptive.

There are three types of feature stores. Interestingly, they roughly correspond to the waves of MLOps and reflect how different paradigms have developed.

The three types are:

1 Literal,
2 Physical,
3 Virtual.

Most people understand literal feature stores intuitively. A literal feature store is literally just a feature store. It will store the features (including definitions and values) and then serve them. That’s pretty much all it does. It’s almost like a very specialized data storage solution.

For example, Feast. Feast is a literal feature store. It’s a very lightweight option you can implement easily, which means implementation risk is low. There’s essentially no transformation, orchestration, or computation going on

Piotr: Miki, if I may, why is it lightweight? I understand that a literal feature store stores features. It kind of replaces your storage, right?

Mikiko Bazeley: When I say lightweight, I mean kind of like implementing Postgres. So, technically, it’s not super lightweight. But if we compare it to a physical feature store and put the two on a spectrum, it is.

A physical feature store has everything:

It stores features,
It serves features,
It orchestrates features
It does the transformations.

In that respect, a physical feature store is heavyweight in terms of implementation, maintenance, and administration.

Piotr: On the spectrum, the physical feature store is the heaviest?

And in the case of a literal feature store, the transformations are implemented somewhere else and then saved?

Mikiko Bazeley: Yes.

Aurimas: And the feature store itself is just a library, which is basically performing actions against storage. Correct?

Mikiko Bazeley: Yes, well, that’s almost an implementation detail. But yeah, for the most part. Feast, for example, is a library. It comes with different providers, so you do have a choice.

Aurimas: You can configure it against S3, DynamoDB, or Redis, for example. The weightiness, I guess comes from it being just a thin library on top of this storage, and you manage the storage yourself.

Mikiko Bazeley: 100%.

Piotr: So there is no backend? There’s no component that stores metadata about this feature store?

Mikiko Bazeley: In the case of the literal feature store, all it does is store features and metadata. It won’t actually do any of the heavy lifting of the transformation or the orchestration.

Piotr: So what is a virtual feature store, then? I understand physical feature stores, this is quite clear to me, but I’m curious what a virtual feature store is.

Mikiko Bazeley: Yeah, so in the virtual feature store paradigm, we attempt to take the best of both worlds.

There is a use case for the different types of feature stores. The physical feature stores came out of companies like Uber, Twitter, Airbnb, etc. They were solving really gnarly problems when it came to processing huge amounts of data in a streaming fashion.

The challenges with physical feature stores is that you’re pretty much locked down to your provider or the provider they choose. You can’t actually swap it out. For example, if you wanted to use Cassandra or Redis as your, what we call the “inference store” or the “online store,” you can’t do that with a physical feature store. Usually, you just take whatever providers they give you. It’s almost like a specialized data processing and storage solution.

With the virtual feature store, we try to take the flexibility of a literal feature store where you can swap out providers. For example, you can use BigQuery,

AWS, or Azure. And if you want to use different inference stores, you have that option.

What virtual feature stores do is focus on the actual problems that feature stores are supposed to solve, which is not just versioning, not just documentation and metadata management, and not just serving, but also the orchestration of transformations.

For example, at FeatureForm, we do this because we are Kubernetes native. We’re assuming that data scientists, for the most part, don’t want to write transformations elsewhere. We assume that they want to do stuff they normally would, with Python, SQL, and PySpark, with data frames.

They just want to be able to, for example, wrap their features in a decorator or write them as a class if they want to. They shouldn’t have to worry about the infrastructure side. They shouldn’t have to provide all this fancy configuration and have to figure out what the path to production is – we try to make that as streamlined and simple as possible.

The idea is that you have a new data scientist that joins the team…

Everyone has experienced this: you go to a new company, and you basically just spend the first three months trying to look for documentation in Confluence. You’re reading people’s Slack channels to be clear on what exactly they did with this forecasting and churn project.

You’re hunting down the data. You find out that the queries are broken, and you’re like “God, what were they thinking about this?”

Then a leader comes to you, and they’re like, “Oh yeah, by the way, the numbers are wrong. You gave me these numbers, and they’ve changed.” And you’re like, “Oh shoot! Now I need lineage. Oh God, I need to track.”

The part that really hurts a lot of enterprises right now is regulation. Any company that does business in Europe has to obey GDPR, that’s a big one. But a lot of medical companies in the US, for example, are under HIPAA, which is for medical and health companies. So for a lot of them, lawyers are very involved in the ML process. Most people don’t realize this.

In the enterprise space, lawyers are the ones who, for example, when they are faced with a lawsuit or a new regulation comes out, they need to go, “Okay, can I track what features are being used and what models?” So those kinds of workflows are the things that we’re really trying to solve with the virtual feature store paradigm.

It’s about making sure that when a data scientist is doing feature engineering, which is really the most heavy and intensive part of the data science process, they don’t have to go to all these different places and learn new languages when the feature engineering is already so hard.

Virtual feature store in the picture of a broader architecture

Piotr: So Miki, when we look at it from two perspectives. From an administrator’s perspective. Let’s say we are going to deploy a virtual feature store as a part of our tech stack, I need to have storage, like S3 or BigQuery. I would need to have the infrastructure to perform computations. It can be a cluster run by Kubernetes or maybe something else. And then, the virtual feature store is an abstraction on top of storage and a compute component.

Mikiko Bazeley: Yeah, so we actually did a talk at Data Council. We had released what we call a “market map,” but that’s not actually quite correct. We had released a diagram of what we think the ML stack, the architecture should look like.

The way we look at it is that you have computation and storage, which are just things that run across every team. These are not what we call layer zero, layer one. These are not necessarily ML concerns because you need computation and storage to run an e-commerce website. So, we’ll use that e-commerce website as an example.

The layer above that is where you have the providers or, for a lot of folks – if you’re a solo data scientist, for example –maybe you just need access to GPUs for machine learning models. Maybe you really like to use Spark, and you have your other serving providers at that layer. So here’s where we start seeing a little bit of the differentiation for ML problems.

Underneath that, you might also have Kubernetes, right? Because that also might be doing the orchestration for the full company. So the virtual feature store goes above your Spark, Inray, and your Databricks offering, for example.

Now, above that though, and we’re seeing this now with, for example, the midsize space, there’s a lot of folks who’ve been publishing amazing descriptions of their ML system. For example, Shopify published a blog post about Merlin. There are a few other folks, I think DoorDash has also published some really good stuff.

But now, people are also starting to look at what we call these unified MLOps frameworks. That’s where you have your ZenML, and a few others that are in that top layer. The virtual feature store would fit in between your unified MLOps framework and your providers like Databricks, Spark, and all that. Below that would be Kubernetes and Ray.

Virtual feature stores from an end-user perspective

Piotr: All this was from an architectural perspective. What about the end-user perspective? I assume that when it comes to the end-users of the feature store, at least one of the personas will be a data scientist. How will a data scientist interact with the virtual feature store?

Mikiko Bazeley: So ideally, the interaction would be, I don’t wanna say it would be minimal. But you would use it to the extent that you would use Git. Our principle is to make it really easy for people to do the right thing.

Something I learned when I was at Mailchimp from the staff engineer and tech lead for my team was to assume positive intent – which I think is just such a lovely guiding principle. I think a lot of times there’s this weird antagonism between ML/MLOps engineers, software engineers, and data scientists where it’s like, “Oh, data scientists are just terrible at coding. They’re terrible people. How awful are they?”

Then data scientists are looking at the DevOps engineers or the platform engineers going, “Why do you constantly create really bad abstractions and really leaky APIs that make it so hard for us to just do our job?” Most data scientists just do not care about infrastructure.

And if they do care about infrastructure, they are just MLOps engineers in training. They’re on the step to a new journey.

Every MLOps engineer can tell a story that goes like, “Oh God, I was trying to debug or troubleshoot a pipeline,” or “Oh God, I had a Jupyter notebook or a pickled model, and my company didn’t have the deployment infrastructure.” I think that’s the origin story of every caped MLOps engineer.

In terms of the interaction, ideally, the data scientists shouldn’t have to be setting up infrastructure like a Spark cluster. What they do need is they just the credential information, which should be, I don’t wanna say fairly easy to get, but if it’s really hard for them to get it from their platform engineers, then that is maybe a sign of some deeper communication issues.

But all they would just need to get is the credential information, put it in a configuration file. At that point, we use the term “registering” at FeatureForm, but essentially it’s mostly through decorators. They just need to kind of tag things like “Hey, by the way, we’re using these data sources. We’re creating these features. We’re creating these training datasets.” Since we offer versioning and we say features are a first-class immutable entity or citizen, they also provide a version and never have to worry about writing over features or having features of the same name.

Let’s say you have two data scientists working on a problem.

They’re doing a forecast for customer lifetime value for our e-commerce example. And maybe it’s “money spent in the first three months of the customer’s journey” or what campaign they came through. If you have two data scientists working on the same logic, and they both submit, as long as the versions are named differently, both of them will be logged against that feature.

That allows us to also provide the tracking and lineage. We help materialize the transformations, but we won’t actually store the data for the features.

Dataset and feature versioning

Piotr: Miki, a question because you used the term “decorator.” The only decorator that comes to my mind is a Python decorator. Are we talking about Python here?

Mikiko Bazeley: Yes!

Piotr: You also mentioned that we can version features, but when it comes to that, conceptually a data set is a set of samples, right? And a sample consists of many features. Which leads me to the question if you would also version datasets with a feature store?

Mikiko Bazeley: Yes!

Piotr: So what is the glue between versioned features? How can we represent datasets?

Mikiko Bazeley: We don’t version datasets. We’ll version sources, which also include features, with the understanding that you can use features as sources for other models.

You could use FeatureForm with a tool like DVC. That has come up multiple times. We’re not really interested in versioning full data sets. For example, for sources, we can take tables or files. If people made modifications to that source or that table or that file, they can log that as a variation. And we’ll keep track of those. But that’s not really the goal.

We want to focus more on the feature engineering side. And so what we do is version the definitions. Every feature consists of two components. It’s the values and the definition. Because we create these pure functions with FeatureForm, the idea is that if you have the same input and you push it through the definitions that we’ve stored for you, then we will transform it, and you should ideally get the same output.

Aurimas: If you plug a machine learning pipeline after a feature store and you retrieve a dataset, it’s already a pre-computed set of features that you saved in your feature store. For this, you’d probably need to provide a list of entity IDs, just like all other feature stores require you to do, correct? So you would version this entity ID list plus the computation logic, such that the feature you versioned plus the source equals a reproducible chunk.

Would you do it like this, or are there any other ways to approach this?

Mikiko Bazeley: Let me just repeat the question back to you:

Basically, what you’re asking is, can we reproduce exact results? And how do we do that?

Aurimas: For a training run, yeah.

Mikiko Bazeley: OK. That goes back to a statement I made earlier. We don’t version the dataset or the data input. We version the transformations. In terms of the actual logic itself, people can register individual features, but they can also zip those features together with a label.

What we guarantee is that whatever you write for your development features, the same exact logic will be mirrored for production. And we do that through our serving client. In terms of guaranteeing the input, that’s where we as a company say, “Hey, you know, there’s so many tools to do that.”

That’s kind of the philosophy of the virtual feature store. A lot of the early waves of MLOps were solving the lower layers, like “How fast can we make this?”, “What’s the throughput?”, “What’s the latency?” We don’t do that. For us, we’re like, “There’s so many great options out there. We don’t need to focus on that.”

Instead, we focus on the parts that we’ve been told are really difficult. For example, minimizing train and serve skew, and specifically, minimizing it through standardizing the logic that’s being used so that the data scientist isn’t writing their training pipeline in the pipeline and then has to rewrite it in Spark, SQL, or something like that. I don’t want to say that this is a guarantee for reproducibility, but that’s where we try to at least help out a lot.

With regard to the entity ID: We get the entity ID, for example, from the front end team as an API call. As long as the entity IDis the same as the feature or features they’re calling is the right version, they should get the same output.

And that’s some of the use cases people have told us about. For example, if they want to test out different kinds of logic, they could:

create different versions of the features,
create different versions of the training sets,
feed one version of the data to different models

They can do ablation studies to see which model performed well and which features did well and then roll it back to the model that performed best.

The value of feature stores

Piotr: To sum up, would you agree that when it comes to the value that a feature store brings to the tech stack of an ML team, it brings versioning of the logic behind feature engineering?

If we have versioned logic for a given set of features that you want to use to train your model and you would save somewhere a pointer or to the source data that will be used to compute specific features, then what we are getting is basically dataset versioning.

So on one hand you need to have the source data, and you need to version it somehow, but also you need to version the logic to process the raw data and compute the features.

Mikiko Bazeley: I’d say the three or four main points of the value proposition are definitely versioning of the logic. The second part is documentation, which is a huge part. I think everyone has had the experience where they look at a project and have no idea why someone chose the logic that they did. For example, logic to represent a customer or a contract value in a sales pipeline.

So versioning, documentation, transformation, and orchestration. The way we say it is you “ write once, serve twice.” We offer that guarantee. And then, along with the orchestration aspect, there’s also things like scheduling. But those are the three main things:

Versioning,
Documentation,
Minimizing train service skew through transformations.

Those are the three big ones that people ask us for.

Feature documentation in FeatureForm

Piotr: How does documentation work?

Mikiko Bazeley: There are two types of documentation. There is, I don’t want to say incidental documentation, but there is documenting through code and assistive documentation.

For example, assistive documentation is, for example, docstrings. You can explain, “Hey, this is the logic of the function, this is what the terms mean, etc.. We offer that.

But then there is also documenting through code as much as possible. For example, you have to list the version of the feature or the training set, or the source that you’re using. Trying to break out the type of the resource that’s being created as well. At least for the managed version of FeatureForm, we also offer governance, user access control, and things like that. We also offer lineage of the features. For example, linking a feature to the model that’s being used with it. We try to build in as much documentation through code as possible .

We’re always looking at different ways we can continue to expand the capabilities of our dashboard to assist with the assistive documentation. We’re also thinking of other ways that different members of the ML lifecycle or the ML team – both the ones that are obvious, like the MLOps engineer, data scientists, but also the non-obvious people, like lawyers, can have visibility and access into what features are being used and with what models. Those are the different kinds of documentation that we offer.

ML platform at Mailchimp and generative AI use cases

Aurimas: Before joining FeatureForm as the head of MLOps, you were a machine learning operations engineer at Mailchimp, and you were helping to build the ML platform there, right? What kind of problems were the data scientists and machine learning engineers solving at Mailchimp?

Mikiko Bazeley: There were a couple of things. When I joined Mailchimp, there was already some kind of a platform team there. It was a very interesting situation, where the MLOps and the ML Platform concerns were roughly split across three teams.

There was the team that I was on, where we were very intensely focused on making tools and setting up the environment for development and training for data scientists, as well as helping out with the actual productionization work.
There was a team that was focused on serving the live models.
And there was a team that was constantly evolving. They started off as doing data integrations, and then became the ML monitoring team. That’s kind of where they’ve been since I left.

Generally speaking, across all teams, the problem that we were trying to solve was: How do we provide passive productionization for data scientists at Mailchimp, given all the different kinds of projects they were working on.

For example, Mailchimp was the first place I had seen where they had a strong use case for business value for generative AI. Anytime a company comes out with generative AI capabilities, the company I benchmark them against is Mailchimp – just because they had such a strong use case for it.

Aurimas: Was it content generation?

Mikiko Bazeley: Oh, yeah, absolutely. It’s helpful to understand what Mailchimp is for additional context.

Mailchimp is a 20-year-old company. It’s based in Atlanta, Georgia. Part of the reason why it was bought out for so much money was because it’s also the largest… I don’t want to say provider. They have the largest email list in the US because they started off as an email marketing solution. But what most people, I think, are not super aware of is that for the last couple of years, they have been making big moves into becoming sort of like the all-in-one shop for small, medium-sized businesses who want to do e-commerce.

There’s still email marketing. That’s a huge part of what they do, so NLP is very big there, obviously. But they also offer things like social media content creation, e-commerce virtual digital websites etc. They essentially tried to position themselves as the front-end CRM for small and medium-sized businesses. They were bought by Intuit to become the front-end of Intuit’s back-of-house operations, such as QuickBooks and TurboTax.

With that context, the goal of Mailchimp is to provide the marketing stuff. In other words, the things that the small mom-and-pop businesses need to do. Mailchimp seeks to make it easier and to automate it.

One of the strong use cases for generative AI they were working on was this: Let’s say you’re a small business owner running a t-shirt or a candle shop. You are the sole proprietor, or you might have two or three employees. Your business is pretty lean. You don’t have the money to afford a full-time designer or marketing person.

You can go to Fiverr, but sometimes you just need to send emails for holiday promotions.

Although that’s low-value work, if you were to hire a contractor to do that, it would be a lot of effort and money. One of the things Mailchimp offered through their creative studio product or services, I forgot the exact name of it, was this:

Then Leslie goes, “Hey, okay, now, give me some templates

Say, Leslie of the candle shop wants to send that holiday email. What she can do is go into the creative studio and say, “Hey, here’s my website or shop or whatever, generate a bunch of email templates for me.” The first thing it would do is to generate stock photos and the color palettes for your email.

Then Leslie goes, “Hey, okay, now, give me some templates to write my holiday email, but do it with my brand in mind,” so her tone of voice, her speaking style. It then lists other kinds of details about her shop. Then, of course, it would generate the email copy. Next, Leslie says, “Okay, I want several different versions of this so I can A/B test the email.” Boom! It would do that…

The reason why I think this is such a strong business use case is because Mailchimp is the largest provider. I intentionally don’t say provider of emails because they don’t provide emails, they –

Piotr: … the sender?

Mikiko Bazeley: Yes, they are the largest secure business for emails. So Leslie has an email list that she’s already built up. She can do a couple of things. Her email list is segmented out – that’s also something Mailchimp offers. Mailchimp allows users to create campaigns based on certain triggers that they can customize on their own. They offer a nice UI for that. So, Leslie has three email lists. She has high spenders, medium spenders, and low spenders.

She can connect the different email templates with those different lists, and essentially, she’s got that end-to-end automation that’s directly tied into her business. For me, that was a strong business value proposition. A lot of it is because Mailchimp had built up a “defensive moat” through the product and their strategy that they’ve been working on for 20 years.

For them, the generative AI capabilities they offer are directly in line with their mission statement. It’s also not the product. The product is “we’re going to make your life super easy as a small or medium sized business owner who might’ve already built up a list of 10,000 emails and has interactions with their website and their shop”. Now, they also offer segmentation and automation capabilities – you normally have to go to Zapier or other providers to do that.

I think Mailchimp is just massively benefiting from the new wave. I can’t say that for a lot of other companies. Seeing that as an ML platform engineer when I was there was super exciting because it also exposed me early on to some of the challenges of working with not just multi-model ensemble pipelines, which we had there for sure, but also testing and validating generative AI or LLMs.

For example, if you have them in your system or your model pipeline, how do you actually evaluate it? How do you monitor it? The big thing that a lot of teams get super wrong is actually the data product feedback on their models.

Companies and teams really don’t understand how to integrate that to further enrich their data science machine learning initiatives and also the products that they’re able to offer.

Piotr: Miki, the funny conclusion is that the greetings we are getting from companies during holidays are not only not personalized, but also even the body of the text is not written by a person.

Mikiko Bazeley: But they are personalized. They’re personalized to your persona.

Generative AI problems at Mailchimp and feedback monitoring

Piotr: That’s fair. Anyways, you said something very interesting: “Companies don’t know how to treat feedback data,” and I think with generative AI type of problems, it is even more challenging because the feedback is less structured.

Can you share with us how it was done at Mailchimp? What type of feedback was it, and what did your teams do with it? How did it work?

Mikiko Bazeley: I will say that when I left, the monitoring initiatives were just getting off the ground. Again, it’s helpful to understand the context with Mailchimp. They’re a 20-year-old, privately owned company that never had any VC funding.

They still have physical data centers that they rent, and they own server racks. They had only started transitioning to the cloud a relatively short time ago – maybe less than eight years ago or closer to six.

This is a great decision that maybe some companies should think about. Rather than moving the entire company to the cloud, Mailchimp said, “For now, what we’ll do is we’ll move the burgeoning data science and machine learning initiatives, including any of the data engineers that are needed to support those. We’ll keep everyone else in the legacy stack for now.”

Then, they slowly started migrating shards to the cloud and evaluated that. Since they were privately owned and had a very clear north star, they were able to make technology decisions in terms of years as opposed to quarters – unlike some tech companies.

What does that mean in terms of the feedback? It means there’s feedback that’s generated through the product data that is serviced back up into the product itself – a lot of that was in the core legacy stack.

The data engineers for the data science/machine learning org were mainly tasked with bringing over data and copying data from the legacy stack over into GCP, which was where we were living. The stack of the data science/machine learning folks on GCP was BigQuery, Spanner, Dataflow, and AI Platform Notebooks, which is now Vertex. We were also using Jenkins, Airflow, Terraform, and a couple of others.

But the big role of the data engineers there was getting that data over to the data science and machine learning side. For the data scientists and machine learning folks, there was a latency of approximately one day for the data.

At that point, it was very hard to do things. We could do live service models – which was a very common pattern – but a lot of the models had to be trained offline. We created a live service out of them, exposed the API endpoint, and all that. But there was a latency of about one to two days.

With that being said, something they were working on, for example, was… and this is where the tight integration with product needs to happen.

One feedback that had been given was about creating campaigns – what we call the “journey builder.” A lot of owners of small and medium sized businesses are the CEO, the CFO, the CMO, they’re doing it all. They’re like, “This is actually complicated. Can you suggest l how to build campaigns for us?” That was feedback that came in through the product.

The data scientist in charge of that project said, “I’m going to build a model that will give a recommendation for the next three steps or the next three actions an owner can take on their campaign.” Then we all worked with the data engineers to go, “Hey, can we even get this data?”

Once again, this is where legal comes into play and says:, “Are there any legal restrictions?” And then essentially getting that into the datasets that could be used in the models.

Piotr: This feedback is not data but more qualitative feedback from the product based on the needs users express, right?

Mikiko Bazeley: But I think you need both.

Aurimas: You do.

Mikiko Bazeley: I don’t think you can have data feedback without product and front-end teams. For example, a very common place to get feedback is when you share a recommendation, right? Or, for example, Twitter ads.

You can say, “Is this ad relevant to you?” It’s yes or no. This makes it very simple to offer that option in the UI. And I think a lot of folks think that the implementation of data feedback is very easy. When I say “easy”, I don’t mean that it doesn’t require a strong understanding of experimentation design. But assuming you have that, there are lots of tools like A/B tests, predictions, and models. Then, you can essentially just write the results back to a table. That’s not actually hard. What is hard a lot of times is getting the different engineering teams to sign on to that, to even be willing to set that up.

Once you have that and you have the experiment, the website, and the model that it was attached to, the data part is easy, but I think getting the product buy-in and getting the engineering or the business team on board with seeing there’s a strategic value in enriching our datasets is hard.

For example, when I was at Data Council last week, they had a generative AI panel. What I got out of that discussion was that boring data and ML infrastructure matter a lot. They matter even more now.

A lot of this MLOps infrastructure is not going to go away. In fact, it becomes more important. The big discussion there was like, “Oh, we are running out of the public corpus of data to train and fine-tune on.” And what they mean by that is we’re running out of high-quality academic data sets in English to use our models with. So people are like, “Well, what happens if we run out of data sets on the web?” And the answer is it goes back to first-party data – it goes back to the data that you, as a business, actually own and can control.

It was the same discussion that happened when Google said, “Hey, we’re gonna get rid of the ability to track third-party data.” A lot of people were freaking out. If you build that data feedback collection and align it with your machine learning efforts, then you won’t have to worry. But if you’re a company where you’re just a thin wrapper around something like an OpenAI API, then you should be worried because you’re not delivering value no one else could offer.

It’s the same with the ML infrastructure, right?

Getting closer to the business as an MLOps engineer

Piotr: The baseline just went up, but to be competitive, to do something on top, you still need to have something proprietary.

Mikiko Bazeley: Yeah, 100%. And that’s actually where I believe MLOps and data engineers think too much like engineers…

Piotr: Can you elaborate more on that?

Mikiko Bazeley: I don’t want to just say they think the challenges are technical. A lot of times there are technical challenges. But, a lot of times, what you need to get is time, headroom, and investment. A lot of times, that means aligning your conversation with the strategic goals of the business.

I think a lot of data engineers and MLOps engineers are not great with that. I think data scientists oftentimes are better at that.

Piotr: That’s because they need to deal with the business more often, right?

Mikiko Bazeley: Yeah!

Aurimas: And the developers are not directly providing value…

Mikiko Bazeley: It’s like public health, right? Everyone undervalues public health until you’re dying of a water contagion issue. It’s super important, but people don’t always surface how important it is. More importantly, they approach it from a “this is the best technical solution” perspective as opposed to “this will drive immense value for the company.” Companies really care only about two or three things:

1 Generating more revenue or profit
2 Cut cost or optimize them
3 A combination of both of the above.

If MLOps and data engineers can align their efforts, especially around building an ML stack, a business person or even the head of engineering is going to be like, “Why do we need this tool? It’s just another thing people here are not gonna be using.”

The strategy to kind of counter that is to think about what KPIs and metrics they care about. Show the impact on those. The next part is also offering a plan of attack, and a plan for maintenance.

The thing I’ve observed extremely successful ML platform teams do is the opposite of the stories you hear about. A lot of stories you hear about building ML platforms go like, “We created this new thing and then we brought in this tool to do it. And then people just used it and loved it.” This is just another version of, “if you build it, they will come,” and that’s just not what happens.

You have to read between the lines of the story of a lot of successful ML platforms. What they did was to take an area or a stage of the process that was already in motion but wasn’t optimal. For example, maybe they already had a path to production for deploying machine learning models but it just really sucked.

What teams would do is build a parallel solution that was much better and then invite or onboard the data scientists to that path. They would do the manual stuff associated with adopting users – it’s the whole “do things that don’t scale,” you know. Do workshops.Help them get their project through the door.

The key point is that you have to offer something that is actually truly better. When data scientists or users have a baseline of, “We do this thing already, but it sucks,” and then you offer them something better – I think there’s a term called “differentiable value” or something like that – you essentially have a user base of data scientists that can do more things.

If you go to a business person or your CTO and say, “We already know we have 100 data scientists that are trying to push models. This is how long it’s taking them. Not only can we cut that time down to half, but we can also do it in a way where they’re happier about it and they’re not going to quit. And it’ll provide X amount more value because these are the initiatives we want to push. It’s going to take us about six months to do it, but we can make sure we can cut down to three months.” Then you can show those benchmarks and measurements as well as offer a maintenance plan.

A lot of these conversations are not about technical supremacy. It’s about how to socialize that initiative, how to align it with your executive leaders’ concerns, and do the hard work of getting the adoption of the ML platform.

Success stories of the ML platform capabilities at Mailchimp

Aurimas: Do you have any success stories from Mailchimp? What practices would you suggest in communicating with machine learning teams? How do you get feedback from them?

Mikiko Bazeley: Yeah, absolutely. There’s a couple of things we did well. I’ll start with Autodesk for context.

When I was working at Autodesk I was in a data scientist/data analyst hybrid role. Autodesk is a design-oriented company. They make you take a lot of classes like design thinking and about how to collect user stories. That’s something I had also learned in my anthropology studies:How do you create what they call ethnographies, which is like, “How do you go to people, learn about their practices, understand what they care about, speak in their language.”

That was the first thing that I did there on the team. I landed there and was like, “Wow, we have all these tickets in Jira. We have all these things we could be working on.” The team was working in all these different directions, and I was like, “Okay, first off, let’s just make sure we all have the same baseline of what’s really important.”

So I did a couple of things.The first was to go back through some of the tickets we had created. I went back through the user stories, talked to the data scientists, talked to the folks on the ML platform team, created a process to gather this feedback. Let’s all independently score or group the feedback and let’s “t-shirt size” the efforts. From there, we could establish a rough roadmap or plan after that.

One of the things we identified was templating. The templating was a little bit confusing. More importantly, this is around the time the M1 Mac was released. It had broken a bunch of stuff for Docker. Part of the templating tool was essentially to create a Docker image and to populate it with whatever configurations based on the type of machine learning project they were doing.What we wanted to get away from was local development.

All of our data scientists were doing work in our AI Platform notebooks. And then they would have to pull down the work locally,then they would have to push that work back to a separate GitHub instance and all this sorts of stuff. We wanted to really simplify this process as much as possible and specifically wanted to find a way to connect the AI Platform notebook.

You would create a template within GCP, which you then could push out to GitHub, which then would trigger the CI/CD, and then also eventually trigger the deployment process. That was a project I worked on. And it looks like it did help. I worked on the V1 of that, and then additional folks took it, matured it even further. Now, data scientists ideally don’t have to go through that weird weird push-pull from remote to local during development.

That was something that to me was just a really fun project because I kind of had

this impression of data scientists, and even in my own work, that you develop locally.But it was a little bit of a disjointed process. There was a couple of other stuff too. But that back-and-forth between remote and local development was the big one. That was a hard process too, because we had to think about how to connect it to Jenkins and then how to get around the VPC and all that.

A book that I’ve been reading recently that I really love is called “Kill It With Fire” by Marianne Bellotti. It’s about how to update legacy systems, how to modernize them without throwing them away. That was a lot of the work I was doing at Mailchimp.

Up until this point in my career, I was used to working at startups where the ML initiative was really new and you had to build everything from scratch. I hadn’t understood that when you’re building an ML service or tool for an enterprise company, it’s a lot harder. You have a lot more constraints on what you can actually use.

For example, we couldn’t use GitHub Actions at Mailchimp. That would have been nice, but we couldn’t. We had an existing templating tool and a process that data scientists already were using. It existed, but it was suboptimal. So how would we optimize an offering that they would be willing to actually use? A lot of learnings from it, but the pace in an enterprise setting is a lot slower than what you could do either at a startup or even as a consultant. So that’s the one drawback.A lot of times the number of projects you can work on is about a third than if you’re someplace else, but it was very fascinating.

Team structure at Mailchimp

Aurimas: I’m very interested to learn whether the data scientists were the direct users of your platform or if there were also machine learning engineers involved in some way – maybe embedded into the product teams?

Mikiko Bazeley: There’s two answers to that question. Mailchimp had a design- and engineering-heavy culture. A lot of the data scientists who worked there, especially the most successful ones, had prior experience as software engineers. Even if the process was a little bit rough, a lot of times they were able to find ways to kind of work with it.

But, in the last two, three years, Mailchimp started hiring data scientists that were more on the product and business side. They didn’t have experience as software engineers. This meant they needed a little bit of help. Thus, each team that was involved in MLOps or the ML platform initiatives had what we called “embedded MLOps engineers.

They were kind of close to an ML engineering role, but not really. For example, they weren’t building the models for data scientists. They were literally only helping with the last mile to production. The way I usually like to think of an ML engineer is as a full-stack data scientist. This means they’re writing up features and developing the models. We had folks that were just there to help the data scientists get their project through the process, but they weren’t building the models.

Our core users were data scientists, and they were the only ones. We had folks that would help them out with things such as answering tickets, Slack questions, and helping to prioritize bugs. That would then be brought back to the engineering folks that would work on it. Each team had this mix of people that would focus on developing new features and tools and people that had about 50% of their time assigned to helping the data scientists.

Intuit had acquired Mailchimp about six months before I left, and it usually takes about that long for changes to actually start kicking in. I think what they have done is to restructure the teams so that a lot of the enablement engineers were nowon one team and the platform engineers were on another team. But before, while I was there, each team had a mix of both.

Piotr: So there was no central ML platform team?

Mikiko Bazeley: No. It was essentially split along training and development, and then serving, and then monitoring and integrations.

Aurimas: It’s still a central platform team, but made up of multiple streamlined teams. They’re kind of part of a platform team, probably providing platform capabilities, like in team topologies.

Mikiko Bazeley: Yeah, yeah.

Piotr: Did they share a tech stack and processes or did each ML team with data scientists and support people have their own realm, own tech stack, own processes. Or did you have initiatives to share some basics, for example, you mentioned templates being used across teams.

Mikiko Bazeley: Most of the stack was shared. I think the team topologies way of describing teams in organizations is actually fantastic. It’s a fantastic way to describe it. Because there were four teams, right? There’s the streamlined teams, which in this case is data science and product. You have complicated subsystem teams, which are the Terraform team, or the Kubernetes team, for example. And then you have enablement and platform.

Each team was a mix of platform and enablement. For example, the resources that we did share were BigQuery, Spanner, and Airflow. But the difference is, and I think this is something that I think a lot of platform teams actually miss: he goal of the platform team isn’t always to own a specific tool, or a specific layer of the stackA lot of times, if you are so big that you have those specializations, the goal of the platform team is to piece together not just the existing tool, but occasionally also bring new tools into a unified experience for your end user – which for us were the data scientists. Even though we shared BigQuery, Airflow, and all that great stuff, other teams were using those resources as well. But they might not be interested, for example, in deploying machine learning models to production. They might not actually be involved in that aspect at all.

What we did was to say, “Hey, we’re going to essentially be your guides to enable these other internal tools. We’re going to create and provide abstractions.” Occasionally, we would also bring in tools that we thought were necessary. For example, a tool that was not used by the serving team was Great Expectations. They didn’t really touch that because it’s something that you would mostly use in development and training – you wouldn’t really use great expectations in production.

There were a couple of other things too… Sorry. I can’t think of them all off the top of my head, but there were three or four other tools the data scientists needed to use in development and training, but they didn’t need them for production. We would incorporate those tools into the paths to production.

The serving layer was a thin Python client that would take the Docker containers or images that were being used for the models. It was then exposed to the API endpoint so that teams up front could route any of the requests to get predictions from the models.

Aside

neptune.ai is the experiment tracker for teams that train foundation models, designed with a strong focus on collaboration and scalability.

It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye.

Neptune is known for its user-friendly UI and seamlessly integrates with popular ML/AI frameworks, enabling quick adoption with minimal disruption.

The pipelining stack

Piotr: Did you use any pipelining tools? For instance, to allow automatic or semi-automatic retraining of models. Or would data scientists just train a model, package it into a Docker image and then it was kind of closed?

Mikiko Bazeley: We had projects that were in various stages of automation. Airflow was a big tool that we used. That was the one that everyone in the company used across the board. The way we interacted with Airflow was as follows: With Airflow, a lot of times you have to go and write your own DAG and create it. Quite often, that can actually be automated, especially if it’s just running the same type of machine learning pipeline that was built into the cookiecutter template. So we said, “Hey, when you’re setting up your project, you go through a series of interview questions. Do you need Airflow? Yes or no?” If they said “yes”, then that part would get filled out for them with the relevant information on the project and all that other stuff. And then it would substitute in the credentials.

Piotr: How did they know whether they needed it or not?

Mikiko Bazeley: That is actually something that was part of the work of optimizing the cookiecutter template. When I first got there, data scientists had to fill out a lot of these questions. Do I need Airflow? Do I need XYZ? And for the most part, a lot of times they would have to ask the enablement engineers “Hey, what should I be doing?”

Sometimes there were projects that needed a little bit more of a design consultation, like “Can we support this model or this system that you’re trying to build with the existing paths that we offer?” And then we would help them figure that out, so that they could go on and set up the project.

It was a pain when they would set up the project and then we’d look at it and go, “No, this is wrong. You actually need to do this thing.” And they would have to rerun the project creation. Something that we did as part of the optimization was to say, “Hey, just pick a pattern and then we’ll fill out all the configurations for you”. Most of them could figure it out pretty easily. For example, “Is this going to be a batch prediction job where I just need to copy values? Is this going to be a live service model?” Those two patterns were pretty easy for them to figure out, so they could go ahead and say, “Hey, this is what I want.” They could just use the image that was designed for that particular job.

The template process would run, and then they could just fill it out., “Oh, this is the project name, yada, yada…” They didn’t have to fill out the Python version. We would automatically set it to the most stable, up-to-date version, but if they needed version 3.2 and Python’s at 3.11, they would specify that. Other than that, ideally, they should be able to do their jobs of writing the features and developing the models.

The other cool part was that we had been looking at offering them native Streamlit support. That was a common part of the process as well. Data scientists would create the initial models. And then they would create a Streamlit dashboard. They would show it to the product team and then product would use that to make “yes” or “no” decisions so that the data scientists could proceed with the project.

More importantly, if new product folks wanted to join and they were interested in a model, looking to understand how this model worked, or what capabilities models offered. Then they could go to that Streamlit library or the data scientists could send them the link to it, and they could go through and quickly see what a model did.

Aurimas: This sounds like a UAT environment, right? User acceptance tests in pre-production.

Piotr: Maybe more like “tech stack on demand”? Like you specify what’s your project and you’re getting the tech stack and configuration. An example of how similar projects were done that had the same setup.

Mikiko Bazeley: Yeah, I mean, that’s kind of how it should be for data scientists, right?

Piotr: So you were not only providing a one-fit-for-all tech stack for Mailchimp’s ML teams, but they had a selection. They were able to have a more personalized tech stack per project.

Size of the ML organization at Mailchimp

Aurimas: How many paths did you support? Because I know that I’ve heard of teams whose only job basically was to bake new template repositories daily to support something like 300 use cases.

Piotr: How big was that team? And how many ML models did you have?

Mikiko Bazeley: The data science team was anywhere from 20 to 25, I think. And in terms of the engineering side of the house, there were six on my team, there might’ve been six on the serving team, and another six on the data integrations and monitoring team. And then we had another team that was the data platform team. So they’re very closely associated with what you would think of as data engineering, right?

They would help maintain and owned copying of the data from Mailchimp’s legacy stack over to BigQuery and Spanner. There were a couple of other things that they did, but that was the big one. Also making sure that the data was available for analytics use cases.

And there were people using that data that were not necessarily involved in ML efforts. That team was another six to eight. So in total, we had about 24 engineers for 25 data scientists plus however many product and data analytics folks that were using the data as well.

Aurimas: Do I understand correctly that you had 18 people in the various platform teams for 25 data scientists? You said there were six people on each team.

Mikiko Bazeley: The third team was spread out across several projects – monitoring was the most recent one. They didn’t get involved with the ML platform initiatives until around three months before I left Mailchimp.

Prior to that, they were working on data integrations, which meant they were much more closely aligned with the efforts on the analytics and engineering side – these were totally different from the data science side.

I think that they hired more data scientists recently. They’ve also hired more platform engineering folks. And I think what they’re trying to do is to align Mailchimp more closely with Intuit, Quickbooks in particular. They’re also trying to continuously build out more ML capabilities, which is super important in terms of Mailchimp’s and Intuit’s long-term strategic vision.

Piotr: And Miki, do you remember how many ML models you had in production when you worked there?

Mikiko Bazeley: I think the minimum was 25 to 30. But they were definitely building out a lot more. And some of those models were actually ensemble models, ensemble pipelines. It was a pretty significant amount.

The hardest part that my team was solving for, and that I was working on, was crossing the chasm between experimentation and production. With a lot of stuff that we worked on while I was there, including optimizing the templating project, we were able to significantly cut down the effort to set up projects and the development environment.

I wouldn’t be surprised if they’ve, I don’t wanna say doubled that number, but at least significantly increased the number of models in production.

Piotr: Do you remember how long it typically took to go from an idea to solve a problem using machine learning to having a machine learning model in production? What was the median or average time?

Mikiko Bazeley: I don’t like the idea of measuring from idea, because there are a lot of things that can happen on the product side. But assuming everything went well with the product side and they didn’t change their minds, and assuming the data scientists weren’t super overloaded, it might still take them a few months. Largely this was due to doing things like validating logic – that was a big one – and getting product buy-in.

Piotr: Validating logic? What would that be?

Mikiko Bazeley: For example, validating the data set. By validating, I don’t mean quality. I mean semantic understanding, creating a bunch of different models, creating different features, sharing that model with the product team and with the other data science folks, making sure that we had the right architecture to support it. And then, for example, things like making sure that our Docker images supported GPUs if a model needed that. It would take at least a couple of months.

Piotr: I was about to ask about the key factors. What took the most time?

Mikiko Bazeley: Initially, it was struggling with the end-to-end experience. It was a bit rough to have different teams. That was the feedback that I had collected when I first got there.

Essentially, data scientists would go to the development and training environment team, and then they would go to serving and deployment and would then have to work with a different team. One piece of feedback was: “Hey, we have to jump through all these different hoops and it’s not a super unified experience.”

The other part we struggled with was the strategic roadmap. For example, when I got there, different people were working on completely different projects and sometimes it wasn’t even visible what these projects were. Sometimes, a project was less about “How useful is it for the data scientists?” but more like “Did the engineer on that project want to work on it?” or “Was it their pet project?” There were a bunch of those.

By the time I left, the tech lead there, Emily Curtin – she is super awesome, by the way, she’s done some awesome talks about how to enable data scientists with GPUs. Working with her was fantastic. My manager at the time, Nadia Morris, who’s still there as well, between the three of us and the work of a few other folks, we were able to actually get better alignment in terms of the roadmap to actually start steering all the efforts towards providing that more unified experience.

For example, there are other practices too where some of these engineers who had their pet projects, they would build something over a period of two, three nights, and then they would ship it to the data scientists without any testing, without any whatever, and they’d be like, “oh yeah, data scientists, you have to use this.“

Piotr: It is called passion *laughs*

Mikiko Bazeley: It’s like, “Wait, why didn’t you first off have us create a period of testing internally.” And then, you know, now we need to help the data scientists because they’re having all these problems with these pet project tools.

We could have buttoned it up. We could have made sure it was free of bugs. And then, we could have set it up like an actual enablement process where we create some tutorials or write-ups or we host office hours where we show it off.

A lot of times, the data scientists would look at it and they’d be like, “Yeah, we’re not using this, we’re just going to keep doing the thing we’re doing because even if it’s suboptimal, at least it’s not broken.”

Golden paths at Mailchimp

Aurimas: Was there any case where something was created inside of a stream-aligned team that was so good that you decided to pull it into the platform as a capability?

Mikiko Bazeley: That’s a pretty good question. I don’t. I don’t think so, but a lot of times the data scientists, especially if there were some senior ones who were really good, they would go out and try out tools and then they would come back to the team and say “Hey, this looks really interesting.” I think that’s pretty much what happened when they were looking at WhyLabs, for example.

And that’s I think how that happened. There were a few others but for the most part we were building a platform to make everyone’s lives easier. Sometimes that meant sacrificing a little bit of newness and I think this is where platform teams sometimes get it wrong.

Spotify had a blog post about this, about golden paths, right? They had a golden path, a silver path, and a bronze path or a copper path or something.

The golden path was supported best. “If you have any issues with this, this is what we support, this is what we maintain. If you have any issues with this, we will prioritize that bug, we will fix it.” And it will work for like 85% of use cases, 85 to 90%.

The silver path includes elements of the golden path, but there are some things that are not really or directly supported, but we are consulted and informed on. If we think we can pull it into the golden path, then we will, but there have to be enough use cases for it.

At that point, it becomes a conversation about “where do we spend engineering resources?” Because, for example, there are some projects like Creative Studio, right? It is super innovative. It was also very hard to support. But MailChimp said, “Hey, we need to offer this, we need to use generative AI to help streamline our product offering for our users.” Then it becomes a conversation of, “Hey, how much of our engineers’ time can we open up or free up to do work on this system?”

And even then, with those sets of projects, there’s not as much difference in terms of infrastructure support that’s needed as people would think. I think especially with generative AI and LLMs, where you get the biggest infrastructure and operational impact is latency, that’s a huge one. The second part is data privacy – that’s a really, really big one. And then the third is the monitoring and evaluation piece. But for a lot of the other stuff… Upstream, it would still line up with, for example, an NLP-based recommendation system. That’s not really going to significantly change as long as you have the right providers providing the right needs.

So we had a golden path, but you could also have some silver paths. And then you had people that would kind of just go and do their own thing. We definitely had that. We had the cowboys and cowgirls and cow people – they would go offroad.

At that point, you can say, “You can do that, but it’s not going to be in production on the official models in production”, right? And you try your best, but I think that’s also when you see that, you have to kind of look at it as a platform team and wonder whether it’s because of this person’s personality that they’re doing that? Or is it truly because there’s a friction point in our tooling? And if you only have one or two people out of 25 doing it, it’s like, “eh, it’s probably the person.” It’s probably not the platform.

Piotr: And it sounds like a situation where your education comes to the picture!

Closing remarks

Aurimas: We’re actually already 19 minutes past our agreed time. So before closing the episode, maybe you have some thoughts that you want to leave our listeners with? Maybe you want to say where they can find you online.

Mikiko Bazeley: Yeah, sure. So folks can find me on LinkedIn and Twitter. I have a Substack that I’ve been neglecting, but I’m gonna be revitalizing that. So folks can find me on Substack. I also have a YouTube channel that I’m also revitalizing, so people can find me there.

In terms of other last thoughts, I know that there are a lot of people that have a lot of anxiety and excitement about all the new things that have been going on in the last six months. Some people are worried about their jobs.

Piotr: You mean foundation models?

Mikiko Bazeley: Yeah, foundation models, but there’s also a lot going on in the ML space. My advice to people would be that one, all the boring ML and data infrastructure and knowledge is more important than ever. So that it’s always great to have a strong skill set in data modeling, in coding, in testing, in best practices, that will never be devalued.

The second word of advice is that I believe people, regardless of whatever title you are, or you want to be: Focus on getting your hands on projects, understanding the adjacent areas, and yeah, learn to speak business.

If I have to be really honest, I’m not the best engineer or data scientist out there. I’m fully aware of my weaknesses and strengths, but the reason I was able to make so many pivots in my career and the reason I was able to get as far as I did is largely because I try to understand the domain and the teams I work with, especially the revenue centers or the profit centers, that’s what people call it. That is super important. That’s a skill. A people skill and body of knowledge that people should pick up.

And people should share their learnings on social media. It’ll get you jobs and sponsorships.

Aurimas: Thank you for your thoughts and thank you for dedicating your time to speak with us. It was really amazing. And thank you to everyone who has listened. See you in the next episode!

Learnings From Building the ML Platform at Stitch Fix

Piotr Niedzwiedz — Thu, 03 Aug 2023 11:24:14 +0000

In this episode, Stefan Krawczyk shares his learnings from building the ML Platform at Stitch Fix.

You can watch it on YouTube:

Or Listen to it as a podcast on:

But if you prefer a written version, here you have it!

In this episode, you will learn about:

1 Problems the ML platform solved for Stitch Fix
2 Serializing models
3 Model packaging
4 Managing feature request to the platform
5 The structure of an end-to-end ML team at Stitch Fix

Introduction

Piotr: Hi, everybody! This is Piotr Niedźwiedź and Aurimas Griciūnas from neptune.ai, and you’re listening to ML Platform Podcast.

Today we have invited a pretty unique and interesting guest, Stefan Krawczyk. Stefan is a software engineer, data scientist, and has been doing work as an ML engineer. He also ran the data platform in his previous company and is also co-creator of open-source framework, Hamilton.

I also recently found out, you are the CEO of DAGWorks.

Stefan: Yeah. Thanks for having me. I’m excited to talk with you, Piotr and Aurimas.

What is DAGWorks?

Piotr: You have a super interesting background, and you have covered all the important check boxes there are nowadays.

Can you tell us a little bit more about your current venture, DAGWorks?

Stefan: Sure. For those who don’t know DAGWorks, D-A-G is short for Directed Acyclic Graph. It’s a little bit of an homage to how we think and how we’re trying to solve problems.

We want to stop the pain and suffering people feel with maintaining machine learning pipelines in production.

We want to enable a team of junior data scientists to write code, take it into production, maintain it, and then when they leave, importantly, no one has nightmares about inheriting their code.

At a high level, we are trying to make machine learning initiatives more human capital efficient by enabling teams to more easily get to production and maintain their model pipelines, ETLs, or workflows.

How is DAGWorks different from other popular solutions?

Piotr: The value from a high level sounds great, but as we dive deeper, there is a lot happening around pipelines, and there are different types of pains.

How is it [DAGWorks solution] different from what is popular today? For example, let’s take Airflow, AWS SageMaker pipelines. Where does it [DAGWorks] fit?

Stefan: Good question. We’re building on top of Hamilton, which is an open-source framework for describing data flows.

In terms of where Hamilton, and kind of where we’re starting, is helping you model the micro.

Airflow, for example, is a macro orchestration framework. You essentially divide things up into large tasks and chunks, but the software engineering that goes within that task is the thing that you’re generally gonna be updating and adding to over time as your machine learning grows within your company or you have new data sources, you want to create new models, right?

What we’re targeting first is helping you replace that procedural Python code with Hamilton code that you describe, which I can go into detail a little bit more.

The idea is we want to help you enable a junior team of data scientists to not trip up over the software engineering aspects of maintaining the code within the macro tasks of something such as Airflow.

Right now, Hamilton is very lightweight. People use Hamilton within an Airflow task. They use us within FastAPI, Flask apps, they can use us within a notebook.

You could almost think of Hamilton as DBT for Python functions. It gives a very opinionary way of writing Python. At a high level, it’s the layer above.

And then, we’re trying to boot out features of the platform and the open-source to be able to take Hamilton data flow definitions and help you auto-generate the Airflow tasks.

To a junior data scientist, it doesn’t matter if you’re using Airflow, Prefect, Dexter. It’s just an implementation detail. What you use doesn’t help you make better models. It’s the vehicle with which you use to run your pipelines with.

Why have a DAG within a DAG?

Piotr: This is procedural Python code. If I understood correctly, it is kind of a DAG inside the DAG. But why do we need another DAG inside a DAG?

Stefan: When you’re iterating on models, you’re adding a new feature, right?

A new feature roughly corresponds to a new column, right?

You’re not going to add a new Airflow task just to compute a single feature unless it’s some sort of big, massive feature that requires a lot of memory. The iteration you’re going to be doing is going to be within those tasks.

In terms of the backstory of how we came up with Hamilton…

At Stitch Fix, where Hamilton was created – the prior company that I worked at – data scientists were responsible for end-to-end development (i.e., going from prototype to production and then being on call for what they took to production).

The team was essentially doing time series forecasting, where every month or every couple of weeks, they had to update their model to help produce forecasts for the business.

The macro workflow wasn’t changing, they were just changing what was within the task steps.

But the team was a really old team. They had a lot of code; a lot of legacy code. In terms of creating features, they were creating on the order of a thousand features.

Piotr: A thousand features?

Stefan: Yeah, I mean, in time series forecasting, it’s very easy to add features every month.

Say there’s a marketing spend, or if you’re trying to model or simulate something. For example, there’s going to be marketing spend next month, how can we simulate demand.

So they were always continually adding to the code, but the problem was it wasn’t engineered in a good way. Adding new things was like super slow, they didn’t have confidence when they added or changed something that something didn’t break.

Rather than having to have a senior software engineer on each pull request to tell them,

“Hey, decouple things,”

“Hey, you’re gonna have issues with the way you’re writing,”

we came up with Hamilton, which is a paradigm where essentially you describe everything as functions, where the function name corresponds exactly to an output – this is because one of the issues was, given a feature, can we map it to exactly one function, make the function name correspond to that output, and in the function of the arguments, declare what’s required to compute it.

When you come to read the code, it’s very clear what the output is and what the inputs are. You have the function docstring because with procedural code generally in script form, there is no place to stick documentation naturally.

Piotr: Oh, you can put it above the line, right?

Stefan: It’s not… you start staring at a wall of text.

It’s easier from a grokking perspective in terms of just reading functions if you want to understand the flow of things.

[With Hamilton] you’re not overwhelmed, you have the docstring, a function for documentation, but then also everything’s unit testable by default – they didn’t have a good testing story.

In terms of the distinction between other frameworks with Hamilton, the naming of the functions and the input arguments stitches together a DAG or a graph of dependencies.

In other frameworks –

Piotr: So you do some magic on top of Python, right? To figure it out.

Stefan: Yep!

Piotr: How about working with it? Do IDEs support it?

Stefan: So IDEs? No. It’s on the roadmap to like to provide more plugins, but essentially, rather than having to annotate a function with a step and then manually specify the workflow from the steps, we short-circuit that with everything through the aspect of naming.

So that’s a long-winded way to say we started at the micro because that was what was slowing the team down.

By transitioning to Hamilton, they were four times more efficient on that monthly task just because it was a very prescribed and simple way to add or update something.

It’s also clear and easy to know where to add it to the codebase, what to review, understand the impacts, and then therefore, how to integrate it with the rest of the platform.

How do you measure whether tools are adding value?

Piotr: How do – and I think it is a question that I sometimes hear, especially from ML platform teams and leaders of those teams where they need to like to justify their existence.

As you’ve been running the ML data platform team, how do you do that? How do you know whether the platform we are building, the tools we are providing to data science teams, or data teams are bringing value?

Stefan: Yeah, I mean, hard question, no simple answer.

If you can be data-driven, that is the best. But the hard part is people’s skill sets differ. So if you were to say, measure how long it takes someone to do something, you have to take into account how senior they are, how junior.

But essentially, if you have enough data points, then you can say roughly something on average. It used to take someone this amount of time now it takes this amount of time, and so you get the ratio and the value added there, and then you want to count how many times that thing happens. Then you can measure human time and, therefore, salary and say this is how much savings we made – that’s from just looking at efficiencies.

The other way machine learning platforms help is like by stopping production fires. You can look at what’s the cost of an outage is and then work backwards like, “hey, if you prevent these outages, we’ve also provided this type of value.”

Piotr: Got it.

What are some use cases of Hamilton?

Aurimas: Maybe we’re getting one step a little bit back…

To me, it sounds like Hamilton is mostly useful for feature engineering. Do I understand this correctly? Or are there any other use cases?

Stefan: Yeah, that’s where Hamilton’s roots are. If you need something to help structure your feature engineering problem, Hamilton is great if you’re in Python.

Most people don’t like their pandas code, Hamilton helps you structure that. But with Hamilton, it works with any Python object type.

Most machines these days are large enough that you probably don’t need an Airflow right away, in which case you can model your end-to-end machine learning pipeline with Hamilton.

In the repository, we have a few examples of what you can do end-to-end. I think Hamilton is a Swiss Army knife. We have someone from Adobe using it to help manage some prompt engineering work that they’re doing, for example.

We have someone precisely using it more for feature engineering, but using it within a Flask app. We have other people using the fact that it’s Python-type agnostic and helping them orchestrate a data flow to generate some Python object.

So very, very broad, but it’s roots are feature engineering, but definitely very easy to extend to a lightweight end-to-end kind of machine learning model. This is where we’re excited about extensions we’re going to add to the ecosystem. For example, how do we make it easy for someone to say, pick up Neptune and integrate it?

Piotr: And Stefan, this part was interesting because I didn’t expect that and want to double-check.

Would you also – let’s assume that we do not need a macro-level pipeline like this one run by Airflow, and we are fine with doing it on one machine.

Would you also include steps that are around training a model, or is it more about data?

Stefan: No, I mean both.

The nice thing with Hamilton is that you can logically express the data flow. You could do source, featurization, creating training set, model training, prediction, and you haven’t really specified the task boundaries.

With Hamilton, you can logically define everything end-to-end. At runtime, you only specify what you want computed – it will only compute the subset of the DAG that you request.

Piotr:But what about the for loop of training? Like, let’s say, 1000 iterations of the gradient descent, that inside, how would this work?

Stefan: You have options there…

I want to say right now people would stick that within the body of a function – so you’ll just have one function that encompasses that training step.

With Hamilton, junior people and senior people like it because you have the full flexibility of whatever you want to do within the Python function. It’s just an opinionated way to help structure your code.

Why doesn’t Hamilton have a feature store?

Aurimas: Getting back to that table in your GitHub repository, a very interesting point that I noted is that you’re saying that you are not comparing to a feature store in any way.

However, I then thought a little bit deeper about it… The feature store is there to store the features, but it also has this feature definition, like modern feature platforms also have feature compute and definition layer, right?

In some cases, they don’t even need a feature store. You might be okay with just computing features both on training time and inference time. So I thought, why couldn’t Hamilton be set for that?

Stefan: You’re exactly right. I term it as a feature definition store. That’s essentially what the team at Stitch Fix built – just on the back of Git.

Hamilton forces you to separate your functions separate from the context where it runs. You’re forced to curate things into modules.

If you want to build a feature bank of code that knows how to compute things with Hamilton, you’re forced to do that – then you can share and reuse those kind of feature transforms in different contexts very easily.

It forces you to align on naming, schema, and inputs. In terms of the inputs to a feature, they have to be named appropriately.

If you don’t need to store data, you could use Hamilton to recompute everything. But if you need to store data for cache, you put Hamilton in front of that in terms of, use Hamilton’s compute and potentially push it to something like FIST.

Aurimas: I also saw in the, not Hamilton, but DAGWorks website, as you already mentioned, you can train models inside of it as well in the function. So let’s say you train a model inside of Hamilton’s function.

Would you be able to also somehow extract that model from storage where you placed it and then serve it as a function as well, or is this not a possibility?

Stefan: This is where Hamilton is really lightweight. It’s not opinioned with materialization. So that is where connectors or other things come in as to, like, where do you push like actual artifacts?

This is where it’s at a lightweight level. You would ask the Hamilton DAG to compute the model, you get the model out, and then the next line, you would save it or push it to your data store – you could also write a Hamilton function that kind of does that.

The side effect of running the function is pushing it, but this is where looking to expand and kind of provide more capabilities to make it more naturally pluggable within the DAG to specify to build a model and then in the context that you want to run it should specify, “I want to save the model and place it into Neptune.”

That’s where we’re heading, but right now, Hamilton doesn’t restrict how you would want to do that.

Aurimas: But could it pull the model and be used in the serving layer?

Stefan: Yes. One of the features of Hamilton is that with each function, you can switch out a function implementation based on configuration or a different module.

For example, you could have two implementations of the function: one which takes a path to pull from S3 to pull the model, another one that expects the model or training data to be passed in to fit a model.

There is flexibility in terms of function implementations and to be able to switch them out. In short, Hamilton the framework doesn’t have anything native for that…

But we have flexibility in terms of how to implement that.

Aurimas: You basically could do the end-to-end, both training and serving with Hamilton.

That’s what I hear.

Stefan:I mean, you can model that. Yes.

Data versioning with Hamilton

Piotr: And what about data versioning? Like, let’s say, simplified form.

I understand that Hamilton is more on the kind of codebase. When we version code, we version the maybe recipes for features, right?

Having that, what do you need on top to say, “yeah, we have versioned datasets?”

Stefan: Yeah. you’re right. Hamilton, you describe your data for encode. If you store it in Git, or have a structured way to version your Python packages, you can go back at any point in time and understand the exact lineage of computation.

But where the source data lives and what the output is, in terms of dataset versioning, is kind of up to you (i.e. your fidelity of what you want to store and capture).

If you were to use Hamilton to create some sort of dataset or transform a dataset, you would store that dataset somewhere. If you stored the Git SHA and the configuration that you used to instantiate the Hamilton DAG with, and you store that with that artifact, you could always go back in time to recreate it, assuming the source data is still there.

This is from building a platform at Stitch Fix, Hamilton, we have these hooks, or at least the ability to, integrate with that. Now, this is part of the DAGWorks platform.

We’re trying to provide precisely a means to store and capture that extra metadata for you so you don’t have to build that component out so that we can then connect it with other systems you might have.

Depending on your size, you might have a data catalog. Maybe storing and emitting open lineage information, etc. with that.

Definitely, looking for ideas or early stacks to integrate with, but otherwise, we’re not opinionated. Where we can help from the dataset versioning is to not only version the data, but if it’s described in Hamilton, you then go and recompute it exactly because, you know, the code path that was used to transform things.

When did you decide Hamilton must be built?

Aurimas: Maybe moving a little bit back to what you did at Stitch Fix and to Hamilton itself.

When was the point when you decided that Hamilton needs to be built?

Stefan: Back in 2019.

We only open-sourced Hamilton 18 months ago. It’s not a new library – it’s been running in Stitch Fix for over three years.

The interesting part for Stitch Fix is it was a data science organization with over 100 data scientists with various modeling disciplines doing various things for the business.

I was part of the platform team that was engineering for data science. My team’s mandate was to streamline model productionization for teams.

We thought, “how can we lower the software engineering bar?”

The answer was to give them the tooling abstractions and APIs such that they didn’t have to be good software engineers – MLOps best practices basically came for free.

There was a team that was struggling, and the manager came to us to talk. He was like, “This code base sucks, we need help, can you come up with anything? I want to prioritize being able to do documentation and testing, and if you can improve our workflow, that’d be great,” which is essentially the requirements, right?

At Stitch Fix, we had been thinking about “what is the ultimate end user experience or API from a platform to data scientist interaction perspective?”

I think Python functions are not an object-oriented interface that someone has to implement – just give me a function, and there’s enough metaprogramming you can do with Python to inspect the function and know the shape of it, know the inputs and outputs, you know have type annotations, et cetera.

So, plus one for work from home Wednesdays. Stitch Fix had a no meeting day, I set aside a whole day to think about this problem.

I was like, “how can I ensure that everything’s unit testable, documentation friendly, and the DAG and the workflow is kind of self-explanatory and easy for someone to kind of describe.”

In which case, I prototyped Hamilton and took it back to the team. My now co-founder, former colleague at Stich Fix, Elijah, also came up with a second implementation, which was akin to more of a DAG-style approach.

The team liked my implementation, but essentially, the premise of everything being unit testable, documentation friendly, and having a good integration testing story.

With data science code, it’s very easy to append a lot of code to the same scripts, and it just grows and grows and grows. With Hamilton, it’s very easy. You don’t have to compute everything to test something – that was also part of the thought with building a DAG that Hamilton knows to only walk the paths needed for the things you want to compute.

But that’s roughly the origin story.

Migrated the team and got them onboarded. Pull requests end up being faster. The team loves it. They’re super sticky. They love the paradigm because it definitely simplified their life more than what it was before.

Using Hamilton for Deep Learning & Tabular Data

Piotr: Previously you mentioned you’ve been working on over 1000 features that are manually crafted, right?

Would you say that Hamilton is more useful in the context of tabular data, or it can also be used for let’s deep learning type of data where you have a lot of features but not manually developed?

Stefan: Definitely. Hamilton’s roots and sweet spots are coming from trying to manage and create tabular data for input to a model.

The team at Stich Fix manages over 4,000 feature transforms with Hamilton. And I want to say –

Piotr: For one model?

Stefan: For all the models they create, they collectively in the same code base, they have 4,000 feature transforms, which they can add to and manage, and it doesn’t slow them down.

On the question of other types, I wanna say, “yeah.” Hamilton is essentially replacing some of the software engineering that you do. It really depends on what you have to do to stitch together a flow of data to transform for your deep learning use case.

Some people have said, “oh, Hamilton kind of looks a little bit like LangChain.” I haven’t looked at LangChain, which I know is something that people are using for large models to stitch things together.

So, I’m not quite sure yet exactly where they think the resemblance is, but otherwise, if you had procedural code that you’re using with encoders, there’s likely a way that you can transcribe and use it with Hamilton.

One of the features that Hamilton has is that it has a really lightweight data quality runtime check. If checking the output of a function is important to you, we have an extensible way you can do it.

If you’re using tabular data, there’s Pandera. It’s a popular library for describing schema – we have support for that. Else we have a pluggable way that like if you’re doing some other object types or tensors or something – we have the ability that you could extend that to ensure that the tensor meets some sort of standards that you would expect it to have.

Piotr: Would you also calculate some statistics over a column or set of columns to, let’s say, use Hamilton as a framework for testing data sets?

Like I’m not talking about verifying particular value in a column but rather statistic distribution of your data.

Stefan: The beauty of everything being Python functions and the Hamilton framework executing them is that we have flexibility with respect to, yeah, given output of a function, and it just happens to be, you know, a dataframe.

Yeah, we could inject something in the framework that takes summary statistics and emits them. Definitely, that’s something that we’re playing around with.

Piotr: When it comes to a combination of columns, like, let’s say that you want to calculate some statistics correlations between three columns, how does it fit to this function representing a column paradigm?

Stefan: It depends on whether you want that to be an actual transform.

You could just write a function that takes the input or the output of that data frame, and in the body of the function, do that – basically, you can do it manually.

It really depends on whether you want that to be if you’re doing it from a platform perspective and you want to enable data scientists just to capture various things automatically, then I would come from a platform angle of trying to add a decorator what’s called something that wraps the function that then can describe and do the introspection that you want.

Why did you open-source Hamilton?

Piotr: I’m going back to a story of Hamilton that started at Stitch Fix. What was the motivation to go open-source with it?

It is something curious for me because I’ve been in a few companies, and there are always some internal libraries and projects that they liked, but yeah, like, it’s not so easy, and not every project is the right candidate for going open and be truly used.

I’m not talking about adding a license file and making the repo public, but I am talking about making it live and really open.

Stefan: Yeah. My team had per view in terms of build versus buy, we’d been looking at like across the stack, and like we were seeing we created Hamilton back in 2019, and we were seeing very similar-ish things come out and be open-source – we’re like, “hey, I think we have a unique angle.” Of the other tools that we had, Hamilton was the easiest to open source.

For those who know, Stitch Fix also was very big on branding. If you ever want to know some interesting stories about techniques and things, you can look up the Stitch Fix Multithreaded blog.

There was a tech branding team that I was part of, which was trying to get quality content out. That helps the Stitch Fix brand, which helps with hiring.

In terms of motivations, that’s the perspective of branding; set a high-quality bar, and bring things out that look good for the brand.

And it just so happened from our perspective, and our team that just had Hamilton was kind of the easiest to open source out of the things that we did – and then I think it was, more interesting.

We built things like, similar to MLflow, configuration-driven model pipelines, but I wanna say that’s not quite as unique. Hamilton is also a more unique angle on a particular problem. And so which case both of those two combined, it was like, “yeah, I think this is a good branding opportunity.”

And then in terms of the surface area of the library, it’s pretty small. You don’t need many dependencies, which makes it feasible to maintain from an open-source perspective.

The requirements were also relatively low since you just need Python 3.6 – now it’s 3.6 is sunset, so now it’s 3.7, and it just kind of works.

From that perspective, I think it had a pretty good sweet spot of likely not going to have to be, add too many things to increase adoption, make it usable from the community, but then also the maintenance aspect side of it was also kind of small.

The last part was a little bit of an unknown; “how much time would we be spending trying to build a community?” I couldn’t always spend more time on that, but that’s kind of the story of how we open-sourced it.

I just spent a good couple of months trying to write a blog post though with it for launch – that took a bit of time, but that’s always also a good means to get your thoughts down and get them clearly articulated.

Launching an open-source product

Piotr: How was the launch when it comes to adoption from the outside? Can you share with us you promoted it? Did it work from day zero, or it took some time to make it more popular?

Stefan: Thankfully, Stitch Fix had a blog that had a reasonable amount of readership. I paired that with the blog, in which case, you know, I got a couple of hundred stars in a couple of months. We have a Slack community that you can join.

I don’t have a comparison to say how well it was compared to something else, but people are adopting it outside of Stitch Fix. UK Government Digital Services is using Hamilton for a national feedback pipeline.

There is a guy internally using it at IBM for a small internal search tool kind of product. The problem with open-source is you don’t know who’s using you in production since telemetry and other things are difficult. People came in, created issues, asked questions, and which case gave us more energy to be in there and help.

Piotr: What about the first pull request, useful pull request from external guys?

Stefan: So we were fortunate to have a guy called James Lamb come in. He’s been on a few open-source projects, and he’s helped us with the repository documentation and structure.

Basically, cleaning up and making it easy for an outside contributor to come in and run our tests and things like that. I want to say kind of grunt work but super, super valuable in the long run since he just like gave feedback like, “hey, this pull request template is just way too long. How can we shorten it?” – “you’re gonna scare off contributors.”

He gave us a few good pointers and help set up the structure a little bit. It’s repo hygiene that enables other people to kind of contribute more easily.

Stitch Fix biggest challenges

Aurimas: Yeah, so maybe let’s also get back a little bit to the work you did at Stitch Fix. So you mentioned that Hamilton was the easiest one to open-source, right? If I understand correctly, you were working on a lot more things than that – not only the pipeline.

Can you go a little bit into what were the biggest problems at Stitch Fix and how did you try and solve it as a platform thing?

Stefan: Yeah, so you could think of, so take yourself back six years ago, right? There wasn’t the maturity and open-source things available. At Stitch Fix, if data scientists had to create an API for the model, they would be in charge of spinning up their own image on EC2 running some sort of Flask app that then kind of integrated things.

Where we basically started was helping from the production standpoint of stabilization, ensuring better practices. Helping a team that essentially made it easier to deploy backends on top of FastAPI, where the data scientists just had to write Python functions as the integration point.

That helped stabilize and standardize all the kind of backend microservices because the platform now owned what the actual web service was.

Piotr: So you’re kind of providing Lambda interface to them?

Stefan: You could say a little more heavy weight. So essentially making it easy for them to provide a requirements.txt, a base Docker image, and you could say the Git repository where the code lived and be able to create a Docker container, which had the web service, which had the kind of code built, and then deployed on AWS pretty easily.

Aurimas: Do I hear the template repositories maybe? Or did you call them something different here?

Stefan: We weren’t quite template, but there were just a few things that people needed to create a micro surface and get it deployed. Right. Once that was done, it was looking at the various parts of the workflow.

One of the problems was model serialization and “how do you know what version of a model is running in production?” So we developed a little project called the model envelope, where the idea was to do more – much like the metaphor of an envelope, you can stick things in it.

For example, you can stick in the model, but you can also stick a lot of metadata and extra information about it. The issue with model serialization is that you need pretty exact Python dependencies, or you can run into serialization issues.

If you reload models on the fly, you can run into issues of someone pushed a bad model, or its not easy to roll back. One of the way things work at Stitch Fix – or how they used to work – was that if a new model is detected, it would just automatically reload it.

But that was kind of a challenge from an operational perspective to roll back or test things before. With the model envelope abstraction, the idea was you save your model, and then you then provide some configuration and a UI, and then we could give it a new model, auto deploy a new service, where each model build was a new Docker container, so each service was immutable.

And it provided better constructs to push something out, make it easy to roll back, so we just switched the container. If you wanted to debug something, then you could just pull that container and compare it against something that was running in production.

It also enabled us to insert a CI/CD type kind of pipeline without them having to put that into their model pipelines because common frameworks right now have, you know, at the end of someone’s machine learning model pipeline ETL is like, you know, you do all these kind of CI/CD checks to qualify a model.

We kind of abstracted that part out and made it something that people could add and after they had created a model pipeline. So that way it was, you know, easier to kind of change and update, and therefore the model pipeline wouldn’t have to change if like, you know, wouldn’t have to be updated if someone there was a bug and they wanted to create a new test or something.

And so that’s roughly it. Model envelope was the name of it. It helped users to build a model and get it into production in under an hour.

We also had the equivalent for the batch side. Usually, if you want to create a model and then run it in a batch somewhere you would have to write the task. We had books to make a model run in Spark or on a large box.

People wouldn’t have to write that batch task to do batch prediction. Because at some level of maturity within a company, you start to have teams who want to reuse other teams’ models. In which case, we were the buffer in between, helping provide a standard way for people to kind of take someone else’s model and run in batch without them having to know much about it.

Serializing models in the Stitch Fix platform

Piotr: And Stefan, talking about serializing a model, did you also serialize the pre and post-processing of features to this model? How, where did you have a boundary?

Like, and second that is very connected, how did you describe the signature of a model? Like, let’s say it’s a RESTful API, right? How did you do this?

Stefan: When someone saved the model, they had to provide a pointer to an object in the name of the function, or they provided a function.

We would use that function, introspect it, and as part of the saving model API, we ask for what the input training data was, what was the sample output? So we could actually exercise a little bit the model when we’re saving it to actually introspect a little bit more about the API. So if someone had passed an appendage data frame, we would go, hey, you need to provide some sample data for this data frame so we can understand, introspect, and create the function.

From that, we would then create a Pydantic schema on the web service side. So then you could go to, you know, so if you use FastAPI, you could go to the docs page, and you would have a nicely kind of easy to execute, you know, REST-based interface that would tell you what features are required to run this model.

So in terms of what was stitched together in a model, it really depended on, since we were just, you know, we tried to treat Python as a black box in terms of serialization boundaries.

The boundary was really, you know, knowing what was in the function. People could write a function that included featurization as the first step before delegating to the model, or they had the option to kind of keep both separate and in which case it was then at call time, they would have to go to the feature store first to get the right features that then would be passed to the request to kind of compute a prediction in the web service.

So we’re not exactly opinionated as to where the boundaries were, but it was kind of something that I guess we were trying to come back to, to try to help standardize a bit more us to, since different use cases have different SLAs, have different needs, sometimes it makes sense to stitch together, sometimes it’s easier to pre-compute and you don’t need to like stick that with the model.

Piotr: And the interface for the data scientist, like building such a model and serializing this model, was in Python, like they were not leaving Python. It is everything in Python. And I like this idea of providing, let’s say, sample input, sample output. It’s very, I would say, Python way of doing things. Like unit testing, it is how we ensure that the signature is kept.

Stefan: Yeah, and so then from that, like actually from that sample input and output, it was, ideally, it was also actually the training set. And so then this is where we could, you know, we pre-computed summary statistics, as you kind of were alluding to. And so whenever someone saved a model, we tried to provide, you know, things for free.

Like they didn’t have to think about, you know, data observability, but look, if you provided those data, we captured things about it. So then, if there was an issue, we could have a breadcrumb trail to help you determine what changed, was it something about the data, or was it, hey look, you included a new Python dependency, right?

And that kind of changes something, right? And so, so, for example, we also introspected the environment that things ran in. So therefore, we could, to the package level, understand what was in there.

And so then, when we ran model production, we tried to closely replicate those dependencies as much as possible to ensure that, at least from a software engineering standpoint, everything should run as expected.

Piotr: So it sounds like model packaging, it is how it’s called today, solution. And where did you store those envelopes. I understand that you had a framework envelope, but you had instances of those envelopes that were serialized models with metadata. Where did you store it?

Stefan: Yeah. I mean pretty basic, you could say S3, so we store them in a structured manner on S3, but you know, we paired that with a database which had the actual metadata and pointer. So some of the metadata would go out to the database, so you could use that for querying.

We had a whole system where each envelope, you would specify tags. So that way, you could hierarchy organize or query based on kind of the tag structure that you included with the model. And so then it was just one field in the row.

There was one field that was just appointed to, like, hey, this is where the serialized artifact lives. And so yeah, pretty basic, nothing too complex there.

How to decide what feature to build?

Aurimas: Okay, Stefan, so it sounds like everything… was really natural in the platform team. So teams needed to deploy models, right? So you created envelope framework, then teams were suffering from defining the section code efficiently, you created Hamilton.

Was there any case where someone came to you with a crazy suggestion that needs to be built, and you said no? Like how do you decide what feature has to be built and what features you rejected?

Stefan: Yeah. So I have a blog post on some of my learnings, building the platform at Stitch Fix. And so, you could say usually those requests that we said “no” to came from someone who was, someone, wanting something super complex, but they’re also doing something speculative.

They wanted the ability to do something, but it wasn’t in production yet, and it was trying to do something speculative based around improving something where the business value was still not known yet.

Unless it was a business priority and we knew that this was a direction that had to be done, we would say, sure, we’ll help you kind of with that. Otherwise, we would basically say no, usually, these requests come from people who think they’re pretty capable from an engineering perspective.

So we’re like, okay, no, you go figure it out, and then if it works, we can talk about ownership and taking it on. So, for example, we had one configuration-driven model pipeline – you could think of it as some YAML with Python code, and in SQL, we enabled people to describe how to build a model pipeline that way.

So different than Hamilton, getting in more of a macro kind of way, and so we didn’t want to support that right away, but it grew in a way that other people wanted to adopt it, and so in terms of the complexity of being able to kind of manage it, maintain it, we came in, refactored it, made it more general, broader, right?

And so that’s where I see a reasonable way to kind of determine whether you say yes or no, is one, if it’s not a business priority, likely probably not worth your time and get them to prove it out and then if it’s successful, assuming you have the conversation ahead of time, you can talk about adoption.

So, it’s not your burden. Sometimes people do get attached. You just have to be aware as to their attachment to, if it’s their baby, you know, how they’re gonna hand it off to you. It’s something to think about.

But otherwise, I’m trying to think some people wanted TensorFlow support – TensorFlow specific support, but it was only one person using TensorFlow. They were like, “yeah, you can do things right now, yeah we can add some stuff,” but thankfully, we didn’t invest our time because the project they tried it didn’t work, and then they ended up leaving.

And so, in which case, glad we didn’t invest time there. So, yeah, happy to dig in more.

Piotr: It sounds like product manager role, very much like that.

Stefan: Yeah, so at Stitch Fix we didn’t have product managers. So the organization had a program manager. My team was our own product managers. That’s why I spent some of my time trying to talk to people, managers, understand pain points, but also understand what’s going to be valuable from business and where we should spending time.

Piotr:

I’m running a product at Neptune, and it is a good thing and at the same time challenging that you’re dealing with people who are technically savvy, they’re engineers, they can code, they can think in an abstract way.

Very often, when you hear the first iteration in the feature request, it’s actually a solution. You don’t hear the problem. I like this test, and maybe other ML platform teams can learn from it. Do you have it in production?

Is it something that works, or is it something that you plan to move to production one day? As a first filter, I like this heuristic.

Stefan: I mean, you brought back memories a lot like, there’s hey, can you do this? Like, so what’s the problem? Yeah, that is, that is actually, that is the one thing you have to learn to be your first reaction whenever someone who is using your platform asks is like, what is the actual problem? Because it could be that they found a hammer, and they want to use that particular hammer for that particular task.

For example, they want to do hyperparameter optimization. They were asking for it, like, “can you do it this way?” And stepping back, we’re like, hey, we can actually do it at a little higher level, so you don’t have to think we wouldn’t have to engineer it. And so, in which case, super important question to always ask is, “what is the actual problem you’re trying to solve?”

And then you can also ask, “what is the business value?” How important is this, et cetera, to really know, like how to prioritize?

Getting buy-in from the team

Piotr: So we have learned how you’ve been dealing with data scientists coming to you for features. How did the second part of the communication work, how did you encourage or make people, teams follow what you’ve developed, what you proposed them to do? How did you set the standards in the organization?

Stefan: Yeah, so ideally, with any initiative we had, we found a particular use case, a narrow use case, and a team who needed it and would adopt it and would use it when we kind of developed it. Nothing worse than developing something and no one using it. That looks bad, managers like, who’s using it?

So one is ensuring that you have a clear use case and someone who has the need and wants to partner with you. And then, only once that’s successful, start to think about broadening it. Because one, you can use them as the use case and story. This is where ideally, you have weekly, bi-weekly shareouts. So we had what was called “algorithms”, I could say beverage minute, where essentially you could get up for a couple of minutes and kind of talk about things.

And so yeah, definitely had to live the dev tools evangelization internally cause at Stitch Fix, it wasn’t the data scientist who had the choice to not use our tools if they didn’t want to, if they wanted to engineer things themselves. So we had to definitely go around the route of, like, we can take these pain points off of you. You don’t have to think about them. Here’s what we’ve built. Here’s someone who’s using it, and they’re using it for this particular use case. I think, therefore, awareness is a big one, right? You got to make sure people know about the solution, that it is an option.
Documentation, so we actually had a little tool that enabled you to write Sphinx docs pretty easily. So that was kind of something that we ensured that for every kind of model envelope, other tool we kind of built, Hamilton, we had kind of a Sphinx docs set up so if people wanted to like, we could point them to the documentation, try to show snippets and things.

The other is, from our experience, the telemetry that we put in. So one nice thing about the platform is that we can put in as much telemetry as we want. So we actually, when everyone was using something, and there was an error, we would get a Slack alert on it. And so we would try to be on top of that and ask them and go, what are you doing?

Maybe try to engage them to ensure that they were successful in kind of doing things correctly. You can’t do that with open-source. Unfortunately, that’s slightly invasive. But otherwise, most people are only willing to kind of adopt things, maybe a couple of times a quarter.

And so it’s just, you need to have the thing in the right place, right time for them to kind of when they have that moment to be able to get started and over the hump since getting started is the biggest challenge. And so, therefore, trying to find the documentation examples and ways to kind of make that as small a jump as possible.

How did you assemble a team for creating the platform?

Aurimas: Okay, so have you been in Stitch Fix from the very beginning of the ML platform, or did it evolve from the very beginning, right?

Stefan: Yeah, so I mean, when I got there, it was a pretty basic small team. In the six years I was there, it grew quite a bit.

Aurimas: Do you know how it was created? Why was it decided that it was the correct time to actually have a platform team?

Stefan: No, I don’t know the answer to that, but the two guys have kind of heads up, Eric Colson and Jeff Magnusson.

Jeff Magnusson has a pretty famous post about engineers shouldn’t write ETL. If you Google that, you’ll see this kind of post that kind of describes the philosophy of Stitch Fix, where we wanted to create full stack data scientists, where if they can do everything end to end, they can do things move faster and better.

But with that thesis, though, there’s a certain scale limit you can’t hire. It’s hard to hire everyone who has all the skills to do everything full stack, you know, data science, right? And so in which case it was really their vision that like, hey, a platform team to build tools of leverage, right?

I think, it’s something I don’t know what data you have, but like my cursory knowledge around machine learning initiatives is generally there’s a ratio of engineers to data scientists of like 1:1 or 1:2. But at Stitch Fix, the ratio of safe, if you just take the engineering, the platform team that was focused on helping pipelines, right?

The ratio was closer to 1:10. And so in terms of just like leverage of, like, engineers to what data scientists can kind of do, I think it does a little, you have to understand what a platform does now, then you also have to know how to communicate it.

So given your earlier question, Piotr, about, like, how do you measure the effectiveness of platform teams in which case, you know, they, I don’t know what conversations they had to get a head count, so potentially you do need a little bit of help or at least like thinking in terms of communicating that like, hey yes this team is going to be second order because we’re not going to be directly impacting and producing a feature, but if we can make the people more effective and efficient who are doing it then you know it’s going to be a worthwhile investment.

Aurimas: When you say engineers and data scientists, do you assume that Machine Learning Engineer is an engineer or he or she is more of a data scientist?

Stefan: Yeah, I count them, the distinction between a data scientist and machine learning engineers, you could say, one, maybe you could say has a connotation they do a little bit more online kind of things, right?

And so they need to do a little bit more engineering. But I think there’s a pretty small gap. You know, for me, actually, my hope is that if when people use Hamilton, we enable them to do more, they can actually switch the title from data scientist to machine learning engineer.

Otherwise, I kind of lump them into the data scientist bucket in that regard. So like platform engineering was specifically what I was talking about.

Aurimas: Okay. And did you see any evolution in how teams were structured throughout your years at Stitch Fix? Did you change the composition of these end-to-end machine learning teams composed of data scientists and engineers?

Stefan: It really depended on their problem because the forecasting teams they were very much an offline batch. Worked fine, they didn’t have to know, engineer anything thing too complex from an online perspective.

But more than the personalization teams where you know SLA and client-facing things started to matter, they definitely started hiring towards people with a little bit more experience there since they did kind of help from, much like we’re not tackling that yet, I would say, but with DAGWorks we’re trying to enable a lower software engineering bar for to build and maintain model pipelines.

I wouldn’t say the recommendation stack and producing recommendations online. There isn’t anything that’s simplifying that and so in which case, you just still need a stronger engineering skillset to ensure that over time, if you’re managing a lot of microservices that are talking to each other or you’re managing SLAs, you do need a little bit more engineering knowledge to kind of do well.

In so which case, if anything, that was the split that started to merge. Anyone who’s doing more client-faced SLA, required stuff was slightly stronger on the software engineering side, else everyone was fine to be great modelers with lower software engineering skills.

Aurimas: And when it comes to roles that are not necessarily technical, would you embed them into those ML teams like project managers or subject matter experts? Or is it just plain data scientists?

Stefan: I mean, so some of it was landed on the shoulder of the data scientist team is to like partner, who they’re partnering with right, and so they were generally partnering with someone within the organization in which case, you could say, collectively between the two the product managing something so we didn’t have explicit product manager roles.

I think at this scale, when Stitch Fix started to grow was really like project management was a pain point of like: how do we bring that in who does that? So it really depends on the scale.

The product is what you’re doing, what it’s touching, is to like whether you start to need that. But yeah, definitely something that the org was thinking about when I was still there, is like how do you structure things to run more efficiently and effectively? And, like, how exactly do you draw the bounds of a team delivering machine learning?

If you’re working with the inventory team, who’s managing inventory in a warehouse, for example, what is the team structure there was still being kind of shaped out, right? When I was there, it was very separate. But they had, they worked together, but they were different managers, right?

Kind of reporting to each other, but they worked on the same initiative. So, worked well when we were small. You’d have to ask someone there now as to, like, what’s happening, but otherwise, I would say depends on the size of the company and the importance of the machine learning initiative.

Model monitoring and production

Piotr: I wanted to ask about monitoring of the models and production, making them live. Because it sounds pretty similar to software space, okay? The data scientists are here with software engineers. ML platform team can be for this DevOps team.

What about people who are making sure it is live, and how did it work?

Stefan: With the model envelope, we provided deployment for free. That meant the data scientists, you could say the only thing that they were responsible for was the model.

And we tried to structure things in a way that, like, hey, bad models shouldn’t reach production because we have enough of a CI validation step that, like the model, you know, shouldn’t be an issue.

And so the only thing, thing that would break in production is an infrastructure change, in which case the data scientists aren’t responsible and capable for.

But otherwise, you know, if they were, so therefore, if they were, so it was our job to kind of like my team’s responsibility.

I think we were on call for something like, you know, over 50 services because that’s how many models were deployed with us. And we were frontline. So we were frontline precisely because, you know, most of the time, if something was going to go wrong, it was likely going to be something to do with infrastructure.

We were the first point, but they were also on the call chain. Actually, well, I’ll step back. Once any model was deployed, we were both on call, just to make sure that it deployed and it was running initiative, but then it would slightly bifurcate us to, like, okay, we would do the first escalation because if it’s infrastructure, you can’t do anything, but otherwise, you need to be on call because if the model is actually doing some weird predictions, we can’t fix that, in which case you’re the person who has to debug and diagnose it.

Piotr: Sounds like something with data, right? Data drift.

Stefan: Yeah, data drift, something upstream, et cetera. And so this is where better model observability and data observability helps. So trying to capture and use that.

So there’s many different ways, but the nice thing with what we had set up is that we were in a good position to be able to capture inputs at training time, but then also because we controlled the web service. And what was the internals, we could actually log and emit things that came in.

So then we had pipelines then to kind of build and reconcile. So if you want to ask the question, is there training serving SKU? You, as a data scientist or machine learning engineer, didn’t have to build that in. You just had to turn on logging in to your service.

Then we had like turn on some other configuration downstream, but then we provided a way that you could push it to an observability solution to then compare production features versus training features.

Piotr: Sounds like you provided a very comfortable interface for your data scientists.

Stefan: Yeah, I mean, that’s the idea. I mean, so truth be told, that’s kind of what I’m trying to replicate with DAGWorks right, provide the abstractions to allow anyone to have that experience we built at Stitch Fix.

But yeah, data scientists hate migrations. And so part of the reason why to focus on an API thing is to be able to if we wanted to change things underneath from a platform perspective, we wouldn’t be like, hey, data scientists, you need to migrate, right? And so that was also part of the idea of why we focused so heavily on these kinds of API boundaries, so we could make our life simpler but then also theirs as well.

Piotr: And can you share how big was the team of data scientists and ML platform team when it comes to the number of people at the time when you work at Stitch Fix?

Stefan: It was, I think, at its peak it was like 150, was total data scientists and platform team together.

Piotr: And the team was 1:10?

Stefan: So we had a platform team, I think we roughly, it was like, either 1:4, 1:5 total, because we had a whole platform team that was helping with UIs, a whole platform team focusing on the microservices and kind of online architecture, right? So not pipeline related.

And so, yeah. And so there was more, you could say, work required from an engineering perspective from integrating APIs, machine learning, other stuff in the business. So the actual ratio was 1:4, 1:5, but that’s because there was a large component of the platform team that was helping with doing more things around building platforms to help integrate, debug, machine learning recommendations, et cetera.

Aurimas: But what were the sizes of the machine learning teams? Probably not hundreds of people in a single team, right?

Stefan: They were, yeah, it’s kind of varied, you know, like eight to ten. Some teams were that large, and others were five, right?

So really, it really depended on the vertical and kind of who they were helping with respect to the business. So you can think of roughly almost scaled on the modeling. So if you, we were in the UK, there are districts in the UK and the US, and then there were different business lines. There were men’s, women’s, kind of kids, right?

You could think of like data scientists on each one, on each kind of combination, right? So really dependent where that was needed and not, but like, yeah, anywhere from like teams of three to like eight to ten.

How to be a valuable MLOps Engineer?

Piotr: There is a lot of information and content on how to become data scientists. But there is an order of magnitude less around being an MLOps engineer or a member of the ML platform team.

What do you think is needed for a person to be a valuable member of an ML platform team? And what is the typical ML platform team composition? What type of people do you need to have?

Stefan: I think you need to have empathy for what people are trying to do. So I think if you have done a bit of machine learning, done a little bit of modeling, it’s not like, so when someone says, so when someone comes to you with a thing, you can ask, what are you trying to do?

You have a bit more understanding, at a high level, like, what can you do? Right? And then having built things yourself and lived the pains that definitely helps with our empathy. So if you’re an ex-operator, you know that’s kind of what my path was.

I built models, I realized I liked less building the actual models but the infrastructure around them to ensure that people can do things effectively and efficiently. So yeah, having, I would say, the skillset may be slightly changing from what it was six years ago to now, just because there’s a lot more maturity and open-source in kind of the vendor market. So, there’s a bit of a meme or trope of, with MLOps, it’s VendorOps.

If you’re going to integrate and bring in solutions that you’re not building in-house, then you need to understand a little bit more about abstractions and what do you want to control versus tightly integrate.

Empathy, so having some background and then the software engineering skillset that you’ve built things to kind of, in my blog post, I frame it as a two-layer API.

You should never ideally expose the vendor API directly. You should always have a wrap

of veneer around it so that you control some aspects. So that the people you’re providing the platform for don’t have to make decisions.

So, for example, where should the artifact be stored? Like for the saved file, like that should be something that you as a platform take care of, even though that could be something that’s required from the API, the vendor API to kind of be provided, you can kind of make that decision.

This is where I kind of say, if you’ve lived the experience of managing and maintaining vendor APIs you’re gonna be a little better at it the next time around. But otherwise, yeah.

And then if you have a DevOps background as well, or like have built things to deploy yourself, so worked in smaller places, then you also can kind of understand the production implications and like the toolset available of what you can integrate with.

Since you could get a pretty reasonable way with Datadog just on service deployment, right?

But if you want to really understand what’s within the model, why training, serving is important to understand, right? Then having seen it done, having some of the empathy to understand why you need to do it, then I think leads you to just, you know if you have the bigger picture of how things fit end to end, the macro picture, I think then that helps you make better micro decisions.

The road ahead for ML platform teams

Piotr: Okay, makes sense. Stefan, a question because I think when it comes to topics we wanted to cover, we are doing pretty well. I am looking at the agenda. Is there anything we should ask, or would you like to talk?

Stefan: Good question.

Let’s see, I’m just looking at the agenda as well. Yeah, I mean, I think one of like my, in terms of the future, right?

I think to me Stitch Fix tried to enable data scientists to do things end-to-end.

The way I interpreted it is that if you enable data practitioners, in general, to be able to do more self-service, more end-to-end work, they can take business domain context and create something that iterates all the way through.

Therefore they have a better feedback loop to understand whether it’s valuable or not, rather than more traditional where people are still in this kind of handoff model. And so which case, like there’s a bit of then, who you’re designing tools for kind of question. So are you trying to target engineers, Machine Learning Engineers like with these kinds of solutions?

Does that mean the data scientist has to become a software engineer to be able to use your solution to do things self-service? There is the other extreme, which is the low code, no code, but I think that’s kind of limiting. Most of those solutions are SQL or some sort of custom DSL, which I don’t think lends itself well to kind of taking knowledge or learning a skill set and then applying it, going into another job. It’s not necessarily that only works if they’re using the same tool, right?

And so, my kind of belief here is that if we can simplify the tools, the software engineering kind of abstraction that’s required, then we can better enable this kind of self-service paradigm that also makes it easier for platform teams to also kind of manage things and hence why I was saying if you take a vendor and you can simplify the API, you can actually make it easier for a data scientist to use, right?

So that is where my thesis is that if we can make it lower the software engineering bar to do more self-service, you can provide more value because that same person can get more done.

But then also, if it’s constructed in the right way, you’re also going to, this is where the thesis with Hamilton is and kind of DAGWorks, that you can kind of more easily maintain things over time so that when someone leaves, it’s not, no one has nightmares inheriting things, which is really where, like at Stitch Fix, we made it really easy to get to production, but teams because the business moved so quickly and other things, they spent half their time trying to keep machine learning pipelines afloat.

And so this is where I think, you know, and that’s some of the reasons why was because we enable them to do more too, too much engineering, right?

Skills required for building robust tools

Stefan: I’m curious, what do you guys think in terms of who should be the ultimate target for kind of, the level of software engineering skill required to enable self-service, model building, machinery pipelines.

Aurimas: What do you mean specifically?

Stefan: I mean, so if self-serve is the future. If so, what is that self-engineering skillset required?

Aurimas: To me, at least how I see it in the future, self-service is the future, first of all, but then I don’t really see, at least from experience, that there are platforms right now that data scientists themselves could work against end to end.

As I’ve seen, in my experience, there is always a need for a machine learning engineer basically who is still in between the data scientists and the platform, unfortunately, but definitely, there should be a goal probably that a person who has a skill set of a current data scientist could be able to do end to end. That’s what I believe.

Piotr: I think it is getting… that is kind of a race. So things that used to be hard six years ago are easy today, but at the same time, techniques got more complex.

Like we have, okay, today, great foundational models, encoders. The models we’re building are more and more dependent on the other services. And this abstraction will not be anymore, data sets, some preprocessing, training, post-processing, model packaging, and then independent web service, right?

It is getting more and more dependent also on external services. So, I think that the goal, yes, of course, like if we are repeating ourselves and we will be repeating ourselves, let’s make it self-service friendly, but I think with the development of the techniques and methods in this space, it will be kind of a race, so we will solve some things, but we will introduce another complexity, especially when you’re trying to do something state of the art, you’re not thinking about making things simple to use at the beginning, rather you’re thinking about, okay, whether you will be able to do it, right?

So the new techniques usually are not so friendly and easy to use. Once they are becoming more common, we are making them easier to use.

Stefan: I was gonna say, or at least jump over what he’s saying, that in terms of one of the techniques I use for designing APIs is really actually trying to design the API first before.

I think what Piotr was saying is that very easy for an engineer. I found this, you know, problem myself is to go bottom up. It’s like, I wanna build this capability, and then I wanna expose how people kind of use it.

And I actually think inverting that and going, you know, what is the experience that I want

someone to kind of use or get from the API first and then go down is really, it has been a very enlightening experience as to like how could you simplify what you could do because it’s very easy from bottoms up to like to include all these concerns because you want to enable anyone to do anything as a natural tendency of an engineer.

But when you want to simplify things, you really need to kind of ask the question, you know what is the eighty-twenty? This is where the Python ethos of batteries is included, right?

So how can you make this easy as possible for the most pre-optimal kind of set of people who want to use it?

Final words

Aurimas: Agreed, agreed, actually.

So we are almost running out of time. So maybe the last question, maybe Stefan, you want to leave our listeners with some idea, maybe you want to promote something. It’s the right time to do it now.

Stefan: Yeah.

So if you are terrified of inheriting your colleagues’ work, or this is where maybe you’re a new person joining your company, and you’re terrified of the pipelines or the things that you’re inheriting, right?

I would say I’d love to hear from you. Hamilton, I think, but it is, you could say we’re still a pretty early open-source project, very easy. We have a roadmap that’s being shaped and formed by inputs and opinions. So if you want an easy way to maintain and collaborate as a team on your model pipeline, since individuals build models, but teams own them.

I think that requires a different skill set and discipline to kind of do well. So come check out Hamilton, tell us what you think. And then from the DAGWorks platform, we’re still at the current, at the time of recording this, we’re still kind of currently kind of closed beta. We have a waitlist, early access form that you can kind of fill out if you’re interested in trying out the platform.

Otherwise, search for Hamilton, and give us a star on GitHub. Let me know your experience. We’d love to ensure that as your ML ETLs or pipelines kind of grow, your maintenance burdens shouldn’t.

Thanks.

Aurimas: So, thank you for being here with us today and really good conversation. Thank you.

Stefan: Thanks for having me, Piotr, and Aurimas.

MLOps Is an Extension of DevOps. Not a Fork — My Thoughts on THE MLOPS Paper as an MLOps Startup CEO

Piotr Niedzwiedz — Mon, 23 Jan 2023 17:09:35 +0000

By now, everyone must have seen THE MLOps paper.

“Machine Learning Operations (MLOps): Overview, Definition, and Architecture”

By Dominik Kreuzberger, Niklas Kühl, Sebastian Hirschl

Great stuff. If you haven’t read it yet, definitely do so.

The authors give a solid overview of:

What MLOps is,
Principles and components of the MLOps ecosystem,
People/roles involved in doing MLOps,
MLOps architecture and workflow that many teams have.

They tackle the ugly problem in the canonical MLOps movement: How do all those MLOps stack components actually relate to each other and work together?

In this article, I share how our reality as the MLOps tooling company and my personal views on MLOps agree (and disagree) with it. Many of the things I will talk about here I already see today. Some are my 3–4 year bets.

Just so you know where I am coming from:

I have a heavy software development background (15+ years in software). Lived through the DevOps revolution. Came to ML from software.
Founded two successful software services companies.
Founded neptune.ai, a robust and highly scalable experiment tracker.
I lead the product and see what users, customers, and other vendors in this corner of the market do.
Most of our customers are doing ML/MLOps at a reasonable scale, NOT at the hyperscale of big-tech FAANG companies.

If you’d like a TLDR, here it is:

MLOps is an extension of DevOps. Not a fork:
– The MLOps team should consist of a DevOps engineer, a backend software engineer, a data scientist, + regular software folks. I don’t see what special role ML and MLOps engineers would play here.
– We should build ML-specific feedback loops (review, approvals) around CI/CD.
We need both automated continuous monitoring AND periodic manual inspection.
There will be only one type of ML metadata store (model-first), not three.
The workflow orchestration component is actually two things, workflow execution tools and pipeline authoring frameworks.
We don’t need a model registry. If anything, it should be a plugin for artifact repositories.
Model monitoring tools will merge with the DevOps monitoring stack. Probably sooner than you think.

Ok, let me explain.

MLOps is an extension of DevOps. Not a fork.

First of all, it is great to talk about MLOps and MLOps stack components, but at the end of the day, we are all just delivering software here.

A special type of software with ML in it but software nonetheless.

We should be thinking about how to connect to already existing and mature DevOps practices, stacks, and teams. But so much of what we do in MLOps is building things that already exist in DevOps and putting the MLOps stamp on them.

MLOps is an extension of DevOps

When companies add ML models to their products/services, something is already there.

That something is regular software delivery processes and the DevOps tool stack.

In reality, almost nobody is starting from scratch.

And in the end, I don’t see a world where MLOps and DevOps stacks sit next to each other and are not just one stack.

I mean, if you are with me on “ML is just a special type of software”, MLOps is just a special type of DevOps.

So, figuring out MLOps architecture and principles is great, but I wonder how that connects to extending the already existing DevOps principles, processes, and tools stacks.

Production ML team composition

Let’s take this “MLOps is a an extension of DevOps” discussion to the team structure.

Who do we need to build reliable ML-fueled software products?

Someone responsible for the reliability of software delivery 🙂
We are building products, so there needs to be a clear connection between the product and end users.
We need people who build ML-specific parts of the product.
We need people who build non-ML-specific parts of the product.

Great, now, who are those people exactly?

I believe the team will look something like this:

Software delivery reliability: DevOps engineers and SREs (DevOps vs SRE here)
ML-specific software: software engineers and data scientists
Non-ML-specific software: software engineers
Product: product people and subject matter experts

Wait, where is the MLOps engineer?

How about the ML engineer?

Let me explain.

MLOps engineer is just a DevOps engineer

This may be a bit extreme, but I don’t see any special MLOps engineer role on this team.

MLOps engineer today is either an ML engineer (building ML-specific software) or a DevOps engineer. Nothing special here.

Should we call a DevOps engineer who primarily operates ML-fueled software delivery an MLOps engineer?

I mean, if you really want, we can, but I don’t think we need a new role here. It is just a DevOps eng.

Either way, we definitely need that person on the team.

Now, where things get interesting for me is here.

Data scientist vs ML engineer vs backend software engineer

So first, what is the actual difference between a data scientist, ML engineer, software engineer, and an ML researcher?

Today I see it like this.

In general, ML researchers are super heavy on ML-specific knowledge and less skilled in software development.

Software engineers are strong in software and less skilled in ML.

Data scientists and ML engineers are somewhere in between.

But that is today or maybe even yesterday.

And there are a few factors that will change this picture very quickly:

Business needs
Maturity of ML education

Let’s talk about business needs first.

Most ML models deployed within product companies will not be cutting-edge, super heavy on tweaking.

They won’t need state-of-the-art model compression techniques for lower latency or tweaks like that. They will be run-of-the-mill models trained on specific datasets that the org has.

That means the need for super custom model development that data scientists and ML researchers do will be less common than building packaging and deploying run-of-the-mill models to prod.

There will be teams that need ML-heavy work for sure. It’s just that the majority of the market will not. Especially as those baseline models get so good.

Ok, so we’ll have more need for ML engineers than data scientists, right?

Not so fast.

Let’s talk about computer science education.

When I studied CS, I had one semester of ML. Today it’s 4x + more ML content on that same program.

I believe that packaging/building/deploying the vanilla, run-of-the-mill ML model will become common knowledge for backend devs.

Even today, most backend software engineers can easily learn enough ML to do that if needed.

Again, not talking about those tricky-to-train, heavy-on tweaking models. I am talking about good baseline models.

So considering that:

Baseline models will get better
ML education in classic CS programs will improve
Business problems that need heavy ML tweaking will be less common

I believe the current roles on the ML team will evolve:

ML heavy role -> data scientist
Software heavy role -> backend software engineer

So who should work on the ML-specific parts of the product?

I believe you’ll always need both ML-heavy data scientists and software-heavy backend engineers.

Backend software engs will package those models and “publish” them to production pipelines operated by DevOps engineers.

Data scientists will build models when the business problem is ML-heavy.

But you will also need data scientists even when the problem is not ML-heavy, and backend software engineers can easily deploy run-of-the-mill models.

Why?

Cause models fail.

And when they fail, it is hard to debug them and understand the root cause.

And the people who understand models really well are ML-heavy data scientists.

But even if the ML model part works “as expected”, the ML-fueled product may be failing.

That is why you also need subject matter experts closely involved in delivering ML-fueled software products.

Subject matter experts

Good product delivery needs frequent feedback loops. Some feedback loops can be automated, but some cannot.

Especially in ML. Especially when you cannot really evaluate your model without you or a subject matter expert taking a look at the results.

And it seems those subject matter experts (SMEs) are involved in MLOps processes more often than you may think.

We saw fashion designers sign up for our ML metadata store.

WHAT? It was a big surprise, so we took a look.

Turns out that teams want SMEs involved in manual evaluation/testing a lot.

Especially teams at AI-first product companies want their SMEs in the loop of model development.

It’s a good thing.

Not everything can be tested/evaluated with a metric like AUC or R2. Sometimes, people just have to check if things improved and not just metrics got better.

This human-in-the-loop MLOps system is actually quite common among our users:

Greensteam in shipping: SME audits the results of new models
Respo Vision in sports analytics: Data scientists are looking at various metrics and visual outputs to evaluate the pipeline performance
Continuum Industries in industrial optimization: devs look at the results over the entire test suite before approving PRs

So this human-in-the-loop design makes true automation impossible, right?

That is bad, right?

It may seem problematic at first glance, but this situation is perfectly normal and common in regular software.

We have Quality Assurance (QA) or User Researchers manually testing and debugging problems.

That is happening on top of the automated tests. So it is not “either or” but “both and”.

But SMEs definitely are present in (manual) MLOps feedback loops.

Principles and components: what is the diff vs DevOps

I really liked something that the authors of THE MLOps paper did.

They started by looking at the principles of MLOps. Not just tools but principles. Things that you want to accomplish by using tools, processes, or any other solutions.

They go into components (tools) that solve different problems later.

Too often, it is completely reversed, and the discussion is shaped by what tools do.
Or, more specifically, what the tools claim to do today.

Tools are temporary. Principles are forever. So to speak.

And the way I see it, some of the key MLOps principles are missing, and some others should be “packaged” differently.

More importantly, some of those things are not “truly MLOps” but actually just DevOps stuff.

I think as the community of builders and users of MLOps tooling, we should be thinking about principles and components that are “truly MLOps”. Things that extend the existing DevOps infrastructure.

This is our value added to the current landscape. Not reinventing the wheel and putting an MLOps stamp on it.

So, let’s dive in.

Principles

So CI/CD, versioning, collaboration, reproducibility, and continuous monitoring are things that you also have in DevOps. And many things we do in ML actually fall under those quite clearly.

Let’s go into those nuances.

CI/CD + CT/CE + feedback loops

If we say that MLOps is just DevOps + “some things”, then CI/CD is a core principle of that.

With CI/CD, you get automatically triggered tests, approvals, reviews, feedback loops, and more.

With MLOps come CT (continuous training/testing) and CE (continuous evaluation), which are essential to a clean MLOps process.

Are they separate principles?

No, they are a part of the very same principle.

With CI/CD, you want to build, test, integrate, and deploy software in an automated or semi-automated fashion.

Isn’t training ML models just building?

And evaluation/testing just, well, testing?

What is so special about it?

Perhaps it is the manual inspection of new models.

That feels very much like reviewing and approving a pull request by looking at the diffs and checking that (often) automated tests passed.

Diffs between not only code but also models/datasets/results. But still diffs.

Then you approve, and it lands in production.

I don’t really see why CT/CE are not just a part of CI/CD. If not in naming, then at least in putting them together as a principle.

The review and approval mechanism via CI/CD works really well.

We shouldn’t be building brand new model approval mechanisms into MLOps tools.

We should integrate CI/CD into as many feedback loops as possible. Just like people do with QA and testing in regular software development.

Workflow orchestration and pipeline authoring

When we talk about workflow orchestration in ML, we usually mix two things.

One is the scheduling, execution, retries, and caching. Things that we do to make sure that the ML pipeline executes properly. This is a classic DevOps use case. Nothing new.

But there is something special here: the ability to author ML pipelines easily.

Pipeline authoring?

Yep.

When creating integration with Kedro, we learned about this distinction.

Kedro explicitly states that they are a framework for “pipeline authoring”, NOT workflow orchestration. They say:

“We focus on a different problem, which is the process of authoring pipelines, as opposed to running, scheduling, and monitoring them.”

You can use different back-end runners (like Airflow, Kubeflow, Argo, Prefect), but you can author them in one framework.

Learn more

Argo vs Airflow vs Prefect: How Are They Different

Kedro vs ZenML vs Metaflow: Which Pipeline Orchestration Tool Should You Choose?

Pipeline authoring is this developer experience (DevEx) layer on top of orchestrators that caters to data science use cases. It makes collaboration on those pipelines easier.

Collaboration and re-usability of pipelines by different teams were the very reasons why Kedro was created.

And if you want re-usability of ML pipelines, you sort of need to solve reproducibility while you are at it. After all, if you re-use a model training pipeline with the same inputs, you expect the same result.

Versioning vs ML metadata tracking/logging

Those are not two separate principles but actually parts of a single one.

We’ve spent thousands of hours talking to users/customers/prospects about this stuff.

And you know what we’ve learned?

Versioning, logging, recording, and tracking of models, results, and ML metadata are extremely connected.

I don’t think we know exactly where one ends and the other starts, let alone our users.

They use versioning/tracking interchangeably a lot.

And it makes sense as you want to version both the model and all metadata that comes with it. Including model/experimentation history.

You want to know:

how the model was built,
what were the results,
what data was used,
what the training process looked like,
how it was evaluated,
etc.

Only then can you talk about reproducibility and traceability.

And so in ML, we need this “versioning +” which is basically not only versioning of the model artifact but everything around it (metadata).

So perhaps the principle of “versioning” should just be a wider “ML versioning” or “versioning +” which includes tracking/recording as well.

Model debugging, inspection, and comparison (missing)

“Debugging, inspection and comparison” of ML models, experiments, and pipeline execution runs is a missing principle in the MLOps paper.

Authors talked about things around versioning, tracking, and monitoring but a principle that we see people want that wasn’t mentioned is this:

As of today, a lot of the things in ML are not automated. They are manual or semi-manual.

In theory, you could automatically optimize hyperparameters for every model to infinity, but in practice, you are tweaking the model config based on the results exploration.

When models fail in production, you don’t know right away from the logs what happened (most of the time).

You need to take a look, inspect, debug, and compare model versions.

Obviously, you experiment a lot during the model development, and then comparing models is key.

But what happens later when those manually-built models hit retraining pipelines?

You still need to compare the in-prod automatically re-trained models with the initial, manually-built ones.

Especially when things don’t go as planned, and the new model version isn’t actually better than the old one.

And those comparisons and inspections are manual.

Automated continuous monitoring (+ manual periodic inspection)

So I am all for automation.

Automating mundane tasks. Automating unit tests. Automating health checks.

And when we speak about continuous monitoring, it is basically automated monitoring of various ML health checks.

You need to answer two questions before you do that:

What do you know can go wrong, and can you set up health checks for that?
Do you even have a real need to set up those health checks?

Yep, many teams don’t really need production model monitoring.

I mean, you can inspect things manually once a week. Find problems you didn’t know you had. Get more familiar with your problem.

As Shreya Shankar shared in her “Thoughts on ML Engineering After a Year of my PhD”, you may not need model monitoring. Just retrain your model periodically.

“Researchers think distribution shift is very important, but model performance problems that stem from natural distribution shift suddenly vanish with retraining.” — Shreya Shankar

You can do that with a cron job. And the business value that you generate through this dirty work will probably be 10x the tooling you buy.

Ok, but some teams do need it, 100%.

Those teams should set up continuous monitoring, testing, and health checks for whatever they know can go wrong.

But even then, you need to manually inspect/debug/compare your models from time to time.

To catch new things that you didn’t know about your ML system.

Silent bugs that no metric can catch.

I guess that was a long way of saying that:

You need not only continuous monitoring but also manual periodic inspection.

Data management

Data management in ML is an essential and much bigger process than just version control.

You have data labeling, reviewing, exploration, comparison, provisioning, and collaboration on datasets.

Especially now, when the idea of data-centric MLOps (iterating over datasets is more important than iterating over model configurations) is gaining so much traction in the ML community.

Also, depending on how quickly your production data changes or how you need to set up evaluation datasets and test suits, your data needs will determine the rest of your stack. For example, if you need to retrain very often, you may not need the model monitoring component, or if you are solving just CV problems, you may not need Feature Store etc.

Collaboration

When authors talk about collaboration, they say:

“P5 Collaboration. Collaboration ensures the possibility to work collaboratively on data, model, and code.”

And they show this collaboration (P5) happening in the source code repository:

This is far from the reality we observe.

Collaboration is also happening with:

Experiments and model-building iterations
Data annotation, cleanups, sharing datasets and features
Pipeline authoring and re-using/transfering
CI/CD review/approvals
Human-in-the-loop feedback loops with subject matter experts
Model hand-offs
Handling problems with in-production models and communication from the front line (users, product people, subject matter experts) and model builders

And to be clear, I don’t think we as an MLOps community are doing a good job here.

Collaboration in source code repos is a good start, but it doesn’t solve even half of the collaboration issues in MLOps.

Ok, so we talked about the MLOps principles, let’s now talk about how those principles are/should be implemented in tool stack components.

Components

Again many components like CI/CD, source version control, training/serving infrastructure, and monitoring are just part of DevOps.

But there are a few extra things and some nuance to the existing ones IMHO.

Pipeline authoring
Data management
ML metadata store (yeah, I know, I am biased, but I do believe that, unlike in software, experimentation, debugging, and manual inspection play a central role in ML)
Model monitoring as a plugin to application monitoring
No need for a model registry (yep)

Workflow executors vs workflow authoring frameworks

As we touched on it before in principles, we have two subcategories of workflow orchestration components:

Workflow orchestration/execution tools
Pipeline authoring frameworks

The first one is about making sure that the pipeline executes properly and efficiently. Tools like Prefect, Argo, and Kubeflow help you do that.

The second is about a devex of creating and reusing the pipelines. Frameworks like Kedro, ZenML, and Metaflow fall into this category.

Data management

What this component (or a set of components) should ideally solve is:

Data labeling
Feature preparation
Feature management
Dataset versioning
Dataset reviews and comparison

Today, it seems to be either done by a home-grown solution or a bundle of tools:

Feature stores like Tecton. Interestingly now they go more in the direction of a feature management platform: “Feature Platform for Real-Time Machine Learning”.
Labeling platforms like Labelbox.
Dataset version control with DVC.
Feature transformation and dataset preprocessing with dbt labs.

Should those be bundled into one “end-to-end data management platform” or solved with best-in-class, modular, and interoperable components?

I don’t know.

But I do believe that the collaboration between users of those different parts is super important.

Especially now in this more data-centric MLOps world. And even more so when subject matter experts review those datasets.

And no tool/platform/stack is doing a good job here today.

ML metadata store (just one)

In the paper, ML metadata stores are mentioned in three contexts, and it is not clear whether we are talking about one component or more. Authors talk about:

ML metadata store configured next to the Experimentation component
ML metadata store configured with Workflow Orchestration
ML metadata store configured with Model registry

The way I see it, there should just be one ML metadata store that enables the following principles:

“reproducibility”
“debugging, comparing, inspection”
“versioning+” (versioning + ML metadata tracking/logging), which includes metadata/results from any tests and evaluations at different stages (for example, health checks and tests results of a model release candidates before they go to a model registry)

Let me go over those three ML metadata stores and explain why I think so.

ML metadata store configured next to the Experimentation component

This one is pretty easy. Maybe because I hear about it all the time at Neptune.

When you experiment, you want to iterate over various experiment/run/model versions, inspect the results, and debug problems.

You want to be able to reproduce the results and have the ready-for-production models versioned.

You want to “keep track of” experiment/run configurations and results, parameters, metrics, learning curves, diagnostic charts, explainers, and example predictions.

You can think of it as a run or model-first ML metadata store.

That said, most people we talk to call the component that solves it an “experiment tracker” or an “experiment tracking tool”.

The “experiment tracker” seems like a great name when it relates to experimentation.

But then you use it to compare the results of initial experiments to CI/CD-triggered, automatically run production re-training pipelines, and the “experiment” part doesn’t seem to cut it anymore.

I think that ML metadata store is a way better name because it captures the essence of this component. Make it easy to “Log, store, compare, organize, search, visualize, and share ML model metadata”.

Ok, one ML metadata store explained. Two more to go.

2. ML metadata store configured with Workflow Orchestration

This one is interesting as there are two separate jobs that people want to solve with this one: ML-related (comparison, debugging) and software/infrastructure-related (caching, efficient execution, hardware consumption monitoring).

From what I see among our users, those two jobs are solved by two different types of tools:

People solve ML-related job by using native solutions or integrating with external experiment trackers. They want to have the re-training run results in the same place where they have experimentation results. Makes sense as you want to compare/inspect/debug those.
The software/infrastructure-related job is done either by the orchestrator components or traditional software tools like Grafana, Datadog etc.

Wait, so shouldn’t the ML metadata store configured next to the workflow orchestration tool gather all the metadata about pipeline execution, including the ML-specific part?

Maybe it should.

But most ML metadata stores configured with workflow orchestrators weren’t purpose-built with the “compare and debug” principle in mind.

They do other stuff really well, like:

caching intermediate results,
retrying based on execution flags,
distributing execution on available resources
stopping execution early

And probably because of all that we see people use our experiment tracker to compare/debug results complex ML pipeline executions.

So if people are using an experiment tracker (or run/model-first ML metadata store) for the ML-related stuff, what should happen with this other pipeline/execution-first ML metadata store?

It should just be a part of the workflow orchestrator. And it often is.

It is an internal engine that makes pipelines run smoothly. And by design, it is strongly coupled with the workflow orchestrator. Doesn’t make sense for that to be outsourced to a separate component.

Ok, let’s talk about the third one.

3. ML metadata store configured with Model registry

Quoting the paper:

“Another metadata store can be configured within the model registry for tracking and logging the metadata of each training job (e.g., training date and time, duration, etc.), including the model-specific metadata — e.g., used parameters and the resulting performance metrics, model lineage: data and code used”

Ok, so almost everything listed here is logged to the experiment tracker.

What is typically not logged there? Probably:

Results of pre-production tests, logs from retraining runs, CI/CD triggered evaluations.
Information about how the model was packaged.
Information about when the model was approved/transitioned between stages (stage/prod/archive).

Now, if you think of the “experiment tracker” more widely, like I do, as an ML metadata store that solves for “reproducibility”, “debugging, comparing, inspection”, and “versioning +” principles, then most of that metadata actually goes there.

Whatever doesn’t, like stage transition timestamps, for example, is saved in places like Github Actions, Dockerhub, Artifactory, or CI/CD tools.

I don’t think there is anything left to be logged to a special “ML metadata store configured next to the model registry”.

I also think that this is why so many teams that we talk to expect close coupling between experiment tracking and model registry.

It makes so much sense:

They want all the ML metadata in the experiment tracker.
They want to have a production-ready packaged model in the model registry
They want a clear connection between those two components

But there is no need for another ML metadata store.

There is only one ML metadata store. That, funny enough, most ML practitioners don’t even call an “ML metadata store” but an “experiment tracker”.

Ok, since we are talking about “model registry”, I have one more thing to discuss.

Model registry. Do we even need it?

Some time ago, we introduced the model registry functionality to Neptune.

At the same time, if you asked me if there is/will be a need for a model registry in MLOps/DevOps in the long run, I would say No!

For us, “model registry” is a way to communicate to the users and the community that our ML metadata store is the right tool stack component to store and manage ML metadata about your production models.

But it is not and won’t be the right component to implement an approval system, do model provisioning (serving), auto-scaling, canary tests etc.

Coming from the software engineering world, it would feel like reinventing the wheel here.

Wouldn’t some artifact registry like Docker Hub or JFrog Artifactory be the thing?

Don’t you just want to put the packaged model inside a Helm chart on Kubernetes and call it a day?

Sure, you need references to the model building history or results of the pre-production tests.

You want to ensure that the new model’s input-output schema matches the expected one.

You want to approve models in the same place where you can compare previous/new ones.

But all of those things don’t really “live” in a new model registry component, do they?

They are mainly in CI/CD pipelines, docker registry, production model monitoring tools, or experiment trackers.

They are not in a shiny new MLOps component called the model registry.

You can solve it with nicely integrated:

CI/CD feedback loops that include manual approvals & “deploy buttons” (check out how CircleCI or GitLab do this)
+ Model packaging tool (to get a deployable package)
+ Container/artifact registry (to have a place with ready-to-use models)
+ ML metadata store (to get the full model-building history)

Right?

Can I explain the need for a separate tool for the model registry to my DevOps friends?

Many ML folks we talk to seem to get it.

But is it because they don’t really have the full understanding of what DevOps tools offer?

I guess that could be it.

And truth be told, some teams have a home-grown solution for a model registry, which is just a thin layer on top of all of those tools.

Maybe that is enough. Maybe that is exactly what a model registry should be. A thin layer of abstraction with references and hooks to other tools in the DevOps/MLOps stack.

Model monitoring. Wait, which one?

“Model monitoring” takes the cake when it comes to the vaguest and most confusing name in the MLOps space (“ML metadata store” came second btw).

“Model monitoring” means six different things to three different people.

We talked to teams that meant:

(1) Monitor model performance in production: See if the model performance decays over time, and you should re-train it.
(2) Monitor model input/output distribution: See if the distribution of input data, features, or predictions distribution change over time.
(3) Monitor model training and re-training: See learning curves, trained model predictions distribution, or confusion matrix during training and re-training.
(4) Monitor model evaluation and testing: log metrics, charts, prediction, and other metadata for your automated evaluation or testing pipelines
(5) Monitor infrastructure metrics: See how much CPU/GPU or Memory your models use during training and inference.
(6) Monitor CI/CD pipelines for ML: See the evaluations from your CI/CD pipeline jobs and compare them visually.

For example:

Neptune does (3) and (4) really well, (5) just ok (working on it), but we saw teams use it also for (6)
Prometheus + Grafana is really good at (5), but people use it for (1) and (2)
Whylabs or Arize AI are really good at (1) and (2)

As I do believe MLOps will just be an extension to DevOps, we need to understand where software observability tools like Datadog, Grafana, NewRelic, and ELK (Elastic, Logstash, Kibana) fit into MLOps today and in the future.

Also, some parts are inherently non-continuous and non-automatic. Like comparing/inspecting/debugging models. There are subject matter experts and data scientists involved. I don’t see how this becomes continuous and automatic.

But above all, we should figure out what is truly ML-specific and build modular tools or plugins there.

For the rest, we should just use more mature software monitoring components that quite likely your DevOps team already has.

So perhaps the following split would make things more obvious:

Production model observability and monitoring (WhyLabs, Arize)
Monitoring of model training, re-training, evaluation, and testing (MLflow, Neptune)
Infrastructure and application monitoring (Grafana, Datadog)

I’d love to see how CEOs of Datadog and Arize AI think about their place in DevOps/MLOps long-term.

Is drift detection just a “plugin” to the application monitoring stack? I don’t know, but it seems reasonable, actually.

Final thoughts and open challenges

If there is anything I want you to take away from this article, it is this.

We shouldn’t be thinking about how to build the MLOps stack from scratch.

We should be thinking about how to gradually extend the existing DevOps stack to specific ML needs that you have right now.

Authors say:

“To successfully develop and run ML products, there needs to be a culture shift away from model-driven machine learning toward a product-oriented discipline

…

Especially the roles associated with these activities should have a product-focused perspective when designing ML products”.

I think we need an even bigger mindset shift:

ML models -> ML products -> Software products that use ML -> just another software product

And your ML-fueled software products are connected to the existing infrastructure of delivering software products.

I don’t see why ML is a special snowflake here long-term. I really don’t.

But even when looking at the MLOps stack presented, what is the pragmatic v1 version of this that 99% of teams will actually need?

The authors interviewed ML practitioners from companies with 6500+ employees. Most companies doing ML in production are not like that. And MLOps stack is way simpler for most teams.

Especially those who are doing ML/MLOps at a reasonable scale.

They choose maybe 1 or 2 components that they go deeper on and have super basic stuff for the rest.

Or nothing at all.

You don’t need:

Workflow orchestration solutions when a cron job is enough.
Feature store when a CSV is enough.
Experiment tracking when a spreadsheet is enough.

Really, you don’t.

We see many teams deliver great things by being pragmatic and focusing on what is important for them right now.

At some point, they may grow their MLOps stack to what we see in this paper.

Or go to a DevOps conference and realize they should just be extending the DevOps stack 😉

I occasionally share my thoughts on ML and MLOps landscape on my Linkedin profile. Feel free to follow me there if that’s an interesting topic for you. Also, reach out if you’d like to chat about it.

neptune.ai Named to the 2022 CB Insights AI 100 List of Most Promising AI Startups

Piotr Niedzwiedz — Wed, 18 May 2022 08:28:16 +0000

It’s been only a couple of weeks since I announced that we raised an $8M series A, and here I am with more good news.

neptune.ai has been named to the 2022 CB Insights AI 100 List of Most Promising AI Startups. We’ve been recognized in the experiment tracking and version control category.

The CB Insights team picked 100 private market vendors from a pool of over 7,000 companies. They were chosen based on factors including R&D activity, proprietary Mosaic scores, market potential, business relationships, investor profile, news sentiment analysis, competitive landscape, team strength, and tech novelty.

There are a ton of great startups in the top 100 – congrats to all of them! I can’t help but notice that there are only a few companies from Europe, so I’m happy that Neptune is one of them.

Good to see some of our customers and users on the list (e.g. the InstaDeep team, who we even had a chance to create a case study with).

Being noticed by CB Insights is motivating but also important for neptune.ai as a company. After we landed on the report for the first time last year, over 100 VCs from all over the world approached us. It helped us a lot in raising the last investment round.

And what does it mean for Neptune users?

You can expect that we’ll use this recognition to develop our tool and create a better developer experience. That includes:

quicker feedback-to-feature loops,
improved experience of our web UI,
more integrations with the tools from the MLOps ecosystem,
even better documentation.

These past few weeks have been especially good for us. But I’m confident we won’t slow down.

We’re going to work even harder to continue making experiment tracking and model registry “just work” for ML teams around the world.

If you want to join our team, check the open positions here.
If you haven’t tried Neptune yet, check the documentation here.
If you’re looking for MLOps resources, join our MLOps bi-weekly event or check our blog.

We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”

Piotr Niedzwiedz — Wed, 18 May 2022 07:27:26 +0000

When I came to the machine learning space from software engineering in 2016, I was surprised by the messy experimentation practices, lack of control over model building, and a missing ecosystem of tools to help people deliver models confidently.

It was a stark contrast from the software development ecosystem, where you have mature tools for DevOps, observability, or orchestration to execute efficiently in production.

Seeing that led me to start neptune.ai with a few friends back in 2017 to give ML practitioners the same level of confidence when developing and deploying models as software devs have when shipping apps.

A lot has changed since then:

the transformers and GPT-3 were created,
Pytorch became a standard,
Theano was deprecated and then came back again,
the term “MLOps” was coined and then became popular.

Most importantly, the ML community realized that building a POC model in a notebook is not the end goal.

Today, companies, big and small, deploy and operate those models in production.
By no means are we at a “develop and deploy models confidently” stage just yet, but we’ve made huge progress as a community.

Speaking of progress, I am really happy to share that we’ve just raised an $8M Series A to continue building neptune.ai.

Almaz Capital led the round with participation from our existing investors: btov Partners, Rheingau Founders, and TDJ Pitango.

We’ve gone such a long way over these last few years. Today we have:

tens of thousands of users,
hundreds of paying teams,
places like CB Insights list us as a “Top 100 AI startup in 2021”.

As a Polish engineer at heart, there is only one way to express how I feel: not bad.

I am very grateful to:

all the users and customers for invaluable feedback and support,
the team for putting in their best effort every day,
investors for believing in our vision.

While most companies in the MLOps space try to go wider and become platforms that solve all the problems of machine learning teams, we want to go deeper and become the best-in-class tool for experiment tracking and model registry.

We want to solve “just” this one part of the MLOps stack really well.

Why just one?

In a more mature software development space, there are almost no end-to-end platforms. So why should machine learning, which is even more complex, be any different?

I believe that by focusing on providing a great developer experience for experiment tracking and model registry, we can become one of the pillars on which teams build their MLOps tool stacks.

And to make this happen, we will invest a big chunk of that $8M in developer experience. Expect:

more features built for specific ML use cases,
even more responsive UI & APIs,
revamped UX of our web UI,
more integrations with the tools from the MLOps ecosystem,
new ways of interacting via webhooks and notifications,
better documentation,
quicker feedback to feature loops.

But first and foremost, we’ll continue making experiment tracking and model registry “just work” for ML teams around the world.

If you’re interested in joining us, checking out the tool, or sharing feedback, I’d love to hear from you:

Jobs: we are hiring for engineering, devrel, and growth roles
Docs: if you haven’t yet, try Neptune out
Request a demo: we’ll create a custom demo for your use case

Note: The article was published on 12th April, 2022.

Piotr Niedzwiedz, Autor w serwisie neptune.ai

We are joining OpenAI

Our future with OpenAI

What is next

Learnings From Building the ML Platform at Mailchimp

Who is Mikiko Bazeley

How to transition from data analytics to MLOps engineering

Resources that can help bridge the technical gap

Looking for boot camp mentors

Finding a job after a boot camp and social media presence

What is FeatureForm, and different types of other feature stores

Virtual feature store in the picture of a broader architecture

Virtual feature stores from an end-user perspective

Dataset and feature versioning

The value of feature stores

Feature documentation in FeatureForm

ML platform at Mailchimp and generative AI use cases

Deploying Conversational AI Products to Production With Jason Flaks

Generative AI problems at Mailchimp and feedback monitoring

Getting closer to the business as an MLOps engineer

Success stories of the ML platform capabilities at Mailchimp

Building a Machine Learning Platform [Definitive Guide]

Team structure at Mailchimp

The pipelining stack

Size of the ML organization at Mailchimp

Roles in ML Team and How They Collaborate With Each Other

Golden paths at Mailchimp

Closing remarks

Learnings From Building the ML Platform at Stitch Fix

Introduction

What is DAGWorks?

How is DAGWorks different from other popular solutions?

MLOps Landscape in 2023: Top Tools and Platforms

Why have a DAG within a DAG?

How do you measure whether tools are adding value?

What are some use cases of Hamilton?

Why doesn’t Hamilton have a feature store?

Data versioning with Hamilton

When did you decide Hamilton must be built?

Using Hamilton for Deep Learning & Tabular Data

Why did you open-source Hamilton?

Launching an open-source product

Stitch Fix biggest challenges

Serializing models in the Stitch Fix platform

How to decide what feature to build?

Getting buy-in from the team

How did you assemble a team for creating the platform?

Learnings From Building the ML Platform at Mailchimp

Model monitoring and production

How to be a valuable MLOps Engineer?

MLOps Is an Extension of DevOps. Not a Fork — My Thoughts on THE MLOPS Paper as an MLOps Startup CEO

The road ahead for ML platform teams

Skills required for building robust tools

Final words

MLOps Is an Extension of DevOps. Not a Fork — My Thoughts on THE MLOPS Paper as an MLOps Startup CEO

MLOps is an extension of DevOps. Not a fork.

Production ML team composition

MLOps engineer is just a DevOps engineer

Data scientist vs ML engineer vs backend software engineer

Subject matter experts

Principles and components: what is the diff vs DevOps

Principles

CI/CD + CT/CE + feedback loops

Workflow orchestration and pipeline authoring

Learn more

Versioning vs ML metadata tracking/logging

Model debugging, inspection, and comparison (missing)

Automated continuous monitoring (+ manual periodic inspection)

Data management

Collaboration

Components

Workflow executors vs workflow authoring frameworks

Data management

ML metadata store (just one)

Model registry. Do we even need it?

Model monitoring. Wait, which one?

Final thoughts and open challenges

neptune.ai Named to the 2022 CB Insights AI 100 List of Most Promising AI Startups

We Raised $8M Series A to Continue Building Experiment Tracking and Model Registry That “Just Works”