Jakub Czakon, Autor w serwisie neptune.ai

8 Best Data Science and Machine Learning Platforms For MLOps

Jakub Czakon — Tue, 11 Oct 2022 15:24:21 +0000

From gathering a team and preparing data to deployment, monitoring, and management of machine learning models, MLOps takes a lot of time, resources, and technology. It’s a complex network of processes and practices that require a certain type of technology. Statistics show that AI technology adoption is still low in businesses as only 14.6% of firms have invested in AI capabilities in production.

Whether you’re a data scientist wanting to optimize your ML experiment processes, or an owner of a business looking for AI means to grow your business, you can use dedicated MLOps tools that’ll help you manage your experiments at every stage.

Here are the best MLOps tools to prepare, deploy, and monitor experiments, and bring all your work to one place!

1. Neptune

Neptune is a lightweight experiment management tool that helps you keep track of your machine learning experiments and manage all your model metadata. It is very flexible, works with many other frameworks, and thanks to its stable user interface, it enables great scalability.

Here’s what Neptune offers to monitor your ML models:

Fast and beautiful UI with a lot of capabilities to organize runs in groups, save custom dashboard views and share them with the team
Version, store, organize, and query models, and model development metadata including dataset, code, env config versions, parameters and evaluation metrics, model binaries, description, and other details
Filter, sort, and group model training runs in a dashboard to better organize your work
Compare metrics and parameters in a table that automatically finds what changed between runs and what are the anomalies
Automatically record the code, environment, parameters, model binaries, and evaluation metrics every time you run an experiment
Your team can track experiments that are executed in scripts (Python, R, other), notebooks (local, Google Colab, AWS SageMaker) and do that on any infrastructure (cloud, laptop, cluster)
Extensive experiment tracking and visualization capabilities (resource consumption, scrolling through lists of images)

Neptune is a robust software that lets you store all your data in one place, easily collaborate, and flexibly experiment with your models.

2. Amazon SageMaker

Amazon SageMaker is a platform that enables data scientists to build, train, and deploy machine learning models. It has all the integrated tools for the entire machine learning workflow providing all of the components used for machine learning in a single toolset.

SageMaker is a tool suitable for arranging, coordinating, and managing machine learning models. It has a single, web-based visual interface to perform all ML development steps – notebooks, experiment management, automatic model creation, debugging, and model drift detection

Amazon SageMaker – summary:

Autopilot automatically inspects raw data, applies feature processors, picks the best set of algorithms, trains and tunes multiple models, tracks their performance, and then ranks the models based on performance – it helps to deploy the best performing model
SageMaker Ground Truth helps you build and manage highly accurate training datasets quickly
SageMaker Experiments helps to organize and track iterations to machine learning models by automatically capturing the input parameters, configurations, and results, and storing them as ‘experiments’
SageMaker Debugger automatically captures real-time metrics during training (such as training and validation, confusion, matrices, and learning gradients) to help improve model accuracy. Debugger can also generate warnings and remediation advice when common training problems are detected
SageMaker Model Monitor allows developers to detect and remediate concept drift. It automatically detects concept drift in deployed models and gives detailed alerts that help identify the source of the problem

Note

See Neptune’s integration with SageMaker

3. Cnvrg.io

cnvrg is an end-to-end machine learning platform to build and deploy AI models at scale. It helps teams to manage, build and automate machine learning from research to production.

You can run and track experiments in hyperspeed with the freedom to use any compute environment, framework, programming language or tool – no configuration required.

Here are the main features of cnvrg:

Organize all your data in one place and collaborate with your team
Real-time visualization allows you to visually track models as they run with automatic charts, graphs and more, and easily share with your team
Store models and meta-data, including parameters, code version, metrics and artifacts
Track changes and automatically record code and parameters for reproducibility
Build production-ready machine learning pipelines in a few clicks with the drag & drop feature

cnvrg lets you store, manage, and easily control all your data, experiments, and flexibly use it to your needs.

4. Iguazio

Iguazio helps in the end-to-end automation of machine learning pipelines. It simplifies development, accelerates performance, facilitates collaboration, and addresses operational challenges.

The platform incorporates the following components:

A data science workbench that includes Jupyter Notebook, integrated analytics engines, and Python packages
Model management with experiments tracking and automated pipeline capabilities
Managed data and machine-learning (ML) services over a scalable Kubernetes cluster
A real-time serverless functions framework — Nuclio
Fast and secure data layer that supports SQL, NoSQL, time-series databases, files (simple objects), and streaming
Integration with third-party data sources such as Amazon S3, HDFS, SQL databases, and streaming or messaging protocols
Real-time dashboards based on Grafana

Iguazio provides you with a complete data science workflow in a single ready-to-use platform for creating data science applications from research to production.

5. Spell

Spell is a platform for training and deploying machine learning models quickly and easily. It provides a Kubernetes-based infrastructure to run and manage ML experiments, store all your data, and automate the MLOps lifecycle.

Here are some of the main features of Spell:

Provides base environments for TensorFlow, PyTorch, Fast.ai, and others. Or, you can roll your own.
You can install any code packages you need using pip, conda, and apt.
Straightforward and accessible Jupyter workspaces, datasets, and resources
Runs can be linked together using workflows to manage model training pipelines
Model metrics automatically generated by Spell
Hyperparameter search
Model servers make it easy to productionize models trained on Spell, allowing you to use one tool for both your model training and model serving

6. MLflow

MLflow is an open-source platform that helps manage the whole machine learning lifecycle that includes experimentation, reproducibility, deployment, and a central model registry.

MLflow is suitable for individuals and for teams of any size.

The tool is library-agnostic. You can use it with any machine learning library and in any programming language.

MLflow comprises four main functions:

MLflow Tracking – an API and UI for logging parameters, code versions, metrics, and artifacts when running machine learning code and for later visualizing and comparing the results
MLflow Projects – packaging ML code in a reusable, reproducible form to share with other data scientists or transfer to production
MLflow Models – managing and deploying models from different ML libraries to a variety of model serving and inference platforms
MLflow Model Registry – a central model store to collaboratively manage the full lifecycle of an MLflow Model, including model versioning, stage transitions, and annotations

Note

See Neptune’s integration with MLflow

7. TensorFlow

Tensorflow is an end-to-end platform for deploying production ML pipelines. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system.

TensorFlow provides stable Python and C++ APIs, as well as non-guaranteed backward compatible API for other languages. It also supports an ecosystem of powerful add-on libraries and models to experiment with, including Ragged Tensors, TensorFlow Probability, Tensor2Tensor and BERT.

TensorFlow lets you train and deploy your model easily, no matter what language or platform you use – use TensorFlow Extended (TFX) if you need a full production ML pipeline; for running inference on mobile and edge devices, use TensorFlow Lite; train and deploy models in JavaScript environments using TensorFlow.js.

It’s an MLOps tool suitable for beginners and advanced data scientists. It has all the necessary features and gives you the flexibility to use it however you want.

Note

See Neptune’s integration with TensorFlow

8. Kubeflow

Kubeflow is the ML toolkit for Kubernetes. It helps in maintaining machine learning systems by packaging and managing docker containers.

It facilitates the scaling of machine learning models by making run orchestration and deployments of machine learning workflows easier. It’s an open-source project that contains a curated set of compatible tools and frameworks specific to various ML tasks.

Here’s a short Kubeflow summary:

A user interface (UI) for managing and tracking experiments, jobs, and runs
Notebooks for interacting with the system using the SDK
Re-use components and pipelines to quickly create end-to-end solutions without having to rebuild each time
Kubeflow Pipelines is available as a core component of Kubeflow or as a standalone installation
Multi-framework integration

Note

See the comparison between Neptune and Kubeflow

To wrap it up

End-to-en MLOps tools give you a lot of flexibility in carrying your ML experiments. They also automate work and optimize time-consuming processes. And even though they aim at the same goals – to give you an infrastructure for scalable ML experimentation, their features may differ. So choose the one that suits your needs to get the most out of it.

Happy experimenting!

Best Tools to Do ML Model Monitoring

Jakub Czakon — Tue, 13 Sep 2022 11:10:08 +0000

If you deploy models to production sooner or later, you will start looking for ML model monitoring tools.

When your ML models impact the business (and they should), you just need visibility into “how things work”.

The first moment you really feel this is when things stop working. With no model monitoring set up, you may have no idea what is wrong and where to start looking for problems and solutions. And people want you to fix this asap.

But what do “things” and “work” mean in this context?

Interestingly, depending on the team/problem/pipeline/setup, people mean entirely different things.

One benefit of working at an MLOps company is that you can talk to many ML teams and get this info firsthand. So it turns out that when people say “I want to monitor ML models” they may want to:

monitor model performance in production: see how accurate the predictions of your model are. See if the model performance decays over time, and you should re-train it.
monitor model input/output distribution: see if the distribution of input data and features that go into the model changed? Has the predicted class distribution changed over time? Those things could be connected to the data and concept drift.
monitor model training and re-training: see learning curves, trained model predictions distribution, or confusion matrix during training and re-training.
monitor model evaluation and testing: log metrics, charts, prediction, and other metadata for your automated evaluation or testing pipelines
monitor hardware metrics: see how much CPU/GPU or Memory your models use during training and inference.
monitor CI/CD pipelines for ML: see the evaluations from your CI/CD pipeline jobs and compare them visually. In ML, the metrics often only tell you so much, and someone needs to actually see the results.

Which ML model monitoring did you mean?

Either way, we’ll look into tools that help with all of those use cases.

But first…

How to compare ML model monitoring tools

Obviously, depending on what you want to monitor, your needs will change but there are some things that you should definitely consider before choosing an ML model monitoring tool:

ease of integration: how easy is it to connect it to your model training and model deployment tools
flexibility and expressiveness: can you log and see what you want and how you want it
overhead: how much overhead does the logging impose on your model training and deployment infrastructure
monitoring functionality: can you monitor data/feature/concept/model drift? Can you compare multiple models that are running at the same time (A/B tests)?
alerting: does it provide automated alerts when the performance or input goes crazy?

Ok now, let’s look into the actual model monitoring tools!

ML model monitoring tools

First, let’s go back to different monitoring capabilities and see which tool checks these boxes.

neptune.ai

Arize AI

WhyLabs

Grafana + Prometheus

Evidently

Qualdo

Fiddler

Amazon SageMaker

Seldon Core

Censius

neptune.ai

Arize AI

WhyLabs

Grafana + Prometheus

Evidently

Qualdo

Fiddler

Amazon SageMaker

Seldon Core

Censius

Model evaluation and testing

Limited

Hardware metrics

Model input/output distribution

Limited

Model training and re-training

Limited

Model performance in production

Limited

CI/CD pipelines for ML

And now, we’ll review each of these tools in more detail.

1. neptune.ai

Neptune is the most scalable experiment tracker designed with a strong focus on teams that train foundation models.

You can log and display pretty much any ML metadata from metrics and losses, prediction images, hardware metrics to interactive visualizations.

When it comes to monitoring ML models, people mostly use it for:

model training, evaluation, testing,
hardware metrics display
but you can (and some teams do) log performance metrics from production jobs and see metadata from ML CI/CD pipelines.

It has a flexible metadata structure that allows you to organize training and production metadata the way you want to. You can think of it as a dictionary or a folder structure that you create in code and display in the UI.

You can build dashboards that display the performance and hardware metrics you want to see to better organize your model monitoring information.

You can compare metrics between models and runs to see how model update changed performance or hardware consumption and whether you should abort live model training because it just won’t beat the baseline.

If you are wondering if it will fit your workflow:

check out case studies of how people set up their MLOps/LLMOps tool stack with Neptune
explore an example public project

2. Arize AI

Example embedding drift monitor | Source

Arize AI is an ML model monitoring platform that is capable of boosting the observability of your project and helping you with troubleshooting production AI.

If your ML team is working without a powerful observability and real-time analytics tool, engineers can waste days trying to identify potential problems. Arize AI makes it easy to pinpoint what went wrong, so that software engineers immediately find and fix a problem, before it impacts the business. Arize AI has the following features:

Simple integration. Arize AI can be used to enhance observability of any model in any environment. Detailed documentation and community support allow you to integrate and go live in minutes.
Pre-launch validation. It’s important to check that your models will behave as expected before they are deployed. Pre-launch validation toolkit can help you gain confidence in the model’s performance and perform pre- and post-launch validation checks.
Automatic monitoring. Model monitoring should be proactive rather than reactive so that you could identify performance degradation or prediction drifts early on. Automated monitoring systems can help you with that, and integrations with tools such as PagerDuty or Slack can notify you in real-time. It demands zero setup and provides space for easy-to-customize dashboards.
Monitor and Identify Drift. Track for prediction, data, and concept drift across model dimensions and values, and compare across training, validation, and production environments.
Ensure Data Integrity. Guarantee the quality of model data inputs and outputs with automated checks for missing, unexpected, or extreme values.
Improve Model Performance. Use ML performance tracing to automatically pinpoint the source of model performance problems and map back to underlying data issues.
Leverage Explainability. See how a model dimension affects prediction distributions, and leverage SHAP to explain feature importance for specific cohorts.
Monitor Unstructured Data. By monitoring embeddings of unstructured data for CV or NLP models with Arize, teams can proactively identify when their unstructured data is drifting.
Dynamic Dashboards. Leverage pre-configured dashboard templates or create customized dashboards to help focus troubleshooting efforts.

3. WhyLabs

WhyLabs dashboard | Source

WhyLabs is a model monitoring and observability tool that helps ML teams with monitoring data pipelines and ML applications. Monitoring the performance of the deployed model is critical to proactively addressing this issue. You can determine the appropriate time and frequency for retraining and updating the model. It helps with detecting data quality degradation, data drift, and data bias. WhyLabs has quickly become quite popular among developers since it can easily be used in mixed teams where seasoned developers work side-by-side with junior employees.

The tool enables you to:

Automatically monitor model performance with out-of-the box or tailored metrics.
Detect overall model performance degradation and successfully identify issues causing it.
Perform easy integrations with other tools while maintaining high privacy-preserving standards via their open source data logging library – whylogs.
Use popular libraries and frameworks like MLFlow, Spark, Sagemaker, etc. to make WhyLabs adoption go smoothly.
Debug data and model issues easily with in-built tools.
Set up the tool in seconds with an easy-to-use zero-configuration setup.
Be notified about the current workflow via the channel that you prefer like Slack, SMS, etc.

One of the biggest advantages of WhyLabs for model monitoring is that it eliminates the need for manual problem-solving and, consequently, saves money and time. You can use this tool to work with structures and unstructured data, regardless of the scale. WhyLabs uses AWS cloud. It runs containers with Amazon ECS and uses Amazon EMR for large-scale data processing.

4. Grafana + Prometheus

Prometheus is a popular open-source ML model monitoring tool that was originally developed by SoundCloud to collect multidimensional data and queries.

The main advantages of Prometheus are tight integration with Kubernetes and many of the available exporters and client libraries, as well as a fast query language. Prometheus is also Docker compatible and available on the Docker Hub.

The Prometheus server has its own self-contained unit that does not depend on network storage or external services. So it doesn’t require a lot of work to deploy additional infrastructure or software. Its main task is to store and monitor certain objects. An object can be anything: a Linux server, one of the processes, a database server, or any other component of the system. Each element that you want to monitor is called a metric.

The Prometheus server reads targets at an interval that you define to collect metrics and stores them in a time series database. You set the targets and the time interval for reading the metrics. You query the Prometheus time series database for where metrics are stored using the PromQL query language.

Grafana dashboard | Source

Grafana allows you to visualize monitoring metrics. Grafana specializes in time series analytics. It can visualize the results of monitoring work in the form of line graphs, heat maps, and histograms.

Instead of writing PromQL queries directly to the Prometheus server, you use Grafana GUI boards to request metrics from the Prometheus server and render them in the Grafana dashboard.

Key features of Grafana:

Alerts. You can receive alerts through a variety of channels from messengers to Slack. If you prefer other options, you can add your own alerts manually with a little bit of code.
Dashboard templates. You can create customized dashboards for different tasks and manage everything you need in one interface.
Automation. You can automate work in Grafana using scripts.
Annotations. If something goes wrong, you can time-match events from different dashboards and sources to analyze the cause of the failure. You can create annotations manually by adding comments to the desired points and plot fragments.

5. Evidently

Evidently dashboard | Source

Evidently is an open-source ML model monitoring system. It helps analyze machine learning models during development, validation, or production monitoring. The tool generates interactive reports from pandas DataFrame.

Currently, 6 reports are available:

Data Drift: detects changes in feature distribution
Numerical Target Drift: detects changes in the numerical target and feature behavior
Categorical Target Drift: detects changes in categorical target and feature behavior
Regression Model Performance: analyzes the performance of a regression model and model errors
Classification Model Performance: analyzes the performance and errors of a classification model. Works both for binary and multi-class models
Probabilistic Classification Model Performance: analyzes the performance of a probabilistic classification model, quality of model calibration, and model errors. Works both for binary and multi-class models

6. Qualdo

Source

Qualdo is a Machine Learning model performance monitoring tool in Azure, Google, and AWS. The tool has some nice, basic features that allow you to observe your models throughout their entire lifecycle.

With Qualdo, you can gain insights from production ML input/predictions data, logs and application data to watch and improve your ML model performance. There’s model deployment and automatic monitoring of data drifts and data anomalies, you can see quality metrics and visualizations.

It also offers tools to monitor ML pipeline performance in Tensorflow and leverages Tensorflow’s data validation and model evaluation capabilities.

Additionally, it integrates with many AI, machine learning, and communication tools to improve your workflow and make collaboration easier.

It’s a rather simple tool and doesn’t offer many advanced features. Hence, it’s best if you’re looking for an easy ML model monitoring performance solution.

7. Fiddler

Fiddler dashboard | Source

Fiddler is a model monitoring tool that has a user-friendly, clear, and simple interface. It lets you monitor model performance, explain and debug model predictions, analyze model behavior for entire data and slices, deploy machine learning models at scale, and manage your machine learning models and datasets

Here are Fiddler’s ML model monitoring features:

Performance monitoring—a visual way to explore data drift and identify what data is drifting, when it’s drifting, and how it’s drifting
Data integrity—to ensure no incorrect data gets into your model and doesn’t negatively impact the end-user experience
Tracking outliers—Fiddler shows both Univariate and Multivariate Outliers in the Outlier Detection tab
Service metrics—give you basic insights into the operational health of your ML service in the production
Alerts—Fiddler allows you to set up alerts for a model or group of models in a project to warn about issues in production

Overall, it’s a great tool for monitoring machine learning models with all the necessary features.

8. Amazon SageMaker Model Monitor

SageMaker dashboard | Source

Amazon SageMaker Model Monitor one of the Amazon SageMaker tools. It automatically detects and alerts on inaccurate predictions from models deployed in production so you can maintain the accuracy of models.

Here’s the summary of SageMaker Model Monitoring features:

Customizable data collection and monitoring – you can select the data you want to monitor and analyze without the need to write any code
Built-in analysis in the form of statistical rules, to detect drifts in data and model quality
You can write custom rules and specify thresholds for each rule. The rules can then be used to analyze model performance
Visualization of metrics, and running ad-hoc analysis in a SageMaker notebook instance
Model prediction – import your data to compute model performance
Schedule monitoring jobs
The tool is integrated with Amazon SageMaker Clarify so you can identify potential bias in your ML models

When used with other tools for ML, the SageMaker Model Monitor gives you a full control of your experiments.

9. Seldon Core

Source

Seldon Core is an open-source platform for deploying machine learning models on Kubernetes. It’s an MLOps framework that lets you package, deploy, monitor, and manage thousands of production machine learning models.

It runs on any cloud and on-premises, is framework agnostic, supports top ML libraries, toolkits, and languages. Also, it converts your ML models (e.g., Tensorflow, Pytorch, H2o) or language wrappers (Python, Java) into production REST/GRPC microservices.

Basically, Seldon Core has all the necessary functions to scale a high number of ML models. You can expect features like advanced metrics, outlier detectors, canaries, rich inference graphs made out of predictors, transformers, routers, or combiners, and more.

10. Censius

Censius is an AI model observability platform that lets you monitor the entire ML pipeline, explain predictions, and proactively fix issues for an improved business outcome.

Censius dashboard | Source

Key features of Censius:

Completely configurable monitors that detect Drifts, Data quality issues and performance Degradation
Real time notifications that keep you ahead of issues in your Model Serving pipeline
Customizable dashboards where you can slice & dice your Model training and production data and watch for any business KPIs
Native support for A/B test frameworks as you continue to experiment & iterate with different models in production
Drill down to the Root cause of your problem with explainability of tabular, image, textual data

Conclusion

Now that you know how to evaluate tools for ML model monitoring and what is out there, the best way to go forward is to test out the ones you liked!

You can also continue evaluating tools by checking out this great resource, ml model monitoring tools comparison prepared by the mlops.community.

Either way, happy monitoring!

MLOps at a Reasonable Scale [The Ultimate Guide]

Jakub Czakon — Wed, 07 Sep 2022 12:06:02 +0000

For a couple of years now, MLOps is probably the most (over)used term in the ML industry. The more models people want to deploy to production, the more they think about how to organize the Ops part of this process.

Naturally, the way to do MLOps has been shaped by the big players on the market – companies like Google, Netflix, and Uber. What they did for the community was (and is) great, but they were solving their MLOps problems.

And most companies don’t have their problems. The majority of ML teams operate on a smaller scale and have different challenges. Yet they are the biggest part of the ML industry, and they want to know what’s the best way to do MLOps at their scale, with their resources and limitations.

The reasonable scale MLOps is addressing this need. “Reasonable scale” is a term coined last year by Jacopo Tagliabue, and it refers to the companies that:

have ml models that generate hundreds of thousands to tens of millions of USD per year (rather than hundreds of millions or billions)
have dozens of engineers (rather than hundreds or thousands)
deal with terabytes (rather than petabytes or exabytes)
have a finite amount of computing budget

In this guide, you’ll learn more about the MLOps at a reasonable scale, and you’ll get to know the best practices, templates, and examples that will help you understand how to implement them in your work.

Before that, let’s do a few steps back and see why we even talk about reasonable scale.

MLOps vs MLOps at a reasonable scale

Solving the right problem and creating a working model, while still crucial, is no longer enough. At more and more companies, ML needs to be deployed to production to show “real value for the business”.

Otherwise, your managers or managers of your managers will start asking questions about the “ROI of our AI investment”. And that means trouble.

The good thing is that many teams, large and small, are past that point, and their models are doing something valuable for the business. The question is:

How do you actually deploy, maintain and operate those models in production?

The answer seems to be MLOps.

In 2021 so many teams looked for tools and best practices around ML operations that MLOps became a real deal. Dozens of tools and startups were created. 2021 was even called “a year of MLOps”. Cool.

But what does it mean to have MLOps set up?

If you read through online resources, it would be:

reproducible and orchestrated pipelines,
alerts and monitoring,
versioned and traceable models,
auto-scalable model serving endpoints,
data versioning and data lineage,
feature stores,
and so much more.

But does it have to be all of it?

Do you really need all those things, or is it just a “standard industry best practice”? Where do those “standard industry best practices” come from anyway?

Most of the good blog posts, whitepapers, conference talks, and tools are created by people from super-advanced, hyperscale companies. Companies like Google, Uber, and Airbnb. They have hundreds of people working on ML problems that serve trillions of requests a month.

That means most of the best practices you find are naturally biased toward hyperscale. But 99% of companies are not doing production ML at hyperscale.

Source: MLOps Is a Mess But That’s to be Expected

Most companies are either not doing any production ML yet or do it at a reasonable scale. Reasonable scale as in five ML people, ten models, millions of requests. Reasonable, demanding, but nothing crazy and hyperscale.

Source: ML and MLOps at a Reasonable Scale

Ok, so the best practices are biased toward hyperscale, but what is wrong with that?

The problem is when a reasonable scale team is going with “standard industry best practice” and tries to build or buy a full-blown, hyperscale MLOps system.

Building hyperscale MLOps with the resources of a reasonable scale ML team just cannot work.

Hyperscale companies need everything. Reasonable scale companies need to solve the most important current challenges. They need to be smart and pragmatic about what they need right now.

The tricky part is to tell what your actual needs are and what are potential, nice-to-have, future needs. With so many blog articles and conference talks out there, it is hard. Once you are clear about your reality, you are halfway there.

But there are examples of pragmatic companies achieving great results by embracing reasonable scale MLOps limitations:

Lemonade generates $100M+ in annual recurring revenue from ML models with just 2 ML engineers serving 20 data scientists.
Coveo leverage tools to deliver recommendation systems to thousands of companies with (almost) no ML infrastructure people.
Hypefactors runs NLP/CV data enrichment pipelines on the entire social media landscape with a team of just a few people.

You probably never heard of them, but their problems and solutions are a lot closer to your use case than that Netflix blog post or Google whitepaper you have opened in the other tab.

Check more stories from reasonable scale companies about how they solved different parts of their ML workflow.

The pillars of MLOps

Ok, so say you want to do the MLOps right, what do you do? Even though MLOps is still developing, there are some things that are clear(ish), e.g. the pillars of MLOps that can be used as a kind of guidance on how to even start thinking about this topic.

The pillars of MLOps – stack components

The first approach is based on the four or five main pillars of MLOps that you need to implement somehow:

Data ingestion (and optionally feature store)
Pipeline and orchestration
Model registry and experiment tracking
Model deployment and serving
Model monitoring

I say four or five because the data ingestion part is not always mentioned as one of the pillars. But I believe that it’s a crucial element and shouldn’t be skipped.

The pillars of MLOps

Each of those can be solved with a simple script or a full-blown solution depending on your needs.

End-to-end vs a canonical stack of best-in-class tools

The decision boils down to whether you want:

an end-to-end platform vs a stack of best-in-class point solutions
to buy vs build vs maintain open-source MLOps tools (or buy and build and maintain oss).

The answer, as always, is “it depends”.

Some teams have a fairly standard ML use case and decide to buy an end-to-end ML platform.

By doing so, they get everything-MLOps out of the box, and they can focus on ML.

The problem is that the further away you go from the standard use case, the harder it gets to adjust the platform to your workflow. And everything looks simple and standard at the beginning. Then business needs to change, requirements change, and it is not so simple anymore.

And then there is the pricing discussion. Can you justify spending “this much” on an end-to-end enterprise solution when all you really need is just 3 out of 10 components? Sometimes you can, and sometimes you cannot.

The pillars of reasonable scale MLOps – components

Because of all that, many teams stay away from end-to-end and decide to build a canonical MLOps stack from point solutions that solve just some parts very well.

Potential implementation of the pillars of MLOps

Some of those solutions are in-house tools, some are open-source, and some are third-party SaaS or on-prem tools.

Depending on their use case, they may have something as basic as bash scripts for most of their ML operations and get something more advanced for one area where they need it.

For example:

You port your models to native mobile apps. You probably don’t need model monitoring but may need advanced model packaging and deployment.
You have complex pipelines with many models working together. Then you probably need some advanced pipelining and orchestration.
You need to experiment heavily with various model architectures and parameters. You probably need a solid experiment tracking tool.

By pragmatically focusing on the problems you actually have right now, you don’t overengineer solutions for the future. You deploy those limited resources that you (as a team doing ML at a reasonable scale) have into things that make a difference for your team/business.

The pillars of reasonable scale MLOps – principles

There’s also another approach to MLOps pillars that’s worth mentioning. It was brought up by Ciro Greco, Andrea Polonioli, and Jacopo Tagliabue in the article Hagakure for MLOps: The Four Pillars of ML at Reasonable Scale. The principles they write about are:

Data is superior to modeling: you can often gain more by iterating on data, not models (Andrew Ng talks about it a lot with ”data-centric AI”)
Log then transform: you should separate data ingestion (getting raw data) from data processing to get reproducibility and replayability. You can get that, for example with Snowflake + dbt
PaaS & FaaS is preferable to IaaS: You have limited resources. Focus them where you are making a difference. Instead of building and maintaining every component of the stack, use fully-managed services where you can. Your team’s time is the real cost here, not the subscription.
Vertical cuts deeper than distributed: in most cases, you don’t really need distributed computing architecture. You can use containerized, cloud-native scaling.

The reasonable scale principles | Inspired by Hagakure for MLOps: The Four Pillars of ML at Reasonable Scale

Best practices and tips for setting up MLOps at a reasonable scale

Okay, we’ve talked about the pillars of MLOps and the principles of how to approach them. Now it’s time for the more practical part. You’re probably wondering:

How do reasonable scale companies actually set it up (and how should you do it)?

Here are the resources that will help you build a pragmatic MLOps stack for your use case.

Let’s start with some tips.

Recently we interviewed a few ML practitioners about setting up MLOps.

Lots of good stuff in there, but there was this one thought I just had to share with you:

“My number 1 tip is that MLOps is not a tool. It is not a product. It describes attempts to automate and simplify the process of building AI-related products and services.

Therefore, spend time defining your process, then find tools and techniques that fit that process.

For example, the process in a bank is wildly different from that of a tech startup. So the resulting MLOps practices and stacks end up being very different too.” – Phil Winder, CEO at Winder Research

So before everything, be pragmatic and think about your use case, your workflow, your needs. Not “industry best practices”.

I keep coming back to Jacopo Tagliabue, Head of AI at Coveo, but the fact is that no reasonable scale ML discussion is complete without him (after all, he’s the one who coined the term, right?). In his pivotal blog post, Jacopo suggests a mindset shift that we think is crucial (especially early in your MLOps journey):

“to be ML productive at a reasonable scale, you should invest your time in your core problems (whatever that might be) and buy everything else.”

You can watch him go deep into the subject in this Stanford Sys seminar video.

The third tip I want you to remember comes from Orr Shilon, ML engineering team lead at Lemonade.

In this episode of mlops.community podcast, he talks about platform thinking.

He suggests that their focus on automation and pragmatically leveraging tools wherever possible were key to doing things efficiently in MLOps.

With this approach, at one point, his team of two ML engineers managed to support the entire data science team of 20+ people. That is some infrastructure leverage.

One more place whit great insights about setting up your MLOps is one of the MLOps community meetups with Andy McMahon, titled “Just Build It! Tips for Making ML Engineering and MLOps Real”. Andy talks about:

Where to start when you want to operationalize your ML models?
What comes first – process or tooling?
How to build and organize an ML team?
…and much more

It’s a great overview of what he learned when doing all these things in real life. Many valuable lessons there.

Now, let’s look at example MLOps stacks!

MLOps tool stacks

There are many tools that play in many MLOps categories though it is sometimes hard to understand who does what.

MLOps tools landscape | Credit: neptune.ai

From our research into how reasonable scale teams set up their stacks, we found out that:

Pragmatic teams don’t do everything. They focus on what they actually need.

For example, the team over at Continuum Industries needed to get a lot of visibility into testing and evaluation suites of their optimization algorithms.

So they connected Neptune with GitHub actions CICD to visualize and compare various test runs.

Continuum Industries tool stack | Credit: neptune.ai

GreenSteam needed something that would work in a hybrid monolith-microservice environment.

Because of their custom deployment needs, they decided to go with Argo pipelines for workflow orchestration and deploy things with FastAPI.

Their stack:

GreenSteam tool stack | Credit: neptune.ai

Those teams didn’t solve everything deeply but pinpointed what they needed and did that very well.

There are more reasonable scale teams among our customers, here are some case studies that are worth checking:

Zoined talks about scalable ML workflow with only a few Data Scientists & ML Engineers
Hypefactors talks about how to manage the process with a variable number of ML experiments
Deepsense.ai talks about finding a way to keep track of over 100k models
Brainly talks about managing their experiments when working with SageMaker Pipelines
InstaDeep talks about building a research-friendly and team-friendly process & stack

If you’d like to see more examples of how reasonable scale teams set up their MLOps, check these articles:

Monzo’s machine learning stack by Neil Lathia
Laying the foundation of our open source ML platform with a modern CI/CD pipeline by Theodore Meynard
The Road to a Serverless ML Pipeline in Production by Gal Shen
How These 8 Companies Implement MLOps: In-Depth Guide by our Developer Advocate, Stephen Oladele who did a great job researching and writing down setups of 8 more companies (some are reasonable scale, and some are hyperscale)

Also, if you want to go deeper, there is a slack channel where people share and discuss their MLOps stacks.

Here’s how you can join it:

Join mlops.community slack
Find the #pancake-stacks channel
While at it, come say hi in the #neptune-ai channel and ask us about this article, MLOps, or whatever else

Okay, stacks are great, but you probably want some templates, too.

MLOps templates

The best reasonable scale MLOps template comes from, you guessed it, Jacopo Tagliabue and collaborators.

In this open-source GitHub repository, they put together an end-to-end (Metaflow-based) implementation of an intent prediction and session recommendation.

MLOps template | Source: You Don’t Need a Bigger Boat

It shows how to connect the main pillars of MLOps and have an end-to-end working MLOps system you can build on. It is an excellent starting point that lets you use the default or pick and choose tools for each component.

Another great resource that is worth mentioning is the MLOps Infrastructure Stack article.

In that article, they explain how:

“MLOps must be language-, framework-, platform-, and infrastructure-agnostic practice. MLOps should follow a “convention over configuration” implementation.”

It comes with a nice graphical template from folks over at Valohai.

MLOps template | Source Valohai

They explain general considerations, tool categories, and example tool choices for each component. Overall a really good read.

MyMLOps gives you a browser-based tool stack builder that talks briefly about what tools do and in which categories they play. You can also share your stack with others.

MLOps template | Source MyMLOps

One more template from Jacopo Tagliabue. This one is specifically for recommender systems – Recs at reasonable scale. It was created in an effort to release as open source code a realistic data and ML pipeline for cutting-edge recommender systems “that just work”.

MLOps template for recommender systems | Source

You may also look into some of our resources for choosing tools for a particular component of the stack:

What should you do next?

Okay, now use this knowledge and go build your MLOps stack!

We’ve gathered here quite a lot of resources that should help you. But if you have specific questions on the way or just want to dig deeper into the topic, here’s even more useful stuff.

MLOps Community – I may be repeating myself, but that’s definitely the best MLOps community out there. Almost 10k practitioners in one place, asking questions, sharing knowledge, and just talking to each other about all things MLOps.
Apart from the very active Slack channel, MLOps Community also runs a podcast, organizes meetups and reading groups, and sends newsletters. Make sure to check all these resources.
MLOps Live – It’s a biweekly event organized by us, neptune.ai, where ML practitioners answer questions from other ML practitioners about one chosen subject related to MLOps. You can watch previous episodes on YouTube or listen to them as a podcast.
Personal blogs of ML folks – Many ML practitioners have their own blogs, which we highly recommend as well. Make sure to follow e.g. Chip Huyen, Eugene Yan, Jeremy Jordan, Shreya Shankar, or Laszlo Sranger. You can also check the Outerbounds blog.
MLOps Blog – Our own blog is also full of MLOps-related articles written by Data Scientists and ML Engineers who work in the industry. You’ll find pieces covering best practices, tools, real-life MLOps pipelines, and much more. Here are a few articles I think you should start with:
Towards Data Science – Probably an obvious resource, but you can find a lot of gold there when it comes to reasonable scale ML teams sharing their solutions and practices.
apply(conf) – Although there are speakers from hyperscale companies as well, this conference gives a lot of space in their agenda to reasonable scale teams. It’s one of the favorite events of the ML community, so there must be a reason for that.
Awesome MLOps GitHub repos – There are actually two repos with this name – here and here. They list everything from articles, books, and papers, to tools, newsletters, podcasts, and events.
If you’d like to take a step back, or you’re just starting to learn about MLOps, no worries. There’s something for everyone. You can check one of the courses: MLOps Fundamentals on Coursera, Zoomcamp organized by DataTalks Club or Made with ML.

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

Jakub Czakon — Thu, 21 Jul 2022 13:30:12 +0000

Keeping track of all the data you use for models and experiments is not exactly a piece of cake. It takes a lot of time and is more than just managing and tracking files. You need to ensure everybody’s on the same page and follows changes simultaneously to keep track of the latest version.

You can do that with no effort by using the right software! A good data version control tool will allow you to have unified data sets with a strong repository of all your experiments.

It will also enable smooth collaboration between all team members so everyone can follow changes in real-time and always know what’s happening.

It’s a great way to systematize data version control, improve workflow, and minimize the risk of occurring errors.

So check out these top tools for data version control that can help you automate work and optimize processes.

Data versioning tools are critical to your workflow if you care about reproducibility, traceability, and ML model lineage.

They help you get a version of an artifact, a hash of the dataset or model that you can use to identify and compare it later. Often you’d log this data version into your metadata management solution to make sure your model training is versioned and reproducible.

How to choose a data versioning tool?

To choose a suitable data versioning tool for your workflow, you should check:

Support for your data modality: how does it support video/audio? Does it provide some preview for tabular data?
Ease of use: how easy is it to use in your workflow? How much overhead does it add to your execution?
Diff and compare: Can you compare datasets? Can you see the diff for your image directory?
How well does it work with your stack: Can you easily connect to your infrastructure, platform, or model training workflow?
Can you get your team on board: If your team does not adopt it, it doesn’t matter how good the tool is. So keep your teammates skillset in mind and preferences in mind.

Here’re are a few tools worth exploring.

Best data version control tools

1. neptune.ai

Neptune is the most scalable experiment tracker designed with a strong focus on teams that train foundation models. It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye.

You can log and display pretty much any ML metadata from hyperparameters and metrics to videos, interactive visualizations, and data versions.

Neptune artifacts let you version datasets, models, and other files from your local filesystem or any S3-compatible storage with a single line of code. Specifically, it saves:

Version (hash) for the file or folder
Location of the file or folder
Folder structure (recursively)
Size of the file or folder

Once logged, you can use Neptune UI to group runs on dataset versions or see how the artifacts changed between runs.

When it comes to data versioning, Neptune is a very lightweight solution, and you can get going quickly. That said, it may not give you everything you need data-versioning-wise.

If you are wondering if it will fit your workflow:

check out the documentation
check out case studies of how people set up their MLOps tool stack with Neptune
explore an example public project about dataset versioning
or if you are like me and would like to compare it to other tools in the space like DVC, Pachyderm, or wandb. So here are many deeper feature-by-feature comparisons to make the evaluation easier.

2. Pachyderm

Pachyderm is a complete version-controlled data science platform that helps to control an end-to-end machine learning life cycle. It comes in three different versions, Community Edition (open-source, with the ability to be deployed anywhere), Enterprise Edition (complete version-controlled platform), and Hub Edition (a hosted version, still in beta).

It’s a great platform for flexible collaboration on any kind of machine learning project.

Here’s what you can do with Pachyderm as a data version tool:

Pachyderm lets you continuously update data in the master branch of your repo, while experimenting with specific data commits in a separate branch or branches
It supports any type, size, and number of files including binary and plain text files
Pachyderm commits are centralized and transactional
Provenance enables teams to build on each other work, share, transform, and update datasets while automatically maintaining a complete audit trail so that all results are reproducible

3. DVC

DVC is an open-source version control system for machine learning projects. It’s a tool that lets you define your pipeline regardless of the language you use.

When you find a problem in a previous version of your ML model, DVC saves your time by leveraging code data, and pipeline versioning, to give you reproducibility. You can also train your model and share it with your teammates via DVC pipelines.

DVC can cope with versioning and organization of big amounts of data and store them in a well-organized, accessible way. It focuses on data and pipeline versioning and management but also has some (limited) experiment tracking functionalities.

DVC – summary:

Possibility to use different types of storage— it’s storage agnostic
Full code and data provenance help to track the complete evolution of every ML model
Reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment
Tracking metrics
A built-in way to connect ML steps into a DAG and run the full pipeline end-to-end
Tracking failed attempts
Runs on top of any Git repository and is compatible with any standard Git server or provider

4. Git LFS

Git Large File Storage (LFS) is an open-source project. It replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.

It allows you to version large files—even those as large as a couple GB in size—with Git, host more in your Git repositories with external storage, and to faster clone and fetch from repositories that deal with large files.

At the same time, you can keep your workflow and the same access controls and permissions for large files as the rest of your Git repository when working with a remote host like GitHub.

5. Dolt

Dolt is a SQL database that you can fork, clone, branch, merge, push, and pull just like a git repository. Dolt allows data and schema to evolve together to make a version control database a better experience. It’s a great tool to collaborate on with your team.

You can freely connect to Dolt just like to any MySQL database to run queries or update the data using SQL commands.

Use the command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate’s changes.

All the commands you know for Git work exactly the same for Dolt. Git versions files, Dolt versions tables.

There’s also DoltHub – a place to share Dolt databases.

6. lakeFS

lakeFS is an open-source platform that provides a Git-like branching and committing model that scales to Petabytes of data by utilizing S3 or GCS for storage.

This branching model makes your data lake ACID-compliant by allowing changes to happen in isolated branches that can be created, merged, and rolled back atomically and instantly.

lakeFS has three main areas that let you focus on differen aspect of your ML models:

Development Environment for Data: has tools that you can use to isolate snapshot of the lake you can experiment with while others are not exposed; reproducibility to compare changes and improve experiments
Continuous Data Integration: entering and managing data according to your own rules
Continuous Data Deployment: ability to quickly revert changes to data; providing consistency in your datasets; testing of production data to avoid cascading quality issues

lakeFS is a great tool for focusing on a specific area of your datasets to make ML experiments more consistent.

7. Delta Lake

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake – summary:

Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
Serializable isolation levels ensure that readers never see inconsistent data.
Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments
Supports merge, update, and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.

To wrap it up

Now that you have the list of the best tools for data versioning, you “just” need to figure out how to make it work for you and your team.

That can be tricky.

Some things to consider when choosing a data versioning are:

How easy is it to set up: You may not have the time, needs, or budget to test something heavy right now.
Can you get your team onboard: Sometimes, the solution is great, but you need more software engineering-oriented mindset to use it. Some ML researchers or data scientists may not end up using it.
What tool stack are you using today: Are you using specific tools, infrastructure, or platform that has good integration with a particular data versioning solution. In that case, probably the best option is to just go with that.
Data modality: Is it images, tables, text, all? Sometimes the tool doesn’t support your modality very well as it was built with a different use case in mind.

If you’d like to talk about choosing it or setting up your MLOps stack, I’d love to help.

Reach out to me, and let’s see what I can do!

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

Jakub Czakon — Thu, 21 Jul 2022 13:24:23 +0000

Let me share a story that I’ve heard too many times.

… So far, we have been doing everything manually and sort of ad hoc.

Some people are using it, some people are using that, it’s all over the place.

We don’t have anything standardized.

But we run many projects, the team is growing, and we are scaling pretty fast.

So we run into a lot of problems. How was the model trained? On what data? What parameters did we use for different versions? How can we reproduce them?

We just feel the need to control our experiments… unfortunate Data Scientist

The truth is, when you develop ML models, you will run lots of experiments.

And those experiments may

use different models and model hyperparameters,
use different training or evaluation data,
run different code (including that one small change you wanted to test the other day)
run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed).

As a result, each of these experiments can produce completely different evaluation metrics.

Keeping track of all that information becomes really difficult really quickly. Especially if you want to organize and compare many experiments and feel confident that you selected the best models to go into production.

This is where experiment tracking comes in.

What is ML experiment tracking?

Experiment tracking is the process of saving all experiment related information that you care about for every experiment you run.

Experiment tracking is the process of saving all experiment-related information that you care about for every experiment you run. What this “information you care about” is will strongly depend on your project.

Generally, this so-called experiment metadata may include:

Any scripts used for running the experiment
Environment configuration files
Information about the data used for training and evaluation (e.g., dataset statistics and versions)
Model and training parameter configurations
ML evaluation metrics
Model weights
Performance visualizations (e.g., a confusion matrix or ROC curve)
Example predictions on the validation set (common in computer vision)

Of course, you want to have this information available after the experiment has finished. But, ideally, you’d like to see some of it already as your experiment is running.

Why?

Because for some experiments, you can see (almost) right away that there is no way they will get you better results. Instead of letting them run (which might take days or weeks), you are better off simply stopping them and trying something different.

To be able to collect, store, and analyze all the data, you need an experiment tracking system in place. Such a system will typically have three components:

Experiment database (neptune.ai servers on the visual below): A place where your logged experiment metadata is stored and can be queried.
Client library: A collection of methods that help you log metadata right from your training scripts and query the experiment database.
Experiment dashboard (neptune.ai web app on the visual below): A visual interface to your experiment database where you can see your experiment metadata.

Experiment tracking system architecture (based on neptune.ai example)

Of course, you can implement each component in many different ways, but the general picture will be very similar.

Wait, so isn’t experiment tracking like MLOps or something?

ML experiment tracking vs MLOps

MLOps deals with every part of the ML project lifecycle, from developing models by scheduling distributed training jobs, managing model serving, and monitoring the quality of models in production to re-training those models when needed.

Experiment tracking (also referred to as experiment logging) is part of MLOps, focused on supporting iterative model development, the part of the ML project lifecycle where you try many things to get your model performance to the level you need. Experiment tracking is closely intertwined with other aspects of MLOps, such as data and model versioning.

MLOps cycle and machine learning experiment tracking

Experiment tracking is useful even if your models don’t make it to production (yet). In many research-focused projects, you might never even get there. But especially in these projects, having all the metadata about every experiment you run and the ability to analyze it is important.

Ok, if you are a bit like me, you may be thinking:

Cool, so I know what experiment tracking is. …but why should I care?

Let me explain.

LLMs: from experiment tracking to prompt tracking

Here’s what our CEO has to say:

If you look at the “jobs to be done” of an experiment tracker, it goes way beyond experimenting. It’s not just about research. When you’re building models, you want to understand what’s happening, you want to understand the building process, you want to debug it, you want to compare it with other experiments. In this way, you can understand whether the model you’re building is going in the right direction or not. You want to version it so you have some level of reproducibility, some way to share a particular model for feedback – and you want to be able to hand the model over to an Ops team.

Listen to what our founding CEO Piotr Niedźwiedź had to say about experiment tracking and prompt engineering for LLMs on episode 168 of the MLOps Community podcast

When I think about prompt engineering, that’s quite a different way of building models. I’m not even sure that we should be calling it “engineering” in the sense of a building process because the model is stateless. For the latest models like GPT-4, fine-tuning is not (yet) available. So what you’re left with is crafting prompts. And you can configure agents and build the prompts in a sequential way using different models. So, yes, it is engineering.

When we talk about experiment tracking, we’re talking about the building phase and figuring out how a model works. In that spirit, I definitively see support for prompt visualizations and chain visualizations on our roadmap, as well as integration with Langchain. But this is just the beginning! I think that to really support teams that are building Large Language Models and using them in production, we’ll have to support and invent new methods to validate prompts.

Why does ML experiment tracking matter?

Building a tool for ML practitioners has one huge benefit. You get to talk to a lot of them.

And after talking to hundreds of people who track their experiments in Neptune, I identified four ways experiment tracking can improve your workflow.

4 reasons why machine learning experiment tracking matters

All of your ML experiments and models are organized in a single place

There are many ways to run your ML experiments or model training jobs:

Personal laptop
PC at work
A dedicated instance in the cloud
University cluster
Kaggle kernel or Google Colab
and many more …

Sometimes, you just want to test something quickly and run an experiment in a notebook. Sometimes, you need to spin up a distributed hyperparameter tuning job.

Either way, over the course of a project (especially when several people are working on it), you can end up with experiment results scattered across multiple machines.

With an experiment tracking system, all of your experiment results are logged to one experiment repository by design. Keeping all of your experiment metadata in a single place, regardless of where you run them, makes your experimentation process so much easier to manage.

[experiment tracking system] allows us to keep all of our experiments organized in a single space. Being able to see my team’s work results any time I need makes it effortless to track progress and enables easier coordination. Michael Ulin VP of Machine Learning at Zesty.ai

A centralized experiment repository makes it easy to:

Search and filter experiments to find the information you need quickly
Compare metrics and parameters between experiments with no additional work
Drill down and see what exactly it was that you tried (code, data versions, architectures)
Reproduce or re-run experiments when you need to
Access experiment metadata even when you don’t have access to the server where you ran them

Additionally, you can sleep peacefully knowing that all the ideas you tried are safely stored, and you can always go back to them later.

Compare ML experiments, analyze results, debug model training with little extra work

Easily compare experiments, analyze results, and debug model training

Whether you are debugging training runs, looking for improvement ideas, or auditing your current best models, comparing experiments is important.

But when you don’t have any experiment tracking system in place,

the way you log things can change,
you may forget to log something important,
and you’re likely to lose some information accidentally.

In those situations, something as simple as comparing and analyzing experiments can get difficult or even impossible.

With an experiment tracking system, your experiments are stored in a single place, and you consistently follow the same protocol for logging them. Experiment analyses and comparisons can go as deep as you like, and you can focus on improving your models instead of worrying about data storage.

Tracking and comparing different approaches has noticeably boosted our productivity, allowing us to focus more on the experiments [and] develop new, good practices within our team… Tomasz Grygiel Data Scientist at idenTT

Proper experiment tracking makes it easy to:

Compare parameters and metrics between experiments
Overlay learning curves of different training runs
Group and compare experiments based on data versions or parameter values
Compare confusion matrices, ROC curves, and other performance charts
Compare the best/worst predictions on test or validation sets
View code diffs (and/or notebook diffs) for model, feature engineering, and training code
Look at hardware consumption during training runs for various models
Look at prediction explanations like feature importance, SHAP, or LIME
Compare rich-format artifacts like video or audio
… and compare anything else you logged

Modern experiment tracking tools will give you many, if not all, of those comparison features (almost) for free. Some tools even go as far as to automatically find suitable experiments to compare to and identify for you which parameters have the biggest impact on model performance.

When you have all the pieces in one place, you can gain new insights and ideas just by looking at all the metadata you logged. That is especially true when you are not working alone.

Speaking of which…

When you are part of a team, and many people are running experiments, having one source of truth for your entire team is really important.

[An experiment tracking system] makes it easy to share results with my teammates. I’m sending them a link and telling what to look at, or I’m building a view on the experiments dashboard. I don’t need to generate it by myself, and everyone in my team has access to it. Maciej Bartczak Resarch Lead at Banacha Street

Experiment tracking lets you organize and compare not only your past experiments but also see what everyone else was trying and how that worked out.

Sharing results becomes easier, too.

Modern experiment tracking tools let you share your work by sending a link to a particular experiment or dashboard view. You don’t have to send screenshots or “have a quick meeting” to explain what is going on in your experiment. It saves a ton of time and energy.

For example, here is a link to an experiment dashboard I created months ago. Pretty easy, right?

Apart from sharing things you see in a web UI, most experiment tracking setups let you access experiment metadata programmatically. This comes in handy when your experiments and models go from experimentation to production. For example, you can connect your experiment tracking tool to a CI/CD framework like GitHub Actions and integrate ML experimentation into your teams’ workflow. A visual comparison between the models on branches `main` and `develop` (and a way to explore details) adds another sanity check before you update your production model.

See your ML runs live: manage ML experiments from anywhere at any time

When you are training a model on your local computer, you can see what is going on whenever you like. But if your experiment is running on a remote server at work, university, or in the cloud, it may not be as easy to see what the learning curve looks like or discover that the training job crashed.

Experiment tracking systems solve this problem. While it’s a big security no-no to allow remote access to all of your data and servers, letting people see only their experiment’s metadata is usually fine.

When you can easily compare the currently running experiment to previous runs, you can decide whether it makes sense to continue. Why waste those precious GPU hours on something that is not converging? You will also quickly notice if your cloud training job has crashed, and you can close it (or fix the bug and re-run).

Speaking of GPUs and failed jobs, Some experiment tracking tools monitor training and log hardware consumption, helping you see whether you are using your resources efficiently.

For example, looking at GPU consumption over time can help you identify if your data loaders are not working correctly or that your multi-GPU setup is actually using just one core (which happened to me more times than I’d like to admit).

Without information I have in [Neptune’s] monitoring section I wouldn’t know that my experiments are running 10 times slower than they could. Michał Kardas ML Researcher at TensorCell

ML experiment tracking best practices

So far, we’ve covered what machine learning experiment tracking is and why it matters.

Now it’s time to get into details.

What you should keep track of in any ML experiment:

As I mentioned initially, what information you may want to track ultimately depends on the project’s characteristics.

But there are some things that you should keep track of regardless of the project you are working on. Those are:

Code: Preprocessing, training and evaluation scripts, notebooks for feature engineering, and other utilities. And, of course, all the code needed to run (and re-run) the experiment.
Environment: The easiest way to keep track of the environment is to save the environment configuration files like `Dockerfile` (Docker), `requirements.txt` (pip), `pyproject.toml` (e.g., hatch and poetry), or `conda.yml` (conda). You can also save built Docker images on Docker Hub or your own container repository, but I find saving configuration files easier.
Data: Saving data versions (as a hash or locations of immutable data resources) makes it easy to see what your model was trained on. You can also use modern data versioning tools like DVC (and save the .dvc files to your experiment tracking tool).
Parameters: Saving your experiment run’s configuration is crucial. Be especially careful when you pass parameters via the command line (e.g., through argparse, click, or hydra), as this is a place where you can easily forget to track important information (I have some horror stories to share). You may want to take a look at this article about various approaches to tracking hyperparameters.
Metrics: Logging evaluation metrics on train, validation, and test sets for every run is pretty obvious. But different frameworks do it differently, so you may want to check out this in-depth article on tracking ML model metrics.

Keeping track of those things will let you reproduce experiments, do basic debugging, and understand what happened at a high level.

That said, you can always log more things to gain even more insights. As long as you keep the data you track in a nice structure, it doesn’t hurt to collect information, even if you don’t know if it might be relevant later. After all, most metadata is just numbers and strings that don’t take up much space.

What else you could keep track of

Let’s look at some additional things you may want to keep track of when working on a specific type of project.

Below are some of my recommendations for various ML project types.

Machine Learning

Model weights
Evaluation charts (ROC curves, Confusion matrix)
Prediction distributions

Deep Learning

Model checkpoints (both during and after training)
Gradient norms (to control for vanishing or exploding gradient problems)
Best/worst predictions on the validation and test set after training
Hardware resources: handy for debugging data loaders and multi-GPU setups

Computer Vision

Model predictions after every epoch (labels, overlayed masks or bounding boxes)

Natural Language Processing and Large Language Models

Inference time
Prompts (in the case of generative LLMs)
Specific evaluation metrics (e.g., ROUGE for text summarization or BLEU for translation between languages)
Embedding size and dimensions, type of tokenizer, and number of attention heads (when training transformer models from scratch)
Feature importance, attention-based, or example-based explanations (see this overview for specific algorithms and more ideas)

Structured Data

Input data snapshot ( `.head()` on DataFrames if you are using pandas)
Feature importance (e.g., permutation importance)
Prediction explanations like SHAP or partial dependence plots (they are all available in DALEX)

Reinforcement Learning

Episode return and episode length
Total environment steps, wall time, steps per second
Value and police function losses
Aggregate statistics over multiple environments and/or runs

Hyperparameter optimization:

Run score: the metric you are optimizing after every iteration
Run parameters: parameter configuration tried at each iteration
Best parameters: best parameters so far and overall best parameters after all runs have concluded
Parameter comparison charts: there are various visualizations that you may want to log during or after training, like parallel coordinates plot or slice plot (they are all available in Optuna, by the way)

How to set up machine learning experiment tracking

OK, those are nice guidelines, but how do you actually implement experiment tracking in your machine learning project?

There are (at least) a few options. The most popular being:

Spreadsheets and naming conventions
Versioning everything in a Git repository
Using modern experiment tracking tools

Let’s talk about those now.

You can use spreadsheets and naming conventions (but please don’t)

A common approach to experiment tracking is to create one giant spreadsheet where you put all of the information you can (metrics, parameters, etc.) and a directory structure where things are named in a certain way. Those names usually end up being really long and intricate, like ‘model_v1_lr01_ batchsize64_ no_preprocessing_ result_accuracy082.h5’.

Whenever you run an experiment, you look at the results and copy them to the spreadsheet.

What is wrong with that?

To be honest, in some situations, it can be just enough to solve your experiment tracking needs. It may not be the best solution, but it is quick and straightforward.

… things can fall apart really quickly

But things can fall apart really quickly. There are (at least) a few major reasons why tracking experiments in spreadsheets doesn’t work for most people:

You have to remember to track them. Things get messy if something doesn’t happen automatically, especially with more people involved.
You have to ensure that you or your team will not accidentally overwrite things in the spreadsheet. Spreadsheets are not easy to version, so if this happens, you are in trouble.
You have to remember to use the naming conventions. If someone on your team messes this up, tracking down the experiment artifacts (model weights, performance charts) for the experiments is painful.
You have to independently back up your artifact directories and keep them in sync with the spreadsheet. Even if you set up an automatic workflow that gets triggered regularly, there will inevitably come a time when it breaks.
When your spreadsheet grows, it becomes less and less usable. Searching for things and comparing hundreds of experiments in a spreadsheet (especially if you have multiple people who want to use it simultaneously) is not a great experience.

You can version ML experiment metadata files on GitHub

Another option is to version all of your experiment metadata in a GitHub repository.

When running your experiment, you can commit metrics, parameters, charts, and whatever you want to keep track of to a repository. You can set up post-commit hooks that automatically create or update files (configs, charts, etc.) automatically after your experiment finishes.

It can work in some setups, but:

Git wasn’t built for comparing machine learning artifacts and experiment metadata. It’s built for versioning and storing text files. Neither binary artifacts like image files nor structured, relational data are handled well.
You cannot compare more than two experiments at a time. Like most version control systems for code, Git was designed for comparing two commits. If you want to compare metrics and learning curves of multiple experiments, you are out of luck.
Organizing many experiments is difficult (if not outright impossible). You can have branches where you try out new ideas or a separate branch for each experiment. But the more experiments you run, the less usable it becomes. (And you’ll have to make sure everyone follows whatever branching convention you come up with.)
You will not be able to monitor your experiments live. You can only save information after your experiment finishes.

Maybe you could build your own ML experiment tracker?

If a spreadsheet relies too much on discipline and will quickly grow to an unmanageable size, and a Git repository is just not the right kind of data store, how about spinning up a database and writing a slim Python client?

It’s certainly not the worst idea, and many experiment-tracking and machine-learning management solutions – including our own – started this way.

At least, you’ll need the following components:

A database to keep your metadata. A natural choice is a schema-free database like MongoDB or CouchDB that allows you to store and query arbitrary JSON documents.
A place to store artifacts like model snapshots or plots. A blob storage bucket, a network drive, or a good old FTP server will probably do.
A client to integrate into your experiment code. A few lines of Python that push metadata and files to your central repositories will suffice initially.

But then things start to get complicated rather quickly. How will you retrieve and analyze the metadata? Is your team content with pulling data into notebooks and generating their plots themselves? Or do you need to set up a dashboard and develop a web frontend? What about live tracking?

I certainly share the enthusiasm for conceptualizing and creating ML experiment-tracking tools – after all, it’s my job these days – but I doubt the effort is worth it for most teams. As we’ll see in the next sections, plenty of excellent tools are available.

You can use a modern experiment tracking tool

Instead of trying to adjust generic tools to work for machine learning experiments or developing your own platform, you could just use one of the solutions built specifically for tracking, organizing, and comparing experiments.

Within the first few tens of runs, I realized how complete the tracking was – not just one or two numbers, but also the exact state of the code, the best-quality model snapshot stored to the cloud, the ability to quickly add notes on a particular experiment. My old methods were such a mess by comparison. Edward Dixon Data Scientist at intel

They have slightly different interfaces, but they usually work in a similar way:

Step 1

Connect to the tool by adding a snippet to your training code.

For example:

import neptune

run = neptune.init_run() # initialize a new run

Step 2

Specify what you want to log (or use an ML framework integration that does it for you):

from neptune.types import File

run["accuracy"] = evaluate_accuracy(model, test_data)
for prediction_image in worst_predictions:
    run["worst predictions"]].append(
       File.as_image(prediction_image)
    )

Step 3

Run your experiment as you normally would:

python train.py

And that’s it!

Your experiment is logged to a central experiment database and displayed in a dashboard, where you can search, compare, and drill down to whatever information you need.

Today, there are several tools for machine learning experiment tracking optimized for different contexts, and I would strongly recommend using one. They are designed to treat machine learning experiments as first-class citizens, and they will always

be easier to use for a machine learning person than general tools
have built-in features to analyze and compare experiments

One important decision to make is whether you want to use a software-as-a-service offering or host an open source tool yourself.

Most open source experiment tracking tools provide the interfaces you need to create plugins and integrations. This might be an essential selection criterion if you’re working with somewhat esoteric data storage systems or compute infrastructure.
You’re not tied to any vendor or cloud provider. If you want, you can take your machine learning experiment tracker and move to a different cloud provider. There is no need to try to migrate data between incompatible platforms. If an open source project is no longer maintained, you can keep developing it – or at least keep the lights on as long as you need to prepare your migration on your own schedule.
Your data and artifacts never have to leave your premises. For the vast majority of businesses and even government agencies, the protections that contracts and legal agreements provide are sufficient to allow them to store their data on third-party clouds. But if your data must under no circumstances leave your building, self-hosting is your only option.

But let’s face it: Everyone who’s ever self-hosted tools knows how long it takes to get things working right and has experienced how simple day-to-day system maintenance became a bottomless time sink. And let’s not forget the hassle of keeping up with breaking changes and security fixes.

It’s no surprise that many data science teams are looking for a fully managed experiment tracking platform. Key benefits of paying someone else to run an experiment tracker on your behalf include:

You don’t need to worry about infrastructure, scaling, and updates. Someone else takes care of the burdensome maintenance work, and you’ll never lose valuable data because your server runs out of storage space mid-experiment.
Vendors have a lot more experience with machine learning experiment tracking than any single machine learning team could ever accumulate. Here at Neptune, we’ve worked with hundreds of customers, constantly learning about new edge cases and continuously discovering new ways to optimize experiment tracking.
Data scientists can focus on creating and optimizing machine learning models. When you adopt a managed experiment tracking platform, you’re not only leaving the software engineering and maintenance to the specialists but also getting access to dedicated support.

Next steps

Machine learning experiment tracking is, first and foremost, a practice, not just a tool or a logging method. It will take some time to really understand and implement:

what to keep track of for your project,
how to use that information to improve future experiments,
how to improve your teams’ unique workflow with it,
and when to even use experiment tracking.

Hopefully, after reading this article, you have a good idea of how to start tracking and how it can improve your (or your teams’) machine learning workflow.

Editor’s note

Do you feel like experimenting with neptune.ai?

Request a free trial
Play with a live project
See the docs or watch a short product demo (2 min)

Best Tools to Manage Machine Learning Projects

Jakub Czakon — Thu, 21 Jul 2022 10:13:49 +0000

Managing Machine Learning Projects is not exactly a piece of cake but every data scientist already knows that.

It touches many things:

Data exploration,
Data preparation and setting up machine learning pipelines
Experimentation and model tuning
Data, pipeline, and model versioning
Managing infrastructure and run orchestration
Model serving and productionalization
Model maintenance (retraining and monitoring models in production)
Effective collaboration between different roles (data scientist, data engineer, software developer, DevOps, team manager, business stakeholder)

So how to make sure everything runs smoothly? How to administer all the parts to create a coherent workflow?

…by using the right tool that will help you manage machine learning projects. But you already know that 😉 There are many apps that can help you improve parts of or even an entire workflow. Sure, it’s really cool to do everything yourself, but why not use tools if they can save you lots of trouble

Let’s get right to it and see what’s on the plate! Below is a list of tools that touch various points listed above. Some are more end-to-end some are focused on a particular stage of the machine learning lifecycle but all of them will help you manage your machine learning projects. Check out our list and choose the one(s) you like most.

1. Neptune

Neptune is a metadata store for MLOps built for research and production teams that run a lot of experiments. It is very flexible, works with many other frameworks, and thanks to its stable user interface, it enables great scalability (to millions of runs).

It’s a robust software that can store, retrieve, and analyze a large amount of data. Neptune has all the tools for efficient team collaboration and project supervision.

Neptune – summary:

Provides user and organization management with a different organization, projects, and user roles
Fast and beautiful UI with a lot of capabilities to organize runs in groups, save custom dashboard views and share them with the team
You can use a hosted app to avoid all the hassle with maintaining yet another tool (or have it deployed on your on-prem infrastructure)
Your team can track experiments which are executed in scripts (Python, R, other), notebooks (local, Google Colab, AWS SageMaker) and do that on any infrastructure (cloud, laptop, cluster)
Extensive experiment tracking and visualization capabilities (resource consumption, scrolling through lists of images)

2. Kubeflow

Kubeflow is the ML toolkit for Kubernetes. It helps in maintaining machine learning systems by packaging and managing docker containers. It facilitates the scaling of machine learning models by making run orchestration and deployments of machine learning workflows easier.

It’s an open-source project that contains a curated set of compatible tools and frameworks specific for various ML tasks.

Kubeflow – summary:

A user interface (UI) for managing and tracking experiments, jobs, and runs
Notebooks for interacting with the system using the SDK
Re-use components and pipelines to quickly create end-to-end solutions without having to rebuild each time
Kubeflow Pipelines is available as a core component of Kubeflow or as a standalone installation

3. DVC

DVC is an open-source version control system for machine learning projects. It’s a tool that lets you define your pipeline regardless of the language you use.

DVC – summary:

Possibility to use different types of storage— it’s storage agnostic
Full code and data provenance help to track the complete evolution of every ML model
Reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment
Tracking metrics
A built-in way to connect ML steps into a DAG and run the full pipeline end-to-end

Learn more

Comparison Between DVC and Neptune

4. Polyaxon

Polyaxon is a platform for reproducing and managing the whole life cycle of machine learning projects as well as deep learning applications.

The tool can be deployed into any data center, cloud provider, and can be hosted and managed by Polyaxon. It supports all the major deep learning frameworks, e.g., Torch, Tensorflow, MXNet.

Polyaxon – summary:

Supports the entire lifecycle including run orchestration but can do way more than that
Has an open-source version that you can use right away but also provides options for enterprise
Very well documented platform, with technical reference docs, getting started guides, learning resources, guides, tutorials, changelogs, and more
Allows to monitor, track, and analyze every single optimization experiment with the experiment insights dashboard

Comparison Between Polyaxon and Neptune

5. GitHub

GitHub is the most popular platform built for developers. It’s used by millions of teams around the globe as it allows for easy and painless collaboration. With GitHub, you can host and review code, manage projects, and build software.

It’s a great platform for teams collaborating on machine learning projects who want to simplify workflow and share ideas conveniently. GitHub lets teams manage ideas, coordinate work, and stay aligned with the entire team to seamlessly collaborate on machine learning projects.

GitHub – summary:

Build, test, deploy, and run CI/CD the way you want in the same place you manage code
Use Actions to automatically publish new package versions to GitHub Packages. Install packages and images hosted on GitHub Packages or your preferred registry of record in your CI/CD workflows
The software lets you secure your work with vulnerability alerts so you can remediate risks and learn how CVEs affect you
The build-in review tools make it easy and convenient to review code – a team can propose changes, compare versions, and give feedback
GitHub easily integrates with other tools for smooth work, or you can create your own tools with GitHubGraphQL API
GitHub is a platform where all the documentation is easily accessible, and all the features make it a unified system for flexibly developing software.

6. Jira and Confluence

Jira is a great software for agile teams as it allows for fully-encompassed project management. It’s an issue and project tracking tool so teams can plan, track, and release their product or software as a perfectly developed ‘organism’. With Confluence, teams have even more flexibility to manage ML projects.

The two tools allow for flexible workflow automation. You can freely manage a project by assigning certain tasks to people, bugs to programmers, create milestones, or plan to carry certain tasks within a specific timeframe.

Products and apps built on top of the Jira combined with Confluence help teams plan, assign, track, report, and manage work. All updates from Jira will automatically appear in Confluence since the two tools are linked together.

7. Notion

Notion is a collaboration tool that lets you write, plan, and organize teamwork.

It has four modules, each with different functionalities:

Notes, Docs – text editor which serves as a space for files, notes of different formats; you can add images, bookmarks, videos, code, and many more
Knowledge Base – in this module, teams can store knowledge about projects, tools, best practices, and other aspects that are necessary for developing machine learning projects
Tasks, Projects – tasks and projects can be organized in a Kanban board, calendar, and list views
Databases – this module can effectively replace spreadsheets and keep records of important data and unique workflows in a convenient way

Additionally, every team member can use Notion for personal use to keep a record of work-related activities and information, for example, weekly agenda, goal, task list, or personal notes.

Other smallish features include #markdown. /Slash commands, drag-and-drop feature, comments and discussions, and integrations with 50+ popular apps such as Google Docs, Github Gist, CodePen, and more.

All modules create a coherent system that serves as a unified hub for work management and project planning.

8. WandB (Weights & Biases)

Weights & Biases a.k.a. WandB is focused on deep learning. Users track experiments to the application with Python library, and – as a team – can see each other’s experiments.

WandB is a hosted service allowing you to backup all experiments in a single place and work on a project with your team. WandB lets you log many data types and analyze them in a nice UI.

Weights & Biases – summary:

Experiments tracking: extensive logging options
Multiple features for sharing work in a team
Several open source integrations with other tools available
SaaS/Local instance available
WandB logs the model graph, so you can inspect it later

9. Streamlit

This one is an open-source Python library that enables you to build fancy custom web-apps for machine learning and data science. It is perfect when you need to build a quick proof-of-concept app and show it to someone, especially when that someone is a bit less technical.

In Streamlit you can automatically update your app every time you change its source code. This allows you to work in a fast interactive loop:

You type code, save it, try it out live
Then type some more code, save it, try it out again
And so on.

Streamlit’s architecture allows you to write apps the same way you write plain Python scripts.

You can easily share your machine learning models with other people and effectively work in a team.

See how we built a streamlit app for exploring results of image segmentation and object detection models trained on COCO: How to Do Data Exploration for Image Segmentation and Object Detection (Things I Had to Learn the Hard Way)

10. Amazon SageMaker

Amazon SageMaker is a platform that enables data scientists to build, train, and deploy machine learning models. It has all the integrated tools for the entire machine learning workflow providing all of the components used for machine learning in a single toolset.

SageMaker is a tool suitable for organizing, training, deployment, and managing machine learning models. It has a single, web-based visual interface to perform all ML development steps – notebooks, experiment management, automatic model creation, debugging, and model drift detection

Amazon SageMaker – summary:

Autopilot automatically inspects raw data, applies feature processors, picks the best set of algorithms, trains and tunes multiple models, tracks their performance, and then ranks the models based on performance – it helps to deploy the best performing model
SageMaker Ground Truth helps you build and manage highly accurate training datasets quickly
SageMaker Experiments helps to organize and track iterations of machine learning models by automatically capturing the input parameters, configurations, and results, and storing them as ‘experiments’
SageMaker Debugger automatically captures real-time metrics during training (such as training and validation, confusion, matrices, and learning gradients) to help improve model accuracy. The Debugger can also generate warnings and remediation advice when common training problems are detected
SageMaker Model Monitor allows developers to detect and troubleshoot concept drift. It automatically detects concept drift in deployed models and gives detailed alerts that help identify the source of the problem

Neptune + Amazon SageMaker Integration

11. Domino Data Lab

Domino Data Lab is a great tool to manage machine learning projects for teams who need a centralized hub to store all their data.

Domino is a data science platform that enables fast, reproducible, and collaborative work on data products like models, dashboards, and data pipelines. You can run regular jobs, launch interactive notebook sessions, view vital metrics, share work with the teammates, and communicate with them directly in the Domino web app.

It’s an advanced management platform for all kinds of machine learning projects, especially helpful for growing organizations that need to share work, and review code fast and effectively.

12. Cortex

Cortex is an open-source alternative to serving models with SageMaker or building your own model deployment platforms on top of AWS services like Elastic Kubernetes Service (EKS), Lambda, or Fargate and open source projects like Docker, Kubernetes, TensorFlow Serving, and TorchServe.

It’s a multi framework tool that lets you deploy all types of models.

Cortex – summary:

Automatically scale APIs to handle production workloads
Run inference on any AWS instance type
Deploy multiple models in a single API and update deployed APIs without downtime
Monitor API performance and prediction results

Conclusion

There are many great tools to choose from. Make sure to look for integrations and features that suit your needs to get the most out of your work.

Enjoy managing your machine learning projects!

A Complete Guide to Monitoring ML Experiments Live in Neptune

Jakub Czakon — Thu, 21 Jul 2022 10:08:38 +0000

Training machine learning or deep learning models can take a really long time.

If you are like me, you like to know what’s happening during that time and you’re probably interested in:

monitoring your training and validation losses,
looking at the GPU consumption,
seeing image predictions after every other epoch
and a bunch of other things.

Neptune lets you do all that, and in this post, I will show you how to make it happen. Step by step.

Check out this example run to see what this can look like in the Neptune app.

If you want to try Neptune monitoring without registration check this Quickstart tutorial.

Set up your Neptune account

Setting up a project and connecting your scripts to Neptune is super easy, but you still need to do it 🙂

Let’s take care of that quickly.

1. Create a project

Let’s create a project first.

To do that:

go to the Neptune app,
click on New project button on the left,
give it a name,
decide whether you want it to be public or private,
done.

2. Get your API token

You will need a Neptune API token (your personal key) to connect the scripts you run with Neptune.

To do that:

click on your user logo on the right
click on Get Your API token
copy your API token
paste it to the environment variable, config file, or directly to your script if you feel really adventurous 🙂

A token is like a password, so I try to keep it safe.

Since I am a Linux guy I put it in my environment file ~/.bashrc. If you are using a different system, check the API token section in the documentation.

With that, whenever you run my training scripts, Neptune will know who you are and log things appropriately.

3. Install client library

To work with Neptune, you need a client library that deals with logging everything you care about.

Since I am using Python, I will use the Python client, but you can use Neptune with R language as well.

You can install it with pip:

pip install neptune

4. Initialize Neptune

Now that you have everything set up, you can start monitoring things!

First, connect your script to Neptune by adding the following towards the top of your script:

import neptune

run = neptune.init_run(
    project="workspace-name/project-name",
    api_token="Your Neptune API token",
)

5. Create a run

Use the init_run() method to create a new run. We started a run when we executed neptune.init_run() above.

The started run then tracks some system metrics in the background, plus whatever metadata you log in your code. By default, Neptune periodically synchronizes the data with the servers in the background. Check what exactly Neptune logs automatically.

The connection to Neptune remains open until the run is stopped or the script finishes executing. You can explicitly stop the run by calling run.stop().

But what’s a run?

A ‘run’ is a namespace inside a project where you can log model-building metadata.

Typically, you create a run every time you execute a script that does model training, re-training, or inference. Runs can be viewed as dictionary-like structures that you define in your code.

They have:

Fields, where you can log your ML metadata
Namespaces, which organize your fields

Whatever hierarchical metadata structure you create, Neptune reflects them in the UI.

To create a structured namespace, use a forward slash / like this:

run["metrics/f1_score"] = 0.67
run["metrics/test/roc"] = 0.82

The snippet above:

Creates two namespaces: metrics and metrics/test.
Assigns values to fields f1_score and roc.

For the full list of run arguments, you can refer to Neptune’s API documentation.

Monitoring experiments in Neptune: methods

Logging basic stuff

In a nutshell, logging into Neptune is as simple as going:

run["WHAT_YOU_WANT_TO_LOG"] = ITS_VALUE

Let’s take a look at some different ways in which you can log important things to Neptune.

You can log:

Metrics and losses -> run["accuracy"]=0.90
Images and charts -> run["images"].upload("bboxes.png");
Artifacts like model files -> run["model_checkpoints"].upload("my_model.pt")
And many other things.

Sometimes you may just want to log something once before or after the training is done.

In that case, just do:

params = {
    "activation": "sigmoid",
    "dropout": 0.25,
    "learning_rate": 0.1,
    "n_epochs": 100,
}

In other scenarios, there is a training loop inside which you might want to log a series of values. For this, we use the .append() function.

for epoch in range(params["n_epochs"]):
    # this would normally be your training loop
    run["train/loss"].append(0.99*epoch)
    run["train/acc"].append(1.01*epoch)
    run["eval/loss"].append(0.98*epoch)
    run["eval/acc"].append(1.02*epoch)

This creates the namespaces “train” and “eval”, each with a loss and acc field.

You can see these visualized as charts in the app later.

Logging with integrations

To make logging easier, we created integrations for most of the Python ML libraries, including PyTorch, TensorFlow, Keras, scikit-learn, and more. You can see all the Neptune integrations here. These integrations give you out-of-the-box utilities that log most of the ML metadata you would normally log in those ML libraries. Let’s check a few examples.

Monitor TensorFlow/Keras models

The Neptune–Keras integration logs the following metadata automatically:

Model summary
Parameters of the optimizer used for training the model
Parameters passed to model.fit during the training
Current learning rate at every epoch
Hardware consumption and stdout/stderr output during training
Training code and Git information

To log metadata as you train your model with Keras, you can use NeptuneCallback in the following manner.

from neptune.integrations.tensorflow_keras import NeptuneCallback

run = neptune.init_run()
neptune_cbk = NeptuneCallback(run=run)

model.fit(
    x_train,
    y_train,
    epochs=5,
    batch_size=64,
    callbacks=[neptune_cbk],
)

Your training metrics will be logged to Neptune automatically:

Check the docs to learn more about what you can do with Neptune-Keras integration.

Monitor time series Prophet models

Prophet is a popular time-series forecasting library. With the Neptune–Prophet integration, you can keep track of parameters, forecast data frames, residual diagnostic charts, cross-validation folds, and other metadata while training models with Prophet.

Here’s an example of how to log relevant metadata regarding your Prophet model all at once.

import pandas as pd
from prophet import Prophet
import neptune
import neptune.integrations.prophet as npt_utils

run = neptune.init_run()

dataset = pd.read_csv(
    "https://raw.githubusercontent.com/facebook/prophet/main/examples/example_wp_log_peyton_manning.csv"
)
model = Prophet()
model.fit(dataset)

run["prophet_summary"] = npt_utils.create_summary(
    model, dataset, log_interactive=True
)

See this example in the Neptune app

Check the docs to know more about Neptune-Prophet integration.

Monitor Optuna hyperparameter optimization

Parameter tuning framework Optuna, also has a callback system that you can plug Neptune in nicely. All the results are logged and updated after every parameter search iteration.

import neptune.integrations.optuna as optuna_utils

run = neptune.init_run()
neptune_callback = optuna_utils.NeptuneCallback(run)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20, callbacks=[neptune_callback])

See this example in the Neptune app

Visit the docs to learn more about the Neptune-Optuna integration.

Most ML frameworks have some callback system in place. They vary slightly, but the idea is the same. You can take a look at the entire list of tools that Neptune supports. In case you are unable to find your framework in this list, you can always resort to the good old way of logging via Neptune Client, as discussed above already.

What can you monitor in Neptune?

There are a ton of different things that you can log to Neptune and monitor live.

Metrics and learning curves, hardware consumption, model predictions, ROC curves, console logs, and more can be logged for every experiment and explored live.

Let’s go over a few of them, one by one.

Monitor ML metrics and losses

You can log scores and metrics as single values, with = assignment, or as series of values, with the log() method.

# Log scores (single value)
run["score"] = 0.97
run["test/acc"] = 0.97

# Log metrics (series of values)
for epoch in range(100):
    # your training loop
    acc = ...
    loss = ...
    metric = ...

    run["train/accuracy"].append(acc)
    run["train/loss"].append(loss)
    run["metric"].append(metric)

See this example in the Neptune app

Monitor hardware resources and console logs

These are actually logged to Neptune automatically:

run = neptune.init_run(capture_hardware_metrics=True)

Just go to the Monitoring section to see it:

See this example in the app

Monitor image predictions

You can log either a single image or a series of images (example below).

from neptune.types import File

for name in misclassified_images_names:
    y_pred = ...
    y_true = ...
    run["misclassified_imgs"].append(File("misclassified_image.png"))

They will be visible in the image gallery in the app:

See this example in the app

Monitor file updates

You can save model weights from any deep learning framework by using the upload() method. In the below example, they’re logged under a field called my_model in the namespace model_checkpoints.

# Log PyTorch model weights
my_model = ...
torch.save(my_model, "my_model.pt")
run["model_checkpoints/my_model"].upload("model_checkpoints/my_model.pt")

Model checkpoints appear in the All metadata section.

See this example in the Neptune app

Compare running experiments with previous ones

The cool thing about monitoring ML experiments in Neptune is that you can compare running experiments with your previous ones.

It makes it easy to decide whether the model that you are training is showing promise of improvement. If it doesn’t you can even abort the experiment from the UI.

To do that:

go to the experiment dashboard
select a few experiments
click compare to overlay learning curves and show diffs in parameters and metrics
click abort on the running ones if you no longer see the point in training

Apart from comparing experiments using charts, you can also compare them in the side-by-side table format view or as parallel coordinates. And if you log any images, it’s also possible to compare them. See the docs about comparison options.

Finally, you can share your running experiments by copying the link to the experiment and sending it to someone.

Just like I am sharing this experiment with you here:

https://ui.neptune.ai/o/shared/org/step-by-step-monitoring-experiments-live/e/STEP-22

The cool thing is you can send people directly to a part of your experiment that you want to show them, like code, hardware consumption charts, or learning curves. You can share the experiment comparisons with links as well.

Final thoughts

With all this information, you should be able to monitor every piece of the machine learning experiment that you care about.

For even more info, you can:

See how the monitoring works in this google colab notebook that comes with snippets for logging all sorts of things to Neptune
Check out this example run monitoring experiment to see how this can look like
Read the updated list of things that you can log
Check out the full list of our integrations with ML frameworks
Talk to us on Intercom (that blue thing in the corner).

Happy experiment monitoring!

Top Machine Learning Influencers – All the Names You Need to Know

Jakub Czakon — Thu, 21 Jul 2022 09:44:16 +0000

Machine learning wouldn’t be possible without all the great minds. If it wasn’t for them, AI wouldn’t be so advanced and our lives would look completely different.

Following the great minds of machine learning can help you discover new things and deepen your knowledge. It’s fascinating to learn from the best scientists. Among them, you will find influencers, teachers, business leaders, and many more. Undeniably their expertise can help change the world and make it a better place.

On this list, you will find not only influencers but also renowned personalities from the world of Data Science. Take a look at all the names you should know as a machine learning researcher. Learn and get inspired to discover new things!

1. Vladimir Vapnik

Vladimir Naumovich Vapnik is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machine method, and support-vector clustering algorithm.

Professor Vapnik gained his Masters Degree in Mathematics in 1958 at Uzbek State University, Samarkand, USSR. From 1961 to 1990 he worked at the Institute of Control Sciences, Moscow, where he became Head of the Computer Science Research Department. He then joined AT&T Bell Laboratories, Holmdel, NJ, having been appointed Professor of Computer Science and Statistics at Royal Holloway in 1995. Vladimir Vapnik is a renowned scientist in the field of machine learning.

Profile on Columbia University website || Wikipedia

⇒ Make sure to listen to Lex Fridman interview with Vladimir Vapnik on AI Podcast

⇒ Also, read about Alexey Chervonenkis

2. Andrej Karpathy

The Sr. Director of AI at Tesla, where he leads the team responsible for all neural networks on the Autopilot. Previously, he was a Research Scientist at OpenAI working on Deep Learning in Computer Vision, Generative Modeling and Reinforcement Learning. Andrej Karpathy received his PhD from Stanford, where he worked with Fei-Fei Li on Convolutional/Recurrent Neural Network architectures and their applications in Computer Vision, Natural Language Processing and their intersection.

Academic Website || LinkedIn || Twitter || GitHub

3. Gregory Piatetsky-Shapiro

Gregory Piatetsky-Shapiro, Ph.D., is the president of KDnuggets. He is a well-known expert in Business Analytics, Data Mining, and Data Science and a top influencer in the field. He was no. 1 on LinkedIn Top Voices in 2018 on Data Science and Analytics. Gregory is a co-founder of KDD (Knowledge Discovery and Data mining conferences) and co-founder and past chair of SIGKDD, a professional organization for Knowledge Discovery and Data Mining. Gregory has over 60 publications and edited several books and collections on data mining and knowledge discovery.

⇒ Check out Gregory Piatetsky-Shapiro Selected Publications on KDNuggets

Twitter || LinkedIn || KDNuggets

4. Allie K. Miller

Allie Miller is the US Head of AI Business Development for Startups and Venture Capital at Amazon, advancing the greatest AI companies in the world.

Previously, Allie was the youngest-ever woman to build an artificial intelligence product at IBM—spearheading large-scale product development across computer vision, conversation, data, and regulation.

Outside of work, Allie is changing the game of AI. Allie has spoken about AI and field diversity around the world, addressed the European Commission, drafted foreign AI strategies, and created eight guidebooks to educate businesses on how to build successful AI projects.

LinkedIn || Twitter || Instagram || Website

5. Yann LeCun

Yann LeCun is VP & Chief AI Scientist at Facebook and Silver Professor at NYU affiliated with the Courant Institute of Mathematical Sciences & the Center for Data Science.

He was the founding Director of Facebook AI Research and of the NYU Center for Data Science. He received an Engineering Diploma from ESIEE (Paris) and a PhD from Sorbonne Université. After a postdoc in Toronto he joined AT&T Bell Labs in 1988, and AT&T Labs in 1996 as Head of Image Processing Research. He joined NYU as a professor in 2003 and Facebook in 2013.

His interests include AI machine learning, computer perception, robotics and computational neuroscience.

He is the recipient of the 2018 ACM Turing Award (with Geoffrey Hinton and Yoshua Bengio) for “conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing”, a member of the National Academy of Engineering and a Chevalier de la Légion d’Honneur.

⇒ To learn about Yann LeCun interesting work, listen to Lex Fridman interview with Yann LeCun on AI Podcast

LinkedIn || Twitter || Website || Quora

6. Fei-Fei Li

A computer scientist, non-profit executive, and writer. She is a professor at Stanford University and the co-director of Stanford’s Human-Centered AI Institute and the Stanford Vision and Learning Lab.

She served as the director of the Stanford Artificial Intelligence Laboratory (SAIL) from 2013 to 2018. In 2017, she co-founded AI4ALL, a nonprofit organization working to increase diversity and inclusion in the field of artificial intelligence.

Her research expertise includes artificial intelligence (AI), machine learning, deep learning, computer vision, and cognitive neuroscience.

She was the leading scientist and principal investigator of ImageNet.

She has been described as an “AI pioneer” and a “researcher bringing humanity to AI”.

Fei-Fei Li has been elected to the National Academy of Engineering which is among the highest professional distinctions for engineers.

Twitter || LinkedIn || Academic Profile

7. Jürgen Schmidhuber

A computer scientist most noted for his work in the field of artificial intelligence, deep learning and artificial neural networks. He is a co-director of the Dalle Molle Institute for Artificial Intelligence Research in Manno, in the district of Lugano, in Ticino in southern Switzerland.

He is sometimes called the “father of (modern) AI” or, one time, the “father of deep learning.”

Schmidhuber did his undergraduate studies at the Technische Universität München in Munich, Germany. He taught there from 2004 until 2009 when he became a professor of artificial intelligence at the Università della Svizzera Italiana in Lugano, Switzerland.

⇒ And Lex Fridman interview with Jürgen Schmidhuber

LinkedIn || Twitter || Website

8. Nick Bostrom

Nick Bostrom is a Swedish philosopher at the University of Oxford known for his work on existential risk, the anthropic principle, human enhancement ethics, superintelligence risks, and the reversal test.

In 2011, he founded the Oxford Martin Programme on the Impacts of Future Technology, and is the founding director of the Future of Humanity Institute at Oxford University. In 2009 and 2015, he was included in Foreign Policy’s Top 100 Global Thinkers list.

Bostrom is the author of over 200 publications, and has written two books and co-edited two others. The two books he has authored are Anthropic Bias: Observation Selection Effects in Science and Philosophy (2002) and Superintelligence: Paths, Dangers, Strategies (2014). Superintelligence was a New York Times bestseller, was recommended by Elon Musk and Bill Gates among others, and helped to popularize the term “superintelligence“.

Website || More on Nick Bostrom on Wikipedia and TED

9. Angelica Lim

Dr. Angelica Lim received her Ph.D. and M.Sc. in Intelligence Science from Kyoto University, and B.Sc. in Computing Science with Minor in French from Simon Fraser University, Canada. A key member on the Pepper humanoid robot project with Softbank and Aldebaran Robotics, she has interned as a software engineer and researcher at Google Santa Monica, Honda Research Institute Japan, and I3S-CNRS, France.

She has worked on robots and artificial intelligence for over 10 years, and is currently interested in signal processing, machine learning and developmental robotics for intelligent systems, particularly in the field of emotions. She is one of four journalists for the IEEE Spectrum Automaton Robotics Blog, and was a speaker at TEDx Kyoto 2012 (“On Designing User-Friendly Robots”) and TEDx KualaLumpur 2014 (“Robots, Emotions and Empathy”).

She was a Guest Editor for the International Journal of Synthetic Emotions, and has received various awards including CITEC Award for Excellence in Doctoral HRI Research (2014), NTF Award for Entertainment Robots and Systems IROS (2010), and the Google Canada Anita Borg Scholarship (2008). She has been featured on the BBC, given talks at SXSW and TEDx, hosted a TV documentary on robotics, and was recently featured in Forbes 20 Leading Women in AI.

Twitter || LinkedIn || Website

10. Fabio Moioli

Fabio Moioli is Head Consulting & Services at Microsoft. Faculty at Harvard, SingularityU, MIP – Artificial & Human Intelligences – AI TEDx. Has 250.000+ followers on Linkedin & Twitter, where he mainly addresses opportunities and challenges raised by Artificial Intelligence and exponential technologies, including societal and ethical perspectives.

Major areas of expertise include Artificial Intelligence, Digital Platforms, Transformation programs, Lean Operations, Product & Services Innovation, and more.

LinkedIn || Twitter || Instagram

11. Andrew Ng

A businessman, computer scientist, investor, and writer. He is focusing on machine learning and AI. As a businessman and investor, Ng co-founded and led Google Brain and was a former Vice President and Chief Scientist at Baidu, building the company’s Artificial Intelligence Group into a team of several thousand people.

Ng is an adjunct professor at Stanford University (formerly associate professor and Director of its AI Lab). Also a pioneer in online education, Ng co-founded Coursera and deeplearning.ai. With his online courses, he has successfully spearheaded many efforts to “democratize deep learning” teaching over 2.5 million students through his online courses.

He is one of the world’s most famous and influential computer scientists being named one of Time magazine’s 100 Most Influential People in 2012, and Fast Company’s Most Creative People in 2014. Since 2018 he launched and currently heads AI Fund, initially a $175-million investment fund for backing artificial intelligence startups. He has founded Landing AI, which provides AI-powered SaaS products and Transformation Program to empower enterprises into cutting-edge AI companies.

⇒ Listen to Lex Fridman interview with Andrew Ng

Twitter || LinkedIn || Website || Coursera

12. Oriol Vinyals

Oriol Vinyals is a Principal Scientist at Google DeepMind, working in Deep Learning and Artificial Intelligence. Prior to joining DeepMind, Oriol was part of the Google Brain team.

He holds a Phd. in EECS from the University of California, Berkeley and is a recipient of the 2016 MIT TR35 innovator award.

Some of his contributions are used in Google Translate, Text-To-Speech, and Speech recognition, serving billions of queries every day, and he was the lead researcher of the AlphaStar project, creating an agent that defeated a top professional at the game of StarCraft, achieving Grandmaster level.

At DeepMind he continues working on his areas of interest, which include artificial intelligence, with particular emphasis on machine learning, deep learning and reinforcement learning.

⇒ Check out Lex Fridman interview with Oriol Vinyals

Twitter || LinkedIn || Google Research

13. Reza Zadeh

Reza Zadeh is founder and CEO at Matroid and an Adjunct Professor at Stanford. His work focuses on Machine Learning, Distributed Computing, and Discrete Applied Mathematics.

He’s served on the Technical Advisory Boards of Microsoft and Databricks, and has been working on Machine Learning since 2005 when he worked in Google’s AI research team. His awards include a KDD Best Paper Award and the Gene Golub Outstanding Thesis Award at Stanford.

Stanford Profile || Twitter || LinkedIn

14. Ben Goertzel

Ben Goertzel is an artificial intelligence researcher. Goertzel is the chief scientist of chairman of AI software company Novamente LLC; chairman of the OpenCog Foundation; and advisor to Singularity University. He was Director of Research of the Machine Intelligence Research Institute.

His research work encompasses artificial general intelligence, natural language processing, cognitive science, data mining, machine learning, computational finance, bioinformatics, virtual worlds and gaming and other areas. He has published a dozen scientific books, 100+ technical papers, and numerous journalistic articles. Before entering the software industry he served as a university faculty in several departments of mathematics, computer science and cognitive science, in the US, Australia and New Zealand.

LinkedIn || Twitter || Website

15. Adam Coates

Director at Apple. He received his PhD from Stanford University in 2012 and was the director of the Silicon Valley AI Lab at Baidu Research until September 2017, then an Operating Partner at Khosla Ventures until 2018.

During his graduate career, he co-developed an autonomous aerobatic helicopter, worked on perception systems for household robots, and early large-scale deep learning methods. He developed deep learning software for high-performance computing systems with a team at Stanford, used for unsupervised learning, object detection and self-driving cars.

Previous projects: Baidu Deep Voice, Deep Speech. DL on COTS HPC, Stanford AI Robot, Stanford Autonomous Helicopter.

LinkedIn || Twitter || Website

16. Kirk Borne

Worldwide top influencer since 2013. Data Scientist. Global Speaker. Consultant. Astrophysicist. Space Scientist.

Big Data & Data Science advisor, TedX speaker, researcher, blogger, Data Literacy advocate. Currently Principal Data Scientist and Executive Advisor at Booz Allen Hamilton, Annapolis Junction, MD.

Twitter || LinkedIn || Website

17. Ronald von Loon

A recognized expert and thought leader in Data Science, works with data-driven companies to generate business value so that they may meet and exceed goal after goal.

Ronald van Loon has been recognized for his work in the field of digital transformation by such publications and organizations as Onalytica, Dataconomy, and Klout. In addition to these recognitions, he is also an author for a number of leading big data websites, including The Guardian, The Datafloq, and Data Science Central, and he regularly speaks at renowned events and conferences.

Twitter || LinkedIn || Other profiles

18. Noam Chomsky

Noam Chomsky is an American linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called “the father of modern linguistics“, Chomsky is also a major figure in analytic philosophy and one of the founders of the field of cognitive science.

He holds a joint appointment as Institute Professor Emeritus at the Massachusetts Institute of Technology (MIT) and Laureate Professor at the University of Arizona, and is the author of more than 100 books on topics such as linguistics, war, politics, and mass media.

If you are interested in Natural Language Processing and cognitive science, you should follow Noam Chomsky.

⇒ Listen to Lex Fridman interview with Noam Chomsky about Language, Cognition, and Deep Learning on Artificial Intelligence (AI) Podcast

Twitter || Website || Facebook

19. Lex Fridman

Lex Fridman fields of expertise include research in human-centered AI, deep learning, autonomous vehicles & robotics at MIT and beyond. Also, he teaches courses on deep learning.

He is known for his Artificial Intelligence Podcast where he talks about all Data Science related topics with the most renowned scientists from the field.

Website || Twitter || LinkedIn || YouTube || Instagram

20. Kai-Fu Lee

Dr. Kai-Fu Lee is one of the world’s leading AI experts and has been in AI research, development, and investment for over 30 years. Dr. Lee is the Chairman and CEO of Sinovation Ventures, and the President of Sinovation’s Artificial Intelligence Institute, and former President of Google China.

⇒ Watch Lex Fridman interview with Kai-Fu Lee

Twitter || LinkedIn || Website

21. Elon Musk

Elon Musk co-founded and leads Tesla, SpaceX, Neuralink, and The Boring Company.

Previously, Musk co-founded and sold PayPal, the world’s leading Internet payment system, and Zip2, one of the first internet maps and directions services.

Although known for his controversial opinions, he’s one of the leading AI influencers in the world.

Twitter || Instagram || Neuralink || The Boring Company || SpaceX || Tesla

22. Bernard Marr

He’s a world-renowned futurist, influencer, and thought leader in the field of business and technology. He is the author of 18 best-selling books, writes a regular column for Forbes, and advises and coaches many of the world’s best-known organizations. He has 2 million social media followers and was ranked by LinkedIn as one of the top 5 business influencers in the world and the No 1 influencer in the UK.

Website || LinkedIn || Twitter || Facebook

23. Rachel Thomas

Rachel Thomas is the co-founder of fast.ai, which created the Practical Deep Learning for Coders course taken by over 200,000 students and which has been featured in The Economist, MIT Tech Review, and Forbes. She was selected by Forbes as one of 20 Incredible Women in AI, earned her math PhD at Duke, and was an early engineer at Uber. Rachel is a popular writer and keynote speaker on topics of data ethics, AI accessibility, and bias in machine learning.

Website || LinkedIn || Twitter

24. Moustapha Cisse

Moustapha Cisse is a research scientist at Google and head of the Google AI center in Accra, Ghana, where he leads research efforts in foundational machine learning and its applications to solving complex societal challenges.

Moustapha is also a professor of machine learning at the African Institute of Mathematical Sciences, where he is the founder and director of the African Masters of Machine Intelligence program. He was previously a research scientist at Facebook AI Research. Before that, he completed his PhD at University Pierre and Marie Curie in France.

Twitter || LinkedIn

25. Kate Crawford

Kate Crawford is a leading researcher and professor in the fields of social implications of data systems, machine learning and artificial intelligence. She is a Senior Principal Researcher at MSR-NYC, the inaugural Visiting Chair for AI and Justice at the École Normale Supérieure in Paris, and the Miegunyah Distinguished Visiting Fellow at the University of Melbourne.

Kate is the co-founder of the AI Now Institute at New York University, the world’s first university institute dedicated to researching the social implications of artificial intelligence and related technologies.

Website || Twitter

26. Sam Altman

Sam Altman is an entrepreneur, investor, programmer, and blogger. He is the CEO of OpenAI and the Chairman of Y Combinator, a leading silicon valley startup accelerator that has helped launch companies such as Reddit, Dropbox, and Airbnb. He is an investor in many companies, and a chairman of the board for Helion and Oklo, two nuclear energy companies.

Twitter || Website

27. Martin Ford

His book Rise of the Robots: Technology and the Threat of a Jobless Future, was a New York Times bestseller and won the £30,000 Financial Times and McKinsey Business Book of the Year Award.

Martin Ford is also the consulting artificial intelligence expert for the new Robotics and AI ETF from Lyxor/Societe Generale (Ticker ROAI), which is focused specifically on investing in companies that will be significant participants in the AI and robotics revolution. He holds a computer engineering degree from the University of Michigan, Ann Arbor and a graduate business degree from the University of California, Los Angeles.

Website || Twitter || LinkedIn

28. Alexis Conneau

Alexis Conneau is a resident Ph.D. student at Facebook AI Research in Paris.

He focuses is in the area of deep learning for natural language processing (NLP). Specifically, he is working on transferable text representations using neural networks.

Conneau’s research interests include natural language understanding, sequence to sequence learning, and neural machine translation.

LinkedIn || Twitter || Alexis Conneau on Google Scholar

29. Andreas Maier

Andreas Maier is an ML researcher and Professor at the Pattern Recognition Lab at the University of Erlangen-Nuremberg. He developed PEAKS, the first online tool to assess speech intelligibility. Since 2016, he is a member of the steering committee of the European Time Machine Consortium.

His current research interests focus on medical imaging, image and audio processing, digital humanities, and interpretable machine learning, and the use of known operators.

Twitter || Website || Andreas Maier on Google Scholar

30. François Chollet

François Chollet is a software engineer and AI researcher currently working as a Staff Software Engineer at Google. He’s the creator of Keras, a leading deep learning framework for Python, and the author of Deep Learning with Python.

His primary interests involve general intelligence, making AI technology easy to understand, helping people use the full potential of AI, and understanding and simulating the early stages of human cognitive development.

LinkedIn || Twitter || Website || Google Scholar

31. Geoffrey Hinton

Geoffrey Hinton is an emeritus professor at the Department of Computer Science at

the University of Toronto. He is also a VP Engineering fellow at Google and Chief Scientific Adviser at the Vector Institute. He was one of the researchers who introduced the backpropagation algorithm and the first to use backpropagation for learning word embeddings. His other contributions to neural network research include Boltzmann machines, distributed representations, time-delay neural nets, mixtures of experts, variational learning and deep learning. His research group in Toronto made major breakthroughs in deep learning that revolutionized speech recognition and object classification.

Geoffrey Hinton is a fellow of the UK Royal Society and a foreign member of the US National Academy of Engineering and the American Academy of Arts and Sciences.

Hinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on deep learning. They are sometimes referred to as the “Godfathers of AI” and “Godfathers of Deep Learning”.

Twitter || Website

32. Demis Hassabis

Demis Hassabis is an artificial intelligence researcher and neuroscientist. He is the CEO and co-founder of DeepMind and a UK Government AI Advisor since 2018. He’s also a five times winner of the Pentamind board games championship. Hassabis is recognized worldwide as one of the smartest thinkers in his field.

Twitter || DeepMind

33. Ian J. Goodfellow

Ian Goodfellow is a machine learning researcher. He’s the Director of Machine Learning at Apple’s Special Projects Group. He was previously employed as a research scientist at Google Brain. He’s the author of the Deep Learning textbook. Goodfellow has made several contributions to the field of deep learning.

LinkedIn || Website || Twitter

34. Jerome Pesenti

Jerome Pesenti is the AI team leader at Facebook pursuing fundamental and applied research in AI and making Facebook products safer and more valuable to people through the use of AI. Prior to joining Facebook, Jerome joined IBM to lead the development of its Watson platform after the startup he co-founded, Vivisimo, was acquired by the company in 2012. He went on to later become the CEO of BenevolentTech.

Twitter || LinkedIn

35. Yoshua Bengio

Yoshua Bengio is recognized as one of the world’s leading experts in artificial intelligence and a pioneer in deep learning. Since 1993, he has been a professor in the Department of Computer Science and Operational Research at the Université de Montréal. CIFAR’s Learning in Machines & Brains Program Co-Director, he is also the founder and scientific director of Mila, the Quebec Artificial Intelligence Institute, the world’s largest university-based research group in deep learning.

In 2019, he received the ACM A.M. Turing Award, “the Nobel Prize of Computing”, jointly with Geoffrey Hinton and Yann LeCun for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.

LinkedIn || Website || Google Scholar

Make sure to follow these great influencers to stay on top of the latest machine learning news, get inspired, and learn new, wonderful things.

What other machine learning influencers do you follow?

Optuna vs Hyperopt: Which Hyperparameter Optimization Library Should You Choose?

Jakub Czakon — Thu, 21 Jul 2022 09:29:11 +0000

Thinking which library should you choose for hyperparameter optimization?

Been using Hyperopt for a while and feel like changing?

Just heard about Optuna and you want to see how it works?

Good!

In this article I will:

show you an example of using Optuna and Hyperopt on a real problem,
compare Optuna vs Hyperopt on API, documentation, functionality, and more,
give you my overall score and recommendation on which hyperparameter optimization library you should use.

Let’s do it.

Evaluation criteria

Ease of use and API

In this section I want to see how to run a basic hyperparameter tuning script for both libraries, see how natural and easy-to-use it is and what is the API.

Optuna

You define your search space and objective in one function.

Moreover, you sample the hyperparameters from the trial object. Because of that, the parameter space is defined at execution. For those of you who like Pytorch because of this imperative approach, Optuna will feel natural.

def objective(trial):
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 30),
              'num_leaves': trial.suggest_int('num_leaves', 2, 100),
              'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 10, 1000),
              'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
              'subsample': trial.suggest_uniform('subsample', 0.1, 1.0)}
    return train_evaluate(params)

Then, you create the study object and optimize it. What is great is that you can choose whether you want to maximize or minimize your objective. That is useful when optimizing a metric like AUC because you don’t have to change the sign of the objective before training and then convert best results after training to get a positive score.

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

That is it.

Everything you may want to know about the optimization is available in the study object.

What I love about Optuna is that I get to define how I want to sample my search space on-the-fly which gives me a lot of flexibility. Ability to choose a direction of optimization is also pretty nice.

If you want to see the full code example you can scroll down to the Example script.

10 / 10

Hyperopt

You start by defining your parameter search space:

SPACE = {'learning_rate':
hp.loguniform('learning_rate',np.log(0.01),np.log(0.5)),
         'max_depth':
hp.choice('max_depth', range(1, 30, 1)),
         'num_leaves':
hp.choice('num_leaves', range(2, 100, 1)),
         'subsample':
hp.uniform('subsample', 0.1, 1.0)}

Then, you create an objective function that you want to minimize. That means you will have to flip the sign of your objective for the-higher-the-better metric like AUC.

def objective(params):
    return -1.0 * train_evaluate(params)

Finally, you instantiate the Trials() object and minimize your objective on the parameter search SPACE.

trials = Trials()
_ = fmin(objective, SPACE, trials=trials, algo=tpe.suggest, max_evals=100)

…and done!

All the information about the hyperparameters that were tested and the corresponding score are kept in the trials object.

The thing that I don’t like is the fact that I need to instantiate the Trials() even in the simplest of cases. I would rather have fmin return the trials and do the instantiation by default.

9 / 10

Both libraries do a good job here but I feel that Optuna is slightly better because of the flexibility, imperative approach to sampling parameters and a bit less boilerplate.

Ease of use and API

Optuna > Hyperopt

Options, methods, and hyper(hyperparameters)

In real-life scenarios running hyperparameter optimization requires a lot of additional options away from the golden path. Areas that I am particularly interested in are:

search space
optimization methods/algorithms
callbacks
persisting and restarting parameter sweeps
pruning unpromising runs
handling exceptions

In this section I will compare Optuna and Hyperopt on exactly those.

Search space

In this section I want to compare the search space definition, flexibility in defining a complex space and sampling options for each parameter type (Float, Integer, Categorical).

Optuna

You can find sampling options for all hyperparameter types:

for categorical parameters you can use trials.suggest_categorical
for integers there is trials.suggest_int
for float parameters you have trials.suggest_uniform, trials.suggest_loguniform and even, more exotic, trials.suggest_discrete_uniform

Especially for the integer parameters you could wish for more options but it deals with most use-cases. Great feature of this library is that you sample from the parameter space on-the-fly and you can do it however you like. You can use if statements, you can change intervals from which you search, you can use the information from the trial object to guide your search.

def objective(trial):
    classifier_name = trial.suggest_categorical('classifier', ['SVC', 'RandomForest'])
    if classifier_name == 'SVC':
        svc_c = trial.suggest_loguniform('svc_c', 1e-10, 1e10)
        classifier_obj = sklearn.svm.SVC(C=svc_c)
    else:
        rf_max_depth = int(trial.suggest_loguniform('rf_max_depth', 2, 32))
        classifier_obj = sklearn.ensemble.RandomForestClassifier(max_depth=rf_max_depth)

    ...

This is awesome, you can do literally anything!

10 / 10

Hyperopt

Search space is where Hyperopt really gives you a ton of sampling options:

for categorical parameters you have hp.choice
for integers you get hp.randit, hp.quniform, hp.qloguniform and hp.qlognormal
for floats we have hp.normal, hp.uniform, hp.lognormal and hp.loguniform

As far as I know this is the most extensive sampling functionality out there.

You define your search space before you run optimization but you can create very complex parameter spaces:

SPACE = hp.choice('classifier_type', [
    {
        'type': 'naive_bayes',
    },
    {
        'type': 'svm',
        'C': hp.lognormal('svm_C', 0, 1),
        'kernel': hp.choice('svm_kernel', [
            {'ktype': 'linear'},
            {'ktype': 'RBF', 'width': hp.lognormal('svm_rbf_width', 0, 1)},
            ]),
    },
    {
        'type': 'dtree',
        'criterion': hp.choice('dtree_criterion', ['gini', 'entropy']),
        'max_depth': hp.choice('dtree_max_depth',
            [None, hp.qlognormal('dtree_max_depth_int', 3, 1, 1)]),
        'min_samples_split': hp.qlognormal('dtree_min_samples_split', 2, 1, 1),
    },
    ])

By combining hp.choice with other sampling methods we can have conditional spaces. This is useful when you are optimizing hyperparameters for a machine learning pipeline that involves preprocessing, feature engineering and model training.

10 / 10

I have to say I like them both. I can define nested search spaces easily and I have a lot of sampling options for all the parameter types. Optuna has imperative parameter definition, which gives more flexibility while Hyperopt has more parameter sampling options.

Search space

Optuna = Hyperopt

Optimization methods

Both Optuna and Hyperopt are using the same optimization methods under the hood. They have:

rand.suggest (Hyperopt) and samplers.random.RandomSampler (Optuna)

Your standard random search over the parameters.

tpe.suggest (Hyperopt) and samplers.tpe.sampler.TPESampler (Optuna)

Tree of Parzen Estimators (TPE). The idea behind this method is similar to what was explained in the previous blog post about Scikit Optimize. We use a cheap surrogate model to estimate the performance of the expensive objective function on a set of parameters.

The difference between the methods used in Scikit Optimize and Tree of Parzen Estimators (TPE) is that instead of estimating the actual performance (point estimation) we want to estimate the density in the tails. We want to be able to tell whether a run will be good (right tail) or bad (left tail).

I like the following explanation taken from the AutoML_Book by amazing folks over at AutoML.org Freiburg.

Instead of modeling the probability p(y|λ) of observations y given the configurations λ, the Tree Parzen Estimator models density functions p(λ|y < α) and p(λ|y ≥ α). Given a percentile α (usually set to 15%), the observations are divided in good observations and bad observations and simple 1-d Parzen windows are used to model the two distributions.

By using p(λ|y < α) and p(λ|y ≥ α) you can estimate the expected improvement of a parameter configuration over previous best.

Interestingly, both for Optuna and Hyperopt, there are no options to specify the αparameter in the optimizer.

Optuna

integration.SkoptSampler

Optuna lets you use samplers from Scikit-Optimize (skopt).

Skopt offers a bunch of Tree-Based methods as a choice for your surrogate model.

In order to use them you need to:

create a SkoptSampler instance specifying the parameters of the surrogate model and acquisition function in the skopt_kwargs argument,
pass the sampler instance to the optuna.create_study method

from optuna.integration import SkoptSampler

sampler = SkoptSampler(skopt_kwargs={'base_estimator':'RF',
                                     'n_random_starts':10,
                                     'base_estimator':'ET',
                                     'acq_func':'EI',
                                     'acq_func_kwargs': {'xi':0.02})
study = optuna.create_study(sampler=sampler)
study.optimize(objective, n_trials=100)

pruners.SuccessiveHalvingPruner

You can also use one of the multiarmed bandit methods called Asynchronous Successive Halving Algorithm (ASHA). If you are interested in the details please read the paper but the general idea is to:

run a bunch of parameter configurations for some time
prune the (half of) the least promising runs every
run a bunch of parameter configurations for some more time
prune the (half of) the least promising runs every
stop when only one configuration is left

By doing so, the search can focus on the more promising runs. However, the static allocation of the budgets to configurations is a problem in practice (which a newer approach called HyperBand solves).

It is very easy to use ASHA in Optuna. Just pass a SuccesiveHalvingPruner to .create_study() and you are good to go:

from optuna.pruners import SuccessiveHalvingPruner

optuna.create_study(pruner=SuccessiveHalvingPruner())
study.optimize(objective, n_trials=100)

Nice and simple.

If you would like to learn more, you may want to check out my article about Scikit Optimize.

Overall, there are a lot of options when it comes to optimization functions right now. However there are some important ones, like Hyperband or BOHB missing.

8 / 10

Hyperopt

atpe.suggest

Recently added, adaptive TPE was invented at ElectricBrain and it is actually a series of (not so) little improvements that they experimented with on top of TPE.

The authors explain their approach and modifications they made to TPE thoroughly in this fascinating blog post.

It is super easy to use. Instead of tpe.suggest you need to pass atpe.suggest to your fmin function.

from hyperopt import fmin, atpe

best = fmin(objective, SPACE,
            max_evals=100,
            algo=atpe.suggest)

I really like this effort to include new optimization algorithms in the library, especially since it’s a new original approach not just an integration with existing algorithm.

Hopefully, in the future multi-armed bandid methods like Hyperband, BOHB, or tree-based methods like SMAC3 will be included as well.

8 / 10

Optimization methods

Optuna = Hyperopt

Callbacks

In this section I want to see how easy it is to define callbacks to monitor/snapshot/modify training after each iteration. It is useful, especially when your training is long and/or distributed.

Optuna

User callbacks are nicely supported with the callbacks argument in of the .optimize() method. Just pass a list of callables that take study and trial as input and you are good to go.

def neptune_monitor(study, trial):
    neptune_run["score"] = trial.value
    neptune_run["parameters"] = trial.params

... 
study.optimize(objective, n_trials=100, callbacks=[neptune_monitor])

Because you can access both study and trial you have all the flexibility you can possibly want to checkpoint, do early stopping or modify future search.

10 / 10

Hyperopt

There are no callbacks per se, but you can put your callback function inside the objective and it will be executed every time the objective is called.

def neptune_monitor(params, score):
    neptune_run["score"] = score
    neptune_run["parameters"] = params

def objective(params):
    score = -1.0 * train_evaluate(params)
    monitor_callback(params, score)
    return score

I don’t love it but I guess I can live with that.

6 / 10

Optuna makes it really easy with the callbacks argument while in Hyperopt you have to modify the objective.

Callbacks

Optuna > Hyperopt

Persisting and restarting

Saving and loading your hyperparameter searches can save you time, money, and can help get better results. Let’s compare both frameworks on that.

Optuna

Simply use joblib.dump to pickle the trials object.

study.optimize(objective, n_trials=100)
joblib.dump(study, 'artifacts/study.pkl')

… and you can load it later with joblib.load to restart your search.

study = joblib.load('../artifacts/study.pkl')
study.optimize(objective, n_trials=200)

That’s it.

For distributed setups you can use the name of the study the URL to the database where you distributed study is to instantiate new study. For example:

study = optuna.create_study(
                    study_name='example-study',
                    storage='sqlite:///example.db',
                    load_if_exists=True)

Nice and easy.

More about running distributed hyperparameter optimization with Optuna in the Speed and Parallelization secion.

10 / 10

Hyperopt

Similarly to Optuna use joblib.dump to pickle the trials object.

trials = Trials()
_ = fmin(objective, SPACE, trials=trials,
         algo=tpe.suggest, max_evals=100)
joblib.dump(trials, 'artifacts/hyperopt_trials.pkl')

… load it with joblib.load and restart.

trials = joblib.load('artifacts/hyperopt_trials.pkl')
_ = fmin(objective, SPACE, trials=trials,
         algo=tpe.suggest, max_evals=200)

Simple and works with no problems.

If you are optimizing hyperparameters in a distributed fashion you can load MongoTrials() object that connects to MongoDB. More about running distributed hyperparameter optimization with Hyperopt in the Speed and Parallelization section.

10 / 10

Both make it easy and get the job done.

Persisting and restarting

Optuna = Hyperopt

Run Pruning

Not all hyperparameter configurations are created equal. For some of them you can tell very quickly that they will not produce high scores. Ideally, you would like to stop those runs as soon as possible try different parameters instead.

Optuna gives you an option to do that with Pruning Callbacks. Many machine learning frameworks are supported:

KerasPruningCallback, TFKerasPruningCallback
TensorFlowPruningHook
PyTorchIgnitePruningHandler, PyTorchLightningPruningCallback
FastAIPruningCallback
LightGBMPruningCallback
XGBoostPruningCallback
and more

You can read about them in the docs.

For example, in the case of lightGBM training you would pass this callback to the lgb.train function.

def train_evaluate(X, y, params, pruning_callback=None):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1234)

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

    callbacks = [pruning_callback] if pruning_callback is not None else None

    model = lgb.train(params, train_data,
                      num_boost_round=NUM_BOOST_ROUND,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=[valid_data],
                      valid_names=['valid'],
                      callbacks=callbacks)
    score = model.best_score['valid']['auc']
    return score

def objective(trial):
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 30),
              'num_leaves': trial.suggest_int('num_leaves', 2, 100),
              'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 10, 1000),
              'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
              'subsample': trial.suggest_uniform('subsample', 0.1, 1.0)}

    pruning_callback = LightGBMPruningCallback(trial, 'auc', 'valid')
    return train_evaluate(params, pruning_callback)

Only Optuna gives you this option so it is a clear win.

Run Pruning

Optuna > Hyperopt

Handling exceptions

If one of your runs fails due to the wrong parameter combination, random training error or some other problem you could lose all the parameter_configuration:score pairs evaluated so far in a study.

You can use callbacks to save this information after every iteration or use a DB to store it as explained in the Speed and Parallelization section.

However, you may want to let this study continue even when the exception happens. To make it possible, Optuna let’s you pass the allowed exceptions to the .optimize() method.

def objective(trial):
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 30),
              'num_leaves': trial.suggest_int('num_leaves', 2, 100)}

    print(non_existent_variable)

    return train_evaluate(params)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, catch=(NameError,))

Again, only Optuna supports this.

Handling exceptions

Optuna > Hyperopt

Documentation

When you are a user of a library or a framework it is absolutely crucial to find the information you need when you need it. This is where documentation/support channels come into the picture and they can make or break a library.

Let’s see how Optuna and Hyperopt compare on that.

Optuna

It is really good.

There is a proper webpage that explains all the basic concepts and shows you where to find more information.

Also, there is a complete and very easy-to-understand documentation on read-the-docs.

It contains:

Tutorials with both simple and advanced examples
API Reference with all the functions containing beautiful docstrings. To give you an idea imagine having charts inside of your docstrings so that you can understand what is happening inside your function better. Check out the BaseSampler if you don’t believe me.

It is also important to mention that the supporting team from Preferred Networks really takes care of this project. They respond to Github issues and the community is growing around it with great feature ideas and PRs coming in. Checkout the Github project issues section to see what is going on there.

10 / 10

Hyperopt

It was recently updated and now it is quite alright.

You can find it here.

You can easily find information about:

how to get started
how to define both simple and advances search spaces
how to run the installation
how to run Hyperopt in parallel via MongoDB or Spark

Unfortunately, there were some things that I didn’t like:

missing API reference with the docstrings all functions/methods
docstrings themselves are missing for most of methods/functions which forces you to read the implementation (there are some positive side effects here:) )
no examples of using Adaptive TPE. I wasn’t sure if I am using it correctly, whether I should specify some additional (hyper)hyper parameters. Missing docstrings didn’t help me here either.
some links to 404 in the docs.

Overall, it has improved a lot lately, but I was still a bit lost at times. I hope that with time it will get even better so stay tuned.

The good thing is, there are a lot of blog posts about it. Some of them that I found useful are:

“Parameter Tuning with Hyperopt” by District Data Labs
“Hyperopt tutorial for Optimizing Neural Networks’ Hyperparameters” by Vooban
“On Using Hyperopt: Advanced Machine Learning” by Tanay Agrawal
“An Introductory Example of Bayesian Optimization in Python with Hyperopt” by Will Koehrsen

The documentation is not the strongest side of this project but because it’s a classic there are a lot of resources out there.

6 / 10

Documentation

Optuna > Hyperopt

Visualizations

Visualizing hyperparameter searches can be very useful. You can gain information on interactions between parameters and see where you should search next.

That is why I want to compare visualization suits that Optuna and Hyperopt offer.

Optuna

A few great visualizations are available in the optuna.visualization module:

plot_contour: plots parameter interactions on an interactive chart. You can choose which hyperparameters you would like to explore.

plot_contour(study, params=['learning_rate',
                            'max_depth',
                            'num_leaves',
                            'min_data_in_leaf',
                            'feature_fraction',
                            'subsample'])

plot_optimization_histor: shows the scores from all trials as well as the best score so far at each point.

plot_optimization_history(study)

plot_parallel_coordinate: interactively visualizes the hyperparameters and scores

plot_parallel_coordinate(study)

plot_slice: shows the evolution of the search. You can see where in the hyperparameter space your search went and which parts of the space were explored more.

plot_slice(study)

Overall, visualizations in Optuna are incredibile!

They let you zoom in on the hyperparameter interactions and help you decide on how to run your next parameter sweep. Amazing job.

10 / 10

Hyperopt

There are three visualization functions in the hyperopt.plotting module:

main_plot_history: shows you the results of each iteration and highlights the best score.

main_plot_history(trials)

main_plot_histogram: shows you the histogram of results over all iterations.

main_plot_histogram(trials)

main_plot_vars: I don’t really know what it does as I couldn’t get it to run and there were no docstrings nor examples (again, the documentation is far from perfect).

Summing up, there are some basic visualization utilities but they are not super useful.

3 / 10

I am very impressed by the visualizations available in Optuna. Useful, interactive, and beautiful.

Visualizations

Optuna > Hyperopt

If you want to play with those visualizations you can use the study object that I saved as ‘study.pkl’ for each experiment.

For example go to artifacts of this one.

You also may like

The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments

Speed and parallelization

When it comes to hyperparameter optimization, being able to distribute your training on your machine or many machines (cluster) can be crucial.

That is why, I checked the distributed training options for both Optuna and Hyperopt.

Optuna

You can run distributed hyperparameter optimization on one machine or a cluster of machines and it is actually really simple.

For one machine you simply change the n_jobs parameter in your .optimize()method.

study.optimize(objective, n_trials=100, n_jobs=12)

To run it on a cluster you need to create a study that resides in a database (you can choose among many Relational DBs).

There are two options to do that. You can do it via command line interface:

optuna create-study
    --study-name "distributed-example"
    --storage "sqlite:///example.db"

You can also create a study in your optimization script.

By using load_if_exists=True you can treat your master script and worker scripts in the same way which simplifies things a lot!

study = optuna.create_study(
    study_name='distributed-example',
    storage='sqlite:///example.db',
    load_if_exists=True)
study.optimize(objective, n_trials=100)

Finally, you can run your worker scripts from many machines and they will all use the same information from the study database.

terminal-1$ python run_worker.py

terminal-25$ python run_worker.py

Easy and works like a charm!

10 / 10

Hyperopt

You can distribute your computation over a cluster of machines. Good, step-by-step instructions can be found in this blog post by Tanay Agrawal but in a nutshell, you need to:

Start a server with MongoDB on it which will consume results from your worker training scripts and send out the next parameter set to try,
In your training script, instead of Trials() create a MongoTrials() object pointing to the database server you have started in the previous step,
Move your objective function to a separate objective.py script and rename it to function,
Compile your Python training script,
Run hyperopt-mongo-worker

Though it gets the job done it doesn’t feel quite perfect. You need to do some juggling around the objective function, and starting MongoDB could have been provided in the CLI to makes things easier.

It is also important to mention that integration with Spark via SparkTrials object was recently added. There is a step by step guide to help you get started and you can even use the spark-installation script to makes things easier.

best = hyperopt.fmin(fn = objective,
                     space = search_space,
                     algo = hyperopt.tpe.suggest,
                     max_evals = 64,
                     trials = hyperopt.SparkTrials())

Works exactly the way you would expect it to work.

Nice and simple!

9 / 10

Both libraries support distributed training which is great. However, Optuna does a bit better job with simpler, more user-friendly interface.

Speed and parallelization

Optuna > Hyperopt

Experimental results*

* Just to be clear those are the results on just one example problem and one run per lib/configuration and they do not guarantee generalization. To run a proper benchmark, you would run it multiple times on various datasets.

That being said, as a practitioner, I would hope to see some improvements over the random search for each problem. Otherwise, why bother with an HPO library?

Ok, so as an example let’s tweak the hyperparameters of the lightGBM model on a tabular, binary classification problem. If you want to use the same dataset as I did you should:

download it from kaggle
use the first 10000 rows from the train.csv file

To make the training quick I fixed the number of boosting rounds to 300 with a 30 round early stopping.

import lightgbm as lgb
from sklearn.model_selection import train_test_split

NUM_BOOST_ROUND = 300
EARLY_STOPPING_ROUNDS = 30

def train_evaluate(X, y, params):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                          test_size=0.2,
                                                          random_state=1234)

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

    model = lgb.train(params, train_data,
                      num_boost_round=NUM_BOOST_ROUND,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=[valid_data],
                      valid_names=['valid'])

    score = model.best_score['valid']['auc']
    return score

All the training and evaluation logic is put inside the train_evaluate function. We can treat it as a black box that takes the data and hyperparameter set and produces the AUC evaluation score.

You can actually turn every script that takes parameters as inputs and outputs the score into such train_evaluate. Once that is done you can treat it as black box and tune your parameters.

I show how to do that step-by-step in a different post “How to Do Hyperparameter Tuning on Any Python Script in 3 Easy Steps”.

To train a model on a set of parameters you need to run something like this:

import pandas as pd

N_ROWS=10000
TRAIN_PATH = '/mnt/ml-team/minerva/open-solutions/santander/data/train.csv'

data = pd.read_csv(TRAIN_PATH, nrows=N_ROWS)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']

MODEL_PARAMS = {'boosting': 'gbdt',
                'objective':'binary',
                'metric': 'auc',
                'num_threads': 12,
                'learning_rate': 0.3,
                }

score = train_evaluate(X, y, MODEL_PARAMS)
print('Validation AUC: {}'.format(score))

For this study, I tried to find the best parameters within 100 run budget.

I ran 6 experiments:

Random search (from hyperopt) as a reference
Tree of Parzen Estimator search strategies for both Optuna and Hyperopt
Adaptive TPE from Hyperopt
TPE from Optuna with a pruning callback for more runs but within the same time frame. It turns out that 400 runs with pruning takes as much time as 100 runs without it.
Optuna with Random Forest surrogate model from skopt.Sampler

See an example hyperparameter optimization script here.

Experiments for Optuna and Hyperopt in different configurations

If you want to explore all of those experiments in more detail you can simply go to the experiment dashboard.

Both Optuna and Hyperopt improved over the random search which is good.

TPE implementation from Optuna was slightly better than Hyperopt’s Adaptive TPE but not by much. On the other hand, when running hyperparameter optimization, those small improvements are exactly what you are going for.

What is interesting is that TPE implementation from HPO and Optuna give vastly different results on this problem. Maybe the cutoff point between good and bad parameter configurations λ is chosen differently or sampling methods have defaults that work better for this particular problem.

Moreover, using pruning decreased training time by 4x. I could run 400 searches in the time that runs 100 without pruning. On the flip side, using pruning got a lower score. It may be different for your problem but it is important to consider that when making a decision whether to use pruning or not.

For this section, I assigned points based on the improvements over the random search strategy.

Hyperopt got (0.850 – 0.844)*100 = 6
Optuna got (0.854 – 0.844)*100 = 10

Experimental results

Optuna > Hyperopt

Conclusions

Let’s take a look at the overall scores:

Even if you look at it generously and consider only the features that both libraries share, Optuna is a better framework.

It is on-par or slightly better on all criteria and:

it has better documentation
it has way better visualization suite
it has some features like pruning, callbacks, and exception handling that hyperopt doesn’t support

After doing all this research I am convinced that Optuna is a great library for hyperparameter optimization.

Moreover, I think that you should strongly consider switching from Hyperopt if you were using that in the past.

May be useful

Hyperparameter tuning with Keras and Ray Tune

Scikit Optimize: Bayesian Hyperparameter Optimization in Python

Jakub Czakon — Thu, 21 Jul 2022 09:24:19 +0000

Need to tune hyperparameters of your machine learning model and don’t want to do it by hand?

Thinking about performing bayesian hyperparameter optimization but you are not sure how to do that exactly?

Heard of various hyperparameter optimization libraries and wondering whether Scikit Optimize is the right tool for you?

You are in the right place.

In this article I will:

Show you an example of using skopt to run bayesian hyperparameter optimization on a real problem,
Evaluate this library based on various criteria like API, speed and experimental results,
Give you my overall score and recommendation on when to use it.

Let’s dive in, shall we?

Evaluation criteria

Ease of use and API

The API is just awesome. It is so simple, that you can almost guess it without reading the docs. Seriously, let me show you.

You define the search space:

SPACE = [
   skopt.space.Real(0.01, 0.5, name='learning_rate', prior='log-uniform'),
   skopt.space.Integer(1, 30, name='max_depth'),
   skopt.space.Integer(2, 100, name='num_leaves'),
   skopt.space.Integer(10, 1000, name='min_data_in_leaf'),
   skopt.space.Real(0.1, 1.0, name='feature_fraction', prior='uniform'),
   skopt.space.Real(0.1, 1.0, name='subsample', prior='uniform')]

You define the objective function that you want to minimize (decorate it, to keep the parameter names):

@skopt.utils.use_named_args(SPACE)
def objective(**params):
    all_params = {**params, **STATIC_PARAMS}
    return -1.0 * train_evaluate(X, y, all_params)

And run the optimization:

results = skopt.forest_minimize(objective, SPACE, **HPO_PARAMS)

That’s it. All the information you need, like the best parameters or scores for each iteration, are kept in the results object. Go here for an example of a full script with some additional bells and whistles.

Super-easy setup and intuitive API.

10 / 10

Options, methods, and (hyper)parameters

Search space

When it comes to hyperparameter search space you can choose from three options:

space.Real -float parameters are sampled by uniform log-uniform from the(a,b) range,
space.Integer -integer parameters are sampled uniformly from the(a,b) range,
space.Categorical -for categorical (text) parameters. A value will be sampled from a list of options. For example, you could pass [‘gbdt’,’dart’,’goss’] if you are training lightGBM.

There is no support for nested search spaces that account for the situations where some combinations of hyperparameters are simply invalid. It really comes in handy sometimes.

Optimization methods

There are four optimization algorithms to try.

dummy_minimize

You can run a simple random search over the parameters. Nothing fancy here but it is useful to have this option within the same API to compare if needed.

forest_minimize and gbrt_minimize

Both of those methods as well as the one in the next section are examples of Bayesian Hyperparameter Optimization also known as Sequential Model-Based Optimization SMBO. The idea behind this approach is to estimate the user-defined objective function with the random forest, extra trees, or gradient boosted trees regressor.

After each run of hyperparameters on the objective function, the algorithm makes an educated guess which set of hyperparameters is most likely to improve the score and should be tried in the next run. It is done by getting regressor predictions on many points (hyperparameter sets) and choosing the point that is the best guess based on the so-called acquisition function.

There are quite a few acquisition function options to choose from:

EI and PI: Negative expected improvement and Negative probability improvement. If you choose one of those you should tweak the xi parameter as well. Basically, when your algorithm is looking for the next set of hyperparameters, you can decide how small of the expected improvement you are willing to try on the actual objective function. The higher the value, the bigger the improvement (or probability of improvement) your regressor expects.
LCB: Lower confidence bound. In this case, you want to choose your next point carefully, limiting the downside risk. You can decide how much risk you want to take at each run. By making the kappa parameter small you lean toward exploitation of what you know, by making it larger you lean toward exploration of the search space.

There are also options EIPS and PIPS which take into account both the score produced by the objective function and the execution time but I haven’t tried them

gp_minimize

Instead of using the tree regressors, the objective function is approximated by the Gaussian process.

From a user perspective, the added value of this method is that instead of deciding beforehand on one of the acquisition functions, you can let the algorithm select the best one of EI, PI, and LCB at every iteration. Just set the acquisition function to gp_hedge and try it out.

One more thing to consider is the optimization method used at each iteration, sampling or lbfgs. For both of them, the acquisition function is calculated over a randomly selected number of points (n_points) in the search space. If you go with sampling, then the point with the lowest value is selected. If you choose lbfgs, the algorithm will take some number (n_restarts_optimizer) of the best, randomly tried points, and will run the lbfgs optimization starting at each of them. So basically the lbfgs method is just an improvement over the sampling method if you don’t care about the execution time.

Persisting and restarting

There are skopt.dump and skopt.load functions that deal with saving and loading the results object:

results = skopt.forest_minimize(objective, SPACE, **HPO_PARAMS)
skopt.dump(results, 'artifacts/results.pkl')
old_results = skopt.load('artifacts/results.pkl')

You can restart training from the saved results via x0 and y0 arguments. For example:

results = skopt.forest_minimize(objective, SPACE,
                                x0=old_results.x_iters,
                                y0=old_results.func_vals,
                                **HPO_PARAMS)

Simple and works with no problems.

Overall, there are a lot of options for tuning (hyper)hyperparameters and you can control the training with callbacks. On the flip side, you can only search through a flat space and you need to deal with those forbidden combinations of parameters on your own.

7 / 10

Documentation

Piece of art.

It’s extensive with a lot of examples, docstrings for all the functions and methods. It took me just a few minutes to get into the groove of things and get things off the ground.

Go to the documentation webpage to see for yourself.

It could be a bit better, with more explanations in the docstrings, but the overall experience is just great.

9 / 10

Visualizations

This is one of my favorite features of this library. There are three plotting utilities in the skopt.plots module, that I really love:

plot_convergence -it visualizes the progress of your optimization by showing the best to date result at each iteration.

import skopt.plots

skopt.plots.plot_convergence(results)

What is cool about it, is that you can compare the progress of many strategies by simply passing a list of results objects or a list of (name, results) tuples.

results = [('random_results', random_results),
           ('forest_results', forest_results),
           ('gbrt_results', gbrt_results),
           ('gp_results', gp_results)]

skopt.plots.plot_convergence(*results)

plot_evaluations -this plot lets you see the evolution of the search. For each hyperparameter, we see the histogram of explored values. For each pair of hyperparameters, the scatter plot of sampled values is plotted with the evolution represented by color, from blue to yellow.

For example, when we look at the random search strategy we can see there is no evolution. It is just randomly searched:

But for the forest_minimze strategy, we can clearly see that it converges to certain parts of space which it explores more heavily.

plot_objective -it lets you gain intuition into the score sensitivity with respect to hyperparameters. You can decide which parts of the space may require a more fine-grained search and which hyperparameters barely affect the score and can potentially be dropped from the search.

Overall, visualizations are incredibly good.

10 / 10

Speed and parallelization

Every optimization function comes with the n_jobs parameter, which is passed to the base_estimator. That means, even though the optimization runs go sequentially you can speed up each run by utilizing more resources.

I haven’t run a proper timing benchmark for all the optimization methods and n_jobs. However, since I kept track of the total execution time for all experiments I decided to present average times for everything I ran:

Obviously, the random search method was the fastest, as it doesn’t need any calculations between the runs. It was followed by the gradient boosted trees regressor and random forest methods. Optimization via the Gaussian process was the slowest by a large margin but I only tested the gp_hedge acquisition function, so that might have been the reason.

Because there is no option to distribute it on the run level, over a cluster of workers, I have to take a few points away.

6 / 10

Experimental results

As an example let’s tweak the hyperparameters of the lightGBM model on a tabular, binary classification problem. If you want to use the same dataset as I did you should:

download it from kaggle
use the first 10000 rows from the train.csv file

To make the training quick I fixed the number of boosting rounds to 300 with a 30 round early stopping.

import lightgbm as lgb
from sklearn.model_selection import train_test_split

NUM_BOOST_ROUND = 300
EARLY_STOPPING_ROUNDS = 30

def train_evaluate(X, y, params):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                          test_size=0.2,
                                                          random_state=1234)

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid, reference=train_data)

    model = lgb.train(params, train_data,
                      num_boost_round=NUM_BOOST_ROUND,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=[valid_data],
                      valid_names=['valid'])

    score = model.best_score['valid']['auc']
    return score

All the training and evaluation logic is put inside the train_evaluate function. We can treat it as a black box that takes the data and hyperparameter set and produces the AUC evaluation score.

You can turn every script that takes parameters as inputs and outputs the score into such train_evaluate. Once that is done you can treat it as a black box and tune your parameters.

I show how to do that step-by-step in a different post “How to Do Hyperparameter Tuning on Any Python Script in 3 Easy Steps”.

To train a model on a set of parameters you can run something like this:

import pandas as pd

N_ROWS=10000
TRAIN_PATH = '/mnt/ml-team/minerva/open-solutions/santander/data/train.csv'

data = pd.read_csv(TRAIN_PATH, nrows=N_ROWS)
X = data.drop(['ID_code', 'target'], axis=1)
y = data['target']

MODEL_PARAMS = {'boosting': 'gbdt',
                'objective':'binary',
                'metric': 'auc',
                'num_threads': 12,
                'learning_rate': 0.3,
                }

score = train_evaluate(X, y, MODEL_PARAMS)
print('Validation AUC: {}'.format(score))

For this study, I will try to find the best parameters within 100 runs budget.

If you search randomly over hyperparameters you can get 0.864 as I showed in this ml experiment.

To find the best model I tried various configurations of optimizers and hyper(hyperparameters) from the Options, methods, and hyper(hyperparameters) section. You can also check the example skopt parameter tuning script here.

In total I ran 87 experiments but let’s take a look at the top few:

Experiments for different skopt configurations

If you want to explore all of those experiments in more detail you can simply go to the experiment dashboard.

The forest_minimize method was the clear winner but to get good results, it was crucial to tweak the (hyper)hyperparameters a bit. For the LCB acquisition function, a lower value of kappa (exploitation) was better. Let’s take a look at the evaluations plot for this experiment:

It exploited the low num_leaves subspace but it was very exploratory for the max_depth and feature_fraction. It’s important to mention that those plots differed a lot from experiment to experiment. It makes you wonder how easy it is to get stuck in a local minimum.

However, the best result was achieved with the EI acquisition function. Again, tweaking the xi parameter was needed. Looking at the objective plot of this experiment:

I get the feeling that by dropping some insensitive dimensions (subsample, max_depth) and running a more fine-grained search on the other hyperparameters I could have gotten a bit better result.

It was a surprise to me that the results for the gp_minimze were significantly worse when I used the lbfgs optimization of the acquisition function. They couldn’t beat random search. Changing the optimization to sampling got better AUC but was still worse than forest_minimize and gbrt_minimize. Go to gaussian process experiments to see for yourself.

Overall the highest score I could squeeze was 0.8566 which was better than random search’s 0.8464 by ~0.01. I will translate that to 10 points (0.01*100).

10/10

Conclusions

Let’s take a look at the results for all criteria:

Overall, I really like Scikit-Optimize. It is a pleasure to use, gives you great results, and useful visualizations. Also, it has a lot of options to tweak with strong documentation to guide you through it.

On the flip side, it is difficult, if not impossible, to parallelize it run-wise and distribute over a cluster of machines. I think going forward, this is going to be more important and can make this library not suitable for some applications.

My recommendation is to use it if you don’t care that much about the speed and parallelization but look elsewhere if those are crucial to your project.

Machine Learning Experiment Management: How to Organize Your Model Development Process

Jakub Czakon — Thu, 21 Jul 2022 09:23:05 +0000

Machine learning or deep learning experiment tracking is a key factor in delivering successful outcomes. There’s no way you will succeed without it.

Let me share a story that I’ve heard too many times.

”So I was developing a machine learning model with my team and within a few weeks of extensive experimentation, we got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we didn’t track feature versions, didn’t record the parameters, and used different environments to run our models…

…after a few weeks, we weren’t even sure what we have actually tried so we needed to rerun pretty much everything”

Sounds familiar?

In this article, I will show you how you can keep track of your machine learning experiments and organize your model development efforts so that stories like that will never happen to you.

What is Machine Learning experiment management?

Experiment management in the context of machine learning is a process of tracking experiment metadata like:

code versions,
data versions,
hyperparameters,
environment,
metrics,

organizing them in a meaningful way and making them available to access and collaborate on within your organization.

In the next sections, you will see exactly what that means with examples and implementations.

How to keep track of Machine Learning experimentation

What I mean by tracking is collecting all the metainformation about your machine learning experiments that is needed to:

share your results and insights with the team (and you in the future),
reproduce results of the machine learning experiments,
keep your results, that take a long time to generate, safe.

Let’s go through all the pieces of an experiment that I believe should be recorded, one by one.

Code version control for data science

Okay, in 2022 I think pretty much everyone working with code knows about version control. Failing to keep track of your code is a big (but obvious and easy-to-fix) oversight.

Should we just proceed to the next section? Not so fast.

Problem 1: Jupyter notebook version control

A large part of data science development is happening in Jupyter notebooks which are more than just code. Fortunately, there are tools that help with notebook versioning and diffing. Some tools that I know:

nbconvert (.ipynb -> .py conversion)
nbdime (diffing)
jupytext (conversion+versioning)
neptune-notebooks (versioning+diffing+sharing)

Once you have your notebook versioned, I would suggest going the extra mile and making sure that it runs top to bottom. For that you can use jupytext or nbconvert:

jupyter nbconvert --to script train_model.ipynb;
python train_model.py

Problem 2: Experiments on dirty commits

Data science people tend to not follow the best practices of software development. You can always find someone (me included) who would ask:

“But how about tracking code in-between commits? What if someone runs an experiment without committing the code?”

One option is to explicitly forbid running code on dirty commits (commits that contain modified or untracked files). Another option is to give users an additional safety net and snapshot code whenever they run an experiment.

Tracking hyperparameters

Most decent machine learning models and pipelines have tuned non-default hyperparameters. Those could be learning rate, number of trees or a missing value imputation method. Failing to keep track of hyperparameters can result in weeks of wasted time looking for them or retraining models.

The good thing is, that keeping track of hyperparameters can be really simple. Let’s start with the way people tend to define them and then we’ll proceed to hyperparameter tracking:

Config files

Typically a .yaml file that contains all the information that your script needs to run. For example:

data:
    train_path: '/path/to/my/train.csv'
    valid_path: '/path/to/my/valid.csv'

model:
    objective: 'binary'
    metric: 'auc'
    learning_rate: 0.1
    num_boost_round: 200
    num_leaves: 60
    feature_fraction: 0.2

Command line + argparse

You simply pass your parameters to your script as arguments:

python train_evaluate.py
    --train_path '/path/to/my/train.csv'
    --valid_path '/path/to/my/valid.csv'
    -- objective 'binary'
    -- metric 'auc'
    -- learning_rate 0.1
    -- num_boost_round 200
    -- num_leaves 60
    -- feature_fraction 0.2

Parameters dictionary in main.py

You put all of your parameters in a dictionary inside your script:

TRAIN_PATH = '/path/to/my/train.csv'
VALID_PATH = '/path/to/my/valid.csv'

PARAMS = {'objective': 'binary',
          'metric': 'auc',
          'learning_rate': 0.1,
          'num_boost_round': 200,
          'num_leaves': 60,
          'feature_fraction': 0.2}

Hydra

Hydra is a configuration management framework developed by Facebook open Source.

The key ideas behind it are:

Dynamically create a hierarchical configuration by composition,
Override it when needed through the command line,
Pass new parameters (not present in the config) via CLI – they will be handled for you

Hydra gives you the ability to prepare and override complex configuration setups (including config groups and hierarchies), while keeping track of any overridden values.

To understand how it works, let us take a simple example of a config.yaml file:

project: ORGANIZATION/home-credit
name: home-credit-default-risk
parameters:
# Data preparation
	n_cv_splits: 5
	validation_size: 0.2
	stratified_cv: True
	shuffle: 1
# Random forest
	rf__n_estimators: 2000
	rf__criterion: gini
	rf__max_depth: 40
	rf__class_weight: balanced

This configuration can be used in an application by simply calling the hydra decorator:

import hydra
from omegaconf import DictConfig
@hydra.main(config_path='config.yaml')
def train(cfg):
	print(cfg.pretty())  # this prints config in a reader friendly way
	print(cfg.parameters.rf__n_estimators)  # this is how to access single value from the config
if __name__ == "__main__":
	train()

Running the above script will produce the below output:

name: home-credit-default-risk
parameters:
	n_cv_splits: 5
	rf__class_weight: balanced
	rf__criterion: gini
	rf__max_depth: 40
	rf__n_estimators: 2000
	shuffle: 1
	stratified_cv: true
	validation_size: 0.2
project: ORGANIZATION/home-credit
2000

To override existing parameters or add new parameters, simply pass them as CLI arguments:

python hydra-main.py parameters.rf__n_estimators=1500 parameters.rf__max_features=0.2

Note: Strict mode has to be turned off to add new parameters:

@hydra.main(config_path='config.yaml', strict=False)

One drawback of Hydra is that to share the configuration or track it across experiments, you have to manually save the config.yaml file.

Hydra is in active development, be sure to check their latest docs.

Magic numbers all over the place

Whenever you need to pass a parameter you simply pass a value of that parameter.

...
train = pd.read_csv('/path/to/my/train.csv')

model = Model(objective='binary',
              metric='auc',
              learning_rate=0.1,
              num_boost_round=200,
              num_leaves=60,
              feature_fraction=0.2)
model.fit(train)

valid = pd.read_csv('/path/to/my/valid.csv')
model.evaluate(valid)

We all do that sometimes but it is not a great idea especially if someone will need to take over your work.

Ok, so I do like .yaml configs and passing arguments from the command line (option 1 and 2), but anything other than magic numbers is fine. What is important is that you log those parameters for every experiment.

If you decide to pass all parameters as the script arguments make sure to log them somewhere. It is easy to forget, so using an experiment management tool that does this automatically can save you here.

parser = argparse.ArgumentParser()
parser.add_argument('--number_trees')
parser.add_argument('--learning_rate')
args = parser.parse_args()

experiment_manager.create_experiment(params=vars(args))
...
# experiment logic
...

There is nothing so painful as to have a perfect script on a perfect data version producing perfect metrics only to discover that you don’t remember what are the hyperparameters that were passed as arguments.

neptune.ai

Neptune makes it very easy to keep track of hyperparameters across runs by giving various options:

Log hyperparameters individually:

run["parameters/epoch_nr"] = 5
run["parameters/batch_size"] = 32
run["parameters/dense"] = 512
run["parameters/optimizer"] = "sgd"
run["parameters/metrics"] = ["accuracy", "mae"]
run["parameters/activation"] = "relu"

Log all of them together as a dictionary:

# Define parameters
params = {
	"epoch_nr": 5,
	"batch_size": 32,
	"dense": 512,
	"optimizer": "sgd",
	"metrics": ["accuracy", "binary_accuracy"],
	"activation": "relu",
}

# Pass parameters
run["parameters"] = params

In both the above cases, the parameters are logged under the All Metadata section of the run UI.

You can also upload configuration files (like the config.yaml file used for Hydra) directly:

run["config_file"].upload("config.yaml")

This file will be logged under the All Metadata section of the run UI.

Data versioning

In real-life projects, data is changing over time. Some typical situations include:

new images are added,
labels are improved,
mislabeled/wrong data is removed,
new data tables are discovered,
new features are engineered and processed,
validation and testing datasets change to reflect the production environment.

Whenever your data changes, the output of your analysis, report or experiment results will likely change even though the code and environment did not. That is why to make sure you are comparing apples to apples you need to keep track of your data versions.

Having almost everything versioned and getting different results can be extremely frustrating, and can mean a lot of time (and money) in wasted effort. The sad part is that you can do little about it afterward. So again, keep your experiment data versioned.

For the vast majority of use cases whenever new data comes in you can save it in a new location and log this location and a hash of the data. Even if the data is very large, for example when dealing with images, you can create a smaller metadata file with image paths and labels and track changes of that file.

A wise man once told me:

“Storage is cheap, training a model for 2 weeks on an 8-GPU node is not.”

And if you think about it, logging this information doesn’t have to be rocket science.

exp.set_property('data_path', 'DATASET_PATH')
exp.set_property('data_version', md5_hash('DATASET_PATH'))

You can calculate hash yourself, use a simple data versioning extension or outsource hashing to a full-blown data versioning tool like DVC.

You can calculate and log the hash yourself, or use a full-fledged data versioning tool that gives you greater versioning capabilities. Read more about some of the best tools available in the market below.

Tracking model performance metrics

I have never found myself in a situation where I thought that I have logged too many metrics for my experiment, have you?

In a real-world project, the metrics you care about can change due to new discoveries or changing specifications so logging more metrics can actually save you some time and trouble in the future.

Either way, my suggestion is:

“Log metrics, log them all”

Typically, metrics are as simple as a single number

exp.send_metric('train_auc', train_auc)
exp.send_metric('valid_auc', valid_auc)

but I like to think of it as something a bit broader. To understand if your model has improved, you may want to take a look at a chart, confusion matrix or distribution of predictions. Those, in my view, are still metrics because they help you measure the performance of your experiment.

exp.send_image('diagnostics', 'confusion_matrix.png')
exp.send_image('diagnostics', 'roc_auc.png')
exp.send_image('diagnostics', 'prediction_dist.png')

Tracking metrics both on training and validation datasets can help you assess the risk of the model not performing well in production. The smaller the gap the lower the risk. A great resource is this kaggle days talk by Jean-François Puget.

Moreover, if you are working with data collected at different timestamps you can assess model performance decay and suggest a proper model retraining schema. Simply track metrics at different timeframes of your validation data and see how the performance drops.

Dig deeper

Read the article: Performance Metrics in Machine Learning [Complete Guide]

Versioning experiment environment

The majority of problems with environment versioning can be summarized by the infamous quote:

“I don’t understand, it worked on my machine.”

One approach that helps solve this issue can be called “environment as code” where the environment can be created by executing instructions (bash/yaml/docker) step-by-step. By embracing this approach you can switch from versioning the environment to versioning environment set-up code which we know how to do.

There are a few options that I know to be used in practice (by no means this is a full list of approaches).

Docker images

This is the preferred option and there are a lot of resources on the subject. One that I particularly like is the “Learn Enough Docker to be useful” series by Jeff Hale. In a nutshell, you define the Dockerfile with some instructions.

# Use a miniconda3 as base image
FROM continuumio/miniconda3

# Installation of jupyterlab
RUN pip install jupyterlab==0.35.6 &&
pip install jupyterlab-server==0.2.0 &&
conda install -c conda-forge nodejs

# Installation of Neptune and enabling neptune extension
RUN pip install neptune &&
pip install neptune-notebooks &&
jupyter labextension install neptune-notebooks

# Setting up Neptune API token as env variable
ARG NEPTUNE_API_TOKEN
ENV NEPTUNE_API_TOKEN=$NEPTUNE_API_TOKEN

# Adding current directory to container
ADD . /mnt/workdir
WORKDIR /mnt/workdir

You build your environment from those instructions:

docker build -t jupyterlab
    --build-arg NEPTUNE_API_TOKEN=$NEPTUNE_API_TOKEN .

And you can run scripts on the environment by going:

docker run
    -p 8888:8888
    jupyterlab:latest
    /opt/conda/bin/jupyter lab
    --allow-root
    --ip=0.0.0.0
    --port=8888

Conda Environments

It’s a simpler option and in many cases, it is enough to manage your environments with no problems. It doesn’t give you as many options or guarantees as docker does, but it can be enough for your use case.The environment can be defined as a .yaml configuration file just like this one:

name: salt

dependencies:
   - pip=19.1.1
   - python=3.6.8
   - psutil
   - matplotlib
   - scikit-image

- pip:
   - neptune-client==0.3.0
   - neptune-contrib==0.9.2
   - imgaug==0.2.5
   - opencv_python==3.4.0.12
   - torch==0.3.1
   - torchvision==0.2.0
   - pretrainedmodels==0.7.0
   - pandas==0.24.2
   - numpy==1.16.4
   - cython==0.28.2
   - pycocotools==2.0.0

You can create conda environment by running:

conda env create -f environment.yaml

What is pretty cool is that you can always dump the state of your environment to such config by running:

conda env export > environment.yaml

Simple and gets the job done.

Makefile

You can always define all your bash instructions explicitly in the Makefile. For example:

git clone git@github.com:neptune-ml/open-solution-mapping-challenge.git
cd open-solution-mapping-challenge

pip install -r requirements.txt

mkdir data
cd data
curl -0 https://www.kaggle.com/c/imagenet-object-localization-challenge/data/LOC_synset_mapping.txt

and set it up by running:

source Makefile

It is often difficult to read those files and you are giving up a ton of additional features of conda and/or docker but it doesn’t get much simpler than this.

Now, that you have your environment defined as code, make sure to log the environment file for every experiment.

Again, if you are using an experiment manager you can snapshot your code whenever you create a new experiment, even if you forget to git commit:

experiment_manager.create_experiment(upload_source_files=['environment.yml')
...
# machine learning magic
...

and have it safely stored in the app:

Versioning Machine Learning models

You have now trained a model using its optimal hyperparameters and have logged and versioned the data, hyperparameters, and the environment. But what about the model itself? In most cases, training and inference happen in different places (scripts/notebooks), and you need to be able to make the model you’ve trained available for inference somewhere else.

There are two basic ways to do this:

1. Save the model as a binary file

You can export the model as a binary file and load it from the binary file wherever you need to make inferences.

There are multiple ways you can do this – libraries like PyTorch and Keras have their own save and load methods, while outside of deep-learning Pickle remains the most popular way to save and load a model from a file:

import pickle

# To save a model
with open(“saved_model.pkl”, “wb”) as f:
	pickle.dumps(trained_model, f)

# To load a model
with open(“saved_model.pkl”, “rb”) as f:
	model = pickle.load(f)

Since the model is saved as a file, you can use file versioning tools like git, or upload the file to experiment trackers like Neptune:

run[“trained_model”].upload(“saved_model.pkl”)

2. Use a model registry

A model registry is a central repository for publishing and accessing models. It is a place where ML developers can push their models to be used by other stakeholders or themselves at a later point in time.

Learn more

ML Model Registry: What It Is, Why It Matters, How to Implement It

Some popular model registries available currently are:

MLflow

Source

The MLflow Model Registry is one of the few open-source model registries available in the market today. You can decide to manage this on your infrastructure or use a fully-managed implementation on a platform like Databricks, or in integrated environments like Amazon SageMaker and Azure Machine Learning, where MLflow is supported via the MLflow client.

In both SageMaker and Azure ML, you can log models using MLflow APIs and manage them through the platform’s proprietary infrastructure. These integrations provide compatibility with the MLflow client, making it easy to work across teams with services within AWS and Azure platforms. On the other hand, keep in mind that some open-source MLflow features may be unavailable or deprecated to favor in-house solutions.

MLflow provides:

Annotation and description tools for tagging models, providing documentation and model information such as the date the model was registered, modification history of the registered model, the model owner, stage, version, and so on;
Model versioning to automatically keep track of versions for registered models when updated;
API integration to serve machine learning models as RESTful APIs for online testing, dashboard updates, etc;
CI/CD workflow integration to record stage transitions, request, review, and approve changes as part of CI/CD pipelines for better control and governance;
Model stages (e.g., “Staging”, “Production”) to assign preset or custom stages to each model version, like “Staging” and “Production” to represent the lifecycle of a model;
Promotion schemes to easily transition models across different lifecycle stages.

neptune.ai

Neptune is primarily an experiment tracker, but it provides model registry functionality to a great extent. You can log, store, and organize your model metadata to have your production-ready models at hand.

Neptune lets you:

Track models and model versions, along with the associated metadata. You can version model code, images, datasets, Git info, and notebooks.
Filter and sort the versioned data easily.
Manage model stages using tags.
Query and download any stored model files and metadata.
And it helps your team to collaborate on experiments by providing persistent links to the UI.

How to organize your model development process?

As much as I think tracking experimentation and ensuring the reproducibility of your work is important it is just a part of the puzzle. Once you have tracked hundreds of experiment runs you will quickly face new problems:

how to search through and visualize all of those experiments,
how to organize them into something that you and your colleagues can digest,
how to make this data shareable and accessible inside your team/organization?

This is where experiment management tools really come in handy. They let you:

filter/sort/tag/group experiments,
visualize/compare experiment runs,
share (app and programmatic query API) experiment results and metadata.

For example, by sending a persistent URL, I can share a comparison of machine learning experiments with all the additional information available.

With that, you and all the people on your team know exactly what is happening when it comes to model development. It makes it easy to track the progress, discuss problems, and discover new improvement ideas.

Working in creative iterations

Tools like that are a big help and a huge improvement from spreadsheets and notes. However, what I believe can take your machine learning projects to the next level is a focused experimentation methodology that I call creative iterations.

Check also

Best Tools to Manage Machine Learning Projects
Data Science Project Management

I’d like to start with some pseudocode and explain it later:

time, budget, business_goal = business_specification()

creative_idea = initial_research(business_goal)

while time and budget and not business_goal:
   solution = develop(creative_idea)
   metrics = evaluate(solution, validation_data)
   if metrics > best_metrics:
      best_metrics = metrics
      best_solution = solution
   creative_idea = explore_results(best_solution)

   time.update()
   budget.update()

In every project, there is a phase where the business_specification is created that usually entails a timeframe, budget, and goal of the machine learning project. When I say goal, I mean a set of KPIs, business metrics, or if you are super lucky, machine learning metrics. At this stage, it is very important to manage business expectations but it’s a story for another day. If you are interested in those things I suggest you take a look at some articles by Cassie Kozyrkov, for instance, this one.

Assuming that you and your team know what is the business goal you can do initial_research and cook up a baseline approach, a first creative_idea. Then you develop it and come up with a solution which you need to evaluate and get your first set of metrics. Those, as mentioned before, don’t have to be simple numbers (and often are not) but could be charts, reports or user study results. Now you should study your solution, metrics, and explore_results.

It may be here where your project will end because:

your first solution is good enough to satisfy business needs,
you can reasonably expect that there is no way to reach business goals within the previously assumed time and budget,
you discover that there is a low-hanging fruit problem somewhere close and your team should focus their efforts there.

If none of the above apply, you list all the underperforming parts of your solution and figure out which ones could be improved and what creative_ideas can get you there. Once you have that list, you need to prioritize them based on expected goal improvements and budget. If you are wondering how can you estimate those improvements, the answer is simple: results exploration.

You have probably noticed that results exploration comes up a lot. That’s because it is so very important that it deserves its own section.

Model results exploration

This is an extremely important part of the process. You need to understand thoroughly where the current approach fails, how far time/budget wise are you from your goal, what are the risks associated with using your approach in production. In reality, this part is far from easy but mastering it is extremely valuable because:

it leads to business problem understanding,
it leads to focusing on the problems that matter and saves a lot of time and effort for the team and organization,
it leads to discovering new business insights and project ideas.

Some popular model interpretation tools currently used are:

SHAP:

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

Read how to use SHAP in their docs.

LIME

Local interpretable model-agnostic explanations (LIME) is a paper in which the authors propose a concrete implementation of local surrogate models. Surrogate models are trained to approximate the predictions of the underlying black box model. Instead of training a global surrogate model, LIME focuses on training local surrogate models to explain individual predictions. The current Python implementation supports tabular, text, and image classifiers.

treeinterpreter

This is a package for interpreting scikit-learn’s decision tree and random forest predictions. Allows decomposing each prediction into bias and feature contribution components. Learn usage here.

Some good resources I found on the subject are:

“Understanding and diagnosing your machine-learning models” PyData talk by Gael Varoquaux

“Creating correct and capable classifiers” PyData talk by Ian Osvald

Using the ‘What-If Tool’ to investigate Machine Learning models article by Parul Pandey

Diving deeply into results exploration is a story for another day and another blog post, but the key takeaway is that investing your time in understanding your current solution can be extremely beneficial for your business.

Interpretable Machine Learning book by Christoph Molnar
ML Model Interpretation Tools blog by Abhishek Jha

Final thoughts

In this article, I explained:

what experiment management is,
how organizing your model development process improves your workflow.

For me, adding experiment management tools to my “standard” software development best practices was an aha-moment that made my machine learning projects more likely to succeed. I think, if you give it a go you will feel the same.

F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

Jakub Czakon — Thu, 21 Jul 2022 09:21:10 +0000

PR AUC and F1 Score are very robust evaluation metrics that work great for many classification problems, but from my experience, the most commonly used metrics are accuracy and ROC AUC. Are they better? Not really. As with the famous “AUC vs. accuracy” discussion, there are real benefits to using both. The big question is when.

There are many questions that you may have right now:

When is accuracy a better evaluation metric than ROC AUC?
What is the F1 score good for?
What is the PR curve, and how do you actually use it?
If my dataset is highly imbalanced, should I use ROC AUC or PR AUC?

As always, it depends, but understanding the trade-offs between different metrics is crucial when it comes to making the correct decision.

In this blog post, I will:

Talk about some of the most common binary classification metrics, like F1 score, ROC AUC, PR AUC, and accuracy.
Compare them using an example binary classification problem.
Tell you what you should consider when deciding to choose one metric over the other (F1 score vs. ROC AUC).

Ok, let’s do this!

Evaluation metrics recap

I will start by introducing each of those classification metrics. Specifically:

What is the definition and intuition behind it?
The non-technical explanation
How to calculate or plot it
When should you use it?

Tip

If you have read my previous blog post, “24 Evaluation Metrics for Binary Classification (And When to Use Them)”, you may want to skip this section and scroll down to the evaluation metrics comparison.

Accuracy

It measures how many observations, both positive and negative, were correctly classified.

You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class.

In Python, you can calculate it in the following way:

from sklearn.metrics import confusion_matrix, accuracy_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)

# or simply

accuracy_score(y_true, y_pred_class)

Since the accuracy score is calculated on the predicted classes (not prediction scores), we need to apply a certain threshold before computing it. The obvious choice is the threshold of 0.5, but it can be suboptimal.

Let’s see an example of how accuracy depends on the threshold choice:

Accuracy by threshold

You can use charts like the one above to determine the optimal threshold. In this case, choosing something a bit over the standard 0.5 could bump the score by a tiny bit (0.9686–0.9688), but in other cases, the improvement can be more substantial.

So, when does it make sense to use it?

When your problem is balanced, using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project.
When every class is equally important to you.

F1 score

Simply put, it combines precision and recall into one metric by calculating the harmonic mean between those two. It is actually a special case of the more general function F beta:

When choosing beta in your F-beta score, the more you care about recall over precision, the higher beta you should choose. For example, with the F1 score, we care equally about recall and precision; with the F2 score, recall is twice as important to us.

F beta threshold by beta

With 0 1, our optimal threshold moves toward lower thresholds, and when beta = 1, it is somewhere in the middle.

It can be easily computed by running:

from sklearn.metrics import f1_score

y_pred_class = y_pred_pos > threshold
f1_score(y_true, y_pred_class)

It is important to remember that the F1 score is calculated from precision and recall, which, in turn, are calculated from the predicted classes (not prediction scores).

How should we choose an optimal threshold? Let’s plot the F1 score over all possible thresholds:

F1 score by threshold

We can adjust the threshold to optimize the F1 score. Notice that for both precision and recall, you could get perfect scores by increasing or decreasing the threshold. The good thing is that you can find a sweet spot for F1 scores. As you can see, getting the threshold just right can actually improve your score from 0.8077->0.8121.

When should you use it?

Pretty much in every binary classification problem where you care more about the positive class. It is my go-to metric when working on those problems.
It can be easily explained to business stakeholders, which in many cases can be a deciding factor. Always remember that machine learning is just a tool to solve a business problem.

ROC AUC

AUC means “area under the curve.” So, to speak about the ROC AUC score, we need to define the ROC curve first.

It is a chart that visualizes the trade-off between the true positive rate (TPR) and the false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot them on one chart.

Of course, the higher the TPR and the lower the FPR for each threshold, the better, and so classifiers that have curves that are more top-left-side are better.

An extensive discussion of the ROC curve and the ROC AUC score can be found in this article by Tom Fawcett.

ROC curves

We can see a healthy ROC curve pushed towards the top-left side for both positive and negative classes. It is not clear which one performs better across the board, as with FPR < ~0.15 the positive class is higher, and starting from FPR~0.15 the negative class is above.

In order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve or ROC AUC score. The more top-left your curve is, the higher the area, and hence, the higher the ROC AUC score.

Alternatively, it can be shown that the ROC AUC score is equivalent to calculating the rank correlation between predictions and targets. From an interpretation standpoint, it is more useful because it tells us that this metric shows how good your model is at ranking predictions. It tells you what the probability is that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_true, y_pred_pos)

You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
You should not use it when your data is heavily imbalanced. This was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: the false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives, then it totally makes sense to use ROC AUC.

PR AUC | Average Precision

Similarly to ROC AUC, in order to define PR AUC, we need to define the precision-recall curve.

It is a curve that combines precision (PPV) and recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot them. The higher the y-axis on your curve, the better your model’s performance.

You can use this plot to make an educated decision when it comes to the classic precision/recall dilemma. Obviously, the higher the recall, the lower the precision. Knowing at which recall your precision starts to fall fast can help you choose the threshold and deliver a better model.

Precision-Recall curve

We can see that for the negative class, we maintain high precision and recall almost throughout the entire range of thresholds. For the positive class, precision starts to fall as soon as we recall 0.2 of true positives, and by the time we hit 0.8, it decreases to around 0.7.

Similarly to the ROC AUC score, you can calculate the area under the precision-recall curve (PR AUC) to get one number that describes model performance.

You can also think of PR AUC as the average of precision scores calculated for each recall threshold. You can also adjust this definition to suit your business needs by choosing or clipping recall thresholds if needed.

from sklearn.metrics import average_precision_score

average_precision_score(y_true, y_pred_pos)

when you want to communicate a precision or recall decision to other stakeholders
when you want to choose the threshold that fits the business problem.
when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR), it cares less about the frequent negative class.
when you care more about positive than negative class. If you care more about the positive class, and hence PPV and TPR, you should go with the precision-recall curve and PR AUC (average precision).

Evaluation metrics comparison

We will compare the metrics we discussed so far with a use case that’s close to what you might typically see day-to-day as data scientists.

Based on a Kaggle competiton I created an example fraud detection problem:

I selected only 43 features.
I sampled 66000 observations from the original dataset.
I adjusted the fraction of the positive class to 0.09.

We’ll train a bunch of LightGBM classifiers with different hyperparameters and will use the metrics to get an intuition as to which models are “truly” better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees. Of course, with more trees and smaller learning rates, it gets tricky, but I think it is a decent proxy.

To generate the results you will see below, run the following snippets of code in unison by changing the hyperparameters of LightGBM.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

First, we install and import the necessary libraries:

pip install neptune pandas lightgbm matplotlib python-dotenv

# Python version: 3.9
import os
import sys
import neptune
import pandas as pd
import lightgbm
import matplotlib.pyplot as plt

from dotenv import load_dotenv
from neptune.integrations.xgboost import NeptuneCallback
# Load the environment variables
load_dotenv()

Then, download the data to your directory and read it with Pandas:

TRAIN_PATH = "https://raw.githubusercontent.com/neptune-ai/blog-binary-classification-metrics/master/data/train.csv"
TEST_PATH = "https://raw.githubusercontent.com/neptune-ai/blog-binary-classification-metrics/master/data/test.csv"

train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

Now, split the data:

feature_names = [col for col in train.columns if col not in ["isFraud"]]

X_train, y_train = train[feature_names], train["isFraud"]
X_test, y_test = test[feature_names], test["isFraud"]

Retrieve your Neptune credentials and instantiate a run object:

project_name = os.getenv("NEPTUNE_PROJECT_NAME")
api_token = os.getenv("NEPTUNE_API_TOKEN")

run = neptune.init_run(project=project_name, api_token=api_token, name=args.name)

The run object establishes a connection between Neptune and your script and allows you to log model metadata to your dashboard. Here is the next section of the code:

MODEL_PARAMS = {
    "random_state": 1234,
    "learning_rate": 0.1,
    "n_estimators": 1500,
}

model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)

# Evaluate model
y_test_probs = model.predict_proba(X_test)
y_test_preds = model.predict(X_test)

Now, we will log our metrics and hyperparameters using the run object:

from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    precision_score,
    recall_score,
    f1_score,
    average_precision_score,
)

# Calculate metrics
accuracy = accuracy_score(y_test, y_test_preds)
roc_auc = roc_auc_score(y_test, y_test_probs[:, 1])  # Assuming binary classification
precision = precision_score(y_test, y_test_preds, average="weighted")
recall = recall_score(y_test, y_test_preds, average="weighted")
f1 = f1_score(y_test, y_test_preds, average="weighted")
pr_auc = average_precision_score(
    y_test, y_test_probs[:, 1], average="weighted"
)

# Log metrics to Neptune
run["accuracy"] = accuracy
run["roc_auc"] = roc_auc
run["precision"] = precision
run["recall"] = recall
run["f1"] = f1
run["pr_auc"] = pr_auc
run["learning_rate"] = MODEL_PARAMS[‘learning_rate’]
run["n_estimators"] = MODEL_PARAMS[‘n_estimators’]

run.stop()

I have run this script with a few different combinations of learning rates and estimators. You can find the full script and other files related to this project in this GitHub repository.

Now, let’s explore how our model is scoring on different metrics.

Runs table | See in Neptune

On this problem, all of those metrics rank models from best to worst very similarly, but there are slight differences. Also, the scores themselves can vary greatly.

In the next sections, we will discuss it in more detail.

Accuracy vs. ROC AUC

The first big difference is that you calculate accuracy on the predicted classes while you calculate ROC AUC on predicted scores. That means you will have to find the optimal threshold for your problem.

Moreover, accuracy looks at fractions of correctly assigned positive and negative classes. That means if our problem is highly imbalanced, we get a really high accuracy score by simply predicting that all observations belong to the majority class.

On the flip side, if your problem is balanced and you care about both positive and negative predictions, accuracy is a good choice because it is really simple and easy to interpret.

Another thing to remember is that ROC AUC is especially good at ranking predictions. Because of that, if you have a problem where sorting your observations is what you care about, ROC AUC is likely what you are looking for.

Now, let’s look at the results of our experiments:

Experiments sorted by ROC AUC score | See in Neptune

The first observation is that models rank almost exactly the same on ROC AUC and accuracy.

Secondly, accuracy scores start at 0.93 for the very worst model and go up to 0.97 for the best one. Remember that predicting all observations as majority class 0 would give 0.9 accuracy, so our worst experiment, BIN-98 is only slightly better than that. Yet the score itself is quite high, and it shows that you should always take an imbalance into consideration when looking at accuracy.

💡 There is an interesting metric called Cohen Kappa that takes imbalance into consideration by calculating the improvement in accuracy over the “sample according to class imbalance” model.

F1 score vs Accuracy

Both of those metrics take class predictions as input, so you will have to adjust the threshold regardless of which one you choose.

Remember that the F1 score balances precision and recall in the positive class, while accuracy looks at correctly classified observations, both positive and negative. That makes a big difference, especially for the imbalanced problems, where by default our model will be good at predicting true negatives and hence accuracy will be high. However, if you care equally about true negatives and true positives, then accuracy is the metric you should choose.

If we look at our experiments below:

Experiments sorted by F1 score | See in Neptune

In our example, both metrics are equally capable of helping us rank models and choose the best one. The class imbalance of 1-10 makes our accuracy really high by default. Because of that, even the worst model has very high accuracy, and the improvements as we go to the top of the table are not as clear on accuracy as they are on the F1 score.

ROC AUC vs. PR AUC

What is common between ROC AUC and PR AUC is that they both look at prediction scores of classification models, not thresholded class assignments. What is different, however, is that ROC AUC looks at a true positive rate TPR and the false positive rate FPR, while PR AUC looks at the positive predictive value PPV and the true positive rate TPR.

Because of that, if you care more about the positive class, then using PR AUC, which is more sensitive to the improvements for the positive class, is a better choice. One common scenario is a highly imbalanced dataset where the fraction of positive classes, which we want to find (like in fraud detection), is small. I highly recommend taking a look at this Kaggle discussion thread for a longer discussion on the subject of ROC AUC vs. PR AUC for imbalanced datasets.

If you care equally about the positive and negative classes or your dataset is quite balanced, then going with the ROC AUC is a good idea.

Let’s compare our experiments on those two metrics:

Experiments sorted by ROC AUC score | See in Neptune

They rank models similarly, but there is a slight difference if you look at experiments BIN-100 and BIN 102.

However, the improvements calculated in Average Precision (PR AUC) are larger and clearer. We get from 0.69 to 0.87 when, at the same time, ROC AUC goes from 0.92 to 0.97. Because of that, ROC AUC can give a false sense of very high performance when, in fact, your model is not doing that well.

F1 score vs. ROC AUC

One big difference between the F1 score and the ROC AUC is that the first one takes predicted classes, and the second takes predicted scores as input. Because of that, with the F1 score, you need to choose a threshold that assigns your observations to those classes. Often, you can impro v e your model performance a lot if you choose it well.

So, if you care about ranking predictions, don’t need them to be properly calibrated probabilities, and your dataset is not heavily imbalanced, then I would go with ROC AUC.

If your dataset is heavily imbalanced and/or you mostly care about the positive class, I’d consider using the F1 score, Precision-Recall curve, and PR AUC. The additional reason to go with F1 (or Fbeta) is that these metrics are easier to interpret and communicate to business stakeholders.

Let’s take a look at the experimental results for some more insights:

Experiments sorted by F1 score | See in Neptune

Experiments rank identically on the F1 score (threshold = 0.5) and ROC AUC. However, the F1 score is lower in value, and the difference between the worst and the best model is larger. For the ROC AUC score, values are larger, and the difference is smaller.

F1 score by threshold

💡 If you would like to easily log those plots for every experiment, I have attached a logging helper at the end of this post.

Final thoughts

In this blog post, you’ve learned about a few common metrics used for evaluating binary classification models.

We’ve discussed how they are defined, how to interpret and calculate them, and when you should consider using them.

Finally, we compared those evaluation metrics to a real problem and discussed some typical decisions you may face.

With all this knowledge, you have the equipment to choose a good evaluation metric for your next binary classification problem!

Bonus

To make things a little bit easier, I have prepared a logging function that logs all the metrics, performance charts, and metrics by threshold charts described in this post.

Logging function

You can log all of those metrics and performance charts that we covered for your machine learning project and explore them in Neptune using our Python client and integrations (in the example below, I use Neptune-LightGBM integration).

install the client:

pip install -U neptune-lightgbm

import and run:

import neptune

run = neptune.init_run(...)
neptune_callback = NeptuneCallback(run=run)

gbm = lgb.train(
       params,
       lgb_train,
       callbacks=[neptune_callback],
)

custom_score = ...

# log score to neptune
run["logs/custom_score"] = custom_score

Explore everything in the app.

You can log different kinds of metadata to Neptune, including metrics, charts, parameters, images, and more. Check the docs to learn more.

Jakub Czakon, Autor w serwisie neptune.ai

8 Best Data Science and Machine Learning Platforms For MLOps

1. Neptune

2. Amazon SageMaker

Note

3. Cnvrg.io

4. Iguazio

5. Spell

6. MLflow

Note

7. TensorFlow

Note

8. Kubeflow

Note

To wrap it up

Best Tools to Do ML Model Monitoring

A Comprehensive Guide on How to Monitor Your Models in Production

How to compare ML model monitoring tools

ML model monitoring tools

1. neptune.ai

2. Arize AI

3. WhyLabs

4. Grafana + Prometheus

5. Evidently

6. Qualdo

7. Fiddler

8. Amazon SageMaker Model Monitor

9. Seldon Core

10. Censius

Conclusion

MLOps at a Reasonable Scale [The Ultimate Guide]

MLOps vs MLOps at a reasonable scale

Read more

The pillars of MLOps

The pillars of MLOps – stack components

End-to-end vs a canonical stack of best-in-class tools

3 Takes on End-to-End For the MLOps Stack: Was It Worth It?

The pillars of reasonable scale MLOps – components

The pillars of reasonable scale MLOps – principles

Best practices and tips for setting up MLOps at a reasonable scale

MLOps tool stacks

MLOps templates

What should you do next?

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

How to choose a data versioning tool?

Best data version control tools

1. neptune.ai

2. Pachyderm

3. DVC

4. Git LFS

5. Dolt

6. lakeFS

7. Delta Lake

To wrap it up

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

What is ML experiment tracking?

From Research to Production: Building The Most Scalable Experiment Tracker For Foundation Models

ML experiment tracking vs MLOps

Experiment Tracking vs Machine Learning Model Management vs MLOps

LLMs: from experiment tracking to prompt tracking

Why does ML experiment tracking matter?

All of your ML experiments and models are organized in a single place

Experiment Tracking for Systems Powering Self-Driving Vehicles

Compare ML experiments, analyze results, debug model training with little extra work

Improve collaboration: see what everyone is doing, share ML experiment results easily, and access experiment data programmatically

The Best Software for Collaborating on Machine Learning Projects

See your ML runs live: manage ML experiments from anywhere at any time

ML experiment tracking best practices

What you should keep track of in any ML experiment:

What else you could keep track of

How to Make Sense of the Reinforcement Learning Agents

How to set up machine learning experiment tracking

You can use spreadsheets and naming conventions (but please don’t)

Switching from Spreadsheets to an Experiment Tracker

You can version ML experiment metadata files on GitHub

Maybe you could build your own ML experiment tracker?

How to Build an Experiment Tracking Tool [Learnings From Engineers Behind Neptune]

You can use a modern experiment tracking tool

Next steps

Best Tools to Manage Machine Learning Projects