<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Nilesh Barla, Autor w serwisie neptune.ai</title>
	<atom:link href="https://neptune.ai/blog/author/nilesh-barla/feed" rel="self" type="application/rss+xml" />
	<link></link>
	<description>The experiment tracker for foundation model training.</description>
	<lastBuildDate>Wed, 14 May 2025 09:21:33 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://i0.wp.com/neptune.ai/wp-content/uploads/2022/11/cropped-Signet-1.png?fit=32%2C32&#038;ssl=1</url>
	<title>Nilesh Barla, Autor w serwisie neptune.ai</title>
	<link></link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">211928962</site>	<item>
		<title>How to Visualize Deep Learning Models</title>
		<link>https://neptune.ai/blog/deep-learning-visualization</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Tue, 14 Nov 2023 15:30:20 +0000</pubDate>
				<category><![CDATA[ML Model Development]]></category>
		<guid isPermaLink="false">https://neptune.ai/?p=33030</guid>

					<description><![CDATA[Deep learning models are typically highly complex. While many traditional machine learning models make do with just a couple of hundreds of parameters, deep learning models have millions or billions of parameters. The large language model GPT-4 that OpenAI released in the spring of 2023 is rumored to have nearly 2 trillion parameters. It goes&#8230;]]></description>
										<content:encoded><![CDATA[
<p>Deep learning models are typically highly complex. While many traditional machine learning models make do with just a couple of hundreds of parameters, deep learning models have millions or billions of parameters. The large language model GPT-4 that OpenAI released in the spring of 2023 is rumored to have nearly 2 trillion parameters. It goes without saying that the interplay between all these parameters is way too complicated for humans to understand.</p>



<p>This is where visualizations in ML come in. Graphical representations of structures and data flow within a deep learning model make its complexity easier to comprehend and enable insight into its decision-making process. With the proper visualization method and a systematic approach, many seemingly mysterious training issues and underperformance of deep learning models can be traced back to root causes.<br><br>In this article, we’ll explore a wide range of deep learning visualizations and discuss their applicability. Along the way, I’ll share many practical examples and point to libraries and in-depth tutorials for individual methods.</p>



<section id="note-block_a648100db1c602947e8b994fc252f080"
         class="block-note c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

    
    <div class="block-note__content">
                    <div class="c-item c-item--wysiwyg_editor">

                
                
                <div class="c-item__content">

                                            <p><strong>Note:</strong> I’ve prepared a <a href="https://colab.research.google.com/drive/1VZp8H1EOyxYxQKiQKv9WceEsM4jHJVsS#scrollTo=Lh_bwP2vij2l" target="_blank" rel="noreferrer noopener nofollow">Colab Notebook with examples</a> of many of the techniques discussed in this article.</p>
                                    </div>

            </div>
            </div>


</section>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" fetchpriority="high" decoding="async" width="1800" height="942" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=1800%2C942&#038;ssl=1" alt="Deep learning model visualization helps us understand model behavior and differences between models, diagnose training processes and performance issues, and aid the refinement and optimizations of models" class="wp-image-33352" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?w=1800&amp;ssl=1 1800w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=1536%2C804&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/Visualizing-deep-learning-models.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="(max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><strong>Deep learning model visualization helps us understand model behavior and differences between models, diagnose training processes and performance issues, and aid the refinement and optimizations of models </strong>| <a href="https://www.sciencedirect.com/science/article/pii/S2468502X17300086" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-why-do-we-want-to-visualize-deep-learning-models">Why do we want to visualize deep learning models?</h2>



<p>Visualizing deep learning models can help us with several different objectives:</p>



<ul class="wp-block-list">
<li><strong>Interpretability and explainability:</strong> The performance of deep learning models is, at times, staggering, even for seasoned data scientists and ML engineers. Visualizations provide ways to dive into a model’s structure and uncover why it succeeds in learning the relationships encoded in the training data.<br></li>



<li><strong>Debugging model training:</strong> It’s fair to assume that everyone training deep learning models has encountered a situation where a model doesn’t learn or struggles with a particular set of samples. The reasons for this range from wrongly connected model components to misconfigured optimizers. Visualizations are great for monitoring training runs and diagnosing issues.<br></li>



<li><strong>Model optimization</strong>: Models with fewer parameters are generally faster to compute and more resource-efficient while being more robust and generalizing better to unseen samples. Visualizations can uncover which parts of a model are essential&nbsp; – and which layers might be omitted without compromising the model’s performance.<br>&nbsp;</li>



<li><strong>Understanding and teaching concepts:</strong> Deep learning is mostly based on fairly simple activation functions and mathematical operations like matrix multiplication. Many high school students will know all the maths required to understand a deep learning model’s internal calculations step-by-step. But it’s far from obvious how this gives rise to models that can seemingly “understand” images or translate fluently between multiple languages. It’s not a secret among educators that good visualizations are key for students to master complex and abstract concepts such as deep learning. Interactive visualizations, in particular, have proven helpful for those new to the field.<br><br><br></li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" width="685" height="588" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-2.png?resize=685%2C588&#038;ssl=1" alt="Example of a deep learning visualization: small convolutional neural network CNN" class="wp-image-33033" style="width:607px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-2.png?w=685&amp;ssl=1 685w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-2.png?resize=200%2C172&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-2.png?resize=220%2C189&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-2.png?resize=120%2C103&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-2.png?resize=160%2C137&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-2.png?resize=300%2C258&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-2.png?resize=480%2C412&amp;ssl=1 480w" sizes="(max-width: 685px) 100vw, 685px" /><figcaption class="wp-element-caption">Example of a deep learning <strong>visualization: small convolutional neural network CNN, notice how the thickness of the colorful lines indicates the weight of the neural pathways </strong>|<a href="https://www.nature.com/articles/srep27755" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-how-is-deep-learning-visualization-different-from-traditional-ml-visualization">How is deep learning visualization different from traditional ML visualization?</h2>



<p>At this point, you might wonder how visualizing deep learning models differs from <a href="/blog/visualization-in-machine-learning" target="_blank" rel="noreferrer noopener">visualizations of traditional machine learning models</a>. After all, aren’t deep learning models closely related to their predecessors?</p>



<p>Deep learning models are characterized by a large number of parameters and a layered structure. Many identical neurons are organized into layers stacked on top of each other. Each neuron is described through a small number of weights and an activation function. While the activation function is typically chosen by the model’s creator (and is thus a so-called hyperparameter), the weights are learned during training.<br><br>This fairly simple structure gives rise to unprecedented performance on virtually every machine learning task known today. From our human perspective, the price we pay is that deep learning models are much larger than traditional ML models.</p>



<p>It’s also much more difficult to see how the intricate network of neurons processes the input data than to comprehend, say, a decision tree. Thus, the main focus of deep learning visualizations is to uncover the data flow within a model and to provide insights into what the structurally identical layers learn to focus on during training.</p>



<p>That said, many of the <a href="/blog/visualization-in-machine-learning" target="_blank" rel="noreferrer noopener">machine learning visualization techniques</a> I covered in my last blog post apply to deep learning models as well. For example, confusion matrices and ROC curves are helpful when working with deep learning classifiers, just as they are for more traditional classification models.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-who-should-use-deep-learning-visualization">Who should use deep learning visualization?</h2>



<p>The short answer to that question is: Everyone who works with deep learning models!<br><br>In particular, the following groups come to mind:<br></p>



<ul class="wp-block-list">
<li><strong>Deep learning researchers:</strong> Many visualization techniques are first developed by academic researchers looking to improve existing deep learning algorithms or to understand why a particular model exhibits a certain characteristic.<br></li>



<li><strong>Data scientists and ML engineers:</strong> Creating and training deep learning models is no easy feat. Whether a model underperforms, struggles to learn, or generates suspiciously good outcomes – visualizations help us to identify the root cause. Thus, mastering different visualization approaches is an invaluable addition to any deep learning practitioner’s toolbox.&nbsp;</li>



<li><strong>Downstream consumers of deep learning models:</strong> Visualizations prove valuable to individuals with technical backgrounds who consume deep learning models via APIs or integrated deep learning-based components into software applications. For instance, <a href="https://mlatgt.blog/2018/02/16/visualizing-deep-learning-models-at-facebook/" target="_blank" rel="noreferrer noopener nofollow">Facebook&#8217;s ActiVis</a> is a visual analytics system tailored to in-house engineers, facilitating the exploration of deployed neural networks.</li>



<li><strong>Educators and students: </strong>Those encountering deep neural networks for the first time – and the people teaching them – often struggle to understand how the model code they write translates into a computational graph that can process complex input data like images or speech. Visualizations make it easier to understand how everything comes together and what a model learned during training.</li>
</ul>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-types-of-deep-learning-visualization">Types of deep learning visualization</h2>



<p>There are many different approaches to deep learning model visualization. Which one is right for you depends on your goal. For instance, deep learning researchers often delve into intricate architectural blueprints to uncover the contributions of different model parts to its performance. ML engineers are often more interested in plots of evaluation metrics during training, as their goal is to ship the best-performing model as quickly as possible.</p>



<p>In this article, we’ll discuss the following approaches:</p>



<ul class="wp-block-list">
<li><strong>Deep learning model architecture visualization:</strong> Graph-like representation of a neural network with nodes representing layers and edges representing connections between neurons.<br></li>



<li><strong>Activation heatmap:</strong> Layer-wise visualization of activations in a deep neural network that provides insights into what input elements a model is sensitive to.<br></li>



<li><strong>Feature visualization:</strong> Heatmaps that visualize what features or patterns a deep learning model can detect in its input.<br></li>



<li><strong>Deep feature factorization:</strong> Advanced method to uncover high-level concepts a deep learning model learned during training.<br></li>



<li><strong>Training dynamics plots:</strong> Visualization of model performance metrics across training epochs.<br></li>



<li><strong>Gradient plots:</strong> Representation of the loss function gradients at different layers within a deep learning model. Data scientists often use these plots to detect exploding or vanishing gradients during model training.<br></li>



<li><strong>Loss landscape:</strong> Three-dimensional representation of the loss function’s value across a deep learning model’s input space.&nbsp;</li>



<li><strong>Visualizing attention:</strong> Heatmap and graph-like visual representations of a transformer-model’s attention that can be used, e.g., to verify if a model focuses on the correct parts of the input data.</li>



<li><strong>Visualizing embeddings:</strong> Graphical representation of embeddings, an essential building block for many NLP and computer vision applications, in a low-dimensional space to unveil their relationships and semantic similarity.</li>
</ul>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-deep-learning-model-architecture-visualization">Deep learning model architecture visualization</h3>



<p>Visualizing the architecture of a deep learning model – its neurons, layers, and connections between them – can serve many purposes:</p>



<ol class="wp-block-list">
<li>It exposes the flow of data from the input to the output, including the shapes it takes when it’s passed between layers.</li>



<li>It gives a clear idea of the number of parameters in the model.</li>



<li>You can see which components repeat throughout the model and how they’re linked.</li>
</ol>



<p>There are different ways to visualize a deep learning model’s architecture:<br></p>



<ol class="wp-block-list">
<li><strong>Model diagrams </strong>expose the model’s building blocks and their interconnection.</li>



<li><strong>Flowcharts</strong> aim to provide insights into data flows and model dynamics.</li>



<li><strong>Layer-wise representations</strong> of deep learning models tend to be significantly more complex and expose activations and intra-layer structures.</li>
</ol>



<p>All of these visualizations do not only satisfy curiosity. They empower deep learning practitioners to fine-tune models, diagnose issues, and build upon this knowledge to create even more powerful algorithms.</p>



<p>You’ll be able to find model architecture visualization utilities for all of the big deep learning frameworks. Sometimes, they are provided as part of the main package, while in other cases, separate libraries are provided by the framework’s maintainers or community members.</p>



<h4 class="wp-block-heading">How do you visualize a PyTorch model’s architecture?</h4>



<p>If you are using PyTorch, you can use <a href="https://github.com/szagoruyko/pytorchviz" target="_blank" rel="noreferrer noopener nofollow">PyTorchViz</a> to create model architecture visualizations. This library visualizes a model’s individual components and highlights the data flow between them.</p>



<p>Here’s the basic code:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torch
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torchviz <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> make_dot

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># create some sample input data</span>
x = torch.randn(<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">3</span>, <span class="hljs-number" style="color: teal;">256</span>, <span class="hljs-number" style="color: teal;">256</span>)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># generate predictions for the sample data</span>
y = MyPyTorchModel()(x)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># generate a model architecture visualization</span>
make_dot(y.mean(),
         params=dict(MyPyTorchModel().named_parameters()),
         show_attrs=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>,
         show_saved=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>).render(<span class="hljs-string" style="color: rgb(221, 17, 68);">"MyPyTorchModel_torchviz"</span>, format=<span class="hljs-string" style="color: rgb(221, 17, 68);">"png"</span>)
</pre></code></pre>
</div>




<p>The Colab notebook accompanying this article contains a complete <a href="https://colab.research.google.com/drive/1VZp8H1EOyxYxQKiQKv9WceEsM4jHJVsS#scrollTo=6BsxxFR0UCnN" target="_blank" rel="noreferrer noopener nofollow">PyTorch model architecture visualization example</a>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" width="1120" height="1680" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=1120%2C1680&#038;ssl=1" alt="Architecture visualization of a PyTorch-based CNN created with PyTorchViz" class="wp-image-40228" style="width:660px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?w=1120&amp;ssl=1 1120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=768%2C1152&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=133%2C200&amp;ssl=1 133w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=1024%2C1536&amp;ssl=1 1024w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=220%2C330&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=120%2C180&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=160%2C240&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=300%2C450&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=480%2C720&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/08/deep-learning-visualization.png?resize=1020%2C1530&amp;ssl=1 1020w" sizes="(max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><strong>Architecture visualization of a PyTorch-based CNN created with PyTorchViz</strong> | Source: Author</figcaption></figure>
</div>


<p>PyTorchViz uses four colors in the model architecture graph:<br></p>



<ol class="wp-block-list">
<li><strong>Blue </strong>nodes represent tensors or variables in the computation graph. These are the data elements that flow through the operations.</li>



<li><strong>Gray </strong>nodes represent PyTorch functions or operations performed on tensors.</li>



<li><strong>Green </strong>nodes represent gradients or derivatives of tensors. They showcase the backpropagation flow of gradients through the computation graph.</li>



<li><strong>Orange </strong>nodes represent the final loss or objective function optimized during training.</li>
</ol>



<h4 class="wp-block-heading">How do you visualize a Keras model’s architecture?</h4>



<p>To visualize the architecture of a Keras deep learning model, you can use the <a href="https://keras.io/api/utils/model_plotting_utils/#plotmodel-function" target="_blank" rel="noreferrer noopener nofollow"><em>plot_model</em></a> utility function that is provided as part of the library:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> tensorflow.keras.utils <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> plot_model

plot_model(my_keras_model,

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;to_file=<span class="hljs-string" style="color: rgb(221, 17, 68);">'keras_model_plot.png'</span>,

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;show_shapes=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>,

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;show_layer_names=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)</pre></code></pre>
</div>




<p>I’ve prepared a complete <a href="https://colab.research.google.com/drive/1VZp8H1EOyxYxQKiQKv9WceEsM4jHJVsS#scrollTo=mL6PLUHAiZbn" target="_blank" rel="noreferrer noopener nofollow">example for Keras architecture visualization</a> in the Colab notebook for this article.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1800" height="1800" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=1800%2C1800&#038;ssl=1" alt="Model architecture diagram of a Keras-based neural network" class="wp-image-33483" style="width:634px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?w=1800&amp;ssl=1 1800w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=768%2C768&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=200%2C200&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=1536%2C1536&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=220%2C220&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=120%2C120&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=88%2C88&amp;ssl=1 88w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=44%2C44&amp;ssl=1 44w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=160%2C160&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=300%2C300&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=480%2C480&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=1020%2C1020&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/deep-learning-visualization-3.png?resize=100%2C100&amp;ssl=1 100w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><strong>Model architecture diagram of a Keras-based neural network</strong> | Source: Author</figcaption></figure>
</div>


<p>The output generated by the <em>plot_model</em> function is quite simple to understand: Each box represents a model layer and shows its name, type, and input and output shapes. The arrows indicate the flow of data between layers.</p>



<p>By the way, Keras also provides a <a href="https://keras.io/api/utils/model_plotting_utils/#modeltodot-function" target="_blank" rel="noreferrer noopener nofollow">model_to_dot</a> function to create graphs similar to the one produced by PyTorchViz above.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-activation-heatmaps">Activation heatmaps</h3>



<p>Activation heatmaps are visual representations of the inner workings of deep neural networks. They show which neurons are activated layer-by-layer, allowing us to see how the activations flow through the model.</p>



<p>An activation heatmap can be generated for just a single input sample or a whole collection. In the latter case, we’ll typically choose to depict the average, median, minimum, or maximum activation. This allows us, for example, to identify regions of the network that rarely contribute to the model’s output and might be pruned without affecting its performance.</p>



<p>Let’s take a computer vision model as an example. To generate an activation heatmap, we’ll feed a sample image into the model and record the output value of each activation function in the deep neural network. Then, we can create a heatmap visualization for a layer in the model by coloring its neurons according to the activation function’s output. Alternatively, we can color the input sample’s pixels based on the activation they cause in the inner layer. This tells us which parts of the input reach the particular layer.</p>



<p>For typical deep learning models with many layers and millions of neurons, this simple approach will produce very complicated and noisy visualizations. Hence, deep learning researchers and data scientists have come up with plenty of different methods to simplify activation heatmaps.</p>



<p>But the goal remains the same: We want to uncover which parts of our model contribute to the output and in what way.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="795" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=1920%2C795&#038;ssl=1" alt="Generation of activation heatmaps for a CNN analyzing MRI data" class="wp-image-33040" style="width:804px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=1920%2C795&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=768%2C318&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=200%2C83&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=1536%2C636&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=220%2C91&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=120%2C50&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=160%2C66&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=300%2C124&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=480%2C199&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?resize=1020%2C422&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-5.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><strong>Generation of activation heatmaps for a CNN analyzing MRI data</strong> | <a href="https://www.semanticscholar.org/paper/Geometric-Deep-Learning-and-Heatmap-Prediction-for-Ha-Hansen/5e81fa02b1a5bf4784ec9af3dc96871f76fd33b3" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>For instance, in the example above, activation heatmaps highlight the regions of an MRI scan that contributed most to the CNN’s output.</p>



<p>Providing such visualizations along with the model output aids healthcare professionals in making informed decisions. Here’s how:</p>



<ol class="wp-block-list">
<li><strong>Lesion detection and abnormality identification</strong>: The heatmaps highlight the crucial areas in the image, aiding in the identification of lesions and abnormalities.<br></li>



<li><strong>Severity assessment of abnormalities:</strong> The intensity of the heatmap directly correlates with the severity of lesions or abnormalities. A larger and brighter area on the heatmap indicates a more severe condition, enabling a quick assessment of the issue.<br></li>



<li><strong>Identifying model mistakes: </strong>If the model’s activation is high for areas of the MRI scan that are not medically significant (e.g., the skull cap or even parts outside of the brain), this is a telltale sign of a mistake. Even without deep learning expertise, medical professionals will immediately see that this particular model output cannot be trusted.</li>
</ol>



<h4 class="wp-block-heading">How do you create a visualization heatmap for a PyTorch model?</h4>



<p>The <a href="https://frgfm.github.io/torch-cam/" target="_blank" rel="noreferrer noopener nofollow">TorchCam</a> library provides several methods to generate activation heatmaps for PyTorch models.&nbsp;</p>



<p>To generate an activation heatmap for a PyTorch model, we need to take the following steps:<br></p>



<ol class="wp-block-list">
<li>Initialize one of <a href="https://frgfm.github.io/torch-cam/methods.html" target="_blank" rel="noreferrer noopener nofollow">the methods provided by TorchCam</a> with our model.</li>



<li>Pass a sample input into the model and record the output.</li>



<li>Apply the initialized TorchCam method.</li>
</ol>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torchcam.methods <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> SmoothGradCAMpp

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># initialize the Smooth Grad.CAM++ extractor</span>
cam_extractor = SmoothGradCAMpp(my_pytorch_model)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># compute the model’s output for the sample</span>
out = model(sample_input_tensor.unsqueeze(<span class="hljs-number" style="color: teal;">0</span>))

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># generate the class activation map</span>
cams = cam_extractor(out.squeeze(<span class="hljs-number" style="color: teal;">0</span>).argmax().item(), out)
</pre></code></pre>
</div>




<p>The accompanying Colab notebook contains <a href="https://colab.research.google.com/drive/1VZp8H1EOyxYxQKiQKv9WceEsM4jHJVsS#scrollTo=yZ2hwGFFHmES&amp;line=2&amp;uniqifier=1" target="_blank" rel="noreferrer noopener nofollow">a full TorchCam activation heatmap example</a> using a ResNet image classification model.</p>



<p>Once we have computed them, we can plot the activation heatmaps for each layer in the model:&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> name, cam <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> zip(cam_extractor.target_names, cams):
plt.imshow(cam.squeeze(<span class="hljs-number" style="color: teal;">0</span>).numpy())
plt.axis(<span class="hljs-string" style="color: rgb(221, 17, 68);">'off'</span>)
plt.title(name)
plt.show()
</pre></code></pre>
</div>




<p>In my example model’s case, the output is not overly helpful:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="389" height="411" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-6.png?resize=389%2C411&#038;ssl=1" alt="Creating a visualization heatmap for a PyTorch model" class="wp-image-33042" style="width:349px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-6.png?w=389&amp;ssl=1 389w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-6.png?resize=189%2C200&amp;ssl=1 189w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-6.png?resize=220%2C232&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-6.png?resize=120%2C127&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-6.png?resize=160%2C169&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-6.png?resize=300%2C317&amp;ssl=1 300w" sizes="auto, (max-width: 389px) 100vw, 389px" /><figcaption class="wp-element-caption">Creating a visualization heatmap for a PyTorch model (layer) | Source: Author</figcaption></figure>
</div>


<p>We can greatly enhance the plot’s value by overlaying the original input image. Luckily for us, TorchCam provides the <a href="https://frgfm.github.io/torch-cam/utils.html#torchcam.utils.overlay_mask" target="_blank" rel="noreferrer noopener nofollow"><em>overlay_mask</em></a> utility function for this purpose:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torchcam.utils <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> overlay_mask

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> name, cam <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> zip(cam_extractor.target_names, cams):
result = overlay_mask(to_pil_image(img),
                to_pil_image(cam.squeeze(<span class="hljs-number" style="color: teal;">0</span>), mode=<span class="hljs-string" style="color: rgb(221, 17, 68);">'F'</span>),
                alpha=<span class="hljs-number" style="color: teal;">0.7</span>)
plt.imshow(result)
plt.axis(<span class="hljs-string" style="color: rgb(221, 17, 68);">'off'</span>)
plt.title(name)
plt.show()
</pre></code></pre>
</div>



<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="293" height="411" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-7.png?resize=293%2C411&#038;ssl=1" alt="Original input image overlaid with an activation heatmap of the fourth layer in a ResNet18" class="wp-image-33044" style="width:347px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-7.png?w=293&amp;ssl=1 293w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-7.png?resize=143%2C200&amp;ssl=1 143w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-7.png?resize=220%2C309&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-7.png?resize=120%2C168&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-7.png?resize=160%2C224&amp;ssl=1 160w" sizes="auto, (max-width: 293px) 100vw, 293px" /><figcaption class="wp-element-caption"><strong>Original input image overlaid with an activation heatmap of the fourth layer in a ResNet18</strong> | Source: Author</figcaption></figure>
</div>


<p>As you can see in the example plot above, the activation heatmap exposes the areas of the input image that resulted in the greatest activation of neurons in the inner layer of the deep learning model. This helps engineers and the general audience to understand what’s happening inside the model.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-feature-visualization">Feature visualization</h3>



<p>Feature visualization reveals the features learned by a deep neural network. It is particularly helpful in <a href="/blog/category/computer-vision" target="_blank" rel="noreferrer noopener">computer vision</a>, where it reveals which abstract features in an input image a neural network responds to. For example, that a neuron in a CNN architecture is highly responsive to diagonal edges or textures like fur.</p>



<p>This helps us understand what the model is looking for in images. The main difference to the activation heatmaps discussed in the previous section is that these show the general response to regions of an input image, whereas feature visualization goes a level deeper and attempts to uncover a model’s response to abstract concepts.</p>



<p>Through feature visualization, we can gain valuable insights into the specific features that deep neural networks are processing at different layers. Generally, layers close to the model’s input will respond to simpler features like edges, while layers closer to the model’s output will detect more abstract concepts.</p>



<p>Such insights not only aid in understanding the inner workings but also serve as a toolkit for fine-tuning and enhancing the model&#8217;s performance. By inspecting the features that are activated incorrectly or inconsistently, we can refine the training process or identify data quality issues.</p>



<p>In my Colab notebook for this article, you can find the <a href="https://colab.research.google.com/drive/1VZp8H1EOyxYxQKiQKv9WceEsM4jHJVsS#scrollTo=U0cniE7sUNP8" target="_blank" rel="noreferrer noopener nofollow">full example code</a> for generating feature visualizations for a PyTorch CNN. Here, we’ll focus on discussing the result and what we can learn from it.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1000" height="1600" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=1000%2C1600&#038;ssl=1" alt="Feature visualization plots for a ResNet18 processing the image of a dog" class="wp-image-33046" style="width:718px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?w=1000&amp;ssl=1 1000w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=768%2C1229&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=125%2C200&amp;ssl=1 125w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=960%2C1536&amp;ssl=1 960w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=220%2C352&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=120%2C192&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=160%2C256&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=300%2C480&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-8.png?resize=480%2C768&amp;ssl=1 480w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><strong>Feature visualization plots for a ResNet18 processing the image of a dog</strong> | Source: Author</figcaption></figure>
</div>


<p>As you can see from the plots above, the CNN detects different patterns or features in every layer. If you look closely at the upper row, which corresponds to the first four layers of the model, you can see that those layers detect the edges in the image. For instance, in the second and fourth panels of the first row, you can see that the model identifies the nose and the ears of the dog.</p>



<p>As the activations flow through the model, it becomes ever more challenging to make out what the model is detecting. But if we analyzed more closely, we would likely find that individual neurons are activated by, e.g., the dog’s ears or eyes.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-deep-feature-factorizations">Deep feature factorizations</h3>



<p><a href="https://arxiv.org/abs/1806.10206" target="_blank" rel="noreferrer noopener nofollow">Deep Feature Factorizatio (DFF)</a> is a method to analyze the features a convolutional neural network has learned. DFF identifies regions in the network’s feature space that belong to the same semantic concept. By assigning different colors to these regions, we can create a visualization that allows us to see whether the features identified by the model are meaningful.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="300" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=1200%2C300&#038;ssl=1" alt="Deep feature visualization for a computer vision model" class="wp-image-33047" style="width:796px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=768%2C192&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=200%2C50&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=220%2C55&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=120%2C30&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=160%2C40&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=300%2C75&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=480%2C120&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-9.png?resize=1020%2C255&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Deep feature visualization for a computer vision model | <a href="https://jacobgil.github.io/pytorch-gradcam-book/Deep%20Feature%20Factorizations.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>For instance, in the example above, we find that the model bases its decision (that the image shows labrador retrievers) on the puppies, not the surrounding grass. The nose region might point to a chow, but the shape of the head and ears push the model toward “labrador retriever.” This decision logic mimics the way a human would approach the task.&nbsp;</p>



<p>DFF is available in PyTorch-gradcam, which comes with <a href="https://jacobgil.github.io/pytorch-gradcam-book/Deep%20Feature%20Factorizations.html" target="_blank" rel="noreferrer noopener nofollow">an extensive DFF tutorial</a> that also discusses how to interpret the results. The image above is based on this tutorial. I have simplified the code and added some additional comments. You’ll find my recommended approach to <a href="https://colab.research.google.com/drive/1VZp8H1EOyxYxQKiQKv9WceEsM4jHJVsS?usp=sharing" target="_blank" rel="noreferrer noopener nofollow">Deep Feature Factorization with PyTorch-gradcam</a> in the Colab notebook.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-training-dynamics-plots">Training dynamics plots</h3>



<p>Training dynamics plots show how a model learns. Training progress is typically gauged through performance metrics such as loss and accuracy. By visualizing these metrics, data scientists and deep learning practitioners can obtain crucial insights:</p>



<ul class="wp-block-list">
<li><strong>Learning Progression</strong>: Training dynamics plots reveal how quickly or slowly a model converges. Rapid convergence can point to overfitting, while erratic fluctuations may indicate issues like poor initialization or improper learning rate tuning.<br></li>



<li><strong>Early Stopping</strong>: Plotting losses helps to identify the point at which a model starts overfitting the training data. A decreasing training loss while the validation loss rises is a clear sign of overfitting. The point where overfitting sets in is the optimal time to halt training.<br></li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1600" height="666" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=1600%2C666&#038;ssl=1" alt="Plots of loss over training epochs for various deep learning models" class="wp-image-33048" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?w=1600&amp;ssl=1 1600w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=768%2C320&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=200%2C83&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=1536%2C639&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=220%2C92&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=120%2C50&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=160%2C67&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=300%2C125&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=480%2C200&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-10.png?resize=1020%2C425&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Plots of loss over training epochs for various deep learning models | <a href="https://www.mdpi.com/2361466" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<div id="app-screenshot-block_8325f401c14019a84ecbdcb50a339225"
	class="block-app-screenshot js-block-with-image-full-screen-modal "
	data-video-url=""
	data-show-controls="false"
	data-unmute="false"
	data-button-icon="https://neptune.ai/wp-content/themes/neptune/img/icon-close.svg"
	data-image-full-screen-modal="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-11.png?fit=1020%2C320&#038;ssl=1"
>

			<div class="block-app-screenshot__image-wrapper">
			<div class="block-app-screenshot__bar">
				<figure class="block-app-screenshot__bar-buttons-wrapper">
					<img
						src="https://neptune.ai/wp-content/themes/neptune/img/blocks/app-screenshot/bar-buttons.svg"
						width="34"
						height="9"
						class="block-app-screenshot__bar-buttons"
						alt="">
				</figure>
			</div>

			
				<img
					srcset="
					https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-11.png?fit=480%2C151&#038;ssl=1 480w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-11.png?fit=768%2C241&#038;ssl=1 768w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-11.png?fit=1020%2C320&#038;ssl=1 1020w"
					alt=""
					style=""
					width="1020"
					height="320"
					class="block-app-screenshot__image"
				>

			
			<div class="block-app-screenshot__overlay">

				
					<a
						href="https://app.neptune.ai/o/showcase/org/onboarding-project/runs/details?viewId=standard-view&#038;detailsTab=charts&#038;shortId=IMG-197&#038;type=run&#038;sortBy=%5B%22sys%2Fcreation_time%22%5D&#038;sortFieldType=%5B%22datetime%22%5D&#038;sortFieldAggregationMode=%5B%22auto%22%5D&#038;sortDirection=%5B%22descending%22%5D&#038;groupBy=%5B%22data%2Fversion%2Fvalid%22%5D&#038;groupByFieldType=%5B%22artifact%22%5D&#038;groupByFieldAggregationMode=%5B%22auto%22%5D&#038;suggestionsEnabled=true&#038;lbViewUnpacked=true"
						class="c-button c-button--primary c-button--small c-button--cta">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-button--test-tube.svg"
							width="16"
							height="19"
							target="_blank" rel="nofollow noopener noreferrer"							class="c-button__icon"
							alt=""
						/>

													<span class="c-button__text">
								See in the app							</span>
						
					</a>

				
														<button
						class="js-c-image-full-screen-modal c-button c-button--tertiary c-button--small">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-zoom.svg"
							width="16"
							height="17"
							class="c-button__icon"
							alt="zoom"
						/>

						<span class="c-button__text">
							Full screen preview						</span>
						
					</button>
									
			</div>

		</div>

					<figcaption class="block-app-screenshot__caption">
				Training loss, validation dice coefficient (also known as F1 score), and validation loss for a model training run in neptune.ai			</figcaption>
			
</div>



<div id="separator-block_632d57e60cc540f25dbac157d50313ba"
         class="block-separator block-separator--30">
</div>


    <a
        href="/blog/improving-ml-model-performance"
        id="cta-box-related-link-block_41cd003c0cd2c6c19436e4062de1c3ba"
        class="block-cta-box-related-link  l-margin__top--0 l-margin__bottom--0"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related post                 </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-how-to-improve-ml-model-performance-best-practices-from-ex-amazon-ai-researcher">                How to Improve ML Model Performance [Best Practices From Ex-Amazon AI Researcher]            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-gradient-plots">Gradient plots</h3>



<p>If plots of performance metrics are insufficient to understand a model’s training progress (or lack thereof), plotting the loss function’s gradients can be helpful.</p>



<p>To adjust the weights of a neural network during training, we use a technique called <a href="http://neuralnetworksanddeeplearning.com/chap2.html" target="_blank" rel="noreferrer noopener nofollow">backpropagation</a> to compute the gradient of the loss function with respect to the weights and biases of our network. The gradient is a high-dimensional vector that points in the direction of the steepest increase of the loss function. Thus, we can use that information to shift our weights and biases in the opposite direction. The learning rate controls the amount by which we change the weights and biases.</p>



<p>Vanishing or exploding gradients can prevent deep neural networks from learning. Plotting the mean magnitude of gradients for different layers can reveal whether gradients are vanishing (approaching zero) or exploding (becoming extremely large). If the gradient vanishes, we have no idea in which direction to shift our weights and biases, so training is stuck. An exploding gradient leads to large changes in the weights and biases, often overshooting the target and causing rapid fluctuations in the loss.</p>



<p><a previewlistener="true" href="/blog/best-ml-experiment-tracking-tools" target="_blank" rel="noreferrer noopener">Machine learning experiment trackers</a> like <a previewlistener="true" href="/" target="_blank" rel="noreferrer noopener">neptune.ai</a> enable researchers, data scientists and AI/ML engineers to track and plot gradients during training.</p>



<div id="app-screenshot-block_e411851aabbbac383dcd84d22ffe8f29"
	class="block-app-screenshot js-block-with-image-full-screen-modal "
	data-video-url=""
	data-show-controls="false"
	data-unmute="false"
	data-button-icon="https://neptune.ai/wp-content/themes/neptune/img/icon-close.svg"
	data-image-full-screen-modal="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-12.png?fit=867%2C361&#038;ssl=1"
>

			<div class="block-app-screenshot__image-wrapper">
			<div class="block-app-screenshot__bar">
				<figure class="block-app-screenshot__bar-buttons-wrapper">
					<img
						src="https://neptune.ai/wp-content/themes/neptune/img/blocks/app-screenshot/bar-buttons.svg"
						width="34"
						height="9"
						class="block-app-screenshot__bar-buttons"
						alt="">
				</figure>
			</div>

			
				<img
					srcset="
					https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-12.png?fit=480%2C200&#038;ssl=1 480w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-12.png?fit=768%2C320&#038;ssl=1 768w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-12.png?fit=867%2C361&#038;ssl=1 1020w"
					alt=""
					style=""
					width="867"
					height="361"
					class="block-app-screenshot__image"
				>

			
			<div class="block-app-screenshot__overlay">

				
					<a
						href="https://app.neptune.ai/katyl/GradientsNew/runs/details?viewId=standard-view&#038;detailsTab=charts&#038;shortId=GRAD1-50"
						class="c-button c-button--primary c-button--small c-button--cta">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-button--test-tube.svg"
							width="16"
							height="19"
							target="_blank" rel="nofollow noopener noreferrer"							class="c-button__icon"
							alt=""
						/>

													<span class="c-button__text">
								See in the app							</span>
						
					</a>

				
														<button
						class="js-c-image-full-screen-modal c-button c-button--tertiary c-button--small">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-zoom.svg"
							width="16"
							height="17"
							class="c-button__icon"
							alt="zoom"
						/>

						<span class="c-button__text">
							Full screen preview						</span>
						
					</button>
									
			</div>

		</div>

					<figcaption class="block-app-screenshot__caption">
				Gradient plots for two different layers of a deep neural network in neptune.ai			</figcaption>
			
</div>



<section
	id="i-box-block_af180a93f85b222360e9aaedc5660ef9"
	class="block-i-box  l-margin__top--large l-margin__bottom--large">

			<header class="c-header">
			<img
				src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
				data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/i-box/header-icon.svg"
				width="24"
				height="24"
				class="c-header__icon lazyload"
				alt="">

			
            <h2 class="c-header__text animation " style='max-width: 100%;'   >
                 <strong>Editor&#8217;s note</strong>
            </h2>		</header>
	
	<div class="block-i-box__inner">
		

<p>Do you feel like experimenting with neptune.ai?</p>



<ul
    id="arrow-list-block_c6915b7685c25644f4ba6a067d3637a5"
    class="block-arrow-list block-list-item--font-size-regular">
    

<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p>Request a <a href="/free-trial" target="_blank" rel="noreferrer noopener">free trial</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p>Play with a <a href="https://scale.neptune.ai/o/examples/org/LLM-Pretraining/reports/9e6a2cad-77e7-42df-9d64-28f07d37e908" target="_blank" rel="noreferrer noopener nofollow">live project</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p><a previewlistener="true" href="https://docs.neptune.ai/" target="_blank" rel="noreferrer noopener">See the docs</a>&nbsp;or watch a short&nbsp;<a href="/walkthrough" target="_blank" rel="noreferrer noopener">product demo (2 min)</a></p>


</li>


</ul>


	</div>

</section>



<p>To learn more about vanishing and exploding gradients and how to use gradient plots to detect them, I recommend Katherine Li’s in-depth blog post on <a href="/blog/vanishing-and-exploding-gradients-debugging-monitoring-fixing" target="_blank" rel="noreferrer noopener">debugging, monitoring, and fixing gradient-related problems</a>.</p>


    <a
        href="/blog/understanding-gradient-clipping-and-how-it-can-fix-exploding-gradients-problem"
        id="cta-box-related-link-block_6d5be1568214eb01de086cc441cbc89c"
        class="block-cta-box-related-link  l-margin__top--0 l-margin__bottom--0"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related article                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-understanding-gradient-clipping-and-how-it-can-fix-exploding-gradients-problem">                Understanding Gradient Clipping (and How It Can Fix Exploding Gradients Problem)            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-loss-landscapes">Loss landscapes</h3>



<p>We can not just plot gradient magnitudes but directly visualize the loss function and its gradients. These visualizations are commonly called “<a href="https://arxiv.org/abs/1712.09913" target="_blank" rel="noreferrer noopener nofollow">loss landscapes</a>.”</p>



<p>Inspecting a loss landscape helps data scientists and machine learning practitioners understand how an optimization algorithm moves the weights and biases in a model toward a loss function’s minimum.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="640" height="480" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-13.png?resize=640%2C480&#038;ssl=1" alt="" class="wp-image-33060" style="width:578px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-13.png?w=640&amp;ssl=1 640w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-13.png?resize=200%2C150&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-13.png?resize=220%2C165&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-13.png?resize=120%2C90&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-13.png?resize=160%2C120&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-13.png?resize=300%2C225&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-13.png?resize=480%2C360&amp;ssl=1 480w" sizes="auto, (max-width: 640px) 100vw, 640px" /><figcaption class="wp-element-caption">A plot of the region around a loss function’s local minimum with an inscribed gradient vector | <a href="https://github.com/pvigier/gradient-descent" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><br>In an idealized case like the one shown in the figure above, the loss landscape is very smooth. The gradient only changes slightly across the surface. Deep neural networks often exhibit a much more complex loss landscape with spikes and trenches. Reliably converging towards a minimum of the loss function in these cases requires robust optimizers such as <a href="https://arxiv.org/abs/1412.6980" target="_blank" rel="noreferrer noopener nofollow">Adam</a>.</p>



<p>To plot a loss landscape for a PyTorch model, you can use <a href="https://github.com/tomgoldstein/loss-landscape" target="_blank" rel="noreferrer noopener nofollow">the code provided by the authors</a> of a <a href="https://arxiv.org/abs/1712.09913" target="_blank" rel="noreferrer noopener nofollow">seminal paper on the topic</a>. To get a first impression, check out the interactive <a href="https://www.telesens.co/loss-landscape-viz/viewer.html" target="_blank" rel="noreferrer noopener nofollow">Loss Landscape Visualizer</a> using this library behind the scenes. There is also a <a href="https://github.com/artur-deluca/landscapeviz" target="_blank" rel="noreferrer noopener nofollow">TensorFlow port of the same code</a>.</p>



<p><br>Loss landscapes do not only provide insight into how deep learning models learn, but they can also be beautiful to look at. Javier Ideami has created the <a href="https://losslandscape.com/" target="_blank" rel="noreferrer noopener nofollow">Loss Landscape project</a> with many artistic videos and interactive animations of various loss landscapes.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-visualizing-attention">Visualizing attention</h3>



<p>Famously, the transformer models that have revolutionized deep learning over the past few years are <a href="https://arxiv.org/abs/1706.03762" target="_blank" rel="noreferrer noopener nofollow">based on attention mechanisms</a>. Visualizing what parts of the input a model attends to provides us with important insights:</p>



<ul class="wp-block-list">
<li><strong>Interpreting self-attention: </strong>Transformers utilize self-attention mechanisms to weigh the importance of different parts of the input sequence. Visualizing attention maps helps us grasp which parts the model focuses on.<br></li>



<li><strong>Diagnosing errors:</strong> When the model attends to irrelevant parts of the input sequence, it can lead to prediction mistakes. Visualization allows us to detect such issues.<br></li>



<li><strong>Exploring contextual information:</strong> Transformer models excel at capturing contextual information from input sequences. Attention maps show how the model distributes attention across the input’s elements, revealing how context is built and propagated through layers.<br></li>



<li><strong>Understanding how transformers work:</strong> Visualizing attention and its flow through the model at different stages helps us understand how transformers process their input. Jacob Gildenblat’s <a href="https://jacobgil.github.io/deeplearning/vision-transformer-explainability" target="_blank" rel="noreferrer noopener nofollow">Exploring Explainability for Vision Transformers</a> takes you on a visual journey through Facebook’s <a href="https://huggingface.co/facebook/deit-tiny-patch16-224" target="_blank" rel="noreferrer noopener nofollow">Data-efficient Image Transformer</a> (deit-tiny).</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1296" height="826" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=1296%2C826&#038;ssl=1" alt="Example of an attention map" class="wp-image-33064" style="width:578px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?w=1296&amp;ssl=1 1296w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=768%2C489&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=200%2C127&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=220%2C140&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=120%2C76&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=160%2C102&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=300%2C191&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=480%2C306&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-14.png?resize=1020%2C650&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">The image on the left is original. On the right, it&#8217;s overlaid with an attention map. You can see that the model allocates the most attention to the dog | Source: Author</figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-visualizing-embeddings">Visualizing embeddings</h3>



<p>Embeddings are high-dimensional vectors that capture semantic information. Nowadays, they are typically generated by deep learning models. Visualizing embeddings helps to understand this complex, high-dimensional data.</p>



<p>Typically, embeddings are projected down to a two- or three-dimensional space and represented by points. Standard techniques include principal component analysis, t-SNE, and UMAP. I’ve covered the latter two in-depth in the section on visualizing cluster analysis in my article on machine learning visualization.</p>



<p>Thus, it is no surprise that embedding visualizations reveal data patterns, similarities, and anomalies by grouping embeddings into clusters. For instance, if you visualize word embeddings with one of the methods mentioned above, you’ll find that semantically similar words will end up close together in the projection space.</p>



<p>The <a href="https://projector.tensorflow.org/" target="_blank" rel="noreferrer noopener nofollow">TensorFlow embedding projector</a> gives everyone access to interactive visualizations of well-known embeddings like standard <a href="https://www.tensorflow.org/text/tutorials/word2vec" target="_blank" rel="noreferrer noopener nofollow">Word2vec</a> corpora.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1050" height="984" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=1050%2C984&#038;ssl=1" alt="Embeddings for MNIST" class="wp-image-33066" style="width:570px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?w=1050&amp;ssl=1 1050w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=768%2C720&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=200%2C187&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=220%2C206&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=120%2C112&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=160%2C150&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=300%2C281&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=480%2C450&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/11/how-to-visualize-deep-learning-models-15.png?resize=1020%2C956&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><strong>Embeddings for MNIST represented in a 3D space </strong>| <a href="https://projector.tensorflow.org/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-when-to-use-which-deep-learning-visualization">When to use which deep learning visualization</h2>



<p>We can break down the deep learning model lifecycle into four different phases:</p>



<div id="case-study-numbered-list-block_957f880db9e06652b64bf5167acd2687"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Pre-training            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                During training            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Post-training            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Inference            </li>
            </ul>
</div>



<p>Each of these phases requires different visualizations.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-pre-training-deep-learning-model-visualization">Pre-training deep learning model visualization</h3>



<p>During early <a href="/blog/category/machine-learning-model-development" target="_blank" rel="noreferrer noopener">model development</a>, finding a suitable model architecture is the most essential task.</p>



<p>Architecture visualizations offer insights into how your model processes information. To understand the architecture of your deep learning model, you can visualize the layers, their connections, and the data flow between them.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-deep-learning-model-visualization-during-model-training">Deep learning model visualization during model training</h3>



<p>In the training phase, understanding training progress is crucial. To this end, training dynamics and gradient plots are the most helpful visualizations.</p>



<p>If training does not yield the expected results, feature visualizations or inspecting the model’s loss landscape in detail can provide valuable insights. If you’re training <a href="/blog/transformer-models-for-textual-data-prediction" target="_blank" rel="noreferrer noopener">transformer-based models</a>, visualizing attention or embeddings can lead you on the right path.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-post-training-deep-learning-model-visualizations">Post-training deep learning model visualizations</h3>



<p>Once the model is fully trained, the main goal of visualizations is to provide insights into how a model processes data to produce its outputs.</p>



<p>Activation heatmaps uncover which parts of the input are considered most important by the model. Feature visualizations reveal the features a model learned during training and help us understand what patterns a model is looking for in the input data at different layers. Deep Feature Factorization goes a step further and visualizes regions in the input space associated with the same concept.</p>



<p>If you’re working with transformers, attention and embedding visualizations can help you validate that your model focuses on the most important input elements and captures semantically meaningful concepts.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-inference">Inference</h3>



<p>At inference time – when a model is used to make predictions or generate outputs – visualizations can help monitor and debug cases where a model went wrong.</p>



<p>The methods used are the same as the ones you might use in the post-training phase but the goal is different: Instead of understanding the model as a whole, we’re now interested in how the model handles an individual input instance.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>We covered a lot of ways to visualize deep learning models. We started by asking why we might want visualizations in the first place and then looked into several techniques, often accompanied by hands-on examples. Finally, we discussed where in the model lifecycle the different deep learning visualization approaches promise the most valuable insights.</p>



<p>I hope you enjoyed this article and have some ideas about which visualizations you will explore for your current deep learning projects. The <a href="https://colab.research.google.com/drive/1VZp8H1EOyxYxQKiQKv9WceEsM4jHJVsS#scrollTo=Lh_bwP2vij2l" target="_blank" rel="noreferrer noopener nofollow">visualization examples in my Colab notebook</a> can serve as starting points. Please feel free to copy and adapt them to your needs!</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">33030</post-id>	</item>
		<item>
		<title>Deploying Large NLP Models: Infrastructure Cost Optimization</title>
		<link>https://neptune.ai/blog/nlp-models-infrastructure-cost-optimization</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Thu, 23 Mar 2023 09:24:59 +0000</pubDate>
				<category><![CDATA[Natural Language Processing]]></category>
		<guid isPermaLink="false">https://neptune.ai/?p=19513</guid>

					<description><![CDATA[NLP models in commercial applications such as text generation systems have experienced great interest among the user. These models have achieved various groundbreaking results in many NLP tasks like question-answering, summarization, language translation, classification, paraphrasing, et cetera.&#160; Models like for example ChatGPT, Gopher **(280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) are predominantly very&#8230;]]></description>
										<content:encoded><![CDATA[
<p>NLP models in commercial applications such as text generation systems have experienced great interest among the user. These models have achieved various groundbreaking results in many NLP tasks like question-answering, summarization, language translation, classification, paraphrasing, et cetera.&nbsp;</p>



<p>Models like for example ChatGPT, Gopher **(280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) are predominantly very large and often addressed as large language models or <strong>LLMs</strong>. These models can easily have millions or up to billions of parameters making them financially expensive to deploy and maintain.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1354" height="938" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=1354%2C938&#038;ssl=1" alt="Graph showing that the size of large NLP models is increasing" class="wp-image-19527" style="width:650px;height:450px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?w=1354&amp;ssl=1 1354w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=768%2C532&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=200%2C139&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=220%2C152&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=120%2C83&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=160%2C111&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=300%2C208&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=480%2C333&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-1.png?resize=1020%2C707&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">The size of large NLP models is increasing | <a href="https://d1.awsstatic.com/events/Summits/reinvent2022/AIM405_Train-and-deploy-large-language-models-on-Amazon-SageMaker.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Such large <a href="/blog/category/natural-language-processing" target="_blank" rel="noreferrer noopener">natural language processing models </a>require significant computational power and memory, which is often the leading cause of high <strong>infrastructure costs.</strong> Even if you are fine-tuning an average-sized model for a large-scale application, you need to muster a huge amount of data.&nbsp;</p>



<p>Such scenarios inevitably lead to stacking new layers of neural connections, making it a large model, moreover, deploying these models will require fast and expensive GPU, which will ultimately add to the infrastructure cost. So is there a way to keep these expenses in check?</p>



<p><em>Sure there is.</em></p>



<p>This article aims to provide some strategies, tips, and tricks you can apply to optimize your infrastructure while deploying them. In the following sections, we will explore these:</p>



<div id="case-study-numbered-list-block_00e91a4e20285466f5c7f98f35117148"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                The infrastructural challenges faced while deploying large NLP models.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Different strategies to reduce the costs associated with these challenges.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Other handy tips you might want to know to address this issue.            </li>
            </ul>
</div>



<section id="blog-intext-cta-block_1ec9f560db812874346af469b173634e" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-you-may-also-like">You may also like</h3>
    
            <p><a href="/blog/deploy-nlp-models-in-production" target="_blank" rel="noopener">How to Deploy NLP Models in Production</a></p>
<p><a href="/blog/future-of-mlops-and-gpt-3-with-david-hershey" target="_blank" rel="noopener">What Does GPT-3 Mean For the Future of MLOps? With David Hershey</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-challenges-of-large-nlp-models">Challenges of large NLP models</h2>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-computational-resources">Computational resources</h3>



<p>LLMs require is a significant amount of resources for optimal performance. Below are the challenges that are usually faced concerning the same.&nbsp;</p>



<h4 class="wp-block-heading">1. High computational requirements</h4>



<p>Deploying LLMs can be challenging as they require significant computational resources to perform inference. This is especially true when the model is used for real-time applications, such as chatbots or virtual assistants.&nbsp;</p>



<p>Consider ChatGPT as an example. It is capable of processing and responding to queries instantly within seconds (most of the time). But there are times when the user traffic seems to be higher, during those moments, the inference time gets higher. There are other factors that can delay the inference, such as the complexity of the question, the amount of information required to generate a response, et cetera. But in any case, if the model is supposed to serve in real-time, it must be capable of high throughput and low latency.</p>



<h4 class="wp-block-heading">2. Storage capacity</h4>



<p>With parameters ranging from millions to billions, LLM can pose storage capacity challenges. It will be good to store the whole model in a single storage device, but because of the size, it is not possible.&nbsp;</p>



<p>For example, <strong>OpenAI&#8217;s GPT-3</strong> model, with 175B parameters, requires over <strong>300GB</strong> of storage for its parameters alone. Additionally, it requires a GPU with a minimum of 16GB of memory to run efficiently. Storing and running such a large model on a single device may be impractical for many use cases due to the hardware requirements. As such, there are three main issues around storage capacity with LLMs:</p>



<h5 class="wp-block-heading">2.1 Memory limitations</h5>



<p>LLMs require a lot of memory as they process a huge amount of information. This can be challenging, especially when you want to deploy them on a low-memory device such as a mobile phone.&nbsp;</p>



<p>One way to deploy such models is to use a distributed system or distributed inference. In distributed inference, the model is distributed on multiple nodes or servers. It allows the distribution of the workload and speeds up the process. But the challenge here is that it may require significant expertise to set up and maintain. Plus, the larger the model, the more servers are required, which again increases the deployment cost.&nbsp;</p>



<h5 class="wp-block-heading">2.2 Large model sizes</h5>



<p>The MT-NLG model released in 2022 has 530 billion parameters and requires several hundred gigabytes of storage. High-end GPUs and basic data parallelism aren&#8217;t sufficient for deployment, and even alternative solutions like pipeline and model parallelism have trade-offs between functionality, usability, and memory/compute efficiency. As the authors in the paper “<a href="https://arxiv.org/pdf/1910.02054.pdf" target="_blank" rel="noreferrer noopener nofollow">ZeRO: Memory Optimizations Toward Training Trillion Parameter Models</a> put it, this, in turn,<strong> reduces the effectiveness of the model.</strong>&nbsp;</p>



<p>For instance, a 1.5B parameter model on 32GB can easily run out of memory during inference if the input query is long and complicated. Even for basic inference on LLM, multiple accelerators or multi-node computing clusters like multiple Kubernetes pods are required. There are techniques discussed by researchers where they propose the idea of offloading parameters to the local RAM. But these techniques turned out to be inefficient in practical use-case scenarios. Users cannot download such large scaled models on their systems just to translate or summarise a given text.&nbsp;</p>



<h5 class="wp-block-heading">2.3 Scalability challenges&nbsp;</h5>



<p>Another area for improvement with LLMs is <strong>scalability</strong>. We know that a large model is often scaled using <strong>model parallelism (MP),</strong> which requires multiple storage and memory capacity. This involves dividing the model into smaller parts and distributing it across multiple machines. Each machine processes a different part of the model, and the results are combined to produce the final output. This technique can be helpful in handling large models, but it requires careful consideration of the communication overhead between machines.&nbsp;</p>



<p>In <strong>Distributed inference,</strong> LLM is deployed on multiple machines, with each machine processing a subset of the input data. This approach is essential for handling large-scale language tasks that require input to pass through billions of parameters.&nbsp;</p>



<p><br>Most of the time, MP works, but there are instances where it doesn’t. The reason being MP divides the model vertically, distributing the computation and parameters among several devices for each layer where the <strong>inter-GPU communication bandwidth is large.</strong> This distribution facilitates intensive communication between each layer in a single node. The limitation comes outside a single node which essentially leads to a fall in performance and efficiency.</p>



<h4 class="wp-block-heading">3. Bandwidth requirements</h4>



<p>As discussed previously, LLM has to be scaled using MP. But the issue we found was that MP is efficient in single-node clusters, but in a multi-node setting, the inference isn’t efficient. This is because of the low bandwidth networks.&nbsp;</p>



<p>Deploying a large language model requires multiple network requests to retrieve data from different servers. Network latency can impact the time required to transfer data between the servers, which can result in slower performance, eventually leading to high latency and response time. This can cause delays in processing, which can impact user experience.</p>



<h4 class="wp-block-heading">4. Resource constraints</h4>



<p>Limited storage capacity can restrict the ability to store multiple versions of the same model, which can make it difficult to compare the performance of different models and track the progress of model development over time. This can be true if you want to adopt a <strong>shadow deployment strategy</strong>.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-energy-consumption">Energy consumption</h3>



<p>As discussed above already, serving LLMs require significant <strong>computational</strong> resources, which can lead to high energy consumption and a large carbon footprint. This can be problematic for organizations that are committed to reducing their environmental impact.</p>



<p>Just for reference, below is the image showing the financial estimation of the LLMs, along with the carbon footprint that they produce during training.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="1096" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=1920%2C1096&#038;ssl=1" alt="Financial estimation of the large NLP models, along with the carbon footprint that they produce during training" class="wp-image-19594" style="width:-37px;height:-20px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=1920%2C1096&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=768%2C438&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=200%2C114&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=1536%2C877&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=220%2C126&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=120%2C68&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=160%2C91&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=300%2C171&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=480%2C274&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?resize=1020%2C582&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-2.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Financial estimation of the large NLP models, along with the carbon footprint that they produce during training | <a href="https://sunniesuhyoung.github.io/files/LLM.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>What is more shocking is that 80-90% of the machine learning workload is inference processing, according to <a href="https://www.hpcwire.com/2019/03/19/aws-upgrades-its-gpu-backed-ai-inference-platform/" target="_blank" rel="noreferrer noopener nofollow">NVIDIA</a>. Likewise, according to <a href="https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/" target="_blank" rel="noreferrer noopener nofollow">AWS</a>, inference accounts for 90% of machine learning demand in the cloud.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-cost">Cost</h3>



<p>Deploying and using LLMs can be costly, including the cost of hardware, storage, and infrastructure. Additionally, the cost of deploying the model can be significant, especially when using resources such as GPUs or TPUs for low latency and high throughput during inference. This can make it challenging for smaller organizations or individuals to use LLMs for their applications.</p>



<p>To put this into perspective, it is expected that the running cost of the chatGPT is around <strong>$100,000</strong> per day or<strong> $3M</strong> per month.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1000" height="1364" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?resize=1000%2C1364&#038;ssl=1" alt="Tweet about ChatGPT costs" class="wp-image-19596" style="width:502px;height:685px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?w=1000&amp;ssl=1 1000w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?resize=768%2C1048&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?resize=147%2C200&amp;ssl=1 147w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?resize=220%2C300&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?resize=120%2C164&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?resize=160%2C218&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?resize=300%2C409&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-3.png?resize=480%2C655&amp;ssl=1 480w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Tweet about ChatGPT costs  | <a href="https://twitter.com/tomgoldsteincs/status/1600196995389366274?lang=en" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-strategies-for-optimizing-infrastructure-costs-of-large-nlp-models">Strategies for optimizing infrastructure costs of large NLP models</h2>



<p>In this section, we will explore and discuss the possible solutions and techniques for the challenges discussed in the previous section. It is worth noting that when you deploy the model on the cloud, you choose the inference option and thereby create an end-point. See the image below. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=1200%2C628&#038;ssl=1" alt="Graph with the general workflow for inference endpoints " class="wp-image-36662" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization.jpg?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">The general workflow for inference endpoints | <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Keep that in mind, and with all the challenges we discussed earlier, we will discuss techniques that can be used to optimize the cost around this infrastructure for deploying LLMs. Below are some of the steps that you can follow to deploy your model as efficiently as possible.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-smart-use-of-cloud-computing-for-computational-resources">Smart use of cloud computing for computational resources</h3>



<p>Using cloud computing services can provide on-demand access to powerful computing resources, including CPUs and GPUs. Cloud computing services are flexible and can scale according to your requirements.&nbsp;</p>



<p>One of the important tips is that you should make a budget for your project. Making a budget always helps you find ways to optimize your project that will not exceed your financial limitation.&nbsp;</p>



<p>Now when it comes to cloud services, there are a lot of companies that offer their platform. Cloud providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform offer a range of options for deploying LLMs, including virtual machines, containers, and serverless computing. But despite you must do your own research and calculation. For instance, you must know these three things:</p>



<div id="case-study-numbered-list-block_a1924d9be2965a1498bcc3ae807bcddd"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                The model size.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Details about the hardware to be used.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Right inference option.            </li>
            </ul>
</div>



<p>Once you have the details, you can actually calculate how much-accelerated computing power you need. Based upon that, you can plan and execute your model deployment.&nbsp;</p>



<section id="blog-intext-cta-block_e89d175d5bcbcd23b7bda031b77c11ae" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-learn-more">Learn more</h3>
    
            <p><a href="/blog/mlops-tools-for-nlp-projects" target="_blank" rel="noopener">MLOps Tools for NLP Projects</a></p>
    
    </section>



<h4 class="wp-block-heading">Calculating model size</h4>



<p>You can see the table below, which will give you an idea of how many FLOPs you might need for your model. Once you have an estimation, you can then go ahead and find the relevant GPU in your preferred cloud platform.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="910" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=1920%2C910&#038;ssl=1" alt="Estimated optimal training FLOPs and training tokens for various NLP model sizes." class="wp-image-19600" style="width:-123px;height:-58px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=1920%2C910&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=768%2C364&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=200%2C95&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=1536%2C728&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=220%2C104&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=120%2C57&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=160%2C76&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=300%2C142&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=480%2C228&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?resize=1020%2C484&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-5-1.png?w=1928&amp;ssl=1 1928w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Estimated optimal training FLOPs and training tokens for various NLP model sizes | <a href="https://arxiv.org/pdf/2203.15556.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>A tool that I found under the blog post named “<a href="https://www.lesswrong.com/posts/HvqQm6o8KnwxbdmhZ/estimating-training-compute-of-deep-learning-models" target="_blank" rel="noreferrer noopener nofollow">Estimating Training Compute of Deep Learning Models</a>”&nbsp; allows you to calculate the FLOPs required for your model both for training and inference.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1272" height="1270" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=1272%2C1270&#038;ssl=1" alt="A screen from a tool that calculates the FLOPs required for both training and inference" class="wp-image-19601" style="width:513px;height:512px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?w=1272&amp;ssl=1 1272w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=768%2C767&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=200%2C200&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=220%2C220&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=120%2C120&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=88%2C88&amp;ssl=1 88w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=44%2C44&amp;ssl=1 44w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=160%2C160&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=300%2C300&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=480%2C479&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=1020%2C1018&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-6.png?resize=100%2C100&amp;ssl=1 100w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">A tool that calculates the FLOPs required for both training and inference | <a href="https://www.lesswrong.com/posts/HvqQm6o8KnwxbdmhZ/estimating-training-compute-of-deep-learning-models" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>The app is based on the works of <a href="https://arxiv.org/abs/2001.08361" target="_blank" rel="noreferrer noopener nofollow">Kaplan et al., 2020</a> or <a href="https://arxiv.org/abs/2203.15556" target="_blank" rel="noreferrer noopener nofollow">Hoffman et al., 2022</a> where they show how to train a model on a fixed-compute budget. To understand more on this subject you can read the blog <a href="https://www.lesswrong.com/posts/HvqQm6o8KnwxbdmhZ/estimating-training-compute-of-deep-learning-models" target="_blank" rel="noreferrer noopener nofollow">here</a>.</p>



<h4 class="wp-block-heading">Selecting the right hardware</h4>



<p>Once you have calculated the required FLOPs, you can go ahead and choose the GPU. Make sure you are aware of the features that the GPU offers. For instance, see the image below to get an understanding.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="906" height="1534" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?resize=906%2C1534&#038;ssl=1" alt="The list of GPU specifications offered by NVIDIA" class="wp-image-19602" style="width:509px;height:862px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?w=906&amp;ssl=1 906w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?resize=768%2C1300&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?resize=118%2C200&amp;ssl=1 118w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?resize=220%2C372&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?resize=120%2C203&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?resize=160%2C271&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?resize=300%2C508&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-7.png?resize=480%2C813&amp;ssl=1 480w" sizes="auto, (max-width: 906px) 100vw, 906px" /><figcaption class="wp-element-caption">The list of GPU specifications offered by NVIDIA | <a href="https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Above you can see the list of specifications that NVIDIA offers. Similarly, you can compare different GPUs and see which one suits your budget.&nbsp;</p>



<h4 class="wp-block-heading">Choosing the right inference option</h4>



<p>Once you have calculated the model size and selected the GPU, you can then proceed to choose the inference option. Amazon SageMaker offers multiple inference options to suit different workloads. For instance, if you require:</p>



<ol class="wp-block-list">
<li><strong>Real-time inference</strong>, which is suitable for low-latency or high-throughput online inferences and supports payload sizes up to 6 MB and processing times of 60 seconds.</li>



<li><strong>Serverless inference</strong>, which is ideal for intermittent or unpredictable traffic patterns and supports payload sizes up to 4 MB and processing times of 60 seconds. In serverless inference, the model scales automatically based on the incoming traffic or requests. At times when the model is sitting idle you won’t be charged. It offers a pay-as-you-use facility.&nbsp;</li>



<li><strong>Batch transform </strong>is suitable for offline processing of large datasets and supports payload sizes of GBs and processing times of days.&nbsp;</li>



<li><strong>Asynchronous inference </strong>is suitable for queuing requests with large payloads and long processing times, supports payloads up to 1 GB and processing times up to one hour, and can scale down to 0 when there are no requests.</li>
</ol>



<div id="separator-block_e234c74c458748ef6017cd37d06b9ae5"
         class="block-separator block-separator--25">
</div>



<p>To get a better understanding and meet your requirement, look at the image below. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=1200%2C628&#038;ssl=1" alt="Graph with choosing model deployment options " class="wp-image-36664" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-2.jpg?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Choosing model deployment options | <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>When all the above points are satisfied, you can then deploy the model on any of the cloud services.&nbsp;</p>



<p>To quickly summarize:</p>



<div id="case-study-numbered-list-block_7765a030fa910f410f6b32f262ea2365"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Set a budget            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Calculate the size of the model             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Compute the FLOPs required for model            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Find the right GPU            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Choose the appropriate inference option             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                Research the pricing offered by various cloud computing platforms            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">7</span>
                Find the service that suits your needs and budget            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">8</span>
                Deploy it.             </li>
            </ul>
</div>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-optimizing-the-model-for-serving">Optimizing the model for serving</h3>



<p>In the last section, I discussed how the size of LLMs can pose a problem for deployment. When your model is too large, strategies like model compilation, model compression, and model sharding can be used. These techniques reduce the size of the model while preserving accuracy, which allows easier deployment and reduce the associated expenses significantly.&nbsp;</p>



<p>Let’s explore each of those in detail.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="1110" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=1920%2C1110&#038;ssl=1" alt="Graph showing different techniques or strategies to optimize LLMs for deployment. " class="wp-image-19605" style="width:793px;height:453px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=1920%2C1110&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=768%2C444&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=200%2C116&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=1536%2C888&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=220%2C127&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=120%2C69&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=160%2C93&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=300%2C173&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=480%2C278&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?resize=1020%2C590&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-9.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><em>Different techniques or strategies to optimize LLMs for deployment</em><a href="https://d1.awsstatic.com/events/Summits/reinvent2022/AIM405_Train-and-deploy-large-language-models-on-Amazon-SageMaker.pdf" target="_blank" rel="noreferrer noopener nofollow"> | Source</a></figcaption></figure>
</div>


<h4 class="wp-block-heading">Model compression</h4>



<p>Model compression is a technique used to optimize and transform an LLM into an efficient executable model that can be run on specialized hardware or software platforms–usually cloud services. The goal of model compression is to improve the performance and efficiency of LLM inference by leveraging hardware-specific optimizations, such as reduced memory footprint, improved computation parallelism, and reduced latency.</p>



<p>This is a good technique because it helps you to play with a different combination, set performance benchmarks for various tasks, and find a price that suits your budget.&nbsp; As such, model compression involves several steps:</p>



<ol class="wp-block-list">
<li><strong>Graph optimization</strong>: The high-level LLM graph is transformed and optimized using graph optimization techniques such as <strong>pruning</strong> and <strong>quantization</strong> to reduce the computational complexity and memory footprint of the model. This, in turn, makes the model small while preserving its accuracy.&nbsp;</li>



<li><strong>Hardware-specific optimization</strong>: The optimized LLM graph is further optimized to leverage hardware-specific optimizations. For instance, Amazon Sagemaker provides model serving containers for various popular ML frameworks, including <a href="https://xgboost.readthedocs.io/en/stable/" target="_blank" rel="noreferrer noopener nofollow">XGBoost</a>, <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener nofollow">scikit-learn</a>, <a href="https://pytorch.org/" target="_blank" rel="noreferrer noopener nofollow">PyTorch</a>, <a href="https://www.tensorflow.org/" target="_blank" rel="noreferrer noopener nofollow">TensorFlow</a>, and <a href="https://mxnet.apache.org/versions/1.9.1/" target="_blank" rel="noreferrer noopener nofollow">Apache MXNet</a>, along with software development kits (SDKs) for each container.</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="609" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=1920%2C609&#038;ssl=1" alt="Illustration of Amazon Sagemaker's workflow" class="wp-image-19607" style="width:720px;height:228px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=1920%2C609&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=768%2C244&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=200%2C63&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=1536%2C487&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=220%2C70&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=120%2C38&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=160%2C51&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=300%2C95&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=480%2C152&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?resize=1020%2C324&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/03/optimizing-infrastructure-costs-for-deploying-large-nlp-models-10.png?w=1992&amp;ssl=1 1992w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">How AWS Sagemaker Neo works | <a href="https://aws.amazon.com/sagemaker/neo/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Here are a few model compression techniques that one must know.</p>



<h5 class="wp-block-heading">Model quantization</h5>



<p>Model quantization (MQ) is a technique used to reduce the memory footprint and computation requirements of an LLM. MQ essentially transforms the model parameters and activations with lower-precision data types. The goal of model quantization is to improve the efficiency of LLM during inference by reducing the memory bandwidth requirements and exploiting hardware-specific optimizations optimized for lower-precision arithmetic.</p>



<p>PyTorch offers model quantization, their API involves the reduction of model parameters by a factor of 4, while the memory bandwidth required by the model by the factor 2 to 4 times. As a result of these improvements, the inference speed can increase by 2 to 4 times, owing to the reduction in memory bandwidth requirements and faster computations using int8 arithmetic. However, the precise degree of acceleration achieved depends on the hardware, runtime, and model used.</p>



<p>There are several approaches to model quantization for LLMs, including:</p>



<p>Model quantization can be challenging to implement effectively, as it requires careful consideration of the trade-offs between reduced precision and model accuracy, as well as the hardware-specific optimizations that can be leveraged with lower-precision arithmetic. However, when done correctly, model quantization can significantly improve the efficiency of LLM inference, enabling better real-time inference on large-scale datasets and edge devices.</p>



<ol class="wp-block-list">
<li><strong>Post-training quantization</strong>: In this approach, the LLM is first trained using floating-point data types, and then the weights and activations are quantized to lower-precision data types post-training. This approach is simple to implement and can achieve good accuracy with a careful selection of quantization parameters.</li>



<li><strong>Quantization-aware training</strong>: Here, the LLM is quantized during training, allowing the model to adapt to the reduced precision during training. This approach can <strong>achieve higher accuracy</strong> than post-training quantization but requires more computation during training.</li>



<li><strong>Hybrid quantization</strong>: It combines both post-training quantization and quantization-aware training, allowing the LLM to adapt to lower-precision data types during training while also applying post-training quantization to further reduce the memory footprint and computational complexity of the model.</li>
</ol>



<div id="separator-block_e234c74c458748ef6017cd37d06b9ae5"
         class="block-separator block-separator--25">
</div>



<h5 class="wp-block-heading">Model Pruning</h5>



<p>Model pruning (MP) is again a technique used to reduce the size and computational complexity of an LLM by removing redundant or unnecessary model parameters. MP is to improve the efficiency of LLM inference without sacrificing accuracy.</p>



<p>MP involves <strong>identifying</strong> and <strong>removing redundant</strong> or <strong>unnecessary model parameters</strong> using various pruning algorithms. These algorithms can be broadly categorized into two categories:</p>



<ol class="wp-block-list">
<li><strong>Weight pruning</strong>: In weight pruning, individual weights in the LLM are removed based on their magnitude or importance, using techniques such as magnitude-based pruning or structured pruning. Weight pruning can significantly reduce the number of model parameters and the computational complexity of the LLM, but it may require fine-tuning of the pruned model to maintain its accuracy.</li>



<li><strong>Neuron pruning</strong>: In neuron pruning, entire neurons or activations in the LLM are removed based on their importance, using techniques such as channel pruning or neuron-level pruning. Neuron pruning can also significantly reduce the number of model parameters and the computational complexity of the LLM, but it may be more difficult to implement and may require more extensive retraining and maybe fine-tuning to maintain accuracy.</li>
</ol>



<div id="separator-block_e234c74c458748ef6017cd37d06b9ae5"
         class="block-separator block-separator--25">
</div>



<p>Here are a couple of approaches to model pruning:</p>



<ol class="wp-block-list">
<li><strong>Post-training pruning</strong>: In this approach, the LLM is first trained using standard techniques and then pruned using one of the pruning algorithms. The pruned LLM is then fine-tuned to preserve its accuracy.</li>



<li><strong>Iterative pruning</strong>: Here, the model is trained using standard training techniques and then pruned iteratively over several rounds of training and pruning. This approach can achieve higher levels of pruning while preserving accuracy.</li>
</ol>



<div id="separator-block_e234c74c458748ef6017cd37d06b9ae5"
         class="block-separator block-separator--25">
</div>



<p>You can explore <a href="https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/7126bf7beed4c4c3a05bcc2dac8baa3c/pruning_tutorial.ipynb" target="_blank" rel="noreferrer noopener nofollow">this</a> Colab notebook by PyTorch to better understand MP.&nbsp;</p>



<h5 class="wp-block-heading">Model distillation</h5>



<p>(MD) is a technique used to transfer knowledge from an LLM called a teacher to a smaller, more efficient model called the student. It is used in the context of model compression. In a nutshell, the teacher model provides guidance and feedback to the student model during training. See the image below.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=1200%2C628&#038;ssl=1" alt="Illustration of DistilBERT’s distillation process " class="wp-image-36665" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/nlp-models-infrastructure-cost-optimization-3.jpg?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">DistilBERT’s distillation process | <a href="https://towardsdatascience.com/distillation-of-bert-like-models-the-code-73c31e8c2b0a" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>MD involves training a student, a more efficient model to mimic the behavior of a teacher, more complex LLM. The student model is prepared using a combination of labeled data and the output probabilities of the larger LLM.&nbsp;</p>



<p>There are several approaches to model distillation for LLMs, including:</p>



<ol class="wp-block-list">
<li><strong>Knowledge distillation</strong>: In this approach, the smaller model is trained to mimic the output probabilities of the larger LLM using a <strong>temperature scaling factor</strong>. The temperature scaling factor is used to soften the output probabilities of the teacher model, allowing the smaller model to learn from the teacher model&#8217;s behavior more effectively.</li>
</ol>



<ol class="wp-block-list" start="2">
<li><strong>Self-distillation:</strong> In this approach, the larger LLM is used to generate training examples for the smaller model by applying the teacher model to unlabeled data. The smaller model is then trained on these generated examples, allowing it to learn from the behavior of the larger LLM without requiring labeled data.</li>
</ol>



<ol class="wp-block-list" start="3">
<li><strong>Ensemble distillation</strong>: In this approach, multiple smaller models are trained to mimic the behavior of different sub-components of the larger LLM. The outputs of these smaller models are combined to form an ensemble model that approximates the behavior of the larger LLM.</li>
</ol>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-optimizing-hardware-and-software-requirements">Optimizing hardware and software requirements</h3>



<p>Hardware is an important area when it comes to deploying LLMs. Here are some useful steps you can take for optimizing the hardware performance:</p>



<ol class="wp-block-list">
<li><strong>Choose hardware that matches the LLM&#8217;s requirements</strong>: Depending on the LLM&#8217;s size and complexity, you may need hardware with a large amount of RAM, high-speed storage, or multiple GPUs to speed up inference. Opt for hardware that provides the necessary processing power, memory, and storage capacity, without overspending on irrelevant features.</li>
</ol>



<ol class="wp-block-list" start="2">
<li><strong>Use specialized hardware</strong>: You can use specialized hardware such as TPUs (Tensor Processing Units) or FPGAs (Field-Programmable Gate Arrays) that are designed specifically for deep learning tasks. Similarly, accelerated linear algebra or XLA can be leveraged during inference time.&nbsp;</li>
</ol>



<div id="separator-block_446599db61dffc738e3e8218f00af26a"
         class="block-separator block-separator--20">
</div>



<p>Although such hardware can be expensive, there are smart ways to consume them. You can opt for charge-on-demand for the hardware used. For instance, elastic Inference from AWS Sagemaker helps you lower your cost when the model is not fully utilizing the GPU instance for inference.&nbsp;</p>



<ol class="wp-block-list" start="3">
<li><strong>Use optimized libraries</strong>: You can use optimized libraries such as TensorFlow, PyTorch, or JAX&nbsp; that leverage hardware-specific features to speed up computation without needing additional hardware.&nbsp;</li>
</ol>



<ol class="wp-block-list" start="4">
<li><strong>Tune the batch size</strong>: Consider tuning the batch size during inference to maximize hardware utilization and improve inference speed. This inherently reduces the hardware requirement, thus cutting the cost.&nbsp;</li>
</ol>



<ol class="wp-block-list" start="5">
<li><strong>Monitor and optimize</strong>: Finally, monitor the LLM&#8217;s performance during deployment and optimize the hardware configuration as needed to achieve the best performance.</li>
</ol>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-cost-efficient-scalability">Cost efficient scalability</h3>



<p>Here’s how you can scale your large NLP models while keeping costs in check:</p>



<ol class="wp-block-list">
<li><strong>Choose the right inference option</strong>, that scales automatically like the serverless inference option. As it will reduce the deployment cost when the demand is less.&nbsp;</li>
</ol>



<div id="separator-block_446599db61dffc738e3e8218f00af26a"
         class="block-separator block-separator--20">
</div>



<p>A rigid architecture will always occupy the same amount of memory even when the demand is low thus the deployment and maintenance costs will be the same. On the contrary, a scalable architecture can scale horizontally or vertically to accommodate an increased workload and go back to its original configuration when the model lies in a dormant state. Such an approach can reduce the cost of maintenance whenever the additional nodes are not being used.&nbsp;</p>



<ol class="wp-block-list" start="2">
<li><strong>Optimize inference performance</strong>, by using hardware acceleration, such as GPUs or TPUs, and by optimizing the inference code.</li>
</ol>



<ol class="wp-block-list" start="3">
<li>Amazon’s Elastic inference is yet another great option as it reduces the cost by up to 75% because the model no longer has extra GPUs to compute for inference. For more on Elastic inference, read this article <a href="https://www.projectpro.io/recipes/introduction-amazon-elastic-inference-and-its-use-cases" target="_blank" rel="noreferrer noopener nofollow">here</a>.&nbsp;</li>
</ol>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-cutting-energy-costs">Cutting energy costs</h3>



<ol class="wp-block-list">
<li><strong>Choose an energy-efficient cloud infrastructure</strong>, that uses renewable energy sources or carbon offsets to reduce the carbon footprint of their data centers. You can also consider choosing energy-efficient GPUs. Check out <a href="https://www.wired.com/story/amazon-google-microsoft-green-clouds-and-hyperscale-data-centers/" target="_blank" rel="noreferrer noopener nofollow">this</a> article by Wired to understand more.&nbsp;</li>
</ol>



<ol class="wp-block-list" start="2">
<li><strong>Use caching</strong> which helps reduce the computational requirements of LLM inference by storing frequently requested responses in memory. This can significantly reduce the number of computations required to generate responses to user requests. It also helps in addressing bandwidth issues as it reduces the time to access data. <strong>You can store frequently accessed data in cache memory so that it can be quickly accessed without the need for additional bandwidth</strong>. This allows you not to opt for additional storage and memory devices.&nbsp;</li>
</ol>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-deploying-large-nlp-models-other-useful-tips">Deploying large NLP models: other useful tips</h2>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-estimating-the-nlp-model-size-before-training">Estimating the NLP model size before training</h3>



<p>Keeping your model size in check could in turn keep your infrastructure costs in check. Here are a few things you can keep in mind while getting your large NLP model ready.</p>



<ol class="wp-block-list">
<li><strong>Consider the available resources</strong>: The size of the LLM for deployment should take into account the available hardware resources, including memory, processing power, and storage capacity. The LLM&#8217;s size should be within the limits of the available resources to ensure optimal performance.</li>



<li><strong>Fine-tuning: </strong>Choose a model with optimal accuracy and then fine-tune it on a task-specific dataset. This step will increase the efficiency of the LLM and keep its size from spiralling out of control.</li>



<li><strong>Consider the tradeoff between size and performance</strong>: The LLM&#8217;s size should be selected based on the tradeoff between size and performance. A larger model size may provide better performance but may also require more resources and time. Therefore, it is essential to find the optimal balance between size and performance.</li>
</ol>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-use-a-lightweight-deployment-framework">Use a lightweight deployment framework</h3>



<p>Many LLMs are too large to be deployed directly to a production environment. Consider using a lightweight deployment framework like <strong>TensorFlow Serving</strong> or <strong>TorchServe</strong> that can host the model and serve predictions over a network. These frameworks can help reduce the overhead of loading and running the model on the server thereby reducing the deployment and infrastructure costs.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-post-deployment-model-monitoring">Post-deployment model monitoring</h3>



<p>Model monitoring helps optimize the infrastructure cost of deployment by providing insights into the performance and resource utilization of deployed models. By monitoring the resource consumption of deployed models, such as CPU, memory, and network usage, you can identify areas that can help you optimize your infrastructure usage to reduce costs.&nbsp;</p>



<ul class="wp-block-list">
<li>Monitoring can identify <strong>underutilized resources</strong>, allowing you to scale back on <strong>unused resources</strong>, and reducing infrastructure costs.&nbsp;</li>



<li>Monitoring can identify resource-intensive operations or models, enabling organizations to optimize their architecture or <strong>refactor</strong> the model to be more efficient. This can also lead to cost savings.&nbsp;</li>
</ul>



<section id="blog-intext-cta-block_2dcca4c37b6d0bc236d9697d9083fe96" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-check-also">Check also</h3>
    
            <p><a href="/blog/tips-to-train-nlp-models" target="_blank" rel="noopener">Tips and Tricks to Train State-Of-The-Art NLP Models</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-key-takeaways">Key takeaways</h2>



<div id="case-study-numbered-list-block_aba3fea0529ec7644a0f28bf7b6d1220"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Set a budget.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Calculate the size of the model.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Use model compression techniques like pruning, quantization, and distillation to decrease the memory and computation required for deployment.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Utilize cloud computing services like AWS, Google Cloud, and Microsoft Azure for cost-effective solutions with scalability options.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Leverage serverless computing for a pay-per-use model, lower operational overhead, and auto-scaling.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                Leverage serverless computing for a pay-per-use model, lower operational overhead, and auto-scaling.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">7</span>
                Optimize hardware acceleration, such as GPUs, to speed up model training and inference.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">8</span>
                Regularly monitor resource usage to identify areas where costs can be reduced, such as underutilized resources or overprovisioned instances.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">9</span>
                Continuously optimize your model size and hardware to cost-efficient inference.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">10</span>
                Update the software and security patch to ensure safety.             </li>
            </ul>
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>In this article, we explored the challenges we face when deploying an LLM and the inflated infrastructural cost associated with them. Simultaneously, we also addressed each of these difficulties with the necessary techniques and solutions.&nbsp;</p>



<p>Out of all the solutions we discussed, a couple of things that I would recommend the most when it comes to reducing infrastructure cost while deployment is <strong>elastic</strong> and <strong>serverless</strong> inference. Yes, model compression is good and valid, but when the demand is high, even the smaller model can act like a larger model, thus increasing the infrastructural cost. Thus, we need to have a scalable approach and pay-per-demand service. That’s where these inference services get handy.&nbsp;</p>



<p>It goes without saying that my recommendation might not be the most ideal for your use case, and you can pick any of these approaches depending on the kind of problems you are dealing with. I hope what we discussed here will go a long way in helping you cut down your deployment infrastructure costs for your large NLP models.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-references">References</h3>



<ol class="wp-block-list">
<li><a href="https://research.aimultiple.com/large-language-model-training/" target="_blank" rel="noreferrer noopener nofollow">Large Language Model Training in 2023</a></li>



<li><a href="https://d1.awsstatic.com/events/Summits/reinvent2022/AIM405_Train-and-deploy-large-language-models-on-Amazon-SageMaker.pdf" target="_blank" rel="noreferrer noopener nofollow">https://d1.awsstatic.com/events/Summits/reinvent2022/AIM405_Train-and-deploy-large-language-models-on-Amazon-SageMaker.pdf</a></li>



<li><a href="https://research.aimultiple.com/ai-chip-makers/" target="_blank" rel="noreferrer noopener nofollow">Top 10 AI Chip Makers of 2023: In-depth Guide&nbsp;</a></li>



<li><a href="https://www.nvidia.com/en-us/data-center/dgx-a100/" target="_blank" rel="noreferrer noopener nofollow">https://www.nvidia.com/en-us/data-center/dgx-a100/</a></li>



<li><a href="https://arxiv.org/pdf/2302.13971.pdf" target="_blank" rel="noreferrer noopener nofollow">LLaMA: A foundational, 65-billion-parameter large language model</a></li>



<li><a href="https://arxiv.org/pdf/2203.15556.pdf" target="_blank" rel="noreferrer noopener nofollow">https://arxiv.org/pdf/2203.15556.pdf</a></li>



<li><a href="https://huggingface.co/docs/transformers/model_doc" target="_blank" rel="noreferrer noopener nofollow">https://huggingface.co/docs/transformers/model_doc</a></li>



<li><a href="https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast" target="_blank" rel="noreferrer noopener nofollow">https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast</a></li>



<li><a href="https://sunniesuhyoung.github.io/files/LLM.pdf" target="_blank" rel="noreferrer noopener nofollow">https://sunniesuhyoung.github.io/files/LLM.pdf</a></li>



<li><a href="https://twitter.com/tomgoldsteincs/status/1600196995389366274?lang=en" target="_blank" rel="noreferrer noopener nofollow">https://twitter.com/tomgoldsteincs/status/1600196995389366274?lang=en</a></li>



<li><a href="https://arxiv.org/pdf/1910.02054.pdf" target="_blank" rel="noreferrer noopener nofollow">https://arxiv.org/pdf/1910.02054.pdf</a></li>



<li><a href="https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html" target="_blank" rel="noreferrer noopener nofollow">https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html</a></li>



<li>Jaime Sevilla et al. (2022), &#8220;Estimating Training Compute of Deep Learning Models&#8221;. Published online at<a href="http://epochai.org/" target="_blank" rel="noreferrer noopener nofollow"> epochai.org</a>. Retrieved from: &#8216;<a href="https://epochai.org/blog/estimating-training-compute" target="_blank" rel="noreferrer noopener nofollow">https://epochai.org/blog/estimating-training-compute</a>&#8216; [online resource]</li>



<li><a href="https://arxiv.org/abs/2001.08361" target="_blank" rel="noreferrer noopener nofollow">https://arxiv.org/abs/2001.08361</a></li>



<li><a href="https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf" target="_blank" rel="noreferrer noopener nofollow">https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf</a></li>



<li><a href="https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html" target="_blank" rel="noreferrer noopener nofollow">https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html</a></li>



<li><a href="https://aws.amazon.com/sagemaker/neo/" target="_blank" rel="noreferrer noopener nofollow">https://aws.amazon.com/sagemaker/neo/</a></li>



<li><a href="https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/7126bf7beed4c4c3a05bcc2dac8baa3c/pruning_tutorial.ipynb" target="_blank" rel="noreferrer noopener nofollow">https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/7126bf7beed4c4c3a05bcc2dac8baa3c/pruning_tutorial.ipynb</a></li>



<li><a href="https://towardsdatascience.com/distillation-of-bert-like-models-the-code-73c31e8c2b0a" target="_blank" rel="noreferrer noopener nofollow">https://towardsdatascience.com/distillation-of-bert-like-models-the-code-73c31e8c2b0a</a></li>



<li><a href="https://aws.amazon.com/blogs/machine-learning/train-175-billion-parameter-nlp-models-with-model-parallel-additions-and-hugging-face-on-amazon-sagemaker/" target="_blank" rel="noreferrer noopener nofollow">https://aws.amazon.com/blogs/machine-learning/train-175-billion-parameter-nlp-models-with-model-parallel-additions-and-hugging-face-on-amazon-sagemaker/</a></li>



<li><a href="https://openai.com/blog/improving-language-model-behavior/" target="_blank" rel="noreferrer noopener nofollow">Improving Language Model Behavior by Training on a Curated Dataset</a></li>



<li><a href="https://towardsdatascience.com/how-to-deploy-large-size-deep-learning-models-into-production-66b851d17f33" target="_blank" rel="noreferrer noopener nofollow">https://towardsdatascience.com/how-to-deploy-large-size-deep-learning-models-into-production-66b851d17f33</a></li>



<li><a href="https://huggingface.co/blog/large-language-models" target="_blank" rel="noreferrer noopener nofollow">https://huggingface.co/blog/large-language-models</a></li>



<li><a href="https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/" target="_blank" rel="noreferrer noopener nofollow">https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/</a></li>



<li><a href="https://openreview.net/pdf?id=NiEtU7blzN" target="_blank" rel="noreferrer noopener nofollow">Large Language Models Can Self-Improve</a></li>



<li><a href="https://spot.io/resources/cloud-cost/cloud-cost-optimization-15-ways-to-optimize-your-cloud/" target="_blank" rel="noreferrer noopener nofollow">https://spot.io/resources/cloud-cost/cloud-cost-optimization-15-ways-to-optimize-your-cloud/</a></li>



<li><a href="https://dataintegration.info/choose-the-best-ai-accelerator-and-model-compilation-for-computer-vision-inference-with-amazon-sagemaker" target="_blank" rel="noreferrer noopener nofollow">https://dataintegration.info/choose-the-best-ai-accelerator-and-model-compilation-for-computer-vision-inference-with-amazon-sagemaker</a></li>



<li><a href="https://medium.com/data-science-at-microsoft/model-compression-and-optimization-why-think-bigger-when-you-can-think-smaller-216ec096f68b" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/data-science-at-microsoft/model-compression-and-optimization-why-think-bigger-when-you-can-think-smaller-216ec096f68b</a></li>



<li><a href="https://medium.com/picsellia/how-to-optimize-computer-vision-models-for-edge-devices-851b20f7cf03" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/picsellia/how-to-optimize-computer-vision-models-for-edge-devices-851b20f7cf03</a></li>



<li><a href="https://huggingface.co/docs/transformers/v4.17.0/en/parallelism#which-strategy-to-use-when" target="_blank" rel="noreferrer noopener nofollow">https://huggingface.co/docs/transformers/v4.17.0/en/parallelism#which-strategy-to-use-when</a></li>



<li><a href="https://medium.com/@mlblogging.k/9-libraries-for-parallel-distributed-training-inference-of-deep-learning-models-5faa86199c1f" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/@mlblogging.k/9-libraries-for-parallel-distributed-training-inference-of-deep-learning-models-5faa86199c1f</a></li>



<li><a href="https://towardsdatascience.com/how-to-estimate-and-reduce-the-carbon-footprint-of-machine-learning-models-49f24510880" target="_blank" rel="noreferrer noopener nofollow">https://towardsdatascience.com/how-to-estimate-and-reduce-the-carbon-footprint-of-machine-learning-models-49f24510880</a></li>
</ol>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">19513</post-id>	</item>
		<item>
		<title>Argo vs Airflow vs Prefect: How Are They Different</title>
		<link>https://neptune.ai/blog/argo-vs-airflow-vs-prefect-differences</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 04 Nov 2022 09:04:36 +0000</pubDate>
				<category><![CDATA[ML Tools]]></category>
		<guid isPermaLink="false">https://neptune.staginglab.eu/?p=11822</guid>

					<description><![CDATA[We live at a stage where ML and DL software are everywhere. New startups and various other companies are adapting and integrating AI systems into their new and already existing workflows to be much more productive and efficient. These systems reduce manual tasks and deliver smart and intelligent solutions. Although they are quite proficient in&#8230;]]></description>
										<content:encoded><![CDATA[
<p>We live at a stage where ML and DL software are everywhere. New startups and various other companies are adapting and integrating AI systems into their new and already existing workflows to be much more productive and efficient. These systems reduce manual tasks and deliver smart and intelligent solutions. Although they are quite proficient in what they do, all AI systems have different modules that must be brought together to build an operational and effective product.&nbsp;</p>



<p>These systems can be broadly divided into five phases, keeping in mind that these phases contain various additional and repetitive tasks:</p>



<div id="case-study-numbered-list-block_8fb9f2e318eeb666ce5306ddd4305ddc"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Data collection             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Feature engineering            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Modeling (which includes training, validation, testing, and inference)            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Deployment             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Monitoring            </li>
            </ul>
</div>



<p>Executing these phases individually can take a lot of time and continuous human effort. These phases must be synchronized and sequentially orchestrated in order to get the best out of them. This can be achieved by <strong>task orchestration tools</strong> that enable ML practitioners to effortlessly bring together and orchestrate different phases of an AI system.</p>



<div id="separator-block_1a2475869ea1cf43bc2807deb583e3aa"
         class="block-separator block-separator--15">
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-1-924x1024.png?resize=693%2C768&#038;ssl=1" alt="Phases of AI systems" class="wp-image-72252" width="693" height="768"/><figcaption class="wp-element-caption"><em>Phases of AI systems | <a href="https://www.datarevenue.com/en-blog/what-we-are-loving-about-prefect" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<div id="separator-block_de1f5f20c22a4cc1abab8756d11e1439"
         class="block-separator block-separator--10">
</div>



<p>In this article, we will explore:</p>



<div id="case-study-numbered-list-block_91410d4b859b53150f683e123f1266c2"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                What task orchestration tools are?            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Three different tools that can help ML practitioners to orchestrate their workflow.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Comparison of the three tools            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Which tool to use and when?            </li>
            </ul>
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-task-orchestration-tools-what-they-are-and-how-are-they-useful">Task orchestration tools: What they are and how are they useful?</h2>



<p>Orchestration tools enable various tasks in MLOps to be organized and sequentially executed. These tools have the capability to orchestrate different tasks at a given period. One of the key properties of these tools is the distribution of tasks. Most of the tools leverage what is known as the DAG or Directed Acyclic Graph, which you will often come across in this article. A DAG is a graph representation of the tasks that need to be executed.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-2-1024x928.png?resize=768%2C696&#038;ssl=1" alt="Explanation of DAG" class="wp-image-72253" width="768" height="696"/><figcaption class="wp-element-caption"><em>Graphic explanation of DAG | <a href="https://www.datarevenue.com/en-blog/airflow-vs-luigi-vs-argo-vs-mlflow-vs-kubeflow" target="_blank" rel="noreferrer noopener nofollow">Source</a>&nbsp;</em></figcaption></figure>
</div>


<div id="separator-block_de1f5f20c22a4cc1abab8756d11e1439"
         class="block-separator block-separator--10">
</div>



<p>DAG enables tasks in a pipeline to be distributed parallelly to various other modules for processing, this offers efficiency. See the image above. DAG also enables tasks to be sequentially sound or arranged for proper execution and timely results.</p>



<p>Another important property that these tools have is adaptability to agile environments. This allows ML practitioners to incorporate various other tools that can be used to monitor, deploy, analyze and preprocess, test, infer, et cetera. If an orchestration tool can orchestrate various tasks from different tools, then it can be considered a good tool. But this is not the case every time, some of the tools are strictly contained within their derived environments, which does not bode well for users trying to integrate any third-party applications.&nbsp;</p>



<p>In this article, we will explore three tools – <a href="https://argoproj.github.io/" target="_blank" rel="noreferrer noopener nofollow">Argo</a>, <a href="https://airflow.apache.org/" target="_blank" rel="noreferrer noopener nofollow">Airflow</a>, and <a href="https://www.prefect.io/" target="_blank" rel="noreferrer noopener nofollow">Prefect</a>, that incorporate these two properties and various others as well.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-tldr-comparison-table">TL;DR comparison table&nbsp;</h2>



<p>Here is a table inspired by <a href="https://medium.com/arthur-engineering/picking-a-kubernetes-orchestrator-airflow-argo-and-prefect-83539ecc69b" target="_blank" rel="noreferrer noopener nofollow">Ian McGraw&#8217;s article</a>, which provides an overview of what these tools offer for orchestration and how they differ from each other in these aspects.</p>



<div id="medium-table-block_d7ed176f201196169251f79293cd9a44"
     class="block-medium-table c-table__outer-wrapper  l-padding__top--0 l-padding__bottom--0 l-margin__top--unset l-margin__bottom--unset">

    <table class="c-table">
                    <thead class="c-table__head">
            <tr>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            &nbsp;                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Features                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Argo                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Airflow                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Prefect                        </div>
                    </td>
                            </tr>
            </thead>
        
        <tbody class="c-table__body">

                    
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>1.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Fault-tolerant scheduling</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-uncheckmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>2.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>UI Support</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>3.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Workflow definition language</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>YAML</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Python</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Python</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>4.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>3rd partyintegration</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Since Argo is container-based it doesn’t come with pre-installed 3rd party systems.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Supports various 3rd party integration</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Supports various 3rd party integration</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>5.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Workflows</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Dynamic workflow</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Static workflow</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Dynamic workflow</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>6.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Accessibility</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Open-sourced</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Open-source</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Hybrid (Open-sourced and subscription-based)</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>7.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Parametrized workflows</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Have an extensive parameter-passing syntax.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Does not has a mechanism to pass parameter.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Supports parameters as first-class object</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>8.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Kubernetes support</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>9.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Scalability</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Highly Parallel</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Horizontal scalable</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Parallel when using Kubernetes</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>10.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Community Support</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Large</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Large</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Medium</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>11.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>State storage</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>All states are stored within the Kubernetes workflow</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Postgres DB</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Postgres DB</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>12.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Ease of deployment</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Medium</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Medium</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Difficult</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>13.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Event-driven workflows</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-uncheckmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                                                            <img loading="lazy" decoding="async"
                                            alt=""
                                            class="c-ceil__checked lazyload"
                                            src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
                                            data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-table-checkmark.svg"
                                            width="27"
                                            height="21"
                                        />
                                                                                                </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>14.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Scripts in DAG definition</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Argo uses text scripts to pass in containers.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Airflow uses Python-based DAG definition language.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Perfect uses functional flow a Python-based API.</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>15.</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Use Cases</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>&#8211; CI/CD- Data Processing- Infrastructure  Automation- Machine Learning- Stream Processing</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>&#8211; ELT &#8211; ML Workflow- ML Automation</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>&#8211; Automating Data Workflow (ELT)- ML Workflow and Orchestration- CI/CD</p>
                                                            </div>
                        </td>

                    
                </tr>

                    
        </tbody>
    </table>

</div>



<p>Now let’s explore each of these tools in more detail under three primary categories:&nbsp;</p>



<div id="case-study-numbered-list-block_dddd3f0beb5ac03bbd1aab5378b1fee1"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Core concepts            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Features they offer            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Why use it?            </li>
            </ul>
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-core-concepts">Core concepts</h2>



<p>All three tools are built on a set of concepts or principles around which they function. Argo is, for instance, built around two concepts: <strong>Workflow </strong>and<strong> Templates</strong>. Both of these make the backbone of its system. Likewise, Airflow is built around <strong>Webserver, Scheduler, Executor, </strong>and<strong> Database,</strong> while Prefect is built around <strong>Flows </strong>and<strong> Task</strong>. Now it is important for us to know what these concepts mean, what they offer, and how it is beneficial to us.</p>



<p>Before going into the details, here is a brief summary of the concepts.&nbsp;</p>



<div id="medium-table-block_1bef54a85a304ee96f66127d4d1b3d93"
     class="block-medium-table c-table__outer-wrapper  l-padding__top--0 l-padding__bottom--0 l-margin__top--unset l-margin__bottom--unset">

    <table class="c-table">
                    <thead class="c-table__head">
            <tr>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            &nbsp;                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Properties of the Concepts                        </div>
                    </td>
                            </tr>
            </thead>
        
        <tbody class="c-table__body">

                    
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Argo</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>It has two concepts <strong>Workflow</strong>, and <strong>Templates</strong>. Essentially the Workflow is the config YAML file. It provides structure and robustness to the workflow as they use DAGs to manage the workflows. On the other hand, templates are the functions that need to be executed. <br data-rich-text-line-break="true" />They are both static and dynamic meaning that you can modify steps on the go.</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Airflow</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>It has four concepts Webserver, Scheduler, Executor, and Database. They basically divide the whole process into different segments and these concepts act as major components to automate the whole process. This allows the workflow to be efficient since each component relies on the other, in this way it is easy to find and report bugs and errors. Furthermore, monitoring is quite easy.<br />
Though Airflow uses DAGs it is not dynamic but only static.</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Prefect</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>It leverages two concepts Flows and Tasks. Prefect uses DAGs that are defined as flow object which uses Python. In Prefect, flow objects can be created using Python which provides flexibility and robustness to define complex pipelines.<br />
Tasks are like templates in Argo which are used to define a specific function that needs to be executed. Again, it uses Python for this.<br />
Because Prefect uses Python as its main programming language it is easy to work with.</p>
                                                            </div>
                        </td>

                    
                </tr>

                    
        </tbody>
    </table>

</div>



<p class="has-text-align-center has-small-font-size"><em>Summary of the concepts</em></p>



<p>Now, let’s understand these concepts in detail.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-argo">Argo&nbsp;</h3>



<p>Argo uses two core concepts:</p>



<ol class="wp-block-list">
<li>Workflow</li>



<li>Templates</li>
</ol>



<h4 class="wp-block-heading">Workflow</h4>



<p>In Argo, the workflow happens to be the most integral component of the whole system. It has two important functions:&nbsp;</p>



<ol class="wp-block-list">
<li>It defines the tasks that need to be executed.</li>



<li>It stores the state of the tasks, which means that it serves as both a static and a dynamic object.</li>
</ol>



<p>Workflow is defined in the workflow.spec configuration file. It is a YAML file that consists of a list of <strong>templates</strong> and <strong>entry points</strong>. The Workflow can be considered as a file that hosts different templates. These templates define the function that needs to be executed.&nbsp;</p>



<p>As mentioned earlier that Argo leverages the <strong>Kubernetes</strong> engine for workflow synchronization, and the configuration file uses the same syntax as Kubernetes. The workflow YAML file has the following dictionaries or objects:</p>



<ol class="wp-block-list">
<li>apiVersion: This is where you define the name of the doc or API.</li>



<li>kind: It defines the type of Kubernetes object that needs to be created. For instance, if you want to deploy an app you can use <strong>Deployment </strong>as one of a kind, at other times you can use service. But in this case, we will use <strong>Workflow</strong>.</li>



<li>metadata: It enables us to define unique properties for that object, that could be a name, UUID, et cetera.&nbsp;</li>



<li>spec: It enables us to define specifications concerning the Workflow. These specifications would be entry points and templates.&nbsp;</li>



<li>templates: This is where we can define the tasks. The template can contain the docker image and various other scripts.&nbsp;</li>
</ol>



<h4 class="wp-block-heading">Templates&nbsp;</h4>



<p>In Argo, there are two types of templates which again are sub-classified into 6 types. The two major types are <strong>definition </strong>and <strong>invocators.&nbsp;</strong></p>



<h5 class="wp-block-heading">Definition</h5>



<p>This template, as the name suggests, defines the type of task in a Docker container. The Definition itself is divided into four categories:</p>



<ol class="wp-block-list">
<li><strong>Container</strong>: It enables users to schedule the workflow in a container. Since the application is containerized in Kubernetes, the steps defined in the YAML file are identical. It is also one of the most used templates.</li>
</ol>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#source: https://argoproj.github.io/argo-workflows/workflow-concepts/</span>
- name: whalesay
    container:
      image: docker/whalesay
      command: [cowsay]
      args: [<span class="hljs-string" style="color: rgb(221, 17, 68);">"hello world"</span>]</pre>



<ol class="wp-block-list" start="2">
<li><strong>Script</strong>: If you want a wrapper around a container, then the script template is perfect. The script template is similar in structure to the container template but adds a source field. The field allows you to define a script in place. You can define any variable or command based on your requirements. Once defined, the script will be saved into a file, and it will be executed for you as an Argo variable.</li>
</ol>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#source: https://argoproj.github.io/argo-workflows/workflow-concepts/</span>
 - name: gen-random-int
    script:
      image: python:alpine3<span class="hljs-number" style="color: teal;">.6</span>
      command: [python]
      source: <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> random

        	i = random.randint(<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">100</span>)
        	  	print(i)
</pre>



<ol class="wp-block-list" start="3">
<li><strong>Resource</strong>: It allows you to perform operations like get, create, apply, delete et cetera on the K8 cluster directly.</li>
</ol>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#source: https://argoproj.github.io/argo-workflows/workflow-concepts/</span>
- name: k8s-owner-reference
    resource:
      action: create
      manifest: |
        apiVersion: v1
        kind: ConfigMap
        metadata:
          generateName: owned-eg-
        data:
          some: value</pre>



<ol class="wp-block-list" start="4">
<li><strong>Suspend</strong>: It basically introduces a time dimension to the workflow. It can suspend the execution of the workflow for a defined duration or till the workflow is resumed manually.&nbsp;</li>
</ol>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#source: https://argoproj.github.io/argo-workflows/workflow-concepts/ </span>
 - name: delay
    suspend:
      duration: <span class="hljs-string" style="color: rgb(221, 17, 68);">"20s"</span></pre>



<h5 class="wp-block-heading">Invocators</h5>



<p>Once the templates are defined, they can be invoked or called on demand by other templates called invocators. These invocators are more of controllers templates that can control the execution of defined templates.&nbsp;</p>



<p>There are two types of invocator templates:</p>



<ol class="wp-block-list">
<li><strong>Steps: </strong>It basically allows you to define the tasks in steps. All YAML files are enabled with the ‘steps’ template.&nbsp;</li>



<li><strong>Directed acyclic graph</strong>: Argo enables its users to manage steps with multiple dependencies in their workflow. This allows parallel execution of different workflows in their respective containers. These types of workflows are managed using a directed acyclic graph or DAG. For instance, if you are working on image segmentation and generation for medical purposes then you can create a pipeline that:
<ul class="wp-block-list">
<li>Processes the images.</li>



<li>Distributes the images (or dataset) to the respective DL models for image segmentation and generation pipeline.</li>



<li>Continuously predicts segmentation masks and updates the dataset storage with new images after proper inspection.&nbsp;</li>
</ul>
</li>
</ol>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-airflow">Airflow</h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" height="327" width="1024" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-3-1024x327.png?resize=1024%2C327&#038;ssl=1" alt="Feature Pipeline- Airflow" class="wp-image-72254"/><figcaption class="wp-element-caption"><em>Feature Pipeline | <a href="https://towardsdatascience.com/mlops-with-a-feature-store-816cfa5966e9" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Apache Airflow consists of four main components:</p>



<ol class="wp-block-list">
<li>Webserver</li>



<li>Scheduler</li>



<li>Executor</li>



<li>Database</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-4.png?resize=744%2C484&#038;ssl=1" alt="Main components of Apache Airflow" class="wp-image-72255" width="744" height="484"/><figcaption class="wp-element-caption"><em> Four main components of Apache Airflow | <a href="https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Webserver</h4>



<p>It provides the user with UI for inspecting, triggering, and debugging all DAGs and tasks. It essentially serves as the entry point for Airflow. The Webserver leverages Python-Flask to manage all the requests made by the user. It also renders the state metadata from the database and displays the same to the UI.</p>



<h4 class="wp-block-heading">Scheduler</h4>



<p>It monitors and manages all the tasks and DAGs. It examines the state of the tasks by querying the database to decide the order of the task that needs to be executed. The aim of the scheduler is then to resolve dependencies and submit the task instance to the executor once the dependencies are taken care of.</p>



<h4 class="wp-block-heading">Executor</h4>



<p>It runs the task instances which are ready to run. It executes all the tasks as scheduled by the scheduler. There are four types of executors:</p>



<ol class="wp-block-list">
<li>Sequential Executor</li>



<li>Local Executor</li>



<li>Celery Executor</li>



<li>Kubernetes Executor</li>
</ol>



<h4 class="wp-block-heading">Metadata Database</h4>



<p>It stores the state of the tasks and DAGs that can be used by the scheduler for proper scheduling of the tasks instance. It is worth noting that Airflow uses SQLAlchemy and Object Relational Mapping (ORM) to store the information.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-prefect">Prefect</h3>



<p>Prefect uses two core concepts:&nbsp;</p>



<ol class="wp-block-list">
<li>Flows</li>



<li>Tasks</li>
</ol>



<h4 class="wp-block-heading">Flows</h4>



<p>In Prefect, flows are the Python objects that can be interacted with. Here DAG is defined as flow objects. See the image below.&nbsp;</p>



<div id="separator-block_1a2475869ea1cf43bc2807deb583e3aa"
         class="block-separator block-separator--15">
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" height="172" width="1024" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-5-1024x172.png?resize=1024%2C172&#038;ssl=1" alt="DAG defined as flow objects " class="wp-image-72256"/><figcaption class="wp-element-caption"><em> DAG defined as flow objects | <a href="https://spell.ml/blog/orchestrating-spell-model-pipelines-using-prefect-YU3rsBEAACEAmRxp" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<div id="separator-block_1a31e0f7eaf92c77e2ac2ec4e92ac849"
         class="block-separator block-separator--5">
</div>



<p>Flow can be imported and can be used as a decorator, @flow, for any given function. Flows take an existing function and transform it into a Prefect flow function, with the following advantages:</p>



<ul class="wp-block-list">
<li>The function can be monitored and governed as it is now reported to the API.</li>



<li>The activity of the function can be tracked and displayed in the UI.</li>



<li>Inputs given to the function can be validated.</li>



<li>Various workflow features like retries, distributed execution et cetera can be added to the function.<em>&nbsp;</em></li>



<li>Timeouts can be enforced to prevent unintentional long-running workflows&nbsp;</li>
</ul>



<p>Here is a code block depicting the implementation of a flow object.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Source: https://github.com/PrefectHQ/prefect</span>
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> prefect <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> flow

<span class="hljs-meta" style="font-weight: 700; color: rgb(153, 153, 153);">@flow(name="GitHub Stars")</span>
<span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">github_stars</span><span class="hljs-params">(repos: List[str])</span>:</span>
    <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> repo <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> repos:
        get_stars(repo)
</pre>



<p>In the code above, the function has been transformed into a flow which is named as “GitHub Stars”. This function is now within the constraints of Prefect orchestration laws.&nbsp;</p>



<p>Now it must be noted that all workflows must be defined within the flow function. Likewise, all tasks must be called within the flow (function). Keep in mind that when a flow is executed, it is known as a <em>flow run</em>.&nbsp;</p>



<h5 class="wp-block-heading">Tasks</h5>



<p>Tasks can be defined as specific work that needs to be executed, for instance, the addition of two numbers. In another word, tasks take an input, perform an operation and yield an output. Like flow, tasks can be imported and can be used as a decorator, @task, for a function. Once used for a function, it essentially wraps the function within the Prefect workflow and has similar advantages to the flow. For instance, it can automatically log information about task runs, such as runtime, tags, and final state.&nbsp;</p>



<p>The code below demonstrates how a task is defined:&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Source: https://github.com/PrefectHQ/prefect</span>

<span class="hljs-meta" style="font-weight: 700; color: rgb(153, 153, 153);">@task(retries=3)</span>
<span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">get_stars</span><span class="hljs-params">(repo: str)</span>:</span>
    url = f<span class="hljs-string" style="color: rgb(221, 17, 68);">"https://api.github.com/repos/{repo}"</span>
    count = httpx.get(url).json()[<span class="hljs-string" style="color: rgb(221, 17, 68);">"stargazers_count"</span>]
    print(f<span class="hljs-string" style="color: rgb(221, 17, 68);">"{repo} has {count} stars!"</span>)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># run the flow!</span>
github_stars([<span class="hljs-string" style="color: rgb(221, 17, 68);">"PrefectHQ/Prefect"</span>])</pre>



<p>To sum up, the flow looks for any task that is defined within its body, and once found it then creates a computational graph in the same order. It then creates dependencies between the tasks whenever the output of one task instance is used to yield output by another.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-features">Features</h2>



<p>All three provide more or less the same features, but some features are better than others, and it also boils down to users&#8217; adaptability. Just like in the previous section, let’s begin with a summary of the features.&nbsp;</p>



<div id="medium-table-block_93f662c2568885d9ff64a5e751dabbfd"
     class="block-medium-table c-table__outer-wrapper  l-padding__top--0 l-padding__bottom--0 l-margin__top--unset l-margin__bottom--unset">

    <table class="c-table">
                    <thead class="c-table__head">
            <tr>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            &nbsp;                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Argo                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Airflow                        </div>
                    </td>
                                    <td class="c-item"
                        style="">
                        <div class="c-item__inner">
                            Prefect                        </div>
                    </td>
                            </tr>
            </thead>
        
        <tbody class="c-table__body">

                    
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>User Interface</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>It has a complete view of the workflow. You can define workflow straight from the UI.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Workflow is very well-maintained as it provides a number of different views.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Prefect is similar to Airflow.</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Deployment Style </strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Supports only Kubernetes-supported environments such as AWS and other S3-compatible services.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Supports Kubernetes-supported environment as well as other third-party environments.</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Same as Airflow</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Scalability</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Parallel</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Horizontal</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Parallel</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Accessibility</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Open-sourced</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Open-source</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Open-sourced and subscription-based</p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>Flexibility</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Rigid</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Rigid and Complicated</p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p>Flexible</p>
                                                            </div>
                        </td>

                    
                </tr>

                    
        </tbody>
    </table>

</div>



<p class="has-text-align-center has-small-font-size"><em>Comparison of the features</em></p>



<p>Let’s start this section by exploring the User Interface.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-user-interface">User Interface</h3>



<h4 class="wp-block-heading">Argo</h4>



<p>For ease of use, Argo Workflow provides a web-based UI to define workflows and templates. The UI enables various purposes like:</p>



<ul class="wp-block-list">
<li>Artifact visualization&nbsp;</li>



<li>Using generated charts to compare Machine Learning pipelines</li>



<li>Visualizing results&nbsp;</li>



<li>Debugging</li>



<li>It can also be used to define workflows</li>
</ul>



<div id="separator-block_063e430cfbfadabbcde40db0b5e58df5"
         class="block-separator block-separator--10">
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" height="603" width="1024" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-6-1024x603.png?resize=1024%2C603&#038;ssl=1" alt="Argo user interface" class="wp-image-72257"/><figcaption class="wp-element-caption"><em>Argo UI | <a href="https://github.com/argoproj/argo-workflows" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Airflow</h4>



<p>Airflow UI provides a clean and efficient design that enables the user to interact with the Airflow server allowing them to <strong>monitor</strong> and <strong>troubleshoot</strong> the entire pipeline. It also allows editing the state of the task in the database and manipulating the behaviour of DAGs and tasks.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" height="616" width="1024" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-7-1024x616.png?resize=1024%2C616&#038;ssl=1" alt="Airflow user interface" class="wp-image-72258"/><figcaption class="wp-element-caption"><em>Airflow UI | <a href="https://airflow.apache.org/docs/apache-airflow/stable/ui.html#" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The Airflow UI also provides various views for its users, they include:</p>



<ul class="wp-block-list">
<li>DAGs View</li>



<li>Datasets View</li>



<li>Grid View</li>



<li>Graph View</li>



<li>Calendar View</li>



<li>Variable View</li>



<li>Gantt View</li>



<li>Task Duration</li>



<li>Code View</li>
</ul>



<h4 class="wp-block-heading">Prefect</h4>



<p>Prefect like Airflow provides an overview of all the tasks, which helps you visualize all your workflow, tasks, and DAGs. It provides two ways to access UI:</p>



<ol class="wp-block-list">
<li><strong>Prefect Cloud</strong>: It is hosted on the cloud, which enables you to configure your personal accounts and workspaces.&nbsp;</li>



<li><strong>Prefect Orion UI</strong>: It is hosted locally, and it is also open-sourced. You cannot configure it the way you can with Prefect cloud.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" height="442" width="1024" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-8-1024x442.png?resize=1024%2C442&#038;ssl=1" alt="Prefect user interface" class="wp-image-72259"/><figcaption class="wp-element-caption"><em>Prefect UI | <a href="https://docs.prefect.io/ui/overview/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Some additional features of Prefect UI:</p>



<ul class="wp-block-list">
<li>Displaying run summaries</li>



<li>Displaying flow details that are deployed</li>



<li>Scheduled flow&nbsp;</li>



<li>Warnings notification for late and failed runs</li>



<li>Details information of tasks and workflows</li>



<li>Task dependency visualization and Radar flow</li>



<li>Logs details</li>
</ul>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-deployment-style">Deployment Style</h3>



<h4 class="wp-block-heading">Argo</h4>



<p>It is a native Kubernetes workflow engine which means it:</p>



<div id="case-study-numbered-list-block_9534e29fd26811ea12e384e75357482d"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Runs on containers.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Runs on Kubernetes-supported pods.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Easy to deploy and scale.            </li>
            </ul>
</div>



<p>On the downside:</p>



<ul class="wp-block-list">
<li>Implementation is hard since it uses configurational language (YAML).</li>
</ul>



<h4 class="wp-block-heading">Airflow</h4>



<div id="case-study-numbered-list-block_8a83e1308936ba2cfba7d5ae62381a9a"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Supports Kubernetes as well as other third–party integrations.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                It runs on containers as well.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Implementation is easy.            </li>
            </ul>
</div>



<p>The downside of Airflow is:</p>



<ul class="wp-block-list">
<li>It is not parallel scalable.</li>



<li>Deployment needs extra effort, which depends upon the cloud facility you choose.&nbsp;</li>
</ul>



<h4 class="wp-block-heading">Prefect</h4>



<p>Lastly, Prefect is a combination of both Argo and Airflow:</p>



<div id="case-study-numbered-list-block_6058393be57ffaaa79c8ca34cd0ca4d3"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                It can run on Containers and Kubernetes pods.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                It is highly parallel and efficient.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                It supports fault-tolerant scheduling.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Easy to deploy.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                It also supports third-party integrations.            </li>
            </ul>
</div>



<p>When it comes to the downside:</p>



<ul class="wp-block-list">
<li>It does not support open-source deployment with Kubernetes.&nbsp;</li>



<li>Deployment is difficult.&nbsp;</li>
</ul>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-scalability">Scalability</h3>



<p>When it comes to scalability, Argo and Prefect are highly parallel, which makes them efficient and especially Prefect because it can leverage different third-party integrations support, making it the best of the three.&nbsp;</p>



<p>Airflow, on the other, is horizontally scalable i.e., the number of active workers is equal to maximum task parallelism.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-accessibility">Accessibility</h3>



<p>All three are open-sourced, but Prefect also comes with a <a href="https://www.prefect.io/pricing/" target="_blank" rel="noreferrer noopener nofollow">subscription-based</a> service.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-flexibility">Flexibility</h3>



<p>Argo and Airflow aren’t that flexible when compared with Prefect as the former is Kubernetes-native it is confined in that environment, making it rigid, while the latter is complicated as it requires a well-defined and structured template, making itself not very well suited to an agile environment.&nbsp;</p>



<p>Prefect, on the other hand, enables you to create dynamic dataflow in native Python, which does not require you to use DAG. All Python functions can be transformed to Prefect Flow and Task. This ensures flexibility.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-why-use-these-tools">Why use these tools?</h2>



<p>So far, I’ve compared the basic concepts and features that these tools possess. Now let me give reasons as to why you can use any of these tools in your project.&nbsp;&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-argo">Argo</h3>



<p>Here are some of the reasons why you should use Argo:</p>



<div id="case-study-numbered-list-block_7c3a28031a82c6fbbaadc383748b3c55"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                The Kubernetes native workflow tool enables you to run each step in its own Kubernetes pod.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Easy to scale because it can be executed parallelly.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Workflow templates offer reusability.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Similarly, artifact integrations are also reusable.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                DAG is dynamic for each run of the workflow.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                Low Latency Scheduler.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">7</span>
                Event-Driven Workflows.            </li>
            </ul>
</div>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-airflow">Airflow</h3>



<p>Reasons for you to use Airflow:</p>



<div id="case-study-numbered-list-block_c0af8bcfe38bccc2ebd32c9ec2c4588a"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                It enables users to connect with various technologies.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                It offers rich scheduling and easy-to-define pipelines.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Pythonic integration is another reason to use Airflow.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                You can create custom components as per your requirements.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Allows rollback to the previous version as workflows are stored.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                Has a well-defined UI.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">7</span>
                Multiple users can write a workflow for a given project, i.e. it is shareable.             </li>
            </ul>
</div>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-prefect">Prefect</h3>



<p>Prefect is one of the well-planned orchestration tools for MLops. It is Python-native and requires you to put effort into the engineering side of things. One of the areas where Prefect shines is in data processing and pipeline. It can be used to fetch the data, apply the necessary transformation, and monitor and orchestrate necessary tasks.</p>



<p>When it comes to tasks related to machine learning, it can be used to automate the entire data flow.&nbsp;</p>



<p>Some other reasons to use Prefect are:</p>



<div id="case-study-numbered-list-block_b16c6afb0237d5b45cd20291ed74f74b"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Provides excellent security as it keeps your data and codes private.              </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Enhanced UI and notification feature which directly comes to your email or Slack.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                It can be used with Kubernetes and Docker.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Efficient parallel processing of tasks.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Dynamic workflow.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                Allows many third-party integrations.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">7</span>
                Prefect uses GraphQL API, enabling it to trigger workflow on demand.             </li>
            </ul>
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-how-to-decide">How to decide?</h2>



<p>Choosing the right tool for your project depends on what you want and what you already have. But I can surely put some criteria that can help you decide which tool will be appropriate for you. You can use –</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-argo">Argo</h3>



<ul class="wp-block-list">
<li>If you want to set up a workflow based on Kubernetes.</li>



<li>If you want to define your workflow as DAGs.</li>



<li>If your dataset is huge and model training requires highly parallel and distributed training.&nbsp;</li>



<li>If your task is complex.</li>



<li>If you are well-versed in YAML files. Even if you are not, learning YAML is not difficult.</li>



<li>If you want to use a cloud platform like GCD or AWS, which is Kubernetes enabled.&nbsp;</li>
</ul>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-airflow">Airflow</h3>



<ul class="wp-block-list">
<li>If you want to incorporate a lot of other 3rd party technology like Jenkins, Airbyte, Amazon, Cassandra, Docker, et cetera. Check the <a href="https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/index.html" target="_blank" rel="noreferrer noopener nofollow">list of supported third-party extensions</a>.</li>



<li>If you want to use Python to define the workflow.</li>



<li>If you want to define your workflow as DAGs.</li>



<li>If your workflow is static.</li>



<li>If you want a mature tool because Airflow is quite old.&nbsp;</li>



<li>If you want to run tasks on schedule.</li>
</ul>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-prefect">Prefect</h3>



<ul class="wp-block-list">
<li>If you want to incorporate a lot of other 3rd party technology.</li>



<li>If you want to use Python to define the workflow.</li>



<li>If your workflow is dynamic.</li>



<li>If you want to run tasks on schedule.</li>



<li>If you want something light and modern.</li>
</ul>



<p>I found a thread on <a href="https://www.reddit.com/r/dataengineering/comments/oqbiiu/airflow_vs_prefect/" target="_blank" rel="noreferrer noopener nofollow">Reddit</a> concerning the use of Airflow and Prefect. Maybe this can give you some additional information as to which tool to use.</p>



<p>“…The pros of Airflow are that it&#8217;s an established and popular project. This means it&#8217;s much easier to find someone who has done a random blog that answers your question. Another pro is that it&#8217;s much easier to hire someone with Airflow experience than Prefect experience. The cons are that Airflow&#8217;s age is showing, in that it wasn&#8217;t really designed for the kind of<em> dynamic workflows that exist within modern data environments</em>. If your company is going to be pushing the limits in terms of <em>computation or complexity, I&#8217;d highly suggest looking at Prefect.</em> Additionally, unless you go through Astronomer, if you can&#8217;t find an answer to a question you have about Airflow, you have to go through their fairly inactive slack chat.</p>



<p>The pros of Prefect are that it&#8217;s much more modern in its assumptions about what you&#8217;re doing and what it needs to do. It has an extensive API that allows you to programmatically control executions or otherwise interact with the scheduler, which I believe Airflow has only recently implemented out of beta in their 2.0 release. Prior to this, it was recommended not to use the API in production, which often leads to hacky workarounds. In addition, Prefect allows for a much more dynamic execution model with some of its concepts by determining the DAG that gets executed at runtime and then handing off the computation/optimization to other systems (namely Dask) to actually execute the tasks. I believe this is a much smarter approach, as I&#8217;ve seen workflows get more and more dynamic over the years.</p>



<p>If my company had neither Airflow nor Prefect in place already, I&#8217;d opt for Prefect. I believe it allows for much better modularization of code (which can then be tested more aggressively / thoroughly), which I already think is worth its weight in gold for data-driven companies that rely on having well-curated data in place to make automated product decisions. You can achieve something similar with Airflow, but you really need to go out of your way to make something like that happen, whereas in Prefect it kind of naturally comes out.”&nbsp;</p>



<p>Here is a useful chart illustrating the popularity of different orchestration tools based on GitHub stars.</p>



<div id="separator-block_f23d2ea42f6f7e22f50891a473a7769e"
         class="block-separator block-separator--15">
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" height="717" width="1024" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/argo-vs-airflow-vs-prefect-9-1024x717.png?resize=1024%2C717&#038;ssl=1" alt="Chart illustrating the popularity of different orchestration tools" class="wp-image-72260"/><figcaption class="wp-element-caption"><em>The popularity of different orchestration tools based on GitHub stars | <a href="https://www.datarevenue.com/en-blog/airflow-vs-luigi-vs-argo-vs-mlflow-vs-kubeflow" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>In this article, we discussed and compared the three popular tools for task orchestration, namely Argo, Airflow, and Prefect. My main aim was to help you understand these tools on the basis of three important factors i.e. Core concepts, Features offered, and why you should use them. The article also compared the three tools on some of the important features they offer, which could help you make the decision of choosing the most appropriate tool for your project.</p>



<p>I hope this article was informative and gave you a better understanding of these tools.&nbsp;</p>



<p>Thanks!!!&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-references">References</h3>



<ol class="wp-block-list">
<li><a href="https://github.com/argoproj/argo-workflows" target="_blank" rel="noreferrer noopener nofollow">https://github.com/argoproj/argo-workflows</a>&nbsp;</li>



<li><a href="https://argoproj.github.io/" target="_blank" rel="noreferrer noopener nofollow">https://argoproj.github.io/</a>&nbsp;</li>



<li><a href="https://codefresh.io/learn/argo-workflows/" target="_blank" rel="noreferrer noopener nofollow">https://codefresh.io/learn/argo-workflows/</a>&nbsp;</li>



<li><a href="https://hazelcast.com/glossary/directed-acyclic-graph/" target="_blank" rel="noreferrer noopener nofollow">https://hazelcast.com/glossary/directed-acyclic-graph/</a></li>



<li><a href="https://towardsdatascience.com/mlops-with-a-feature-store-816cfa5966e9" target="_blank" rel="noreferrer noopener nofollow">https://towardsdatascience.com/mlops-with-a-feature-store-816cfa5966e9</a>&nbsp;</li>



<li><a href="https://medium.com/arthur-engineering/picking-a-kubernetes-orchestrator-airflow-argo-and-prefect-83539ecc69b" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/arthur-engineering/picking-a-kubernetes-orchestrator-airflow-argo-and-prefect-83539ecc69b</a></li>



<li><a href="https://argoproj.github.io/argo-workflows/artifact-visualization/#artifact-types" target="_blank" rel="noreferrer noopener nofollow">https://argoproj.github.io/argo-workflows/artifact-visualization/#artifact-types</a>&nbsp;</li>



<li><a href="https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html" target="_blank" rel="noreferrer noopener nofollow">https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html</a>&nbsp;</li>



<li><a href="https://spell.ml/blog/orchestrating-spell-model-pipelines-using-prefect-YU3rsBEAACEAmRxp" target="_blank" rel="noreferrer noopener nofollow">https://spell.ml/blog/orchestrating-spell-model-pipelines-using-prefect-YU3rsBEAACEAmRxp</a>&nbsp;</li>



<li><a href="https://github.com/PrefectHQ/prefect" target="_blank" rel="noreferrer noopener nofollow">https://github.com/PrefectHQ/prefect</a></li>



<li><a href="https://www.datarevenue.com/en-blog/airflow-vs-luigi-vs-argo-vs-mlflow-vs-kubeflow" target="_blank" rel="noreferrer noopener nofollow">https://www.datarevenue.com/en-blog/airflow-vs-luigi-vs-argo-vs-mlflow-vs-kubeflow</a>&nbsp;</li>



<li><a href="https://hevodata.com/learn/argo-vs-airflow/#w6" target="_blank" rel="noreferrer noopener nofollow">https://hevodata.com/learn/argo-vs-airflow/#w6</a>&nbsp;</li>



<li><a href="https://www.datarevenue.com/en-blog/what-we-are-loving-about-prefect" target="_blank" rel="noreferrer noopener nofollow">https://www.datarevenue.com/en-blog/what-we-are-loving-about-prefect</a>&nbsp;</li>



<li><a href="https://github.com/PrefectHQ/prefect" target="_blank" rel="noreferrer noopener nofollow">https://github.com/PrefectHQ/prefect</a>&nbsp;</li>



<li><a href="https://docs.prefect.io/" target="_blank" rel="noreferrer noopener nofollow">https://docs.prefect.io/</a>&nbsp;</li>



<li><a href="https://medium.com/the-prefect-blog/introducing-the-artifacts-api-b9e5972db043" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/the-prefect-blog/introducing-the-artifacts-api-b9e5972db043</a>&nbsp;</li>



<li><a href="https://medium.com/the-prefect-blog/orchestrate-your-data-science-project-with-prefect-2-0-4118418fd7ce" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/the-prefect-blog/orchestrate-your-data-science-project-with-prefect-2-0-4118418fd7ce</a>&nbsp;</li>



<li><a href="https://www.reddit.com/r/dataengineering/comments/oqbiiu/airflow_vs_prefect/" target="_blank" rel="noreferrer noopener nofollow">https://www.reddit.com/r/dataengineering/comments/oqbiiu/airflow_vs_prefect/</a></li>
</ol>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">11822</post-id>	</item>
		<item>
		<title>5 Tools That Will Help You Setup Production ML Model Testing</title>
		<link>https://neptune.ai/blog/tools-ml-model-testing</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 30 Sep 2022 11:15:05 +0000</pubDate>
				<category><![CDATA[ML Tools]]></category>
		<category><![CDATA[MLOps]]></category>
		<guid isPermaLink="false">https://neptune.test/tools-ml-model-testing/</guid>

					<description><![CDATA[Developing a machine learning or a deep learning model seems like a relatively straightforward task. It usually involves research, collecting and preprocessing the data, extracting features, building and training the model, evaluation, and inference. Most of the time is consumed in the data-preprocessing phase, followed by the modeling-building phase. If the accuracy is not up&#8230;]]></description>
										<content:encoded><![CDATA[
<p><a href="/categories/ml-model-development" target="_blank" rel="noreferrer noopener">Developing a machine learning or a deep learning model</a> seems like a relatively straightforward task. It usually involves research, collecting and preprocessing the data, extracting features, building and training the model, evaluation, and inference. Most of the time is consumed in the <a href="/blog/data-preprocessing-guide" target="_blank" rel="noreferrer noopener">data-preprocessing phase</a>, followed by the modeling-building phase. If the accuracy is not up to the mark, we then reiterate the whole process until we find a satisfactory accuracy.&nbsp;</p>



<p>The difficulty arises when we want to put the model into production in the real world. The model often does not perform as well as it did during the training and evaluation phase. This happens primarily because of <a href="/blog/concept-drift-best-practices" target="_blank" rel="noreferrer noopener">concept drift</a> or data drift and issues concerning data integrity. Therefore, testing an ML model becomes very important so that we can understand its strengths and weaknesses and act accordingly.&nbsp;</p>



<p>In this article, we will discuss some of the tools that can be leveraged to test an ML model. Some of these tools and libraries are open-source, while others require a subscription. Either way, this article will fully explore the tools which will be handy for your MLOps pipeline.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-why-does-model-testing-matter">Why does model testing matter?</h2>



<p>Building upon what we just discussed, model testing allows you to pinpoint a bug or area of concern that might cause the prediction capability of the model to degrade. This can happen over time gradually or in an instant. Either way, it is always good to know in which area they might fail and which features can cause them to fail. It exposes flaws, and it can also bring new insights to light. Essentially, the idea is to make a robust model that can efficiently handle uncertain data entries and anomalies.&nbsp;</p>



<p>Some of the benefits of model testing are:</p>



<div id="case-study-numbered-list-block_98ce7e7d1ce3f6d5a001c0826a830c71"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Detecting model and data drift<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Finding anomalies in dataset<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Checking data and model integrity<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Detect possible root cause for model failure<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Eliminating bugs and errors<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                Reducing false positives and false negatives<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">7</span>
                Encouraging retraining the model over a certain period of time<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">8</span>
                Creating a production-ready model<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">9</span>
                Ensuring robustness of ML model<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">10</span>
                Finding new insights within the model            </li>
            </ul>
</div>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-is-model-testing-the-same-as-model-evaluation">Is model testing the same as model evaluation?</h3>



<p>Model testing and evaluation are similar to what we call diagnosis and screening in medicine.&nbsp;</p>



<p><strong>Model evaluation</strong> is similar to diagnosis, where the performance of the model is checked based upon certain metrics like F1 score or MSE loss. These metrics do not provide a focused area of concern.&nbsp;</p>



<section id="blog-intext-cta-block_8ccae508b071a43993cfb7cffe665126" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-learn-more">Learn more</h3>
    
            <p><img loading="lazy" decoding="async" class="lazyload block-blog-intext-cta__arrow-image" src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg" alt="" width="12" height="12" data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-arrow--right-gray.svg" />️ <a href="/blog/the-ultimate-guide-to-evaluation-and-selection-of-models-in-machine-learning" target="_blank" rel="noopener">The Ultimate Guide to Evaluation and Selection of Models in Machine Learning</a></p>
<p><img loading="lazy" decoding="async" class="lazyload block-blog-intext-cta__arrow-image" src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg" alt="" width="12" height="12" data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-arrow--right-gray.svg" />️ <a href="/blog/f1-score-accuracy-roc-auc-pr-auc" target="_blank" rel="noopener">F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?</a></p>
    
    </section>



<p><strong>Model testing</strong> is similar to diagnosis, where a certain test like the invariance test and unit test aims to find a particular issue in the model.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-will-a-typical-ml-software-testing-suite-include">What will a typical ML software testing suite include?</h2>



<p>A machine learning testing suite often includes testing modules to <strong>detect different types of drifts</strong> like concept drift and data drift, which can include covariant drift, prediction drift, and so on. These issues usually occur within the dataset. Most of the time, the dataset&#8217;s distribution changes over time, affecting the model’s capability to accurately predict the output. You will find that the frameworks we will discuss will contain tools to detect data drifts.&nbsp;</p>



<p>Apart from testing data, the ML testing suite contains tools to test the <strong>model&#8217;s capability to predict, </strong>as well as<strong> overfitting, underfitting, variance and bias</strong> et cetera. The idea of the testing framework is to inspect the pipeline in the three major phases of development: </p>



<ul class="wp-block-list">
<li>data ingestion, </li>



<li>data preprocessing, </li>



<li>and model evaluation. </li>
</ul>



<p>Some of the frameworks like Robust Intelligence and Kolena rigorously test the given ML pipeline automatically in these given areas to ensure a production-ready model.&nbsp;</p>



<p>In essence, a machine learning suite will contain:</p>



<ol class="wp-block-list">
<li><strong>Unit tests</strong> that operate on the level of the codebase,</li>



<li><strong>Regression tests</strong> replicate bugs from the previous iteration of the model that is fixed,</li>



<li><strong>Integration tests</strong> simulate conditions and are typically longer-running tests that observe model behaviors. These conditions can mirror the ML pipeline, including preprocessing phase, data distribution, et cetera.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-1.png?ssl=1" alt="A workflow of software development " class="wp-image-71569"/><figcaption class="wp-element-caption"><em>The image above depicts a typical workflow of software development | <a href="https://www.jeremyjordan.me/testing-ml/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<section id="blog-intext-cta-block_fab1718676241ba16b30355ddbfbb16c" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-read-also">Read also </h3>
    
            <p>  <a href="/blog/ml-model-testing-teams-share-how-they-test-models" target="_blank" rel="noopener">ML Model Testing: 4 Teams Share How They Test Their Models</a></p>
<p>  <a href="/blog/automated-testing-machine-learning" target="_blank" rel="noopener">Automated Testing in Machine Learning Projects [Best Practices for MLOps]</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-are-the-best-tools-for-machine-learning-model-testing">What are the best tools for machine learning model testing?</h2>



<p>Now, let’s discuss some of the tools for testing ML models. This section is divided into three parts: open-source tools, subscription-based tools, and hybrid tools.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-open-source-model-testing-tools">Open-source model testing tools</h3>



<h4 class="wp-block-heading">1. DeepChecks</h4>



<p><a href="https://deepchecks.com/" target="_blank" rel="noreferrer noopener nofollow">DeepChecks</a> is an open-source Python framework for testing ML Models &amp; Data. It basically enables users to test the ML pipeline in three different phases:</p>



<ol class="wp-block-list">
<li><strong>Data integrity test </strong>before the preprocessing phase.</li>



<li><strong>Data Validation, </strong>before the training, mostly while splitting the data into training and testing, and</li>



<li><strong>ML model testing</strong>.</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-2.png?ssl=1" alt="" class="wp-image-71570"/><figcaption class="wp-element-caption"><em> The image above shows the schema of three different tests that could be performed in an ML pipeline | <a href="https://docs.deepchecks.com/stable/getting-started/when_should_you_use.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>These tests can be performed all at once and even independently. The image above shows the schema of three different tests that could be performed in an ML pipeline.&nbsp;</p>



<h5 class="wp-block-heading">Installation</h5>



<p>Deepchecks can be installed using following the pip command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install deepchecks &gt; <span class="hljs-number" style="color: teal;">0.5</span><span class="hljs-number" style="color: teal;">.0</span></pre>



<p>The latest version of Deepcheck is 0.8.0.&nbsp;</p>



<h5 class="wp-block-heading">Structure of the framework&nbsp;</h5>



<p>DeepChecks introduces three important terms: <strong>Check</strong>, <strong>Condition</strong> and <strong>Suite</strong>. It is worth noting that these three terms together form the core structure of the framework.&nbsp;</p>



<p><strong>Check</strong></p>



<p>It enables a user to inspect a specific aspect of the data and models. The framework contains various classes which allow you to check both of them. You can do a full check as well. Here are a couple of such checks:</p>



<ol class="wp-block-list">
<li><strong><em>Data inspecting</em></strong><em> </em>involves inspection around data drift, duplication, missing values, string mismatch, statistical analysis such as data distribution et cetera<em>.</em> You can find the various data inspecting tools within the check module. The check module allows you to precisely design the inspecting methods for your datasets. These are some of the tools that you will find for data inspection:</li>
</ol>



<ul class="wp-block-list">
<li>&nbsp;&#8216;DataDuplicates&#8217;,</li>



<li>&nbsp;&#8216;DatasetsSizeComparison&#8217;,</li>



<li>&nbsp;&#8216;DateTrainTestLeakageDuplicates&#8217;,</li>



<li>&nbsp;&#8216;DateTrainTestLeakageOverlap&#8217;,</li>



<li>&nbsp;&#8216;DominantFrequencyChange&#8217;,</li>



<li>&nbsp;&#8216;FeatureFeatureCorrelation&#8217;,</li>



<li>&nbsp;&#8216;FeatureLabelCorrelation&#8217;,</li>



<li>&nbsp;&#8216;FeatureLabelCorrelationChange&#8217;,</li>



<li>&nbsp;&#8216;IdentifierLabelCorrelation&#8217;,</li>



<li>&nbsp;&#8216;IndexTrainTestLeakage&#8217;,</li>



<li>&nbsp;&#8216;IsSingleValue&#8217;,</li>



<li>&nbsp;&#8216;MixedDataTypes&#8217;,</li>



<li>&nbsp;&#8216;MixedNulls&#8217;,</li>



<li>&nbsp;&#8216;WholeDatasetDrift&#8217;</li>
</ul>



<p>In the following example, we will inspect whether the dataset has duplicates or not. We will import the class DataDuplicates from the checks module and pass the dataset as a parameter. This will return a table containing relevant information on whether the dataset has duplicate values or not.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> deepchecks.checks <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> DataDuplicates, FeatureFeatureCorrelation
dup = DataDuplicates()
dup.run(data)
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-3.png?ssl=1" alt="Inspection of dataset duplicates " class="wp-image-71571"/><figcaption class="wp-element-caption"><em>An example of inspecting if the dataset has duplicates | Source: Author</em></figcaption></figure>
</div>


<p>As you can see, the table above yields relative information about the number of duplicates present in the dataset. Now let’s see how DeepChecks uses a visual aid to provide the concerning information.&nbsp;</p>



<p>In the following example, we will inspect feature-feature correlation within the dataset. For that, we will import the FeatureFeatureCorrelation class from the checks module.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">ffc = FeatureFeatureCorrelation()
ffc.run(data)
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-4.png?ssl=1" alt=" Inspection of feature-feature correlation" class="wp-image-71572"/><figcaption class="wp-element-caption"><em>An example of inspecting feature-feature correlation within the dataset | Source: Author</em></figcaption></figure>
</div>


<p>As you can see from both examples, the results can be displayed either in the form of a table or a graph, or even both to give relevant information to the user.&nbsp;&nbsp;</p>



<ol class="wp-block-list" start="2">
<li><strong><em>The model inspection</em></strong><em> </em>involves overfitting, underfitting, et cetera<em>. </em>Similar to data inspection, you can also find the various model inspecting tools within the check module. These are some of the tools that you will find for model inspection:</li>
</ol>



<ul class="wp-block-list">
<li>&#8216;ModelErrorAnalysis&#8217;,</li>



<li>&nbsp;&#8216;ModelInferenceTime&#8217;,</li>



<li>&nbsp;&#8216;ModelInfo&#8217;,</li>



<li>&nbsp;&#8216;MultiModelPerformanceReport&#8217;,</li>



<li>&nbsp;&#8216;NewLabelTrainTest&#8217;,</li>



<li>&nbsp;&#8216;OutlierSampleDetection&#8217;,</li>



<li>&nbsp;&#8216;PerformanceReport&#8217;,</li>



<li>&nbsp;&#8216;RegressionErrorDistribution&#8217;,</li>



<li>&nbsp;&#8216;RegressionSystematicError&#8217;,</li>



<li>&nbsp;&#8216;RocReport&#8217;,</li>



<li>&nbsp;&#8216;SegmentPerformance&#8217;,</li>



<li>&nbsp;&#8216;SimpleModelComparison&#8217;,</li>



<li>&nbsp;&#8216;SingleDatasetPerformance&#8217;,</li>



<li>&nbsp;&#8216;SpecialCharacters&#8217;,</li>



<li>&nbsp;&#8216;StringLengthOutOfBounds&#8217;,</li>



<li>&nbsp;&#8216;StringMismatch&#8217;,</li>



<li>&nbsp;&#8216;StringMismatchComparison&#8217;,</li>



<li>&nbsp;&#8216;TrainTestFeatureDrift&#8217;,</li>



<li>&nbsp;&#8216;TrainTestLabelDrift&#8217;,</li>



<li>&nbsp;&#8216;TrainTestPerformance&#8217;,</li>



<li>&nbsp;&#8216;TrainTestPredictionDrift&#8217;,</li>
</ul>



<p>Example of a model check or inspection on Random Forest Classifier:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> deepchecks.checks <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> ModelInfo
info = ModelInfo()
info.run(RF)
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-5.png?resize=490%2C768&#038;ssl=1" alt="A model check or inspection on Random Forest Classifier" class="wp-image-71573" width="490" height="768"/><figcaption class="wp-element-caption"><em>An example of a model check or inspection on Random Forest Classifier | Source: Author&nbsp;</em></figcaption></figure>
</div>


<p><strong>Condition</strong>&nbsp;</p>



<p>It is a function or attribute that can be added to a Check. Essentially it contains a predefined parameter that can return a pass, fail, or warning results. These parameters can be modified as well accordingly. Follow the code snippet below to get an understanding.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> deepchecks.checks <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> ModelInfo
info = ModelInfo()
info.run(RF)
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-6.png?resize=908%2C572&#038;ssl=1" alt="A bar graph of feature label correlation" class="wp-image-71574" width="908" height="572"/><figcaption class="wp-element-caption"><em>An example of a bar graph of feature label correlation | Source: Author</em></figcaption></figure>
</div>


<p>The image above shows a bar graph of feature label correlation. It essentially measures the predictive power of an independent feature that can predict the target value by itself. When you add a condition to a check as in the example above, the condition will return additional information mentioning the features which are above and below the condition.&nbsp;</p>



<p>In this particular example, you will find that the condition returned a statement stating that the algorithm “<em>Found 2 out of 4 features with PPS above threshold: {&#8216;petal width (cm)&#8217;: &#8216;0.9&#8217;, &#8216;petal length (cm)&#8217;: &#8216;0.87&#8217;}</em>” meaning that features with high PPS are suitable to predict the labels.&nbsp;</p>



<p><strong>Suite</strong>&nbsp;</p>



<p>It is a module containing a collection of checks for data and model. It is an ordered collection of checks. All the checks can be found in the suite module. Below is the schematic diagram of the framework and how it works.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-7.png?ssl=1" alt="Schematic diagram of suite of checks " class="wp-image-71575"/><figcaption class="wp-element-caption"><em>The schematic diagram of the suite of checks and how it works | <a href="https://medium.com/@ptannor/new-open-source-for-validating-and-testing-machine-learning-86bb9c575e71" target="_blank" rel="noreferrer noopener nofollow">Source</a>&nbsp;</em></figcaption></figure>
</div>


<p>As you can see from the image above, the data and the model can be passed into the suites which contain the different checks. The checks can be provided with the conditions for much more precise testing.&nbsp;</p>



<p>You can run the following code to see the list of 35 checks and their conditions that DeepChecks provides:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> deepchecks.suites <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> full_suite
suites = full_suite()
print(suites)
Full Suite: [
	<span class="hljs-number" style="color: teal;">0</span>: ModelInfo
	<span class="hljs-number" style="color: teal;">1</span>: ColumnsInfo
	<span class="hljs-number" style="color: teal;">2</span>: ConfusionMatrixReport
	<span class="hljs-number" style="color: teal;">3</span>: PerformanceReport
		Conditions:
			<span class="hljs-number" style="color: teal;">0</span>: Train-Test scores relative degradation <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">is</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">not</span> greater than <span class="hljs-number" style="color: teal;">0.1</span>
	<span class="hljs-number" style="color: teal;">4</span>: RocReport(excluded_classes=[])
		Conditions:
			<span class="hljs-number" style="color: teal;">0</span>: AUC score <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> all the classes <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">is</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">not</span> less than <span class="hljs-number" style="color: teal;">0.7</span>
	<span class="hljs-number" style="color: teal;">5</span>: SimpleModelComparison
		Conditions:
			<span class="hljs-number" style="color: teal;">0</span>: Model performance gain over simple model <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">is</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">not</span> less than
…]

</pre>



<p>In conclusion, Check, Condition, and Suites allow users to essentially check the data and model in their respective tasks. These can be extended and modified according to the requirements of the project and for various use cases.&nbsp;</p>



<p>DeepChecks allows flexibility and instant validation of the ML pipeline with less effort. Their strong boilerplate code can allow users to automate the whole testing process, which can save a lot of time.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-8.png?ssl=1" alt="Graph with distribution checks" class="wp-image-71576"/><figcaption class="wp-element-caption"><em>An example of distribution checks | <a href="https://deepchecks.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h5 class="wp-block-heading">Why should you use this?</h5>



<ul class="wp-block-list">
<li>It is open-source and free, and it has a growing community.</li>



<li>Very well-structured framework.&nbsp;</li>



<li>Because it has built-in checks and suites, it can be extremely useful for inspecting potential issues in your data and models.</li>



<li>It is efficient in the research phase as it can be easily integrated into the pipeline.</li>



<li>If you are mostly working with tabular datasets, then DeepChecks is extremely good.&nbsp;</li>



<li>You can also use it to check data, model drifts, model integrity, and model monitoring.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-9.png?resize=768%2C568&#038;ssl=1" alt="Methodology issues" class="wp-image-71577" width="768" height="568"/><figcaption class="wp-element-caption"><em>An example of methodology issues | <a href="https://deepchecks.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h5 class="wp-block-heading">Key features&nbsp;</h5>



<div id="case-study-numbered-list-block_58e722c9b682ec0283ebe0dff110f34b"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                It supports both classification and regression models in both computer vision and tabular datasets.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                It can easily run a large group of checks with a single call.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                It is flexible, editable, and expandable.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                It yields results in both tabular and visual formats.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                It does not require a login dashboard as all the results, including the visualization, are displayed instantly during execution itself.  And it has a pretty good UX on the go.             </li>
            </ul>
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-10.png?resize=768%2C581&#038;ssl=1" alt="Performance checks " class="wp-image-71578" width="768" height="581"/><figcaption class="wp-element-caption"><em>An example of performance checks | <a href="https://deepchecks.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h5 class="wp-block-heading">Key drawbacks</h5>



<div id="case-study-numbered-list-block_17a99a63b90e85a0ff3abd9114c1d988"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                It does not support NLP tasks.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Deep Learning support is in beta version including computer vision. So results can yield errors.             </li>
            </ul>
</div>



<h4 class="wp-block-heading">2. Drifter-ML</h4>



<p>Drifter ML is an ML model testing tool specifically written for the Scikit-learn library. It can also be used to test datasets similar to DeepChecks. It has five modules, each very specific to the task at hand.</p>



<ol class="wp-block-list">
<li><strong>Classification test: </strong>It enables you to test classification algorithms.</li>



<li><strong>Regression test: </strong>It enables you to test classification algorithms.</li>



<li><strong>Structural test: </strong>This module has a bunch of classes that allow testing of clustering algorithms.</li>



<li><strong>Time Series test: </strong>This module can be used to test model drifts.&nbsp;</li>



<li><strong>Columnar test: </strong>This module allows you to test your tabular dataset. Tests include sanity testing, mean and median similarity, Pearson’s correlation et cetera.&nbsp;</li>
</ol>



<h5 class="wp-block-heading">Installation</h5>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install drifter-ml</pre>



<h5 class="wp-block-heading">Structure of the framework</h5>



<p>Drifter ML conforms to the Scikit-Learn blueprint for models, i.e., the model must contain a .fit and .predict methods. This essentially means that you can test deep learning models as well since Scikit-Learn has an integrated Keras API. Check the <a href="https://drifter-ml.readthedocs.io/en/latest/introduction.html" target="_blank" rel="noreferrer noopener nofollow">example</a> below.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Source: https://drifter-ml.readthedocs.io/en/latest/classification-tests.html#lower-bound-classification-measures</span>

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> keras.models <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Sequential
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> keras.layers <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Dense
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> keras.wrappers.scikit_learn <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> KerasClassifier
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> pandas <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> pd
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> numpy <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> np
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> joblib

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Function to create model, required for KerasClassifier</span>
<span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">create_model</span><span class="hljs-params">()</span>:</span>
   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># create model</span>
   model = Sequential()
   model.add(Dense(<span class="hljs-number" style="color: teal;">12</span>, input_dim=<span class="hljs-number" style="color: teal;">3</span>, activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'relu'</span>))
   model.add(Dense(<span class="hljs-number" style="color: teal;">8</span>, activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'relu'</span>))
   model.add(Dense(<span class="hljs-number" style="color: teal;">1</span>, activation=<span class="hljs-string" style="color: rgb(221, 17, 68);">'sigmoid'</span>))
   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Compile model</span>
   model.compile(loss=<span class="hljs-string" style="color: rgb(221, 17, 68);">'binary_crossentropy'</span>, optimizer=<span class="hljs-string" style="color: rgb(221, 17, 68);">'adam'</span>, metrics=[<span class="hljs-string" style="color: rgb(221, 17, 68);">'accuracy'</span>])
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> model

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># fix random seed for reproducibility</span>
df = pd.DataFrame()
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> _ <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">1000</span>):
   a = np.random.normal(<span class="hljs-number" style="color: teal;">0</span>, <span class="hljs-number" style="color: teal;">1</span>)
   b = np.random.normal(<span class="hljs-number" style="color: teal;">0</span>, <span class="hljs-number" style="color: teal;">3</span>)
   c = np.random.normal(<span class="hljs-number" style="color: teal;">12</span>, <span class="hljs-number" style="color: teal;">4</span>)
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> a + b + c &gt; <span class="hljs-number" style="color: teal;">11</span>:
       target = <span class="hljs-number" style="color: teal;">1</span>
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
       target = <span class="hljs-number" style="color: teal;">0</span>
   df = df.append({
       <span class="hljs-string" style="color: rgb(221, 17, 68);">"A"</span>: a,
       <span class="hljs-string" style="color: rgb(221, 17, 68);">"B"</span>: b,
       <span class="hljs-string" style="color: rgb(221, 17, 68);">"C"</span>: c,
       <span class="hljs-string" style="color: rgb(221, 17, 68);">"target"</span>: target
   }, ignore_index=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># split into input (X) and output (Y) variables</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># create model</span>
clf = KerasClassifier(build_fn=create_model, epochs=<span class="hljs-number" style="color: teal;">150</span>, batch_size=<span class="hljs-number" style="color: teal;">10</span>, verbose=<span class="hljs-number" style="color: teal;">0</span>)
X = df[[<span class="hljs-string" style="color: rgb(221, 17, 68);">"A"</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">"B"</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">"C"</span>]]
clf.fit(X, df[<span class="hljs-string" style="color: rgb(221, 17, 68);">"target"</span>])
joblib.dump(clf, <span class="hljs-string" style="color: rgb(221, 17, 68);">"model.joblib"</span>)
df.to_csv(<span class="hljs-string" style="color: rgb(221, 17, 68);">"data.csv"</span>)

</pre>



<p>The example above shows the ease with which you can design your ANN model using drifter-ml. Similarly, you can also design a test case as well. In the test defined below, we will try to find the lowest decision boundary by which the model can easily classify the two classes.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">test_cv_precision_lower_boundary</span><span class="hljs-params">()</span>:</span>
   df = pd.read_csv(<span class="hljs-string" style="color: rgb(221, 17, 68);">"data.csv"</span>)
   column_names = [<span class="hljs-string" style="color: rgb(221, 17, 68);">"A"</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">"B"</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">"C"</span>]
   target_name = <span class="hljs-string" style="color: rgb(221, 17, 68);">"target"</span>
   clf = joblib.load(<span class="hljs-string" style="color: rgb(221, 17, 68);">"model.joblib"</span>)

   test_suite = ClassificationTests(clf,
   df, target_name, column_names)
   lower_boundary = <span class="hljs-number" style="color: teal;">0.9</span>
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> test_suite.cross_val_precision_lower_boundary(
       lower_boundary
   )</pre>



<h5 class="wp-block-heading">Why should you use this?</h5>



<ul class="wp-block-list">
<li>Drifter-ML is specifically written for Scikit-learn, and this library acts as an extension to it. All the classes and methods are written in sync with Scikit-learn, so data and model testing become relatively easy and straightforward.&nbsp;</li>
</ul>



<ul class="wp-block-list">
<li>On a side note, if you like to work on an open-source library, then you can extend the library to other machine learning and deep learning libraries such as Pytorch as well.&nbsp;</li>
</ul>



<h5 class="wp-block-heading">Key features&nbsp;</h5>



<div id="case-study-numbered-list-block_ecedf7f8833fa0750959f05284aa7cf3"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Built on top of Scikit-learn.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Offers to test for Deep learning architecture but only for Keras since it is extended in Scikit-learn.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Open source library and open to contribution.             </li>
            </ul>
</div>



<h5 class="wp-block-heading">Key drawbacks</h5>



<div id="case-study-numbered-list-block_76f21612f4a7714359264686da22f1de"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                It is not up to date, and its community is not fairly active.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                It does not work well with other libraries.             </li>
            </ul>
</div>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-subscription-based-tools">Subscription-based tools</h3>



<h4 class="wp-block-heading">1. Kolena.io</h4>



<p><a href="https://www.kolena.io/" target="_blank" rel="noreferrer noopener nofollow">Kolena.io</a> is a Python-based framework for ML testing. It also includes an online platform where the results and insights can be logged. Kolena focuses mostly on the ML unit testing and validation process at scale.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-11.png?ssl=1" alt="Kolena.io dashboard" class="wp-image-71579"/><figcaption class="wp-element-caption"><em>Kolena.io dashboard example | <a href="https://www.kolena.io/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h5 class="wp-block-heading">Why you should use this?</h5>



<p>Kolena argues that the split test dataset methodology isn’t as reliable as it seems to be. Splitting the datasets provides a global representation of the entire population distribution and fails to capture the local representations at a granular level, this is especially true with label or class. There are hidden nuances of features that still need to be discovered. This leads to the failure of the model in the real world even though the model yields good scores in the performance metrics during training and evaluation.&nbsp;</p>



<p>One way of addressing that issue is by creating a much more focused dataset that can be achieved by breaking a given class into smaller subclasses for focused results or even creating a subset of the features themselves. Such a dataset can enable the ML model to extract features and representation at a much granular level. This will increase the performance of the model as well by balancing both the bias and variance such that the model generalizes well in the real-world scenario.&nbsp;</p>



<p>For example, when building a classification model, a given class in the dataset can be broken down into various subsets and those subsets into finer subsets. This can enable users to test the model in various scenarios. In the table below, the CAR class is tested against several test cases to check the model’s performance on various attributes.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-12.png?resize=768%2C551&#038;ssl=1" alt="CAR class tested against several test cases" class="wp-image-71580" width="768" height="551"/><figcaption class="wp-element-caption">CAR class tested against several test cases to check the model’s performance on various attributes | <a href="https://medium.com/kolena-ml/best-practices-for-ml-model-testing-224366d3f23c" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Another benefit is whenever we face a new scenario in the real-world, a new test case can be designed and tested immediately. Likewise, users can build more comprehensive test cases for a variety of tasks and train or build a model. The users can also generate a detailed report on a model’s performance in each category of test cases and compare it to the previous models with each iteration.</p>



<p>To sum up, Kolena offers:</p>



<ul class="wp-block-list">
<li>Ease of python framework</li>



<li>Automated workflow testing and deployment</li>



<li>Faster model debugging</li>



<li>Faster model deployment</li>
</ul>



<p>If you are working on a large-scale deep learning model which will be complex to monitor, then Kolena will be beneficial.&nbsp;</p>



<h5 class="wp-block-heading">Key features&nbsp;</h5>



<div id="case-study-numbered-list-block_f35e837c87b49c17aedbb0b5f2313e9a"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Supports Deep Learning architectures.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Kolena Test Case Studio offers to curate customizable test cases for the model.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                It allows users to prepare quality tests by removing noise and improving annotations.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                It can automatically diagnose failure modes and can find the exact issue concerning the same.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Integrates seamlessly into the ML pipeline.             </li>
            </ul>
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-13.png?ssl=1" alt="App Kolena.io " class="wp-image-71581"/><figcaption class="wp-element-caption"><em>View from the Kolena.io app | Source</em></figcaption></figure>
</div>


<h5 class="wp-block-heading">Key drawbacks</h5>



<div id="case-study-numbered-list-block_8a424d976e1022cd4a50f878ecaf1176"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Subscription-based model (pricing not mentioned).<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Subscription-based model (pricing not mentioned).<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                In order to download the framework, you need a CloudRepo pass.             </li>
            </ul>
</div>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip3 install --extra-index-url <span class="hljs-string" style="color: rgb(221, 17, 68);">"$CR_URL"</span> kolena-client</pre>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-2-robust-intelligence">2. Robust intelligence</h3>



<p>It is an E2E ML platform that offers various services in terms of ML integrity. The framework is written in Python and allows customizing your code according to your needs. The framework also integrates into an online dashboard that provides insights into various testing on data and model performance as well as model monitoring. All these services target the ML model and data right from training to the post-production phase.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-14.png?ssl=1" alt="Robust intelligence " class="wp-image-71582"/><figcaption class="wp-element-caption"><em>Robust intelligence features | <a href="https://www.robustintelligence.com/platform/overview" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h5 class="wp-block-heading">Why should you use this?</h5>



<p>The platform offers services like:</p>



<p><strong>1. AI stress testing,</strong> which includes hundreds of tests to automatically evaluate the performance of the model and identify potential drawbacks.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-15.png?ssl=1" alt="AI stress testing" class="wp-image-71583"/><figcaption class="wp-element-caption"><em>Evaluating the performance of the model | <a href="https://www.robustintelligence.com/platform/overview" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p><strong>2. AI Firewall, </strong>which automatically creates a wrapper around the trained model to protect it from bad data in real-time. The wrapper is configured based on the model. It also automatically checks both the data and model, reducing manual effort and time.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-16.png?ssl=1" alt="AI Firewall" class="wp-image-71584"/><figcaption class="wp-element-caption"><em>Prevention of model failures in production | <a href="https://www.robustintelligence.com/platform/overview" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p><strong>3. AI continuous testing</strong>, which<strong> </strong>monitors the model and automatically tests the deployed model to check for updates and retraining. The testing involves data drift, error, root cause analysis, anomalies detection et cetera. All the insights gained during continuous testing are displayed on the dashboard.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-17.png?ssl=1" alt="AI continuous testing" class="wp-image-71585"/><figcaption class="wp-element-caption"><em>Monitoring model in production | <a href="https://www.robustintelligence.com/platform/overview" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Robust intelligence enables model testing, model protection during deployment, and model monitoring after deployment. Since it is an e2e-based platform, all the phases can be easily automated with hundreds of stress tests run on the model to make it production ready. If the project is fairly large, then Robust intelligence will give you an edge.&nbsp;</p>



<h5 class="wp-block-heading">Key features&nbsp;</h5>



<div id="case-study-numbered-list-block_f57daf41c19f462ee2a211e50497a809"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Supports deep learning frameworks<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Flexible and easy to use<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Customisable<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Scalable            </li>
            </ul>
</div>



<h5 class="wp-block-heading">Key drawbacks</h5>



<div id="case-study-numbered-list-block_28d9482eadfb67a947c16f0969223daf"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Only for enterprise.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Few details are available online.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Expensive: One-year subscription costs around $60,000.            </li>
            </ul>
</div>



<p class="has-text-align-left"><span style="color: initial;"><em>(</em></span><a href="https://aws.amazon.com/marketplace/pp/prodview-23bciknsbkgta" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a><strong style="color: initial;"><em>)</em></strong></p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-hybrid-frameworks">Hybrid frameworks</h3>



<h4 class="wp-block-heading">1. Etiq.ai</h4>



<p>​​<a href="https://etiq.ai/" target="_blank" rel="noreferrer noopener nofollow">Etiq </a>is an AI-observability platform that supports AI/ML lifecycle. Like Kolena and Robust Intelligence, the framework offers ML Model testing, monitoring, optimization, and explainability.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-18.png?ssl=1" alt="Etiq.ai" class="wp-image-71586"/><figcaption class="wp-element-caption"><em>The dashboard of Etiq.ai | <a href="https://docs.google.com/document/d/1oJ20eZeuuuFigdi4P4rqgcLONZ2ulimFn4XMBQ8M9Fw/edit#heading=h.upirzig57bbx" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Etiq is considered to be a hybrid framework as it offers both offline and online implementation. Etiq has four tiers of usage:</p>



<ol class="wp-block-list">
<li><strong>Free and public</strong>: It includes free usage of the library as well as the dashboard. Keep in mind the results and metadata will be stored in your dashboard instance the moment you log in to the platform, but you will receive full benefits.&nbsp;</li>



<li><strong>Free and limited</strong>: If you want a free but private testing environment for your project and don’t want to share any information, then you can use the platform without logging into the platform. Keep in mind that you will not receive full benefits as would have received when you logged into the platform.&nbsp;&nbsp;</li>



<li>Subscribe and private: If you want full benefits of Etiq.ai, then you can subscribe to their plan and make use of their tools in your own private environment. Etiq.ai is already available in the AWS market place which starts at around $3.00/hour or from $25,000.00/year.&nbsp;</li>



<li>Personalized request: If you require functionality beyond what is provided by Etiq.ai, like explainability, robustness, or team share functionality, then you can contact them and get your own personalized test suite.&nbsp;&nbsp;</li>
</ol>



<h5 class="wp-block-heading">Structure of the framework&nbsp;</h5>



<p>Etiq follows a structure similar to DeepChecks. This structure remains the core of the framework:</p>



<ul class="wp-block-list">
<li><strong>Snapshot</strong>: It is a combination of dataset and model in the pre-production testing phase.&nbsp;</li>



<li><strong>Scan</strong>: It is usually a test that is applied to the snapshot.</li>



<li><strong>Config</strong>: It is usually a JSON file that contains a set of parameters that will be used by the scan for running tests in the snapshot.</li>



<li><strong>Custom test</strong>: It allows you to customize your tests by adding and editing various metrics to the config file.&nbsp;</li>
</ul>



<p>Etiq offers two types of tests: <strong>Scan</strong> and <strong>Root Cause Analysis</strong> or RCA, the latter is an experimental pipeline. The scan type offers</p>



<ul class="wp-block-list">
<li><strong>Accuracy</strong>: In some cases, high accuracy can indicate a problem just as low accuracy can. In such cases, an ‘accuracy’ scan can be helpful. If the accuracy is too high, then you might do a leakage scan, or if it is low, then you can do a drift scan.&nbsp;</li>



<li><strong>Leakage</strong>: It helps you to find data leakage.&nbsp;</li>



<li><strong>Drift</strong>: It can help you to find feature drift, target drift, concept drift, and prediction drift.&nbsp;</li>



<li><strong>Bias</strong>: Bias refers to algorithmic bias that can happen because of automated decision making causing unintended discrimination.&nbsp;</li>
</ul>



<h5 class="wp-block-heading">Why should you use this?</h5>



<p>Etiq.ai offers a multi-step pipeline, which means you can monitor the test by logging the results of each of the steps in the ML pipeline. This allows you to identify and repair bias within the model. If you are looking for a framework that can do the heavy lifting of your AI pipeline, then Etiq.ai is the one to go.&nbsp;</p>



<p>Some other reasons why you should use Etiq.ai:</p>



<div id="case-study-numbered-list-block_57f77c519e1c6252998be57a660d6d1c"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                It is a Python Framework<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Dashboard facility for multiple insights and optimization reporting<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                You can manage multiple projects.             </li>
            </ul>
</div>



<p>All the points above are valid for free tier usage.&nbsp;</p>



<p>One key feature of Etiq.ai is that it allows you to be very precise and straightforward in your model building and deploying approaches. It aims to give users the tools that can help them to achieve the desired model. At times, the development process gets drifted away from the original plan mostly because of the lack of tools needed to shape the model. If you want to deploy a model that is aligned with the proposed requirements, then Etiq.ai is the way to go. This is because the framework offers <strong>similar tests at each step throughout your ML pipeline.&nbsp;</strong></p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/5-tools-that-will-help-you-setup-production-ML-model-testing-19.png?ssl=1" alt="Etiq.ai " class="wp-image-71587"/><figcaption class="wp-element-caption"><em>Steps of the process when to use Etiq.ai | <a href="https://docs.etiq.ai/#why-use-etiq-for-ml-testing" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h5 class="wp-block-heading">Key features&nbsp;</h5>



<div id="case-study-numbered-list-block_e700c04d8752c688f4c840db614bf7fa"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                A lot of functionalities in the free tier.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Test each of the pipelines for better monitoring<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Supports deep learning frameworks like PyTorch and Keras-Tensorflow<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                You can request a personalized test library.             </li>
            </ul>
</div>



<h5 class="wp-block-heading">Key drawbacks</h5>



<div id="case-study-numbered-list-block_ca0a5b8e52fd3f953d543c1e8dc4c24f"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                At the moment, in production, they only provide functionality for batch processing.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                To apply tests to tasks pertaining to segmentation, regression, or recommendation engines, who must get in touch with the team.             </li>
            </ul>
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>The ML testing frameworks that we discussed are directed toward the needs of the users. All of the frameworks have their own pros and cons. But you can definitely get by using any one of these frameworks. ML model testing frameworks play an integral part in defining how the model will perform when deployed to a real-world scenario.&nbsp;</p>



<p>If you are looking for a free and easy-to-use ML testing framework for structured datasets and smaller ML models, then go with DeepChecks. If you are working with DL algorithms, then Etiq.ai is a good option. But if you can spare some money, then you should definitely inquire about Kolena. And lastly, if you are working in a mid to large-size enterprise and looking for ML testing solutions, then hands-down, it has to be Robust Intelligence.&nbsp;</p>



<p>I hope this article provided you with all the preliminary information needed for you to get started with ML testing. Please share this article with everyone who needs it.&nbsp;</p>



<p>Thanks for reading!!!</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-reference">Reference</h3>



<ol class="wp-block-list">
<li><a href="https://www.robustintelligence.com/" target="_blank" rel="noreferrer noopener nofollow">https://www.robustintelligence.com/</a></li>



<li><a href="https://aws.amazon.com/marketplace/pp/prodview-23bciknsbkgta" target="_blank" rel="noreferrer noopener nofollow">https://aws.amazon.com/marketplace/pp/prodview-23bciknsbkgta</a></li>



<li><a href="https://etiq.ai/" target="_blank" rel="noreferrer noopener nofollow">https://etiq.ai/</a></li>



<li><a href="https://docs.etiq.ai/" target="_blank" rel="noreferrer noopener nofollow">https://docs.etiq.ai/</a></li>



<li><a href="https://arxiv.org/pdf/2005.04118.pdf" target="_blank" rel="noreferrer noopener nofollow">https://arxiv.org/pdf/2005.04118.pdf</a></li>



<li><a href="https://medium.com/kolena-ml/best-practices-for-ml-model-testing-224366d3f23c" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/kolena-ml/best-practices-for-ml-model-testing-224366d3f23c</a></li>



<li><a href="https://docs.kolena.io/" target="_blank" rel="noreferrer noopener nofollow">https://docs.kolena.io/</a></li>



<li><a href="https://www.kolena.io/" target="_blank" rel="noreferrer noopener nofollow">https://www.kolena.io/</a></li>



<li><a href="https://github.com/EricSchles/drifter_ml" target="_blank" rel="noreferrer noopener nofollow">https://github.com/EricSchles/drifter_ml</a></li>



<li><a href="https://arxiv.org/pdf/2203.08491.pdf" target="_blank" rel="noreferrer noopener nofollow">https://arxiv.org/pdf/2203.08491.pdf</a></li>



<li><a href="https://medium.com/@ptannor/new-open-source-for-validating-and-testing-machine-learning-86bb9c575e71" target="_blank" rel="noreferrer noopener nofollow">https://medium.com/@ptannor/new-open-source-for-validating-and-testing-machine-learning-86bb9c575e71</a></li>



<li><a href="https://deepchecks.com/" target="_blank" rel="noreferrer noopener nofollow">https://deepchecks.com/ </a></li>



<li><a href="https://www.xenonstack.com/insights/machine-learning-model-testing" target="_blank" rel="noreferrer noopener nofollow">https://www.xenonstack.com/insights/machine-learning-model-testing</a></li>



<li><a href="https://www.jeremyjordan.me/testing-ml/" target="_blank" rel="noreferrer noopener nofollow">https://www.jeremyjordan.me/testing-ml/</a></li>



<li><a href="https://neptune.ai/blog/ml-model-testing-teams-share-how-they-test-models" target="_blank" rel="noreferrer noopener">https://neptune.ai/blog/ml-model-testing-teams-share-how-they-test-models</a></li>



<li><a href="https://mlops.toys" target="_blank" rel="noreferrer noopener nofollow">https://mlops.toys</a></li>
</ol>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">7265</post-id>	</item>
		<item>
		<title>Building MLOps Pipeline for Computer Vision: Image Classification Task [Tutorial]</title>
		<link>https://neptune.ai/blog/mlops-pipeline-for-computer-vision-image-classification</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Mon, 01 Aug 2022 15:16:11 +0000</pubDate>
				<category><![CDATA[Computer Vision]]></category>
		<category><![CDATA[MLOps]]></category>
		<guid isPermaLink="false">https://neptune.test/mlops-pipeline-for-computer-vision-image-classification/</guid>

					<description><![CDATA[The introduction of Transformers in 2018 by Vaswani and the team brought a significant transformation in the research and development of deep learning models for various tasks. The transformer leverages a self-attention mechanism that was adopted from the attention mechanism by Bahdanau and the team. With this mechanism, one input could interact with other inputs&#8230;]]></description>
										<content:encoded><![CDATA[
<p>The introduction of <a href="https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf" target="_blank" rel="noreferrer noopener nofollow">Transformers in 2018 by Vaswani</a> and the team brought a significant transformation in the research and development of deep learning models for various tasks. The transformer leverages a self-attention mechanism that was adopted from the attention mechanism by Bahdanau and the team. With this mechanism, one input could interact with other inputs enabling it to focus or pay attention to the important features of the data.&nbsp;</p>



<p>Because of this, transformers were able to achieve state-of-the-art results in various <a href="/category/natural-language-processing" target="_blank" rel="noreferrer noopener">NLP</a> tasks like machine translation, summary generation, text-generation, et cetera. It has also replaced RNN and its variants in almost all the NLP tasks. As a matter of fact, with its success in NLP, transformers are now being adopted in <a href="/category/computer-vision" target="_blank" rel="noreferrer noopener">computer vision</a> tasks as well. In 2020, Dosovitskiy and his team developed vision transformers (ViT), where they argued that reliance on CNN is not necessary. Based upon this premise, in this article, we will explore and learn how ViT can help in the task of image classification.&nbsp;&nbsp;</p>



<p>This article is a guide aimed at <strong>building an MLOps pipeline for a computer vision</strong> task using <a href="https://www.google.com/url?q=https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html&amp;sa=D&amp;source=docs&amp;ust=1658825455903181&amp;usg=AOvVaw23I6LB81bRMZdj0MrMq7ID" target="_blank" rel="noreferrer noopener nofollow">ViT</a>, and it will focus on the following areas with respect to a typical data science project:</p>



<ol class="wp-block-list">
<li>Aim of the project</li>



<li>Hardware specification</li>



<li>Attention visualization&nbsp;</li>



<li>Building the model and experiment tracking</li>



<li>Testing and inference</li>



<li>Creating a Streamlit app for deployment</li>



<li>Setting up CI/CD using GitHub actions</li>



<li>Deployment and monitoring</li>
</ol>



<section id="blog-intext-cta-block_7a4db89e8655d5539c97cc46b2397ada" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-read-also">Read also</h3>
    
            <p>  <a href="/blog/mlops-pipeline-for-time-series-prediction-tutorial"> Building MLOps Pipeline for Time Series Prediction [Tutorial]</a></p>
<p>  <a href="/blog/mlops-pipeline-for-nlp-machine-translation">Building MLOps Pipeline for NLP: Machine Translation Task [Tutorial]</a></p>
    
    </section>



<p>The code for this article can be found on this <a href="https://github.com/Nielspace/ViT-Pytorch" target="_blank" rel="noreferrer noopener nofollow"><strong>Github</strong></a> Link so that you can follow along. Let’s get started.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-pipeline-for-image-classification-understanding-the-project">MLOps pipeline for image classification: understanding the project</h2>



<p>Understanding the requirements of the project or the client is an important step as it can help us brainstorm ideas and research various components that the project might require, such as the latest papers, repositories, relevant work, datasets, and even cloud-based platforms for deployment. This section will focus on 2 topics:&nbsp;</p>



<div id="case-study-numbered-list-block_36f63fc68a8c34542343bc45072b4d03"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Aim of the project.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Hardware for accelerated training.            </li>
            </ul>
</div>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-aim-of-the-project-bird-image-classifier">Aim of the project: bird image classifier&nbsp;</h3>



<p>The aim of the project is to build an image classifier to classify different species of birds. Since this model will be later deployed in the cloud, we must keep in mind that the model must be trained to achieve a good accuracy score in both training and testing datasets. In order to do that, we should use metrics like precision, recall, confusion metrics, F1, and AUROC score to see how the model is performing on both datasets. Once the model achieves good scores on the test dataset, we will then create a web app to deploy it on a cloud-based server.&nbsp;</p>



<section id="blog-intext-cta-block_f0fd4838c4f16ff20633b0f59fe74d6c" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-learn-more">Learn more:</h3>
    
            <p>  <a href="/blog/f1-score-accuracy-roc-auc-pr-auc">F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?</a></p>
    
    </section>



<p>In a nutshell, this is how the project will be executed:</p>



<div id="case-study-numbered-list-block_5151f0189dfaa4cd937afd5c2e46b3d9"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Building the deep learning model with Pytorch<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Testing the model<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Creating a Streamlit app<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Creating directories and their respective config files for deployment<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Finally, deploying it on the Google Cloud Platform            </li>
            </ul>
</div>



<p>This project will include some of the additional practices that you will find in this article, such as:&nbsp;</p>



<ul class="wp-block-list">
<li>Live tracking to monitor metrics,</li>



<li>Attention visualization,</li>



<li>Directory structure,</li>



<li>Code formatting for all the python modules.&nbsp;</li>
</ul>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-hardware-for-accelerated-training">Hardware for accelerated training</h3>



<p>We will conduct our experiment with two sets of hardware:</p>



<ol class="wp-block-list">
<li><strong>M1 Macbook</strong>: The efficiency of Apple’s M1 processors will allow us to quickly develop models and train them on a smaller dataset. Once the training is done, we can start building a web application on our local machine and create a small pipeline of data ingestion, data preprocessing, model prediction, and attention visualization before scaling up the model in the cloud.&nbsp;</li>
</ol>



<p><strong>Note</strong>: if you have one of these M1 laptops, then make sure to check the installation process in my <a href="https://github.com/Nielspace/ViT-Pytorch" target="_blank" rel="noreferrer noopener nofollow">Github repo</a>.</p>



<ol start="2" class="wp-block-list">
<li><strong>Kaggle or Google Colab GPUs</strong>: Once our code is working properly in our local machine and the pipeline is created, we can scale it up and train the whole model for a longer period in Google Colab or Kaggle which are free. Once the training is done, we can then download new weights and metadata to our local computer and test whether the web application is performing well in the unseen data before deploying it to the cloud.&nbsp;</li>
</ol>



<p>Now let’s start the implementation.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-pipeline-for-image-classification-data-preparation">MLOps pipeline for image classification: data preparation</h2>



<p>The first step of implementing a deep learning project is to plan the different python modules that we are going to have. Although we will be using the Jupyter notebook for experimentation, it is always a good idea to have everything laid out before starting to code. Planning might include reference code repositories as well as research papers.&nbsp;</p>



<p>It is always a good idea to create the directory structure for the project for efficiency and for ease of navigation.&nbsp;&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">ViT Classification
├── notebooks
│   └── ViT.ipynb
└── source
    └──config.py
</pre>



<p>In our case, the main directory is called the ViT Classification, which contains two folders:&nbsp;</p>



<ol class="wp-block-list">
<li><strong>Notebooks</strong>: This is where all the experimentation with jupyter notebook will reside.</li>



<li><strong>Source</strong>: This is where all the Python modules will reside.&nbsp;</li>
</ol>



<p>As we progress, we will keep adding Python modules to the source directory, and we will also create different sub-directories for storing metadata, docker files, README.md files, et cetera.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-building-the-image-classification-model">Building the image classification model</h3>



<p>As mentioned before, research and planning is the key to implementing any machine learning project. What I usually do first is, create a config.py to store all the parameters with respect to data preprocessing, model training and inference, visualization, et cetera. </p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/config.py" target="_blank" rel="noreferrer noopener nofollow">config.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Config</span>:</span>
   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Image configuration</span>
   IMG_SIZE = <span class="hljs-number" style="color: teal;">32</span>
   PATCH_SIZE = <span class="hljs-number" style="color: teal;">10</span>
   CROP_SIZE = <span class="hljs-number" style="color: teal;">100</span>
   BATCH_SIZE = <span class="hljs-number" style="color: teal;">1</span>
   DATASET_SAMPLE = <span class="hljs-string" style="color: rgb(221, 17, 68);">'full'</span>


   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#opimizer configuration</span>
   LR = <span class="hljs-number" style="color: teal;">0.003</span>
   OPIMIZER = <span class="hljs-string" style="color: rgb(221, 17, 68);">'Adam'</span>

   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Model configuration</span>
   NUM_CLASSES = <span class="hljs-number" style="color: teal;">400</span>
   IN_CHANNELS = <span class="hljs-number" style="color: teal;">3</span>
   HIDDEN_SIZE = <span class="hljs-number" style="color: teal;">768</span>
   NUM_ATTENTION_HEADS = <span class="hljs-number" style="color: teal;">12</span>
   LINEAR_DIM = <span class="hljs-number" style="color: teal;">3072</span>
   NUM_LAYERS = <span class="hljs-number" style="color: teal;">12</span>

   ATTENTION_DROPOUT_RATE = <span class="hljs-number" style="color: teal;">0.1</span>
   DROPOUT_RATE = <span class="hljs-number" style="color: teal;">0.1</span>
   STD_NORM = <span class="hljs-number" style="color: teal;">1e-6</span>
   EPS = <span class="hljs-number" style="color: teal;">1e-6</span>
   MPL_DIM = <span class="hljs-number" style="color: teal;">128</span>
   OUTPUT = <span class="hljs-string" style="color: rgb(221, 17, 68);">'softmax'</span>
   LOSS_FN = <span class="hljs-string" style="color: rgb(221, 17, 68);">'nll_loss'</span>

   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Device configuration</span>
   DEVICE = [<span class="hljs-string" style="color: rgb(221, 17, 68);">"cpu"</span>,<span class="hljs-string" style="color: rgb(221, 17, 68);">"mps"</span>,<span class="hljs-string" style="color: rgb(221, 17, 68);">"cuda"</span>]

   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Training configuration</span>
   N_EPOCHS = <span class="hljs-number" style="color: teal;">1</span>
</pre>



<p>The above code block gives a vague idea of what the parameters should look like. As we make progress, we can keep adding more parameters.&nbsp;</p>



<p><strong>Note</strong>: In the device configuration section, I have given a list of three hardware: CPU, MPS, and CUDA. MPS or Metal Performance Shaders is the hardware type to train on M1 Macbooks.&nbsp;&nbsp;</p>



<h4 class="wp-block-heading">Dataset</h4>



<p>The dataset that we will use is the bird classification dataset which can be <a href="https://www.kaggle.com/datasets/gpiosenka/100-bird-species" target="_blank" rel="noreferrer noopener nofollow">downloaded from Kaggle</a>. The dataset consists of 400 classes of birds with three subsets: training, validation, and testing, each containing 58388, 2000, and 2000 images, respectively. Once the data has been downloaded, then we can then create a function to read and visualize the images.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-1.png?ssl=1" alt="sample from the datase" class="wp-image-69916" style="width:458px;height:469px"/><figcaption class="wp-element-caption"><em>The image above is a sample from the dataset along with the class that it belongs to&nbsp;| <a href="https://www.kaggle.com/datasets/gpiosenka/100-bird-species" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Preparing the data</h4>



<p>We can move ahead to create a data loader that transforms the images into image tensors. Along with that, we will also perform resizing, image cropping, and normalizing as well. Once the preprocessing is done, we can then use the DataLoader function to automatically generate data for training in batches. The following pseudo function will give you an idea of what we are trying to achieve, you can find the full code in the link provided in the code heading:</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/preprocessing.py" target="_blank" rel="noreferrer noopener nofollow">preprocessing.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#apply the desired transformations on dataset and split it into train, validation, and test set.</span>

<span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">Dataset</span><span class="hljs-params">(bs, crop_size, sample_size=<span class="hljs-string" style="color: rgb(221, 17, 68);">'full'</span>)</span>:</span>
      <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> train_data, valid_data, test_data
</pre>



<p>The above function has a sample size argument that allows the creation of a sub-set of the training dataset for testing purposes on your local machine.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-pipeline-for-image-classification-building-the-vision-transformer-using-pytorch">MLOps pipeline for image classification: building the vision transformer using Pytorch</h2>



<p>I have created the full model as per the author&#8217;s description of ViT in their paper. This code is inspired by <a href="https://github.com/jeonsworld/ViT-pytorch" target="_blank" rel="noreferrer noopener nofollow"><strong>jeonsworld</strong></a> repo, I have added a few more details and edited some of the lines of code for the purpose of this task.&nbsp;</p>



<p>The model that I have created is divided into 9 modules, and each module can be executed independently for various tasks. We will explore each section in order for ease of understanding.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-embedding">Embedding</h3>



<p>Transformers and all the natural language model has an important component called <strong>embedding</strong>. Its function is usually to capture semantic information by grouping similar information together. Apart from that embeddings can be learned and reused across models.&nbsp;</p>



<p>In ViT, embeddings serve the same purpose by retaining positional information which can be fed into the encoder. Again the following pseudo-code will help you to understand what’s going on and you can also find the full code in the link provided in the code heading.&nbsp;</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/embeddings.py" target="_blank" rel="noreferrer noopener nofollow">embedding.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Embeddings</span><span class="hljs-params">(nn.Module)</span>:</span>

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Construct the embeddings from patch, position embeddings.</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self, img_size:int, hidden_size:int, in_channels:int)</span>:</span>

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#create a CONV2D object for creation of embeddings </span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">forward</span><span class="hljs-params">(self, x)</span>:</span>

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#calculate and return embeddings</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> embeddings
</pre>



<p>Note that the embedding patches for the image can be created using convolution layers. It is quite efficient and easy to modify as well.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-encoder">Encoder</h3>



<p>The encoder is made up of a number of attention blocks which itself has two important modules:</p>



<div id="case-study-numbered-list-block_f222b8f23e7ecf57a2f6928059cf9ac6"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Self Attention Mechanism<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Multi-layer perceptron (MLP)            </li>
            </ul>
</div>



<h4 class="wp-block-heading">Self attention mechanism</h4>



<p>Let’s start with the self-attention mechanism.&nbsp;</p>



<p>The self-attention mechanism is the core of the whole system. It enables the model to focus on the important feature of the data. It does so by operating on a single embedding at different positions to compute the representation of the same sequence. You can find the link to the entire code below to get a deeper picture.&nbsp;</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/attention.py" target="_blank" rel="noreferrer noopener nofollow">attention.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Calculate the attention and return the attention output along with the weights</span>

<span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Attention</span><span class="hljs-params">(nn.Module)</span>:</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> attention_output, weights</pre>



<p>The output of the attention block will yield the attention output as well the attention weights. The latter will be used to visualize the ROI that is calculated using the attention mechanism.&nbsp;</p>



<h4 class="wp-block-heading">Multilayer perceptron</h4>



<p>Once we receive the attention output, we can then feed it into the MLP, which will give us a probability distribution for the classification. You can get an idea of the entire process in the forward function. To see the full code click the link provided in the code heading below.&nbsp;</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/linear.py" target="_blank" rel="noreferrer noopener nofollow">linear.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Apply a linear transformation to the incoming attention output using the GELU activation function.</span>

<span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Mlp</span><span class="hljs-params">(nn.Module)</span>:</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self, hidden_size, linear_dim, dropout_rate, std_norm)</span>:</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> x</pre>



<p>It is worth noting that we are using the GELU as our activation function.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-2.png?ssl=1" alt="activation function" class="wp-image-69917"/><figcaption class="wp-element-caption"><em>GELU as activation function | <a href="https://mlfromscratch.com/activation-functions-explained/#/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>One of the pros of using GELU is that it avoids vanishing gradient, which makes the model easy to scale.&nbsp;</p>



<h4 class="wp-block-heading">Attention-block</h4>



<p>The attention block is the module where we assemble both the modules: the self-attention module and the MLP modules.&nbsp;</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/attention_block.py" target="_blank" rel="noreferrer noopener nofollow">attention_block.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Returns the calculated sum of attention scores via MLP along with attention weights.</span>

<span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Block</span><span class="hljs-params">(nn.Module)</span>:</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> x, weights
</pre>



<p>This module will also yield the attention weights directly from the attention mechanism along with the distribution yielded by MLP.&nbsp;</p>



<p>Now let’s briefly understand the encoder. The Encoder essentially enables us to create multiple attention blocks that give the transformer more control over the attention mechanism. The three components: Encoder, Transformer, and ViT are written in the same module i.e., <a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/attention_block.py" target="_blank" rel="noreferrer noopener nofollow">transformers.py</a>.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Creates multiple layers of attention blocks and returns encoded state and attention weights. </span>

<span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Encoder</span><span class="hljs-params">(nn.Module)</span>:</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> encoded, attn_weights</pre>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-transformer">Transformer</h3>



<p>After assembling the attention block we can then code our transformer. The attention block transformer is an assembly of the embedding module and encoder module.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Transformer</span><span class="hljs-params">(nn.Module)</span>:</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self, img_size, hidden_size, in_channels, num_layers,
                num_attention_heads, linear_dim, dropout_rate, attention_dropout_rate,
                eps, std_norm)</span>:</span>
       super(Transformer, self).__init__()
       self.embeddings = Embeddings(img_size, hidden_size, in_channels)
       self.encoder = Encoder(num_layers, hidden_size, num_attention_heads,
                              linear_dim, dropout_rate, attention_dropout_rate,
                              eps, std_norm)

   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">forward</span><span class="hljs-params">(self, input_ids)</span>:</span>
       embedding_output = self.embeddings(input_ids)
       encoded, attn_weights = self.encoder(embedding_output)
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> encoded, attn_weights
</pre>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-vision-transformer">Vision transformer</h3>



<p>Finally, we can code our vision transformer which involves two components: the transformer and the final linear layer. The final linear will help us to find the probability distribution over all the classes. It can be described as:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">VisionTransformer</span><span class="hljs-params">(nn.Module)</span>:</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self, img_size, num_classes, hidden_size, in_channels, num_layers,
                num_attention_heads, linear_dim, dropout_rate, attention_dropout_rate,
                eps, std_norm)</span>:</span>
       super(VisionTransformer, self).__init__()
       self.classifier = <span class="hljs-string" style="color: rgb(221, 17, 68);">'token'</span>

       self.transformer=Transformer(img_size, hidden_size, in_channels,
                                    num_layers, num_attention_heads, linear_dim,
                                    dropout_rate, attention_dropout_rate, eps,
                                    std_norm)
       self.head = Linear(hidden_size, num_classes)

   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">forward</span><span class="hljs-params">(self, x, labels=None)</span>:</span>
       x, attn_weights = self.transformer(x)
       logits = self.head(x[:, <span class="hljs-number" style="color: teal;">0</span>])

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> labels <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">is</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">not</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">None</span>:
           loss_fct = CrossEntropyLoss()
           loss = loss_fct(logits.view(<span class="hljs-number" style="color: teal;">-1</span>, <span class="hljs-number" style="color: teal;">400</span>), labels.view(<span class="hljs-number" style="color: teal;">-1</span>))
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> loss
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
           <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> logits, attn_weights</pre>



<p>Please notice that the network is going to consistently yield attention weights which will be useful for visualizing the attention maps.&nbsp;</p>



<p>Here is a bonus tip. If you want to see the architecture of the model and how the inputs are being operated then use the following line of code. The code will generate a full operational architecture for you.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torchviz <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> make_dot
x = torch.randn(<span class="hljs-number" style="color: teal;">1</span>,config.IN_CHANNELS*config.IMG_SIZE*config.IMG_SIZE)
x = x.reshape(<span class="hljs-number" style="color: teal;">1</span>,config.IN_CHANNELS,config.IMG_SIZE,config.IMG_SIZE)
logits, attn_weights = model(x)
make_dot(logits, params=dict(list(model.named_parameters()))).render(<span class="hljs-string" style="color: rgb(221, 17, 68);">"../metadata/VIT"</span>, format=<span class="hljs-string" style="color: rgb(221, 17, 68);">"png"</span>)</pre>



<p>You can find the image in the given <a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/metadata/VIT.png" target="_blank" rel="noreferrer noopener nofollow">link</a>.&nbsp;</p>



<p>But in nutshell, this how the architecture looks like.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-3.png?ssl=1" alt="vision transformer" class="wp-image-69918"/><figcaption class="wp-element-caption"><em>The architecture of vision transformer | <a href="https://arxiv.org/pdf/2010.11929.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-pipeline-for-image-classification-training-vision-transformer-using-pytorch">MLOps pipeline for image classification: training vision transformer using Pytorch</h2>



<p>The training module is where we will assemble all the other modules like the config module, preprocessing module, and Transformer and log the parameters including the metadata into the <a href="/" target="_blank" rel="noreferrer noopener">neptune.ai</a> API. One easiest way to log parameters is to use Config.__dict__. This automatically converts a class into a dictionary. </p>



<section
	id="i-box-block_2e65339ef60ce26f3f9d1ca61c887287"
	class="block-i-box  l-margin__top--large l-margin__bottom--x-large">

			<header class="c-header">
			<img
				src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
				data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/i-box/header-icon.svg"
				width="24"
				height="24"
				class="c-header__icon lazyload"
				alt="">

			
            <h2 class="c-header__text animation " style='max-width: 100%;'   >
                <strong>Disclaimer</strong>
            </h2>		</header>
	
	<div class="block-i-box__inner">
		

<p>Please note that this article references a <strong>deprecated version of Neptune</strong>.</p>



<p>For information on the latest version with improved features and functionality, please <a href="/" target="_blank" rel="noreferrer noopener">visit our website</a>.</p>


	</div>

</section>



<p>You can later create a function that removes unnecessary attributes from the dictionary.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">neptune_monitoring</span><span class="hljs-params">()</span>:</span>
   PARAMS = {}
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> key, val <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> Config.__dict__.items():
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> key <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">not</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> [<span class="hljs-string" style="color: rgb(221, 17, 68);">'__module__'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'__dict__'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'__weakref__'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'__doc__'</span>]:
           PARAMS[key] = val
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> PARAMS</pre>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-training">Training&nbsp;</h3>



<p>The training function is quite straightforward and simple to write. I have included both training and evaluation in the pseudo-code. You can find <a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/train.py" target="_blank" rel="noreferrer noopener nofollow">the full training block here</a>, or you can click the code heading below.</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/train.py" target="_blank" rel="noreferrer noopener nofollow">train.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">train_Engine</span><span class="hljs-params">(n_epochs, train_data, val_data, model, optimizer, loss_fn, device,
                monitoring=True)</span>:</span>

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Initiates the training procedure while tracking accuracy and loss over each iterations. </span>
</pre>



<p>Now our training loop is completed, we can then start the training and log the metadata into Neptune dashboard, which we can use for monitoring the training on the go, saving charts and parameters, and sharing them with teammates. </p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/train.py" target="_blank" rel="noreferrer noopener nofollow">train.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> __name__ == <span class="hljs-string" style="color: rgb(221, 17, 68);">'__main__'</span>:
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> preprocessing <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Dataset
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> config <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Config
   config = Config()
   params = neptune_monitoring(Config)

   run = neptune.init_run(project=<span class="hljs-string" style="color: rgb(221, 17, 68);">"nielspace/ViT-bird-classification"</span>,
                       api_token=API_TOKEN)
   run[<span class="hljs-string" style="color: rgb(221, 17, 68);">'parameters'</span>] = params

   model = VisionTransformer(img_size=config.IMG_SIZE,
                num_classes=config.NUM_CLASSES,
                hidden_size=config.HIDDEN_SIZE,
                in_channels=config.IN_CHANNELS,
                num_layers=config.NUM_LAYERS,
                num_attention_heads=config.NUM_ATTENTION_HEADS,
                linear_dim=config.LINEAR_DIM,
                dropout_rate=config.DROPOUT_RATE,
                attention_dropout_rate=config.ATTENTION_DROPOUT_RATE,
                eps=config.EPS,
                std_norm=config.STD_NORM)

   train_data, val_data, test_data = Dataset(config.BATCH_SIZE, config.IMG_SIZE,
                                             config.DATASET_SAMPLE)

   optimizer = optim.Adam(model.parameters(), lr=<span class="hljs-number" style="color: teal;">0.003</span>)
   train_Engine(n_epochs=config.N_EPOCHS, train_data=train_data, val_data=val_data,
               model=model,optimizer=optimizer, loss_fn=<span class="hljs-string" style="color: rgb(221, 17, 68);">'nll_loss'</span>,
               device=config.DEVICE[<span class="hljs-number" style="color: teal;">1</span>], monitoring=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)</pre>



<p><strong>Note</strong>: The prototyping of this model was done in Macbook Air M1 on a smaller dataset with 10 classes. The prototyping stage is where I tried different configurations and played with the architecture of the model. Once I was satisfied I used Kaggle to train the model. Since the dataset has 400 classes, the model needed to be larger and trained for a longer period of time.&nbsp;&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-experiment-tracking">Experiment tracking</h3>



<p>In the prototyping stage, experiment tracking becomes a very handy and reliable source to make further changes to your model. You can keep an eye on your model’s performance during training and subsequently make necessary tweaks to it until you get a high-performing model.</p>



<p>The Neptune API enables you to:</p>



<ul class="wp-block-list">
<li>monitor the model’s training progress</li>



<li>and simultaneously upload the metrics into the system.</li>



<li>It also allows you to compare multiple runs involving different model configurations and simultaneously choose the best one.</li>
</ul>



<p>If you want to log your metadata in the system, then import the Neptune API and call the init function. Following that, enter the API key provided for the project, and you are good to go. Get to know more about how to <a href="https://docs-legacy.neptune.ai/setup/installation/" target="_blank" rel="noreferrer noopener">get started with Neptune here</a>. Also, <a href="https://app.neptune.ai/nielspace/ViT-bird-classification/experiments?split=tbl&amp;dash=charts&amp;viewId=standard-view" target="_blank" rel="noreferrer noopener">here is the Neptune dashboard</a>, which has the metadata related to this project.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">run = neptune.init_run(project=<span class="hljs-string" style="color: rgb(221, 17, 68);">"nielspace/ViT-bird-classification"</span>,
api_token=<span class="hljs-string" style="color: rgb(221, 17, 68);">"API_TOKEN"</span>)</pre>



<p>Once you are done with the initialization, you can start logging. For instance, if you want to:</p>



<ol class="wp-block-list">
<li>Upload the parameters, use: run[&#8216;parameters&#8217;] = params. <br>Note: make sure that the params are of dictionary class.</li>



<li>Upload metrics, use: run[&#8216;Training_loss&#8217;].log(loss.item())and run[&#8216;Training_loss&#8217;].log(loss.item())</li>



<li>Upload model weights, use: run[&#8220;model_checkpoints/ViT&#8221;].upload(&#8220;model.pt&#8221;)</li>



<li>Upload images, use: run[&#8220;val/conf_matrix&#8221;].upload(&#8220;confusion_matrix.png&#8221;)</li>
</ol>



<p>Depending upon what you are optimizing your model for, there are plenty of things that you can log and track. In our case, we put an emphasis on training and validation loss and accuracy.</p>



<h4 class="wp-block-heading">Logging metadata and dashboard</h4>



<p>In the ongoing training process, you can then monitor the model’s performance. With each iteration, the graph will update.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://neptune.ai/blog/mlops-pipeline-for-computer-vision-image-classification/attachment/mlops-pipeline-computer-vision-neptune-1"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MLOps-pipeline-computer-vision-neptune-1.png?ssl=1" alt="MLOps pipeline computer vision neptune 1" class="wp-image-70283"/></a><figcaption class="wp-element-caption"><em><a href="https://app.neptune.ai/nielspace/ViT-bird-classification/e/VIT-23/charts" target="_blank" rel="noreferrer noopener nofollow">Monitoring the model&#8217;s performance</a></em></figcaption></figure>
</div>


<p>Along with the model’s performance, you will also find CPU and GPU performance as well. See the image below.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://neptune.ai/blog/mlops-pipeline-for-computer-vision-image-classification/attachment/mlops-pipeline-computer-vision-neptune-2"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MLOps-pipeline-computer-vision-neptune-2.png?ssl=1" alt="MLOps pipeline computer vision neptune 2" class="wp-image-70284"/></a><figcaption class="wp-element-caption"><em><a href="https://app.neptune.ai/nielspace/ViT-bird-classification/e/VIT-23/monitoring" target="_blank" rel="noreferrer noopener">CPU and GPU performance</a></em></figcaption></figure>
</div>


<p>You can also find all the model metadata as well.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://neptune.ai/mlops-pipeline-computer-vision-6"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-6.png?ssl=1" alt="model metadata" class="wp-image-69921"/></a><figcaption class="wp-element-caption"><em>The model metadata</em></figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-scaling-using-kaggle">Scaling using Kaggle</h3>



<p>Now, let’s scale the model. We will use Kaggle for this project because it is free and also because the dataset was downloaded from Kaggle so it will be easy to scale and train the model on the platform itself.&nbsp;</p>



<ol class="wp-block-list">
<li>The first thing we need to do is to upload the model and change the directory path to Kaggle-specific paths and enable the GPUs.&nbsp;</li>
</ol>



<ol start="2" class="wp-block-list">
<li>Note that the model must be complex in order to capture relative information for prediction. You can start scaling the model by gradually increasing the number of hidden layers and seeing how the model behaves. You may not want to touch other parameters like the number of attention heads and hidden size because it may throw up arithmetic errors.&nbsp;</li>
</ol>



<ol start="3" class="wp-block-list">
<li>For each change, you make the model run for at least two epochs in small data batches with all the 400 classes and observe if the accuracy is increasing. Typically, it will increase.&nbsp;</li>
</ol>



<ol start="4" class="wp-block-list">
<li>Once satisfied, run the model for 10 to 15 epochs which would take around 5 hours for the subset of 30000 samples.&nbsp;</li>
</ol>



<ol start="5" class="wp-block-list">
<li>After the training, check its performance on the test dataset, and if it performs well, then download the model weights. At this point, the size of the model should be around 650 MB for 400 classes.&nbsp;</li>
</ol>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-attention-visualization">Attention visualization</h3>



<p>As mentioned before, self-attention is the crux of the whole Vision Transformer architecture, and interestingly there is a way to visualize it as well. The source code of the attention map can be found <a href="https://github.com/jeonsworld/ViT-pytorch/blob/main/visualize_attention_map.ipynb" target="_blank" rel="noreferrer noopener nofollow">here</a>. I have modified it a bit and created it as a separate independent module that can use the output of the transformer to yield the attention maps. The idea here is to store the input image and its corresponding attention map image and display it in the README.md file.&nbsp;</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/attention_viz.py" target="_blank" rel="noreferrer noopener nofollow">attention_viz.py</a> (<a href="https://github.com/jeonsworld/ViT-pytorch/blob/main/visualize_attention_map.ipynb" target="_blank" rel="noreferrer noopener nofollow">Source</a>)</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">attention_viz</span><span class="hljs-params">(model, test_data, img_path=PATH, device=<span class="hljs-string" style="color: rgb(221, 17, 68);">'mps'</span>)</span>:</span>

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Visualizes the attention mask of a given input (image) by comparing it with the original image. </span>

</pre>



<p>We can run this code by simply calling the <strong>attention_viz</strong> function and passing the corresponding arguments.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> __name__ == <span class="hljs-string" style="color: rgb(221, 17, 68);">'__main__'</span>:
   train_data, val_data, test_data = Dataset(config.BATCH_SIZE,config.IMG_SIZE, config.DATASET_SAMPLE)
   model = torch.load(<span class="hljs-string" style="color: rgb(221, 17, 68);">'metadata/models/model.pth'</span>, map_location=torch.device(<span class="hljs-string" style="color: rgb(221, 17, 68);">'cpu'</span>))
   attention_viz(model, test_data, PATH)</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-7.png?ssl=1" alt="Attention Visualization" class="wp-image-69922"/><figcaption class="wp-element-caption"><em>The image above is an example of attention visualization. The image on the left is the original image whereas the image on the right is overlaid with the attention map. The region i.e. the face of the bird is quite bright as that area constitutes the features to which the model is paying attention&nbsp;</em></figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-testing-and-inference">Testing and inference</h3>



<p>We can also use the <strong>attention_viz </strong>function<strong> </strong>in the test module, where we will test the model on the test data and measure the model’s performance on various metrics like confusion matrix, accuracy score, f1 score, recall score, and precision score.</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/test.py" target="_blank" rel="noreferrer noopener nofollow">test.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">test</span><span class="hljs-params">(model, test_data)</span>:</span>
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> logits_, ground, confusion_matrix

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Evaluates the model’s performance on the test dataset and returns the confusion matrix, logits and ground truth for further performance evaluation. </span></pre>



<p>We can easily generate a confusion matrix and visualize using heatmap from seaborn and save it in the results folder, which we can also use to display it on the README.md file.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-8.png?ssl=1" alt="confusion matrix" class="wp-image-69923"/><figcaption class="wp-element-caption"><em>Above is the image of a confusion matrix that is of the shape 100X100 trained for 50 epochs. As you can see the model is quite efficient to predict true positives which can be seen in the diagonals in white color. But there are few false positives across the graph which means that the model still makes wrong predictions</em></figcaption></figure>
</div>


<p>We can also generate the accuracy and loss graph and store it in the results folder as well. Consequently, we can use Sklearn to find other metrics, but before that, we must convert the tensors array into a NumPy array.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">probs = torch.zeros(len(logits_))
y_ = torch.zeros(len(ground))
idx = <span class="hljs-number" style="color: teal;">0</span>
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> l, o <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> zip(logits_, ground):
   _, l = torch.max(l, dim=<span class="hljs-number" style="color: teal;">1</span>)
   probs[idx] = l
   y_[idx] = o.item()
   idx+=<span class="hljs-number" style="color: teal;">1</span>

prob = probs.to(torch.long).numpy()
y_ = y_.to(torch.long).numpy()

print(accuracy_score(y_, prob))
print(cohen_kappa_score(y_, prob))
print(classification_report(y_, prob))</pre>



<p>Once we are satisfied with the model’s performance, we can then do inference by simultaneously creating a Streamlit app.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-pipeline-for-image-classification-creating-the-app-using-streamlit">MLOps pipeline for image classification: creating the app using Streamlit</h2>



<p>The <a href="https://streamlit.io/" target="_blank" rel="noreferrer noopener nofollow">Streamlit </a>app will be a web app that we will deploy on the cloud. In order to build the app, we must first pip install streamlit followed by importing the library in the new module.&nbsp;</p>



<p>The module will contain the same module as the inference module we just need to copy and paste the evaluation function as it is and then build the app using the Streamlit library. Below is the code of the app.&nbsp;</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/source/app.py" target="_blank" rel="noreferrer noopener nofollow">app.py</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> warnings
warnings.simplefilter(action=<span class="hljs-string" style="color: rgb(221, 17, 68);">'ignore'</span>, category=FutureWarning)

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> PIL <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Image
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torch
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torchvision <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> transforms
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torch
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> streamlit <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> st

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> embeddings <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Embeddings
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> attention_block <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Block
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> linear <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Mlp
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> attention <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Attention
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> transformer <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> VisionTransformer, Transformer, Encoder

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> config <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Config
config = Config()

st.set_option(<span class="hljs-string" style="color: rgb(221, 17, 68);">'deprecation.showfileUploaderEncoding'</span>, <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">False</span>)
st.title(<span class="hljs-string" style="color: rgb(221, 17, 68);">"Bird Image Classifier"</span>)
st.write(<span class="hljs-string" style="color: rgb(221, 17, 68);">""</span>)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># enable users to upload images for the model to make predictions</span>
file_up = st.file_uploader(<span class="hljs-string" style="color: rgb(221, 17, 68);">"Upload an image"</span>, type = <span class="hljs-string" style="color: rgb(221, 17, 68);">"jpg"</span>)


<span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">predict</span><span class="hljs-params">(image)</span>:</span>
   <span class="hljs-string" style="color: rgb(221, 17, 68);">"""Return top 5 predictions ranked by highest probability.
   Parameters
   ----------
   :param image: uploaded image
   :type image: jpg
   :rtype: list
   :return: top 5 predictions ranked by highest probability
   """</span>
   model = torch.load(<span class="hljs-string" style="color: rgb(221, 17, 68);">'model.pth'</span>)

   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># transform the input image through resizing, normalization</span>
   transform = transforms.Compose([
       transforms.Resize(<span class="hljs-number" style="color: teal;">128</span>),
       transforms.CenterCrop(<span class="hljs-number" style="color: teal;">128</span>),
       transforms.ToTensor(),
       transforms.Normalize(
           mean = [<span class="hljs-number" style="color: teal;">0.485</span>, <span class="hljs-number" style="color: teal;">0.456</span>, <span class="hljs-number" style="color: teal;">0.406</span>],
           std = [<span class="hljs-number" style="color: teal;">0.229</span>, <span class="hljs-number" style="color: teal;">0.224</span>, <span class="hljs-number" style="color: teal;">0.225</span>])])

   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># load the image, pre-process it, and make predictions</span>
   img = Image.open(image)
   x = transform(img)
   x = torch.unsqueeze(x, <span class="hljs-number" style="color: teal;">0</span>)
   model.eval()
   logits, attn_w = model(x)

   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">with</span> open(<span class="hljs-string" style="color: rgb(221, 17, 68);">'../metadata/classes.txt'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'r'</span>) <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> f:
       classes = f.read().split(<span class="hljs-string" style="color: rgb(221, 17, 68);">'n'</span>)

   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># return the top 5 predictions ranked by highest probabilities</span>
   prob = torch.nn.functional.softmax(logits, dim = <span class="hljs-number" style="color: teal;">1</span>)[<span class="hljs-number" style="color: teal;">0</span>] * <span class="hljs-number" style="color: teal;">100</span>
   _, indices = torch.sort(logits, descending = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> [(classes[idx], prob[idx].item()) <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> idx <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> indices[<span class="hljs-number" style="color: teal;">0</span>][:<span class="hljs-number" style="color: teal;">5</span>]]


<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> file_up <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">is</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">not</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">None</span>:
   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># display image that user uploaded</span>
   image = Image.open(file_up)
   st.image(image, caption = <span class="hljs-string" style="color: rgb(221, 17, 68);">'Uploaded Image.'</span>, use_column_width = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
   st.write(<span class="hljs-string" style="color: rgb(221, 17, 68);">""</span>)
   st.write(<span class="hljs-string" style="color: rgb(221, 17, 68);">"Processing..."</span>)
   labels = predict(file_up)

   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># print out the top 5 prediction labels with scores</span>
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> labels:
       st.write(f<span class="hljs-string" style="color: rgb(221, 17, 68);">"Prediction {i[0]} score {i[1]:.2f}"</span>)</pre>



<p>But before we deploy, we must test it locally. In order to test the app, we will run the following command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">streamlit run app.py</pre>



<p>Once the above command is executed, you will get the following prompt:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">You can now view your Streamlit app <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> your browser.

  Local URL: http://localhost:<span class="hljs-number" style="color: teal;">8501</span>
  Network URL: http://<span class="hljs-number" style="color: teal;">192.168</span><span class="hljs-number" style="color: teal;">.0</span><span class="hljs-number" style="color: teal;">.105</span>:<span class="hljs-number" style="color: teal;">8501</span></pre>



<p>Copy the URL and paste it into your browser, and the app is online (locally).&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-9.png?ssl=1" alt="Bird image classifier" class="wp-image-69924"/><figcaption class="wp-element-caption"><em>Copied URL</em></figcaption></figure>
</div>


<p>Upload the image for classification.&nbsp;</p>



<figure class="wp-block-image size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-10.png?ssl=1" alt="Uploaded image" class="wp-image-69925"/><figcaption class="wp-element-caption"><em>Uploaded image</em></figcaption></figure>



<p>With the ViT model trained and the app ready our directory structure should look something like this now:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">.
├── README.md
├── metadata
│   ├── Abbott<span class="hljs-string" style="color: rgb(221, 17, 68);">'s_babbler_(Malacocincla_abbotti).jpg
│   ├── classes.txt
│   ├── models
│   │   └── model.pth
│   └── results
│       ├── accuracy_loss.png
│       ├── attn.png
│       └── confusion_matrix.png
├── notebooks
│   ├── ViT.ipynb
│   └── __init__.py
└── source
    ├── __init__.py
    ├── app.py
    ├── attention.py
    ├── attention_block.py
    ├── attention_viz.py
    ├── config.py
    ├── embeddings.py
    ├── linear.py
    ├── metrics.py
    ├── preprocessing.py
    ├── test.py
    ├── train.py
    ├── transformer.py
</span></pre>



<p>Now we proceed toward deploying the app.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-pipeline-for-image-classification-code-formatting">MLOps pipeline for image classification: code formatting</h2>



<p>First, let&#8217;s format our Python scripts. For that, we will use Black. Black is a Python script formatter. All you need to do is pip install black and then run <strong><em>`black `</em></strong> following the name of the python module or even the whole directory. For this project, I ran black followed by the source directory which contains all the python modules.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">ViT-Pytorch git:(main) black source
Skipping .ipynb files <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> Jupyter dependencies are <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">not</span> installed.
You can fix this by running ``pip install black[jupyter]``
reformatted source/config.py
reformatted source/embeddings.py
reformatted source/attention_block.py
reformatted source/linear.py
reformatted source/app.py
reformatted source/attention_viz.py
reformatted source/attention.py
reformatted source/preprocessing.py
reformatted source/test.py
reformatted source/metrics.py
reformatted source/transformer.py
reformatted source/train.py
</pre>



<p>The advantage of using black is that it removes unnecessary spaces, adds double quotes instead of single quotes, and makes reviewing code faster and more efficient.&nbsp;</p>



<p>Given below are the images of before and after using <strong>black </strong>to format the code.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://neptune.ai/mlops-pipeline-computer-vision-11"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-11.png?ssl=1" alt="Examples before and after using black to format the code" class="wp-image-69926"/></a><figcaption class="wp-element-caption"><em>Examples before and after using black to format the code </em></figcaption></figure>
</div>


<p>As you can see that unnecessary spaces have been removed.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-pipeline-for-image-classification-setting-up-ci-cd">MLOps pipeline for image classification: setting up CI/CD&nbsp;</h2>



<p>For our CI/CD process, we will be using <strong>Github Actions,</strong> and <strong>Google Cloud Build</strong> to integrate and deploy our Streamlit app. The following are the steps that will help you to create a full MLOps pipeline.&nbsp;</p>



<h4 class="wp-block-heading">Creating the Github Repository</h4>



<p>The first step is to create the Github repository. But before that we must create three important files:</p>



<div id="case-study-numbered-list-block_92abe7f008b62bc2583ed9c876cbe209"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                requirements.txt<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                makefile<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                main.yml            </li>
            </ul>
</div>



<h4 class="wp-block-heading">requirements.txt</h4>



<p>The requirements.txt file must contain all the libraries that the model is using. There are two ways in which you can create a requirements.txt file.&nbsp;</p>



<ol class="wp-block-list">
<li>If you have a dedicated working environment created specifically for this project, then you can run pip freeze&gt;requirements.txt and it will create a requirements.txt file for you.&nbsp;</li>



<li>If you have a general working environment, then you can run pip freeze and copy-paste the libraries that you have been working on.</li>
</ol>



<p>The requirement.txt file for this project looks like this:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">numpy==<span class="hljs-number" style="color: teal;">1.22</span><span class="hljs-number" style="color: teal;">.3</span>
torch==<span class="hljs-number" style="color: teal;">1.12</span><span class="hljs-number" style="color: teal;">.0</span>
torchvision==<span class="hljs-number" style="color: teal;">0.12</span><span class="hljs-number" style="color: teal;">.0</span>
tqdm==<span class="hljs-number" style="color: teal;">4.64</span><span class="hljs-number" style="color: teal;">.0</span>
opencv-python==<span class="hljs-number" style="color: teal;">4.6</span><span class="hljs-number" style="color: teal;">.0</span><span class="hljs-number" style="color: teal;">.66</span>
streamlit==<span class="hljs-number" style="color: teal;">1.10</span><span class="hljs-number" style="color: teal;">.0</span>
neptune-client==<span class="hljs-number" style="color: teal;">0.16</span><span class="hljs-number" style="color: teal;">.3</span>
</pre>



<p><strong>Note:</strong> Always make sure that you mention the version so that in the future, the app remains stable and performs optimally.&nbsp;</p>



<h4 class="wp-block-heading">Makefile</h4>



<p>In a nutshell, Makefile is a command prompt file that automates the whole process of installing libraries, and dependencies, running a Python script, et cetera. A typical Makefile looks something like this:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Makefile</span>
setup:
   python3 -m venv ~/.visiontransformer
   source ~/.visiontransformer/bin/activate
   cd .visiontransformer
install:
   pip install --upgrade pip &amp;&amp;
       pip install -r requirements.txt
run:
   python source/test.py
all: install run</pre>



<p>For this project, our Makefile will have three processes:</p>



<div id="case-study-numbered-list-block_1297cdb696c95a544faafbfd77b39fe1"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Setup virtual environment and activate it.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Install all the Python libraries.<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Run a test file.             </li>
            </ul>
</div>



<p>Essentially every time we make a new commit, the makefile will be executed, which will automatically run the test.py module generating the latest performance metrics and updating the README.md file.</p>



<p>But Makefile will only work if we create an action trigger. So let’s create that.</p>



<h4 class="wp-block-heading">Action trigger: .github/workflow/main.yml</h4>



<p>To create an action trigger, we need to create the following directory: .github/workflow,<strong> </strong>followed by creating a <strong>main.yml</strong> file. The main.yml will essentially create an action trigger whenever the repo is updated.&nbsp;</p>



<p>Our aim is to continuously integrate any changes made in the existing build, like updating parameters, model architecture, or even the UI/UX. Once the change is detected, it will automatically update the README.md file. The main.yml for this project is designed to trigger the workflow on any push or pull request but only for the main branch.</p>



<p>At each new commit, the file will activate the ubuntu-latest environment, install the specific python version and then execute a specific command from the Makefile.&nbsp;</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/Makefile" target="_blank" rel="noreferrer noopener nofollow">main.yml</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#main.yml</span>
name: Continuous Integration <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">with</span> Github Actions

on:
 push:
   branches: [ main ]
 pull_request:
   branches: [ main ]

jobs:
 build:
   runs-on: ubuntu-latest
   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Steps represent a sequence of tasks that will be executed as part of the job</span>
   steps:
     - uses: actions/checkout@v2
     - name: Set up Python <span class="hljs-number" style="color: teal;">3.8</span>
       uses: actions/setup-python@v1
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">with</span>:
         python-version: <span class="hljs-number" style="color: teal;">3.8</span>
     - name: Install dependencies
       run: |
         make install
         make run
</pre>



<h4 class="wp-block-heading">Testing</h4>



<p>After the files are created, you can push the entire codebase to Github. Once uploaded, you can click on the Actions tab and see the build-in progress for yourself.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-12.png?ssl=1" alt="Testing" class="wp-image-69927"/><figcaption class="wp-element-caption"><em>Build-in progress in the Actions tab</em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Deployment: Google Cloud Build</h4>



<p>After the testing is done and all the logs and results are updated in the Github README.md file, we can move to the next step, which is to integrate the app into the cloud.&nbsp;</p>



<ol class="wp-block-list">
<li>First, we will visit: <a href="https://console.cloud.google.com/" target="_blank" rel="noreferrer noopener nofollow">https://console.cloud.google.com/</a>, and then we will create a new project in the dashboard and name it Vision Transformer Pytorch.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-13.png?ssl=1" alt="Creating a new project" class="wp-image-69928"/><figcaption class="wp-element-caption"><em>Creating a new project</em></figcaption></figure>
</div>


<p>Once the project is created, you can navigate into the project, and it will look something like this:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-14.png?ssl=1" alt="The project" class="wp-image-69929"/><figcaption class="wp-element-caption"><em>The project</em></figcaption></figure>
</div>


<p>As you can see, google cloud build offers us various services right out of the box like a virtual machine, big query, GKE, or Kubernetes cluster on the project home page. But before we create anything in the cloud build we must enable the Kubernetes cluster and create a certain directory and their respective files in the project directory.</p>



<ol start="2" class="wp-block-list">
<li><strong>Kubernetes</strong></li>
</ol>



<p>Let’s set up our Kubernetes cluster before we create any files. To do that, we can search GKE in the google cloud console search bar and enable the API.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-15.png?ssl=1" alt="Setting up Kubernetes cluster" class="wp-image-69930"/><figcaption class="wp-element-caption"><em>Setting up Kubernetes cluster</em></figcaption></figure>
</div>


<p>Once the API is enabled, we will be navigated to the following page.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-15b.png?ssl=1" alt="Kubernetes cluster " class="wp-image-69931"/><figcaption class="wp-element-caption"><em>Kubernetes cluster</em></figcaption></figure>
</div>


<p>But instead of creating the clusters manually, we will create them using the inbuild cloud shell. To do that, click on the terminal button on the top right hand, and check the image below.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-16.png?ssl=1" alt="Cloud shell" class="wp-image-69932"/><figcaption class="wp-element-caption"><em>Activating Cloud Shell</em></figcaption></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-17.png?ssl=1" alt="Creating cluster by using inbuild cloud shell" class="wp-image-69933"/><figcaption class="wp-element-caption"><em>Creating cluster by using inbuild cloud shell</em></figcaption></figure>
</div>


<p>After activating the cloud shell, we can type the following command to create Kubernetes clusters:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">gcloud container clusters create project-kube --zone <span class="hljs-string" style="color: rgb(221, 17, 68);">"us-west1-b"</span> --machine-type <span class="hljs-string" style="color: rgb(221, 17, 68);">"n1-standard-1"</span> --num-nodes <span class="hljs-string" style="color: rgb(221, 17, 68);">"1"</span></pre>



<p>This usually takes up to 5 minutes.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-18.png?ssl=1" alt="Creating Kubernetes clusters" class="wp-image-69934"/><figcaption class="wp-element-caption"><em>Creating Kubernetes clusters</em></figcaption></figure>
</div>


<p>After it is completed, it will look something like this:&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-19.png?ssl=1" alt="Kubernetes clustering completed" class="wp-image-69935"/><figcaption class="wp-element-caption"><em>Kubernetes clustering completed</em></figcaption></figure>
</div>


<p>Now let’s set up the two files that will configure the Kubernetes clusters: deployment.yml and service.yml.&nbsp;</p>



<p>The deployment.yml file allows us to deploy the model in the cloud. The deployment can be canary, recreate, blue-green or any other depending upon the requirement. In this example, we will overwrite the deployments. This file also helps in scaling the model efficiently using the arguments <strong>replicas</strong>. Here is an example of a deployment.yml file.</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/kubernetes/deployment.yml" target="_blank" rel="noreferrer noopener nofollow">deployment.yml</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#deployment.yml</span>

apiVersion: apps/v1
kind: Deployment
metadata:
 name: imgclass
spec:
 replicas: <span class="hljs-number" style="color: teal;">1</span>
 selector:
   matchLabels:
     app: imageclassifier
 template:
   metadata:
     labels:
       app: imageclassifier
   spec:
     containers:
     - name: cv-app
       image: gcr.io/vision-transformer-pytorch/vit:v1
       ports:
       - containerPort: <span class="hljs-number" style="color: teal;">8501</span></pre>



<p>The next file is the service.yml file. It essentially connects the app from the container to the real world. Notice the <em>containerPort</em> argument is specified as 8501, we will use the same number in our service.yml for the <em>targetPort</em> argument. This is the same number that Streamlit uses to deploy the application. Apart from that, the <em>app</em> argument is the same in both files.&nbsp;</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/kubernetes/service.yml" target="_blank" rel="noreferrer noopener nofollow">service.yml</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#service.yml</span>

apiVersion: v1
kind: Service
metadata:
 name: imageclassifier
spec:
 type: LoadBalancer
 selector:
   app: imageclassifier
 ports:
 - port: <span class="hljs-number" style="color: teal;">80</span>
   targetPort: <span class="hljs-number" style="color: teal;">8501</span></pre>



<p><strong>Note</strong>: Always make sure that the name of the app and the version are in lower cases.&nbsp;</p>



<ol start="3" class="wp-block-list">
<li><strong>Dockerfile</strong></li>
</ol>



<p>Now let’s configure the Dockerfile. This file will create a Docker container that will host our Streamlit app. Docker is very much required since it wraps the app in an environment that is easy to scale. A typical Dockerfile looks like this:</p>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/Dockerfile" target="_blank" rel="noreferrer noopener nofollow">Dockerfile</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">FROM python:<span class="hljs-number" style="color: teal;">3.8</span><span class="hljs-number" style="color: teal;">.2</span>-slim-buster

RUN apt-get update

ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

RUN ls -la $APP_HOME/

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Install dependencies</span>
RUN pip install -r requirements.txt

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Run the streamlit on container startup</span>
CMD [ <span class="hljs-string" style="color: rgb(221, 17, 68);">"streamlit"</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">"run"</span>,<span class="hljs-string" style="color: rgb(221, 17, 68);">"app.py"</span> ]
</pre>



<p>Dockerfile contains a series of commands that:</p>



<ul class="wp-block-list">
<li>Installs the Python version.&nbsp;</li>



<li>Copies the local code to the container image.</li>



<li>Installs all the libraries.</li>



<li>Executes Streamlit app.&nbsp;</li>
</ul>



<p>Note that we are using Python 3.8 as some of the dependencies use the latest Python version.</p>



<ol start="4" class="wp-block-list">
<li><strong>cloudbuild.yaml</strong></li>
</ol>



<p>In Google Cloudbuild cloudbuild.yml file stitches all the artefacts together to create a seamless pipeline. It has three primary steps:</p>



<ul class="wp-block-list">
<li>Build a Docker container using the Dockerfile from the current directory.&nbsp;</li>



<li>Push the container to the google container registry.</li>



<li>Deploy the container in the Kubernetes engine.&nbsp;</li>
</ul>



<p><a href="https://github.com/Nielspace/ViT-Pytorch/blob/main/cloudbuild.yaml" target="_blank" rel="noreferrer noopener nofollow">cloudbuild.yml</a></p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">steps:
- name: <span class="hljs-string" style="color: rgb(221, 17, 68);">'gcr.io/cloud-builders/docker'</span>
 args: [<span class="hljs-string" style="color: rgb(221, 17, 68);">'build'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'-t'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'gcr.io/vision-transformer-pytorch/vit:v1'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'.'</span>]
 timeout: <span class="hljs-number" style="color: teal;">180</span>s
- name: <span class="hljs-string" style="color: rgb(221, 17, 68);">'gcr.io/cloud-builders/docker'</span>
 args: [<span class="hljs-string" style="color: rgb(221, 17, 68);">'push'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'gcr.io/vision-transformer-pytorch/vit:v1'</span>]
- name: <span class="hljs-string" style="color: rgb(221, 17, 68);">"gcr.io/cloud-builders/gke-deploy"</span>
 args:
 - run
 - --filename=kubernetes/ <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#this argument connects the files in kubernetes directory</span>
 - --location=us-west1-b
 - --cluster=project-kube</pre>



<p><strong>Note</strong>: Please cross-check the arguments like the container name in deployment.yml and cloudbuild.yml file. Along with that also cross-check the cluster name that you created earlier with the cluster name in the clouldbuild.yml file. Lastly, make sure that the <em>filename</em> argument is as same as the Kubernetes directory where the deployment.yml and service.yml are present.&nbsp;&nbsp;</p>



<p>After creating the files the file structure of the entire project should look like this:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">.
├── Dockerfile
├── .github/workflow/main.yml
├── Makefile
├── README.md
├── cloudbuild.yaml
├── kubernetes
│   ├── deployment.yml
│   └── service.yml
├── metadata
│   ├── Abbott<span class="hljs-string" style="color: rgb(221, 17, 68);">'s_babbler_(Malacocincla_abbotti).jpg
│   ├── classes.txt
│   ├── models
│   │   └── model.pth
│   └── results
│       ├── accuracy_loss.png
│       ├── attn.png
│       └── confusion_matrix.png
├── notebooks
│   ├── ViT.ipynb
│   └── __init__.py
├── requirements.txt
└── source
    ├── __init__.py
    ├── app.py
    ├── attention.py
    ├── attention_block.py
    ├── attention_viz.py
    ├── config.py
    ├── embeddings.py
    ├── linear.py
    ├── metrics.py
    ├── preprocessing.py
    ├── test.py
    ├── train.py
    ├── transformer.py
    └── vit-pytorch.ipynb
</span></pre>



<ol start="5" class="wp-block-list">
<li><strong>Cloning and testing</strong></li>
</ol>



<p>Now let’s clone the GitHub repo in our google cloud build project, cd into it, and run the cloudbuild.yml file. Use the following commands:</p>



<ul class="wp-block-list">
<li><em>git clone </em><a href="https://github.com/Nielspace/ViT-Pytorch.git" target="_blank" rel="noreferrer noopener nofollow"><em>https://github.com/Nielspace/ViT-Pytorch.git</em></a></li>



<li><em>cd ViT-Pytorch</em></li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-20.png?ssl=1" alt="clone the GitHub repo" class="wp-image-69936"/><figcaption class="wp-element-caption"><em>Cloning the GitHub repo</em></figcaption></figure>
</div>


<ul class="wp-block-list">
<li><em>gcloud builds submit &#8211;config cloudbuild.yaml</em></li>
</ul>



<p>The deployment process will look something like this:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-21.png?ssl=1" alt="The deployment process" class="wp-image-69937"/><figcaption class="wp-element-caption"><em>The deployment process </em></figcaption></figure>
</div>


<ol start="6" class="wp-block-list">
<li>The deployment takes somewhere around 10 minutes, depending on various factors. And if everything is executed properly, you will see that the steps are color-coded with green ticks.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-22.png?ssl=1" alt="Succcessful deployment " class="wp-image-69938"/><figcaption class="wp-element-caption"><em>Succcessful deployment  </em></figcaption></figure>
</div>


<ol start="7" class="wp-block-list">
<li>Once the deployment is successful, you can find the endpoints of the app in the Services &amp; Ingress tab in the Kubernetes Engine. Click on the endpoints, and it will navigate you to the Streamlit app.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-23.png?ssl=1" alt="The endpoints " class="wp-image-69939"/><figcaption class="wp-element-caption"><em>The endpoints </em></figcaption></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-24.png?ssl=1" alt="The Streamlit app" class="wp-image-69940"/><figcaption class="wp-element-caption"><em>The Streamlit app</em></figcaption></figure>
</div>


<p><strong>Additional tips:</strong></p>



<ol class="wp-block-list">
<li>Make sure that you use lowercases for app name and project id in all your *.yml config files.</li>



<li>Cross-check the arguments for all *.yml config files.&nbsp;</li>



<li>Since you are copying your repo in a virtual environment, cross-check all the directory and file paths.&nbsp;</li>



<li>In case of an error in the cloud build process, look for a command which will help you resolve the error you find in the error statement. See the image below for a better understanding; I have highlighted the command that needs to be executed before I re-run the cloud build command.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-25.png?ssl=1" alt="an error in the cloud build process" class="wp-image-69941"/><figcaption class="wp-element-caption"><em>An error in the cloud build process</em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Cloud build Integration</h4>



<p>Now we will integrate the Google cloud build into the Github repo. This will create a trigger action that will update the build whenever a change is being made in the repo.&nbsp;</p>



<ol class="wp-block-list">
<li>Search Google Cloud Build in the Marketplace</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-26.png?ssl=1" alt="Searching for Google Cloud Build" class="wp-image-69942"/><figcaption class="wp-element-caption"><em>Searching for Google Cloud Build</em></figcaption></figure>
</div>


<ol start="2" class="wp-block-list">
<li>Select the repo that you want to connect. In this case, it will be ViT-Pytorch and save it.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-27.png?ssl=1" alt="Selecting the repo" class="wp-image-69943"/><figcaption class="wp-element-caption"><em>Selecting the repo</em></figcaption></figure>
</div>


<ol start="3" class="wp-block-list">
<li>In Google Cloud Build, we will go to the Cloud build page and click on the Triggers tab to create triggers.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-28.png?ssl=1" alt="creating triggers" class="wp-image-69944"/><figcaption class="wp-element-caption"><em>Creating triggers</em></figcaption></figure>
</div>


<ol start="4" class="wp-block-list">
<li>After clicking on create trigger, we will be navigated to the page below. There we will mention the trigger name, select the event which will trigger the cloudbuild.yml file, and select the project repository.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-29.png?ssl=1" alt="Trigger settings " class="wp-image-69945"/><figcaption class="wp-element-caption"><em>Trigger settings</em></figcaption></figure>
</div>


<ol start="5" class="wp-block-list">
<li>Follow the authentication process.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-30.png?ssl=1" alt="authentication process" class="wp-image-69946"/><figcaption class="wp-element-caption"><em>Authentication process</em></figcaption></figure>
</div>


<ol start="6" class="wp-block-list">
<li>Connect the repository.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-31.png?ssl=1" alt="Connecting the repository" class="wp-image-69947"/><figcaption class="wp-element-caption"><em>Connecting the repository</em></figcaption></figure>
</div>


<ol start="7" class="wp-block-list">
<li>Finally, create the trigger.&nbsp;</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-32.png?ssl=1" alt="creating the trigger" class="wp-image-69948"/><figcaption class="wp-element-caption"><em>Creating the trigger</em></figcaption></figure>
</div>


<p>Now that the trigger is created, all the changes that you make in the Github repo will be automatically detected, and the deployment will be updated.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/mlops-pipeline-computer-vision-33.png?ssl=1" alt="Created trigger" class="wp-image-69949"/><figcaption class="wp-element-caption"><em>Created trigger </em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Monitoring the model-decay</h4>



<p>Over time the model will decay, which will affect the prediction capabilities. We need to monitor the performance on a regular basis. One way to do that is to occasionally test the model on the new dataset and evaluate the same on metrics that I mentioned earlier, like F1 score, Accuracy score, Precision score, et cetera.&nbsp;</p>



<p>Another interesting way to monitor the model’s performance is to use the AUROC metric, which measures the discriminative performance of the model. Because this project is a multiclassification project, you can convert it into a binary classification project and then check the model’s performance. If the performance of the model has decayed, then the model must be trained again with new samples and larger samples. And if it really required, then modify the architecture as well.&nbsp;</p>



<p><a href="https://gist.github.com/khizirsiddiqui/559a91dab223944fb83f8480715d2582" target="_blank" rel="noreferrer noopener nofollow">Here</a> is the link to the code, which will allow you to measure the AUROC score.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>In this article, we learned to build an image classifier app with Vision Transformer using Pytorch and Streamlit. We also saw how we can deploy the app on the Google Cloud Platform using Github Actions and technologies like Kubernetes, Dockerfile, and Makefile.&nbsp;</p>



<p>Important takeaways from this project:</p>



<ol class="wp-block-list">
<li>Bigger data requires a larger model, which essentially requires training for more epochs.&nbsp;</li>



<li>When creating a prototyping experiment, reduce the number of classes and test whether the accuracy increases with each epoch. Try different configurations till you are confident that the model’s performance is increasing before using GPUs on cloud services like Kaggle or Colab.&nbsp;</li>



<li>Use various performance metrics like confusion metrics, precision, recall, confusion metrics, f1, and AUROC.&nbsp;</li>



<li>Once the model is deployed, monitoring of the model can be done occasionally and not frequently.&nbsp;</li>



<li>In order to monitor, using performance metrics like the AUROC score is good since it automatically creates threshold values and graphs the model’s True Positive rate and False Positive rate. With the AUROC score, the model’s previous and current performance can be easily compared.&nbsp;</li>



<li>Re-training the model should be done only when the model has drifted significantly. Since a model like this requires a lot of computational resources, frequent retraining can be expensive.</li>
</ol>



<p>I hope you found this article informative and practical. You can find the entire code in this <a href="https://github.com/Nielspace/ViT-Pytorch" target="_blank" rel="noreferrer noopener nofollow">Github repo</a>. Feel free to share it with others as well.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-references">References</h3>



<ol class="wp-block-list">
<li><a href="https://arxiv.org/pdf/2010.11929.pdf" target="_blank" rel="noreferrer noopener nofollow">An Image Is Worth 16&#215;16 Words: Transformers For Image Recognition At Scale</a></li>



<li><a href="https://arxiv.org/abs/2111.05464" target="_blank" rel="noreferrer noopener nofollow">Are Transformers More Robust Than CNNs?</a></li>



<li><a href="https://www.kdnuggets.com/2022/01/machine-learning-models-die-silence.html" target="_blank" rel="noreferrer noopener nofollow">https://www.kdnuggets.com/2022/01/machine-learning-models-die-silence.html</a></li>



<li><a href="https://github.com/jeonsworld/ViT-pytorch" target="_blank" rel="noreferrer noopener nofollow">https://github.com/jeonsworld/ViT-pytorch</a>&nbsp;</li>



<li>​​<a href="https://gist.github.com/khizirsiddiqui/559a91dab223944fb83f8480715d2582" target="_blank" rel="noreferrer noopener nofollow">https://gist.github.com/khizirsiddiqui/559a91dab223944fb83f8480715d2582</a></li>



<li><a href="https://github.com/srivatsan88/ContinousModelDeploy" target="_blank" rel="noreferrer noopener nofollow">https://github.com/srivatsan88/ContinousModelDeploy</a>&nbsp;</li>



<li><a href="https://neptune.ai/blog/mlops-pipeline-for-nlp-machine-translation" target="_blank" rel="noreferrer noopener nofollow">Building MLOps Pipeline for NLP: Machine Translation Task</a></li>
</ol>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">7085</post-id>	</item>
		<item>
		<title>Distributed Training: Frameworks and Tools</title>
		<link>https://neptune.ai/blog/distributed-training-frameworks-and-tools</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 11:20:50 +0000</pubDate>
				<category><![CDATA[ML Tools]]></category>
		<guid isPermaLink="false">https://neptune.test/distributed-training-frameworks-and-tools/</guid>

					<description><![CDATA[Recent developments in deep learning have led to some fascinating state-of-the-art results especially in the areas like natural language processing and computer vision. A couple of the reasons for the success usually comes from the availability of a huge amount of data and the increasing size of deep learning (DL) models. These algorithms are capable&#8230;]]></description>
										<content:encoded><![CDATA[
<p>Recent developments in deep learning have led to some fascinating state-of-the-art results especially in the areas like <a href="/blog/category/natural-language-processing" target="_blank" rel="noreferrer noopener">natural language processing</a> and <a href="/blog/category/computer-vision">computer vision</a>. A couple of the reasons for the success usually comes from the availability of a huge amount of data and the <a href="https://towardsdatascience.com/review-of-recent-advances-in-dealing-with-data-size-challenges-in-deep-learning-ac5c1844af73" target="_blank" rel="noreferrer noopener nofollow">increasing size of deep learning (DL) models</a>. These algorithms are capable of extracting meaningful patterns and deriving correlations between the input and the output. But it is also true that developing and training these complex algorithms can take days and sometimes even weeks.</p>



<p>To manage this problem, a fast and efficient approach to designing and developing new models is needed. One cannot train these models on a single GPU because it will result in an information bottleneck. To solve the issue of information bottlenecks on a single core GPU we need to use multi-core GPUs. This is where the idea of <strong>distributed training </strong>comes into the picture.</p>



<p>In this article, we&#8217;ll look into some of <strong>the best frameworks and tools for distributed training</strong>. But before that, let&#8217;s have a quick overview of distributed training itself. </p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-distributed-training">Distributed training</h2>



<p>The DL training usually relies on scalability, which simply means the ability of the DL algorithm to learn or deal with any amount of data. Essentially the scalability of any DL algorithm depends on three factors:</p>



<div id="case-study-numbered-list-block_e206ac8011b1dea9114e2d724887b5f9"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Size and the complexity of the deep learning model            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Amount of training data            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Availability of infrastructure which includes hardware like GPUs and storage units, and smooth integration between these devices            </li>
            </ul>
</div>



<p><a href="/blog/distributed-training" target="_blank" rel="noreferrer noopener">Distributed training</a> satisfies all three elements. It takes care of the model size and complexity, handles training data in batches, and it splits and distributes the training process among multiple processors called nodes. More importantly, it reduces the training time significantly making iteration time shorter and thus making experiments and deployment quicker.</p>



<p>Distributed training is of two types:</p>



<div id="case-study-numbered-list-block_c5e9d946dcfdc95993c47579ad1a27f2"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Data-parallel training<br />
            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Model-parallel training            </li>
            </ul>
</div>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-training-model-data-parallelism.png?ssl=1" alt="Distributed training model parallelism vs data parallelism " class="wp-image-61294"/><figcaption class="wp-element-caption"><em>Distributed training model parallelism vs data parallelism | <a href="https://towardsdatascience.com/deep-learning-on-supercomputers-96319056c61f" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>In data-parallel training, the data is divided into subsets based upon the number of nodes available for training. And the same model architecture is shared in all the available nodes. During the training process, all the nodes must communicate with each other to ensure that the training at each node is synced with each other. It is the most efficient way of training the model and the most common practice.</p>



<p>In model-parallel training, the DL model is split into segments based on the number of nodes available. Each node is fed with the same data. In model-parallel training, the DL model itself is split into different segments and each of the segments is then fed into different nodes. This type of training is possible if the DL model has independent components that can be trained individually. It is kept in mind that each of the nodes must be in sync with regard to the shared weights and biases of the different segments of the model.&nbsp;</p>



<p>Among the two types of training, data-parallelism is quite commonly used and as we discover the frameworks for distributed training you will find that data-parallelism is offered by all of them irrespective of model-parallelism.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-criteria-for-choosing-the-right-framework-for-distributed-training">Criteria for choosing the right framework for distributed training</h2>



<p>Before we dive into the frameworks there are some points that one should consider while choosing the right framework and tools:</p>



<ol class="wp-block-list">
<li><strong>Computational graph type: </strong>The whole deep learning community is majorly divided into two factions, one that uses PyTorch or dynamic computational graph and the other that uses TensorFlow or static computational graph. Hence, it is not news that most of the distributed frameworks are built on top of these two libraries. So if you prefer one over the other then half of your decision is already made.</li>



<li><strong>Cost of training</strong>: Affordability is a critical concern when you are dealing with distributed computing, e.g. a project involving the training of BigGAN can require a number of GPUs and the cost could scale up proportionally as this number increases. Hence, a tool with moderate pricing is always the right choice.</li>



<li><strong>Type of training</strong>: Depending upon your training requirement i.e. data-parallelism or model-parallelism, you can prefer one tool over the other. </li>



<li><strong>Efficiency</strong>: This basically refers to the number of lines you need to write to enable distributed training, the less the better.</li>



<li><strong>Flexibility</strong>: Can the framework of your choice be used across different platforms? Especially when you need to train on-premise or on cloud platforms.</li>
</ol>



<section id="blog-intext-cta-block_f1a9708072a6991122a282e13817315f" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-learn-more">Learn more</h3>
    
            <p><a href="/blog/distributed-training" target="_blank" rel="noopener"><img loading="lazy" decoding="async" class="lazyload block-blog-intext-cta__arrow-image" src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg" alt="" width="12" height="12" data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-arrow--right-gray.svg" />️ Distributed Training: Guide for Data Scientists</a><br />
<a href="/blog/distributed-training-errors" target="_blank" rel="noopener"><img loading="lazy" decoding="async" class="lazyload block-blog-intext-cta__arrow-image" src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg" alt="" width="12" height="12" data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-arrow--right-gray.svg" />️ Distributed Training: Errors to Avoid</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-frameworks-for-distributed-training">Frameworks for distributed training</h2>



<p>Now, let’s discuss some of the libraries that offer distributed training.&nbsp;</p>



<section id="blog-intext-cta-block_519e65445749da8fa6f9129dc357552e" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-may-be-useful">May be useful</h3>
    
            <p>  In Neptune, you can <a href="https://docs.neptune.ai/how-to-guides/neptune-api/distributed-computing" target="_blank" rel="noopener">track data of your run from many processes</a>, in particular running on different machines.</p>
    
    </section>



<h3 class="wp-block-heading" id="1-pytorch">1. PyTorch</h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_12.png?resize=469%2C94&#038;ssl=1" alt="Distributed training: PyTorch" class="wp-image-61137" width="469" height="94"/><figcaption class="wp-element-caption"><em>Distributed training: PyTorch | <a href="https://github.com/pytorch/pytorch" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>PyTorch is one of the most popular deep learning frameworks developed by Facebook. It is one of the most flexible and easy-to-learn frameworks. PyTorch allows you to create and implement neural network modules very effectively and with its distributed training modules you can easily implement parallel training with a few lines of code.&nbsp;&nbsp;</p>



<p>PyTorch offers a number of ways in which you can perform distributed training:</p>



<ol class="wp-block-list">
<li><a href="https://pytorch.org/docs/stable/nn.html#dataparallel" target="_blank" rel="noreferrer noopener nofollow"><strong>nn.DataParallel</strong></a><strong>:</strong> This package allows you to perform parallel training in a single machine with multiple GPUs. One advantage is that it requires a minimum code.</li>



<li><a href="https://pytorch.org/docs/stable/nn.html#distributeddataparallel" target="_blank" rel="noreferrer noopener nofollow"><strong>nn.DistributedDataParallel</strong></a>: This package allows you to perform parallel training across multiple GPUs within multiple machines. It requires a few more extra steps to configure the training process.&nbsp;&nbsp;</li>



<li><a href="https://pytorch.org/docs/stable/rpc.html" target="_blank" rel="noreferrer noopener nofollow"><strong>torch.distributed.rpc</strong></a><strong>: </strong>This package allows you to perform a model-parallelism strategy. It is very efficient if your model is large and does not fit in a single GPU.</li>
</ol>



<h4 class="wp-block-heading" id="advantages">Advantages</h4>



<ol class="wp-block-list">
<li>It is easy to implement.</li>



<li>PyTorch is very user-friendly.</li>



<li>Offers data-parallelism and model-parallelism methods out-of-the-box.</li>



<li>The majority of the cloud computing platforms support PyTorch.</li>
</ol>



<h4 class="wp-block-heading" id="when-to-use-pytorch">When to use PyTorch?</h4>



<p>You should opt for PyTorch when:</p>



<ul class="wp-block-list">
<li>You have a huge amount of data because data parallelism is easy to implement.&nbsp;</li>
</ul>



<h4 class="wp-block-heading" id="related-information">Related information</h4>



<ul class="wp-block-list">
<li><a href="https://github.com/pytorch/pytorch" target="_blank" rel="noreferrer noopener nofollow">Repository Link</a></li>



<li><a href="https://pytorch.org/tutorials/beginner/dist_overview.html" target="_blank" rel="noreferrer noopener nofollow">Documentations</a></li>



<li>Data-parallelism: <a href="https://github.com/pytorch/tutorials" target="_blank" rel="noreferrer noopener nofollow">tutorial</a></li>



<li>Model-parallelism: <a href="https://pytorch.org/docs/stable/rpc.html#tutorials" target="_blank" rel="noreferrer noopener nofollow">tutorial</a></li>
</ul>



<section id="blog-intext-cta-block_6ba3b00302b8a8713b9ce427bbbaa616" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-see-also">See also</h3>
    
            <p><img loading="lazy" decoding="async" class="lazyload block-blog-intext-cta__arrow-image" src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg" alt="" width="12" height="12" data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-arrow--right-gray.svg" />️ <a href="/blog/how-to-keep-track-of-experiments-in-pytorch-using-neptune" target="_blank" rel="noopener">How to Keep Track of Experiments in PyTorch</a></p>
    
    </section>



<h3 class="wp-block-heading" id="2-deepspeed">2. <a href="https://www.deepspeed.ai/" target="_blank" rel="noreferrer noopener nofollow">DeepSpeed</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_16.png?resize=405%2C150&#038;ssl=1" alt="Distributed training: DeepSpeed" class="wp-image-61156" width="405" height="150"/><figcaption class="wp-element-caption"><em>Distributed training: DeepSpeed | <a href="https://github.com/microsoft/DeepSpeed" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>PyTorch distributed training specializes in data parallelism. DeepSpeed, which is built on top of PyTorch, targets other aspects i.e. model-parallelism. DeepSpeed is developed by Microsoft that aims to offer distributed training for large-scale models.&nbsp;</p>



<p>DeepSpeed can efficiently tackle memory challenges when training models with trillions of parameters. It reduces memory footprint while maintaining compute and communication efficiency. Interestingly, DeepSpeed offers 3D parallelism through which you can distribute data, model, and pipeline, which basically means that now you can train a model which is large and consumes a huge amount of data, something like a GPT-3 or a Turing NLG.&nbsp;&nbsp;</p>



<h4 class="wp-block-heading" id="advantages">Advantages</h4>



<ol class="wp-block-list">
<li>Model scaling up to trillions of parameters.</li>



<li>Faster training up to 10X.</li>



<li>Democratize AI which means users can run bigger models on a single GPU without running out of memory.</li>



<li>Compressed training allows users to train attention models by reducing the memory required to compute attention operations.&nbsp;</li>



<li>Easy to learn and use.&nbsp;</li>
</ol>



<h4 class="wp-block-heading" id="when-to-use-deepspeed">When to use DeepSpeed?</h4>



<p>You should opt for DeepSpeed when:</p>



<ul class="wp-block-list">
<li>You want to do data and model parallelism.&nbsp;</li>



<li>If your codebase is based on PyTorch.&nbsp;</li>
</ul>



<h4 class="wp-block-heading" id="related-information">Related information</h4>



<ul class="wp-block-list">
<li><a href="https://github.com/microsoft/DeepSpeed" target="_blank" rel="noreferrer noopener nofollow">Repository Link</a></li>



<li><a href="https://www.deepspeed.ai/" target="_blank" rel="noreferrer noopener nofollow">Documentations</a></li>



<li><a href="https://github.com/microsoft/DeepSpeed#videos" target="_blank" rel="noreferrer noopener nofollow">Tutorial</a></li>
</ul>



<h3 class="wp-block-heading" id="3-distributed-tensorflow">3. <a href="https://www.tensorflow.org/guide/distributed_training" target="_blank" rel="noreferrer noopener nofollow">Distributed TensorFlow</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_4.png?resize=450%2C151&#038;ssl=1" alt="Distributed training: TensorFlow" class="wp-image-61145" width="450" height="151"/><figcaption class="wp-element-caption"><em>Distributed training: TensorFlow | <a href="https://github.com/tensorflow/tensorflow" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>TensorFlow is developed by Google and it supports distributed training. It uses data-parallel techniques for training. You can leverage the distributed training on TensorFlow by using the <strong>tf.distribute</strong> API. This API allows you to configure your training as per your requirements. By default, TensorFlow uses only one GPU but the tf.distribute allows you to use multiple GPUs.</p>



<p>TensorFlow provides three primary types of distributed training strategy:</p>



<ol class="wp-block-list">
<li><strong>tf.distribute.MirroredStrategy()</strong>: This simple strategy allows you to distribute training across multiple GPUs on a single machine. This method is also called Synchronous Data-Parallelism. It is worth noting that each worker node will have its own set of gradients. These gradients are then averaged and used to update the model parameters.</li>
</ol>



<ol class="wp-block-list" start="2">
<li><strong>tf.distribute.MultiWorkerMirroredStrategy()</strong>: This strategy allows you to distribute training across multiple machines and multiple GPUs on a single machine. All the operations are similar to tf.distribute.MirroredStrategy(). It is also a Synchronous Data-Parallelism method.</li>
</ol>



<ol class="wp-block-list" start="3">
<li><strong>tf.distribute.experimental.ParameterServerStrategy()</strong>: This is an Asynchronous Data-Parallelism method, it is common practice to scale-up model training on multiple machines. In this strategy, the parameters are stored in a parameter server and workers are independent of each other. This strategy scales up well because the worker nodes are not waiting for the parameter update from each other.</li>
</ol>



<h4 class="wp-block-heading" id="advantages">Advantages</h4>



<ol class="wp-block-list">
<li>Huge community support.&nbsp;</li>



<li>It is a static paradigm of programming.</li>



<li>Very well integrated with Google Cloud and other cloud-based services.&nbsp;</li>
</ol>



<h4 class="wp-block-heading" id="when-to-use-distributed-tensorflow">When to use Distributed TensorFlow?</h4>



<p>You should use Distributed TensorFlow:</p>



<ul class="wp-block-list">
<li>If you want to do data parallelism.&nbsp;</li>



<li>If you like the static paradigm of programming compared to dynamic.</li>



<li>If you are in the Google Cloud ecosystem since TensorFlow is very well optimized for TPUs.&nbsp;</li>



<li>Lastly, if you have huge data and need high processing power.&nbsp;</li>
</ul>



<h4 class="wp-block-heading" id="related-information">Related information</h4>



<ul class="wp-block-list">
<li><a href="https://github.com/tensorflow/tensorflow" target="_blank" rel="noreferrer noopener nofollow">Repository Link</a></li>



<li><a href="https://www.tensorflow.org/guide/distributed_training" target="_blank" rel="noreferrer noopener nofollow">Documentations</a></li>



<li><a href="https://github.com/tensorflow/docs/blob/master/site/en/guide/distributed_training.ipynb" target="_blank" rel="noreferrer noopener nofollow">Tutorial</a></li>
</ul>



<h3 class="wp-block-heading" id="4-mesh-tensorflow">4. <a href="https://github.com/tensorflow/mesh" target="_blank" rel="noreferrer noopener nofollow">Mesh TensorFlow</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_4.png?resize=453%2C152&#038;ssl=1" alt="Distributed training: TensorFlow" class="wp-image-61145" width="453" height="152"/><figcaption class="wp-element-caption"><em>Distributed training: TensorFlow | <a href="https://github.com/tensorflow/tensorflow" target="_blank" rel="noreferrer noopener nofollow"></a><a href="https://github.com/tensorflow/tensorflow">Source</a></em></figcaption></figure>
</div>


<p>Mesh Tensorflow is again an extension of Tensorflow distributed training but is specifically designed to train large DL models on Tensor Processing Unit (TPUs) which AI accelerates like the GPUs but faster. Although Mesh TensorFlow can execute data-parallelism, it aims to solve distributed training for large models whose parameters cannot fit on one device.&nbsp;</p>



<p>Mesh TensorFlow is inspired by a synchronous data-parallel method, i.e. every worker is involved in every operation. Apart from that, all the workers will have the same program and it uses collective communication like Allreduce.&nbsp;</p>



<h4 class="wp-block-heading" id="advantages">Advantages</h4>



<ol class="wp-block-list">
<li>It can train large models with millions and billions of parameters like: GPT-3, GPT-2, BERT, et cetera.&nbsp;</li>



<li>Potentially low latency across the workers.&nbsp;</li>



<li>Good TensorFlow community support.&nbsp;</li>



<li>Availability of TPU-pods from Google.&nbsp;</li>
</ol>



<h4 class="wp-block-heading" id="when-to-use-mesh-tensorflow">When to use Mesh Tensorflow?</h4>



<p>You should use Mesh TensorFlow:</p>



<ul class="wp-block-list">
<li>If you want to do model parallelism.&nbsp;</li>



<li>If you want to develop huge models and practice rapid-prototyping.</li>



<li>If you are especially working in the area of Natural Language Processing with huge data.&nbsp;</li>
</ul>



<h4 class="wp-block-heading" id="related-information">Related information</h4>



<ul class="wp-block-list">
<li><a href="https://github.com/tensorflow/mesh" target="_blank" rel="noreferrer noopener nofollow">Repository Link</a></li>
</ul>



<h3 class="wp-block-heading" id="5-tensorflowonspark">5. <a href="https://github.com/yahoo/TensorFlowOnSpark" target="_blank" rel="noreferrer noopener nofollow">TensorFlowOnSpark</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_4.png?resize=453%2C152&#038;ssl=1" alt="Distributed training: TensorFlow" class="wp-image-61145" width="453" height="152"/><figcaption class="wp-element-caption"><em>Distributed training: TensorFlow | <a href="https://github.com/tensorflow/tensorflow" target="_blank" rel="noreferrer noopener nofollow"></a><a href="https://github.com/tensorflow/tensorflow">Source</a></em></figcaption></figure>
</div>


<p><strong>Apache Spark</strong> is one of the most well-known, open-source big data processing platforms. It allows users to do all kinds of data-related work like data engineering, data science, and machine learning. We already know what TensorFlow is. But if you wanna use TensorFlow on Apache Spark then you have to use TensorFlowOnSpark.&nbsp;</p>



<p>TensorFlowOnSpark is a machine learning framework that allows you to perform distributed training on Apache Spark Clusters and Apache Hadoop. It was developed by Yahoo. The framework allows both distributed training and inference with minimum code changes to existing TensorFlow code on the shared grid.&nbsp;</p>



<h4 class="wp-block-heading" id="advantages">Advantages</h4>



<ol class="wp-block-list">
<li>Allows easy migration to Spark Clusters with existing TensorFlow programs.&nbsp;</li>



<li>Fewer changes in the code.&nbsp;</li>



<li>All TensorFlow functionalities are available.&nbsp;</li>



<li>Datasets can be efficiently pushed and pulled by Spark and TensorFlow respectively.&nbsp;</li>



<li>Cloud development is easy and efficient on CPUs or GPUs.&nbsp;</li>



<li>Training pipelines can be created easily.&nbsp;</li>
</ol>



<h4 class="wp-block-heading" id="when-to-use-tensorflowonspark">When to use TensorFlowOnSpark?</h4>



<p>You should use TensorflowOnSpark:</p>



<ul class="wp-block-list">
<li>If your workflow is based on Apache Spark or if you prefer Apache Spark.</li>



<li>If your preferred framework is TensorFlow.&nbsp;</li>
</ul>



<h4 class="wp-block-heading" id="related-information">Related information</h4>



<ul class="wp-block-list">
<li><a href="https://github.com/yahoo/TensorFlowOnSpark" target="_blank" rel="noreferrer noopener nofollow">Repository Link</a></li>



<li><a href="https://yahoo.github.io/TensorFlowOnSpark/" target="_blank" rel="noreferrer noopener nofollow">Documentations</a></li>



<li><a href="https://github.com/yahoo/TensorFlowOnSpark/tree/master/examples" target="_blank" rel="noreferrer noopener nofollow">Examples</a></li>
</ul>



<section id="blog-intext-cta-block_c928fe29e0771c8bd3287c263b99eac7" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-read-more">Read more</h3>
    
            <p><img loading="lazy" decoding="async" class="lazyload block-blog-intext-cta__arrow-image" src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg" alt="" width="12" height="12" data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-arrow--right-gray.svg" />️ <a href="/blog/extensions-for-tensorflow" target="_blank" rel="noopener">The Best ML Frameworks &amp; Extensions For TensorFlow</a></p>
    
    </section>



<h3 class="wp-block-heading" id="6-bigdl">6. <a href="https://bigdl-project.github.io/master/" target="_blank" rel="noreferrer noopener nofollow">BigDL</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_14.png?resize=264%2C131&#038;ssl=1" alt="Distributed training: BigDL" class="wp-image-61135" width="264" height="131"/><figcaption class="wp-element-caption"><em>Distributed training: BigDL | <a href="https://github.com/intel-analytics/BigDL" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>BigDL is also an open-source framework for distributed training for Apache Spark. It was developed by Intel to allow DL algorithms to run Hadoop and Spark clusters. One big advantage of BigDL is that it can help you to easily build and process production data in an end-to-end pipeline for both data analysis and deep learning applications.&nbsp;</p>



<p>BigDL provides two options:</p>



<ol class="wp-block-list">
<li>You can <strong>directly</strong> use BigDL as you would any other library that Apache Spark provides for data engineering, data analytics et cetera.&nbsp;</li>



<li>You can <strong>scale out python libraries</strong> like PyTorch, TensorFlow, and Keras in the Spark ecosystem.&nbsp;</li>
</ol>



<h4 class="wp-block-heading" id="advantages">Advantages</h4>



<ol class="wp-block-list">
<li><strong>End-to-end pipeline</strong>: If your big data is messy and complex which is usually in the case of live data streaming, then adopting BigDL is appropriate because it integrates data analytics and deep learning in an end-to-end pipeline.&nbsp;</li>



<li><strong>Efficiency</strong>: With an integrated approach across different components of Spark BigDL makes development, deployment, and operations direct, seamless, and efficient across all components.&nbsp;</li>



<li><strong>Communication and Computing</strong>: Since all the hardware and software are stitched together, they run smoothly without any interruption, making communication with different workflows clear and computing faster.&nbsp;</li>
</ol>



<h4 class="wp-block-heading" id="when-to-use-bigdl">When to use BigDL?</h4>



<p>You should use BigDL:</p>



<ul class="wp-block-list">
<li>If you want to develop an Apache Spark workflow,&nbsp;</li>



<li>If your preferred framework is PyTorch.</li>



<li>If you want to have continuous integration of all the components like data mining, data analytics, machine learning et cetera.&nbsp;</li>
</ul>



<h4 class="wp-block-heading" id="related-information">Related information</h4>



<ul class="wp-block-list">
<li><a href="https://github.com/intel-analytics/BigDL" target="_blank" rel="noreferrer noopener nofollow">Repository Link</a></li>



<li><a href="https://bigdl.readthedocs.io/" target="_blank" rel="noreferrer noopener nofollow">Documentations</a></li>



<li><a href="https://bigdl.readthedocs.io/en/latest/doc/UserGuide/notebooks.html" target="_blank" rel="noreferrer noopener nofollow">Tutorial</a></li>
</ul>



<h3 class="wp-block-heading" id="7-horovod">7. <a href="https://github.com/horovod/horovod" target="_blank" rel="noreferrer noopener nofollow">Horovod</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_10.png?resize=257%2C257&#038;ssl=1" alt="Distributed training: Horovod" class="wp-image-61139" width="257" height="257"/><figcaption class="wp-element-caption"><em>Distributed training: Horovod | <a href="https://github.com/horovod/horovod" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Horovod was introduced by Uber in 2017. It is an open-source project that is specifically made for distributed training. It is an internal component of Michelangelo, a deep learning toolkit that Uber uses for implementing its DL algorithms. Horovod leverages data-parallel distributed training, which makes scaling very easy and efficient. It can also scale up to hundreds of GPUs in a matter of 5 lines of python code. The idea is to write a training script for a single GPU and Horovod can scale it to train on multiple in parallel.&nbsp;</p>



<p>Horovod is built for frameworks like Tensorflow, Keras, Pytorch, and Apache MXNet. It is easy to use and fast.&nbsp;</p>



<h4 class="wp-block-heading" id="advantages">Advantages</h4>



<ol class="wp-block-list">
<li>Easy to learn and implement if you are familiar with Tensorflow, Keras, Pytorch, and Apache MXNet.</li>



<li>If you are using Apache Spark then you can unify all the processes on a single pipeline.</li>



<li>Good community support.</li>



<li>It is fast.&nbsp;</li>
</ol>



<h4 class="wp-block-heading" id="when-to-use-horovod">When to use Horovod?</h4>



<p>You should use Horovod:</p>



<ul class="wp-block-list">
<li>If you want to scale a single GPU script quickly across multiple GPUs.</li>



<li>If you are using Microsoft Azure as your cloud computing platform.&nbsp;</li>
</ul>



<h4 class="wp-block-heading" id="related-information">Related information</h4>



<ul class="wp-block-list">
<li><a href="https://github.com/horovod/horovod" target="_blank" rel="noreferrer noopener nofollow">Repository Link</a></li>



<li><a href="https://github.com/horovod/horovod#documentation" target="_blank" rel="noreferrer noopener nofollow">Documentations</a></li>



<li><a href="https://github.com/horovod/horovod/tree/master/examples" target="_blank" rel="noreferrer noopener nofollow">Tutorial</a></li>
</ul>



<h3 class="wp-block-heading" id="8-ray">8. <a href="https://www.ray.io/" target="_blank" rel="noreferrer noopener nofollow">Ray</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_5.png?resize=453%2C141&#038;ssl=1" alt="Distributed training: Ray" class="wp-image-61144" width="453" height="141"/><figcaption class="wp-element-caption"><em>Distributed training: Ray | <a href="https://github.com/ray-project/ray" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Ray is another open-source framework for distributed training built on top of Pytorch. It provides tools for launching GPU clusters on any cloud provider. Unlike any other libraries we have discussed so far, Ray is very flexible and can work anywhere like Azure, GCD, AWS Apache Spark, and Kubernetes.&nbsp;</p>



<p>Ray offers the following libraries in its bundle for hyperparameter tuning, reinforcement learning, deep learning, scaling et cetera:</p>



<ol class="wp-block-list">
<li><a href="https://docs.ray.io/en/master/tune.html" target="_blank" rel="noreferrer noopener nofollow">Tune</a>: Scalable Hyperparameter Tuning.</li>



<li><a href="https://docs.ray.io/en/master/rllib/index.html" target="_blank" rel="noreferrer noopener nofollow">RLlib</a>: Distributed Reinforcement Learning.</li>



<li><a href="https://docs.ray.io/en/master/train/train.html" target="_blank" rel="noreferrer noopener nofollow">Train</a>: Distributed Deep Learning, currently in beta version.&nbsp;</li>



<li><a href="https://docs.ray.io/en/master/data/dataset.html" target="_blank" rel="noreferrer noopener nofollow">Datasets</a>: Distributed Data Loading and Compute, currently in beta version.&nbsp;</li>



<li><a href="https://docs.ray.io/en/master/serve/index.html" target="_blank" rel="noreferrer noopener nofollow">Serve</a>: Scalable and Programmable Serving.</li>



<li><a href="https://docs.ray.io/en/master/workflows/concepts.html" target="_blank" rel="noreferrer noopener nofollow">Workflows</a>: Fast, Durable Application Flows.</li>
</ol>



<p>Apart from these libraries, Ray also has integration with third-party libraries and frameworks which allows you to develop, train and scale your workloads with minimal code changes. Given below is the list of integrated libraries:</p>



<ol class="wp-block-list">
<li>Airflow</li>



<li>ClassyVision</li>



<li>Dask</li>



<li>Flambe</li>



<li>Horovod</li>



<li>Hugging Face Transformers</li>



<li>Intel Analytics Zoo</li>



<li>John Snow Labs’ NLU</li>



<li>LightGBM</li>



<li>Ludwig AI</li>



<li>MARS</li>



<li>Modin</li>



<li>PyCaret</li>



<li>PyTorch Lightning</li>



<li>RayDP</li>



<li>Scikit Learn</li>



<li>Seldon Alibi&nbsp;</li>



<li>Spacy</li>



<li>XGBoost</li>
</ol>



<h4 class="wp-block-heading" id="advantages">Advantages</h4>



<ol class="wp-block-list">
<li>It supports Jupyter Notebook&nbsp;</li>



<li>It makes your code run parallel in single and multiple machines&nbsp;</li>



<li>It integrates multiple frameworks and libraries.&nbsp;</li>



<li>It works with all the major cloud computing platform</li>
</ol>



<h4 class="wp-block-heading" id="when-to-use-ray">When to use Ray?</h4>



<p>You should use Ray:</p>



<ol class="wp-block-list">
<li>If you want to perform distributed reinforcement learning</li>



<li>If you want to perform distributed hyperparameter tuning</li>



<li>If you want to use distributed data loading and compute across different machines.&nbsp;</li>



<li>If you want to serve your application.</li>
</ol>



<h4 class="wp-block-heading" id="related-information">Related information</h4>



<ul class="wp-block-list">
<li><a href="https://github.com/ray-project/ray" target="_blank" rel="noreferrer noopener nofollow">Repository Link</a></li>



<li><a href="https://www.ray.io/docs" target="_blank" rel="noreferrer noopener nofollow">Documentations</a></li>



<li><a href="https://github.com/ray-project/tutorial" target="_blank" rel="noreferrer noopener nofollow">Tutorial</a></li>
</ul>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-cloud-platforms-for-distributed-training">Cloud platforms for distributed training</h2>



<p>So far, we have discussed the frameworks and the libraries that can be used to enable distributed training. Now, let’s discuss and explore the cloud platforms where you can get into the hardware that will allow you to efficiently train your DL models. But before that, let’s lay out some criteria that will allow you to choose the best cloud platforms as per your requirements.&nbsp;</p>



<ol class="wp-block-list">
<li><strong>Hardware and Software Support: </strong>It is important to learn and understand what hardware these platforms offer like GPUs, TPUs, storage units et cetera. Apart from that, one should also see the API that they offer so that (depending on your project) you can get access to hosting facilities, containers, tools for data analytics and so forth.&nbsp;</li>



<li><strong>Availability Zones: </strong>Availability zones are an important factor in cloud computing, it gives users the flexibility to set up and deploy their project anywhere in the world. Users can also shift their projects whenever they want to.&nbsp;</li>



<li><strong>Pricing: </strong>Whether the platform charges you based on your usage or do they offer a subscription-based model.&nbsp;</li>
</ol>



<p>Now, let’s discuss cloud computing options. We will discuss the two extremely feasible ready-to-use experimental notebook platforms and the three most popular cloud computing services.&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_11.png?resize=574%2C599&#038;ssl=1" alt="Magic quadrant for cloud infrastructure as a service" class="wp-image-61138" width="574" height="599"/><figcaption class="wp-element-caption"><em>Magic quadrant for cloud infrastructure as a service | <a href="https://www.c-sharpcorner.com/article/top-10-cloud-service-providers/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h3 class="wp-block-heading" id="1-google-colab">1. <a href="https://colab.research.google.com/" target="_blank" rel="noreferrer noopener nofollow">Google Colab</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_17.png?resize=311%2C138&#038;ssl=1" alt="Distributed training: Google Colab" class="wp-image-61197" width="311" height="138"/><figcaption class="wp-element-caption"><em>Distributed training: Google Colab | <a href="https://colab.research.google.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Google Colab is one most reliable and easy to use platforms for small scale to medium-scale projects. One good thing about Google Colab is that you can easily connect to Google Cloud with ease and you can work with any python library mentioned above. It offers three models:</p>



<ol class="wp-block-list">
<li><strong>Google Colab</strong> is free of cost and it gives you access to GPUs and TPUs. But you will get access to limited storage and memory. Once any of them exceeds, the program stops.&nbsp;</li>



<li><strong>Google Colab Pro</strong> is a subscription version of Google Colab where you have extra memory and storage. You can fairly run a heavy model but again is it limited.&nbsp;</li>



<li><strong>Google Colab Pro +</strong> is the new service which is a subscription based model and which is also expensive. It offers faster GPUs and TPUs plus extra memory so that you can run fairly larger models on fairly large datasets.&nbsp;</li>
</ol>



<p>Given below is the official comparison of all the three.&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_8.png?resize=746%2C424&#038;ssl=1" alt="Cloud platforms" class="wp-image-61141" width="746" height="424"/><figcaption class="wp-element-caption"><em>Cloud platforms | <a href="https://colab.research.google.com/signup" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<section id="blog-intext-cta-block_734c7e8457f4e4324b1bab34e1ef9ae6" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-learn-more">Learn more</h3>
    
            <p><img loading="lazy" decoding="async" class="lazyload block-blog-intext-cta__arrow-image" src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg" alt="" width="12" height="12" data-src="https://neptune.ai/wp-content/themes/neptune/img/icon-arrow--right-gray.svg" />️ <a href="/blog/how-to-use-google-colab-for-deep-learning-complete-tutorial" target="_blank" rel="noopener">How to Use Google Colab for Deep Learning [Complete Tutorial]</a></p>
    
    </section>



<h3 class="wp-block-heading" id="2-amazon-web-services-sagemaker">2. <a href="https://aws.amazon.com/sagemaker/" target="_blank" rel="noreferrer noopener nofollow">Amazon Web Services: SageMaker</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_6.png?resize=376%2C160&#038;ssl=1" alt="Distributed training: AWS SageMaker" class="wp-image-61143" width="376" height="160"/><figcaption class="wp-element-caption"><em>Distributed training: AWS SageMaker | <a href="https://nub8.net/machine-learning-with-amazon-sagemaker/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>AWS SageMaker is one of the most popular and the oldest cloud computing platforms for distributed training. It is very well integrated with Apache MXNet, Pytorch, and TensorFlow and allows you to deploy deep learning algorithms with ease and less code modification. SageMaker API has <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html" target="_blank" rel="noreferrer noopener nofollow">18+ machine learning algorithms</a>, some of which are rewritten from scratch to make the whole process scalable and easy. These built-in algorithms are optimized to get the most out of the hardware.&nbsp;</p>



<p>SageMaker also has an integrated Jupyter Notebook that allows Data-scientist and machine learning engineers to build and develop pipeline algorithms on the go and allows you to directly deploy them in a hosted environment. You can configure hardware and environments based on your requirements and preferences from SageMaker Studio or SageMaker console. All the hosting and development are billed according to the <strong>usage per minute</strong>.&nbsp;</p>



<p>AWS SageMaker offers both data-parallelism as well as model-parallelism distributed training. In fact, SageMaker also offers a hybrid training strategy where you can use both model and data parallelism.&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_1.png?resize=780%2C496&#038;ssl=1" alt="Distributed training: AWS SageMaker" class="wp-image-61148" width="780" height="496"/><figcaption class="wp-element-caption"><em>Distributed training: AWS SageMaker | <a href="https://aws.amazon.com/blogs/machine-learning/the-aws-deep-learning-ami-now-with-ubuntu/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h3 class="wp-block-heading" id="3-google-cloud-computing">3. <a href="https://cloud.google.com/" target="_blank" rel="noreferrer noopener nofollow">Google Cloud Computing</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_15.png?resize=165%2C165&#038;ssl=1" alt="Distributed training: Google Cloud Computing" class="wp-image-61134" width="165" height="165"/><figcaption class="wp-element-caption"><em>Distributed training: Google Cloud Computing | <a href="https://cloud.google.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Google Cloud Computing was developed by Google in 2010 to strengthen their own platforms like Google search engine and Youtube. Gradually, they started open-sourcing it to the public. Google Cloud Computing offers the same infrastructure that all Google&#8217;s platforms use.&nbsp;</p>



<p>Google cloud computing offers in-built support for libraries like TensorFlow, Pytorch, Scikit-Learn, and many more. Furthermore, apart from configuring GPUs in your workflow, you can add TPUs as well to make the training process go much faster. Like I mentioned before you can connect your Google Colab to Google Cloud Platform and access all the features that it provides.&nbsp;</p>



<p>Some of the features that it provides are:&nbsp;</p>



<ol class="wp-block-list">
<li>Compute (Virtual Hardwares like GPUs and TPUs)</li>



<li>Storage Bucket,&nbsp;</li>



<li>Databases&nbsp;</li>



<li>Networking</li>



<li>Management tools</li>



<li>Security</li>



<li>IoT</li>



<li>API platform</li>



<li>Hosting Services</li>
</ol>



<p>It is worth noting that GCP has less availability zones but it is also less expensive compared to AWS.&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_13.png?resize=800%2C500&#038;ssl=1" alt="Distributed training: Google Cloud Computing" class="wp-image-61136" width="800" height="500"/><figcaption class="wp-element-caption"><em>Distributed training: Google Cloud Computing | Source: Author</em></figcaption></figure>
</div>


<h3 class="wp-block-heading" id="4-microsoft-azure">4. <a href="https://azure.microsoft.com/en-us/">Microsoft Azure</a></h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_9.png?resize=347%2C195&#038;ssl=1" alt="Distributed training: Microsoft Azure" class="wp-image-61140" width="347" height="195"/><figcaption class="wp-element-caption"><em>Distributed training: Microsoft Azure | <a href="https://medium.com/analytics-vidhya/azure-machine-learning-service-part-1-80e43e4af71b" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Microsoft Azure is another very popular cloud computing platform. One of the popular language models GPT-3 from OpenAI was trained in Azure. It also offers both <a href="https://docs.microsoft.com/en-us/azure/machine-learning/concept-distributed-training#data-parallelism" target="_blank" rel="noreferrer noopener nofollow">data parallelism</a> and <a href="https://docs.microsoft.com/en-us/azure/machine-learning/concept-distributed-training#model-parallelism" target="_blank" rel="noreferrer noopener nofollow">model parallelism</a> methods and supports both TensorFlow and Pytorch. In fact, if you want to optimize computing speed then you can also leverage Horovod from Uber.</p>



<p>Azure machine learning service is for both coders and non-coders. It simply offers a drag and drop approach that can optimize your workflow. It also reduces manual work with automated machine learning that can help you to develop smarter working prototypes.&nbsp;</p>



<p>The Azure Python SDK also allows you to interact in any Python environment like Jupyter Notebooks, Visual Studio Code, and many more. It is quite similar to both AWS and GCP in terms of offering services. These are the services that Azure offers:</p>



<ol class="wp-block-list">
<li>AI, Machine Learning and Deep learning</li>



<li>Computing powers (GPUs)&nbsp;</li>



<li>Analytics&nbsp;</li>



<li>Blockchain</li>



<li>Containers</li>



<li>Databases</li>



<li>Developer Tools</li>



<li>DevOps</li>



<li>Internet of Things</li>



<li>Mixed Reality</li>



<li>Mobile</li>



<li>Networking et cetera</li>
</ol>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_3.png?resize=781%2C527&#038;ssl=1" alt="Distributed training: Microsoft Azure" class="wp-image-61146" width="781" height="527"/><figcaption class="wp-element-caption"><em>Distributed training: Microsoft Azure |  <a href="https://docs.microsoft.com/en-us/azure/azure-portal/azure-portal-dashboards" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Let’s also compare the main 3 tools side-by-side to give you a better perspective about making the choice.</p>



<h3 class="wp-block-heading" id="comparison-table-for-cloud-platform">Comparison table for cloud platform</h3>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Distributed-Training-Libraries-and-Tools_2.png?ssl=1" alt="Comparison table for cloud platform" class="wp-image-61147"/><figcaption class="wp-element-caption"><em>Comparison table for cloud platform | <a href="https://medium.com/georgian-impact-blog/comparing-google-cloud-platform-aws-and-azure-d4a52a3adbd2" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-final-thoughts">Final thoughts</h2>



<p>In this article, we saw different Libraries and tools that can help you implement distributed training for your own deep learning application. Bear in mind that all the libraries are good and very effective in what they do, eventually, it all boils down to your preferences and requirements.&nbsp;</p>



<p>You must have noticed that all the frameworks discussed have primarily Pytorch and TensorFlow integration in some way or another. This trait can easily help you isolate the framework of choice. Once your framework is decided you can then look at the advantages to decide which distributed training tool works the best for you.&nbsp;</p>



<p>I hope you enjoyed this article. If you wanna try out all the frameworks we discussed then follow the tutorial link.&nbsp;</p>



<p>Thanks for reading!</p>



<h3 class="wp-block-heading" id="references">References</h3>



<ul class="wp-block-list">
<li><a href="https://neptune.ai/blog/ml-model-monitoring-best-tools" target="_blank" rel="noreferrer noopener">https://neptune.ai/blog/ml-model-monitoring-best-tools</a></li>



<li><a href="https://neptune.ai/blog/best-mlops-tools" target="_blank" rel="noreferrer noopener">https://neptune.ai/blog/best-mlops-tools</a></li>



<li><a href="https://towardsdatascience.com/how-to-train-your-deep-learning-models-in-a-distributed-fashion-43a6f53f0484" target="_blank" rel="noreferrer noopener nofollow">https://towardsdatascience.com/how-to-train-your-deep-learning-models-in-a-distributed-fashion-43a6f53f0484</a></li>



<li><a href="https://analyticsindiamag.com/top-distributed-training-frameworks-in-2021/" target="_blank" rel="noreferrer noopener nofollow">https://analyticsindiamag.com/top-distributed-training-frameworks-in-2021/</a></li>



<li><a href="https://www.telesens.co/2017/12/25/understanding-data-parallelism-in-machine-learning/" target="_blank" rel="noreferrer noopener nofollow">https://www.telesens.co/2017/12/25/understanding-data-parallelism-in-machine-learning/</a></li>



<li><a href="https://towardsdatascience.com/distributed-training-on-aws-sagemaker-8bcbea28466c" target="_blank" rel="noreferrer noopener nofollow">https://towardsdatascience.com/distributed-training-on-aws-sagemaker-8bcbea28466c</a></li>



<li><a href="https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu" target="_blank" rel="noreferrer noopener nofollow">https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu</a></li>



<li><a href="https://www.telesens.co/2017/12/25/understanding-data-parallelism-in-machine-learning/" target="_blank" rel="noreferrer noopener nofollow">https://www.telesens.co/2017/12/25/understanding-data-parallelism-in-machine-learning/</a></li>



<li><a href="https://arxiv.org/pdf/1811.02084.pdf" target="_blank" rel="noreferrer noopener nofollow">https://arxiv.org/pdf/1811.02084.pdf</a></li>



<li><a href="https://developer.yahoo.com/blogs/157196317141/" target="_blank" rel="noreferrer noopener nofollow">https://developer.yahoo.com/blogs/157196317141/</a></li>



<li><a href="http://www.vldb.org/pvldb/vol13/p3005-li.pdf" target="_blank" rel="noreferrer noopener nofollow">http://www.vldb.org/pvldb/vol13/p3005-li.pdf</a></li>



<li><a href="https://arxiv.org/pdf/1804.05839.pdf" target="_blank" rel="noreferrer noopener nofollow">https://arxiv.org/pdf/1804.05839.pdf</a></li>
</ul>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6462</post-id>	</item>
		<item>
		<title>Model Deployment Strategies</title>
		<link>https://neptune.ai/blog/model-deployment-strategies</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 06:46:18 +0000</pubDate>
				<category><![CDATA[MLOps]]></category>
		<guid isPermaLink="false">https://neptune.test/model-deployment-strategies/</guid>

					<description><![CDATA[In recent years, big data and machine learning has been adopted in most of the major industries and most startups are leaning towards the same. As data has become an integral part of all companies, ways to process them i.e. derive meaningful insights and patterns are essential. This is where machine learning comes into the&#8230;]]></description>
										<content:encoded><![CDATA[
<p>In recent years, big data and machine learning has been adopted in most of the major industries and most startups are leaning towards the same. As data has become an integral part of all companies, ways to process them i.e. derive meaningful insights and patterns are essential. This is where machine learning comes into the picture. </p>



<p>We already know how efficient machine learning systems are to process the huge amount of data and based upon the task in hand, yield results in real-time as well. But these systems need to be curated and deployed properly so that the task at hand performs efficiently. This article aims to provide you with information on the <strong>model deployment strategies</strong> and how you can choose which strategy is best for your application.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><a href="https://neptune.ai/model-deployment-strategies_5" target="_blank" rel="noopener"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_5.png?resize=795%2C625&#038;ssl=1" alt="The entire pipeline of a data-science project" class="wp-image-63169" width="795" height="625"/></a><figcaption class="wp-element-caption"><em>The image above depicts the entire pipeline of a data-science project | <a href="https://arxiv.org/pdf/2103.08937.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p><strong>We will cover the following strategies and techniques for model deployment:</strong></p>



<div id="case-study-numbered-list-block_864715b185adc97e438a2bba2390a414"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Shadow evaluation            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                A/B testing            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Multi Arm Bandits            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Blue-green deployment            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Canary testing            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                Feature flag            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">7</span>
                Rolling deployment            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">8</span>
                Recreate strategy            </li>
            </ul>
</div>



<div id="separator-block_1a76c94c024c6df9ccb8972303143a75"
         class="block-separator block-separator--10">
</div>



<p>These strategies can be broken down into two categories:</p>



<ul class="wp-block-list">
<li><strong>Static deployment strategies</strong>: These are the strategies where the distribution of traffic or request are handled manually. Examples of this are shadow evaluation, A/B testing, Canary testing, Rolling deployment, Blue-green deployment et cetera.&nbsp;</li>



<li><strong>Dynamic deployment strategies: </strong>These are the strategies where the distribution of traffic or request are handled automatically. Example of this is Multi Arm Bandits.&nbsp;</li>
</ul>



<div id="separator-block_9338c7c68f6c33cb794813d775b26342"
         class="block-separator block-separator--15">
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_10.png?resize=800%2C390&#038;ssl=1" alt="Model deployment strategies" class="wp-image-63164" width="800" height="390"/><figcaption class="wp-element-caption"><em>Model deployment strategies | <a href="https://www.coursera.org/lecture/ml-models-human-in-the-loop-pipelines/model-deployment-strategies-6icWT" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>To begin with, let’s have a quick overview of what model lifecycle and model deployment refers to.&nbsp;</p>



<section id="blog-intext-cta-block_24ca560510faf8b00de208d52903cde7" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-read-also">Read also</h3>
    
            <p style="text-align: left;">  <a href="/blog/model-deployment-challenges-lessons-from-ml-engineers" target="_blank" rel="noopener">Model Deployment Challenges: 6 Lessons From 6 ML Engineers</a></p>
<p>  <a href="/blog/best-8-machine-learning-model-deployment-tools" target="_blank" rel="noopener">Best Machine Learning Model Deployment Tools</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-lifecycle-of-an-ml-model">Lifecycle of an ML model</h2>



<p>The lifecycle of a machine learning model refers to the entire process that structures the whole data science or an AI project. It is similar to the software development life cycle (SDLC) but differs in a few key areas such as the use of real-time data to evaluate the model performance before deployment. A life cycle of the ML model or model development life cycle (MDLC) primarily has five phases:&nbsp;</p>



<div id="case-study-numbered-list-block_bcd422d18a8644971c6c7f22aa78ebd6"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Data collection            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Create model and training            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Testing and evaluation             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Deployment and production              </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Monitoring             </li>
            </ul>
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><a href="https://neptune.ai/model-deployment-strategies_12" target="_blank" rel="noopener"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_12.png?resize=861%2C878&#038;ssl=1" alt="Model development lifecycle (MDLC)" class="wp-image-63162" width="861" height="878"/></a><figcaption class="wp-element-caption"><em>Model development lifecycle (MDLC) | <a href="https://towardsdatascience.com/the-machine-learning-lifecycle-in-2021-473717c633bc" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Now, another term that you must be familiar with is <strong><a href="/blog/mlops" target="_blank" rel="noreferrer noopener">MLOps</a></strong>. MLOps is generally a set of practices that enables ML Lifecycle. Its stitches machine learning and software applications together. Simply put, it is a collaboration between data scientists and the operations team that takes care of and orchestrates the whole ML lifecycle. The three key areas that MLOps focuses on are <strong>continuous integration</strong>, <strong>continuous deployment,</strong> and <strong>continuous testing</strong>. </p>



<section id="blog-intext-cta-block_1a9a2be3c98034e1ba5f051b6adab6b1" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-learn-more">Learn more</h3>
    
            <p>  <a href="/blog/mlops" target="_blank" rel="noopener">MLOps: What It Is, Why It Matters, and How to Implement It</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-is-model-deployment-or-model-release">What is model deployment (or model release)?</h2>



<p>Model deployment (release) is a process that enables you to integrate machine learning models into production to make decisions on real-world data. It is essentially the second last stage of the ML life cycle before <a href="/blog/ml-model-monitoring-best-tools" target="_blank" rel="noreferrer noopener nofollow">monitoring</a>. Once deployed, the model further needs to be monitored to check whether the whole process of data ingestion, feature engineering, training, testing et cetera are aligned properly so that no human intervention is required and the whole process is automatic.&nbsp;&nbsp;</p>



<p>But before deploying the model, one has to evaluate and test if the trained ML model is fit to be deployed into production. The model is tested for performance, efficiency, even bugs, and issues. There are various strategies one can use before deploying the ML model. Let us explore them. </p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-model-deployment-strategies">Model deployment strategies</h2>



<p>Strategies allow us to evaluate the ML model performances, capabilities and discover issues concerning the model. A key point to keep in mind is that the strategies usually depend on the task and resources in hand. Some of the strategies can be a great resource but computationally expensive while some can get the job done with ease. Let’s discuss a few of them.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-1-shadow-deployment-strategy">1. Shadow deployment strategy</h3>



<p>In shadow deployment or shadow mode, the new model is deployed with new features alongside the live model. The new deployed model in this case is known as a <strong>shadow model</strong>. The shadow model handles all the requests just like the live model except it is not released to the public.&nbsp;</p>



<p>This strategy allows us to evaluate the shadow model better by testing it on real-world data while not interrupting the services offered by the live model.&nbsp;</p>



<div id="separator-block_9338c7c68f6c33cb794813d775b26342"
         class="block-separator block-separator--15">
</div>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_4.png?resize=686%2C408&#038;ssl=1" alt="Shadow deployment strategy" class="wp-image-63170" width="686" height="408"/><figcaption class="wp-element-caption"><em>Shadow deployment strategy | <a href="https://alexgude.com/blog/machine-learning-deployment-shadow-mode/" target="_blank" rel="noreferrer noopener">Sour</a><a href="https://alexgude.com/blog/machine-learning-deployment-shadow-mode/" target="_blank" rel="noreferrer noopener nofollow">ce</a></em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Methodology: champion vs challenger</h4>



<p>In shadow evaluation, the request is sent to both the models running parallel to each other using two API endpoints. During the inference, predictions from both the models are computed and stored, but only the prediction from the live model is used in the application which is returned to the users.</p>



<p>The predicted values from both the live and shadow model are compared against the ground truth. Once the results are in hand, data scientists can decide whether to deploy the shadow model globally into production or not.&nbsp;</p>



<p>But one can also use <a href="https://medium.com/decision-automation/what-is-champion-challenger-and-how-does-it-enable-choosing-the-right-decision-f57b8b653149" target="_blank" rel="noreferrer noopener nofollow">champion/challenger</a> framework in a manner where multiple shadow models are tested and compared with the existing model. Essentially the model with the best accuracy or Key Performance Index (KPI) is selected and deployed.&nbsp;</p>



<p><strong>Pros</strong>:</p>



<ul class="wp-block-list">
<li>Model evaluation is efficient since both the models are running parallelly there is no impact on traffic.&nbsp;&nbsp;</li>



<li>No overloading irrespective of the traffic.&nbsp;</li>



<li>You can monitor the shadow model which allows you to check the stability and performance; this reduces risk.&nbsp;</li>
</ul>



<p><strong>Cons</strong>:</p>



<ul class="wp-block-list">
<li>Expensive because of the resources required to support the shadow model.&nbsp;</li>



<li>Shadow deployment can be tedious, especially if you are concerned about different aspects of model performance like metrics comparison, latency, load testing, et cetera.</li>



<li>Provides no user response data.&nbsp;</li>
</ul>



<p><strong>When to use it?</strong></p>



<ul class="wp-block-list">
<li>If you want to compare multiple models with each other then shadow testing is great, although tedious.&nbsp;</li>



<li>Shadow testing will allow you to evaluate the pipeline, latency while yielding results as well the load-bearing capacity.&nbsp;</li>
</ul>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-2-a-b-testing-model-deployment-strategy">2. A/B testing model deployment strategy</h3>



<p>A/B testing is a data-based strategy method. It is used to evaluate two models namely A and B, to assess which one performs better in a controlled environment. It is primarily used in e-commerce websites and social media platforms. With A/B testing the data scientists can evaluate and choose the best design for the website based on the data received from the users.&nbsp;&nbsp;</p>



<p>The two models differ slightly in terms of features and they cater to different sets of users. Based on the interaction and data received from the users such as feedback, data scientists choose one of the models that can be deployed globally into production.&nbsp;</p>



<h4 class="wp-block-heading">Methodology</h4>



<p>In A/B the two models are set up parallelly with different features. The aim is to increase the <strong>conversion rate</strong> of a given model. In order to do that data scientist sets up a hypothesis. A hypothesis is an assumption based on an abstract intuition of the data. This assumption is proposed through an experiment, if the assumption passes the test it is accepted as fact and the model is accepted, otherwise, it’s rejected.&nbsp;</p>



<h5 class="wp-block-heading">Hypothesis testing</h5>



<p>In A/B testing there are two types of hypothesis:&nbsp;</p>



<div id="case-study-numbered-list-block_63cfb35d1d1d3a3551c4749e796c8a11"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Null Hypothesis states that the phenomenon occurring in the model is purely out of chance and not because of a certain feature.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Alternate Hypothesis challenges the null hypothesis by stating that the phenomenon occurring in the model is because of a certain feature.             </li>
            </ul>
</div>



<p>In <a href="https://www.analyticsvidhya.com/blog/2021/09/hypothesis-testing-in-machine-learning-everything-you-need-to-know/" target="_blank" rel="noreferrer noopener nofollow">hypothesis testing</a>, the aim is to reject the null hypothesis by setting up experiments like the A/B testing and exposing the new model with a certain feature to a few users. The new model essentially is designed on an alternate hypothesis. If the alternate hypothesis is accepted and the null hypothesis is rejected then that feature is added and the new model is deployed globally.&nbsp;</p>



<p>It is important to know that in order to reject the null hypothesis you have to prove the <a href="https://www.investopedia.com/terms/s/statistical-significance.asp#:~:text=Statistical%20significance%20refers%20to%20the,attributable%20to%20a%20specific%20cause." target="_blank" rel="noreferrer noopener nofollow"><strong>statistical significance</strong></a><strong> </strong>of the test.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_11.png?resize=814%2C353&#038;ssl=1" alt="A/B testing model deployment strategy" class="wp-image-63163" width="814" height="353"/><figcaption class="wp-element-caption"><em>A/B testing model deployment strategy | <a href="https://www.oreilly.com/library/view/building-machine-learning/9781492045106/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p><strong>Advantages</strong>:</p>



<ul class="wp-block-list">
<li>It is simple.&nbsp;</li>



<li>Yields quick results and helps in the elimination of the low performing model.</li>
</ul>



<p><strong>Disadvantages</strong>:</p>



<ul class="wp-block-list">
<li>Models can be unreliable if the complexity is increased. One should use A/B testing in the case of simple hypothesis testing.&nbsp;</li>
</ul>



<p><strong>When to use it?</strong></p>



<p>As mentioned earlier, A/B testing is predominantly used for e-commerce, social media platforms, and online streaming platforms. In such a setting and if you have two models you can use A/B to evaluate and choose which one to deploy globally.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-3-multi-armed-bandit">3. Multi Armed Bandit</h3>



<p>Multi-Armed Bandit or MAB is an advanced version of A/B testing. It is also inspired by reinforcement learning, and the idea is to explore and exploit the environment that maximizes the reward function. </p>



<p>MAB leverages machine learning to explore and exploit the data received to optimize the key performance index (KPI). The advantage of using this technique is that the user traffic is diverted according to the KPI of two or more models. The model which yields the best KPI is deployed globally.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_6.png?resize=815%2C550&#038;ssl=1" alt="Multi Armed Bandit strategy" class="wp-image-63168" width="815" height="550"/><figcaption class="wp-element-caption"><em>Multi Armed Bandit strategy | <a href="https://vwo.com/blog/multi-armed-bandit-algorithm/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Methodology</h4>



<p>MAB heavily depends on two concepts: <a href="https://www.manifold.ai/exploration-vs-exploitation-in-reinforcement-learning" target="_blank" rel="noreferrer noopener nofollow">exploration and exploitation</a>.&nbsp;</p>



<p><strong>Exploration: </strong>It is a concept where the model explores the statistically significant results, as what we saw in A/B testing. The prime focus of A/B testing is to find or discover conversion rates of the two models.&nbsp;</p>



<p><strong>Exploitation</strong>: It is a concept where the algorithm uses a greedy approach to maximize conversion rates using the information it gained during exploring.&nbsp;</p>



<p>MAB is very flexible compared to the A/B testing. It can work with more than two models at a given time, this increases the rate of conversion. The algorithm continuously logs the KPI score of each model based on the success with respect to the route from which the request was made. This allows the algorithm to update its score of which is best.&nbsp;&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><a href="https://neptune.ai/model-deployment-strategies_8" target="_blank" rel="noopener"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_8.png?resize=831%2C392&#038;ssl=1" alt="Building machine learning powered application" class="wp-image-63166" width="831" height="392"/></a><figcaption class="wp-element-caption"><em>Building machine learning powered application | <a href="https://www.oreilly.com/library/view/building-machine-learning/9781492045106/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p><strong>Advantages</strong>:</p>



<ul class="wp-block-list">
<li>With exploring and exploiting the MAB offers adaptive testing.</li>



<li>Resources are not wasted like in A/B testing.&nbsp;</li>



<li>Faster and efficient way of testing.&nbsp;</li>
</ul>



<p><strong>Disadvantages</strong>:</p>



<ul class="wp-block-list">
<li>It is expensive because exploiting takes a lot of computing power which can be economically expensive.&nbsp;</li>
</ul>



<p><strong>When to use it?</strong></p>



<p>MAB is very helpful for scenarios where the conversion rate is all you care about and where the time to make a decision is small. For example, optimizing offers or discounts on a product for a limited period.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-4-blue-green-deployment-strategy">4. Blue-green deployment strategy</h3>



<p>Blue-green deployment strategies involve two production environments instead of just models. The blue environment consists of the live model whereas the green environment consists of the new version of the model.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_13.png?resize=795%2C314&#038;ssl=1" alt="Blue-green deployment strategy" class="wp-image-63161" width="795" height="314"/><figcaption class="wp-element-caption"><em>Blue-green deployment strategy | <a href="https://www.data4v.com/machine-learning-deployment-strategies/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The green environment is set as a staging environment i.e. an exact replica of a live environment but with new features. Let us briefly understand the methodology.&nbsp;</p>



<h4 class="wp-block-heading">Methodology</h4>



<p>In Blue-green deployment, the two identical environments consist of the same database, containers, virtual machines, same configuration et cetera. Keep in mind that setting up an environment can be expensive so usually, some components like a database are shared between the two.&nbsp;</p>



<p>The Blue environment which contains the original model is live and keeps servicing requests while the Green environment acts as a staging environment for a new version of the model. It is subjected to deployment and final stages of testing against the real data to ensure that it performs well and is ready to deploy to production. Once the testing is successfully completed ensuring that all the bugs and issues are rectified the new model is made live.&nbsp;</p>



<p>Once this model is made live, the traffic is diverted from the blue environment to the green environment. In most cases, the blue environment serves as a backup, in case something goes wrong the request can be rerouted to the blue model. </p>



<p><strong>Pros:</strong></p>



<ul class="wp-block-list">
<li>It ensures application availability round the clock.</li>



<li>Rollbacks are easy because you can quickly divert the traffic to the blue environment in case of any issues.&nbsp;</li>



<li>Since both environments are independent of each other, deployment risk is less.</li>
</ul>



<p><strong>Cons</strong>:</p>



<ul class="wp-block-list">
<li>It is cost expensive since both models require separate environments.</li>
</ul>



<p><strong>When to use it?</strong></p>



<p>In case your application cannot afford downtime then one should use the Blue-Green deployment strategy.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-5-canary-deployment-strategy">5. Canary deployment strategy</h3>



<p>The canary deployment aims to deploy the new version of the model by gradually increasing the number of users. Unlike the previous strategies that we’ve seen where the new model is either hidden from the public or a small control group is set up, the canary deployment strategy uses the real users to test the new model. As a result, bugs and issues can be detected before the model is deployed globally for all the users.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><a href="https://neptune.ai/model-deployment-strategies_3" target="_blank" rel="noopener"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_3.png?resize=818%2C593&#038;ssl=1" alt="Canary deployment strategy" class="wp-image-63171" width="818" height="593"/></a><figcaption class="wp-element-caption"><em>Canary deployment strategy | <a href="https://cloud.google.com/architecture/application-deployment-and-testing-strategies#canary_test_pattern" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Methodology</h4>



<p>Similar to other deployment strategies in canary deployment, the new model is tested alongside the current live model but here the new model is tested on a few users to check its reliability, errors, performance et cetera.&nbsp;</p>



<p>The number of users can be increased or decreased based on the testing requirements. If the model is successful in the testing phase then the model can be rolled out and if it is not then it can be rolled back with no downtime but only a number of users will be exposed to the new model.</p>



<p>Canary deployment strategy can be broken down into three steps:</p>



<div id="case-study-numbered-list-block_4af1af7593739bf9cb18211da187d327"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Design a new model and route a small sample of users&#8217; requests to the new model.             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Check for bugs, efficiency, reports, and issues in the new model, if found then perform a rollback.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Repeat steps one and two until all errors and issues are resolved, before routing all traffic to the new model.             </li>
            </ul>
</div>



<p><strong>Pros</strong>:</p>



<ul class="wp-block-list">
<li>Cheaper compared to Blue-Green deployment.</li>



<li>Ease to test the new model against real data.</li>



<li>Zero downtime.&nbsp;</li>



<li>In case of failure, the model could be easily rolled back to the current version.</li>
</ul>



<p><strong>Cons:</strong></p>



<ul class="wp-block-list">
<li>Rollouts are easy but slow.</li>



<li>Since the testing takes place against the real data with few users, proper monitoring must be in place so in case of failure the users are effectively routed to the live version.&nbsp;</li>
</ul>



<p><strong>When to use it?</strong></p>



<p>Canary deployment strategy must be used when the model is to be evaluated against real-world real-time data. Also, it has advantages over A/B testing since it can take a long time to gather enough data from the user to find a statistically significant result. Canary deployment can do this in hours.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-6-other-model-deployment-strategies-and-techniques">6. Other model deployment strategies and techniques</h3>



<h4 class="wp-block-heading">Feature flag&nbsp;</h4>



<p>Feature flag is a technique rather than a strategy that allows developers to push or integrate code into the main branch. The idea here is to keep the feature dormant until it is ready. This allows developers to collaborate on different ideas and iterations. Once the feature is finalized it can be activated and deployed.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><a href="https://neptune.ai/model-deployment-strategies_1" target="_blank" rel="noopener"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_1.png?resize=793%2C310&#038;ssl=1" alt="Feature flag" class="wp-image-63173" width="793" height="310"/></a><figcaption class="wp-element-caption"><em>Feature flag&nbsp;| <a href="https://semaphoreci.com/blog/feature-flags" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>As mentioned earlier feature flag is a technique so this can be used in combination with any deployment techniques mentioned earlier.&nbsp;</p>



<h4 class="wp-block-heading">Rolling deployment</h4>



<p>Rolling deployment is a strategy that gradually updates and replaces the older version of the model. This deployment occurs in a running instance, it does not involve staging or even private development.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_2.png?resize=673%2C523&#038;ssl=1" alt="Rolling deployment" class="wp-image-63172" width="673" height="523"/><figcaption class="wp-element-caption"><em>Rolling deployment | <a href="https://medium.com/@codefresh/continuous-deployment-strategies-with-kubernetes-c02323789a28" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The image above represents how the rolling deployment works. As you can see that the service is horizontally scaled this is the key factor.&nbsp;</p>



<p>The image at the top left represents three instances. In the next step version 1.2 is deployed. With deployment of a single instance of version 1.2, one instance of version 1.1 is retired. The same trend follows for all other instances i.e. whenever a new instance is deployed the older instances are retired.&nbsp;</p>



<p><strong>Pros</strong>:</p>



<ul class="wp-block-list">
<li>It is faster than a blue/green deployment because there are no environmental restrictions.&nbsp;</li>
</ul>



<p><strong>Cons</strong>:</p>



<ul class="wp-block-list">
<li>Although it is quicker, rollbacks can be difficult if further updates fail.</li>
</ul>



<h4 class="wp-block-heading">Recreate strategy</h4>



<p>Recreate is a simple strategy where the live version of the model is shut down and then the new version is deployed.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_7.png?resize=604%2C504&#038;ssl=1" alt="Recreate strategy" class="wp-image-63167" width="604" height="504"/><figcaption class="wp-element-caption"><em>Recreate strategy | <a href="https://www.weave.works/blog/kubernetes-deployment-strategies" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The image above depicts how the recreate strategy works. Essentially, the old instances namely V1’s are shut down and discarded while the new instances V2’s are deployed. </p>



<p><strong>Pros</strong>:</p>



<ul class="wp-block-list">
<li>Easy and simple set-up.</li>



<li>The entire environment is completely renewed.</li>
</ul>



<p><strong>Cons</strong>:</p>



<ul class="wp-block-list">
<li>Negative impact on users since it suffers from downtime as well as rebooting.</li>
</ul>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-comparison-which-model-release-strategy-to-use">Comparison: which model release strategy to use?</h2>



<p>There can be various metrics that one can use to determine which strategy will suit them the best. But it mostly depends on the project complexity and resource availability. The following comparison table gives some idea about when to use which strategy.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><a href="https://neptune.ai/model-deployment-strategies_9" target="_blank" rel="noopener"><img data-recalc-dims="1" loading="lazy" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Model-Deployment-Strategies_9.png?resize=870%2C672&#038;ssl=1" alt="Model Release (Deployment) Strategies" class="wp-image-63165" width="870" height="672"/></a><figcaption class="wp-element-caption"><em>Model release (deployment) strategies | <a href="https://cloud.google.com/architecture/application-deployment-and-testing-strategies" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-key-takeaways">Key takeaways</h2>



<p>Deployment strategies often help data scientists to figure out how their model is performing in a given situation. A good strategy depends upon the type of product and users it aims to target. To sum it up, here are the points one should keep in mind:</p>



<ul class="wp-block-list">
<li>If you want the model to be tested in real-world data then a shadow evaluation strategy or something similar to it must be considered. Unlike the other strategies where the sample of users are used, the shadow evaluation strategy uses live and real user requests.&nbsp;</li>



<li>Check the complexity of the task, if the model requires simple or minor tweaks then A/B testing is the way to go.&nbsp;</li>



<li>If there is time constraint and ideas are more, then one should opt for multiarm bandits since it gives you the best results in such a situation.&nbsp;</li>



<li>If your model is complex and needs proper monitoring before deploying then Blue-green strategy will help you analyse and monitor your model.&nbsp;</li>



<li>If you want no downtime and you are okay to expose your model to the public then opt for Canary deployment.&nbsp;</li>



<li>The rolling deployment must be used when you want to gradually deploy the new version of the model.&nbsp;</li>
</ul>



<p>Hope you guys enjoyed reading this article. If you want to read more about this topic, you can refer to the attached resources. Keep learning!</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-references">References</h3>



<ol class="wp-block-list">
<li><a href="https://www.analyticsvidhya.com/blog/2020/10/ab-testing-data-science/" target="_blank" rel="noreferrer noopener nofollow">A/B Testing for Data Science using Python – A Must-Read Guide for Data Scientists</a></li>



<li><a href="https://christophergs.com/machine%20learning/2019/03/30/deploying-machine-learning-applications-in-shadow-mode/" target="_blank" rel="noreferrer noopener nofollow">Deploying Machine Learning Models in Shadow Mode</a></li>



<li><a href="https://towardsdatascience.com/the-machine-learning-lifecycle-in-2021-473717c633bc" target="_blank" rel="noreferrer noopener nofollow">The Machine Learning Lifecycle in 2021</a></li>



<li><a href="https://www.analyticsvidhya.com/blog/2021/05/machine-learning-life-cycle-explained/" target="_blank" rel="noreferrer noopener nofollow">Machine Learning Life-cycle Explained!</a></li>



<li><a href="https://towardsdatascience.com/automatic-canary-releases-for-machine-learning-models-38874a756f87" target="_blank" rel="noreferrer noopener nofollow">Automatic Canary Releases for Machine Learning Models</a></li>



<li><a href="https://harness.io/blog/blue-green-canary-deployment-strategies/" target="_blank" rel="noreferrer noopener nofollow">Intro To Deployment Strategies: Blue-Green, Canary, And More</a></li>



<li><a href="https://www.alessandroai.com/strategies-to-deploy-your-machine-learning-models/" target="_blank" rel="noreferrer noopener nofollow">Strategies To Deploy Your Machine Learning Models</a></li>



<li><a href="https://www.data4v.com/machine-learning-deployment-strategies/" target="_blank" rel="noreferrer noopener nofollow">Machine Learning Deployment Strategies</a></li>



<li><a href="https://www.opsmx.com/blog/blue-green-deployment/" target="_blank" rel="noreferrer noopener nofollow">What is Blue Green Deployment ?</a></li>



<li><a href="https://towardsdatascience.com/safely-rolling-out-ml-models-to-production-13e0b8211a2f" target="_blank" rel="noreferrer noopener nofollow">Safely Rolling Out ML Models To Production</a></li>



<li><a href="https://vwo.com/blog/multi-armed-bandit-algorithm/" target="_blank" rel="noreferrer noopener nofollow">Minimize Your A/B Test Losses Due to Low-Performing Variations</a></li>



<li><a href="https://mlinproduction.com/deploying-machine-learning-models/" target="_blank" rel="noreferrer noopener nofollow">The Ultimate Guide to Deploying Machine Learning Models</a></li>



<li><a href="https://alexgude.com/blog/machine-learning-deployment-shadow-mode/" target="_blank" rel="noreferrer noopener nofollow">Machine Learning Deployment: Shadow Mode</a></li>



<li><a href="https://www.optimizely.com/optimization-glossary/multi-armed-bandit/" target="_blank" rel="noreferrer noopener nofollow">Multi-armed bandit</a></li>



<li><a href="https://splitmetrics.com/blog/sequential-ab-testing-vs-multi-armed-bandit/" target="_blank" rel="noreferrer noopener nofollow">Sequential A/B Testing vs Multi-Armed Bandit Testing</a></li>



<li><a href="https://semaphoreci.com/blog/blue-green-deployment" target="_blank" rel="noreferrer noopener nofollow">What Is Blue-Green Deployment?</a></li>



<li><a href="https://www.split.io/blog/canary-release-feature-flags/" target="_blank" rel="noreferrer noopener nofollow">Pros and Cons of Canary Release and Feature Flags in Continuous Delivery</a></li>



<li><a href="https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/rolling-deployments.html" target="_blank" rel="noreferrer noopener nofollow">Rolling Deployments</a></li>



<li><a href="https://www.cloudbees.com/blog/rolling-deployment" target="_blank" rel="noreferrer noopener nofollow">Rolling Deployment: What This Is and How it De-Risks Software Deploys</a></li>



<li><a href="https://www.linkedin.com/pulse/shadow-deployments-machine-learning-models-aws-carlos-lara" target="_blank" rel="noreferrer noopener nofollow">Shadow Deployments of Machine Learning Models in AWS</a></li>



<li><a href="https://aws.amazon.com/blogs/machine-learning/deploy-shadow-ml-models-in-amazon-sagemaker/" target="_blank" rel="noreferrer noopener nofollow">Deploy shadow ML models in Amazon SageMaker</a></li>



<li><a href="https://mercari.github.io/ml-system-design-pattern/QA-patterns/Shadow-ab-test-pattern/design_en.html" target="_blank" rel="noreferrer noopener nofollow">Shadow AB test pattern</a></li>



<li><a href="https://www.arrikto.com/mlops-explained/" target="_blank" rel="noreferrer noopener nofollow">MLOps Explained</a></li>



<li><a href="/blog/mlops" target="_blank" rel="noreferrer noopener">MLOps: What It Is, Why It Matters, and How to Implement It</a></li>



<li><a href="https://aws.amazon.com/blogs/machine-learning/dynamic-a-b-testing-for-machine-learning-models-with-amazon-sagemaker-mlops-projects/" target="_blank" rel="noreferrer noopener nofollow">Dynamic A/B testing for machine learning models with Amazon SageMaker MLOps projects</a></li>



<li><a href="https://abhishek-maheshwarappa.medium.com/multi-arm-bandits-for-recommendations-and-a-b-testing-on-amazon-ratings-data-set-9f802f2c4073" target="_blank" rel="noreferrer noopener nofollow">Multi-Arm Bandits for recommendations and A/B testing on Amazon ratings data set</a></li>



<li><a href="https://www.blazemeter.com/shiftleft/automate-canary-testing-continuous-quality" target="_blank" rel="noreferrer noopener nofollow">Automate Canary Testing for Continuous Quality</a></li>



<li><a href="https://martinfowler.com/bliki/CanaryRelease.html" target="_blank" rel="noreferrer noopener nofollow">CanaryRelease</a></li>



<li><a href="https://www.cloudbolt.io/blog/what-is-best-kubernetes-deployment-strategy/" target="_blank" rel="noreferrer noopener nofollow">What Is the Best Kubernetes Deployment Strategy?</a></li>
</ol>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6644</post-id>	</item>
		<item>
		<title>Pix2pix: Key Model Architecture Decisions</title>
		<link>https://neptune.ai/blog/pix2pix-key-model-architecture-decisions</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 06:37:34 +0000</pubDate>
				<category><![CDATA[Computer Vision]]></category>
		<category><![CDATA[ML Model Development]]></category>
		<guid isPermaLink="false">https://neptune.test/pix2pix-key-model-architecture-decisions/</guid>

					<description><![CDATA[Generative Adversarial Networks or GANs is a type of neural network that belongs to the class of unsupervised learning models. It is used for the task of deep generative modeling.&#160; In deep generative modeling, deep neural networks learn a probability distribution over a given set of data points and generate similar ones. Since it is&#8230;]]></description>
										<content:encoded><![CDATA[
<p><a href="/blog/generative-adversarial-networks-gan-applications" target="_blank" rel="noreferrer noopener">Generative Adversarial Networks or GANs</a> is a type of neural network that belongs to the class of unsupervised learning models. It is used for the task of deep generative modeling.&nbsp;</p>



<p>In deep generative modeling, deep neural networks learn a probability distribution over a given set of data points and generate similar ones. Since it is an unsupervised learning task, it uses no labels during the learning process.&nbsp;</p>



<p>Since its release in 2014, the deep learning community has been actively developing new GANs to improve the field of generative modeling. This article aims to provide information on GANs, specifically Pix2Pix GAN, one of the most used generative models.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-is-gan">What is GAN?</h2>



<p>GAN was designed by Ian Goodfellow in 2014. GAN’s main intention was to generate samples that were not blurry and had rich representations of features. Discriminative models were doing well on this front as they were able to classify between different classes. Deep generative models, on the other hand, were far less effective due to the difficulty in approximating many intractable probabilistic computations, which are quite evident in Autoencoders.&nbsp;&nbsp;</p>



<p>Autoencoders and their variants are explicit likelihood models, meaning they explicitly compute the probability density function over a given distribution. GANs and their variants are implicit likelihood models, which means they don’t compute the probability density function but rather learn the underlying distribution.&nbsp;</p>



<p>GANs learn the underlying distribution by approaching the whole problem as a binary classification problem. In this approach, the problem model is presented by two models: a generator and a discriminator. The job of the generator is to generate new samples and the job of the discriminator is to classify or discriminate if the sample produced by the generator is real or fake.&nbsp;</p>



<p>The two models are trained together in a zero-sum game until the generator can produce samples that are similar to the real samples. In other words, they are trained until the generator can fool the discriminator.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-architecture-of-a-vanilla-gan">Architecture of a vanilla GAN</h3>



<p>Let’s briefly understand the architecture of GANs. From this section onward most of the topics will be explained using code. So to begin with, let’s install and import all the required dependencies:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install torch torchvision matplotlib cv2 numpy</pre></code></pre>
</div>





<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torch
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torch.nn <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> nn
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torch.optim <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> optim
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torch.autograd <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Variable
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> matplotlib.pyplot <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> plt
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torchvision
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torchvision.datasets <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> datasets
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torch.utils.data <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> DataLoader
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> torchvision.transforms <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> transforms</pre></code></pre>
</div>




<h4 class="wp-block-heading">Generator</h4>



<p>The Generator is a component in GAN that takes in noise, which by definition follows a Gaussian distribution, and yields samples similar to the original dataset. As GANs have evolved over the years, they have adopted the use of CNNs, which is quite prominent in computer vision tasks. But for simplicity, we will define it with just linear functions using Pytorch.</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Generator</span><span class="hljs-params">(nn.Module)</span>:</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self, z_dim, img_dim)</span>:</span>
       super().__init__()
       self.gen = nn.Sequential(
           nn.Linear(z_dim, <span class="hljs-number" style="color: teal;">256</span>),
           nn.LeakyReLU(<span class="hljs-number" style="color: teal;">0.01</span>),
           nn.Linear(<span class="hljs-number" style="color: teal;">256</span>, img_dim),
           nn.Tanh(),  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># normalize inputs to [-1, 1] to make outputs [-1, 1]</span>
       )

   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">forward</span><span class="hljs-params">(self, x)</span>:</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> self.gen(x)</pre></code></pre>
</div>




<h4 class="wp-block-heading">Discriminator</h4>



<p>The discriminator is simply a classifier that classifies whether the data yielded by the generator is real or fake. It does this by learning the original distribution from the real data and then evaluating between the two. We will keep things simple and define the discriminator using linear functions.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Discriminator</span><span class="hljs-params">(nn.Module)</span>:</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self, in_features)</span>:</span>
       super().__init__()
       self.disc = nn.Sequential(
           nn.Linear(in_features, <span class="hljs-number" style="color: teal;">128</span>),
           nn.LeakyReLU(<span class="hljs-number" style="color: teal;">0.01</span>),
           nn.Linear(<span class="hljs-number" style="color: teal;">128</span>, <span class="hljs-number" style="color: teal;">1</span>),
           nn.Sigmoid(),
       )

   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">forward</span><span class="hljs-params">(self, x)</span>:</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> self.disc(x)</pre></code></pre>
</div>




<p>The key difference between the generator and the discriminator is the last layer. The former yields the same shape as that of the image while the latter yields only one output, either 0 or 1.&nbsp;</p>



<h4 class="wp-block-heading">Loss function and training</h4>



<p>The loss function is one of the most important components in any deep learning algorithm. For instance, if we design a CNN to minimize the Euclidean distance between the ground truth, and predicted results, it will tend to produce blurry results. This is because Euclidean distance is minimized by averaging all plausible outputs, which cause blurring.&nbsp;</p>


    <a
        href="/blog/gan-loss-functions"
        id="cta-box-related-link-block_bd0bfc2c338431b26859efcdba632331"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-understanding-gan-loss-functions">                Understanding GAN Loss Functions            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<p>The above point is an important one that we must keep in mind. With that being said, the loss function that we will use for vanilla GAN will be binary cross-entropy loss or BCELoss because we are performing binary classification.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">criterion = nn.BCELoss()</pre></code></pre>
</div>




<p>Now let’s define the optimization method and other related parameters:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Define hyperparameters</span>
device = <span class="hljs-string" style="color: rgb(221, 17, 68);">"cuda"</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> torch.cuda.is_available() <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span> <span class="hljs-string" style="color: rgb(221, 17, 68);">"cpu"</span>
lr = <span class="hljs-number" style="color: teal;">3e-4</span>
z_dim = <span class="hljs-number" style="color: teal;">64</span>
image_dim = <span class="hljs-number" style="color: teal;">28</span> * <span class="hljs-number" style="color: teal;">28</span> * <span class="hljs-number" style="color: teal;">1</span>  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># 784</span>
batch_size = <span class="hljs-number" style="color: teal;">32</span>
num_epochs = <span class="hljs-number" style="color: teal;">100</span>

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Initialize Generator and Discriminator</span>
gen = Generator(z_dim, image_dim).to(device)
disc = Discriminator(image_dim).to(device)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Set up optimizers</span>
opt_disc = optim.Adam(disc.parameters(), lr=lr)
opt_gen = optim.Adam(gen.parameters(), lr=lr)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Prepare dataset and dataloader</span>
transform = transforms.Compose([
   transforms.ToTensor(),
   transforms.Normalize((<span class="hljs-number" style="color: teal;">0.5</span>,), (<span class="hljs-number" style="color: teal;">0.5</span>,))
])
dataset = datasets.MNIST(root=<span class="hljs-string" style="color: rgb(221, 17, 68);">"dataset/"</span>, transform=transform, download=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Set up TensorBoard writers</span>
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torch.utils.tensorboard <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> SummaryWriter
writer_fake = SummaryWriter(f<span class="hljs-string" style="color: rgb(221, 17, 68);">"logs/fake"</span>)
writer_real = SummaryWriter(f<span class="hljs-string" style="color: rgb(221, 17, 68);">"logs/real"</span>)
step = <span class="hljs-number" style="color: teal;">0</span></pre></code></pre>
</div>




<p>Let’s understand the training loop. The training loop of GAN starts with:</p>



<ol class="wp-block-list">
<li>Generating samples from the generator using Gaussian distribution&nbsp;</li>



<li>Training the discriminator using real data and fake data produced by the generator</li>



<li>Updating the discriminator</li>



<li>Updating the generator</li>
</ol>



<p>Here’s what the training loop looks like:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> epoch <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(num_epochs):
  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Loop over each batch of data</span>
  <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> batch_idx, (real, _) <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> enumerate(loader):
      real = real.view(<span class="hljs-number" style="color: teal;">-1</span>, <span class="hljs-number" style="color: teal;">784</span>).to(device)
      batch_size = real.shape[<span class="hljs-number" style="color: teal;">0</span>]
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">### Train Discriminator: max log(D(x)) + log(1 - D(G(z)))</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Generate random noise as input to the generator</span>
      noise = torch.randn(batch_size, z_dim).to(device)
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Produce fake images from the generator using the random noise</span>
      fake = gen(noise)
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Get discriminator's predictions on real images</span>
      disc_real = disc(real).view(<span class="hljs-number" style="color: teal;">-1</span>)
      lossD_real = criterion(disc_real, torch.ones_like(disc_real))
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Get discriminator's predictions on fake images</span>
disc_fake = disc(fake).view(<span class="hljs-number" style="color: teal;">-1</span>)
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Calculate discriminator's loss on fake images</span>
lossD_fake = criterion(disc_fake, torch.zeros_like(disc_fake))
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Combine real and fake loss to get total discriminator loss</span>
lossD = (lossD_real + lossD_fake) / <span class="hljs-number" style="color: teal;">2</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Backpropagation and optimization step for discriminator</span>
      disc.zero_grad()
      lossD.backward(retain_graph=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
      opt_disc.step()
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">### Train Generator: min log(1 - D(G(z))) &lt;-&gt; max log(D(G(z))</span>
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># where the second option of maximizing doesn't suffer from</span>
      <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># saturating gradients</span>
      output = disc(fake).view(<span class="hljs-number" style="color: teal;">-1</span>)
      lossG = criterion(output, torch.ones_like(output))
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Backpropagation and optimization step for generator</span>
      gen.zero_grad()
      lossG.backward()
      opt_gen.step()
      <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> batch_idx == <span class="hljs-number" style="color: teal;">0</span>:
          print(
              f<span class="hljs-string" style="color: rgb(221, 17, 68);">"""Epoch [{epoch}/{num_epochs}] Batch {batch_idx}/{len(loader)}
                    Loss D: {lossD:.4f}, loss G: {lossG:.4f}"""</span>
          )
          <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">with</span> torch.no_grad():
              <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># fixed_noise is not defined earlier in the code</span>
              <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># It should be a constant noise vector used to generate consistent samples</span>
              <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Let's define it at the beginning of the training loop</span>
              <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> <span class="hljs-string" style="color: rgb(221, 17, 68);">'fixed_noise'</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">not</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> locals():
                  fixed_noise = torch.randn(<span class="hljs-number" style="color: teal;">64</span>, z_dim, device=device)
              fake = gen(fixed_noise).reshape(<span class="hljs-number" style="color: teal;">-1</span>, <span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">28</span>, <span class="hljs-number" style="color: teal;">28</span>)
              data = real.reshape(<span class="hljs-number" style="color: teal;">-1</span>, <span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">28</span>, <span class="hljs-number" style="color: teal;">28</span>)
              img_grid_fake = torchvision.utils.make_grid(fake, normalize=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
              img_grid_real = torchvision.utils.make_grid(data, normalize=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
              writer_fake.add_image(
                  <span class="hljs-string" style="color: rgb(221, 17, 68);">"Mnist Fake Images"</span>, img_grid_fake, global_step=step
              )
              writer_real.add_image(
                  <span class="hljs-string" style="color: rgb(221, 17, 68);">"Mnist Real Images"</span>, img_grid_real, global_step=step
              )
              step += <span class="hljs-number" style="color: teal;">1</span></pre></code></pre>
</div>




<p>There are some important considerations from the loop above:</p>



<ol class="wp-block-list">
<li>The loss function for the discriminator is calculated twice: one for real images and another for fake images.
<ul class="wp-block-list">
<li>For real images, the ground truth is converted to ones using the <span class="c-code-snippet">torch.ones_like</span> function which returns a matrix of ones of a defined shape.&nbsp;&nbsp;</li>



<li>For fake images, the ground truth is converted to ones using the <span class="c-code-snippet">torch.zeros_like</span> function which returns a matrix of zeros of a defined shape.&nbsp;&nbsp;</li>
</ul>
</li>



<li>The loss function for the generator is calculated only once. If you observe carefully, it is the same loss function that is used by the discriminator to calculate the loss for fake images. The only difference is that instead of using the <span class="c-code-snippet">torch.zeros_like</span> function, <span class="c-code-snippet">torch.ones_like</span> function is used. The interchanging of labels from 0 to 1 enables the generator to learn representations that will produce real images, therefore fooling the discriminator.&nbsp;</li>
</ol>



<p>Mathematically, we can define the whole process as:</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_2.png?ssl=1" alt=" This equation represents the objective function of a Generative Adversarial Network (GAN). In a GAN, the generator G aims to produce realistic data to fool the discriminator D, while the discriminator tries to distinguish between real data x and generated data G(z), where \( z is random noise. The discriminator D maximizes its ability to classify real vs. generated samples, while the generator G minimizes the discriminator’s success, creating a minimax game. The function  V(D, G) is optimized by minimizing over G and maximizing over D." class="wp-image-57303" style="width:866px;height:92px"/><figcaption class="wp-element-caption">&nbsp;This equation represents the objective function of a Generative Adversarial Network (GAN). In a GAN, the generator G aims to produce realistic data to fool the discriminator D, while the discriminator tries to distinguish between real data x and generated data G(z), where \( z is random noise. The discriminator D maximizes its ability to classify real vs. generated samples, while the generator G minimizes the discriminator’s success, creating a minimax game. The function&nbsp; V(D, G) is optimized by minimizing over G and maximizing over D.</figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-application-of-gans">Application of GANs</h3>



<p>GANs are widely used for:</p>



<ul class="wp-block-list">
<li><strong>Generating training samples:</strong> GANs are often used to generate samples for specific tasks like the classification of malignant and benign cancer cells, especially where the data is scarce to train a classifier.&nbsp;</li>



<li><strong>AI Art or Generative Art:</strong> AI or Generative art is another new domain where GANs are extensively used. Since the introduction of non-fungible tokens, artists all over the world have been creating art in unorthodox fashion i.e. digital and generative. GANs like DeepDaze, BigSleep, BigGAN, CLIP, VQGAN, etc. are the most commonly used by creators.&nbsp;</li>
</ul>



<div id="separator-block_af6a6e357361905e62153786a946eed4"
         class="block-separator block-separator--0">
</div>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_4.jpg?ssl=1" alt="AI Art or Generative Art: This figure has been generated with AI" class="wp-image-57301"/><figcaption class="wp-element-caption"><em>AI Art or Generative Art: This figure has been generated with AI | Source: Author</em></figcaption></figure>
</div>


<ul class="wp-block-list">
<li><strong>Image-to-image translation:</strong> The idea here is to translate a certain type of image to an image in the target domain. For example a day-light image into a night image, or a winter image to a summer image (see the image below). GANs like pix2pix, cycleGAN, styleGAN are few of the most popular GANs used by digital creators.</li>
</ul>



<div id="separator-block_af6a6e357361905e62153786a946eed4"
         class="block-separator block-separator--0">
</div>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_10.png?ssl=1" alt="Image-to-image translation example. To the left, the original image (a car in a road during winter). To the right, the same image in the summer, generated by AI" class="wp-image-57295" style="aspect-ratio:2.3529411764705883;width:834px;height:auto"/><figcaption class="wp-element-caption">Image-to-image translation example. To the left, the original image (a car in a road during winter). To the right, the same image in the summer, generated by AI | <a href="https://research.nvidia.com/publication/2017-12_Unsupervised-Image-to-Image-Translation" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<ul class="wp-block-list">
<li><strong>Text-to-image translation: </strong>Text-to-image translation is simply converting a text or a given string into an image. This is a very populardomain as of now and it is a growing community. As mentioned previously GANs such as DeepDaze, BigSleep and DALL·E from OpenAI are the most common tools for this.</li>
</ul>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_6.png?ssl=1" alt="Text-to-image translation. The prompt “an armchair in the shape of an avocado” turns into images of the desired chair in different angles. " class="wp-image-57299" style="width:810px;height:auto"/><figcaption class="wp-element-caption">Text-to-image translation. The prompt “an armchair in the shape of an avocado” turns into images of the desired chair in different angles. | <a href="https://openai.com/blog/dall-e/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-issues-with-gans">Issues with GANs</h3>



<p>Although GANs can produce images from random Gaussian distributions that are similar to real images, this process is not perfect most of the time. Here’s why:</p>



<ul class="wp-block-list">
<li>Mode Collapse:&nbsp; This refers to the issue when the generator can fool the discriminator by learning with fewer data samples from the overall data. Because of mode collapse, the GAN is not able to learn a wide variety of distributions and remains limited to a few.&nbsp;</li>



<li>Diminished gradient: Diminished or vanishing gradient descent occurs when the derivative of the network is so small, that the update to the original weights is almost negligible. To overcome this issue, Wasserstein GANs (WGANs in short) are recommended.&nbsp;</li>



<li>Non-convergence: It occurs when the network is unable to converge to a global minimum. This results from unstable training, and it can be tackled with <a href="https://medium.com/perceptronai/review-spectral-normalization-for-gans-fa97cd2363c4" target="_blank" rel="noreferrer noopener nofollow">spectral normalization</a>.</li>
</ul>


    <a
        href="/blog/vanishing-and-exploding-gradients-debugging-monitoring-fixing"
        id="cta-box-related-link-block_75a418d4ca6ad5c73f9d20fd96274a28"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-vanishing-and-exploding-gradients-in-neural-network-models-debugging-monitoring-and-fixing">                Vanishing and Exploding Gradients in Neural Network Models: Debugging, Monitoring, and Fixing            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-variations-of-gan">Variations of GAN</h3>



<p>Since the release of the first GAN, there have been many variants of GANs. Below are some of the most popular GANs:</p>



<ul class="wp-block-list">
<li>CycleGAN</li>



<li>StyleGAN</li>



<li>PixelRNN</li>



<li>Text2image</li>



<li>DiscoGAN</li>



<li>IsGAN</li>
</ul>


    <a
        href="/blog/6-gan-architectures"
        id="cta-box-related-link-block_e9e7b3dff303bdc43fa000a25cfba395"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-gan-architectures-you-really-should-know">                GAN Architectures You Really Should Know            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<p>This article solely focuses on Pix2Pix GAN. In the following section, we will understand some of the key components of the same like the architecture, loss function, etcetera.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-is-the-pix2pix-gan">What is the Pix2Pix GAN?</h2>



<p>Pix2Pix GAN is a conditional GAN (<a href="https://golden.com/wiki/Conditional_generative_adversarial_network_(cGAN)" target="_blank" rel="noreferrer noopener nofollow">cGAN</a>) that was developed by <a href="http://web.mit.edu/phillipi/" target="_blank" rel="noreferrer noopener nofollow">Phillip Isola</a>, et al. Unlike vanilla GAN which uses only real data and noise to learn and generate images, cGAN uses real data andnoise as well as labels to generate images.&nbsp;</p>



<p>In essence, the generator learns the mapping from the real data as well as the noise.&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_16.png?ssl=1" alt="What Is the Pix2Pix GAN?" class="wp-image-57289" style="aspect-ratio:4.42;width:257px;height:auto"/></figure>
</div>


<p>The generator G combines the learnt real data x and the random noise z to output y, which is the fake data.&nbsp;</p>



<p>Similarly, the discriminator not only learns from the “real data” example it has seen, but also from the labels that help it understand what is real and what is fake.&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_1.png?ssl=1" alt="What Is the Pix2Pix GAN?" class="wp-image-57304" style="aspect-ratio:2.1666666666666665;width:129px;height:auto"/></figure>
</div>


<p>The discriminator uses, then, two sources of information to improve its ability to tell real from fake: x (the real data) and y (the label saying &#8220;real&#8221; or &#8220;fake.&#8221;)</p>



<p>This setting makes cGAN to be suitable for image-to-image translation tasks, where the generator is conditioned on an input image to generate the corresponding output image. In other words, the generator uses a condition distribution (or data) such as a guide or a blueprint to generate a target image (see the image below).&nbsp;&nbsp;&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_21.png?ssl=1" alt="The model generates realistic building facades (right column) based on input segmentation maps (left column), with comparisons to the actual ground truth images (center column)" class="wp-image-57284" style="width:836px;height:auto"/><figcaption class="wp-element-caption"><em>The model generates realistic building facades (right column) based on input segmentation maps (left column), with comparisons to the actual ground truth images (center column) | Source: Author</em><span id="docs-internal-guid-21dce10b-7fff-45bc-c964-c5b9a4c49a30" style="font-weight:normal;"><div><span style="font-size: 11pt; font-family: Arial, sans-serif; color: rgb(0, 0, 0); background-color: transparent; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-alternates: normal; font-variant-position: normal; vertical-align: baseline;"></span></div></span></figcaption></figure>
</div>

<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_20.png?ssl=1" alt="Applications of Pix2Pix, a type of conditional GANs
" class="wp-image-57285" style="width:836px;height:auto"/><figcaption class="wp-element-caption"><em>Applications of Pix2Pix, a type of conditional GANs | <a href="https://phillipi.github.io/pix2pix/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The idea with Pix2Pix relies on the dataset provided for the training. It is a pair-to-pair image translation with training examples {x, y} having a correspondence between them.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-pix2pix-network-architectures">Pix2Pix network architectures</h2>



<p>The pix2pix has two important architectures, one for the generator and the other for the discriminator, namely U-net and patchGAN. Let’s explore both of them in more detail.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-u-net-generator">U-Net generator&nbsp;</h3>



<p>As mentioned before, the architecture used in pix2pix is called U-net. U-net was primarily developed for biomedical image segmentation by Ronneberger et. al. in 2015.&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_8.png?ssl=1" alt="U-Net generator:  A symmetric encoder-decoder structure with down-sampling through max pooling (red arrows) and up-sampling via transposed convolutions (green arrows). Skip connections (gray arrows) connect layers of matching spatial dimensions in the encoder and decoder, preserving spatial information for segmentation in the output map. " class="wp-image-57297" style="aspect-ratio:1.5107913669064748;width:840px;height:auto"/><figcaption class="wp-element-caption"><em>U-Net generator:&nbsp; A symmetric encoder-decoder structure with down-sampling through max pooling (red arrows) and up-sampling via transposed convolutions (green arrows). Skip connections (gray arrows) connect layers of matching spatial dimensions in the encoder and decoder, preserving spatial information for segmentation in the output map. | <a href="https://arxiv.org/pdf/1505.04597.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>U-Net consists of two major parts:&nbsp;</p>



<ol class="wp-block-list">
<li>A contracting path made up of convolutional layers (left side) which downsamples the data while extracting information.&nbsp;</li>



<li>An expansive path made of up transpose convolution layer (right side) which upsamples the information.&nbsp;</li>
</ol>



<p>Let’s say our downsampling has three convolutional layers C_l(1,2,3), then we have to make sure that our upsampling has three transpose convolutional layers C_u(1,2,3). This is because we want to connect the corresponding blocks of the same sizes using a skip connection.&nbsp;</p>



<div id="separator-block_75de0d8bd0c4b6baa838bb6c64057f86"
         class="block-separator block-separator--5">
</div>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_3.png?ssl=1" alt="Skip connection architecture: This diagram illustrates the use of skip layers between encoder (C_l1, C_l2, C_l3) and decoder (C_u1, C_u2, C_u3) blocks, with a bottleneck in the center to keep the feature dimensions at each stage. This retains the spatial details across the network" class="wp-image-57302" style="width:572px;height:auto"/><figcaption class="wp-element-caption"><em>Skip connection architecture: This diagram illustrates the use of skip layers between encoder (C_l1, C_l2, C_l3) and decoder (C_u1, C_u2, C_u3) blocks, with a bottleneck in the center to keep the feature dimensions at each stage. This retains the spatial details across the network | Source: Author</em><span id="docs-internal-guid-2e37afae-7fff-0ac1-57ec-191b3d9b3e42" style="font-weight:normal;"><div><span style="font-size: 11pt; font-family: Arial, sans-serif; color: rgb(0, 0, 0); background-color: transparent; font-variant-numeric: normal; font-variant-east-asian: normal; font-variant-alternates: normal; font-variant-position: normal; vertical-align: baseline;"></span></div></span></figcaption></figure>
</div>


<h4 class="wp-block-heading">Downsampling</h4>



<p>During downsampling, each convolutional block extracts spatial information and passes the information to the next convolutional block to extract more information until it reaches the middle part known as the bottleneck. Upsampling starts from the bottleneck.&nbsp;</p>



<h4 class="wp-block-heading">Upsampling</h4>



<p>During upsampling, each transpose convolutional block expands information from the previous block while concatenating the information from the corresponding downsampling block. By concatenating information, the network can then learn to assemble a more precise output based on this information.</p>



<p>This architecture can localize, i.e. it can find the object of interest pixel by pixel. Furthermore, U-Net also allows the network to propagate context information from lower resolution to higher resolution layers. This allows the network to generate high-resolution samples.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-markovian-discriminator-patchgan">Markovian discriminator (PatchGAN)</h3>



<p>The discriminator uses PatchGAN architecture. This architecture contains several transposed convolutional blocks. It takes an NxN part of the image and tries to find whether it is real or fake. N can be of any size. It can be smaller than the original image and it is still able to produce high-quality results. The discriminator is applied convolutionally across the whole image. Also, because the discriminator is smaller i.e. it has fewer parameters compared to the generator, it is faster.&nbsp;</p>



<p>PatchGAN can effectively model the image as a Markov random field, where NxN is considered an independent patch. Therefore, PatchGAN can be understood as a form of texture/style loss.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-loss-function">Loss function</h3>



<p>The loss function is:&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_14.png?ssl=1" alt="Loss function" class="wp-image-57291" style="width:582px;height:123px"/></figure>
</div>


<p>The equation above has two components: one for the discriminator and the other for the generator. Let’s understand both of them one by one.&nbsp;</p>



<p>In any GAN, the discriminator is trained first in every iteration so that it can recognize both real and fake data. Essentially,&nbsp;</p>



<p>D(x,y) = 1 i.e. real and,&nbsp;</p>



<p>D(x,G(z)) = 0 i.e. fake.&nbsp;</p>



<p>It is worth noting that G(z) will also produce fake samples and thus its value will be closer to zero. In theory, the discriminator should always classify G(z) as zero only. Therefore the discriminator should maintain a maximum distance between real and fake i.e. 1 and 0 in every iteration. In other words, the discriminator should maximize the loss function.&nbsp;</p>



<p>After the discriminator, the generator is trained. The generator i.e. G(z) should learn to produce samples that are closer to the real samples. To learn the original distribution it takes help from the discriminator i.e. instead of D(x, G(z)) = 0, we change D(x, G(z)) = 1.&nbsp;</p>



<p>With the alteration in labeling, the generator now optimizes its parameter concerning the parameter belonging to the discriminator with ground truth labels. This step ensures that the generator can now yield samples that are close to real data i.e. 1.&nbsp;</p>



<p>The loss function is also mixed with an L1 loss so that the generator not only fools the discriminator but also produces images near the ground truth. In essence, the loss function has an additional L1 loss for the generator.&nbsp;</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_15.png?ssl=1" alt="Loss function" class="wp-image-57290" style="width:480px;height:86px"/></figure>
</div>


<p>Therefore, the final loss function is:</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_9.png?ssl=1" alt="Loss function" class="wp-image-57296" style="width:567px;height:66px"/></figure>
</div>


<p>It is worth noting that the L1 loss can preserve low-frequency details in the image, but it will not be able to capture high-frequency details. Hence, it will still produce blurry images. To tackle this problem, PatchGAN is used.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-optimization">Optimization&nbsp;</h3>



<p>The optimization and training process is similar to vanilla GAN. However, the training itself is a difficult process since the objective function of GAN is more concave-concave rather than convex-concave. Because of this, it is difficult to find a saddle point and this is what makes training and optimizing the GANs difficult.&nbsp;</p>



<p>As we saw previously, the generator is not trained directly but through the discriminator. This essentially limits the optimization of the generator. If the discriminator fails to capture high dimensional spaces then it is very certain that the generator will fail to produce good samples. On the other hand, if we can train the discriminator in a much more optimal way then we can be assured that the generator will be trained optimally as well.&nbsp;</p>



<p>In the early stages of training, G is untrained and weak to produce good samples. This makes the discriminator very powerful, so instead of minimizing log(1 − D(G(z))), the generator is trained to maximize log D(G(z)). This creates some sort of stability in the early stages of the training.&nbsp;</p>



<p>Other ways to tackle the instability are:</p>



<ol class="wp-block-list">
<li>Using spectral normalization in every layer of the model</li>



<li>Using Wasserstein loss which calculates the average score for real or fake images.</li>
</ol>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-hands-on-example-with-pix2pix">Hands-on example with Pix2Pix</h2>



<p>Let’s implement Pix2Pix with PyTorch and get an intuitive understanding of how it works and the various components behind it. This section will give you a clear understanding of how the Pix2Pix algorithm works.&nbsp;</p>



<p>Let’s start by downloading the data. The following code can be used to download the data.</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">!wget http://efrosgans.eecs.berkeley.edu/pix2pix/datasets/facades.tar.gz
!tar -xvf facades.tar.gz</pre></code></pre>
</div>




<h3 class="wp-block-heading" class="wp-block-heading" id="h-data-visualization">Data visualization</h3>



<p>Once the data is downloaded, we can then visualize them to understand what are the necessary steps needed to format the data according to the requirement.&nbsp;</p>



<p>We will import the following libraries for data visualization.</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> matplotlib.pyplot <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> plt
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> cv2
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> os
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> numpy <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> np

path = <span class="hljs-string" style="color: rgb(221, 17, 68);">"facades/train/"</span>
plt.imshow(cv2.imread(f<span class="hljs-string" style="color: rgb(221, 17, 68);">"{path}91.jpg"</span>))</pre></code></pre>
</div>



<div class="wp-block-image is-style-default">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_17.png?ssl=1" alt="Resulting output from the previous code" class="wp-image-57288" style="aspect-ratio:1.8727810650887573;width:833px;height:auto"/><figcaption class="wp-element-caption"><em>Resulting output from the previous code | Source: Author</em></figcaption></figure>
</div>


<p>From the image above, we can see that the data has two images attached together. If we then see the shape of the image above we find that the width is 512, which means that the image can be easily separated into two.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">print(<span class="hljs-string" style="color: rgb(221, 17, 68);">'Shape of the image: '</span>,cv2.imread(f<span class="hljs-string" style="color: rgb(221, 17, 68);">'{path}91.jpg'</span>).shape)</pre></code></pre>
</div>




<p>&gt;&gt; Shape of the image:&nbsp; (256, 512, 3)</p>



<p>To separate the images we will use the following commands:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Dividing the image by width</span>
image = cv2.imread(f<span class="hljs-string" style="color: rgb(221, 17, 68);">'{path}91.jpg'</span>)
w = image.shape[<span class="hljs-number" style="color: teal;">1</span>]//<span class="hljs-number" style="color: teal;">2</span>
image_real = image[:, :w, :]
image_cond = image[:, w:, :]
fig, axes = plt.subplots(<span class="hljs-number" style="color: teal;">1</span>,<span class="hljs-number" style="color: teal;">2</span>, figsize=(<span class="hljs-number" style="color: teal;">18</span>,<span class="hljs-number" style="color: teal;">6</span>))
axes[<span class="hljs-number" style="color: teal;">0</span>].imshow(image_real, label=<span class="hljs-string" style="color: rgb(221, 17, 68);">'Real'</span>)
axes[<span class="hljs-number" style="color: teal;">1</span>].imshow(image_cond, label=<span class="hljs-string" style="color: rgb(221, 17, 68);">'Condition'</span>)
plt.show()</pre></code></pre>
</div>




<p>Output:</p>


<div class="wp-block-image is-style-default">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Pix2pix-Key-Model-Architecture-Decisions_11.png?ssl=1" alt="Data visualization" class="wp-image-57294" style="width:810px;height:auto"/><figcaption class="wp-element-caption"><em>Resulting output | Source: Author</em></figcaption></figure>
</div>


<p>The image on the left will be our ground truth while the image on the right will be our conditional image. We will refer to them as y and x respectively (from left to right).&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-creating-dataloader">Creating dataloader</h3>



<p>Dataloader is a function that will allow us to format the data as per the PyTorch requirement. This will involve two steps:&nbsp;</p>



<p>1. Formatting the data, that is reading the data from the source, cropping them followed by converting them to Pytorch tensors.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> glob <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> glob
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> torch.utils.data <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> Dataset


<span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Data</span><span class="hljs-params">(Dataset)</span>:</span>
   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self, path=<span class="hljs-string" style="color: rgb(221, 17, 68);">"facades/train/"</span>)</span>:</span>
       self.filenames = glob(path + <span class="hljs-string" style="color: rgb(221, 17, 68);">"*.jpg"</span>)

   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__len__</span><span class="hljs-params">(self)</span>:</span>
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> len(self.filenames)

   <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__getitem__</span><span class="hljs-params">(self, idx)</span>:</span>
       filename = self.filenames[idx]

       image = cv2.imread(filename)
       image_width = image.shape[<span class="hljs-number" style="color: teal;">1</span>]
       image_width = image_width // <span class="hljs-number" style="color: teal;">2</span>
       real = image[:, :image_width, :]
       condition = image[:, image_width:, :]

       real = transforms.functional.to_tensor(real)
       condition = transforms.functional.to_tensor(condition)

       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> real, condition</pre></code></pre>
</div>




<p>2. Loading the data by using Pytorch’s DataLoader function to create batches before feeding them into the neural nets.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">train_dataset = Data()
train_loader = DataLoader(train_dataset, batch_size=<span class="hljs-number" style="color: teal;">4</span>, shuffle=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)

val_dataset = Data(path=<span class="hljs-string" style="color: rgb(221, 17, 68);">"facades/val/"</span>)
val_loader = DataLoader(val_dataset, batch_size=<span class="hljs-number" style="color: teal;">4</span>, shuffle=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)</pre></code></pre>
</div>




<p>Keep in mind that we will create a data loader for training and validation.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-utils">Utils</h3>



<p>In this section, we involve creating components that will be used to build the Generator and Discriminator. The components that we will create will be a convolutional function for downsampling and a transpose convolution function for upsampling which will be referred to as <span class="c-code-snippet">cnn_block</span> and <span class="c-code-snippet">tcnn_block</span> respectively.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">cnn_block</span><span class="hljs-params">(
   in_channels, out_channels, kernel_size, stride = <span class="hljs-number" style="color: teal;">1</span>, padding = <span class="hljs-number" style="color: teal;">0</span>, first_layer = False
)</span>:</span>

   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> first_layer:
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> nn.Conv2d(
           in_channels, out_channels, kernel_size, stride = stride, padding = padding
       )
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> nn.Sequential(
           nn.Conv2d(
               in_channels, out_channels, kernel_size, stride = stride, padding = padding
           ),
           nn.BatchNorm2d(out_channels, momentum = <span class="hljs-number" style="color: teal;">0.1</span>, eps = <span class="hljs-number" style="color: teal;">1e-5</span>),
       )


<span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">tcnn_block</span><span class="hljs-params">(
   in_channels,
   out_channels,
   kernel_size,
   stride = <span class="hljs-number" style="color: teal;">1</span>,
   padding = <span class="hljs-number" style="color: teal;">0</span>,
   output_padding = <span class="hljs-number" style="color: teal;">0</span>,
   first_layer = False,
)</span>:</span>
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> first_layer:
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> nn.ConvTranspose2d(
           in_channels,
           out_channels,
           kernel_size,
           stride = stride,
           padding = padding,
           output_padding = output_padding,
       )

   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span>:
       <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> nn.Sequential(
           nn.ConvTranspose2d(
               in_channels,
               out_channels,
               kernel_size,
               stride = stride,
               padding = padding,
               output_padding = output_padding,
           ),
           nn.BatchNorm2d(out_channels, momentum = <span class="hljs-number" style="color: teal;">0.1</span>, eps = <span class="hljs-number" style="color: teal;">1e-5</span>),
       )</pre></code></pre>
</div>




<h3 class="wp-block-heading" class="wp-block-heading" id="h-defining-parameters">Defining parameters</h3>



<p>In this section, we will define the parameters. These parameters will help us in training the neural network.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Define parameters</span>
batch_size = <span class="hljs-number" style="color: teal;">4</span>
workers = <span class="hljs-number" style="color: teal;">2</span>

epochs = <span class="hljs-number" style="color: teal;">30</span>

gf_dim = <span class="hljs-number" style="color: teal;">64</span>
df_dim = <span class="hljs-number" style="color: teal;">64</span>

L1_lambda = <span class="hljs-number" style="color: teal;">100.0</span>

in_w = in_h = <span class="hljs-number" style="color: teal;">256</span>
c_dim = <span class="hljs-number" style="color: teal;">3</span>

device = torch.device(<span class="hljs-string" style="color: rgb(221, 17, 68);">"cuda"</span> <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> torch.cuda.is_available() <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">else</span> <span class="hljs-string" style="color: rgb(221, 17, 68);">"cpu"</span>)</pre></code></pre>
</div>




<h3 class="wp-block-heading" class="wp-block-heading" id="h-generator">Generator</h3>



<p>Now, let’s define the generator. We will use the two components to define the same.&nbsp;</p>



<p>class Generator(nn.Module):</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Generator</span><span class="hljs-params">(nn.Module)</span>:</span>
 <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self,instance_norm=False)</span>:</span> <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#input : 256x256</span>
   super(Generator,self).__init__()
   self.e1 = cnn_block(c_dim,gf_dim,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>, first_layer = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)
   self.e2 = cnn_block(gf_dim,gf_dim*<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>,)
   self.e3 = cnn_block(gf_dim*<span class="hljs-number" style="color: teal;">2</span>,gf_dim*<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>,)
   self.e4 = cnn_block(gf_dim*<span class="hljs-number" style="color: teal;">4</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>,)
   self.e5 = cnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>,)
   self.e6 = cnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>,)
   self.e7 = cnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>,)
   self.e8 = cnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>, first_layer=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)

   self.d1 = tcnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)
   self.d2 = tcnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>*<span class="hljs-number" style="color: teal;">2</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)
   self.d3 = tcnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>*<span class="hljs-number" style="color: teal;">2</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)
   self.d4 = tcnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>*<span class="hljs-number" style="color: teal;">2</span>,gf_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)
   self.d5 = tcnn_block(gf_dim*<span class="hljs-number" style="color: teal;">8</span>*<span class="hljs-number" style="color: teal;">2</span>,gf_dim*<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)
   self.d6 = tcnn_block(gf_dim*<span class="hljs-number" style="color: teal;">4</span>*<span class="hljs-number" style="color: teal;">2</span>,gf_dim*<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)
   self.d7 = tcnn_block(gf_dim*<span class="hljs-number" style="color: teal;">2</span>*<span class="hljs-number" style="color: teal;">2</span>,gf_dim*<span class="hljs-number" style="color: teal;">1</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)
   self.d8 = tcnn_block(gf_dim*<span class="hljs-number" style="color: teal;">1</span>*<span class="hljs-number" style="color: teal;">2</span>,c_dim,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>, first_layer = <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#256x256</span>
   self.tanh = nn.Tanh()

 <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">forward</span><span class="hljs-params">(self,x)</span>:</span>
   e1 = self.e1(x)
   e2 = self.e2(F.leaky_relu(e1,<span class="hljs-number" style="color: teal;">0.2</span>))
   e3 = self.e3(F.leaky_relu(e2,<span class="hljs-number" style="color: teal;">0.2</span>))
   e4 = self.e4(F.leaky_relu(e3,<span class="hljs-number" style="color: teal;">0.2</span>))
   e5 = self.e5(F.leaky_relu(e4,<span class="hljs-number" style="color: teal;">0.2</span>))
   e6 = self.e6(F.leaky_relu(e5,<span class="hljs-number" style="color: teal;">0.2</span>))
   e7 = self.e7(F.leaky_relu(e6,<span class="hljs-number" style="color: teal;">0.2</span>))
   e8 = self.e8(F.leaky_relu(e7,<span class="hljs-number" style="color: teal;">0.2</span>))
   d1 = torch.cat([F.dropout(self.d1(F.relu(e8)),<span class="hljs-number" style="color: teal;">0.5</span>,training=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>),e7],<span class="hljs-number" style="color: teal;">1</span>)
   d2 = torch.cat([F.dropout(self.d2(F.relu(d1)),<span class="hljs-number" style="color: teal;">0.5</span>,training=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>),e6],<span class="hljs-number" style="color: teal;">1</span>)
   d3 = torch.cat([F.dropout(self.d3(F.relu(d2)),<span class="hljs-number" style="color: teal;">0.5</span>,training=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>),e5],<span class="hljs-number" style="color: teal;">1</span>)
   d4 = torch.cat([self.d4(F.relu(d3)),e4],<span class="hljs-number" style="color: teal;">1</span>)
   d5 = torch.cat([self.d5(F.relu(d4)),e3],<span class="hljs-number" style="color: teal;">1</span>)
   d6 = torch.cat([self.d6(F.relu(d5)),e2],<span class="hljs-number" style="color: teal;">1</span>)
   d7 = torch.cat([self.d7(F.relu(d6)),e1],<span class="hljs-number" style="color: teal;">1</span>)
   d8 = self.d8(F.relu(d7))

   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> self.tanh(d8)</pre></code></pre>
</div>




<h3 class="wp-block-heading" class="wp-block-heading" id="h-discriminator">Discriminator</h3>



<p>Let’s define the discriminator using the downsampling function.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-class"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">class</span> <span class="hljs-title" style="color: rgb(68, 85, 136); font-weight: 700;">Discriminator</span><span class="hljs-params">(nn.Module)</span>:</span>
 <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">__init__</span><span class="hljs-params">(self,instance_norm=False)</span>:</span><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#input : 256x256</span>
   super(Discriminator,self).__init__()
   self.conv1 = cnn_block(c_dim*<span class="hljs-number" style="color: teal;">2</span>,df_dim,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>, first_layer=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>) <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># 128x128</span>
   self.conv2 = cnn_block(df_dim,df_dim*<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># 64x64</span>
   self.conv3 = cnn_block(df_dim*<span class="hljs-number" style="color: teal;">2</span>,df_dim*<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">2</span>,<span class="hljs-number" style="color: teal;">1</span>)<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># 32 x 32</span>
   self.conv4 = cnn_block(df_dim*<span class="hljs-number" style="color: teal;">4</span>,df_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">1</span>,<span class="hljs-number" style="color: teal;">1</span>)<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># 31 x 31</span>
   self.conv5 = cnn_block(df_dim*<span class="hljs-number" style="color: teal;">8</span>,<span class="hljs-number" style="color: teal;">1</span>,<span class="hljs-number" style="color: teal;">4</span>,<span class="hljs-number" style="color: teal;">1</span>,<span class="hljs-number" style="color: teal;">1</span>, first_layer=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>)<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># 30 x 30</span>

   self.sigmoid = nn.Sigmoid()
 <span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">forward</span><span class="hljs-params">(self, x, y)</span>:</span>
   O = torch.cat([x,y],dim=<span class="hljs-number" style="color: teal;">1</span>)
   O = F.leaky_relu(self.conv1(O),<span class="hljs-number" style="color: teal;">0.2</span>)
   O = F.leaky_relu(self.conv2(O),<span class="hljs-number" style="color: teal;">0.2</span>)
   O = F.leaky_relu(self.conv3(O),<span class="hljs-number" style="color: teal;">0.2</span>)
   O = F.leaky_relu(self.conv4(O),<span class="hljs-number" style="color: teal;">0.2</span>)
   O = self.conv5(O)

   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> self.sigmoid(O)</pre></code></pre>
</div>




<h3 class="wp-block-heading" class="wp-block-heading" id="h-initializing-the-models">Initializing the models</h3>



<p>Let’s initialize both models and enable CUDA for faster training.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">G = Generator().to(device)
D = Discriminator().to(device)</pre></code></pre>
</div>




<p>We will also define the optimizers and the loss function.&nbsp;</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">G_optimizer = optim.Adam(G.parameters(), lr=<span class="hljs-number" style="color: teal;">2e-4</span>,betas=(<span class="hljs-number" style="color: teal;">0.5</span>,<span class="hljs-number" style="color: teal;">0.999</span>))
D_optimizer = optim.Adam(D.parameters(), lr=<span class="hljs-number" style="color: teal;">2e-4</span>,betas=(<span class="hljs-number" style="color: teal;">0.5</span>,<span class="hljs-number" style="color: teal;">0.999</span>))

bce_criterion = nn.BCELoss()
L1_criterion = nn.L1Loss()</pre></code></pre>
</div>




<h3 class="wp-block-heading" class="wp-block-heading" id="h-training-and-monitoring-our-model">Training and monitoring our model</h3>



<p>Training the model is not the last step. You need to monitor the training and track it to analyze the performance and implement changes if necessary. Given how taxing it can get to monitor the performance of a GAN with too many losses, plots, and metrics to deal with, we will use <a href="/" target="_blank" rel="noreferrer noopener">neptune.ai</a> at this step.</p>



<p>Neptune allows the user to:</p>



<ol class="wp-block-list">
<li>Monitor the live performance of the model</li>



<li>Monitor the performance of the hardware</li>



<li>Store and compare different metadata for different runs (like metrics, parameters, performance, data, etc.)</li>



<li>Share the work with others</li>
</ol>



<section
	id="i-box-block_4c939d22a63a86151b0bc30b744797e9"
	class="block-i-box  l-margin__top--large l-margin__bottom--x-large">

			<header class="c-header">
			<img
				src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
				data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/i-box/header-icon.svg"
				width="24"
				height="24"
				class="c-header__icon lazyload"
				alt="">

			
            <h2 class="c-header__text animation " style='max-width: 100%;'   >
                <strong>Disclaimer</strong>
            </h2>		</header>
	
	<div class="block-i-box__inner">
		

<p>Please note that this article references a <strong>deprecated version of Neptune</strong>.</p>



<p>For information on the latest version with improved features and functionality, please <a href="/" target="_blank" rel="noreferrer noopener">visit our website</a>.</p>


	</div>

</section>



<p>To get started, just follow these steps:</p>



<p>1. Install the Python <span class="c-code-snippet">neptune</span> library on your local system:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">!pip install neptune</pre></code></pre>
</div>




<p>2. Sign up at <a href="/" target="_blank" rel="noreferrer noopener">neptune.ai</a>.</p>



<p>3. <a href="https://docs-legacy.neptune.ai/api/creating_and_deleting_projects/#creating-a-project" target="_blank" rel="noreferrer noopener">Create a project</a> for storing your metadata.</p>



<p>4. <a href="https://docs-legacy.neptune.ai/setup/setting_credentials/" target="_blank" rel="noreferrer noopener">Save your credentials as environment variables</a>.</p>



<p>For this project, we will log our parameters into the Neptune dashboard. For logging the parameters or any information into the dashboard, you can use a run object:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> neptune
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> os

run = neptune.init_run(
   project=os.getenv(“NEPTUNE_PROJECT_NAME”),
   api_token=os.getenv(<span class="hljs-string" style="color: rgb(221, 17, 68);">"NEPTUNE_API_TOKEN"</span>)
)</pre></code></pre>
</div>




<p>A run object establishes a connection between your environment and the project’s dashboard you’ve created for this tutorial. To log metadata, like the dictionary below, you can use the following syntax:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Logging parameter in Neptune</span>
PARAMS = {<span class="hljs-string" style="color: rgb(221, 17, 68);">'Epoch'</span>: epochs,
         <span class="hljs-string" style="color: rgb(221, 17, 68);">'Batch Size'</span>: batch_size,
         <span class="hljs-string" style="color: rgb(221, 17, 68);">'Input Channels'</span>: c_dim,

         <span class="hljs-string" style="color: rgb(221, 17, 68);">'Workers'</span>: workers,
         <span class="hljs-string" style="color: rgb(221, 17, 68);">'Optimizer'</span>: <span class="hljs-string" style="color: rgb(221, 17, 68);">'Adam'</span>,
         <span class="hljs-string" style="color: rgb(221, 17, 68);">'Learning Rate'</span>: <span class="hljs-number" style="color: teal;">2e-4</span>,
         <span class="hljs-string" style="color: rgb(221, 17, 68);">'Metrics'</span>: [<span class="hljs-string" style="color: rgb(221, 17, 68);">'Binary Cross Entropy'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'L1 Loss'</span>],
         <span class="hljs-string" style="color: rgb(221, 17, 68);">'Activation'</span>: [<span class="hljs-string" style="color: rgb(221, 17, 68);">'Leaky Relu'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'Tanh'</span>, <span class="hljs-string" style="color: rgb(221, 17, 68);">'Sigmoid'</span> ],
         <span class="hljs-string" style="color: rgb(221, 17, 68);">'Device'</span>: device}

run[<span class="hljs-string" style="color: rgb(221, 17, 68);">'parameters'</span>] = PARAMS</pre></code></pre>
</div>




<p>To log the loss, generated images, and the model’s weights, we will use the run object again but with different methods like <span class="c-code-snippet">append</span> or <span class="c-code-snippet">upload</span>. Here is our training loop putting together everything we have along with Neptune logging:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Define missing variables</span>
epochs = <span class="hljs-number" style="color: teal;">30</span>  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Adjust as needed</span>
L1_lambda = <span class="hljs-number" style="color: teal;">100</span>  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Adjust as needed</span>
G_losses = []
D_losses = []
G_GAN_losses = []
G_L1_losses = []
img_list = []

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Assuming fixed_x and fixed_y are not defined, let's create them</span>
fixed_x, fixed_y = next(iter(train_loader))
fixed_x = fixed_x.to(device)
fixed_y = fixed_y.to(device)

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> ep <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">30</span>):
    <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i, data <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> enumerate(train_loader):

        y, x = data
        x = x.to(device)
        y = y.to(device)

        b_size = x.shape[<span class="hljs-number" style="color: teal;">0</span>]

        real_class = torch.ones(b_size, <span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">30</span>, <span class="hljs-number" style="color: teal;">30</span>).to(device)
        fake_class = torch.zeros(b_size, <span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">30</span>, <span class="hljs-number" style="color: teal;">30</span>).to(device)

        <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Train D</span>
        D.zero_grad()

        real_patch = D(y, x)
        real_gan_loss = bce_criterion(real_patch, real_class)

        fake = G(x)

        fake_patch = D(fake.detach(), x)
        fake_gan_loss = bce_criterion(fake_patch, fake_class)

        D_loss = real_gan_loss + fake_gan_loss
        D_loss.backward()
        D_optimizer.step()

        <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Train G</span>
        G.zero_grad()
        fake_patch = D(fake, x)
        fake_gan_loss = bce_criterion(fake_patch, real_class)

        L1_loss = L1_criterion(fake, y)
        G_loss = fake_gan_loss + L1_lambda * L1_loss
        G_loss.backward()

        G_optimizer.step()

        <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Neptune logging</span>
        run[<span class="hljs-string" style="color: rgb(221, 17, 68);">"Gen Loss"</span>].append(G_loss.item())
        run[<span class="hljs-string" style="color: rgb(221, 17, 68);">"Dis Loss"</span>].append(D_loss.item())
        run[<span class="hljs-string" style="color: rgb(221, 17, 68);">"L1 Loss"</span>].append(L1_loss.item())
        run[<span class="hljs-string" style="color: rgb(221, 17, 68);">"Gen GAN Loss"</span>].append(fake_gan_loss.item())

        <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">if</span> (i + <span class="hljs-number" style="color: teal;">1</span>) % <span class="hljs-number" style="color: teal;">5</span> == <span class="hljs-number" style="color: teal;">0</span>:
            print(
                <span class="hljs-string" style="color: rgb(221, 17, 68);">"Epoch [{}/{}], Step [{}/{}], d_loss: {:.4f}, g_loss: {:.4f},D(real): {:.2f}, D(fake):{:.2f},g_loss_gan:{:.4f},g_loss_L1:{:.4f}"</span>.format(
                    ep,
                    epochs,
                    i + <span class="hljs-number" style="color: teal;">1</span>,
                    len(train_loader),
                    D_loss.item(),
                    G_loss.item(),
                    real_patch.mean(),
                    fake_patch.mean(),
                    fake_gan_loss.item(),
                    L1_loss.item(),
                )
            )
            G_losses.append(G_loss.item())
            D_losses.append(D_loss.item())
            G_GAN_losses.append(fake_gan_loss.item())
            G_L1_losses.append(L1_loss.item())

            <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">with</span> torch.no_grad():
                G.eval()
                fake = G(fixed_x).detach().cpu()
                G.train()
            figs = plt.figure(figsize=(<span class="hljs-number" style="color: teal;">10</span>, <span class="hljs-number" style="color: teal;">10</span>))
            plt.subplot(<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">3</span>, <span class="hljs-number" style="color: teal;">1</span>)
            plt.axis(<span class="hljs-string" style="color: rgb(221, 17, 68);">"off"</span>)
            plt.title(<span class="hljs-string" style="color: rgb(221, 17, 68);">"conditional image (x)"</span>)
            plt.imshow(
                np.transpose(
                    torchvision.utils.make_grid(fixed_x.cpu(), nrow=<span class="hljs-number" style="color: teal;">1</span>, padding=<span class="hljs-number" style="color: teal;">5</span>, normalize=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>),
                    (<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">2</span>, <span class="hljs-number" style="color: teal;">0</span>),
                )
            )

            plt.subplot(<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">3</span>, <span class="hljs-number" style="color: teal;">2</span>)
            plt.axis(<span class="hljs-string" style="color: rgb(221, 17, 68);">"off"</span>)
            plt.title(<span class="hljs-string" style="color: rgb(221, 17, 68);">"fake image"</span>)
            plt.imshow(
                np.transpose(
                    torchvision.utils.make_grid(fake, nrow=<span class="hljs-number" style="color: teal;">1</span>, padding=<span class="hljs-number" style="color: teal;">5</span>, normalize=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>),
                    (<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">2</span>, <span class="hljs-number" style="color: teal;">0</span>),
                )
            )

            plt.subplot(<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">3</span>, <span class="hljs-number" style="color: teal;">3</span>)
            plt.axis(<span class="hljs-string" style="color: rgb(221, 17, 68);">"off"</span>)
            plt.title(<span class="hljs-string" style="color: rgb(221, 17, 68);">"ground truth (y)"</span>)
            plt.imshow(
                np.transpose(
                    torchvision.utils.make_grid(fixed_y.cpu(), nrow=<span class="hljs-number" style="color: teal;">1</span>, padding=<span class="hljs-number" style="color: teal;">5</span>, normalize=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>),
                    (<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">2</span>, <span class="hljs-number" style="color: teal;">0</span>),
                )
            )

            run[<span class="hljs-string" style="color: rgb(221, 17, 68);">"epoch_results"</span>].upload(figs)
            plt.close()
            img_list.append(figs)</pre></code></pre>
</div>





<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--standard l-margin__bottom--standard block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">run.stop()</pre></code></pre>
</div>



    <a
        href="https://app.neptune.ai/nielspace/Pix2Pix/experiments?compare=IwJgNMQ&#038;split=cmp&#038;dash=charts&#038;viewId=standard-view"
        id="cta-box-related-link-block_b745789eac639967f0f295cc2b31a4b5"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Recommended                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-explore-the-example-project-in-neptune-ai">                Explore the example project in neptune.ai            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    See more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<p>Once the training is initialized, all the logged information will automatically log into the dashboard. Neptune fetches live information from the training which allows <a href="https://docs-legacy.neptune.ai/tutorials/monitoring_training_live/" target="_blank" rel="noreferrer noopener">live monitoring of the entire process</a>. </p>



<p>Below are the screenshots of the monitoring process.&nbsp;</p>



<div id="app-screenshot-block_2d95ddc6d9c093ab610023d1699c7de2"
	class="block-app-screenshot js-block-with-image-full-screen-modal "
	data-video-url=""
	data-show-controls="false"
	data-unmute="false"
	data-button-icon="https://neptune.ai/wp-content/themes/neptune/img/icon-close.svg"
	data-image-full-screen-modal="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Monitoring-the-performance-of-the-model.png?fit=1020%2C520&#038;ssl=1"
>

			<div class="block-app-screenshot__image-wrapper">
			<div class="block-app-screenshot__bar">
				<figure class="block-app-screenshot__bar-buttons-wrapper">
					<img
						src="https://neptune.ai/wp-content/themes/neptune/img/blocks/app-screenshot/bar-buttons.svg"
						width="34"
						height="9"
						class="block-app-screenshot__bar-buttons"
						alt="">
				</figure>
			</div>

			
				<img
					srcset="
					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Monitoring-the-performance-of-the-model.png?fit=480%2C245&#038;ssl=1 480w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Monitoring-the-performance-of-the-model.png?fit=768%2C391&#038;ssl=1 768w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Monitoring-the-performance-of-the-model.png?fit=1020%2C520&#038;ssl=1 1020w"
					alt=""
					style=""
					width="1020"
					height="520"
					class="block-app-screenshot__image"
				>

			
			<div class="block-app-screenshot__overlay">

				
					<a
						href="https://app.neptune.ai/o/community/org/pix2pix-key-model-architecture/runs/details?viewId=standard-view&#038;detailsTab=charts&#038;shortId=PIX-1&#038;type=run"
						class="c-button c-button--primary c-button--small c-button--cta">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-button--test-tube.svg"
							width="16"
							height="19"
							target="_blank" rel="nofollow noopener noreferrer"							class="c-button__icon"
							alt=""
						/>

													<span class="c-button__text">
								See in the app							</span>
						
					</a>

				
														<button
						class="js-c-image-full-screen-modal c-button c-button--tertiary c-button--small">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-zoom.svg"
							width="16"
							height="17"
							class="c-button__icon"
							alt="zoom"
						/>

						<span class="c-button__text">
							Full screen preview						</span>
						
					</button>
									
			</div>

		</div>

					<figcaption class="block-app-screenshot__caption">
				Monitoring the performance of the model 			</figcaption>
			
</div>



<div id="separator-block_b9c595d54dc54a9f881c52e8bdc6fe6e"
         class="block-separator block-separator--20">
</div>



<p>You can also access all metadata and see generated samples.</p>



<div id="app-screenshot-block_fc1b55134d51f39accf9ee1c500baa3e"
	class="block-app-screenshot js-block-with-image-full-screen-modal "
	data-video-url=""
	data-show-controls="false"
	data-unmute="false"
	data-button-icon="https://neptune.ai/wp-content/themes/neptune/img/icon-close.svg"
	data-image-full-screen-modal="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Access-to-all-metadata-in-the-project.png?fit=1020%2C520&#038;ssl=1"
>

			<div class="block-app-screenshot__image-wrapper">
			<div class="block-app-screenshot__bar">
				<figure class="block-app-screenshot__bar-buttons-wrapper">
					<img
						src="https://neptune.ai/wp-content/themes/neptune/img/blocks/app-screenshot/bar-buttons.svg"
						width="34"
						height="9"
						class="block-app-screenshot__bar-buttons"
						alt="">
				</figure>
			</div>

			
				<img
					srcset="
					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Access-to-all-metadata-in-the-project.png?fit=480%2C245&#038;ssl=1 480w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Access-to-all-metadata-in-the-project.png?fit=768%2C391&#038;ssl=1 768w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Access-to-all-metadata-in-the-project.png?fit=1020%2C520&#038;ssl=1 1020w"
					alt=""
					style=""
					width="1020"
					height="520"
					class="block-app-screenshot__image"
				>

			
			<div class="block-app-screenshot__overlay">

				
					<a
						href="https://app.neptune.ai/o/community/org/pix2pix-key-model-architecture/runs/details?viewId=standard-view&#038;detailsTab=metadata&#038;shortId=PIX-1&#038;type=run&#038;path=&#038;attribute=Gen%20Loss"
						class="c-button c-button--primary c-button--small c-button--cta">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-button--test-tube.svg"
							width="16"
							height="19"
							target="_blank" rel="nofollow noopener noreferrer"							class="c-button__icon"
							alt=""
						/>

													<span class="c-button__text">
								See in the app							</span>
						
					</a>

				
														<button
						class="js-c-image-full-screen-modal c-button c-button--tertiary c-button--small">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-zoom.svg"
							width="16"
							height="17"
							class="c-button__icon"
							alt="zoom"
						/>

						<span class="c-button__text">
							Full screen preview						</span>
						
					</button>
									
			</div>

		</div>

					<figcaption class="block-app-screenshot__caption">
				Access to all metadata in the project			</figcaption>
			
</div>



<div id="separator-block_b9c595d54dc54a9f881c52e8bdc6fe6e"
         class="block-separator block-separator--20">
</div>



<p>Switching to the images panel, it will show you the generated samples:</p>



<div id="app-screenshot-block_cd4f46e48d8a298c1b84c8f320d87ccd"
	class="block-app-screenshot js-block-with-image-full-screen-modal "
	data-video-url=""
	data-show-controls="false"
	data-unmute="false"
	data-button-icon="https://neptune.ai/wp-content/themes/neptune/img/icon-close.svg"
	data-image-full-screen-modal="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Access-to-the-generated-samples.png?fit=1020%2C520&#038;ssl=1"
>

			<div class="block-app-screenshot__image-wrapper">
			<div class="block-app-screenshot__bar">
				<figure class="block-app-screenshot__bar-buttons-wrapper">
					<img
						src="https://neptune.ai/wp-content/themes/neptune/img/blocks/app-screenshot/bar-buttons.svg"
						width="34"
						height="9"
						class="block-app-screenshot__bar-buttons"
						alt="">
				</figure>
			</div>

			
				<img
					srcset="
					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Access-to-the-generated-samples.png?fit=480%2C245&#038;ssl=1 480w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Access-to-the-generated-samples.png?fit=768%2C391&#038;ssl=1 768w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/Access-to-the-generated-samples.png?fit=1020%2C520&#038;ssl=1 1020w"
					alt=""
					style=""
					width="1020"
					height="520"
					class="block-app-screenshot__image"
				>

			
			<div class="block-app-screenshot__overlay">

				
					<a
						href="https://app.neptune.ai/o/community/org/pix2pix-key-model-architecture/runs/details?viewId=standard-view&#038;detailsTab=images&#038;shortId=PIX-1&#038;type=run"
						class="c-button c-button--primary c-button--small c-button--cta">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-button--test-tube.svg"
							width="16"
							height="19"
							target="_blank" rel="nofollow noopener noreferrer"							class="c-button__icon"
							alt=""
						/>

													<span class="c-button__text">
								See in the app							</span>
						
					</a>

				
														<button
						class="js-c-image-full-screen-modal c-button c-button--tertiary c-button--small">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-zoom.svg"
							width="16"
							height="17"
							class="c-button__icon"
							alt="zoom"
						/>

						<span class="c-button__text">
							Full screen preview						</span>
						
					</button>
									
			</div>

		</div>

					<figcaption class="block-app-screenshot__caption">
				Access to the generated samples			</figcaption>
			
</div>



<div id="separator-block_b9c595d54dc54a9f881c52e8bdc6fe6e"
         class="block-separator block-separator--20">
</div>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-key-takeaways">Key takeaways</h2>



<ul class="wp-block-list">
<li>Pix2Pix is a conditional GAN that uses images and labels to generate images.&nbsp;</li>



<li>It uses two architectures:
<ul class="wp-block-list">
<li>U-Net for generator</li>



<li>PatchGAN for discriminator</li>
</ul>
</li>



<li>PatchGAN uses smaller patches of&nbsp; NxN size in the generated image to discriminate it from real or fake instead of discriminating the entire image at once.&nbsp;</li>



<li>Pix2Pix has an additional loss specifically for the generator so that it can generate images closer to the ground truth.&nbsp;</li>



<li>Pix2Pix is a pairwise image translation algorithm.&nbsp;</li>
</ul>



<h4 class="wp-block-heading">Other GANs that you can explore are:</h4>



<ol class="wp-block-list">
<li>CycleGAN: It is similar to Pix2Pix since most of the approach is the same except the data part. Instead of pair-image-translation, it is unpaired-image-translation. Learning and exploring CycleGAN will be much easier since it was developed by the same authors.</li>



<li>If you are interested in text-to-image translation then you should explore:
<ul class="wp-block-list">
<li>DeepDaze: Uses a generative model to create images from text prompts. Great for generating abstract or artistic images based on text descriptions.</li>



<li>BigSleep: Great if you want to discover unusual visualizations from prompts.</li>



<li>DALL:E: Developed by OpenAI, this model generates creative compositions with high level of detail directly from text descriptions.</li>
</ul>
</li>



<li>Other interesting GAN projects you may want to try out:
<ul class="wp-block-list">
<li>StyleGAN: Generates realistic faces; ideal for style manipulation and creative blending.</li>



<li>AnimeGAN: Converts real photos into anime-style images.</li>



<li>BigGAN: Produces images with realistic textures.</li>



<li>Age-cGAN: Alters age in facial images.</li>



<li>StarGAN: Handles multiple transformations in faces, like hair color and expression changes.</li>
</ul>
</li>
</ol>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6256</post-id>	</item>
		<item>
		<title>Dimensionality Reduction for Machine Learning</title>
		<link>https://neptune.ai/blog/dimensionality-reduction</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 06:34:22 +0000</pubDate>
				<category><![CDATA[ML Model Development]]></category>
		<guid isPermaLink="false">https://neptune.test/dimensionality-reduction/</guid>

					<description><![CDATA[Data forms the foundation of any machine learning algorithm, without it, Data Science can not happen. Sometimes, it can contain a huge number of features, some of which are not even required. Such redundant information makes modeling complicated. Furthermore, interpreting and understanding the data by visualization gets difficult because of the high dimensionality. This is&#8230;]]></description>
										<content:encoded><![CDATA[
<p>Data forms the foundation of any machine learning algorithm, without it, Data Science can not happen. Sometimes, it can contain a huge number of features, some of which are not even required. Such redundant information makes modeling complicated. Furthermore, interpreting and understanding the data by visualization gets difficult because of the high dimensionality. This is where dimensionality reduction comes into play.&nbsp;</p>



<p>In this article you will learn:</p>



<ol class="wp-block-list">
<li>What is dimensionality reduction?</li>



<li>What is the curse of dimensionality?</li>



<li>Tools and libraries used for dimensionality reduction</li>



<li>Algorithms used for dimensionality reduction</li>



<li>Applications</li>



<li>Advantages and disadvantages</li>
</ol>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-is-dimensionality-reduction">What is dimensionality reduction?</h2>



<p>Dimensionality reduction is the task of reducing the number of features in a dataset. In machine learning tasks like <a href="/blog/random-forest-regression-when-does-it-fail-and-why" target="_blank" rel="noreferrer noopener">regression</a> or <a href="/blog/image-classification-tips-and-tricks-from-13-kaggle-competitions" target="_blank" rel="noreferrer noopener">classification</a>, there are often too many variables to work with. These variables are also called <strong>features</strong>. The higher the number of features, the more difficult it is to model them, this is known as the <strong>curse of dimensionality</strong>. This will be discussed in detail in the next section.</p>



<p>Additionally, some of these features can be quite redundant, adding noise to the dataset and it makes no sense to have them in the training data. This is where feature space needs to be reduced.&nbsp;</p>



<p>The process of dimensionality reduction essentially transforms data from high-dimensional feature space to a low-dimensional feature space. Simultaneously, it is also important that meaningful properties present in the data are not lost during the transformation.</p>



<p>Dimensionality reduction is commonly used in data visualization to understand and interpret the data, and in machine learning or deep learning techniques to simplify the task at hand.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-curse-of-dimensionality">Curse of dimensionality</h3>



<p>It is well known that ML/DL algorithms need a large amount of data to learn invariance, patterns, and representations. If this data comprises a large number of features, this can lead to the curse of dimensionality. The curse of dimensionality, first introduced by <a href="https://zbmath.org/0103.12901" target="_blank" rel="noreferrer noopener nofollow">Bellman</a>, describes that in order to estimate an arbitrary function with a certain accuracy the number of features or dimensionality required for estimation grows exponentially. This is especially true with big data which yields more <strong>sparsity</strong>.&nbsp;</p>



<p>Sparsity in data is usually referred to as the features having a value of zero; this doesn&#8217;t mean that the value is missing. If the data has a lot of sparse features then the space and computational complexity increase. Oliver <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.1421" target="_blank" rel="noreferrer noopener nofollow">Kuss [2002]</a> shows that the model trained on sparse data performed poorly in the test dataset. In other words, the model during the training learns noise and they are not able to generalize well. Hence they overfit.&nbsp;&nbsp;</p>



<p>When the data is sparse, observations or samples in the training dataset are difficult to cluster as high-dimensional data causes every observation in the dataset to appear equidistant from each other. If data is meaningful and non-redundant, then there will be regions where similar data points come together and cluster, furthermore they must be statistically significant.&nbsp;</p>



<p>Issues that arise with high dimensional data are:</p>



<ol class="wp-block-list">
<li>Running a risk of overfitting the machine learning model.&nbsp;</li>



<li>Difficulty in clustering similar features.</li>



<li>Increased space and computational time complexity.&nbsp;</li>
</ol>



<p>Non-sparse data or dense data on the other hand is data that has non-zero features. Apart from containing non-zero features they also contain information that is both meaningful and non-redundant.&nbsp;</p>



<p>To tackle the curse of dimensionality, methods like dimensionality reduction are used. Dimensional reduction techniques are very useful to transform sparse features to dense features. Furthermore, dimensionality reduction is also used to clean the data and feature extraction.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-tools-and-library">Tools and library</h2>



<p>The most popular library for dimensionality reduction is <strong>scikit-learn </strong>(sklearn). The library consists of three main modules for dimensionality reduction algorithms:</p>



<ol class="wp-block-list">
<li>Decomposition algorithms
<ul class="wp-block-list">
<li>Principal Component Analysis</li>



<li>Kernel Principal Component Analysis</li>



<li>Non-Negative Matrix Factorization&nbsp;</li>



<li>Singular Value Decomposition&nbsp;</li>
</ul>
</li>



<li>Manifold learning algorithms
<ul class="wp-block-list">
<li>t-Distributed Stochastic Neighbor Embedding</li>



<li>Spectral Embedding</li>



<li>Locally Linear Embedding</li>
</ul>
</li>



<li>Discriminant Analysis
<ul class="wp-block-list">
<li>Linear Discriminant Analysis</li>
</ul>
</li>
</ol>



<p>When it comes to deep learning, algorithms like autoencoders can be constructed to reduce dimensions and learn features and representations. Frameworks like Pytorch, Pytorch Lightning, Keras, and TensorFlow are used to create autoencoders. &nbsp;</p>



<section id="blog-intext-cta-block_445b3f9fd7653e0ae715edfe89ca024a" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-recommended-for-you">Recommended for you</h3>
    
            <p><a href="/blog/knowledge-distillation" target="_blank" rel="noopener">Knowledge Distillation: Principles, Algorithms, Applications</a></p>
<p><a href="https://neptune.ai/blog/the-best-ml-framework-extensions-for-scikit-learn" target="_blank" rel="noopener">The Best ML Frameworks &amp; Extensions For Scikit-learn</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-algorithms-for-dimensionality-reduction">Algorithms for dimensionality reduction</h2>



<p>Let’s start with the first class of algorithms.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-decomposition-algorithms">Decomposition algorithms&nbsp;</h3>



<p>Decomposition algorithm in scikit-learn involves dimensionality reduction algorithms. We can call various techniques using the following command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.decomposition <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> PCA, KernelPCA, NMF</pre>



<h4 class="wp-block-heading">Principal Component Analysis (PCA)</h4>



<p>Principal Component Analysis, or PCA, is a dimensionality-reduction method to find lower-dimensional space by preserving the <strong>variance</strong> as measured in the high dimensional input space. It is an unsupervised method for dimensionality reduction.&nbsp;</p>



<p>PCA transformations are linear transformations. It involves the process of finding the principal components, which is the decomposition of the feature matrix into eigenvectors. This means that PCA will not be effective when the distribution of the dataset is non-linear.&nbsp;</p>



<p>Let’s understand PCA with python code.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">pca</span><span class="hljs-params">(X=np.array<span class="hljs-params">([])</span>, no_dims=<span class="hljs-number" style="color: teal;">50</span>)</span>:</span>

    print(<span class="hljs-string" style="color: rgb(221, 17, 68);">"Preprocessing the data using PCA..."</span>)
    (n, d) = X.shape
    Mean = np.tile(np.mean(X, <span class="hljs-number" style="color: teal;">0</span>), (n, <span class="hljs-number" style="color: teal;">1</span>))
    X = X - Mean
    (l, M) = np.linalg.eig(np.dot(X.T, X))
    Y = np.dot(X, M[:, <span class="hljs-number" style="color: teal;">0</span>:no_dims])
    <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> Y</pre>



<p>PCA implementation is quite straightforward. We can define the whole process into just four steps:</p>



<ol class="wp-block-list">
<li><strong>Standardization</strong>: The data has to be transformed to a common scale by taking the difference between the original dataset with the mean of the whole dataset. This will make the distribution 0 centered.&nbsp;</li>



<li><strong>Finding covariance</strong>: Covariance will help us to understand the relationship between the mean and original data.&nbsp;</li>



<li><strong>Determining the principal components</strong>: Principal components can be determined by calculating the eigenvectors and eigenvalues. <strong>Eigenvectors</strong> are a special set of vectors that help us to understand the structure and the property of the data that would be principal components. The <strong>eigenvalues</strong> on the other hand help us to determine the principal components. The highest eigenvalues and their corresponding eigenvectors make the most important principal components.</li>



<li><strong>Final output</strong>: It is the dot product of the standardized matrix and the eigenvector. Note that the number of columns or features will be changed.&nbsp;</li>
</ol>



<p>Reducing the number of variables of data not only reduces complexity but also decreases the accuracy of the machine learning model. However, with a smaller number of features it is easy to explore, visualize and analyze, it also makes machine learning algorithms computationally less expensive. In simple words, the idea of PCA is to reduce the number of variables of a data set, while preserving as much information as possible.</p>



<p>Let’s also take a look at the modules and functions sklearn provides for PCA.</p>



<p>We can start by loading the most dataset:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.datasets <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> load_digits
digits = load_digits()
digits.data.shape</pre>



<p>(1797, 64)</p>



<p>The data consists of 8×8 pixel images, which means that they are 64-dimensional. To gain some understanding of the relationships between these points, we can use PCA to project them to lower dimensions, like 2-D:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.decomposition <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> PCA

pca = PCA(<span class="hljs-number" style="color: teal;">2</span>)  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># project from 64 to 2 dimensions</span>
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)</pre>



<p>(1797, 64)</p>



<p>(1797, 2)</p>



<p>Now, let’s plot the first two principal components.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">plt.scatter(projected[:, <span class="hljs-number" style="color: teal;">0</span>], projected[:, <span class="hljs-number" style="color: teal;">1</span>],
            c=digits.target, edgecolor=<span class="hljs-string" style="color: rgb(221, 17, 68);">'none'</span>, alpha=<span class="hljs-number" style="color: teal;">0.5</span>,
            cmap=plt.cm.get_cmap(<span class="hljs-string" style="color: rgb(221, 17, 68);">'spectral'</span>, <span class="hljs-number" style="color: teal;">10</span>))
plt.xlabel(<span class="hljs-string" style="color: rgb(221, 17, 68);">'component 1'</span>)
plt.ylabel(<span class="hljs-string" style="color: rgb(221, 17, 68);">'component 2'</span>)
plt.colorbar();</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_5.png?ssl=1" alt="Principal Component Analysis (PCA)" class="wp-image-54935" style="width:523px;height:406px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: PCA | Source: Author</em></figcaption></figure>
</div>


<p>We can see that PCA optimally found the principal components that can quite effectively cluster similar distributions, for the most part.&nbsp;</p>



<h4 class="wp-block-heading">Kernel PCA (KPCA)</h4>



<p>The PCA transformations we described previously are linear transformations that are ineffective with the non-linear distribution. To deal with non-linear distribution, the basic idea is to use the kernel trick.&nbsp;</p>



<p>A kernel trick is simply a method to project non-linear data onto a higher dimensional space and separate different distributions of data. Once the distributions are separated we can use PCA to separate them linearly.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_2.png?ssl=1" alt="Kernel PCA" class="wp-image-54929"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: KPCA | Source: Author</em></figcaption></figure>
</div>


<p>Kernel PCA uses a kernel function ϕ that calculates the dot product of the data for non-linear mapping. In other words, the function ϕ maps the original d-dimensional features into a larger, k-dimensional feature space by creating non-linear combinations of the original features.</p>



<p>Let assume a dataset x that contains two features x1 and x2:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_13.png?ssl=1" alt="Kernel PCA" class="wp-image-54946" style="width:375px;height:89px"/></figure>
</div>


<div id="separator-block_20128d1a995cd55a34255b9a40986af5"
         class="block-separator block-separator--5">
</div>



<p>After applying the kernel trick we get:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_6.png?ssl=1" alt="Kernel PCA" class="wp-image-54943" style="width:779px;height:67px"/></figure>
</div>


<div id="separator-block_20128d1a995cd55a34255b9a40986af5"
         class="block-separator block-separator--5">
</div>



<p>To get a more intuitive understanding of Kernel PCA let&#8217;s define a feature space that cannot be linearly separated.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">​​<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.datasets <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> make_circles
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.decomposition <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> KernelPCA
np.random.seed(<span class="hljs-number" style="color: teal;">0</span>)
X, y = make_circles(n_samples=<span class="hljs-number" style="color: teal;">400</span>, factor=<span class="hljs-number" style="color: teal;">.3</span>, noise=<span class="hljs-number" style="color: teal;">.05</span>)</pre>



<p>Now, let’s plot and see our dataset.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">plt.figure(figsize=(<span class="hljs-number" style="color: teal;">15</span>,<span class="hljs-number" style="color: teal;">10</span>))
plt.subplot(<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">2</span>, <span class="hljs-number" style="color: teal;">1</span>, aspect=<span class="hljs-string" style="color: rgb(221, 17, 68);">'equal'</span>)
plt.title(<span class="hljs-string" style="color: rgb(221, 17, 68);">"Original space"</span>)
reds = y == <span class="hljs-number" style="color: teal;">0</span>
blues = y == <span class="hljs-number" style="color: teal;">1</span>

plt.scatter(X[reds, <span class="hljs-number" style="color: teal;">0</span>], X[reds, <span class="hljs-number" style="color: teal;">1</span>], c=<span class="hljs-string" style="color: rgb(221, 17, 68);">"red"</span>,
           s=<span class="hljs-number" style="color: teal;">20</span>, edgecolor=<span class="hljs-string" style="color: rgb(221, 17, 68);">'k'</span>)
plt.scatter(X[blues, <span class="hljs-number" style="color: teal;">0</span>], X[blues, <span class="hljs-number" style="color: teal;">1</span>], c=<span class="hljs-string" style="color: rgb(221, 17, 68);">"blue"</span>,
           s=<span class="hljs-number" style="color: teal;">20</span>, edgecolor=<span class="hljs-string" style="color: rgb(221, 17, 68);">'k'</span>)
plt.xlabel(<span class="hljs-string" style="color: rgb(221, 17, 68);">"$x_1$"</span>)
plt.ylabel(<span class="hljs-string" style="color: rgb(221, 17, 68);">"$x_2$"</span>)</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_7.png?ssl=1" alt="Kernel PCA" class="wp-image-54928" style="width:456px;height:460px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: KPCA | Source: Author</em></figcaption></figure>
</div>


<p>As you can see in this dataset the two classes cannot be separated linearly. Now, let&#8217;s define kernel PCA and see how it separates this feature space.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">kpca = KernelPCA(kernel=<span class="hljs-string" style="color: rgb(221, 17, 68);">"rbf"</span>, fit_inverse_transform=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">True</span>, gamma=<span class="hljs-number" style="color: teal;">10</span>, )
X_kpca = kpca.fit_transform(X)
plt.subplot(<span class="hljs-number" style="color: teal;">1</span>, <span class="hljs-number" style="color: teal;">2</span>, <span class="hljs-number" style="color: teal;">2</span>, aspect=<span class="hljs-string" style="color: rgb(221, 17, 68);">'equal'</span>)
plt.scatter(X_kpca[reds, <span class="hljs-number" style="color: teal;">0</span>], X_kpca[reds, <span class="hljs-number" style="color: teal;">1</span>], c=<span class="hljs-string" style="color: rgb(221, 17, 68);">"red"</span>,
           s=<span class="hljs-number" style="color: teal;">20</span>, edgecolor=<span class="hljs-string" style="color: rgb(221, 17, 68);">'k'</span>)
plt.scatter(X_kpca[blues, <span class="hljs-number" style="color: teal;">0</span>], X_kpca[blues, <span class="hljs-number" style="color: teal;">1</span>], c=<span class="hljs-string" style="color: rgb(221, 17, 68);">"blue"</span>,
           s=<span class="hljs-number" style="color: teal;">20</span>, edgecolor=<span class="hljs-string" style="color: rgb(221, 17, 68);">'k'</span>)
plt.title(<span class="hljs-string" style="color: rgb(221, 17, 68);">"Projection by KPCA"</span>)
plt.xlabel(<span class="hljs-string" style="color: rgb(221, 17, 68);">r"1st principal component in space induced by $phi$"</span>)
plt.ylabel(<span class="hljs-string" style="color: rgb(221, 17, 68);">"2nd component"</span>)</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_11.png?ssl=1" alt="Kernel PCA" class="wp-image-54934" style="width:584px;height:432px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: KPCA | Source: Author</em></figcaption></figure>
</div>


<p>After applying KPCA, it is able to linearly separate the two classes in the dataset.&nbsp;&nbsp;</p>



<h4 class="wp-block-heading">Singular Value Decomposition&nbsp;(SVD)</h4>



<p>The singular value decomposition or SVD is a factorization method of a real or complex matrix. It is efficient when working with a sparse dataset; a dataset having a lot of zero entries. This type of dataset is usually found in the Recommender Systems, rating, and reviews dataset, et cetera.&nbsp;&nbsp;&nbsp;</p>



<p>The idea of SVD is that every matrix of shape nXp factorizes into A = USV<sup>T</sup>, where U is the orthogonal matrix, S is a diagonal matrix and <sup>&nbsp;</sup>V<sup>T</sup> is also an orthogonal matrix.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_1.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54945" style="width:342px;height:89px"/></figure>
</div>


<p>The advantage of SVD is that the orthogonal matrices capture the structure of the original matrix A which means that their properties do not change when multiplied by other numbers. This can help us approximate A.&nbsp;</p>



<p>Now let’s understand SVD using code. To get a better understanding of the algorithm we will use a face dataset that scikit-learn provides.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.datasets <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=<span class="hljs-number" style="color: teal;">70</span>, resize=<span class="hljs-number" style="color: teal;">0.4</span>)</pre>



<p>Plot the images to get an idea of what we are working with.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">X = lfw_people.images.reshape(img_count, img_width * img_height)
X0_img = X[<span class="hljs-number" style="color: teal;">0</span>].reshape(img_height, img_width)

plt.imshow(X0_img, cmap=plt.cm.gray)</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_22.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54937" style="width:342px;height:410px"/></figure>
</div>


<p>Create a function for easy visualization of images.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">draw_img</span><span class="hljs-params">(img_vector, h=img_height, w=img_width)</span>:</span>
   plt.imshow( img_vector.reshape((h,w)), cmap=plt.cm.gray)
   plt.xticks(())
   plt.yticks(())
draw_img(X[<span class="hljs-number" style="color: teal;">49</span>])</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_18.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54938" style="width:309px;height:376px"/></figure>
</div>


<p>Before applying SVD it is better to standardize the data.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.preprocessing <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> StandardScaler

scaler = StandardScaler(with_std=<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">False</span>)
Xstd = scaler.fit_transform(X)</pre>



<p>After standardizing this is how the image looks.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_15.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54939" style="width:288px;height:369px"/></figure>
</div>


<p>It is worth noting that we can always retrieve the original image by performing the inverse transformation.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">Xorig = scaler.inverse_transform(Xstd)
draw_img(Xorig[<span class="hljs-number" style="color: teal;">49</span>])</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_18.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54938" style="width:303px;height:369px"/></figure>
</div>


<p>Now, we can apply the SVD function from NumPy and decompose the matrix into three matrices.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> numpy.linalg <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> svd

U, S, VT = svd(Xstd)</pre>



<p>To check that the function works we can always perform a matrix multiplication of the three matrices.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">US = U*S
Xhat = US @ VT[<span class="hljs-number" style="color: teal;">0</span>:<span class="hljs-number" style="color: teal;">1288</span>,:]

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># inverse transform Xhat to reverse standardization</span>
Xhat_orig = scaler.inverse_transform(Xhat)
draw_img(Xhat_orig[<span class="hljs-number" style="color: teal;">49</span>])</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_18.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54938" style="width:306px;height:372px"/></figure>
</div>


<p>Now, let’s perform dimensionality reduction. To do that we just need to reduce the number of features from the orthogonal matrices.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">Xhat_500 = US[:, <span class="hljs-number" style="color: teal;">0</span>:<span class="hljs-number" style="color: teal;">500</span>] @ VT[<span class="hljs-number" style="color: teal;">0</span>:<span class="hljs-number" style="color: teal;">500</span>, :]
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># inverse transform Xhat to reverse standardization</span>
Xhat_500_orig = scaler.inverse_transform(Xhat_500)
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># draw recovered image</span>
draw_img(Xhat_500_orig[<span class="hljs-number" style="color: teal;">49</span>])</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_16.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54930" style="width:289px;height:372px"/></figure>
</div>


<p>We can further reduce more features and see the results.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">Xhat_100 = US[:, <span class="hljs-number" style="color: teal;">0</span>:<span class="hljs-number" style="color: teal;">100</span>] @ VT[<span class="hljs-number" style="color: teal;">0</span>:<span class="hljs-number" style="color: teal;">100</span>, :]
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># inverse transform Xhat to reverse standardization</span>
Xhat_100_orig = scaler.inverse_transform(Xhat_100)
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># draw recovered image</span>
draw_img(Xhat_100_orig[<span class="hljs-number" style="color: teal;">49</span>])</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_17.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54931" style="width:299px;height:381px"/></figure>
</div>


<p>Now let’s create a function that would allow us to reduce the dimensions of the image.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">dim_reduce</span><span class="hljs-params">(US_, VT_, dim=<span class="hljs-number" style="color: teal;">100</span>)</span>:</span>

   Xhat_ = US_[:, <span class="hljs-number" style="color: teal;">0</span>:dim] @ VT_[<span class="hljs-number" style="color: teal;">0</span>:dim, :]

   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> scaler.inverse_transform(Xhat_)</pre>



<p>Plotting images with a different number of features.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">dim_vec = [<span class="hljs-number" style="color: teal;">50</span>, <span class="hljs-number" style="color: teal;">100</span>, <span class="hljs-number" style="color: teal;">200</span>, <span class="hljs-number" style="color: teal;">400</span>, <span class="hljs-number" style="color: teal;">800</span>]

plt.figure(figsize=(<span class="hljs-number" style="color: teal;">1.8</span> * len(dim_vec), <span class="hljs-number" style="color: teal;">2.4</span>))

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i, d <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> enumerate(dim_vec):
   plt.subplot(<span class="hljs-number" style="color: teal;">1</span>, len(dim_vec), i + <span class="hljs-number" style="color: teal;">1</span>)
   draw_img(dim_reduce(US, VT, d)[<span class="hljs-number" style="color: teal;">49</span>])</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_9.png?ssl=1" alt="Singular Value Decomposition" class="wp-image-54932"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: SVD | Source: Author</em></figcaption></figure>
</div>


<p>As you can see the first image contains the least number of features yet it can still construct the abstract version of the image and as we increase the features, we eventually obtain the original image. This proves that SVD can retain the basic structure of the data.&nbsp;</p>



<h4 class="wp-block-heading">Non-negative Matrix Factorization (NMF)</h4>



<p>NMF is an unsupervised machine learning algorithm. When a non-negative input matrix X of dimension mXn is given to the algorithm, it is decomposed into the product of two non-negative matrices W and H. W is of the dimension mXp while H is of the dimension pXn.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_20.png?ssl=1" alt="Non-negative Matrix Factorization (NMF)" class="wp-image-54940" style="width:642px;height:121px"/></figure>
</div>


<p><strong>Where Y = W.H</strong></p>



<p>From the equation above you can see that to factorize the matrix, we need to minimize the distance. The most widely used distance function is the squared Frobenius norm; this is an extension of the Euclidean norm to matrices.&nbsp;</p>



<p>It is also worth noting that this problem is not solvable in general which is why it is approximated. As it turns out, NMF is good for parts-based representation of the dataset i.e. NMF provides an <strong>efficient</strong>,<strong> distributed representation,</strong> and can aid in the discovery of the structure of interest within the data.&nbsp;</p>



<p>Let’s understand NMF with code. We will use the same data that we used in SVD.</p>



<p>First, we will fit the model to the data.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.decomposition <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> NMF
model = NMF(n_components=<span class="hljs-number" style="color: teal;">200</span>, init=<span class="hljs-string" style="color: rgb(221, 17, 68);">'nndsvd'</span>, random_state=<span class="hljs-number" style="color: teal;">0</span>)
W = model.fit_transform(X)
V = model.components_</pre>



<p>NMF takes a bit of time to decompose the data. Once the data is decomposed we can then visualize the factorized components.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">num_faces = <span class="hljs-number" style="color: teal;">20</span>
plt.figure(figsize=(<span class="hljs-number" style="color: teal;">1.8</span> * <span class="hljs-number" style="color: teal;">5</span>, <span class="hljs-number" style="color: teal;">2.4</span> * <span class="hljs-number" style="color: teal;">4</span>))

<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">0</span>, num_faces):
   plt.subplot(<span class="hljs-number" style="color: teal;">4</span>, <span class="hljs-number" style="color: teal;">5</span>, i + <span class="hljs-number" style="color: teal;">1</span>)
   draw_img(V[i])</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_10.png?ssl=1" alt="Non-negative Matrix Factorization" class="wp-image-54925" style="width:676px;height:693px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: NMF | Source: Author</em></figcaption></figure>
</div>


<p>From the image above we can see that NMF is very efficient to capture the underlying structure of the data. It is also worth mentioning that NMF captures only the linear attributes.&nbsp;</p>



<p><strong>Advantages of NMF</strong>:</p>



<ol class="wp-block-list">
<li>Data compression and visualization</li>



<li>Robustness to noise&nbsp;</li>



<li>Easier to interpret</li>
</ol>



<section id="blog-intext-cta-block_702bd12e25920e38b8bb3de71ec4cecb" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-you-may-also-like">You may also like</h3>
    
            <p>   <a href="/blog/15-computer-visions-projects" target="_blank" rel="noopener">15 Computer Visions Projects You Can Do Right Now</a></p>
    
    </section>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-manifold-learning">Manifold learning</h3>



<p>So far we have seen approaches that only involved linear transformation. But what do we do when we have a non-linear dataset?</p>



<p>Manifold learning is a type of unsupervised learning that seeks to perform dimensionality reduction of a non-linear dataset. Again, scikit-learn offers a module that consists of various nonlinear dimensionality reduction techniques. We can call those classes or techniques through this command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.manifold <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> TSNE, LocallyLinearEmbedding, SpectralEmbedding</pre>



<h4 class="wp-block-heading">t-Distributed Stochastic Neighbor Embedding (t-SNE)</h4>



<p>t-Distributed Stochastic Neighbor Embedding or t-SNE is a dimensionality reduction technique well suited for data visualization. Unlike PCA which simply maximizes the variance, t-SNE minimizes the divergence between two distributions. Essentially, it recreates the distribution of a high-dimensional space in a low-dimensional space rather than maximizing variance or even using a kernel trick.&nbsp;</p>



<p>We can get a high-level understanding of t-SNE in three simple steps:</p>



<ol class="wp-block-list">
<li>It first creates a probability distribution for the high-dimensional samples.&nbsp;&nbsp;</li>



<li>Then, it defines a similar distribution for the points in the low-dimensional embedding.</li>



<li>Finally, it tries to minimize the KL-divergence between the two distributions.&nbsp;</li>
</ol>



<p>Now let’s understand it with code. For t-SNE, we will use the MNIST dataset again. Firstly, we import TSNE and then the data as well.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.manifold <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> TSNE
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.datasets <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> load_digits

digits = load_digits()
print(digits.data.shape)

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># There are 10 classes (0 to 9) with almost 180 images in each class</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># The images are 8x8 and hence 64 pixels(dimensions)</span>

<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Displaying what the standard images look like</span>
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">0</span>,<span class="hljs-number" style="color: teal;">5</span>):
   plt.figure(figsize=(<span class="hljs-number" style="color: teal;">5</span>,<span class="hljs-number" style="color: teal;">5</span>))
   plt.imshow(digits.images[i])
   plt.show()</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_19.png?ssl=1" alt="t-Distributed Stochastic Neighbor Embedding" class="wp-image-54936" style="width:287px;height:560px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: t-SNE | Source: Author</em></figcaption></figure>
</div>


<p>We will then store the digits in order using np.vstack.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">X = np.vstack([digits.data[digits.target==i] <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">10</span>)])
Y = np.hstack([digits.target[digits.target==i] <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">10</span>)])</pre>



<p>We will apply t-SNE to the dataset.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">digits_final = TSNE(perplexity=<span class="hljs-number" style="color: teal;">30</span>).fit_transform(X)</pre>



<p>We will now create a function to visualize the data.&nbsp;</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-function"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">def</span> <span class="hljs-title" style="color: rgb(153, 0, 0); font-weight: 700;">plot</span><span class="hljs-params">(x, colors)</span>:</span>
    palette = np.array(sb.color_palette(<span class="hljs-string" style="color: rgb(221, 17, 68);">"hls"</span>, <span class="hljs-number" style="color: teal;">10</span>))  <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;">#Choosing color palette</span>

   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Create a scatter plot.</span>
   f = plt.figure(figsize=(<span class="hljs-number" style="color: teal;">8</span>, <span class="hljs-number" style="color: teal;">8</span>))
   ax = plt.subplot(aspect=<span class="hljs-string" style="color: rgb(221, 17, 68);">'equal'</span>)
   sc = ax.scatter(x[:,<span class="hljs-number" style="color: teal;">0</span>], x[:,<span class="hljs-number" style="color: teal;">1</span>], lw=<span class="hljs-number" style="color: teal;">0</span>, s=<span class="hljs-number" style="color: teal;">40</span>,c=palette[colors.astype(np.int)])
   <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Add the labels for each digit.</span>
   txts = []
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">for</span> i <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">in</span> range(<span class="hljs-number" style="color: teal;">10</span>):
       <span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># Position of each label.</span>
       xtext, ytext = np.median(x[colors == i, :], axis=<span class="hljs-number" style="color: teal;">0</span>)
       txt = ax.text(xtext, ytext, str(i), fontsize=<span class="hljs-number" style="color: teal;">24</span>)
       txt.set_path_effects([pe.Stroke(linewidth=<span class="hljs-number" style="color: teal;">5</span>, foreground=<span class="hljs-string" style="color: rgb(221, 17, 68);">"w"</span>),
                             pe.Normal()])
       txts.append(txt)
   <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">return</span> f, ax, txts</pre>



<p>Now we perform data visualization on the transformed dataset.</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">plot(digits_final,Y)</pre>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_21.png?ssl=1" alt="t-Distributed Stochastic Neighbor Embedding" class="wp-image-54926" style="width:524px;height:504px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: t-SNE | Source: Author</em></figcaption></figure>
</div>


<p>As it can be seen, t-SNE clusters the data beautifully. Compared to PCA, t-SNE performs well on nonlinear data. The drawback with t-SNE is that when the data is big it consumes a lot of time. So it is better to perform PCA followed by t-SNE.&nbsp;</p>



<h4 class="wp-block-heading">Locally Linear Embedding (LLE)</h4>



<p>Locally Linear Embedding or LLE is a non-linear and unsupervised machine learning method for dimensionality reduction. LLE takes advantage of the local structure or topology of the data and preserves it on a lower-dimensional feature space.&nbsp;</p>



<p>LLE optimizes faster but fails on noisy data.&nbsp;</p>



<p>Let’s break the whole process into three simple steps:</p>



<ol class="wp-block-list">
<li>Find the nearest neighbors of the data points.&nbsp;</li>



<li>Construction of a weight matrix, by approximating each data point as a weighted linear combination of its k-nearest neighbors and minimizing the squared distance between them and their linear representation.</li>



<li>Map the weights into a lower-dimensional space by using the <strong>eigenvector-based optimization</strong> technique.</li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_24.png?ssl=1" alt="Locally Linear Embedding" class="wp-image-54927" style="width:649px;height:639px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: LLE | Source: S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding</em></figcaption></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_14.png?ssl=1" alt="Locally Linear Embedding" class="wp-image-54933" style="width:518px;height:389px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: LLE | Source: <a href="https://scikit-learn.org/stable/modules/manifold.html#manifold" target="_blank" rel="noreferrer noopener nofollow">Scikit Learn</a></em></figcaption></figure>
</div>


<h4 class="wp-block-heading">Spectral embedding</h4>



<p>Spectral Embedding is another non-linear dimensionality reduction technique that also happens to be an unsupervised machine learning algorithm. Spectral embedding aims to find clusters of different classes based on the low-dimensional representations.&nbsp;</p>



<p>We can again break the whole process into three simple steps:</p>



<ol class="wp-block-list">
<li><strong>Preprocessing</strong>: Construct a Laplacian matrix representation of the data or graph.&nbsp;</li>



<li><strong>Decomposition</strong>: Compute eigenvalues and eigenvectors of the constructed matrix and then map each point to a lower-dimensional representation. Spectral embedding makes use of the second smallest eigenvalue and its corresponding eigenvector.</li>



<li><strong>Clustering</strong>: Assign points to two or more clusters, based on the representation. Clustering is usually done using k-means clustering.&nbsp;</li>
</ol>



<p><strong>Applications</strong>: Spectral Embedding finds its application in image<strong> </strong>segmentation.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-discriminant-analysis">Discriminant Analysis</h3>



<p>Discriminant Analysis is another module that scikit-learn provides. It can be called using the following command:</p>



<pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.discriminant_analysis <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> LinearDiscriminantAnalysis</pre>



<h4 class="wp-block-heading">Linear Discriminant Analysis (LDA)</h4>



<p>LDA is an algorithm that is used to find a linear combination of features in a dataset. Like PCA, LDA is also a linear transformation-based technique. But unlike PCA it is a supervised learning algorithm.&nbsp;</p>



<p>LDA computes the directions, i.e. linear discriminants that can create decision boundaries and maximize the separation between multiple classes. It is also very effective for multi-class classification tasks.</p>



<p>To have a more intuitive understanding of LDA, consider plotting a relationship of two classes as shown in the image below.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_3.png?ssl=1" alt="Linear Discriminant Analysis (LDA)" class="wp-image-54944" style="width:506px;height:418px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: LDA | Source: <a href="https://towardsdatascience.com/linear-discriminant-analysis-in-python-76b8b17817c2" target="_blank" rel="noreferrer noopener nofollow">Towards Data Science</a></em></figcaption></figure>
</div>


<p>One way to solve this problem is to project all the data points in the x-axis.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_8.png?ssl=1" alt="Linear Discriminant Analysis (LDA)" class="wp-image-54941" style="width:512px;height:417px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: LDA | Source: <a href="https://towardsdatascience.com/linear-discriminant-analysis-in-python-76b8b17817c2" target="_blank" rel="noreferrer noopener nofollow">Towards Data Science</a></em></figcaption></figure>
</div>


<p>This approach will lead to information loss and would be redundant.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_12.png?ssl=1" alt="Linear Discriminant Analysis (LDA)" class="wp-image-54948" style="width:467px;height:132px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: LDA | Source: <a href="https://towardsdatascience.com/linear-discriminant-analysis-in-python-76b8b17817c2" target="_blank" rel="noreferrer noopener nofollow">Towards Data Science</a></em></figcaption></figure>
</div>


<p>A better approach will be to compute the distance between all the points in the data and fit a new linear line that passes through them. This new line can now be used to project all the points.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_23.png?ssl=1" alt="Linear Discriminant Analysis (LDA)" class="wp-image-54942" style="width:493px;height:405px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: LDA | Source: <a href="https://towardsdatascience.com/linear-discriminant-analysis-in-python-76b8b17817c2" target="_blank" rel="noreferrer noopener nofollow">Towards Data Science</a></em></figcaption></figure>
</div>


<p>This new line minimizes the variance and classifies the two classes efficiently by maximizing the distance between them.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Dimensionality-Reduction-for-Machine-Learning_4.png?ssl=1" alt="Linear Discriminant Analysis (LDA)" class="wp-image-54947" style="width:515px;height:140px"/><figcaption class="wp-element-caption"><em>Dimensionality reduction technique: LDA | Source: <a href="https://towardsdatascience.com/linear-discriminant-analysis-in-python-76b8b17817c2" target="_blank" rel="noreferrer noopener nofollow">Towards Data Science</a></em></figcaption></figure>
</div>


<p>LDA can be used for multivariate data as well. It makes data inference quite simple. LDA can be computed using the following 5 steps:</p>



<ol class="wp-block-list">
<li>Compute the d-dimensional mean vectors for the different classes from the dataset.</li>



<li>Compute the scatter matrices (in-between-class and within-class scatter matrices). the Scatter matrix is used to make estimates of the covariance matrix. This is done when the covariance matrix is difficult to calculate or joint variability of two random variables is difficult to calculate.&nbsp;</li>



<li>Compute the eigenvectors (e1, e2, e3&#8230;ed) and corresponding eigenvalues (λ1,λ2,&#8230;,λd) for the scatter matrices.</li>



<li>Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k dimensional matrix W (where every column represents an eigenvector).</li>



<li>Use this d×k eigenvector matrix to transform the samples onto the new subspace. This can be summarized by the matrix multiplication: Y=X×W (where X is an n×d dimensional matrix representing the n samples, and y are the transformed n×k-dimensional samples in the new subspace).</li>
</ol>



<p>To know about LDA you can check out this <a href="https://towardsdatascience.com/linear-discriminant-analysis-in-python-76b8b17817c2" target="_blank" rel="noreferrer noopener nofollow">article</a>.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-applications-of-dimentionality-reduction">Applications of dimentionality reduction</h2>



<p>Dimensionality reduction finds its way in many real-life applications some of which are:</p>



<ul class="wp-block-list">
<li>Customer relationship management</li>



<li>Text categorization</li>



<li>Image retrieval</li>



<li>Intrusion detection</li>



<li>Medical image segmentation&nbsp;</li>
</ul>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-advantages-and-disadvantages-of-dimentionality-reduction">Advantages and disadvantages of dimentionality reduction </h2>



<p><strong>Advantages of dimensionality reduction</strong>:</p>



<ul class="wp-block-list">
<li>It helps in data compression by reducing features.</li>



<li>It reduces storage.</li>



<li>It makes machine learning algorithms computationally efficient.</li>



<li>It also helps remove redundant features and noise.</li>



<li>It tackles the curse of dimensionality</li>
</ul>



<p><strong>Disadvantages of dimensionality reduction</strong>:</p>



<ul class="wp-block-list">
<li>It may lead to some amount of data loss.</li>



<li>Accuracy is compromised.&nbsp;</li>
</ul>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-final-thoughts">Final thoughts</h2>



<p>In this article, we learned about dimensionality reduction and also about the curse of dimensionality. We touched on the different algorithms that are used in dimensionality reduction with mathematical details and through code as well.&nbsp;</p>



<p>It is worth mentioning these algorithms are supposed to be used based on the task at hand. For instance, if the nature of your data is linear then use decomposition methods otherwise use manifold learning techniques.&nbsp;</p>



<p>It is considered to be a good practice to first visualize the data and then decide which method to use. Also, do not restrict yourself to one method but explore differently and see which one is the most suitable.</p>



<p>I hope you have learned something from this article. Happy learning.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-references">&nbsp;References</h3>



<ol class="wp-block-list">
<li><a href="https://www.geeksforgeeks.org/dimensionality-reduction/" target="_blank" rel="noreferrer noopener nofollow">Introduction to Dimensionality Reduction &#8211; GeeksforGeeks</a></li>



<li><a href="https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/" target="_blank" rel="noreferrer noopener nofollow">Introduction to Dimensionality Reduction for Machine Learning</a></li>



<li><a href="https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202" target="_blank" rel="noreferrer noopener nofollow">Principal component analysis: a review and recent developments</a></li>



<li><a href="https://towardsdatascience.com/linear-discriminant-analysis-in-python-76b8b17817c2" target="_blank" rel="noreferrer noopener nofollow">Linear Discriminant Analysis In Python | by Cory Maklin</a></li>



<li><a href="https://www.kdnuggets.com/2021/01/sparse-features-machine-learning-models.html" target="_blank" rel="noreferrer noopener nofollow">Working With Sparse Features In Machine Learning Models</a></li>



<li><a href="https://www.kdnuggets.com/2017/04/must-know-curse-dimensionality.html" target="_blank" rel="noreferrer noopener nofollow">Must-Know: What is the curse of dimensionality?</a></li>



<li><a href="https://analyticsindiamag.com/curse-of-dimensionality-and-what-beginners-should-do-to-overcome-it/" target="_blank" rel="noreferrer noopener nofollow">Curse Of Dimensionality And What Beginners Should Do To Overcome It</a></li>



<li><a href="https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/" target="_blank" rel="noreferrer noopener nofollow">6 Dimensionality Reduction Algorithms With Python</a></li>



<li><a href="https://scikit-learn.org/stable/modules/classes.html" target="_blank" rel="noreferrer noopener nofollow">Sklearn API References</a></li>



<li><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding" target="_blank" rel="noreferrer noopener nofollow">t-distributed stochastic neighbor embedding</a></li>



<li><a href="https://www.sciencedirect.com/science/article/pii/B9780124095458000029" target="_blank" rel="noreferrer noopener nofollow">Feature Selection and Extraction</a></li>
</ol>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6121</post-id>	</item>
		<item>
		<title>Open Source MLOps: Platforms, Frameworks and Tools</title>
		<link>https://neptune.ai/blog/best-open-source-mlops-tools</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 06:30:45 +0000</pubDate>
				<category><![CDATA[ML Tools]]></category>
		<guid isPermaLink="false">https://neptune.test/best-open-source-mlops-tools/</guid>

					<description><![CDATA[You don’t need to spend a lot on MLOps tools to bring the magic of DevOps to your machine learning projects. There is plenty of open-source tools to choose from. It’s a good solution when you’re trying to address unique problems and a community to rely on is needed. But there are some prons to&#8230;]]></description>
										<content:encoded><![CDATA[
<p>You don’t need to spend a lot on <a href="/blog/mlops-tools-platforms-landscape" target="_blank" rel="noreferrer noopener">MLOps tools</a> to bring the magic of DevOps to your machine learning projects. There is plenty of open-source tools to choose from. It’s a good solution when you’re trying to address unique problems and a community to rely on is needed. But there are some prons to machine learning open source tools too.&nbsp;</p>



<ul class="wp-block-list">
<li>First, be careful—machine learning open source tools aren’t always 100% free all of the time. For example, Kuberflow has client and server components, and both are open. However, some tools might open-source only one of these components. The client is open, but the vendor controls everything server-side.</li>



<li>Free open-source tools can cost you in other ways too. If you consider that you have to host and maintain the tool long-term, you’ll find that open-source can be quite costly after all.&nbsp;</li>



<li>Finally, if something goes awry, you probably won’t have 24/7/365 vendor support to rely on. Community can help you but, obviously, they don’t bear any responsibility for the result you’re left with.</li>
</ul>



<p>Ultimately, open-source tools can be tricky. Before you choose the tool for your project, you need to carefully study its pros and cons. Moreover, you need to make sure that the tools work well with the rest of your stack. This is why I prepared a list of popular and community-approved MLOps platforms, tools, and frameworks for different stages of the model development process.&nbsp;</p>



<p>If you&#8217;re exploring the possibility of integrating open source machine learning platforms into your workflow to simplify <a href="/blog/experiment-management" target="_blank" rel="noreferrer noopener">model development</a> and <a href="/blog/model-deployment-strategies" target="_blank" rel="noreferrer noopener">model deployment</a>, this article is tailored for you. Within these insights, you&#8217;ll discover a compilation of machine learning platforms, frameworks, and specialized tools designed to assist you in data exploration, deployment strategies, and testing procedures.</p>



<p>Furthermore, we&#8217;ve included a <a href="#faq">FAQ section towards the conclusion</a>, offering comprehensive responses to the most commonly posed questions.</p>



<section
	id="i-box-block_4205a9d15b0b03a8244887f2aa407d29"
	class="block-i-box  l-margin__top--large l-margin__bottom--0">

			<header class="c-header">
			<img
				src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
				data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/i-box/header-icon.svg"
				width="24"
				height="24"
				class="c-header__icon lazyload"
				alt="">

			
            <h3 class="c-header__text animation " style='max-width: 100%;'   >
                <strong>Interested in other MLOps tools?</strong>
            </h3>		</header>
	
	<div class="block-i-box__inner">
		

<p>When building their ML pipelines, teams usually look into a few other components of the MLOps stack.</p>



<p>If that’s the case for you, here are a few article you should check:</p>



<ul
    id="arrow-list-block_f1a2711030905652b3f96609e46525f0"
    class="block-arrow-list block-list-item--font-size-regular">
    

<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p><a href="/blog/mlops-tools-platforms-landscape" target="_blank" rel="noreferrer noopener">MLOps Landscape in 2023 [Tools and Platforms]</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p><a href="/blog/ml-platform-guide" target="_blank" rel="noreferrer noopener">Building an ML Platform [Guide]</a></p>


</li>


</ul>


	</div>

</section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-open-source-platforms">MLOps open source platforms</h2>



<p>Let us start by exploring the open-source platforms first followed by frameworks and tools.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-full-fledged-mlops-open-source-platforms">Full-fledged MLOps open source platforms</h3>



<p>Full-fledged platforms contain tools for all stages of the machine-learning workflow. Ideally, once you get a full-fledged tool, you won’t have to set up any other tools. In practice, it depends on the needs of your project and personal preferences.&nbsp;</p>



<h4 class="wp-block-heading" id="1-kubeflow">Kubeflow</h4>



<p>Almost immediately after Kubernetes established itself as the standard for working with a cluster of containers, Google created <a href="https://www.kubeflow.org/" target="_blank" rel="noreferrer noopener nofollow">Kubeflow</a>—an open-source project that simplifies working with ML in Kubernetes. It has all the advantages of this <a href="/blog/best-workflow-and-pipeline-orchestration-tools" target="_blank" rel="noreferrer noopener">orchestration tool</a>, from the ability to deploy on any infrastructure to managing loosely-coupled microservices, and on-demand scaling.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="912" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=1920%2C912&#038;ssl=1" alt="" class="wp-image-29415" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?w=1920&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=768%2C365&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=200%2C95&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=1536%2C730&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=220%2C105&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=120%2C57&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=160%2C76&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=300%2C143&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=480%2C228&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kubeflow.png?resize=1020%2C485&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><strong>  Introduction to Kubeflow, open source MLOps platform | <a href="https://www.kubeflow.org/docs/components/central-dash/overview/" target="_blank" rel="noreferrer noopener nofollow">Source</a> </strong></figcaption></figure>
</div>


<p>This project is for developers who want to deploy portable and scalable machine learning projects. Google didn&#8217;t want to recreate other services. They wanted to create a state-of-the-art open-source system that can be applied alongside various infrastructures—from supercomputers to laptops.&nbsp;</p>



<div id="separator-block_20d44106fb5e580968b51d0194b03143"
         class="block-separator block-separator--5">
</div>



<p>With Kuberflow, you can benefit from the following features:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Jupyter notebooks</strong></li>
</ul>



<p>Create and customize Jupyter notebooks, immediately see the results of running your code, and create interactive analytics reports.</p>



<ul class="wp-block-list">
<li><strong>Custom TensorFlow job operator</strong></li>
</ul>



<p>This functionality helps train your model and apply a TensorFlow or Seldon Core serving container to export the model to Kubernetes.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Simplified containerization</strong></li>
</ul>



<p>Kuberflow eliminates the complexity involved in containerizing the code. Data scientists can perform data preparation, model training, and deployment in less time.</p>



<p>All in all, Kuberflow is a full-fledged solution for the development and deployment of end-to-end machine learning workflows.&nbsp;</p>


    <a
        href="/blog/the-best-kubeflow-alternatives"
        id="cta-box-related-link-block_951a0a55475c47aeced513670b8069d7"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--0"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related post                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-the-best-kubeflow-alternatives">                 The Best Kubeflow Alternatives            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h4 class="wp-block-heading" id="2-mlflow">MLflow</h4>



<p><a href="https://mlflow.org/" target="_blank" rel="noreferrer noopener nofollow">MLflow</a> is an open-source platform for machine learning engineers to manage the machine learning lifecycle through experimentation, deployment, and testing. MLflow comes in handy when you want to track the performance of your machine learning models. It’s like a dashboard, one place where you can:&nbsp;</p>



<ul class="wp-block-list">
<li>monitor machine learning pipelines,&nbsp;</li>



<li>store model metadata, and&nbsp;</li>



<li>pick the best-performing model.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><a href="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?ssl=1" target="_blank" rel="noreferrer noopener"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1023" height="504" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=1023%2C504&#038;ssl=1" alt="A list of experiment runs with metrics you can use to compare the models in MLFlow, MLOps open source platform" class="wp-image-29326" style="width:810px;height:399px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?w=1023&amp;ssl=1 1023w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=768%2C378&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=200%2C99&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=220%2C108&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=120%2C59&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=160%2C79&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=300%2C148&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=480%2C236&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/07/image31.png?resize=1020%2C503&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></a><figcaption class="wp-element-caption">UI sample of MLFlow,<strong> open source MLOps platform</strong><em><a href="Kubeflow: MLOps open source platform | Source"> </a> | <a href="https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html#training-the-model" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Right now, there are four components provided by MLflow:</p>



<ul class="wp-block-list">
<li><strong>Tracking</strong>&nbsp;</li>
</ul>



<p>The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files for running the code and visualizing the results. You can do log and query experiments using Python, REST, R API, and Java APIs. You can also record the results.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Project</strong>&nbsp;</li>
</ul>



<p>MLflow Project is a tool for machine learning teams to package data science code in a reusable and reproducible way. It comes with an API and command-line tools to connect projects into workflows. It helps you run projects on any platform.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Model</strong></li>
</ul>



<p>MLflow Model makes it easy to package machine learning models to be used by various downstream tools, like Apache Spark. With this, deploying machine learning models in diverse serving environments is much more manageable.&nbsp;</p>



<p>Overall, users love MLflow because it’s easy to use locally without a dedicated server and has a fantastic UI where you can explore your experiments.&nbsp;</p>



<section
	id="i-box-block_a2f62b2f2c5b685f7aafef5d905f3b80"
	class="block-i-box  l-margin__top--large l-margin__bottom--x-large">

			<header class="c-header">
			<img
				src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
				data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/i-box/header-icon.svg"
				width="24"
				height="24"
				class="c-header__icon lazyload"
				alt="">

			
            <h2 class="c-header__text animation " style='max-width: 100%;'   >
                <strong>Might be useful</strong>
            </h2>		</header>
	
	<div class="block-i-box__inner">
		

<div
    id="custom-text-block_29f4d69652f5dd9cf60b3ead84c00976"
    class="block-custom-text  white l-padding__top--0 l-padding__bottom--x-small"
    style="max-width: 100%; font-size: 1rem; line-height: 1.33; font-weight: 600;"
    >
    
    Unlike manual, homegrown, or open-source solutions, neptune.ai is a scalable full-fledged component with user access management, developer-friendly UX, and advanced collaboration features. 
    </div>



<div
    id="custom-text-block_ea94c55141c47ba4d67d01adfc6ad49a"
    class="block-custom-text  white l-padding__top--0 l-padding__bottom--0"
    style="max-width: 100%; font-size: 1rem; line-height: 1.33; font-weight: 400;"
    >
    
    That&#8217;s especially valuable for ML/AI teams. Here&#8217;s an example of how Neptune helped Waabi optimize their experiment tracking workflow.
    </div>



<div id="group-of-boxes-block_e1e1a96c89f73139d6a6251b5181fb78" class="b-group-of-boxes  l-padding__top--large l-padding__bottom--large">

<div
    class="c-wrapper c-wrapper--align-auto c-wrapper--align-vertical-auto" >
    <div class="b-group-of-boxes__grid l-grid--cols-2  l-grid--boxes">
        

	<div
		class="c-box c-box--transparent c-box--dark c-box--no-hover c-box--micro c-box--vertical-center c-box--horizontal-flex-start c-box--paddings-none  l-margin__top--0 l-margin__bottom--0">
		

<blockquote
	id="quote-small-block_2800db17d3394ebab2d6d6317967de89"
	class="block-quote-small ">

	<img
		src="https://neptune.ai/wp-content/themes/neptune/img/icon-quote-small.svg"
		alt=""
		width="24"
		height="18"
		class="c-item__icon">

	
		<div class="c-item__content">

			The product has been very helpful for our experimentation workflows. Almost all the projects in our company are now using Neptune for experiment tracking, and it seems to satisfy all our current needs. It’s also great that all these experiments are available to view for everyone in the organization, making it very easy to reference experimental runs and share results.
							<cite class="c-item__cite">
					<p>James Tu, Research Scientist at Waabi</p>
				</cite>
			
		</div>

	
</blockquote>


	</div>



	<div
		class="c-box c-box--transparent c-box--dark c-box--no-hover c-box--micro c-box--vertical-flex-start c-box--horizontal-flex-start c-box--paddings-none  l-margin__top--0 l-margin__bottom--0">
		

<div id="app-screenshot-block_ac27a546c2190139651bf5043a85ff41"
	class="block-app-screenshot js-block-with-image-full-screen-modal "
	data-video-url=""
	data-show-controls="false"
	data-unmute="false"
	data-button-icon="https://neptune.ai/wp-content/themes/neptune/img/icon-close.svg"
	data-image-full-screen-modal="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/11/Reporting.png?fit=1020%2C577&#038;ssl=1"
>

			<div class="block-app-screenshot__image-wrapper">
			<div class="block-app-screenshot__bar">
				<figure class="block-app-screenshot__bar-buttons-wrapper">
					<img
						src="https://neptune.ai/wp-content/themes/neptune/img/blocks/app-screenshot/bar-buttons.svg"
						width="34"
						height="9"
						class="block-app-screenshot__bar-buttons"
						alt="">
				</figure>
			</div>

			
				<img
					srcset="
					https://i0.wp.com/neptune.ai/wp-content/uploads/2024/11/Reporting.png?fit=480%2C271&#038;ssl=1 480w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2024/11/Reporting.png?fit=768%2C434&#038;ssl=1 768w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2024/11/Reporting.png?fit=1020%2C577&#038;ssl=1 1020w"
					alt=""
					style=""
					width="1020"
					height="577"
					class="block-app-screenshot__image"
				>

			
			<div class="block-app-screenshot__overlay">

				
					<a
						href="https://scale.neptune.ai/o/examples/org/LLM-Pretraining/reports/9e6a2cad-77e7-42df-9d64-28f07d37e908"
						class="c-button c-button--primary c-button--small c-button--cta">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-button--test-tube.svg"
							width="16"
							height="19"
							target="_blank" rel="nofollow noopener noreferrer"							class="c-button__icon"
							alt=""
						/>

													<span class="c-button__text">
								See in app							</span>
						
					</a>

				
														<button
						class="js-c-image-full-screen-modal c-button c-button--tertiary c-button--small">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-zoom.svg"
							width="16"
							height="17"
							class="c-button__icon"
							alt="zoom"
						/>

						<span class="c-button__text">
							Full screen preview						</span>
						
					</button>
									
			</div>

		</div>

			
</div>


	</div>


    </div>
</div>


</div>



<ul
    id="arrow-list-block_f1a2711030905652b3f96609e46525f0"
    class="block-arrow-list block-list-item--font-size-regular">
    

<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p>Full <a href="/customers/waabi" target="_blank" rel="noreferrer noopener">case study with Waabi</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p>Dive into<a rel="noreferrer noopener" href="https://docs.neptune.ai/" target="_blank"> documentation</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p><a rel="noreferrer noopener" href="/contact-us" target="_blank">Get in touch</a>&nbsp;if you’d like to go through a custom demo with your team</p>


</li>


</ul>


	</div>

</section>



<h4 class="wp-block-heading" id="3-metaflow">Metaflow</h4>



<p>Netflix created <a href="https://metaflow.org/" target="_blank" rel="noreferrer noopener nofollow">Metaflow</a> as an open-source MLOps platform for building and managing large-scale, enterprise-level data science projects. Data scientists can use this platform for end-to-end development and <a href="/blog/best-ml-model-deployment-tools" target="_blank" rel="noreferrer noopener">deployment of their machine-learning models</a>.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Great library support</strong></li>
</ul>



<p>Metaflow supports all popular data science tools, like TensorFlow and scikit-learn, so you can keep using your favorite tool. Metaflow supports Python and R, making it even more flexible in terms of library and package choice.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Powerful version control toolkit&nbsp;</strong></li>
</ul>



<p>What is excellent about Metaflow is that it versions and keeps track of all your machine learning experiments automatically. You won’t lose anything important, and you can even inspect the results of all the experiments in notebooks.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1564" height="712" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=1564%2C712&#038;ssl=1" alt="Tracking metrics of each run within the project in Metaflow, MLOps open source platform" class="wp-image-7031" style="width:810px;height:369px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?w=1564&amp;ssl=1 1564w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=200%2C91&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=768%2C350&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=1536%2C699&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=220%2C100&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=120%2C55&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=160%2C73&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=300%2C137&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=480%2C219&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Kedro-vs-Metaflow-vs-ZenML2.png?resize=1020%2C464&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Metaflow, <strong>open source MLOps platform</strong><em><a href="Kubeflow: MLOps open source platform | Source"></a> | <a href="https://demo.public.outerbounds.xyz/?timerange_start=1444518000000" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>As it was mentioned above, Metaflow was specifically created for large-scale machine learning development. The AWS cloud powers the solution, so there are built-in integrations to storage, compute, and machine learning services from AWS if you need to scale. You don’t have to rewrite or change the code to use any of it.&nbsp;</p>



<h4 class="wp-block-heading">Flyte</h4>



<p>If you’re looking for a platform that will take care of <a href="/blog/ml-experiment-tracking">experiment tracking</a> and maintenance for your machine learning project, have a look at <a href="https://github.com/flyteorg/flyte" target="_blank" rel="noreferrer noopener nofollow">Flyte</a>. It is an open-source orchestrator designed to simplify the creation of robust data and machine learning pipelines for production. Its architecture prioritizes scalability and reproducibility, harnessing the power of Kubernetes as its foundational framework. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1600" height="707" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=1600%2C707&#038;ssl=1" alt="UI sample of Flyte, MLOps open source platform" class="wp-image-29351" style="width:810px;height:358px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?w=1600&amp;ssl=1 1600w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=768%2C339&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=200%2C88&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=1536%2C679&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=220%2C97&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=120%2C53&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=160%2C71&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=300%2C133&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=480%2C212&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Flyte.png?resize=1020%2C451&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Flyte, <strong>open source MLOps platform</strong> | <a href="https://flyte.org/features" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Flyte offers a ton of features and use cases from a simple machine learning project to complex LLMs projects. To give you an I have distilled a some features and listed them below, but you check out their <a href="https://flyte.org/" target="_blank" rel="noreferrer noopener nofollow">website</a> and <a href="https://docs.flyte.org/en/latest/" target="_blank" rel="noreferrer noopener nofollow">documentation</a>.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Large-scale project support</strong></li>
</ul>



<p>Flyte has helped them to execute large-scale computing that’s crucial to their business. It’s not a secret that scaling and monitoring all pipeline changes can be pretty challenging, especially if the workflows have complex data dependencies. Flyte successfully deals with tasks of higher complexity, so developers can focus on business logic rather than machines.</p>



<ul class="wp-block-list">
<li><strong>Improved reproducibility</strong></li>
</ul>



<p>This tool can also help you be sure of the reproducibility of the machine learning models you build. Flyte tracks changes, does version control, and containerizes the model alongside its dependencies.</p>



<ul class="wp-block-list">
<li><strong>Multi-language support</strong></li>
</ul>



<p>Flyte was created to support complex ML projects in Python, Java, or Scala.</p>



<p>Flyte has been tested out by Lyft internally before they released it to the public. It has a proven record of managing more than 7,000 unique workflows totaling 100,000 executions every month.&nbsp;</p>



<h4 class="wp-block-heading" id="4-mlreef">MLReef</h4>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/MLReef-MLOps-tool.png?ssl=1" alt="UI sample of MLReef, MLOps open source platform" class="wp-image-52731" style="width:842px;height:478px"/><figcaption class="wp-element-caption"><em><em>UI sample of MLReef, open source MLOps platform</em>| <a href="https://about.mlreef.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em>  </figcaption></figure>
</div>


<p><a href="https://about.mlreef.com/" target="_blank" rel="noreferrer noopener nofollow">MLReef</a> is an MLOps platform for teams to collaborate and share the results of their machine learning experiments. Projects are built on reusable machine learning modules realized either by you or by the community. This boosts the speed of development and makes the workflow more efficient by promoting concurrency.&nbsp;</p>



<p>MLReef provides tools in four directions:</p>



<ul class="wp-block-list">
<li><strong>Data management</strong></li>
</ul>



<p>You have a fully-versioned data hosting and processing infrastructure for setting up and managing your machine learning models.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Script repositories</strong></li>
</ul>



<p>Every developer has access to containerized and versioned script repositories that you can use in your machine learning pipelines.</p>



<ul class="wp-block-list">
<li><strong>Experiment management&nbsp;</strong></li>
</ul>



<p>You can use MLReef for experiment tracking across different iterations of your project.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>MLOps</strong></li>
</ul>



<p>This solution helps you optimize pipeline management and orchestration, automating routine tasks.</p>



<p>Moreover, MLReef feels welcoming to projects of any size. Newcomers can use it for small-scale projects, experienced developers―for small, medium-sized, and enterprise projects.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Newcomer</strong></li>
</ul>



<p>If you don’t have much experience developing machine learning models, you’ll find a user-friendly interface and community support for whatever problem you may face.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Experienced</strong></li>
</ul>



<p>MLReef lets you build your project on Git while taking care of all the DevOps mess for you. You can easily monitor progress and outcomes in an automated environment.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Enterprise</strong></li>
</ul>



<p>MLReef for enterprise is easy to scale and control on the cloud or on-premises.</p>



<p>All in all, MLReef is a convenient framework for your machine learning project. With just a couple of easy setups, you’ll be able to develop, test, and optimize your machine learning solution brick-by-brick.&nbsp;</p>



<h4 class="wp-block-heading" id="5-kedro">Seldon Core</h4>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="989" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=1920%2C989&#038;ssl=1" alt="Introduction to Seldon Core, open source MLOps platform" class="wp-image-29105" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=1920%2C989&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=768%2C396&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=200%2C103&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=1536%2C791&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=220%2C113&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=120%2C62&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=160%2C82&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=300%2C155&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=480%2C247&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?resize=1020%2C526&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-2.jpg?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to Seldon Core, open source MLOps platform | <a href="https://github.com/SeldonIO/seldon-core" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://github.com/SeldonIO/seldon-core" target="_blank" rel="noreferrer noopener nofollow">Seldon Core</a> is one of the<a href="/blog/best-ml-model-deployment-tools" target="_blank" rel="noreferrer noopener"> platform for machine learning model deployment</a> on Kubernetes. This platform helps developers build models in a robust Kubernetes environment, with features like custom resource definitions to manage model graphs. You can also merge your continuous integration and deployment tools with platform.</p>



<ul class="wp-block-list">
<li><strong>Build scalable models</strong></li>
</ul>



<p>Seldon core can convert your model built on TensorFlow, PyTorch, H2O, and other frameworks into a scalable microservice architecture based on REST/GRPC.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Monitor model performance</strong></li>
</ul>



<p>It will handle scaling for you, and give you advanced solutions for measuring model performance, detecting outliers, and conducting A/B testing out-of-the-box.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Robust and reliable</strong></li>
</ul>



<p>Seldon Core can boast the robustness and reliability of a system supported through continuous maintenance and security policy updates.&nbsp;</p>



<p>Optimized servers provided by Seldon Core allow you to build large-scale deep-learning systems without having to containerize them or worry about their security.</p>



<h4 class="wp-block-heading">Sematic<a href="https://docs.sematic.dev/onboarding/readme"></a></h4>



<p><a href="https://www.sematic.dev/" target="_blank" rel="noreferrer noopener nofollow">Sematic</a> stands as an open-source machine learning development platform. It grants ML Engineers and Data Scientists the ability to craft intricate end-to-end machine learning pipelines using straightforward Python code, which can then be executed on diverse platforms: their local machine, a cloud VM, or a Kubernetes cluster, harnessing the potential of cloud-based resources.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1230" height="396" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=1230%2C396&#038;ssl=1" alt="Introduction to Sematic, open source MLOps platform | Source" class="wp-image-29109" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?w=1230&amp;ssl=1 1230w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=768%2C247&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=200%2C64&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=220%2C71&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=120%2C39&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=160%2C52&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=300%2C97&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=480%2C155&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/image3the-best-open-source-mlops-tools-you-should-know-3.png?resize=1020%2C328&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to Sematic, open source MLOps platform | <a href="https://github.com/sematic-ai/sematic" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>This open source platform draws upon insights amassed from leading self-driving car enterprises. It facilitates the seamless linking of data processing tasks (such as those powered by Apache Spark) with model training endeavors (like PyTorch or TensorFlow), or even arbitrary Python-based business logic. This amalgamation results in the creation of type-safe, traceable, and reproducible end-to-end pipelines. These pipelines, complete with comprehensive monitoring and visualization, are effortlessly managed through a contemporary web dashboard.</p>



<p>Here are some of the features that Sematic offers:</p>



<ul class="wp-block-list">
<li><strong>Smooth Onboarding</strong>&nbsp;</li>
</ul>



<p>Embarking on your journey with Sematic is a breeze – no initial deployment or infrastructure requirements. Simply install Sematic locally and plunge into exploration.</p>



<ul class="wp-block-list">
<li><strong>Parity from Local to Cloud</strong></li>
</ul>



<p>The same code that runs on your personal laptop can be seamlessly executed on your Kubernetes cluster, ensuring consistent outcomes.</p>



<ul class="wp-block-list">
<li><strong>End-to-End Transparency</strong></li>
</ul>



<p>Every artifact of your pipeline is meticulously stored, tracked, and presented within a web dashboard, enabling comprehensive oversight.</p>



<ul class="wp-block-list">
<li><strong>Harnessing Diverse Computing Resources</strong></li>
</ul>



<p>Tailor the resources allocated to each step of your pipeline, optimizing performance and cloud footprint through a range of options including CPUs, memory, GPUs, and Spark clusters.</p>



<ul class="wp-block-list">
<li><strong>Reproducibility at the core</strong></li>
</ul>



<p>Rerun your pipelines with confidence from the intuitive UI, securing the assurance of reproducible results each time.</p>



<p>Sematic introduces an exceptional level of clarity to your machine learning pipelines, affording you an encompassing view of crucial aspects such as artifacts, logs, errors, source control, and dependency graphs. This robust insight is seamlessly coupled with an SDK and GUI that remain both straightforward and instinctive.</p>



<p>Sematic strikes an adept balance by offering a precisely calibrated level of abstraction. This equilibrium empowers ML engineers to concentrate on refining their business logic, all the while harnessing the power of cloud resources – all without the necessity of wielding intricate infrastructure expertise.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-data-processing-mlops-open-source-platform">Data-processing MLOps open source platform</h3>



<p>Data-processing platforms are ideally used to prepare a robust pipeline for any given application. These platforms are capable of scaling, optimizing, batching, distributing data streams et cetera.&nbsp;</p>



<h4 class="wp-block-heading">Apache Airflow</h4>



<p><a href="https://airflow.apache.org/" target="_blank" rel="noreferrer noopener nofollow">Apache Airflow</a> emerges as an open-source platform tailored for the development, scheduling, and vigilant monitoring of batch-centric workflows. Airflow&#8217;s expansive Python foundation empowers you to forge intricate workflows, seamlessly bridging connections with a diverse spectrum of technologies. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1343" height="874" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=1343%2C874&#038;ssl=1" alt="UI sample of Apache Airflow, data processing MLOps open source platform" class="wp-image-29111" style="width:810px;height:527px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?w=1343&amp;ssl=1 1343w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=768%2C500&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=200%2C130&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=220%2C143&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=120%2C78&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=160%2C104&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=300%2C195&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=480%2C312&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-4.png?resize=1020%2C664&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Apache Airflow, data processing MLOps <strong>open source platform</strong> | <a href="https://airflow.apache.org/docs/apache-airflow/stable/index.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>A user-friendly web interface takes charge of workflow management, meticulously overseeing their state. From deploying as a singular process on your personal laptop to configuring a distributed setup capable of supporting the most intricate workflows, Airflow accommodates a plethora of deployment options.A distinctive hallmark of Airflow workflows is their anchoring within Python code. This &#8220;workflows as code&#8221; paradigm serves a multifaceted role:</p>



<ul class="wp-block-list">
<li><strong>Dynamic Prowess</strong></li>
</ul>



<p>Airflow pipelines are molded through Python code, instilling the capability for dynamic pipeline generation.</p>



<ul class="wp-block-list">
<li><strong>Inherent Extensibility</strong></li>
</ul>



<p>The Airflow framework houses a range of operators that seamlessly interface with a multitude of technologies. Every component of Airflow retains an intrinsic extensibility, seamlessly adapting to your unique environment.</p>



<ul class="wp-block-list">
<li><strong>Supreme Flexibility</strong></li>
</ul>



<p>The fabric of workflow parameterization is woven into the system, harnessing the prowess of the Jinja templating engine for streamlined customization.</p>



<p>Apache Airflow is a versatile addition to any machine learning stack, offering dynamic workflow orchestration that adapts to changing data and requirements. With its flexibility, extensive connectivity, and scalability, Airflow allows machine learning practitioners to build custom workflows as code while integrating various technologies. Its monitoring capabilities, community support, and compatibility with cloud resources enhance ML  reproducibility, collaboration, and efficient resource utilization in machine learning operations.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-monitoring-mlops-open-source-platform">Monitoring MLOps open source platform</h3>



<h4 class="wp-block-heading">EvidentlyAI</h4>



<p>EvidentlyAI is an open-sourced observability platform that allows you to evaluate, test, and monitor machine learning models. The platform covers the phase from validation to production. It offers services for tabular data, embeddings, and text-based models and data. It has also extended its services to cater to the needs of large language models or LLMs.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1080" height="674" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=1080%2C674&#038;ssl=1" alt="UI sample of EvidentlyAI, monitoring MLOps open source platform" class="wp-image-29113" style="width:810px;height:506px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?w=1080&amp;ssl=1 1080w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=768%2C479&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=200%2C125&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=220%2C137&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=120%2C75&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=160%2C100&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=300%2C187&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=480%2C300&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-5.png?resize=1020%2C637&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of EvidentlyAI, monitoring MLOps <strong>open source platform</strong> | <a href="https://www.evidentlyai.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>These are some of the products that EvidentlyAI offers:</p>



<div id="case-study-numbered-list-block_d008e7fe231603fae7c84e9709660723"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Data and model visualization dashboard            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Data and ML monitoring             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Data quality and integrity check             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Data drift monitoring             </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                ML model performance monitoring            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                NLP and LLM monitoring.             </li>
            </ul>
</div>



<p>With these products, EvidentlyAI offers the following features:</p>



<ul class="wp-block-list">
<li><strong>Build reports</strong></li>
</ul>



<p>The plug-and-play capabilities allow users to easily build reports for dataset and model performance. These reports are easy to share and appealing to interact.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Test your pipelines</strong></li>
</ul>



<p>EvidentlyAI test suites allow you properly create test pipelines for your machine learning models and data to see if there is any drift detected.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Monitoring</strong></li>
</ul>



<p>With dashboard capabilities and a wide range of testing methods Evidently makes monitoring and debugging machine learning models simple and interactive.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Data quality</strong></li>
</ul>



<p>With EvidentlyAI you can run various exploratory analyses to ensure that the data is of high quality and integrity. It enables you to spot issues in your data with ease with a single line of code.&nbsp;</p>



<p>EvidentlyAI is an easy-to-use platform that offers great features with good capabilities. This testing platform is one of the best out there, and it is keeping up with the trend as it offers services towards LLMs.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-open-source-frameworks">MLOps open source frameworks</h2>



<p>Now that open-source platforms are covered let us dive into the frameworks.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-workflow-open-source-mlops-frameworks">Workflow open source MLOps frameworks</h3>



<p>The workflow frameworks allows you to provide a structural approach to streamline the different phases of your MLOps applications. You must keep in mind that some frameworks covers two phases while some may cover multiple phases.&nbsp;</p>



<h4 class="wp-block-heading">Kedro&nbsp;</h4>



<p><a href="https://github.com/quantumblacklabs/kedro">Kedro</a> is a Python framework for machine learning engineers and data scientists to create reproducible and maintainable code.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1578" height="904" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=1578%2C904&#038;ssl=1" alt="UI sample of Kedro, MLOps open source framework" class="wp-image-29387" style="width:810px;height:548px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?w=1578&amp;ssl=1 1578w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=768%2C440&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=200%2C115&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=1536%2C880&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=220%2C126&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=120%2C69&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=160%2C92&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=300%2C172&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=480%2C275&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Kedro.png?resize=1020%2C584&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><br>UI sample of Kedro, MLOps <strong>open source </strong>framework | <a href="https://kedro.org/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>This framework is your best friend if you want to organize your <strong>data pipeline</strong> and make machine learning project development much more efficient. You won’t have to waste time on code rewrites and will have more opportunities for focusing on robust pipelines. Moreover, Kedro helps teams establish collaboration standards to limit delays and build scalable, deployable projects.</p>



<p>Kedro has many good features:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Project templates</strong></li>
</ul>



<p>Usually, you have to spend a lot of time understanding how to set up your analytics project. Kedro provides a standard template that will save you time.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Data management</strong></li>
</ul>



<p>Kedro will help you load and store data to stop being alarmed about the reproducibility and scalability of your code.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Configuration management</strong></li>
</ul>



<p>This is a necessary tool when you’re working with complex software systems. If you don’t pay enough attention to configuration management, you might encounter serious reliability and scalability problems.&nbsp;&nbsp;</p>



<p>Kedro promotes a data-driven approach to ML development and maintains industry-level standards while decreasing operational risks for businesses.&nbsp;</p>


    <a
        href="/blog/data-science-pipelines-with-kedro"
        id="cta-box-related-link-block_0678a8f63876d138f2c3439906cb82ab"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--0"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related post                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-building-and-managing-data-science-pipelines-with-kedro">                Building and Managing Data Science Pipelines with Kedro            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h4 class="wp-block-heading">ZenML</h4>



<p><a href="https://docs.zenml.io/" target="_blank" rel="noreferrer noopener nofollow">ZenML</a> is an MLOps framework for orchestrating your machine learning <strong>experiment pipeline</strong>. It provides you with tools to:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="832" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=1920%2C832&#038;ssl=1" alt="Introduction to ZenML, MLOps open source framework" class="wp-image-29120" style="width:810px;height:351px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=1920%2C832&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=768%2C333&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=200%2C87&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=1536%2C665&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=220%2C95&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=120%2C52&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=160%2C69&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=300%2C130&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=480%2C208&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?resize=1020%2C442&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-8.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to ZenML, MLOps <strong>open source </strong>framework | <a href="https://www.zenml.io/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<ul class="wp-block-list">
<li><strong>Preprocess data</strong></li>
</ul>



<p>ZenML helps you convert raw data into analysis-ready data.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Train your models</strong></li>
</ul>



<p>Among other tools for convenient model training, the platform uses declarative pipeline configs, so you can switch between on-premise and cloud environments easily.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Conduct split testing&nbsp;</strong></li>
</ul>



<p>ZenML creators claim that the platform’s key benefits are automated experiment tracking and guaranteed comparability between experiments.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Evaluate the results</strong></li>
</ul>



<p>XML focuses on making ML development reproducible and straightforward for both individual developers and large teams.&nbsp;</p>



<p>This framework frees you from all the troubles of delivering machine learning models with traditional tools. If you struggle with providing enough experiment data that prove the reproducibility of results, want to reduce waste and make the reuse of code simpler, ZenML will help.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-deployment-and-serving-open-source-mlops-framework">Deployment and serving open source MLOps framework</h3>



<h4 class="wp-block-heading">BentoML</h4>



<p><a href="https://www.bentoml.com/" target="_blank" rel="noreferrer noopener nofollow">BentoML</a> is a framework that allows you to build, deploy and scale any machine learning application. BentoML provides a way to bundle your trained models, along with any preprocessing, post-processing, and custom code, into a containerized format.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="796" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868-1920x796.png?resize=1920%2C796&#038;ssl=1" alt="UI sample of BentoML, MLOps open source framework" class="wp-image-29451" style="width:820px;height:340px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=1920%2C796&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=768%2C319&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=200%2C83&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=1536%2C637&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=2048%2C849&amp;ssl=1 2048w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=220%2C91&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=120%2C50&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=160%2C66&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=300%2C124&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=480%2C199&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/bentoml-e1693834167868.png?resize=1020%2C423&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of BentoML, MLOps <strong>open source </strong>framework | <a href="https://docs.bentoml.org/en/0.13-lts/concepts.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Some of the key features of BentoML include:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Model Serving</strong></li>
</ul>



<p>BentoML allows you to easily serve your machine learning models with a REST API. It abstracts away the complexities of serving machine learning models and managing infrastructure.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Model Packaging</strong></li>
</ul>



<p>You can package your trained models, along with dependencies and custom code, into a single deployable artifact. This makes it simple to reproduce your model deployments.</p>



<ul class="wp-block-list">
<li><strong>Multi-Framework Support</strong>&nbsp;</li>
</ul>



<p>BentoML supports a variety of machine learning frameworks, such as TensorFlow, PyTorch, Scikit-learn, XGBoost, and more.</p>



<ul class="wp-block-list">
<li><strong>Deployment Flexibility</strong></li>
</ul>



<p>You can deploy BentoML models in various environments, including local servers, cloud platforms, and Kubernetes clusters.</p>



<ul class="wp-block-list">
<li><strong>Scalability</strong></li>
</ul>



<p>BentoML supports high-throughput serving, making it suitable for machine learning applications that require efficient and scalable model deployments.</p>



<ul class="wp-block-list">
<li><strong>Versioning</strong></li>
</ul>



<p>BentoML allows you to version your model artifacts and easily switch between different versions for serving.</p>



<ul class="wp-block-list">
<li><strong>Monitoring and Logging</strong></li>
</ul>



<p>BentoML provides features for monitoring the health and performance of your deployed models, including logging and metrics.</p>



<ul class="wp-block-list">
<li><strong>Customization</strong></li>
</ul>



<p>You can customize the deployment environment, preprocessing, post-processing, and other aspects of your deployed model.</p>



<p>BentoML can be an important resource in your ML arsenal as it can essentially offer so much more with ease and reliability. With BentoML you can deploy your machine learning models as REST APIs, Docker containers, or even serverless functions.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-workflow-orchestration-open-source-mlops-framework">Workflow orchestration open source MLOps framework</h3>



<h4 class="wp-block-heading">Argo Workflow</h4>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1276" height="751" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=1276%2C751&#038;ssl=1" alt="UI sample of Argo Workflow, MLOps open source framework " class="wp-image-29122" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?w=1276&amp;ssl=1 1276w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=768%2C452&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=200%2C118&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=220%2C129&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=120%2C71&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=160%2C94&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=300%2C177&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=480%2C283&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-10.png?resize=1020%2C600&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Argo Workflow, MLOps <strong>open source </strong>framework | <a href="https://argoproj.github.io/argo-workflows/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://argoproj.github.io/argo-workflows/" target="_blank" rel="noreferrer noopener nofollow">Argo Workflow</a> is a Kubernetes based orchestration tool which is based on YAML. It is lightweight and easy to use tool. Because it is based on YAML it is implemented as Kubernetes CRD (Custom Resource Definition). It is open-sourced, and it is trusted by a large community.&nbsp;</p>



<p>Argo Workflow provides support to a wide range of ecosytem some of which are:</p>



<div id="case-study-numbered-list-block_328243cc7eb4a653b9d5b064736457fa"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Kedro            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                Kubeflow Pipelines            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Seldon            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                SQLFlow            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">5</span>
                Argo Events            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">6</span>
                Couler            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">7</span>
                Hera            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">8</span>
                Katlib            </li>
            </ul>
</div>



<p>Argo Workflow also supports Python-based environments. Although Argo offers a quite a number of features I have listed out a few that attracted me a lot:</p>



<ul class="wp-block-list">
<li><strong>UI</strong></li>
</ul>



<p>The workflow provides a user interface that allows users to manage their workflow with ease.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Artifact Support</strong></li>
</ul>



<p>You can integrate platforms such as S3, Azure Blob Storage, et cetera to store your metadata.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Scheduling</strong></li>
</ul>



<p>You can schedule your whole ML workflow using cron. This allows you to schedule jobs and tasks to run automatically at specific times, on specific days, or at regular intervals.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Kubernetes</strong>&nbsp;</li>
</ul>



<p>If you are well established on working with Kubernetes cluster then Argo is the go-to choice. One key feature is that Argo defines each step as a container.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Efficiency</strong></li>
</ul>



<p>It easily computes intensive jobs for data processing and ML making it efficient and reliable.&nbsp;</p>



<p>You can find the full documentation <a href="https://argoproj.github.io/argo-workflows/" target="_blank" rel="noreferrer noopener nofollow">here</a>.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-mlops-open-source-tools">MLOps open source tools</h2>



<p>Open-source tools and libraries addresses one specific aspect in your machine learning applications. You can pick any of these tools and use it for your own application and fit them into a desired framework. One key advantage is that these tools are compatible with most of the working environments and they are compatible as well.</p>



<p>In this list, we will cover some of the major areas in ML lifecycle where open-source tools will get the job done for you.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-development-and-deployment-open-source-ml-tools">Development and deployment open source ML tools</h3>



<h4 class="wp-block-heading">MLRun</h4>



<p><a href="https://www.iguazio.com/open-source/mlrun/" target="_blank" rel="noreferrer noopener nofollow">MLRun</a> is a tool for machine learning model development and deployment. If you’re looking for a tool that conveniently runs in a <strong>wide variety of environments</strong> and <strong>supports multiple technology</strong> stacks, it’s definitely worth a try. MLRun offers a comprehensive approach to managing data pipelines.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1903" height="965" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=1903%2C965&#038;ssl=1" alt="UI sample of MLRun, development and deployment open source ML tool" class="wp-image-29125" style="width:811px;height:411px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?w=1903&amp;ssl=1 1903w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=768%2C389&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=200%2C101&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=1536%2C779&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=220%2C112&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=120%2C61&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=160%2C81&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=300%2C152&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=480%2C243&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-11.png?resize=1020%2C517&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of MLRun, development and deployment open source ML tool | <a href="https://github.com/mlrun/mlrun" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>MLRun has a layered architecture that offers the following powerful functionality:&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Feature and artifact store</strong></li>
</ul>



<p>This layer helps you to handle the preparation and processing of data and store it across different repositories.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Elastic serverless runtimes layer</strong></li>
</ul>



<p>Convert simple code into microservices that are easy to scale and maintain. It’s compatible with standard runtime engines like Kubernetes jobs, Dask, and Apache Spark.</p>



<ul class="wp-block-list">
<li><strong>Automation layer</strong></li>
</ul>



<p>For you to concentrate on model training the model and fine-tuning the hyperparameters, the pipeline automation tool helps you with preparing data, testing, and real-time deployment. You’ll only need to provide your supervision to create a state-of-the-art ML solution.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Central management layer</strong></li>
</ul>



<p>Here, you get access to a unified dashboard to manage your whole workflow. MLRun has a convenient user interface, a CLI, and an SDK that you can access anywhere.</p>



<p>With MLRun, you can write code once and then use automated solutions to run it on different platforms. The tool manages the build process, execution, data movement, scaling, versioning, parameterization, output tracking, and more.&nbsp;</p>



<h4 class="wp-block-heading">CML (Continuous Machine Learning)</h4>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="740" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=1920%2C740&#038;ssl=1" alt="Introduction to CML, development and deployment open source ML tool" class="wp-image-29474" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=1920%2C740&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=768%2C296&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=200%2C77&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=1536%2C592&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=2048%2C789&amp;ssl=1 2048w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=220%2C85&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=120%2C46&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=160%2C62&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=300%2C116&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=480%2C185&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/CML.webp?resize=1020%2C393&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to CML, development and deployment open source ML tool  | <a href="https://towardsdatascience.com/continuous-machine-learning-e1ffb847b8da" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://cml.dev/" target="_blank" rel="noreferrer noopener nofollow">CML</a> (Continuous Machine Learning) is a library for continuous integration and delivery (CI / CD) of machine learning projects. The library was developed by the creators of DVC, an open-source library for versioning machine learning models and machine learning experiments. Together with DVC, Tensorboard, and cloud services, CML should facilitate the process of developing and implementing machine learning models into products.</p>



<ul class="wp-block-list">
<li><strong>Automate pipeline building</strong></li>
</ul>



<p>CML was designed to automate some of the work of machine learning engineers, including training experiments, model evaluation, datasets, and their additions.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Integrate APIs&nbsp;</strong></li>
</ul>



<p>The tool is positioned as a library that supports GitFlow for data science projects, allows automatic generation of reports, and hides complex details of using external services. Examples of external services include cloud platforms: AWS, Azure, GCP, and others. For infrastructure tasks, DVC, docker, and Terraform are also used. Recently, there is an infrastructural aspect of machine learning projects attracting more attention.&nbsp;</p>



<p>The library is flexible and provides a wide range of functionality; from sending reports and publishing data, to distributing cloud resources for a project.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-automl-open-source-tools">AutoML open source tools</h3>



<h4 class="wp-block-heading">AutoKeras</h4>



<p><a href="http://autokeras.com/" target="_blank" rel="noreferrer noopener nofollow">AutoKeras</a> is an open-source library for Automated Machine Learning (AutoML). With AutoML frameworks, you can automate the processing of raw data, choose a machine learning model, and optimize the hyperparameters of the learning algorithm.</p>



<ul class="wp-block-list">
<li><strong>Streamline machine learning model development</strong></li>
</ul>



<p>AutoML reduces the biases and variances that happen when humans develop machine learning models, and streamlines the development of a machine learning model.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Enjoy automated hyperparameter tuning</strong></li>
</ul>



<p>AutoKeras is the tool that provides functionality to match the architecture and hyperparameters of deep learning models automatically.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Build flexible solutions</strong></li>
</ul>



<p>AutoKeras is most famous for its flexibility. In this case, the code you write will be executed regardless of the backend. It supports Theano, Tensorflow, and other frameworks.</p>



<p>AutoKeras has several training datasets inside. They’re already put in a form that’s convenient for work, but it doesn’t show you the full power of AutoKeras. In fact, it contains tools for suitable preprocessing of texts, pictures, and time series. In other words, the most common data types, which make the data preparation process much more manageable. The tool also has built-in visualization for models.</p>



<h4 class="wp-block-heading">H2O AutoML</h4>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="1013" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=1920%2C1013&#038;ssl=1" alt="UI sample of H2O.ai, autoML open source tool" class="wp-image-29483" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=1920%2C1013&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=768%2C405&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=200%2C106&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=1536%2C810&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=2048%2C1080&amp;ssl=1 2048w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=220%2C116&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=300%2C158&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=480%2C253&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?resize=1020%2C538&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/h2o-automl.png?w=3000&amp;ssl=1 3000w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of H2O.ai, autoML open source tool | <a href="https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://www.h2o.ai/products/h2o-automl/" target="_blank" rel="noreferrer noopener nofollow">H2O.ai</a> is a software platform that optimizes the machine learning process using AutoML. H2O claims the platform can train models faster than popular machine learning libraries such as scikit-learn.&nbsp;</p>



<p>H2O is a machine learning, predictive data analytics platform for building machine learning models and generating production code for them in Java and Python, all at the click of a button.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Implement ML models out-of-the-box</strong></li>
</ul>



<p>It has implementations of supervised and unsupervised algorithms such as GLM and K-Means, and an easy-to-use web interface called Flow.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Tailor H2O to your needs&nbsp;</strong></li>
</ul>



<p>The tool is helpful for both beginner and seasoned developers. It equips the coder with a simple wrapper function that manages modeling-related tasks in a few lines of code. Experienced machine learning engineers appreciate this function, since it allows them to focus on other, more thought-intensive processes of building models (like data exploration and feature engineering).&nbsp;</p>



<p>Overall, H2O is a powerful tool for solving machine learning and data science problems. Even beginners can extract value from data and build robust models. H2O continues to grow and release new products while maintaining high quality across the board.</p>



<h4 class="wp-block-heading">EvalML&nbsp;</h4>



<p><a href="https://evalml.alteryx.com/" target="_blank" rel="noreferrer noopener nofollow">EvalML</a> is a library that offers multiple functionalities such building, optimizing, and evaluating machine learning pipelines. EvalML offers <strong>end-to-end supervised</strong> machine learning solutions that leverage <em>Featuretools</em> and <em>Compose</em>. The former is a framework that is used to perform <strong>automated feature engineering</strong> in relational datasets, and the latter is used to <strong>automate prediction engineering</strong>.&nbsp;</p>



<p>With these automated capabilities, EvalML offers four important functionalities:</p>



<ul class="wp-block-list">
<li><strong>Automation</strong></li>
</ul>



<p>It takes away the manual work from the picture. You make machine learning models with ease. The automation feature includes data quality check, cross-validation, and many other features.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Data Checks</strong></li>
</ul>



<p>As the name suggest it inspect data integrity and brings into light the issues and problems like duplicates, imbalance distribution et cetera before you can use them to train the model.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>End-to-end</strong></li>
</ul>



<p>Offers end-to-end functionality that includes data-preprocessing, feature-engineering, feature-selection, and various other machine learning modeling techniques.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Model Understanding</strong></li>
</ul>



<p>It helps you to understand and inspect your machine learning model.</p>



<p>To conclude EvalML is an amazing tool that essentially automates two major phases of the ML lifecycle, data-preprocessing, and ML modeling. EvalML is has an active list of contributors, and the library is updated in a day-to-day basis. You can leverage this light-weight library to your own application with ease as the documentation is pretty straightforward and easy to understand.&nbsp;</p>



<h4 class="wp-block-heading">Neural Network Intelligence (NNI)</h4>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1612" height="802" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=1612%2C802&#038;ssl=1" alt="Introduction to NNI, autoML open source tool " class="wp-image-29133" style="width:806px;height:401px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?w=1612&amp;ssl=1 1612w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=768%2C382&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=200%2C100&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=1536%2C764&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=220%2C109&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=120%2C60&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=160%2C80&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=300%2C149&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=480%2C239&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-15.png?resize=1020%2C507&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to NNI, autoML open source tool | <a href="https://github.com/microsoft/nni" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://github.com/microsoft/nni" target="_blank" rel="noreferrer noopener nofollow">NNI</a> or Neural Network Intelligence is a lightweight tool created by Microsoft for automating neural network optimization. This open-source toolkit allows users to automate feature engineering, neural architecture search or NAS, model compression, and hyper-parameter tuning.&nbsp;</p>



<p>NNI offers simple to use function calling in Python. Similar to other Python libraries and frameworks NNI can be leveraged in an existing pipeline. All you need to have is a PyTorch working environment, and you are ready to plug-and-play and automate your optimization technique with a single function calling. For instance, if you want to perform:</p>



<ul class="wp-block-list">
<li>Hyperparameter tuning then simply call nni.get_next_parameter()</li>



<li>Model pruning then call one of the pruning methods such as L1NormPruner(model, config)</li>



<li>Model quantization then call any quantization function such as QAT_Quantizer(model, config)</li>



<li>Neural architecture search then you can call a strategy and an evaluator like RegularizedEvolution() and FunctionalEvaluator() respectively.</li>
</ul>



<p>There are other features as well One-shot neural architecture search and feature engineering. The idea that NNI is putting forward is to automate Neural Network model engineering.&nbsp;</p>



<p>Essentially, NNI eases the model buildings and engineering phase while allowing you to manage AutoML machine learning experiments. Along with all the above it also provides a dashboard where you can monitor the tuning process which allows you to control the experiments. If you are someone who spends a lot of time building models and finetuning them then this tool is a necessity.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-data-validation-open-source-ml-tools">Data validation open source ML tools</h3>



<p>Data validation is the process of checking data quality. During this stage, you make sure that there are no inconsistencies or missing data in your sets. Data validation tools automate this routine process and improve the quality of data cleansing.&nbsp;&nbsp;</p>



<h4 class="wp-block-heading">Hadoop&nbsp;</h4>



<p><a href="https://hadoop.apache.org/" target="_blank" rel="noreferrer noopener nofollow">Hadoop</a> is a freely redistributable set of utilities, libraries, and frameworks for developing and executing programs running on clusters. This fundamental technology for storing and processing Big Data is a top-level project of the Apache Software Foundation.</p>



<p>The project consists of 4 main modules:</p>



<ul class="wp-block-list">
<li><strong>Hadoop Common</strong></li>
</ul>



<p>Hadoop Common is a set of infrastructure software libraries and utilities that are used in other solutions and related projects, in particular, for managing distributed files and creating the necessary infrastructure.</p>



<ul class="wp-block-list">
<li><strong>HDFS is a distributed file system</strong></li>
</ul>



<p>Hadoop Distributed File System is a technology for storing files on various data servers with addresses located on a special name server. HDFS provides reliable storage of large files, block-by-block distributed between the nodes of the computing cluster.</p>



<ul class="wp-block-list">
<li><strong>YARN is a task scheduling and cluster management system</strong></li>
</ul>



<p>YARN is a set of system programs that provide sharing, scalability, and reliability of distributed applications.</p>



<ul class="wp-block-list">
<li><strong>Hadoop MapReduce</strong></li>
</ul>



<p>This is a platform for programming and performing distributed MapReduce calculations using many computers that form a cluster.</p>



<p>Today, there’s a whole ecosystem of related projects and technologies in Hadoop used for data mining and machine learning.</p>



<h4 class="wp-block-heading">Apache Spark&nbsp;</h4>



<p><a href="https://spark.apache.org/" target="_blank" rel="noreferrer noopener nofollow">Apache Spark</a> helps you to process semi-structured in-memory data. The main advantages of Spark are performance and a user-friendly programming interface.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1328" height="496" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=1328%2C496&#038;ssl=1" alt="UI sample of Apache Spark, data validation open source ML tool" class="wp-image-29493" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?w=1328&amp;ssl=1 1328w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=768%2C287&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=200%2C75&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=220%2C82&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=120%2C45&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=160%2C60&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=300%2C112&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=480%2C179&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Apache-Spark.png?resize=1020%2C381&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Apache Spark, data validation open source tool | <a href="https://spark.apache.org/docs/latest/web-ui.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>The framework has five components: a core and four libraries, each solving a specific problem.</p>



<ul class="wp-block-list">
<li><strong>Spark Core</strong></li>
</ul>



<p>This is the core of the framework. You can use it for scheduling and core I/O functionality.</p>



<ul class="wp-block-list">
<li><strong>Spark SQL</strong></li>
</ul>



<p>Spark SQL is one of four framework libraries that comes in handy when working with processing data. To run faster, this tool uses DataFrames and can act as a distributed SQL query engine.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Spark Streaming</strong></li>
</ul>



<p>This is an easy-to-use streaming data processing tool. It breaks data into micro-batch mode. The creators of Spark claim that performance does not suffer much from this.</p>



<ul class="wp-block-list">
<li><strong>MLlib</strong></li>
</ul>



<p>This is a high-speed distributed machine learning system. It’s nine times faster than its competitor, the Apache Mahout library when benchmarked against the alternating least squares (ALS) algorithm. MLlib includes popular algorithms for classification, regression, and recommender systems.</p>



<ul class="wp-block-list">
<li><strong>GraphX</strong></li>
</ul>



<p>GraphX ​​is a library for scalable graph processing. GraphX ​​is not suitable for graphs that change in a transactional manner, for example, databases.</p>



<p>Spark is entirely autonomous but also compatible with other standard ML instruments, like Hadoop, if needed.</p>



<h4 class="wp-block-heading">Great Expectations</h4>



<p>For effective management of intricate data pipelines, data practitioners recognize the significance of testing and documentation. GX offers a solution for swift deployment of adaptable, expandable data quality testing within data stacks. Its user-friendly documentation ensures accessibility for both technical and non-technical users.</p>



<p>Great Expectations (GX) assists data teams in fostering a collective comprehension of their data by incorporating quality testing, documentation, and profiling.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1600" height="753" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=1600%2C753&#038;ssl=1" alt="UI sample of Great Expectations, data validation open source ML tool" class="wp-image-29498" style="width:810px;height:381px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?w=1600&amp;ssl=1 1600w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=768%2C361&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=200%2C94&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=1536%2C723&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=220%2C104&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=120%2C56&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=160%2C75&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=300%2C141&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=480%2C226&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Great-Expectations.png?resize=1020%2C480&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Great Expectations, data validation open source ML tool | <a href="https://docs.greatexpectations.io/docs/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Some of the key features are:</p>



<ul class="wp-block-list">
<li><strong>Seamless Integration</strong></li>
</ul>



<p>GX seamlessly integrates into your current tech stack and can be linked with your CI/CD pipelines, enabling precise data quality enhancement. Validate and connect with your existing data, enabling Expectation Suites to perfectly address your data quality requisites.</p>



<ul class="wp-block-list">
<li><strong>Quick Start</strong></li>
</ul>



<p>GX produces valuable outcomes promptly, even for large datasets. Its Data Assistants offer curated Expectations tailored for various domains, accelerating data discovery for rapid deployment of data quality across pipelines. Auto-generated Data Docs ensure ongoing up-to-date documentation.</p>



<ul class="wp-block-list">
<li><strong>Unified Insight</strong></li>
</ul>



<p>Expectations serve as GX&#8217;s core abstraction, articulating anticipated data states. The Expectation library employs a human-readable vocabulary, catering to technical and non-technical users. Bundled into Expectation Suites, they excellently characterize your data expectations.</p>



<ul class="wp-block-list">
<li><strong>Security and Transparency</strong></li>
</ul>



<p>GX preserves your data security by processing it within your own systems. Its open-source foundation ensures full transparency, allowing for complete control over insights.</p>



<ul class="wp-block-list">
<li><strong>Data Contracts Support</strong></li>
</ul>



<p>Utilize Checkpoints for transparent, central, and automated testing of Expectations, producing readable Data Docs. Checkpoints can trigger actions based on evaluation results, bolstering data quality.</p>



<ul class="wp-block-list">
<li><strong>Enhanced Collaboration</strong></li>
</ul>



<p>GX&#8217;s Data Docs are inspectable, shareable, and human-readable, fostering mutual understanding of data quality. Publish Data Docs in diverse formats to seamlessly integrate with existing catalogs, dashboards, and reporting tools.</p>



<p>Great Expectations aligns well with your MLOps tools by enhancing data reliability, reducing the risk of poor data quality impacting your machine learning  models, and promoting a collaborative approach to data quality management within your team.</p>



<h4 class="wp-block-heading">TensorFlow Extended (TFX)</h4>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="763" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=1920%2C763&#038;ssl=1" alt="Introduction to TensorFlow Extended (TFX), data validation open source ML tool " class="wp-image-29135" style="width:810px;height:322px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=1920%2C763&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=768%2C305&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=200%2C79&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=1536%2C610&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=220%2C87&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=120%2C48&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=160%2C64&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=300%2C119&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=480%2C191&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?resize=1020%2C405&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-17.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><br>Introduction to TFX, data validation open source ML tool&nbsp;| <a href="https://www.tensorflow.org/tfx" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://www.tensorflow.org/tfx" target="_blank" rel="noreferrer noopener nofollow">TFX</a>, short for TensorFlow Extended, presents a range of powerful features for effective machine learning operations:</p>



<ul class="wp-block-list">
<li><strong>Scalable ML Pipelines</strong></li>
</ul>



<p>TFX offers a structured sequence of components tailored for scalable and high-performance machine learning tasks, streamlining the development of end-to-end ML pipelines.</p>



<ul class="wp-block-list">
<li><strong>Component Modularity</strong></li>
</ul>



<p>TFX components are built using specialized libraries, providing both a cohesive framework and the flexibility to utilize individual components according to your needs.</p>



<ul class="wp-block-list">
<li><strong>Data Preprocessing</strong></li>
</ul>



<p>TFX includes powerful tools for data preprocessing, transformation, and feature engineering, crucial for preparing data for model training.</p>



<ul class="wp-block-list">
<li><strong>Model Training and Validation&nbsp;</strong></li>
</ul>



<p>It supports model training using TensorFlow and facilitates model validation, ensuring the robustness and reliability of your machine learning models.</p>



<ul class="wp-block-list">
<li><strong>Automated Model Deployment&nbsp;</strong></li>
</ul>



<p>TFX simplifies the process of deploying models to various serving environments, enabling smooth integration with production systems.</p>



<ul class="wp-block-list">
<li><strong>Artifact Tracking</strong></li>
</ul>



<p>TFX maintains track of experiment artifacts, aiding in tracking and managing the lifecycle of your ML models.</p>



<ul class="wp-block-list">
<li><strong>Custom Component Development</strong></li>
</ul>



<p>It allows for the creation of custom components to meet specific requirements or integrate third-party tools.</p>



<ul class="wp-block-list">
<li><strong>Integration with TensorFlow</strong></li>
</ul>



<p>As an extension of TensorFlow, TFX seamlessly integrates with TensorFlow ecosystem tools and technologies.</p>



<p>TFX is an excellent fit in your MLOps toolkit due to its focus on scalability, performance, and end-to-end ML pipeline management. It streamlines the development and deployment of machine learning workflows, ensuring efficient data preprocessing, model training, validation, and deployment. Its modularity and integration with TensorFlow make it a valuable asset in your quest for efficient and effective machine learning operations.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-data-exploration-open-source-ml-tools">Data exploration open source ML tools</h3>



<p>Data exploration software is created for automated data analysis that provides streamlined pattern recognition and easy insights visualization. Data exploration is a cognitively intense process, you need powerful tools that will help you track and execute code as you go.</p>



<h4 class="wp-block-heading">Jupyter Notebook</h4>



<p><a href="https://jupyter.org/" target="_blank" rel="noreferrer noopener nofollow">Jupyter Notebook</a> is a development environment where you can immediately see the result of executing code and its fragments. The difference from a traditional IDE is that the code can be broken into chunks and performed in any order. You can load a file into memory, check its contents separately, and also process the contents separately.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-medium is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="768" height="544" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=768%2C544&#038;ssl=1" alt="UI sample of Jupyter Notebook, data exploration open source ML tool" class="wp-image-29136" style="width:785px;height:557px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=768%2C544&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=1920%2C1361&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=200%2C142&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=1536%2C1089&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=220%2C156&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=120%2C85&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=160%2C113&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=300%2C213&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=480%2C340&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?resize=1020%2C723&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-18.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 768px) 100vw, 768px" /><figcaption class="wp-element-caption">UI sample of Jupyter Notebook, data exploration open source ML tool | <a href="https://jupyter.org/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<ul class="wp-block-list">
<li><strong>Multi-language support</strong></li>
</ul>



<p>Often when we talk about Jupyter Notebook, we mean working with Python. But, in fact, you can work with other languages, such as Ruby, Perl, or R.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Integration with the cloud</strong></li>
</ul>



<p>The easiest way to start working with a Jupyter Notebook in the cloud is by using Google Colab. This means that you just need to launch your browser and open the desired page. After that, the cloud system will allocate resources for you and allow you to execute any code.</p>



<p>The plus is that you don’t need to install anything on your computer. The cloud takes care of everything, and you just write and run code.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-data-version-control-open-source-ml-tools">Data version control open source ML tools</h3>



<p>There will be multiple machine learning model versions before you finish up. To make sure nothing gets lost, use a robust and trustworthy <a href="https://neptune.ai/blog/best-7-data-version-control-tools-that-improve-your-workflow-with-machine-learning-projects" target="_blank" rel="noreferrer noopener nofollow">data version control system</a> where every change is trackable.</p>



<h4 class="wp-block-heading">Data Version Control (DVC)</h4>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1798" height="794" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=1798%2C794&#038;ssl=1" alt="Introduction to DVC, data version control open source ML tool" class="wp-image-29137" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?w=1798&amp;ssl=1 1798w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=768%2C339&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=200%2C88&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=1536%2C678&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=220%2C97&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=120%2C53&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=160%2C71&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=300%2C132&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=480%2C212&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-19.png?resize=1020%2C450&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to DVC, data version control open source ML tool | <a href="https://dvc.org/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://dvc.org/" target="_blank" rel="noreferrer noopener nofollow">DVC</a> is a tool designed for managing software versions in ML projects. It’s useful both for experimentation and for deploying models to production. DVC runs on top of Git, uses its infrastructure, and has a similar syntax.</p>



<ul class="wp-block-list">
<li><strong>Fully-automated version control</strong></li>
</ul>



<p>DVC creates metafiles to describe pipelines and versioned files that need to be saved in the Git history of your project. If you transfer some data under the control of DVC, it will start tracking all changes.</p>



<ul class="wp-block-list">
<li><strong>Git-based modification tracking</strong></li>
</ul>



<p>You can work with data the same way as with Git: save a version, send it to a remote repository, get the required version of the data, and change and switch between versions. The DVC interface is intuitively clear.&nbsp;</p>



<p>Overall, DVS is an excellent <a href="/blog/top-model-versioning-tools" target="_blank" rel="noreferrer noopener">tool for data and model versioning</a>. If you don’t need pipelines and remote repositories, you can version data for a specific project working on a local machine. DVC allows you to work very quickly with tens of gigabytes of data.</p>



<p>However, it also allows you to exchange data and models between teams. For data storage, you can use cloud solutions.&nbsp;</p>



<h4 class="wp-block-heading">Pachyderm</h4>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="598" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=1920%2C598&#038;ssl=1" alt="Introduction to Pachyderm, data version control open source ML tools" class="wp-image-29138" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=1920%2C598&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=768%2C239&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=200%2C62&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=1536%2C479&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=220%2C69&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=120%2C37&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=160%2C50&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=300%2C93&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=480%2C150&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?resize=1020%2C318&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-20.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to Pachyderm, data version control open source ML tools | <a href="https://www.pachyderm.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://www.pachyderm.com/" target="_blank" rel="noreferrer noopener nofollow">Pachyderm</a> is a Git-like tool for tracking transformations in your data. It keeps track of data lineage and ensures that data is kept relevant.&nbsp;</p>



<p>Pachyderm is useful because it provides:</p>



<ul class="wp-block-list">
<li><strong>Traceability</strong></li>
</ul>



<p>You want your data to be fully traceable from the moment it’s raw to the final prediction. With its version control for data, Pachyderm gives you a fully transparent view of your data pipelines. It can be a challenge; for example, when multiple transformers use the same dataset, it can be hard to say why you get this or that result.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Reproducibility</strong></li>
</ul>



<p>Pachyderm is a step forward to the reproducibility of your data science models. You will always be assured that your clients can get the same results after the model is handed down to them.</p>



<p>Pachyderm stores all your data in one central location and updates all the changes. No transformation will pass unnoticed.&nbsp;</p>


    <a
        href="/blog/top-model-versioning-tools"
        id="cta-box-related-link-block_17f31d729f3e3245f5fd1aee95346336"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--0"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related post                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-top-model-versioning-tools-for-your-ml-workflow">                Top Model Versioning Tools for Your ML Workflow            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-data-inspection-open-source-ml-tools">Data inspection open source ML tools</h3>



<h4 class="wp-block-heading">Alibi Detect&nbsp;</h4>



<p><a href="https://github.com/SeldonIO/alibi-detect" target="_blank" rel="noreferrer noopener nofollow">Alibi Detect</a> is an open-source Python library by SeldonIO which also provide Seldon Core which we discussed earlier. This library allows you to inspect your data’s integrity. It offers features like outlier, adversarial, and drift detection for tabular data, text, images and time series. It is compatible with TensorFlow and PyTorch backends.&nbsp;</p>



<p>Abili Detect offers a variety of methods for inspecting your data’s integrity. The documentation is pretty neat and also offers a examples for better understanding. I highly recommend you to go through the <a href="https://docs.seldon.io/projects/alibi-detect/en/latest/" target="_blank" rel="noreferrer noopener nofollow">documentation</a> as it will be extremely beneficial.&nbsp;</p>



<p>If you are using frameworks like TensorFlow and PyTorch then this would be the best reasons to Abili Detect as it will create a smooth transition in the machine learning pipeline. Another reason to use this library in your machine learning workflow is because it provides <strong>built-in preprocessing steps</strong>. This feature essentially enables you to detect drift while using the transformers library. It also helps you to extract hidden layer from machine learning models.&nbsp;</p>



<h4 class="wp-block-heading">Frouros</h4>



<p><a href="https://github.com/IFCA/frouros" target="_blank" rel="noreferrer noopener nofollow">Frouros</a> is an open source Python library aimed only to address drift detection. Unlike Abili Detect which offers inspection for outlier and adversarial detection, Frouros is only focused on drift detection. This library is special because it offers classical and more recent algorithms for detecting both data and concept drift.&nbsp;</p>



<p>Frouros is also a lightweight library which works with <strong>Scikit-Learn, Numpy, PyTorch,</strong> and other frameworks. It offers a wide variety of methods which will be suitable for largely only univariant datasets and a few multivariate datasets as well.&nbsp;</p>



<p>So as a final verdict, this library is good for people who want to explore the concept of data drifts in univariant datasets. But since this library offers a vast range of algorithms it is a good place to learn and even deploy in a fairly small project.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-model-serving-open-source-ml-tool">Model serving open source ML tool</h3>



<h4 class="wp-block-heading">StreamLit</h4>



<p><a href="https://github.com/streamlit/streamlit" target="_blank" rel="noreferrer noopener nofollow">Streamlit</a> is an open-source Python library that is used for creating interactive web applications mostly related to data science and ML projects. Streamlit comes under the framework category, but since it only allows you to deploy the ML application I have put it under the tools category.&nbsp;</p>



<p>Anyways, StreamLit allows you to build web-based dashboards, visualizations, and applications with minimal effort. Some of the key features include:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1668" height="1186" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=1668%2C1186&#038;ssl=1" alt="Introduction to Streamlit, model serving open source ML tool " class="wp-image-29524" style="width:810px;height:576px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?w=1668&amp;ssl=1 1668w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=768%2C546&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=200%2C142&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=1536%2C1092&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=220%2C156&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=120%2C85&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=160%2C114&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=300%2C213&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=480%2C341&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/StreamLit.png?resize=1020%2C725&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to Streamlit, model serving open source ML tool | <a href="https://streamlit.io/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<ul class="wp-block-list">
<li><strong>Rapid Prototyping</strong></li>
</ul>



<p>As mentioned previously you can create interactive applications by writing Python code that directly interacts with your data and visualizations</p>



<ul class="wp-block-list">
<li><strong>Simplicity</strong></li>
</ul>



<p>The library is designed to be user-friendly, with a simple and intuitive API calling functions. You can create interactive widgets with just a few lines of code.</p>



<ul class="wp-block-list">
<li><strong>Data Visualization</strong></li>
</ul>



<p>Streamlit supports integration with popular data visualization libraries like Matplotlib, Plotly, and Altair, enabling you to display charts and graphs in your web application.</p>



<ul class="wp-block-list">
<li><strong>Customization</strong></li>
</ul>



<p>While Streamlit is straightforward to use out of the box, you can also customize the appearance and layout of your apps using CSS styling and additional layout components.</p>



<ul class="wp-block-list">
<li><strong>Integration</strong></li>
</ul>



<p>You can integrate your Streamlit apps with machine learning models, data analysis scripts, and other Python-based functionalities to create cohesive data-driven machine learning applications.</p>



<ul class="wp-block-list">
<li><strong>Interactivity</strong></li>
</ul>



<p>Streamlit&#8217;s widgets and features allow users to interact with data, adjust parameters, and see real-time updates in the app&#8217;s visualizations.</p>



<ul class="wp-block-list">
<li><strong>Sharing and Deployment</strong></li>
</ul>



<p>You can deploy your Streamlit apps on various platforms, including cloud services, making it easy to share your work with others.</p>



<ul class="wp-block-list">
<li><strong>Community and Extensions</strong></li>
</ul>



<p>Streamlit has a growing community and supports a range of extensions and integrations, allowing you to enhance the functionality of your apps.</p>



<p>Streamlit is particularly well-suited for scenarios where you want to create simple and interactive data visualization tools or prototypes without investing a significant amount of time in web development. It&#8217;s commonly used by data scientists and engineers who want to showcase their data analysis and machine learning results in an accessible and engaging manner.</p>



<h4 class="wp-block-heading">TorchServe</h4>



<p><a href="https://pytorch.org/serve/index.html" target="_blank" rel="noreferrer noopener nofollow">TorchServe</a> is an open-source model serving tool, made by Facebook AI. It is engineered to simplify the deployment and management of PyTorch models, aligning seamlessly with your MLOps workflows. Let&#8217;s delve into why TorchServe is a compelling choice for model management and inference in the MLOps landscape.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pytorch.org/serve/getting_started.html" target="_blank" rel="noreferrer noopener"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="1081" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=1920%2C1081&#038;ssl=1" alt="Introduction to TorchServe, model serving open source ML tool" class="wp-image-29797" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=1920%2C1081&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=768%2C432&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=200%2C113&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=1536%2C865&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=2048%2C1153&amp;ssl=1 2048w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=220%2C124&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=120%2C68&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=160%2C90&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=300%2C169&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=480%2C270&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/TorchServe.jpeg?resize=1020%2C574&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></a><figcaption class="wp-element-caption">Introduction to TorchServe, model serving open source ML tool | Source</figcaption></figure>
</div>


<ul class="wp-block-list">
<li><strong>Efficient Model Management</strong></li>
</ul>



<p>One of TorchServe&#8217;s standout features is its robust Model Management API. It empowers MLOps practitioners with multi-model management capabilities, allowing the allocation of models to workers in an optimized manner. This means you can effortlessly handle multiple models, versioning, and configurations while ensuring resource allocation is fine-tuned for peak performance.</p>



<ul class="wp-block-list">
<li><strong>Versatile Inference Support</strong></li>
</ul>



<p>TorchServe extends its capabilities through its Inference API, offering support for both REST and gRPC protocols. But it doesn&#8217;t stop there; it&#8217;s equipped for batched inference, optimizing the prediction process for both single and multiple data points. This versatility ensures that your models can be integrated seamlessly into a wide array of applications.</p>



<ul class="wp-block-list">
<li><strong>Complex Deployments Made Simple</strong></li>
</ul>



<p>For those tackling intricate deployments involving complex Directed Acyclic Graphs (DAGs) with interdependent models, TorchServe comes to the rescue. Its TorchServe Workflows feature enables the deployment of these intricate setups, giving you the flexibility needed to cater to demanding real-world scenarios.</p>



<ul class="wp-block-list">
<li><strong>Wide Adoption in Leading MLOps Platforms</strong></li>
</ul>



<p>TorchServe&#8217;s reputation extends beyond its own ecosystem. It serves as the default choice for serving PyTorch models within platforms like Kubeflow, MLflow, SageMaker, Google Vertex AI, and Kserve, supporting both v1 and v2 APIs. This widespread adoption speaks volumes about its effectiveness and compatibility within the MLOps landscape.</p>



<ul class="wp-block-list">
<li><strong>Optimized Inference Export</strong></li>
</ul>



<p>In the quest for optimized inference, TorchServe offers a suite of options. Whether it&#8217;s TorchScript right out of the box, ONNX, ORT, IPEX, or TensorRT, you have the freedom to export your model in a format that suits your specific performance requirements. This flexibility ensures that your models are primed for efficient execution.</p>



<ul class="wp-block-list">
<li><strong>Performance at the Core</strong></li>
</ul>



<p>MLOps professionals know that performance is paramount. TorchServe recognizes this and provides built-in support to optimize, benchmark, and profile both PyTorch models and TorchServe itself. This means you can fine-tune your deployments for optimal throughput and responsiveness.</p>



<ul class="wp-block-list">
<li><strong>Expressive Handlers for Custom Use Cases</strong></li>
</ul>



<p>Handling inferencing for diverse use cases is a breeze with TorchServe&#8217;s expressive handler architecture. It simplifies the process of customizing inferencing for your unique requirements, and it comes with a plethora of out-of-the-box solutions to cater to various scenarios.</p>



<ul class="wp-block-list">
<li><strong>Comprehensive Metrics and Monitoring</strong></li>
</ul>



<p>Monitoring the health and performance of your models is vital. TorchServe comes with a Metrics API that offers out-of-the-box support for system-level metrics. It seamlessly integrates with Prometheus for metric exports, and it also supports custom metrics. Moreover, it aligns seamlessly with PyTorch&#8217;s profiler for in-depth performance analysis.</p>



<p>TorchServe seamlessly integrates with leading MLOps platforms and empowers you to deploy and manage models efficiently. If you&#8217;re seeking a robust solution to elevate your MLOps workflows, TorchServe deserves a prominent place in your toolkit.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-testing-and-maintenance-open-source-ml-tools">Testing and maintenance open source ML tools</h3>



<p>The final step of ML development is testing and maintenance after the main jobs are done. Special tools allow you to make sure that the results are reproducible in the long run.</p>



<h4 class="wp-block-heading">Prometheus</h4>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1351" height="811" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=1351%2C811&#038;ssl=1" alt="Introduction to Prometheus, monitoring and testing open source ML tool" class="wp-image-29143" style="width:810px;height:486px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?w=1351&amp;ssl=1 1351w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=768%2C461&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=200%2C120&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=220%2C132&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=120%2C72&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=160%2C96&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=300%2C180&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=480%2C288&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-25.png?resize=1020%2C612&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to Prometheus, monitoring and testing open source ML tool | <a href="https://prometheus.io/docs/introduction/overview/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Prometheus is an open-sourced monitoring toolkit built by Soundcloud. This toolkit has a very active community, and it is well-supported by a large number of organisation. The fundamental concept of Prometheus is that it stores all data and metrics in a time series format. This means that metrics collected during the monitoring phase is associated with a timestamp.&nbsp;</p>



<p>This is a reason why <strong>Prometheus fits very well with timeseries data</strong>. It also supports multi-dimensional <strong>data collection</strong> along with querying the dataset. This means you can use Prometheus to log your ML system metrics.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1600" height="703" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=1600%2C703&#038;ssl=1" alt="Introduction to Prometheus, monitoring and testing open source ML tool" class="wp-image-29144" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?w=1600&amp;ssl=1 1600w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=768%2C337&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=200%2C88&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=1536%2C675&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=220%2C97&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=120%2C53&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=160%2C70&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=300%2C132&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=480%2C211&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-26.png?resize=1020%2C448&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Introduction to Prometheus | <a href="https://www.jeremyjordan.me/ml-monitoring/#prometheus" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>The image above represents how metrics can be logged into the time series database, and later the same can be retrieved via endpoints.&nbsp;</p>



<p>Some of the highlighted key features are:</p>



<ul class="wp-block-list">
<li><strong>Standalone servers</strong></li>
</ul>



<p>Each Prometheus server is a standalone server which means they are independent of others making it reliable.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>PromQL</strong></li>
</ul>



<p>A powerful query language that allows searching, slicing, and dicing of time series data. With PromQL you can also generate graphs, table, and alerts on Prometheus&#8217;s expression browser.</p>



<ul class="wp-block-list">
<li><strong>Efficient storage</strong></li>
</ul>



<p>The data is stored in memory and a local on-disk time series database in a custom format. This allows efficient scaling as well.&nbsp;</p>



<ul class="wp-block-list">
<li><strong>Dimensional Data</strong></li>
</ul>



<p>The keyconcept of Prometheus is storing data in a time series format. Because of this you can select any timeframe to understand the behaviour of your model. On top of that, you can create a visualization dashboard using Grafana.&nbsp;&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1798" height="940" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=1798%2C940&#038;ssl=1" alt="UI sample of Prometheus, monitoring and testing open source ML tool" class="wp-image-29145" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?w=1798&amp;ssl=1 1798w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=1536%2C803&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-27.png?resize=1020%2C533&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Prometheus, monitoring and testing open source ML tool | <a href="https://www.jeremyjordan.me/ml-monitoring/#prometheus" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>If you want a general-purpose lightweight tool to collect and log metrics about your system then Prometheus is a must.&nbsp;</p>



<h4 class="wp-block-heading">ModsysML</h4>



<p><a href="https://github.com/modsysML/modsysML" target="_blank" rel="noreferrer noopener nofollow">ModsysML</a> is an extremely new MLOps tool that allows users to test, automate workloads and compare outputs, improve data quality, and catch regressions all in a single API. This enables you to automate, accelerate and backtest the entire process of running proactive intelligence and insights through testing data quality.&nbsp;</p>



<p>ModsysML streamlines the process of meticulously refining AI systems across a diverse range of pertinent test cases. By meticulously scrutinizing and contrasting outputs, it constructs workflows that facilitate decision-making. Users can expedite the assessment of quality and promptly identify regressions.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="900" height="450" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?resize=900%2C450&#038;ssl=1" alt="UI sample of ModsysML, monitoring and testing open source ML tool" class="wp-image-29146" style="width:810px;height:405px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?w=900&amp;ssl=1 900w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?resize=768%2C384&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?resize=200%2C100&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?resize=220%2C110&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?resize=120%2C60&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?resize=160%2C80&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?resize=300%2C150&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-28.png?resize=480%2C240&amp;ssl=1 480w" sizes="auto, (max-width: 900px) 100vw, 900px" /><figcaption class="wp-element-caption">UI sample of ModsysML, monitoring and testing open source ML tool | <a href="https://github.com/modsysML/modsysML" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>The suite of tools we offer encompasses three fundamental functions:</p>



<ul class="wp-block-list">
<li>Conducting performance benchmarks for AI systems with respect to precise outcomes.</li>



<li>Crafting automated tasks or revisiting established ones for a thorough evaluation.</li>



<li>Detecting immediate fluctuations within data streams.</li>
</ul>



<p>Empowered by a user interface (UI) and a Python library, you possess the means to intricately calibrate your AI systems for particular use cases. This encompasses the creation of automated workflows as well as deriving data-driven insights from real-time shifts within your datasets.</p>



<h4 class="wp-block-heading">Deepchecks</h4>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1540" height="860" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=1540%2C860&#038;ssl=1" alt="UI sample of Deepchecks, monitoring and testing open source ML tool" class="wp-image-29147" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?w=1540&amp;ssl=1 1540w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=768%2C429&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=200%2C112&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=1536%2C858&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=220%2C123&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=120%2C67&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=160%2C89&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=300%2C168&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=480%2C268&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-29.png?resize=1020%2C570&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Deepchecks, monitoring and testing open source ML tool | <a href="https://deepchecks.com/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>We come across another open-source tool that allows you to thoroughly evaluate and test the integrity of the data as well the ML model. <a href="https://deepchecks.com/">Deepchecks</a> offer continuous evaluation from research to production. It has a strong and active community backing it up.&nbsp;</p>



<p>Deepchecks cater to tabular, NLP, and computer vision (CV) datasets. It offers four solutions:</p>



<div id="case-study-numbered-list-block_7805ce0252a3bb123b8aeab98727a88e"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Testing            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                CI/CD            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Monitoring            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Root Cause Analysis            </li>
            </ul>
</div>



<p>Deepchecks offer a convenient avenue for detecting imperfections within data and ML models, while also enabling proactive steps toward enhancement. Its features, the <strong>Suite</strong>, prove particularly advantageous by facilitating an in-depth assessment of diverse data and ML model facets, subsequently generating valuable reports.</p>



<p>To facilitate a clearer grasp, a selection of the predefined checks carried out within a suite, along with their functions, is outlined below:</p>



<ul class="wp-block-list">
<li><strong>Dataset Integrity</strong></li>
</ul>



<p>Employed to ascertain the accuracy and comprehensiveness of the dataset.</p>



<ul class="wp-block-list">
<li><strong>Train-Test Validation</strong></li>
</ul>



<p>A set of checks is devised to ascertain the appropriateness of the data split for the model training and testing phases.</p>



<ul class="wp-block-list">
<li><strong>Model Evaluation&nbsp;</strong></li>
</ul>



<p>A set of checks is performed to gauge model performance, its adaptability to diverse scenarios, and any indicators of overfitting.</p>



<p>One of the reasons why Deepcheck will fit in your workflow is because of the automated solutions it offers especially <strong>root cause analysis</strong>. It essentially expedites the process of grasping the fundamental source of the problem across the entire model lifecycle, allowing you to swiftly discern the underlying cause of the issue. It promises to give you granular details of the issue.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-experiment-tracking-open-source-ml-tools">Experiment tracking open source ML tools</h3>



<h4 class="wp-block-heading">Aim</h4>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1400" height="828" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=1400%2C828&#038;ssl=1" alt="UI sample of Aim, experiment tracking open source ML tool" class="wp-image-29149" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?w=1400&amp;ssl=1 1400w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=768%2C454&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=200%2C118&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=220%2C130&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=120%2C71&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=160%2C95&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=300%2C177&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=480%2C284&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-30.png?resize=1020%2C603&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Aim, experiment tracking open source ML tool | <a href="https://aimstack.io/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p><a href="https://github.com/aimhubio/aim" target="_blank" rel="noreferrer noopener nofollow">Aim</a> stands as an open-source, self-hosted AI Metadata monitoring solution tailored to manage vast volumes of tracked metadata sequences, numbering in the tens of thousands.</p>



<p>Aim presents an efficient and visually pleasing user interface (UI) that facilitates the exploration and juxtaposition of metadata, encompassing elements such as training runs or agent executions. What&#8217;s more, its software development kit (SDK) grants the capability for programmatic interaction with the tracked metadata—an ideal feature for streamlined automation and analysis within Jupyter Notebooks.</p>



<p>Some of the key features are:</p>



<ul class="wp-block-list">
<li><strong>Streamlined Run Comparisons</strong></li>
</ul>



<p>Effortlessly contrast various runs to expedite the model-building process.</p>



<ul class="wp-block-list">
<li><strong>In-Depth Run Inspection</strong></li>
</ul>



<p>Immerse yourself in the minutiae of each run, facilitating seamless troubleshooting.</p>



<ul class="wp-block-list">
<li><strong>Centralized Repository of Pertinent Details</strong></li>
</ul>



<p>All pertinent information is centralized, ensuring hassle-free governance and management.</p>



<p>Aim can handle up to 100,000 metadata sequences which is why it can be one of the best fits in your ML stack. Apart from that, it has a beautiful UI which functional and appealing to the eyes.&nbsp;</p>



<h4 class="wp-block-heading">Guild AI</h4>



<p><a href="https://github.com/guildai/guildai" target="_blank" rel="noreferrer noopener nofollow">Guild AI</a> serves as an open source toolkit that streamlines and enhances the efficiency of machine learning experiments. It stands as an all-encompassing ML engineering toolkit with an array of capabilities.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1100" height="814" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=1100%2C814&#038;ssl=1" alt="UI sample of Guild AI, experiment tracking open source ML tool" class="wp-image-29542" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?w=1100&amp;ssl=1 1100w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=768%2C568&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=200%2C148&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=220%2C163&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=120%2C89&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=160%2C118&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=300%2C222&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=480%2C355&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/Guild-ai.png?resize=1020%2C755&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">UI sample of Guild AI, experiment tracking open source ML tool | <a href="https://guild.ai/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<ul class="wp-block-list">
<li><strong>Automated Experiment Tracking</strong></li>
</ul>



<p>Guild AI lets you run original training scripts, capturing unique experiment results and provides tools for analysis, visualization, and comparison.</p>



<ul class="wp-block-list">
<li><strong>Hyperparameter Tuning with AutoML</strong></li>
</ul>



<p>Harness AutoML for hyperparameter tuning by automating trials with grid search, random search, and Bayesian optimization techniques.</p>



<ul class="wp-block-list">
<li><strong>Comparison and Analysis&nbsp;</strong></li>
</ul>



<p>Compare and analyze experiment runs to gain insights and enhance your model&#8217;s performance.</p>



<ul class="wp-block-list">
<li><strong>Efficient Backup and Archiving</strong></li>
</ul>



<p>Secure training-related operations like data preparation and testing and archive runs to remote systems such as S3.</p>



<ul class="wp-block-list">
<li><strong>Remote Operations and Acceleration&nbsp;</strong></li>
</ul>



<p>Perform operations remotely on cloud accelerators, optimizing your workflow efficiency.</p>



<ul class="wp-block-list">
<li><strong>Model Packaging and Reproducibility&nbsp;</strong></li>
</ul>



<p>Package and distribute models for seamless reproducibility across different environments.</p>



<ul class="wp-block-list">
<li><strong>Streamlined Pipeline Automation</strong></li>
</ul>



<p>Enable automated pipelines for smoother workflow execution.</p>



<ul class="wp-block-list">
<li><strong>Scheduling and Parallel Processing</strong></li>
</ul>



<p>Utilize scheduling and parallel processing to optimize resource utilization.</p>



<ul class="wp-block-list">
<li><strong>Remote Training and Management</strong></li>
</ul>



<p>Conduct remote training, backup, and restoration of experiments for enhanced flexibility.</p>



<p>If you are looking for a tool that offers automated experiment management, optimization, and insights that streamline and enhance machine learning workflows, then Guild AI is tool of choice.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-model-interpretability-open-source-ml-tools">Model interpretability open source ML tools&nbsp;</h3>



<h4 class="wp-block-heading">Alibi Explain</h4>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="599" height="313" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-32.png?resize=599%2C313&#038;ssl=1" alt="Introduction to Alibi Explain, model interpretability open source ML tool" class="wp-image-29152" style="width:795px;height:415px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-32.png?w=599&amp;ssl=1 599w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-32.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-32.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-32.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-32.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-32.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/the-best-open-source-mlops-tools-you-should-know-32.png?resize=480%2C251&amp;ssl=1 480w" sizes="auto, (max-width: 599px) 100vw, 599px" /><figcaption class="wp-element-caption">Introduction to Alibi Explain, model interpretability open source ML tool | <a href="https://docs.seldon.io/projects/alibi/en/stable/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Alibi Explain is another tool from SeldonIO. It stands as an open-source Python library with a primary focus on the explainability and interpretation of ML models. The library is dedicated to furnishing top-tier implementations of explanation methods, encompassing black-box, white-box, local, and global approaches tailored for both classification and regression models.</p>



<p>Within Alibi Explain, a collection of algorithms or methodologies, termed explainers, are at your disposal. Each explainer serves as a conduit for obtaining insights into a model&#8217;s behavior. The array of insights attainable, contingent upon a trained model, is influenced by several variables.&nbsp;</p>



<p>To know more please read their documentation <a href="https://docs.seldon.io/projects/alibi/en/stable/" target="_blank" rel="noreferrer noopener nofollow">here</a>. It is one of the most pleasing tools in this list.&nbsp;</p>



<p>Broadly speaking, the range of explainers available from Alibi is bounded by:</p>



<ul class="wp-block-list">
<li>The nature of the data the model handles, encompassing images, tabular data, or text.</li>



<li>The task performed by the model, namely regression or classification.</li>



<li>The specific model type employed, including neural networks and random forests.</li>
</ul>



<p>Explainibility in general is one of the most sought features in the ML world. It is because as humans we are curious to know what is happening inside a closed. If you are working in medicine or healthcare or any sort of life-related industry then this tool is a must in your arsenal.&nbsp;</p>



<p>&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h3>



<p>Open source MLOps tools are necessary. They help you automate a large amount of routine work without costing a fortune. Fully-fledged platforms offer a wide selection of tools for different purposes, for whatever technological stack you might desire. In practice, however, it often turns out that you still need to integrate them with specialized tools that are more intuitive to use. Luckily, most open-source tools make the integration as seamless as possible.&nbsp;</p>



<p>However, an important thing to understand about open-source tools is that you shouldn’t expect them to be completely free of charge: the costs of infrastructure, support, and maintenance of your ML projects will still be on you.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5959</post-id>	</item>
		<item>
		<title>How to Do Model Visualization in Machine Learning?</title>
		<link>https://neptune.ai/blog/visualization-in-machine-learning</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 06:23:21 +0000</pubDate>
				<category><![CDATA[ML Model Development]]></category>
		<guid isPermaLink="false">https://neptune.test/visualizing-machine-learning-models/</guid>

					<description><![CDATA[Machine learning models are powerful and complex mathematical structures. Understanding their intricate workings is a crucial aspect of model development. Model visualization in machine learning is essential for gaining insights, making informed decisions, and effectively communicating results. In this article, we’ll delve into the art of machine learning visualization, exploring various techniques that help us&#8230;]]></description>
										<content:encoded><![CDATA[
<p>Machine learning models are powerful and complex mathematical structures. Understanding their intricate workings is a crucial aspect of model development. Model visualization in machine learning is essential for gaining insights, making informed decisions, and effectively communicating results.</p>



<p>In this article, we’ll delve into the art of machine learning visualization, exploring various techniques that help us make sense of complex data-driven systems. I have also prepared a <a href="https://colab.research.google.com/drive/1Y9LO60Pi28d4a1_aU8Amlf4I3O4abUtp?usp=sharing" target="_blank" rel="noreferrer noopener nofollow">Google Colab notebook with visualization examples</a> to try yourself. </p>



<p>So, without further ado, let’s get started.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-what-is-visualization-in-machine-learning">What is visualization in machine learning?</h2>



<p>Machine learning visualization (ML visualization for short) generally refers to the process of representing machine learning models, data, and their relationships through graphical or interactive means. The goal is to make comprehending a model’s complex algorithms and data patterns easier, making it more accessible to technical and non-technical stakeholders.&nbsp;</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><em>Visualization bridges the gap between the enigmatic inner workings of ML models and our innate human capacity for understanding patterns and relationships through visuals.</em></p>
</blockquote>



<p>Visualizing ML models can help with a wide range of objectives:</p>



<ul class="wp-block-list">
<li><strong>Model structure visualization:</strong> Common model types, such as decision trees, support vector machines, or deep neural networks, often consist of many layers of computations and interactions that are challenging to grasp for humans. Visualization lets us see more easily how data flows through a model and where transformations occur.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Visualizing performance metrics</strong>: Once we have trained a model, we need to assess its performance. Visualizing metrics such as accuracy, precision, recall, and the F1 score helps us see how well our model is doing and where improvements are needed.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Comparative model analysis</strong>: When dealing with multiple models or algorithms, visualization of differences in structure or performance allows us to choose the best one for a particular task.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Feature importance</strong>: It is vital to understand which features influence a model’s predictions the most. Visualization techniques like feature importance plots make identifying the critical factors driving model outcomes easy.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Interpretability</strong>: Due to their complexity, ML models are often &#8220;black boxes&#8221; to their human creators, making it hard to explain their decisions. Visualizations can shed light on how specific features affect the output or how robust a model’s predictions are.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Communication</strong>: Visualizations are a universal language for conveying complex ideas simply and intuitively. They are essential for effectively sharing information with management and other non-technical stakeholders.</li>
</ul>



<figure class="wp-block-image size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1600" height="900" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=1600%2C900&#038;ssl=1" alt="Visualization in machine learning: loss function’s gradient" class="wp-image-32040" style="aspect-ratio:1.7777777777777777;width:811px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?w=1600&amp;ssl=1 1600w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=768%2C432&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=200%2C113&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=1536%2C864&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=220%2C124&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=120%2C68&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=160%2C90&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=300%2C169&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=480%2C270&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-1.jpg?resize=1020%2C574&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Example of visualization in machine learning : loss function’s gradient | <a href="https://losslandscape.com/gallery/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-model-structure-visualization">Model structure visualization</h2>



<p>Understanding how data flows through a model is essential in understanding how a machine learning model transforms the input features into its output.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-decision-tree-visualization">Decision tree visualization</h3>



<p>Decision trees have a flowchart-like structure that’s familiar to most people. Each internal node represents a decision based on the value of a specific feature. Each branch from a node signifies an outcome of that decision. The Leaf nodes represent the model’s outputs.</p>



<p>Visualization of this structure offers a straightforward representation of the decision-making process, enabling data scientists and business stakeholders alike to comprehend the decision rules the model has learned.</p>



<p>During training, a decision tree identifies the feature that best separates the samples in a branch based on a specific criterion, often the Gini impurity or information gain. In other words, it determines the most discriminative feature.</p>



<p>Visualizing decision trees (or their ensembles like random forests or gradient-boosted trees) involves a graphical rendering of their overall structure, displaying the splits and decisions at each node clearly and intuitively. The depth and width of the tree, as well as the leaf nodes, become evident at first sight. Moreover, decision tree visualization aids in identifying crucial features, the most discriminative attributes that lead to accurate predictions.</p>



<p>The path to accurate prediction can be summed in four steps:</p>



<ul class="wp-block-list">
<li><strong>Feature Clarity</strong>: Decision tree visualization is like peeling back layers of complexity to reveal the pivotal features at play. It&#8217;s akin to looking at a decision-making flowchart, where each branch signifies a feature, and each decision node holds a crucial aspect of our data.<br></li>



<li><strong>Discriminative Attributes</strong>: The beauty of a decision tree visualization lies in its ability to highlight the most discriminative features. These factors heavily influence the outcome, guiding the model in making predictions. Through visualizing the tree, we can pinpoint these features and thus understand the core factors driving our model&#8217;s decisions.<br></li>



<li><strong>Path to Precision:</strong> Every path down the decision tree is a journey towards precision. The visualization showcases the sequence of decisions that lead to a particular prediction. This is gold for understanding the logic and criteria our model uses to reach specific conclusions.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Simplicity Amidst Complexity</strong>: Despite the complexity of machine learning algorithms, decision tree visualization comes with an element of simplicity. It transforms intricate mathematical calculations into an intuitive representation, making it accessible to technical and non-technical stakeholders.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1800" height="1800" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=1800%2C1800&#038;ssl=1" alt="Decision tree visualization in machine learning: plot representing a decision tree classifier trained on the Iris data set" class="wp-image-32268" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?w=1800&amp;ssl=1 1800w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=768%2C768&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=200%2C200&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=1536%2C1536&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=220%2C220&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=120%2C120&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=88%2C88&amp;ssl=1 88w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=44%2C44&amp;ssl=1 44w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=160%2C160&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=300%2C300&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=480%2C480&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=1020%2C1020&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/Example-of-decision-tree-visualization-in-machine-learning.png?resize=100%2C100&amp;ssl=1 100w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><strong>Example of decision tree visualization in machine learning : decision tree classifier trained on the </strong><a href="https://en.wikipedia.org/wiki/Iris_flower_data_set" target="_blank" rel="noreferrer noopener nofollow"><strong>Iris data set</strong></a> | Source: Author</figcaption></figure>
</div>


<p>The diagram above shows the structure of a decision tree classifier trained on the famous Iris dataset. This dataset consists of 150 samples of iris flowers, each belonging to one of three species: <em>setosa</em>, <em>versicolor</em>, or <em>virginica</em>. Each sample has four features: sepal length, sepal width, petal length, and petal width.</p>



<p>From the decision tree visualization, we can understand how the model classifies a flower:</p>



<ol class="wp-block-list">
<li><strong>Root node</strong>: At the root node, the model determines whether the petal length is 2.45 cm or less. If so, it classifies the flower as <em>setos</em>a. Otherwise, it moves on to the next internal node.<br></li>



<li><strong>Second split based on petal length</strong>: If the petal length is greater than 2.45 cm, the tree again uses this feature to make a decision. The decision criterion is whether the petal length is less than or equal to 4.75 cm.</li>
</ol>



<ol start="3" class="wp-block-list">
<li><strong>Split based on petal width</strong>: If the petal length is less than or equal to 4.75 cm, the model next considers the petal width and determines whether it is above 1.65 cm. If so, it classifies the flower as <em>virginica</em>. Otherwise, the model’s output is<em> versicolor.</em></li>
</ol>



<ol start="4" class="wp-block-list">
<li><strong>Split based on sepal length</strong>: If the petal length is greater than 4.75 cm, the model determined during training that sepal length is best suited to distinguish <em>versicolor</em> from <em>virginica</em>. If the sepal length is greater than 6.05 cm, it classifies the flower as <em>virginica</em>. Otherwise, the model’s output is <em>versicolor</em>.</li>
</ol>



<p>The visualization captures this hierarchical decision-making process and represents it in a way that is easier to understand than a simple listing of decision rules.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-ensemble-model-visualization">Ensemble model visualization</h3>



<p>Ensemble approaches like random forests, AdaBoost, gradient boosting, and bagging combine multiple simpler models (called base models) into one larger, more accurate model. For example, a random forest classifier comprises many decision trees. Understanding the comprising models&#8217; contributions and complex interplay is crucial when debugging and assessing ensembles.</p>



<p>One way to visualize an ensemble model is to create a diagram showing how the base models contribute to the ensemble model’s output. A common approach is to plot the base models’ decision boundaries (also called surfaces), highlighting their influence across different parts of the feature space. By examining how these decision boundaries overlap, we can learn how the base models give rise to the collective predictive power of the ensemble.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="1738" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=1920%2C1738&#038;ssl=1" alt="Ensemble model visualization example: how individual classifiers adapt to different data distributions by adjusting their decision boundaries. " class="wp-image-32046" style="aspect-ratio:1.1041426927502878;width:800px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=1920%2C1738&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=768%2C695&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=200%2C181&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=1536%2C1391&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=220%2C199&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=120%2C109&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=160%2C145&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=300%2C272&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=480%2C435&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?resize=1020%2C924&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-3.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><em>Example of ensemble model visualization</em>: h<em>ow individual classifiers adapt to different data distributions by adjusting their decision boundaries. Darker areas signify higher confidence, i.e., the model is more confident about its prediction. Lighter areas represent regions of lower confidence | </em><a href="https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Ensemble model visualizations also help users better comprehend the weights assigned to each base model within the ensemble. Typically, base models have a strong influence in some regions of the feature space and little influence in others. However, there might also be base models that never contribute significantly to the ensemble’s output. Identifying base models with particularly low or high weights can help to make ensemble models more robust and improve their generalizability.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-visually-building-models">Visually building models</h3>



<p>Visual ML is an approach to designing machine-learning models using a low-code or no-code platform. It enables users to create and modify complex machine-learning processes, models, and outcomes through a user-friendly visual interface. Instead of retroactively generating model structure visualizations, Visual ML places them at the heart of the ML workflow.</p>



<p>In a nutshell, Visual ML platforms offer drag-and-drop model-building workflows that allow users of various backgrounds to create ML models easily. They bridge the gap between the abstract world of algorithms and our innate ability to grasp patterns and relationships through visuals.</p>



<p>These platforms can save us time and help us build model prototypes quickly. Since models can be created in minutes, training and comparing different model configurations is easy. The model which performs best can then be optimized further, perhaps using a more code-centric approach.</p>



<p>Data scientists and machine learning engineers can make use of Visual ML tools to create:</p>



<div id="case-study-numbered-list-block_fe5ee669abf2d5c7c87fe84bd2b1b5be"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                Experimental prototypes            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                MLOps pipelines            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">3</span>
                Generate optimal ML code for production            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">4</span>
                Scale the existing ML model codebase for a larger sample            </li>
            </ul>
</div>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="814" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=1920%2C814&#038;ssl=1" alt="A classic example of how to create ML/DL models with no code. This type of interface is agile and enables a detailed understanding of how the models work" class="wp-image-32049" style="aspect-ratio:2.3574938574938575;width:811px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=1920%2C814&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=768%2C326&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=200%2C85&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=1536%2C652&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=220%2C93&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=120%2C51&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=160%2C68&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=300%2C127&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=480%2C204&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?resize=1020%2C433&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-4.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><em>Example of how </em>to<em> create ML/DL models with no code. This type of interface is agile and enables a detailed understanding of how the models work</em>&nbsp;| <a href="https://playground.tensorflow.org/" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Examples of Visual ML tools are <a href="https://playground.tensorflow.org/" target="_blank" rel="noreferrer noopener nofollow">TensorFlow’s Neural Network Playground</a> and <a href="https://www.knime.com/" target="_blank" rel="noreferrer noopener nofollow">KNIME</a>, an open-source data science platform built entirely around Visual ML and No-Code concepts.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-visualize-machine-learning-model-performance">Visualize machine learning model performance</h2>



<p>In many cases, we do not care so much about how a model works internally but are interested in understanding its performance. For which kinds of samples is it reliable? Where does it frequently draw the wrong conclusions? Should we go with model A or model B?</p>



<p>In this section, we’ll look at machine learning visualizations that help us better understand a model’s performance.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-confusion-matrices">Confusion matrices</h3>



<p><a href="/blog/evaluation-metrics-binary-classification" target="_blank" rel="noreferrer noopener">Confusion matrices</a> are a fundamental tool for evaluating a classification model’s performance. A confusion matrix compares a model’s predictions with the ground truth, clearly showing what kind of samples a model misclassifies or where it struggles to distinguish between classes.&nbsp;</p>



<p>In the case of a binary classifier, a confusion matrix has just four fields: true positives, false positives, false negatives, and true negatives:</p>



<div id="medium-table-block_05d93235456530245aaf958d3c99df82"
     class="block-medium-table c-table__outer-wrapper  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large">

    <table class="c-table">
                    <thead class="c-table__head">
            <tr>
                                    <td class="c-item"
                        style="min-width: 250px">
                        <div class="c-item__inner">
                            &nbsp;                        </div>
                    </td>
                                    <td class="c-item"
                        style="min-width: 250px">
                        <div class="c-item__inner">
                            Model predicts: 0                        </div>
                    </td>
                                    <td class="c-item"
                        style="min-width: 250px">
                        <div class="c-item__inner">
                            Model predicts: 1                        </div>
                    </td>
                            </tr>
            </thead>
        
        <tbody class="c-table__body">

                    
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>True value: 0</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><em>true negative</em></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><em>false positive</em></p>
                                                            </div>
                        </td>

                    
                </tr>

            
                <tr class="c-row">

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><strong>True value: 1</strong></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><em>false negative</em></p>
                                                            </div>
                        </td>

                    
                        <td class="c-ceil">
                            <div class="c-ceil__inner">
                                                                    <p><em>true positive</em></p>
                                                            </div>
                        </td>

                    
                </tr>

                    
        </tbody>
    </table>

</div>



<p>Equipped with this information, it’s straightforward to calculate essential metrics like precision, recall, F1 score, and accuracy.</p>



<p>The confusion matrix for a multi-class model follows the same general idea. The diagonal elements represent correctly classified instances (i.e., the model’s output matches the ground truth), while off-diagonal elements signify misclassifications.</p>



<p>Here is a small snippet to generate a confusion matrix for a sci-kit-learn classifier:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> matplotlib.pyplot <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> plt
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.datasets <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> make_classification
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.metrics <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> confusion_matrix, ConfusionMatrixDisplay
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.model_selection <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> train_test_split
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.svm <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> SVC


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># generate some sample data</span>
X, y = make_classification(n_samples=<span class="hljs-number" style="color: teal;">1000</span>,
n_features=<span class="hljs-number" style="color: teal;">10</span>,
n_informative=<span class="hljs-number" style="color: teal;">6</span>,
n_redundant = <span class="hljs-number" style="color: teal;">2</span>,
n_repeated = <span class="hljs-number" style="color: teal;">2</span>,
n_classes = <span class="hljs-number" style="color: teal;">6</span>,
n_clusters_per_class=<span class="hljs-number" style="color: teal;">1</span>,
random_state = <span class="hljs-number" style="color: teal;">42</span>
)


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># split the data into train and test set</span>
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=<span class="hljs-number" style="color: teal;">0</span>)


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># initialize and train a classifier</span>
clf = SVC(random_state=<span class="hljs-number" style="color: teal;">0</span>)
clf.fit(X_train, y_train)


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># get the model’s prediction for the test set</span>
predictions = clf.predict(X_test)


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># using the model’s prediction and the true value,</span>
<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># create a confusion matrix</span>
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># use the built-in visualization function to generate a plot</span>
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot()
plt.show()
</pre></code></pre>
</div>



<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="498" height="432" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-5.png?resize=498%2C432&#038;ssl=1" alt="Visualize machine learning model performance: 6x6 confusion matrix" class="wp-image-32057" style="aspect-ratio:1.1527777777777777;width:498px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-5.png?w=498&amp;ssl=1 498w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-5.png?resize=200%2C173&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-5.png?resize=220%2C191&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-5.png?resize=120%2C104&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-5.png?resize=160%2C139&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-5.png?resize=300%2C260&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-5.png?resize=480%2C416&amp;ssl=1 480w" sizes="auto, (max-width: 498px) 100vw, 498px" /><figcaption class="wp-element-caption">Example of model performance visualization: 6&#215;6 confusion matrix | Source: Author</figcaption></figure>
</div>


<p>Let’s have a look at the output. As mentioned before, the elements in the diagonal represent the true class, and the off-diagonal elements represent cases where the model confuses classes – hence the name “confusion matrix.”</p>



<p>Here are three key takeaways from the plot:</p>



<ol class="wp-block-list">
<li><strong>Diagonal</strong>: Ideally, the matrix&#8217;s main diagonal should be populated with the highest numbers. These numbers represent the instances where the model correctly predicted the class, aligning with the true class. Looks like our model is doing pretty well here!<br></li>



<li><strong>Off-diagonal entries</strong>: The numbers outside the main diagonal are equally important. They reveal cases where the model made errors. For example, if you look at the cell where row 5 intersects with column 3, you’ll see that there were five cases where the true class was “5”, but the model predicted class “3”. Perhaps we should look at the affected samples to better understand what’s going on here!<br></li>



<li><strong>Analyzing performance at a glance</strong>: By examining the off-diagonal entries, you can see immediately that they’re quite low. Overall, the classifier seems to do a pretty good job. You’ll also notice that we have about an equal number of samples for each category. In many real-world scenarios, this is not going to be the case. Then, generating a second confusion matrix that shows the likelihood of a correct classification (rather than the absolute number of samples) can be helpful.</li>
</ol>



<p>Visual enhancements like color gradients and percentage annotations make a confusion matrix more intuitive and easily interpretable. Confusion matrices styled like a heatmap draw attention to classes with high error rates and thus guide further model development.</p>



<p>Confusion matrices can also help non-technical stakeholders grasp a model&#8217;s strengths and weaknesses, fostering discussions about the need for additional data or cautionary measures when using model predictions for critical decisions.</p>


    <a
        href="/blog/ml-model-performance-monitoring"
        id="cta-box-related-link-block_34d58b8dce95ca5c22f7a17ac87a0e08"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
        <div class="block-cta-box-related-link__image-wrapper">
            <figure class="c-image__wrapper">

                
                <img
                    src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/05/blog_feature_image_051678_7_0_9_0.jpg?fit=200%2C105&amp;ssl=1"
                    loading="lazy"
                    decoding="async"
                    width="200"
                    height="105"
                    class="c-image"
                    alt="">
            </figure>
        </div>

    
    <div class="block-cta-box-related-link__description-wrapper">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related post                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-doing-ml-model-performance-monitoring-the-right-way">                Doing ML Model Performance Monitoring The Right Way            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-visualizing-cluster-analysis">Visualizing cluster analysis</h3>



<p>Cluster analysis groups similar data points based on specific features. Visualizing these clusters can bring to light patterns, trends, and relationships within the data.</p>



<p>Scatter plots where each point is colored according to its cluster assignment are a standard way to visualize the results of a <a href="/blog/clustering-algorithms" target="_blank" rel="noreferrer noopener">cluster analysis</a>. Cluster boundaries and their distribution across the feature space are clearly visible. Pair plots or parallel coordinates help to understand the relationships between multiple features.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="917" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=1920%2C917&#038;ssl=1" alt="Visualizing cluster analysis: two different data clusters produced by k-means clustering. You can see that in both cases, the clusters the model found (color-coded) do not match the actual clusters in the data" class="wp-image-32060" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=1920%2C917&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=768%2C367&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=200%2C96&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=1536%2C734&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=220%2C105&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=120%2C57&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=160%2C76&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=300%2C143&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=480%2C229&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?resize=1020%2C487&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-6.png?w=1960&amp;ssl=1 1960w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"> Example of visualizing cluster analysis: two different data clusters produced by k-means clustering. You can see that in both cases, the clusters the model found (color-coded) do not match the actual clusters in the data | <a href="https://scikit-learn.org/stable/modules/clustering.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>One popular clustering algorithm, <a href="/blog/k-means-clustering" target="_blank" rel="noreferrer noopener">k-means</a>, begins with selecting starting points called centroids. A simple approach is randomly picking k samples from the dataset.</p>



<p>Once these initial centroids are established, k-means alternates between two steps:</p>



<div id="case-study-numbered-list-block_4783b3dfdf1313185856177bdcb42497"
         class="block-case-study-numbered-list ">

    
    <h2 id="h-"></h2>

    <ul class="c-list">
                    <li class="c-list__item">
                <span class="c-list__counter">1</span>
                It associates each sample with the nearest centroid, thereby creating clusters comprised of the samples associated with the same centroid.            </li>
                    <li class="c-list__item">
                <span class="c-list__counter">2</span>
                It recalibrates the centroids by averaging the values of all samples in a cluster.            </li>
            </ul>
</div>



<p>As this process continues, the centroids move, and the association of points with clusters is iteratively refined. Once the difference between the old and new centroids falls below a set threshold, signaling stability, k-means concludes.&nbsp;</p>



<p>The result is a set of centroids and clusters that you can visualize in a plot like the one above.</p>



<p>For larger datasets, t-SNE (t-distributed Stochastic Neighbor Embedding) or UMAP (Uniform Manifold Approximation and Projection) can be employed to reduce dimensions while preserving cluster structures. These techniques aid in visualizing high-dimensional data effectively.&nbsp;</p>



<p>t-SNE takes complex, high-dimensional data and transforms it into a lower-dimensional representation. The algorithm starts by assigning each data point a location in the lower-dimensional space. Then, it looks at the original data and decides where each point should really be placed in this new space, considering its neighboring points. Points that were similar in the high-dimensional space are pulled closer together in the new space, and those dissimilar are pushed apart.</p>



<p>This process repeats until the points find their perfect positions. The final result is a clustered representation where similar data points form groups, allowing us to see patterns and relationships hidden in the high-dimensional chaos. It&#8217;s like a symphony where each note finds its harmonious place, creating a beautiful composition of data.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1800" height="942" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=1800%2C942&#038;ssl=1" alt="Visualizing cluster analysis: the t-SNE algorithm creates clusters from high-dimensional data in a low-dimensional space" class="wp-image-32278" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?w=1800&amp;ssl=1 1800w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=1536%2C804&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/The-t-SNE-algorithm-creates-clusters-from-high-dimensional-data-in-a-low-dimensional-space.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">The t-SNE algorithm creates clusters from high-dimensional data in a low-dimensional space | <a href="https://bcho.tistory.com/1210" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>UMAP also tries to find clusters in high-dimensional space but takes a different approach.</p>



<p>Here is how UMAP works:<br></p>



<ul class="wp-block-list">
<li><strong>Neighbor finding</strong>: UMAP begins by identifying the neighbors of each data point. It determines which points are close to each other in the original high-dimensional space.<br></li>



<li><strong>Fuzzy simplicial set construction</strong>: Imagine creating a web of connections between these neighboring points. UMAP models the strength of these connections based on how related or similar the points are.<br></li>



<li><strong>Low-Dimensional Layout</strong>: After determining their closeness, UMAP carefully arranges the data points in the lower-dimensional space. Points strongly connected in the high-dimensional space are placed close together in this new space.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Optimization</strong>: UMAP aims to find the best representation in lower dimensions. It minimizes the difference between the distances in the original high-dimensional space and the new lower-dimensional space.<br></li>



<li><strong>Clustering</strong>: UMAP uses clustering algorithms to group similar data points. Imagine gathering similar colored marbles together — this allows us to see patterns and structures more clearly.</li>
</ul>


    <a
        href="/blog/dimensionality-reduction"
        id="cta-box-related-link-block_efb08fded76bba8c3b8c347f3589e434"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    You may also like                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-dimensionality-reduction-for-machine-learning">                Dimensionality Reduction for Machine Learning            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-comparative-model-analysis">Comparative model analysis</h2>



<p>Comparing different <a href="/blog/performance-metrics-in-machine-learning-complete-guide" target="_blank" rel="noreferrer noopener">model performance metrics</a> is crucial for deciding which machine learning model is best suited for a task. Whether during the experimental phase of an ML project or while re-training production models, visualizations are often necessary to turn complex numeric results into actionable insights.</p>



<p>Thus, visualizations for model performance metrics, such as ROC curves and calibration plots, are tools every data scientist and ML engineer should have in their toolbox. They are fundamental for understanding and communicating the effectiveness of machine learning models.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="567" height="432" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-8.png?resize=567%2C432&#038;ssl=1" alt="Comparative model analysis: comparing three different models using ROC curves and the ROC-AUC metric" class="wp-image-32064" style="aspect-ratio:1.3125;width:567px;height:auto" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-8.png?w=567&amp;ssl=1 567w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-8.png?resize=200%2C152&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-8.png?resize=220%2C168&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-8.png?resize=120%2C91&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-8.png?resize=160%2C122&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-8.png?resize=300%2C229&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-8.png?resize=480%2C366&amp;ssl=1 480w" sizes="auto, (max-width: 567px) 100vw, 567px" /><figcaption class="wp-element-caption"><em>Example of comparative model analysis: comparing three different models using ROC curves and the ROC-AUC metric</em> | Source: Author</figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-roc-curves">ROC curves</h3>



<p>Receiver operating characteristic curves – ROC curves for short – are vital when analyzing machine-learning classifiers and comparing ML model performance.</p>



<p>A ROC curve plots a model&#8217;s true positive rate against its false positive rate as a function of the cutoff threshold. It depicts the trade-off between true and false positives we invariably have to make and offers insight into a model’s discriminative power.</p>



<p>A curve closer to the top-left corner signifies superior performance: The model achieves a high rate of true positives while maintaining a low rate of false positives. Comparing ROC curves helps us choose the best model.</p>



<p>Here is a step-by-step explanation of how the ROC curve works:</p>



<p>In binary classification, we are interested in predicting one of two possible outcomes, typically labeled as positive (e.g., presence of a disease) and negative (e.g., absence of a disease).</p>



<p>Remember that we can turn any classification problem into a binary one by selecting one class as the positive outcome and assigning all other classes as negative outcomes. Hence, ROC curves can still be helpful for multi-class or multi-label classification problems.</p>



<p>The axes of the ROC curve represent two metrics:<br></p>



<ul class="wp-block-list">
<li><strong>True Positive Rate (Sensitivity): </strong>The proportion of actual positive cases correctly identified by the model.</li>



<li><strong>False Positive Rate: </strong>The proportion of actual negative cases incorrectly identified as positive.<br></li>
</ul>



<p>A machine-learning classifier typically outputs the likelihood that a sample belongs to the positive class. For example, a logistic regression model outputs values between 0 and 1 that can be interpreted as the likelihood.</p>



<p>As data scientists, it’s up to us to select the threshold above which we assign the positive label. The ROC curve shows us the influence of that choice on our classifier&#8217;s performance.</p>



<p>If we set the threshold to 0, all samples will be assigned to the positive class – and the rate of false positives will be 1. Thus, in the upper right-hand corner of any ROC curve plot, you’ll see that the curve ends at (1, 1).</p>



<p>If we set the threshold to 1, no samples will ever be assigned to the positive class. But since, in this case, we never mistakenly assign a negative sample to the positive class, the rate of false positives will be 0. As you might have guessed already, that’s what we see in the lower left-hand corner of a ROC curve plot: The curve always begins at (0, 0).</p>



<p>The curve between those points is plotted by changing the threshold for classifying a sample as positive. The resulting curve – the ROC curve – reflects how the true positive rate and false positive rate change in relation to one another as this threshold varies.</p>



<p>But what do we learn from this?&nbsp;</p>



<p>The ROC curve shows the trade-off we must make between sensitivity (the true positive rate) and specificity (1 &#8211; false positive rate). In more colloquial terms, we can either find all the positive samples (high sensitivity) or be sure that all samples our classifier identifies as positive actually belong to the positive class (high specificity).</p>



<p>Consider a classifier that can perfectly distinguish between positive and negative samples: Its true positive rate is always 1, and its false positive rate is always 0, independent of our chosen threshold. Its ROC curve would shoot straight up from (0,0) to (0,1) and then resemble a straight line between (0,1) and (1,1).</p>



<p>Thus, the closer the ROC curve follows the left-hand border and then the top border of the plot, the more discriminative the model and the better it can satisfy the sensitivity and specificity objectives.</p>



<p>To compare different models, we often don’t use the curve directly but compute the area under it. This quantifies the model&#8217;s overall ability to discriminate between positive and negative classes.</p>



<p>This so-called ROC-AUC (the area under the ROC curve) can take on values between 0 and 1, with higher values indicating a better performance. Indeed, our perfect classifier would reach a ROC-AUC of exactly 1.</p>



<p>When using the ROC-AUC metric, it’s essential to keep in mind that the baseline is not 0 but 0.5 – the ROC-AUC of a perfectly random classifier. If we use <em>np.random.rand()</em> as our classifier, the resulting ROC curve will be a diagonal line from (0,0) to (1,1).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="640" height="480" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-9.png?resize=640%2C480&#038;ssl=1" alt="Comparative model analysis: one-vs-ROC curves" class="wp-image-32065" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-9.png?w=640&amp;ssl=1 640w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-9.png?resize=200%2C150&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-9.png?resize=220%2C165&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-9.png?resize=120%2C90&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-9.png?resize=160%2C120&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-9.png?resize=300%2C225&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-9.png?resize=480%2C360&amp;ssl=1 480w" sizes="auto, (max-width: 640px) 100vw, 640px" /><figcaption class="wp-element-caption"><em><em>Example of comparative model analysis: </em></em>a<em> random classifier’s ROC curve is diagonal, resulting in a ROC-AUC of 0.5. The ROC curve of an actual ML classifier shown in yellow always lies above that line, with a ROC-AUC of 0.78 |</em> <a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>Generating ROC curves and computing the ROC-AUC is straightforward using scikit-learn. It takes just a few lines of code in your model training script to create this evaluation data for each of your training runs. When you log the ROC-AUC and the ROC curve plot using an <a href="/blog/best-ml-experiment-tracking-tools" target="_blank" rel="noreferrer noopener">ML experiment tracking tool</a>, you can later compare different model versions.</p>



<section
	id="i-box-block_668a790247ad9a82cab75a17ed5de9bb"
	class="block-i-box  l-margin__top--standard l-margin__bottom--standard">

			<header class="c-header">
			<img
				src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
				data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/i-box/header-icon.svg"
				width="24"
				height="24"
				class="c-header__icon lazyload"
				alt="">

			
            <h2 class="c-header__text animation " style='max-width: 100%;'   >
                <strong>Might be useful</strong>
            </h2>		</header>
	
	<div class="block-i-box__inner">
		

<div
    id="custom-text-block_38bb73a94b531cf3aadcb6755ad597a5"
    class="block-custom-text  white l-padding__top--0 l-padding__bottom--0"
    style="max-width: 100%; font-size: 1rem; line-height: 1.33; font-weight: 600;"
    >
    
    When visualizing, comparing, and debugging models, it&#8217;s really useful to keep an organized record of all experiments.
    </div>



<div id="group-of-boxes-block_e1093fc761fee681cc78731632af1a63" class="b-group-of-boxes  l-padding__top--large l-padding__bottom--large">

<div
    class="c-wrapper c-wrapper--align-auto c-wrapper--align-vertical-auto" >
    <div class="b-group-of-boxes__grid l-grid--cols-2  l-grid--boxes">
        

	<div
		class="c-box c-box--transparent c-box--dark c-box--no-hover c-box--micro c-box--vertical-center c-box--horizontal-flex-start c-box--paddings-none  l-margin__top--0 l-margin__bottom--0">
		

<p>Media intelligence company Hypefactors is using neptune.ai for that.</p>



<blockquote
	id="quote-small-block_9c7d7092ed62a7f67c6b0a7be042ea25"
	class="block-quote-small ">

	<img
		src="https://neptune.ai/wp-content/themes/neptune/img/icon-quote-small.svg"
		alt=""
		width="24"
		height="18"
		class="c-item__icon">

	
		<div class="c-item__content">

			We use Neptune for most of our tracking tasks, from experiment tracking to uploading the artifacts. A very useful part of tracking was monitoring the metrics, now we could easily see and compare those F-scores and other metrics.
							<cite class="c-item__cite">
					<p>Andrea Duque, Data Scientist at Hypefactors</p>
				</cite>
			
		</div>

	
</blockquote>


	</div>



	<div
		class="c-box c-box--transparent c-box--dark c-box--no-hover c-box--micro c-box--vertical-flex-start c-box--horizontal-flex-start c-box--paddings-none  l-margin__top--0 l-margin__bottom--0">
		

<div id="app-screenshot-block_1d1a377b90e8698910034bf2be497fab"
	class="block-app-screenshot js-block-with-image-full-screen-modal "
	data-video-url=""
	data-show-controls="false"
	data-unmute="false"
	data-button-icon="https://neptune.ai/wp-content/themes/neptune/img/icon-close.svg"
	data-image-full-screen-modal="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/11/Reporting.png?fit=1020%2C577&#038;ssl=1"
>

			<div class="block-app-screenshot__image-wrapper">
			<div class="block-app-screenshot__bar">
				<figure class="block-app-screenshot__bar-buttons-wrapper">
					<img
						src="https://neptune.ai/wp-content/themes/neptune/img/blocks/app-screenshot/bar-buttons.svg"
						width="34"
						height="9"
						class="block-app-screenshot__bar-buttons"
						alt="">
				</figure>
			</div>

			
				<img
					srcset="
					https://i0.wp.com/neptune.ai/wp-content/uploads/2024/11/Reporting.png?fit=480%2C271&#038;ssl=1 480w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2024/11/Reporting.png?fit=768%2C434&#038;ssl=1 768w,					https://i0.wp.com/neptune.ai/wp-content/uploads/2024/11/Reporting.png?fit=1020%2C577&#038;ssl=1 1020w"
					alt=""
					style=""
					width="1020"
					height="577"
					class="block-app-screenshot__image"
				>

			
			<div class="block-app-screenshot__overlay">

				
					<a
						href="https://scale.neptune.ai/o/examples/org/LLM-Pretraining/reports/9e6a2cad-77e7-42df-9d64-28f07d37e908"
						class="c-button c-button--primary c-button--small c-button--cta">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-button--test-tube.svg"
							width="16"
							height="19"
							target="_blank" rel="nofollow noopener noreferrer"							class="c-button__icon"
							alt=""
						/>

													<span class="c-button__text">
								See in app							</span>
						
					</a>

				
														<button
						class="js-c-image-full-screen-modal c-button c-button--tertiary c-button--small">
						<img
							decoding="async"
							loading="lazy"
							src="https://neptune.ai/wp-content/themes/neptune/img/icon-zoom.svg"
							width="16"
							height="17"
							class="c-button__icon"
							alt="zoom"
						/>

						<span class="c-button__text">
							Full screen preview						</span>
						
					</button>
									
			</div>

		</div>

			
</div>


	</div>


    </div>
</div>


</div>



<ul
    id="arrow-list-block_e020710a7bce7541c5783c6af7cdf8e7"
    class="block-arrow-list block-list-item--font-size-regular">
    

<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p>Full <a href="/customers/hypefactors" target="_blank" rel="noreferrer noopener">case study with Hypefactors</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p>Dive into available <a href="/product/compare-experiments" target="_blank" rel="noreferrer noopener">comparison features</a></p>


</li>



<li class="block-list-item ">
    <img loading="lazy" decoding="async"
        src="https://neptune.ai/wp-content/themes/neptune/img/image-ratio-holder.svg"
        data-src="https://neptune.ai/wp-content/themes/neptune/img/blocks/list-item/arrow.svg"
        width="10"
        height="10"
        class="block-list-item__arrow lazyload"
        alt="">

    

<p><a rel="noreferrer noopener" href="/contact-us" target="_blank">Get in touch</a>&nbsp;if you’d like to go through a custom demo with your team</p>


</li>


</ul>


	</div>

</section>



<p><strong>Computing and logging the ROC-AUC</strong></p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.metrics <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> roc_auc_score


clf.fit(x_train, y_train)


y_test_pred = clf.predict_proba(x_test)
auc = roc_auc_score(y_test, y_test_pred[:, <span class="hljs-number" style="color: teal;">1</span>])


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># optional: log to an experiment-tracker like neptune.ai</span>
neptune_logger.run[<span class="hljs-string" style="color: rgb(221, 17, 68);">"roc_auc_score"</span>].append(auc)
</pre></code></pre>
</div>




<p><strong>Creating and logging a ROC plot</strong></p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> scikitplot.metrics <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> plot_roc
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> matplotlib.pyplot <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> plt


fig, ax = plt.subplots(figsize=(<span class="hljs-number" style="color: teal;">16</span>, <span class="hljs-number" style="color: teal;">12</span>))
plot_roc(y_test, y_test_pred, ax=ax)


<span class="hljs-comment" style="color: rgb(153, 153, 136); font-style: italic;"># optional: log to an experiment tracker like neptune.ai</span>
<span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> neptune.types <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> File
neptune_logger.run[<span class="hljs-string" style="color: rgb(221, 17, 68);">"roc_curve"</span>].upload(File.as_html(fig))
</pre></code></pre>
</div>




<h3 class="wp-block-heading" class="wp-block-heading" id="h-calibration-curves">Calibration curves</h3>



<p>While machine-learning classifiers typically output values between 0 and 1 for each class, these values do not represent a likelihood or confidence in the statistical sense. That’s perfectly fine in many cases because we’re only interested in obtaining the correct labels.</p>



<p>But if we want to report a confidence level along with the classification outcome, we must ensure our classifier is calibrated. Calibration curves are a helpful visual aid to understand how well a classifier is calibrated. We can also use them to compare different models or to check that our attempts to re-calibrate a model were successful.</p>



<p>Let’s again consider the case of a model that outputs values between 0 and 1. If we choose a threshold, say 0.5, we can turn this into a binary classifier where all samples for which the model outputs a higher value are assigned to the positive class (and vice versa).</p>



<p>A calibration curve plots the “fraction of positives” against the model’s output. The “fraction of positives” is the conditional probability that a sample actually belongs to the positive class given the model’s output (P(sample belongs to positive class|model’s output between 0 and 1)).</p>



<p>Does that sound way too abstract? Let’s look at an example:</p>


<div class="wp-block-image">
<figure class="aligncenter"><img decoding="async" src="https://lh7-us.googleusercontent.com/NUJixw3A_NlXfWsxFbvxwP9tuuXy6lezTUtzDEr8jKSf1haKuZjyEaJLWmhPzvmTHAxTjh9BSsQyrEK7I9Cl0chw2k-SHtNFiSs38cJtcYRzhfNDcE4c9cXhOHzmpQKxvoHacGvVFeR4EC47RA7xzK0" alt="Calibration curves: comparing different models |  Source: Author"/><figcaption class="wp-element-caption">Example of calibration curves: comparing different models |&nbsp; Source: Author</figcaption></figure>
</div>


<p>First, have a look at the diagonal line. It represents a perfectly calibrated classifier: The model’s output between 0 and 1 is precisely the probability that a sample belongs to the positive class. For example, if the model outputs 0.5, there’s a 50:50 chance the sample belongs to either the positive or negative class. If the model outputs 0.2 for a sample, there is only a 20% chance that the sample belongs to the positive class.</p>



<p>Next, consider the calibration curve for the Naive Bayes classifier: You see that even when this model outputs 0, there is about a 10% chance that the sample is positive. If the model outputs 0.8, there’s still a 50% chance that the sample belongs to the negative class. Hence, the classifier’s output does not reflect its confidence.</p>



<p>Computing the “fraction of positives” is far from straightforward. We need to create bins based on the model’s outputs, which is complicated by the fact that the distribution of samples across the model’s value range is typically not homogeneous. For example, a logistic regression classifier typically assigns values close to 0 or 1 to many samples but rarely outputs values close to 0.5. You can find a more in-depth discussion of this topic in the <a href="https://scikit-learn.org/stable/modules/calibration.html#calibration-curves" target="_blank" rel="noreferrer noopener nofollow">scikit-learn documentation</a>. There, you can also dive into possible ways to re-calibrate models, which is beyond the scope of this article.</p>



<p>For our purposes here, we’ve seen how calibration curves visualize complex model behavior in an easy-to-grasp fashion. From a quick glance at the plot, we can see whether models are well-calibrated and which comes closest to the ideal.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-visualizing-hyperparameter-tuning">Visualizing hyperparameter tuning</h3>



<p><a href="/blog/hyperparameter-tuning-in-python-complete-guide" target="_blank" rel="noreferrer noopener">Hyperparameter tuning</a> is a critical step in developing a machine-learning model. The aim is to select the best configuration of hyperparameters – a generic name for parameters not learned by the model from the data but pre-defined by its human creators. Visualizations can aid data scientists in understanding the impact of different hyperparameters on a model’s performance and properties.</p>



<p>Finding the optimal configuration of hyperparameters is a skill on its own and goes far beyond the machine learning visualization aspect we will focus on here. To learn more about hyperparameter tuning in all its depth, I recommend this article on <a href="/blog/improving-ml-model-performance" target="_blank" rel="noreferrer noopener">improving ML model performance</a> by a former Amazon AI researcher.&nbsp;</p>



<p>A common approach to systematic hyperparameter optimization is creating a list of possible parameter combinations and training a model for each. This is often referred to as “grid search.”</p>



<p>For instance, if you are training a Support Vector Machine (SVM), you might want to try out different values for the parameters <em>C</em> (the regularization parameter) and <em>gamma</em> (the kernel coefficient):</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> numpy <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">as</span> np
 
C_range = np.logspace(<span class="hljs-number" style="color: teal;">-2</span>, <span class="hljs-number" style="color: teal;">10</span>, <span class="hljs-number" style="color: teal;">13</span>)
gamma_range = np.logspace(<span class="hljs-number" style="color: teal;">-9</span>, <span class="hljs-number" style="color: teal;">3</span>, <span class="hljs-number" style="color: teal;">13</span>)

param_grid = {“gamma”: gamma_range, “C”: C_range}
</pre></code></pre>
</div>




<p>Using scikit-learn’s <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn-model-selection-gridsearchcv" target="_blank" rel="noreferrer noopener nofollow">GridSearchCV</a>, you can train models for each possible combination (using a cross-validation strategy) and find the best one with respect to an evaluation metric:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);"><span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">from</span> sklearn.model_selection <span class="hljs-keyword" style="color: rgb(51, 51, 51); font-weight: 700;">import</span> GridSearchCV,

grid = GridSearchCV(SVC(), param_grid=param_grid, scoring=’accuracy’)
grid.fit(X, y)
</pre></code></pre>
</div>




<p>After the grid search concludes, you can inspect the results:</p>




<div
	style="opacity: 0;"
	class="block-code-snippet  l-padding__top--0 l-padding__bottom--0 l-margin__top--0 l-margin__bottom--large block-code-snippet--regular language-py line-numbers block-code-snippet--show-header"
	data-show-header="show"
	data-header-text=""
>
	<pre style="font-size: .875rem;" data-prismjs-copy="Copy the JavaScript snippet!"><code><pre class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">print(
<span class="hljs-string" style="color: rgb(221, 17, 68);">"The best parameters are %s with a score of %0.2f"</span>
% (grid.best_params_, grid.best_score_)
)
</pre></code></pre>
</div>




<p>But we’re usually not just interested in finding the best model but also want to understand the effect its parameters have. For example, if a parameter does not influence the model’s performance, we don’t need to waste time and money by trying out even more different values. On the other hand, if we see that as a parameter’s value increases, the model’s performance gets better, we might want to try even higher values for this parameter.</p>



<p>Here’s a visualization of the grid search we just performed:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="800" height="600" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?resize=800%2C600&#038;ssl=1" alt="Visualization of the grid search: how SVM classifiers training with different values of gamma and C perform on a test set" class="wp-image-32076" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?w=800&amp;ssl=1 800w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?resize=768%2C576&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?resize=200%2C150&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?resize=220%2C165&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?resize=120%2C90&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?resize=160%2C120&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?resize=300%2C225&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-11.png?resize=480%2C360&amp;ssl=1 480w" sizes="auto, (max-width: 800px) 100vw, 800px" /><figcaption class="wp-element-caption">Example of visualization of the grid search: how SVM classifiers training with different values of <em>gamma</em> and <em>C</em> perform on a test set | <a href="https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html#sphx-glr-auto-examples-svm-plot-rbf-parameters-py" target="_blank" rel="noreferrer noopener nofollow">Source</a></figcaption></figure>
</div>


<p>From the plot, we see that the value of <em>gamma</em> greatly influences the SVM’s performance. If <em>gamma</em> is set too high, the influence radius of support vectors is minimal, potentially causing overfitting even with substantial regularization through <em>C</em>. Conversely, an extremely small <em>gamma</em> overly restricts the model, making it incapable of capturing the intricacies of the patterns within the data. In this scenario, the influence region of any support vector spans the entire training set, rendering the model akin to a linear one, using hyperplanes to separate dense areas of different classes.</p>



<p>The best models lie along a diagonal line of <em>C</em> and <em>gamma</em>, as depicted in the second plot panel. By adjusting <em>gamma</em> (lower values for smoother models) and increasing <em>C</em> (higher values for greater emphasis on correct classification), we can traverse this diagonal to achieve well-performing models.</p>



<p>Even from this simple example, you can see how helpful visualizations are for drilling down into the root causes of differences in model performance. This is why many machine-learning experiment tracking tools<a href="https://mlflow.org/docs/latest/tracking.html#tracking-ui" target="_blank" rel="noreferrer noopener nofollow"> </a>enable data scientists to create different types of visualizations to compare model versions.</p>


    <a
        href="/product/compare-experiments"
        id="cta-box-related-link-block_ec6c11828540af998fe8ab987e5ebe39"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
        <div class="block-cta-box-related-link__image-wrapper">
            <figure class="c-image__wrapper">

                
                <img
                    src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/09/compare_tab.png?fit=200%2C90&amp;ssl=1"
                    loading="lazy"
                    decoding="async"
                    width="200"
                    height="90"
                    class="c-image"
                    alt="">
            </figure>
        </div>

    
    <div class="block-cta-box-related-link__description-wrapper">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--resource.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    May be useful                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-how-to-compare-multiple-runs">                How to Compare Multiple Runs             </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Learn more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-feature-importance-visualization">Feature importance visualization</h2>



<p>Feature importance visualizations provide a clear and intuitive way to grasp the contribution of each feature in the model&#8217;s decision-making process. Understanding which features significantly influence predictions is paramount in many applications.</p>



<p>Plenty of different approaches to extracting insights about feature importance from machine-learning models exist. Broadly speaking, we can divide them into two categories:</p>



<ul class="wp-block-list">
<li>Some kinds of models, like decision trees and random forests, inherently contain feature importance information as part of their model structure. All we need to do is extract and visualize it.</li>
</ul>



<ul class="wp-block-list">
<li>Most machine-learning models in use today do not provide feature importance information out of the box. We have to use statistical techniques and algorithmic approaches to uncover the importance of each of their input features on the model’s final output.</li>
</ul>



<p>In the following, we’ll look at one example of each category: the mean decrease in impurity approach for random forest models and the model-agnostic LIME interpretability method. Other approaches you might want to look into comprise <a href="https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html#tree-s-feature-importance-from-mean-decrease-in-impurity-mdi" target="_blank" rel="noreferrer noopener nofollow">permutation importance</a>, <a href="/blog/shap-values" target="_blank" rel="noreferrer noopener">SHAP</a>, and integrated gradients.</p>



<p>For the purpose of this article, we don’t care so much about how to obtain feature-importance data but about its visualization. To this end, bar charts are the top choice for structured data, with the length of each bar signifying the feature’s importance. Heatmaps are a clear favorite for images, and for text data, highlighting the most important words or phrases is typical.</p>



<p>In a business context, feature importance visualization is an invaluable tool for stakeholder communication. It provides a straightforward narrative, demonstrating the factors that predominantly influence predictions. This transparency enhances decision-making and can foster trust in the model&#8217;s outcomes.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="629" height="470" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-12.png?resize=629%2C470&#038;ssl=1" alt="Feature importance visualization example: using the mean decrease in impurity method" class="wp-image-32077" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-12.png?w=629&amp;ssl=1 629w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-12.png?resize=200%2C149&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-12.png?resize=220%2C164&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-12.png?resize=120%2C90&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-12.png?resize=160%2C120&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-12.png?resize=300%2C224&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-12.png?resize=480%2C359&amp;ssl=1 480w" sizes="auto, (max-width: 629px) 100vw, 629px" /><figcaption class="wp-element-caption"><strong>Example of feature importance visualization, using the mean decrease in impurity method</strong> | Source: Author</figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-mean-decrease-in-impurity">Mean decrease in impurity</h3>



<p>The mean decrease in impurity is a measure of each feature&#8217;s contribution to a decision tree&#8217;s performance. To understand this, we’ll first need to understand what “impurity” means in this context.</p>



<p>We’ll start with an analogy:</p>



<ul class="wp-block-list">
<li>Let’s say we have a fruit basket with apples, pears, and oranges. When the pieces of fruit are in the basket, they’re thoroughly mixed, and we could say this set has a high <strong>impurity</strong>.</li>



<li>Now, our task is to sort them by kind. If we put all the apples into a bowl, place the oranges on a tray, and leave the pears in the basket, we would be left with three sets that have perfect <strong>purity</strong>.</li>



<li>But here comes the twist: We cannot see the fruits while making our decision. For each piece of fruit, we are told its color, diameter, and weight. Then, we need to decide where it should go. Thus, these three properties are our <strong>features</strong>.</li>



<li>The weight and the diameter of the pieces of fruit will be very similar. They won’t help us much in sorting – or, to say it differently, they are unhelpful in <strong>decreasing the impurity</strong>.</li>



<li>But the color will be helpful. We might still struggle to distinguish between green or yellow apples and green or yellow pears, but if we learn that the color is red or orange, we can confidently make a decision. Thus, the “color” will give us the biggest <strong>mean decrease in impurity</strong>.</li>
</ul>



<p>Now, let’s use this analogy in the context of decision trees and random forests:</p>



<p>When building a decision tree, we want each node to be as pure as possible regarding the target variable. In more colloquial terms, when creating a new node for our tree, we aim to find the feature that best splits the samples that reach the node into two distinct sets so that samples with the same label are in the same set. (For the full mathematical details, see the <a href="https://scikit-learn.org/stable/modules/tree.html#mathematical-formulation" target="_blank" rel="noreferrer noopener nofollow">scikit-learn documentation</a>).</p>



<p>Each node in a decision tree reduces the impurity – roughly speaking, it helps sort the training samples by their target label. Suppose a feature is the decision criterion in many nodes in the tree, and it’s effective in cleanly dividing the samples. In that case, it will be responsible for a large share of the reduction in impurity the decision tree achieves overall. That’s why looking at the “mean decrease in impurity” a feature is responsible for is a good measure of the importance of a feature.</p>



<p>Whew, that was a lot of complicated math and terminology!</p>



<p>Luckily, the visualizations are not quite that difficult to read. We can clearly identify our model’s primary drivers and use that information in feature selection. Reducing a model’s input space to just the most decisive features reduces its complexity and can prevent overfitting.</p>



<p>Additionally, understanding feature importance informs data preparation. Features with low importance might be candidates for removal or consolidation, streamlining the input data preprocessing.</p>



<p>There’s one important caveat, though, that I would like to mention before we move on. Since a node’s decrease in impurity is determined during training, using the training data set, the “mean decrease in impurity” doesn’t necessarily translate to previously unseen test data:</p>



<p>Consider the case that our training samples are numbered, and this number is an input feature for our model. Then, if our decision tree is complex enough, it can just learn which sample has which label (e.g., “fruit 1 is an orange”, “fruit 2 is an apple”, …). The mean decrease in impurity for the number feature will be massive, and it will appear as a highly important feature in our visualization, even though it’s entirely useless when applying our model to data it has not seen before.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-local-interpretable-model-agnostic-explanations-lime">Local interpretable model-agnostic explanations (LIME)</h3>



<p>Local interpretability approaches aim to shed light on a model’s behavior in a specific instance. (The opposite is global interpretability, where a model’s behavior across its entire feature space is examined.)&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1920" height="1009" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=1920%2C1009&#038;ssl=1" alt="Local interpretable model-agnostic explanations (LIME) example: yielding important features" class="wp-image-32079" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=1920%2C1009&amp;ssl=1 1920w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=768%2C403&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=1536%2C807&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=220%2C116&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=300%2C158&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=480%2C252&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?resize=1020%2C536&amp;ssl=1 1020w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/how-to-visualize-machine-learning-models-13.png?w=1999&amp;ssl=1 1999w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">Example of local interpretable model-agnostic explanations (LIME) and yielding important features | Source: Author</figcaption></figure>
</div>


<p>One of the oldest and still widely used techniques is <a href="https://github.com/marcotcr/lime" target="_blank" rel="noreferrer noopener nofollow">LIME</a> (Local Interpretable Model-agnostic Explanations). To uncover the contributions of each input feature to the model’s prediction, a linear model is fitted that approximates the model’s behavior in the particular area of the feature space. Roughly speaking, the linear model’s coefficients reflect the importance of the input features. The result can be visualized as a feature importance plot, highlighting the most influential features for a particular prediction.</p>



<p>Local interpretability techniques can extract intuitive insights from complex algorithms. Visualization of these results can support discussions with business stakeholders or be the foundation for cross-checking a model’s learned behavior with domain experts. They provide practical, actionable insights, enhance trust in a model’s intricate inner workings, and can be a vital tool for promoting machine learning adoption.</p>


    <a
        href="/blog/improving-ml-model-performance#h-how-to-improve-ml-model-performance-through-feature-engineering"
        id="cta-box-related-link-block_02eaad9c29306ad2e3e8cbd428a16db4"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
    <div class="block-cta-box-related-link__description-wrapper block-cta-box-related-link__description-wrapper--full">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related post                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-how-to-improve-ml-model-performance-best-practices-from-ex-amazon-ai-researcher">                How to Improve ML Model Performance [Best Practices From Ex-Amazon AI Researcher]            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-how-to-adopt-model-visualization-in-machine-learning">How to adopt model visualization in machine learning?</h2>



<p>In this section, I’ll share tips on seamlessly integrating model visualization into your daily data science and machine learning routines.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1800" height="942" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=1800%2C942&#038;ssl=1" alt="How to adopt model visualization in machine learning" class="wp-image-32282" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?w=1800&amp;ssl=1 1800w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=1536%2C804&amp;ssl=1 1536w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/10/How-to-adopt-model-visualization-in-machine-learning.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption">How to adopt model visualization in machine learning? | Source: Author</figcaption></figure>
</div>


<h3 class="wp-block-heading" class="wp-block-heading" id="h-1-start-with-a-clear-purpose">1. Start with a clear purpose</h3>



<p>Before diving into model visualization, establish a clear purpose. Ask yourself, &#8220;What specific goals do I aim to achieve through visualization?&#8221;</p>



<p>Are you seeking to …</p>



<ul class="wp-block-list">
<li>… improve model performance?</li>



<li>… enhance interpretability?</li>



<li>… better communicate results to stakeholders?</li>
</ul>



<p>Defining your objectives will provide the direction needed for effective visualization.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-2-choosing-the-appropriate-visualization">2. Choosing the appropriate visualization</h3>



<p>Always have a top-to-bottom approach. This means you start at a very abstract level and then explore deeper for more insights.</p>



<p>For instance, if you are seeking to improve the model’s performance, then make sure that you start with simple approaches first, like plotting the model’s accuracy and loss using simple line plots.</p>



<p>Let’s assume that your model is overfitting. Then, you can use feature importance techniques to rank features based on their contribution to model performance. You can plot these feature importance scores to visualize the most influential features in the model. Features with high importance might point to overfitting and information leakage.</p>



<p>Likewise, you can create partial dependence plots for relevant features. PDPs show how the target variable&#8217;s prediction changes as a specific feature varies while keeping other features constant. You must look for erratic behavior or sharp fluctuations in the curve, which could indicate overfitting due to that feature.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-3-select-the-right-tools">3. Select the right tools</h3>



<p>Selecting the right tools depends on the task at hand and the features the tools offer. Python offers a plethora of libraries like <a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener nofollow">Matplotlib</a>, <a href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener nofollow">Seaborn</a>, and <a href="https://plotly.com/python/" target="_blank" rel="noreferrer noopener nofollow">Plotly</a> for creating static and interactive visualizations. Framework-specific tools, such as <a href="https://www.tensorflow.org/tensorboard" target="_blank" rel="noreferrer noopener nofollow">TensorBoard</a> for TensorFlow and <a href="https://scikit-plot.readthedocs.io/" target="_blank" rel="noreferrer noopener nofollow">scikit-plot</a> for scikit-learn, can be invaluable for model-specific visualizations.</p>


    <a
        href="/blog/the-best-tools-for-machine-learning-model-visualization"
        id="cta-box-related-link-block_dad6996e76da426a88d471b68e16edc3"
        class="block-cta-box-related-link  l-margin__top--standard l-margin__bottom--standard"
        target="_blank" rel="nofollow noopener noreferrer"    >

    
        <div class="block-cta-box-related-link__image-wrapper">
            <figure class="c-image__wrapper">

                
                <img
                    src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/08/blog_feature_image_045530_5_1_6_4.jpg?fit=200%2C105&amp;ssl=1"
                    loading="lazy"
                    decoding="async"
                    width="200"
                    height="105"
                    class="c-image"
                    alt="">
            </figure>
        </div>

    
    <div class="block-cta-box-related-link__description-wrapper">

        
            <div class="c-eyebrow">

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-related--article.svg"
                    loading="lazy"
                    decoding="async"
                    width="16"
                    height="16"
                    alt=""
                    class="c-eyebrow__icon">

                <div class="c-eyebrow__text">
                    Related post                </div>
            </div>

        
                    <h3 class="c-header" class="c-header" id="h-the-best-tools-for-machine-learning-model-visualization">                The Best Tools for Machine Learning Model Visualization            </h3>        
                    <div class="c-button c-button--tertiary c-button--small">

                <span class="c-button__text">
                    Read more                </span>

                <img
                    src="https://neptune.ai/wp-content/themes/neptune/img/icon-button-arrow-right.svg"
                    loading="lazy"
                    decoding="async"
                    width="12"
                    height="12"
                    alt=""
                    class="c-button__arrow">

            </div>
            </div>

    </a>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-4-iterate-and-improve">4. Iterate and improve</h3>



<p>Remember that model visualization is an iterative process. Continuously refine your visualizations based on feedback from your team and the stakeholders you present them to. The ultimate goal is to make your models transparent, interpretable, and accessible to all stakeholders. Their input and evolving project requirements might mean you need to reconsider and adapt your approach.</p>



<p>Incorporating model visualization into your daily data science or machine learning practice empowers you to make data-driven decisions with clarity and confidence. Whether you&#8217;re a data scientist, a domain expert, or a decision-maker, adopting model visualization as a routine practice is a pivotal step in harnessing the <a href="/blog/how-to-make-machine-learning-project-more-likely-to-succeed" target="_blank" rel="noreferrer noopener">full potential of your machine-learning projects</a>.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h3>



<p>Effective machine-learning model visualization is an indispensable tool for any data scientist. It empowers practitioners to gain insights, make informed decisions, and communicate results transparently.</p>



<p>In this article, we covered a lot of information about how we visualize machine learning models. To conclude, here are some key takeaways:</p>



<p><strong>Purpose of visualization in machine learning</strong>:</p>



<ul class="wp-block-list">
<li>Visualizations simplify complex ML model structures and data patterns for better understanding.</li>



<li>Interactive visualizations and Visual ML tools empower users to dynamically interact with data and models. They can tweak parameters, zoom in on details, and better understand the ML system.</li>



<li>Visualizations foster informed decision-making and effective communication of results.</li>
</ul>



<p><strong>Types of machine learning visualizations:</strong></p>



<ul class="wp-block-list">
<li>Model structure visualizations help data scientists, AI researchers, and business stakeholders understand complex algorithms and data flows.</li>



<li>Model performance visualizations provide insight into the performance characteristics of individual models and model ensembles.</li>



<li>Visualizations for comparative model analysis aid practitioners in selecting the best-performing model or verifying that a new model version is an improvement.</li>



<li>Feature importance visualizations uncover each input feature’s influence on a model’s output.</li>
</ul>



<p><strong>Best practices for adopting model visualization</strong>:</p>



<ul class="wp-block-list">
<li>Start with defined objectives and simple visualizations.</li>



<li>Choose an appropriate visualization method that suits your needs and is accessible to the intended audience.&nbsp;</li>



<li>Select the proper tools and libraries that help you craft accurate visualizations efficiently.</li>



<li>Continuously listen to feedback and adapt your visualizations to your stakeholders’ needs.</li>
</ul>



<p></p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5831</post-id>	</item>
		<item>
		<title>Self-Driving Cars With Convolutional Neural Networks (CNN)</title>
		<link>https://neptune.ai/blog/self-driving-cars-with-convolutional-neural-networks-cnn</link>
		
		<dc:creator><![CDATA[Nilesh Barla]]></dc:creator>
		<pubDate>Fri, 22 Jul 2022 06:17:42 +0000</pubDate>
				<category><![CDATA[Computer Vision]]></category>
		<category><![CDATA[ML Model Development]]></category>
		<guid isPermaLink="false">https://neptune.test/self-driving-cars-with-convolutional-neural-networks-cnn/</guid>

					<description><![CDATA[Humanity has been waiting for self-driving cars for several decades. Thanks to the extremely fast evolution of technology, this idea recently went from “possible” to “commercially available in a Tesla”. Deep learning is one of the main technologies that enabled self-driving. It’s a versatile tool that can solve almost any problem &#8211; it can be&#8230;]]></description>
										<content:encoded><![CDATA[
<p>Humanity has been waiting for self-driving cars for several decades. Thanks to the extremely fast evolution of technology, this idea recently went from “possible” to “commercially available in a Tesla”.</p>



<p>Deep learning is one of the main technologies that enabled self-driving. It’s a versatile tool that can solve almost any problem &#8211; it can be used in physics, for example, the <a href="https://arxiv.org/pdf/2006.10159.pdf" target="_blank" rel="noreferrer noopener nofollow">proton-proton collision</a> in the Large Hadron Collider, just as well as in <a href="https://lens.google">Google Lens</a> to classify pictures. Deep learning is a technology that can help solve almost any type of science or engineering problem.&nbsp;</p>



<p>In this article, we’ll focus on deep learning algorithms in self-driving cars &#8211; <strong>convolutional neural networks </strong>(CNN). CNN is the primary algorithm that these systems use to recognize and classify different parts of the road, and to make appropriate decisions.&nbsp;&nbsp;</p>



<p>Along the way, we’ll see how Tesla, Waymo, and Nvidia use CNN algorithms to make their cars driverless or autonomous.&nbsp;</p>



<section id="blog-intext-cta-block_e36bfd5a646cb391a731c5cc3e4446cd" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-you-may-also-like">You may also like</h3>
    
            <p><a href="/customers/waabi" target="_blank" rel="noopener">Experiment Tracking for Systems Powering Self-Driving Vehicles [Case Study with Waabi]</a></p>
    
    </section>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-how-do-self-driving-cars-work">How do self-driving cars work?</h2>



<p>The first self-driving car was invented in 1989, it was the <strong>Automatic Land Vehicle in Neural Network</strong> (ALVINN). It used neural networks to detect lines, segment the environment, navigate itself, and drive. It worked well, but it was limited by slow processing powers and insufficient data.</p>



<p>With today’s high-performance graphics cards, processors, and huge amounts of data, self-driving is more powerful than ever. If it becomes mainstream, it will reduce traffic congestion and increase road safety.&nbsp;</p>



<p>Self-driving cars are autonomous decision-making systems. They can process streams of data from different sensors such as cameras, LiDAR, RADAR, GPS, or inertia sensors. This data is then modeled using deep learning algorithms, which then make decisions relevant to the environment the car is in.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-pipeline.png?ssl=1" alt="Self driving cars - pipeline" class="wp-image-50843" style="width:849px;height:276px"/><figcaption class="wp-element-caption"><em> A modular perception-planning-action pipeline |  <a href="https://arxiv.org/pdf/1910.07738.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The image above shows a modular <strong>perception-planning-action</strong> pipeline used to make driving decisions. The key components of this method are the different sensors that fetch data from the environment.&nbsp;</p>



<p>To understand the workings of self-driving cars, we need to examine the four main parts:</p>



<ol class="wp-block-list">
<li><strong>Perception&nbsp;</strong></li>



<li><strong>Localization</strong></li>



<li><strong>Prediction</strong></li>



<li><strong>Decision Making</strong>
<ol class="wp-block-list">
<li>High-level path planning&nbsp;</li>



<li>Behaviour Arbitration</li>



<li>Motion Controllers</li>
</ol>
</li>
</ol>



<h3 class="wp-block-heading" id="perception">Perception&nbsp;</h3>



<p>One of the most important properties that self-driving cars must have is <strong>perception</strong>, which helps the car see the world around itself, as well as recognize and classify the things that it sees. In order to make good decisions, the car needs to recognize objects instantly.</p>



<p>So, the car needs to see and classify traffic lights, pedestrians, road signs, walkways, parking spots, lanes, and much more. Not only that, it also needs to know the exact distance between itself and the objects around it. Perception is more than just seeing and classifying, it enables the system to evaluate the distance and decide to either slow down or brake.&nbsp;</p>



<p>To achieve such a high level of perception, a self-driving car must have three sensors:</p>



<ol class="wp-block-list">
<li>Camera</li>



<li>LiDAR</li>



<li>RADAR</li>
</ol>



<h4 class="wp-block-heading">Camera</h4>



<p>The camera provides vision to the car, enabling multiple tasks like <strong>classification, segmentation, </strong>and <strong>localization</strong>. The cameras need to be high-resolution and represent the environment accurately.</p>



<p>In order to make sure that the car receives visual information from every side: front, back, left, and right, the cameras are stitched together to get a 360-degree view of the entire environment. These cameras provide a wide-range view as far as 200 meters as well as a short-range view for more focused perception.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-camera.png?ssl=1" alt="Self driving cars - camera" class="wp-image-50844"/><figcaption class="wp-element-caption"><em>Self-driving car&#8217;s camera | <a href="https://heartbeat.fritz.ai/computer-vision-at-tesla-cd5e88074376" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>In some tasks like parking, the camera also provides a panoramic view for better decision-making.&nbsp;</p>



<p>Even though the cameras do all the perception related tasks, it’s hardly of any use during extreme conditions like fog, heavy rain, and especially at night time. During extreme conditions, all that cameras capture is noise and discrepancies, which can be life-threatening.&nbsp;</p>



<p>To overcome these limitations, we need sensors that can work without light and also measure distance.</p>



<h4 class="wp-block-heading">LiDAR</h4>



<p>LiDAR stands for Light Detection And Ranging, it’s a method to measure the distance of objects by firing a laser beam and then measuring how long it takes for it to be reflected by something.</p>



<p>A camera can only provide the car with images of what’s going around itself. When it’s combined with the LiDAR sensor, it gains depth in the images &#8211; it suddenly has a 3D perception of what’s going on around the car.&nbsp;</p>



<p>So, LiDAR perceives <strong>spatial information</strong>. And when this data is fed into deep neural networks, the car can predict the actions of the objects or vehicles close to it. This sort of technology is very useful in a complex driving scenario, like a multi-exit intersection, where the car can analyze all other cars and make the appropriate, safest decision.</p>



<figure class="wp-block-video aligncenter"><video height="338" style="aspect-ratio: 1200 / 338;" width="1200" autoplay controls loop muted src="https://neptune.ai/wp-content/uploads/2022/11/Self-driving-car-LiDAR.mp4"></video><figcaption class="wp-element-caption"><em>Object detection with LiDAR | <a href="https://shangzhouye.tech/other-projects/deeplidar_detection_tracking/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>



<p>In 2019, Elon Musk openly stated that “<a href="https://www.youtube.com/watch?v=HM23sjhtk4Q" target="_blank" rel="noreferrer noopener nofollow">anyone relying on LiDARs are doomed…</a>”. Why? Well, LiDARs have limitations that can be catastrophic. For example, the LiDAR sensor uses lasers or light to measure the distance of the nearby object. It will work at night and in dark environments, but it can still fail when there’s noise from rain or fog. That’s why we also need a RADAR sensor.</p>



<h4 class="wp-block-heading">RADARs</h4>



<p>Radio detection and ranging (RADAR) is a key component in many military and consumer applications. It was first used by the military to detect objects. It calculates distance using <strong>radio wave signals</strong>. Today, it’s used in many vehicles and has become a primary component of the self-driving car.&nbsp;</p>



<p>RADARs are highly effective because they use radio waves instead of lasers, so they work in any conditions.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-lidar-vs-radar.png?ssl=1" alt="Self driving cars - lidar vs radar" class="wp-image-50847" style="width:840px;height:392px"/><figcaption class="wp-element-caption"><em><a href="https://cleantechnica.com/2016/07/29/tesla-google-disagree-lidar-right/" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>It’s important to understand that radars are noisy sensors. This means that even if the camera sees no obstacle, the radar will detect some obstacles.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><a href="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-lidar.png?ssl=1"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-lidar.png?ssl=1" alt="Self driving cars - lidar" class="wp-image-50848" style="width:504px;height:474px"/></a><figcaption class="wp-element-caption"><em><a href="https://www.youtube.com/watch?v=6lG6B4tkCEk" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The image above shows the self-driving car (in green) using LiDAR to detect objects around and to calculate the distance and shape of the object. Compare the same scene, but captured with the RADAR sensor below, and you can see a lot of unnecessary noise.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><a href="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-radar.png?ssl=1"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-radar.png?ssl=1" alt="Self driving cars - radar" class="wp-image-50849" style="width:512px;height:488px"/></a><figcaption class="wp-element-caption"><em><a href="https://www.youtube.com/watch?v=6lG6B4tkCEk" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The RADAR data should be cleaned in order to make good decisions and predictions. We need to separate weak signals from strong ones; this is called <strong>thresholding</strong>. We also use <strong>Fast Fourier Transforms</strong> (FFT) to filter and interpret the signal.&nbsp;</p>



<p>If you look at the below above, you’ll notice that the RADAR and LiDAR signals are point-based data. This data should be clustered so that it can be interpreted nicely. Clustering algorithms such as<strong> Euclidean Clustering</strong> or K means Clustering are used to achieve this task.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-lidar-and-radar.png?ssl=1" alt="Self driving cars - lidar and radar" class="wp-image-50851" style="width:768px;height:437px"/><figcaption class="wp-element-caption"><em><a href="https://ieeexplore.ieee.org/document/7226315" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h3 class="wp-block-heading" id="localization">Localization</h3>



<p>Localization algorithms in self-driving cars calculate the position and orientation of the vehicle as it navigates &#8211; a science known as Visual Odometry (VO).</p>



<p>VO works by matching key points in consecutive video frames. With each frame, the key points are used as the input to a mapping algorithm. The mapping algorithm, such as Simultaneous localization and mapping (SLAM), computes the position and orientation of each object nearby with respect to the previous frame and helps to classify roads, pedestrians, and other objects around.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><a href="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-localization.png?ssl=1"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-localization.png?ssl=1" alt="Self driving cars - localization" class="wp-image-50852" style="width:844px;height:443px"/></a><figcaption class="wp-element-caption"><em><a href="https://www.youtube.com/watch?v=jcKnb65wpWA" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>Deep learning is generally used to improve the performance of VO, and to classify different objects. Neural networks, such as PoseNet and VLocNet++, are some of the frameworks that use point data to estimate the 3D position and orientation. These estimated 3D positions and orientations can be used to derive scene semantics, as seen in the image below.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-localization-2.png?ssl=1"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-localization-2.png?ssl=1" alt="Self driving cars - localization" class="wp-image-50854"/></a><figcaption class="wp-element-caption"><a href="https://blogs.nvidia.com/blog/2019/10/23/drive-labs-panoptic-segmentation/" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<h3 class="wp-block-heading" id="prediction">Prediction</h3>



<p>Understanding human drivers is a very complex task. It involves emotions rather than logic, and these are all fueled with <strong>reactions</strong>. It becomes very uncertain what the next action will be of the drivers or pedestrians nearby, so a system that can predict the actions of other road users can be very important for road safety.&nbsp;</p>



<p>The car has a 360-degree view of its environment that enables it to perceive and capture all the information and process it. Once fed into the deep learning algorithm, it can come up with all the possible moves that other road users might make. It’s like a game where the player has a finite number of moves and tries to find the best move to defeat the opponent.&nbsp;</p>



<p>The sensors in self-driving cars enable them to perform tasks like image classification, object detection, segmentation, and localization. With various forms of data representation, the car can make predictions of the object around it.</p>



<p>A deep learning algorithm can model such information (images and cloud data points from LiDARs and RADARs) during training. The same model, but during inference, can help the car to prepare for all the possible moves which involve braking, halting, slowing down, changing lanes, and so on.&nbsp;</p>



<p>The role of deep learning is to interpret complex vision tasks, localize itself in the environment, enhance perception, and actuate kinematic maneuvers in self-driving cars. This ensures road safety and easy commute as well.</p>



<p>But the tricky part is to choose the correct action out of a finite number of actions.&nbsp;</p>



<h3 class="wp-block-heading" id="decision-making">Decision-making</h3>



<p>Decision-making is vital in self-driving cars. They need a system that’s dynamic and precise in an uncertain environment. It needs to take into account that not all sensor readings will be true, and that humans can make unpredictable choices while driving. These things can’t be measured directly. Even if we could measure them, we can’t predict them with good accuracy.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-decision-making.png?ssl=1" alt="Self driving cars - decision making" class="wp-image-50857"/><figcaption class="wp-element-caption"><em> A self-driving car moving towards an intersection | <a href="https://ieeexplore.ieee.org/document/7995949" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The image above shows a self-driving car moving towards an intersection. Another car, in blue, is also moving towards the intersection. In this scenario, the self-driving car has to predict whether the other car will go straight, left, or right. In each case, the car has to decide what maneuver it should perform to prevent a collision.</p>



<p>In order to make a decision, the car should have enough information so that it can select the necessary set of actions. We learned that the sensors help the car to collect information and deep learning algorithms can be used for localization and prediction.&nbsp;</p>



<p>To recap, localization enables the car to know its initial position, and prediction creates an <em>n </em>number of possible actions or moves based on the environment. The question remains: which option is best out of the many predicted actions?&nbsp;</p>



<p>When it comes to making decisions, we use deep reinforcement learning (DRL). More specifically, a decision-making algorithm called the <strong>Markov decision process</strong> (MDP) lies at the heart of DRL (we’ll learn more about MDP in a later section where we talk about reinforcement learning).&nbsp;</p>



<p>Usually, an MDP is used to predict the future behavior of the road-users. We should keep in mind that the scenario can get very complex if the number of objects, especially moving ones, increases. This eventually increases the number of possible moves for the self-driving car itself.&nbsp;</p>



<p>In order to tackle the problem of finding the best move for itself, the deep learning model is optimized with <strong>Bayesian optimization</strong>. There are also situations where the framework, consisting of both a hidden Markov model and Bayesian Optimization, is used for decision-making.&nbsp;</p>



<p>In general, decision-making in self-driving cars is a hierarchical process. This process has four components:</p>



<ul class="wp-block-list">
<li><strong>Path or Route planning</strong>: Essentially, route planning is the first of four decisions that the car must make. Entering the environment, the car should plan the best possible route from its current position to the requested destination. The idea is to find an optimal solution among all the other solutions.&nbsp;&nbsp;</li>



<li><strong>Behaviour Arbitration</strong>: Once the route is planned, the car needs to navigate itself through the route. The car knows about the static elements, like roads, intersections, average road congestion and more, but it can’t know exactly what the other road users are going to be doing throughout the journey. This uncertainty in the behavior of other road users is solved by using probabilistic planning algorithms like MDPs.</li>



<li><strong>Motion Planning</strong>: Once the behavior layer decides how to navigate through a certain route, the motion planning system orchestrates the motion of the car. The motion of the car must be feasible and comfortable for the passenger. Motion planning includes speed of the vehicle, lane-changing, and more, all of which should be relevant to the environment the car is in.&nbsp;&nbsp;</li>



<li><strong>Vehicle Control</strong>: Vehicle control is used to execute the reference path from the motion planning system.&nbsp;</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-decision-making-2.png?ssl=1" alt="Self driving cars - decision making" class="wp-image-50858"/><figcaption class="wp-element-caption"><a href="https://arxiv.org/pdf/1604.07446.pdf" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<h2 class="wp-block-heading" class="wp-block-heading" id="h-cnns-used-for-self-driving-cars">CNNs used for self-driving cars</h2>



<p>Convolutional neural networks (CNN) are used to model spatial information, such as images. CNNs are very good at extracting features from images, and they’re often seen as universal non-linear function approximators.&nbsp;</p>



<p>CNNs can capture different patterns as the depth of the network increases. For example, the layers at the beginning of the network will capture edges, while the deep layers will capture more complex features like the shape of the objects (leaves in trees, or tires on a vehicle). This is the reason why CNNs are the main algorithm in self-driving cars.&nbsp;</p>



<p>The key component of the CNN is the convolutional layer itself. It has a convolutional kernel which is often called the <em>filter matrix</em>. The filter matrix is convolved with a local region of the input image which can be defined as:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="380" height="138" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-1.png?resize=380%2C138&#038;ssl=1" alt="CNNs used for self-driving cars" class="wp-image-27170" style="width:291px;height:105px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-1.png?w=380&amp;ssl=1 380w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-1.png?resize=200%2C73&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-1.png?resize=220%2C80&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-1.png?resize=120%2C44&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-1.png?resize=160%2C58&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-1.png?resize=300%2C109&amp;ssl=1 300w" sizes="auto, (max-width: 380px) 100vw, 380px" /></figure>
</div>


<p>Where:&nbsp;</p>



<ul class="wp-block-list">
<li>the operator * represents the convolution operation,</li>



<li>w is the filter matrix and b is the bias,&nbsp;</li>



<li>x is the input,</li>



<li>y is the output.&nbsp;</li>
</ul>



<p>The dimension of the filter matrix in practice is usually 3X3 or 5X5. During the training process, the filter matrix will constantly update itself to get a reasonable weight. One of the properties of CNN is that the weights are shareable. The same weight parameters can be used to represent two different transformations in the network. The shared parameter saves a lot of processing space; they can produce more diverse feature representations learned by the network.</p>



<p>The output of the CNN is usually fed to a nonlinear <strong>activation function</strong>. The activation function enables the network to solve the linear inseparable problems, and these functions can represent high-dimensional manifolds in lower-dimensional manifolds. Commonly used activation functions are Sigmoid, Tanh, and ReLU, which are listed as follows:</p>


<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="458" height="266" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-2.png?resize=458%2C266&#038;ssl=1" alt="CNNs used for self-driving cars" class="wp-image-27172" style="width:382px;height:220px" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-2.png?w=458&amp;ssl=1 458w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-2.png?resize=200%2C116&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-2.png?resize=220%2C128&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-2.png?resize=120%2C70&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-2.png?resize=160%2C93&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-2.png?resize=300%2C174&amp;ssl=1 300w" sizes="auto, (max-width: 458px) 100vw, 458px" /></figure>
</div>


<p>It’s worth mentioning that the ReLU is the preferred activation function, because it converges faster compared to the other activation functions. In addition to that, the output of the convolution layer is modified by the max-pooling layer which keeps more information about the input image, like the background and texture.&nbsp;</p>



<p>The three important CNN properties that make them versatile and a primary component of self-driving cars are:</p>



<ul class="wp-block-list">
<li><strong>local receptive fields,&nbsp;</strong></li>



<li><strong>shared weights,&nbsp;</strong></li>



<li><strong>spatial sampling</strong>.&nbsp;</li>
</ul>



<p>These properties reduce overfitting and store representations and features that are vital for image classification, segmentation, localization, and more.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Convolutional-neural-networks-2.png?ssl=1" alt="Convolutional neural networks" class="wp-image-50859"/><figcaption class="wp-element-caption"><a href="https://iopscience.iop.org/article/10.1088/1742-6596/1869/1/012071/pdf" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<p>Next, we’ll discuss three CNN networks that are used by three companies pioneering self-driving cars:</p>



<ol class="wp-block-list">
<li>HydraNet by Tesla</li>



<li>ChauffeurNet by Google Waymo</li>



<li>Nvidia Self driving car</li>
</ol>



<h3 class="wp-block-heading" id="hydranet">HydraNet &#8211; semantic segmentation for self-driving cars&nbsp;</h3>



<p>HydraNet was introduced by <a href="https://rmullapudi.bitbucket.io/data/hydranet_cvpr_final.pdf?utm_source=Jeremy+Cohen&amp;utm_campaign=15c163eaa1-EMAIL_CAMPAIGN_2020_07_10_08_05&amp;utm_medium=email&amp;utm_term=0_9a0160b0e8-15c163eaa1-" target="_blank" rel="noreferrer noopener nofollow">Ravi et al. in 2018</a>. It was developed for semantic segmentation, for improving computational efficiency during inference time.  </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=1200%2C628&#038;ssl=1" alt="Self driving cars - semantic segmentation" class="wp-image-36651" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-1-1.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><em><a href="https://rmullapudi.bitbucket.io/data/hydranet_cvpr_final.pdf?utm_source=Jeremy+Cohen&amp;utm_campaign=15c163eaa1-EMAIL_CAMPAIGN_2020_07_10_08_05&amp;utm_medium=email&amp;utm_term=0_9a0160b0e8-15c163eaa1-" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>HydraNets is dynamic architecture so it can have different CNN networks, each assigned to different tasks. These blocks or networks are called branches. The idea of HydraNet is to get various inputs and feed them into a task-specific CNN network.&nbsp;</p>



<p>Take the context of self-driving cars. One input dataset can be of static environments like trees and road-railing, another can be of the road and the lanes, another of traffic lights and road, and so on. These inputs are trained in different branches. During the inference time, the <strong>gate</strong> chooses which branches to run, and the combiner aggregates branch outputs and makes a final decision.&nbsp;</p>



<p>In the case of Tesla, they have modified this network slightly because it’s difficult to segregate data for the individual tasks during inference. To overcome that problem, engineers at Tesla developed a shared backbone. The shared backbones are usually modified ResNet-50 blocks.</p>



<p>This HydraNet is trained on all the object’s data. There are task-specific heads that allow the model to predict task-specific outputs. The heads are based on semantic segmentation architecture like the U-Net.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-hydranet.jpeg?ssl=1" alt="Self driving cars - hydranet" class="wp-image-50862"/><figcaption class="wp-element-caption"><em><a href="https://heartbeat.fritz.ai/computer-vision-at-tesla-cd5e88074376" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>The Tesla HydraNet can also project a birds-eye, meaning it can create a 3D view of the environment from any angle, giving the car much more dimensionality to navigate properly. It’s important to know that Tesla doesn’t use LiDAR sensors. It has only two sensors, a camera and a radar. Although LiDAR explicitly creates depth perception for the car, Tesla’s hydranet is so efficient that it can stitch all the visual information from the 8 cameras in it and create depth perception.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-tesla-hydranet.png?ssl=1" alt="Self driving cars - tesla hydranet" class="wp-image-50863"/><figcaption class="wp-element-caption"><em><a href="https://heartbeat.fritz.ai/computer-vision-at-tesla-cd5e88074376" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<h3 class="wp-block-heading" id="chauffeurnet">ChauffeurNet:&nbsp;training self-driving car using imitation learning</h3>



<p><a href="http://roboticsproceedings.org/rss15/p31.pdf" target="_blank" rel="noreferrer noopener">ChauffeurNet</a> is an RNN-based neural network used by Google Waymo, however, CNN is actually one of the core components here and it’s used to extract features from the perception system.&nbsp;</p>



<p>The CNN in ChauffeurNet is described as a convolutional feature network, or FeatureNet, that extracts contextual feature representation shared by the other networks. These representations are then fed to a recurrent agent network (AgentRNN) that iteratively yields the prediction of successive points in the driving trajectory.</p>



<p>The idea behind this network is to train a self-driving car using imitation learning. In the paper released by Bansal et al “<a href="https://arxiv.org/abs/1812.03079" target="_blank" rel="noreferrer noopener nofollow">ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst</a>”, they argue that training a self-driving car even with 30 million examples is not enough. In order to tackle that limitation, the authors trained the car in synthetic data. This synthetic data introduced deviations such as introducing perturbation to the trajectory path, adding obstacles, introducing unnatural scenes, etc. They found that such synthetic data was able to train the car much more efficiently than the normal data.&nbsp;</p>



<p>Usually, self-driving has an end-to-end process as we saw earlier, where the perception system is part of a deep learning algorithm along with planning and controlling. In the case of ChauffeurNet, the perception system is not a part of the end-to-end process; instead, it’s a mid-level system where the network can have different variations of input from the perception system.&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><a href="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-ChauffeurNet.png?ssl=1"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-ChauffeurNet.png?ssl=1" alt="Self driving cars - ChauffeurNet" class="wp-image-50864" style="width:768px;height:624px"/></a><figcaption class="wp-element-caption"><em><a href="http://roboticsproceedings.org/rss15/p31.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>ChauffeurNet yields a driving trajectory by observing a mid-level representation of the scene from the sensors, using the input along with synthetic data to imitate an expert driver.&nbsp;&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-ChauffeurNet-2.png?ssl=1" alt="Self driving cars - ChauffeurNet" class="wp-image-50866"/><figcaption class="wp-element-caption"><em><a href="http://roboticsproceedings.org/rss15/p31.pdf" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p><em>In the image above, the cyan path depicts the input route, green box is the self-driving car, blue dots are the agent’s past route or position, and green dots are the predicted future routes or positions.&nbsp;</em></p>



<p>Essentially, a mid-level representation doesn’t directly use raw sensor data as input, factoring out the perception task, so we can combine real and simulated data for easier transfer learning. This way, the network can create a high-level bird’s eye view of the environment which ultimately yields better decisions.&nbsp;</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-nvidia-self-driving-car-a-minimalist-approach-towards-self-driving-cars">Nvidia self-driving car: a minimalist approach towards self-driving cars</h3>



<p>Nvidia also uses a Convolution Neural Network as a primary algorithm for its self-driving car. But unlike Tesla, it uses 3 cameras, one on each side and one at the front.  See the image below. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=1200%2C628&#038;ssl=1" alt="Convolutional neural networks NVIDIA" class="wp-image-36653" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-2.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /><figcaption class="wp-element-caption"><a href="https://rmullapudi.bitbucket.io/data/hydranet_cvpr_final.pdf?utm_source=Jeremy+Cohen&amp;utm_campaign=15c163eaa1-EMAIL_CAMPAIGN_2020_07_10_08_05&amp;utm_medium=email&amp;utm_term=0_9a0160b0e8-15c163eaa1-" target="_blank" rel="noreferrer noopener nofollow"><em>Source</em></a></figcaption></figure>
</div>


<p>The network is capable of operating inroads that don’t have lane markings, including parking lots. It can also learn features and representations that are necessary for detecting useful road features.&nbsp;</p>



<p>Compared to the explicit decomposition of the problem such as lane marking detection, path planning, and control, this end-to-end system optimizes all processing steps at the same time.&nbsp;</p>



<p>Better performance is the result of internal components self-optimizing to maximize overall system performance, instead of optimizing human-selected intermediate criteria like lane detection. Such criteria understandably are selected for ease of human interpretation, which doesn’t automatically guarantee maximum system performance. Smaller networks are possible because the system learns to solve the problem with a minimal number of processing steps.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-reinforcement-learning-used-for-self-driving-cars">Reinforcement learning used for self-driving cars</h2>



<p><a href="/blog/best-reinforcement-learning-tutorials-examples-projects-and-courses" target="_blank" rel="noreferrer noopener">Reinforcement learning</a> (RL) is a type of machine learning where an agent learns by exploring and interacting with the environment. In this case, the self-driving car is an <strong>agent</strong>.&nbsp;</p>



<section id="blog-intext-cta-block_fda2a884f66aeb554255458742312277" class="block-blog-intext-cta  c-box c-box--default c-box--dark c-box--no-hover c-box--standard ">

            <h3 class="block-blog-intext-cta__header" class="block-blog-intext-cta__header" id="h-explore-more-applications-of-rl">Explore more applications of RL</h3>
    
            <p><a href="/blog/reinforcement-learning-applications" target="_blank" rel="noopener">10 Real-Life Applications of Reinforcement Learning</a></p>
<p><a href="https://neptune.ai/blog/7-applications-of-reinforcement-learning-in-finance-and-trading" target="_blank" rel="noopener">7 Applications of Reinforcement Learning in Finance and Trading</a></p>
    
    </section>



<p>We discussed earlier how the neural network predicts a number of actions from the perception data. But, choosing an appropriate action requires deep reinforcement learning (DRL). At the core of DRL, we have three important variables:</p>



<ol class="wp-block-list">
<li><strong>State</strong> describes the current situation in a given time. In this case, it would be a position on the road.&nbsp;</li>



<li><strong>Action</strong> describes all the possible moves that the car can make.&nbsp;</li>



<li><strong>Reward</strong> is feedback that the car receives whenever it takes a certain action.&nbsp;</li>
</ol>



<p>Generally, the agent is not told what to do or what actions to take. So far as we have seen, in supervised learning, the algorithm maps input to the output. In DRL, the algorithm learns by exploring the environment and each interaction yields a certain reward. The reward can be both positive and negative. The goal of the DRL is to maximize the cumulative rewards.&nbsp;</p>



<p>In self-driving cars, the same procedure is followed: the network is trained on perception data, where it learns what decision it should make. Because the CNNs are very good at extracting features of representations from the input, DRL algorithms can be trained on those representations. Training a DRL algorithm on these representations can yield good results because these extracted representations are the transformation of higher-dimensional manifolds into simpler lower-dimensional manifolds. Training on lower representation yields efficiency which is required at the inference.&nbsp;</p>



<p>One key point to remember is that self-driving cars can’t be trained in real-world scenarios or roads because they will be extremely dangerous. Instead, self-driving cars are trained on a <strong>simulator</strong> where there’s no risk at all.&nbsp;</p>



<p>Some open-source simulators are:</p>



<ol class="wp-block-list">
<li><a href="https://carla.org" target="_blank" rel="noreferrer noopener nofollow">CARLA</a></li>



<li><a href="https://github.com/AdaCompNUS/summit" target="_blank" rel="noreferrer noopener nofollow">SUMMIT</a>​​</li>



<li><a href="https://microsoft.github.io/AirSim/" target="_blank" rel="noreferrer noopener nofollow">AirSim</a></li>



<li><a href="https://deepdrive.io/index.html" target="_blank" rel="noreferrer noopener nofollow">DeepDrive</a></li>



<li><a href="https://flow-project.github.io/" target="_blank" rel="noreferrer noopener nofollow">Flow</a></li>
</ol>


<div class="wp-block-image">
<figure class="aligncenter size-large"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-car-simulator-deepdrive.jpeg?ssl=1" alt="Self driving car simulator - deepdrive" class="wp-image-50868"/><figcaption class="wp-element-caption"><em>A snapshot from&nbsp;Voyage Deepdrive | <a href="https://news.voyage.auto/introducing-voyage-deepdrive-69b3cf0f0be6" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>

<div class="wp-block-image">
<figure class="aligncenter size-full is-resized"><img data-recalc-dims="1" loading="lazy" decoding="async" width="640" height="400" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-car-simulator-deepdrive.gif?resize=640%2C400&#038;ssl=1" alt=" A snapshot from Voyage Deepdrive | Source " class="wp-image-5717" style="width:-20px;height:-12px"/><figcaption class="wp-element-caption"> <em>A snapshot from&nbsp;Voyage Deepdrive | <a href="https://news.voyage.auto/introducing-voyage-deepdrive-69b3cf0f0be6" target="_blank" rel="noreferrer noopener nofollow">Source</a></em> </figcaption></figure>
</div>


<p>These cars (agents) are trained for thousands of epochs with highly difficult simulations before they’re deployed in the real world.&nbsp;</p>



<p>During training, the agent (the car) learns by taking a certain action in a certain state. Based on this <strong>state-action</strong> pair, it receives a <strong>reward</strong>. This process happens over and over again. Each time the agent updates its memory of rewards. This is called the <strong>policy</strong>.&nbsp;</p>



<p><strong>The policy is described as how the agent makes decisions.</strong> It’s a decision-making rule. The policy defines the behaviour of the agent at a given time.&nbsp;</p>



<p>For every negative decision the agent makes, the policy is changed. So in order to avoid the negative rewards, the agent checks the quality of a certain action. This is measured by the <strong>state-value function. </strong>State-value can be measured using the <strong>Bellman Expectation Equation.</strong></p>



<p>The Bellman expectation equation, along with Markov Decision Process (MDP), makes up the two core concepts of DRL. But when it comes to self-driving cars, we have to keep in mind that<strong> the observations from the perception data should be mapped with the appropriate action </strong>and not just map the underlying state to the action. This is where a partially observed decision process or a <strong>Partially Observable Markov Decision Process (POMDP)</strong> is required, which can make decisions based on the observation.&nbsp;</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-partially-observable-markov-decision-process-used-for-self-driving-cars">Partially Observable Markov Decision Process used for self-driving cars</h2>



<p>The <a href="https://neptune.ai/blog/markov-decision-process-in-reinforcement-learning">Markov Decision Process</a> gives us a way to sequentialize decision-making. When the agent interacts with the environment, it does so sequentially over time. Each time the agent interacts with the environment, it gives some representation of the environment state. Given the representation of the state, the agent selects the action to take, as in the image below. </p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1200" height="628" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=1200%2C628&#038;ssl=1" alt="The Markov Decision Process " class="wp-image-36656" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?w=1200&amp;ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=768%2C402&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=200%2C105&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=220%2C115&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=120%2C63&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=160%2C84&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=300%2C157&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=480%2C251&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/04/self-driving-cars-with-convolutional-neural-networks-cnn-3.png?resize=1020%2C534&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>
</div>


<p>The action taken is transitioned into some new state and the agent is given a reward. This process of evaluating a state, taking action, changing states, and rewarding is repeated. Throughout the process, it’s the agent’s goal to maximize the total amount of rewards.&nbsp;</p>



<p>Let’s get a more constructive idea of the whole process:</p>



<ol class="wp-block-list">
<li>At a give time t, the state of the environment is at St</li>



<li>The agent observes the current state St and selects an action At</li>



<li>The environment is then transitioned into a new state St+1, simultaneously the agent is rewarded Rt</li>
</ol>



<p>In a <strong>partially observable Markov decision process</strong> (POMDP), the agent senses the environment state with observations received from the perception data and takes a certain action followed by receiving a reward.&nbsp;</p>



<p>The POMDP has six components and it can be denoted as POMDP <em>M:= (I, S, A, R, P, </em>γ), where,&nbsp;</p>



<ul class="wp-block-list">
<li>I: Observations&nbsp;</li>



<li>S: Finite set of states</li>



<li>A: Finite set of actions</li>



<li>R: Reward function</li>



<li>P: transition probability function</li>



<li>γ &#8211; discounting factor for future rewards.&nbsp;</li>
</ul>



<p>The objective of DRL is to find the desired policy that maximizes the reward at each given time step or, in other words, to find an optimal value-action function (Q-function).&nbsp;&nbsp;</p>



<h3 class="wp-block-heading" id="qlearning">Q-learning used for self-driving cars</h3>



<p><a href="https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56" target="_blank" rel="noreferrer noopener nofollow">Q-learning</a> is one of the most commonly used DRL algorithms for self-driving cars. It comes under the category of <strong>model-free learning</strong>. In model-free learning, the agent will try to approximate the optimal state-action pair. The <strong>policy</strong> still determines which action-value pairs or Q-value are visited and updated (see the equation below). The goal is to find optimal policy by interacting with the environment while modifying the same when the agent makes an error.&nbsp;</p>



<p>With enough samples or observation data, Q-learning will learn optimal state-action value pairs. In practice, Q-learning has been shown to converge to the optimum state-action values for a MDP with probability 1, provided that all actions in all states are infinitely available.&nbsp;</p>



<p>Q-learning can be described in the following equation:&nbsp;</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img data-recalc-dims="1" loading="lazy" decoding="async" width="1074" height="72" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=1074%2C72&#038;ssl=1" alt="Q-learning used for self-driving cars" class="wp-image-27171" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?w=1074&amp;ssl=1 1074w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=768%2C51&amp;ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=200%2C13&amp;ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=220%2C15&amp;ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=120%2C8&amp;ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=160%2C11&amp;ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=300%2C20&amp;ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=480%2C32&amp;ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2023/07/Equation-3.png?resize=1020%2C68&amp;ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></figure>
</div>


<p>where:</p>



<p>α ∈ [0,1] is the learning rate. It controls the degree to which Q values are updated at a given t.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img data-recalc-dims="1" decoding="async" src="https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Self-driving-cars-q-learning.png?ssl=1" alt="Self driving cars - q learning" class="wp-image-50870" style="width:842px;height:449px"/><figcaption class="wp-element-caption"><em><a href="https://www.mdpi.com/2079-9292/8/5/543/htm" target="_blank" rel="noreferrer noopener nofollow">Source</a></em></figcaption></figure>
</div>


<p>It’s important to remember that the agent will discover the good and bad actions through trial and error.</p>



<h2 class="wp-block-heading" class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>Self-driving cars aim to revolutionize car travel by making it safe and efficient. In this article, we outlined some of the key components such as LiDAR, RADAR, cameras, and most importantly &#8211; the algorithms that make self-driving cars possible.&nbsp;</p>



<p>While it’s promising, there’s still a lot of room for improvement. For example, current self-driving cars are at level-2 out of level-5 of advancement, which means that there still has to be a human ready to intervene if necessary.&nbsp;</p>



<p>Few things need to be taken care of:</p>



<ol class="wp-block-list">
<li>The algorithms used are not yet optimal enough to perceive roads and lanes because some roads lack markings and other signs.</li>



<li>The optimal sensing modality for localization, mapping, and perception still lack accuracy and efficiency.</li>



<li>Vehicle-to-vehicle communication is still a dream, but work is being done in this area as well.&nbsp;&nbsp;</li>



<li>The field of human-machine interaction is not explored enough, with many open, unsolved problems.</li>
</ol>



<p>Still, the technology we’ve developed so far is amazing. And with orchestrated efforts, we can ensure that self-driving systems will be safe, robust, and revolutionary.</p>



<h3 class="wp-block-heading" class="wp-block-heading" id="h-further-reading">Further reading:</h3>



<ol class="wp-block-list">
<li><a href="https://arxiv.org/pdf/1910.07738.pdf" target="_blank" rel="noreferrer noopener nofollow">A Survey of Deep Learning Techniques for Autonomous Driving</a></li>



<li><a href="https://arxiv.org/pdf/1906.05113.pdf" target="_blank" rel="noreferrer noopener nofollow">A Survey of Autonomous Driving: Common Practices and Emerging Technologies</a></li>



<li><a href="https://ieeexplore.ieee.org/document/7995949" target="_blank" rel="noreferrer noopener nofollow">Decision Making for Autonomous Driving considering Interaction and Uncertain Prediction of Surrounding Vehicles</a></li>



<li><a href="https://iopscience.iop.org/article/10.1088/1742-6596/1869/1/012071/pdf" target="_blank" rel="noreferrer noopener nofollow">Autonomous car using CNN deep learning algorithm</a></li>



<li><a href="https://arxiv.org/pdf/2002.00444.pdf" target="_blank" rel="noreferrer noopener nofollow">Deep Reinforcement Learning for Autonomous Driving: A Survey</a></li>



<li><a href="https://www.wired.com/story/guide-self-driving-cars/" target="_blank" rel="noreferrer noopener nofollow">The WIRED Guide to Self-Driving Cars</a></li>



<li><a href="https://towardsdatascience.com/deep-learning-for-self-driving-cars-7f198ef4cfa2" target="_blank" rel="noreferrer noopener nofollow">Deep Learning for Self-Driving Cars</a></li>



<li><a href="https://towardsdatascience.com/reinforcement-learning-towards-general-ai-1bd68256c72d" target="_blank" rel="noreferrer noopener nofollow">Training Self Driving Cars using Reinforcement Learning</a></li>



<li><a href="https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html" target="_blank" rel="noreferrer noopener nofollow">A (Long) Peek into Reinforcement Learning</a></li>



<li><a href="https://mycreditsummit.com/tesla-statistics" target="_blank" rel="noreferrer noopener">Tesla Statistics: What You Should Know About Safety, Pricing and More</a></li>
</ol>
]]></content:encoded>
					
		
		<enclosure url="https://neptune.ai/wp-content/uploads/2022/11/Self-driving-car-LiDAR.mp4" length="390854" type="video/mp4" />

		<post-id xmlns="com-wordpress:feed-additions:1">5694</post-id>	</item>
	</channel>
</rss>
