Nilesh Barla, Autor w serwisie neptune.ai

How to Visualize Deep Learning Models

Nilesh Barla — Tue, 14 Nov 2023 15:30:20 +0000

Deep learning models are typically highly complex. While many traditional machine learning models make do with just a couple of hundreds of parameters, deep learning models have millions or billions of parameters. The large language model GPT-4 that OpenAI released in the spring of 2023 is rumored to have nearly 2 trillion parameters. It goes without saying that the interplay between all these parameters is way too complicated for humans to understand.

This is where visualizations in ML come in. Graphical representations of structures and data flow within a deep learning model make its complexity easier to comprehend and enable insight into its decision-making process. With the proper visualization method and a systematic approach, many seemingly mysterious training issues and underperformance of deep learning models can be traced back to root causes.

In this article, we’ll explore a wide range of deep learning visualizations and discuss their applicability. Along the way, I’ll share many practical examples and point to libraries and in-depth tutorials for individual methods.

Note: I’ve prepared a Colab Notebook with examples of many of the techniques discussed in this article.

Deep learning model visualization helps us understand model behavior and differences between models, diagnose training processes and performance issues, and aid the refinement and optimizations of models | Source

Why do we want to visualize deep learning models?

Visualizing deep learning models can help us with several different objectives:

Interpretability and explainability: The performance of deep learning models is, at times, staggering, even for seasoned data scientists and ML engineers. Visualizations provide ways to dive into a model’s structure and uncover why it succeeds in learning the relationships encoded in the training data.
Debugging model training: It’s fair to assume that everyone training deep learning models has encountered a situation where a model doesn’t learn or struggles with a particular set of samples. The reasons for this range from wrongly connected model components to misconfigured optimizers. Visualizations are great for monitoring training runs and diagnosing issues.
Model optimization: Models with fewer parameters are generally faster to compute and more resource-efficient while being more robust and generalizing better to unseen samples. Visualizations can uncover which parts of a model are essential – and which layers might be omitted without compromising the model’s performance.
Understanding and teaching concepts: Deep learning is mostly based on fairly simple activation functions and mathematical operations like matrix multiplication. Many high school students will know all the maths required to understand a deep learning model’s internal calculations step-by-step. But it’s far from obvious how this gives rise to models that can seemingly “understand” images or translate fluently between multiple languages. It’s not a secret among educators that good visualizations are key for students to master complex and abstract concepts such as deep learning. Interactive visualizations, in particular, have proven helpful for those new to the field.

Example of a deep learning visualization: small convolutional neural network CNN, notice how the thickness of the colorful lines indicates the weight of the neural pathways |Source

How is deep learning visualization different from traditional ML visualization?

At this point, you might wonder how visualizing deep learning models differs from visualizations of traditional machine learning models. After all, aren’t deep learning models closely related to their predecessors?

Deep learning models are characterized by a large number of parameters and a layered structure. Many identical neurons are organized into layers stacked on top of each other. Each neuron is described through a small number of weights and an activation function. While the activation function is typically chosen by the model’s creator (and is thus a so-called hyperparameter), the weights are learned during training.

This fairly simple structure gives rise to unprecedented performance on virtually every machine learning task known today. From our human perspective, the price we pay is that deep learning models are much larger than traditional ML models.

It’s also much more difficult to see how the intricate network of neurons processes the input data than to comprehend, say, a decision tree. Thus, the main focus of deep learning visualizations is to uncover the data flow within a model and to provide insights into what the structurally identical layers learn to focus on during training.

That said, many of the machine learning visualization techniques I covered in my last blog post apply to deep learning models as well. For example, confusion matrices and ROC curves are helpful when working with deep learning classifiers, just as they are for more traditional classification models.

Who should use deep learning visualization?

The short answer to that question is: Everyone who works with deep learning models!

In particular, the following groups come to mind:

Deep learning researchers: Many visualization techniques are first developed by academic researchers looking to improve existing deep learning algorithms or to understand why a particular model exhibits a certain characteristic.
Data scientists and ML engineers: Creating and training deep learning models is no easy feat. Whether a model underperforms, struggles to learn, or generates suspiciously good outcomes – visualizations help us to identify the root cause. Thus, mastering different visualization approaches is an invaluable addition to any deep learning practitioner’s toolbox.
Downstream consumers of deep learning models: Visualizations prove valuable to individuals with technical backgrounds who consume deep learning models via APIs or integrated deep learning-based components into software applications. For instance, Facebook’s ActiVis is a visual analytics system tailored to in-house engineers, facilitating the exploration of deployed neural networks.
Educators and students: Those encountering deep neural networks for the first time – and the people teaching them – often struggle to understand how the model code they write translates into a computational graph that can process complex input data like images or speech. Visualizations make it easier to understand how everything comes together and what a model learned during training.

Types of deep learning visualization

There are many different approaches to deep learning model visualization. Which one is right for you depends on your goal. For instance, deep learning researchers often delve into intricate architectural blueprints to uncover the contributions of different model parts to its performance. ML engineers are often more interested in plots of evaluation metrics during training, as their goal is to ship the best-performing model as quickly as possible.

In this article, we’ll discuss the following approaches:

Deep learning model architecture visualization: Graph-like representation of a neural network with nodes representing layers and edges representing connections between neurons.
Activation heatmap: Layer-wise visualization of activations in a deep neural network that provides insights into what input elements a model is sensitive to.
Feature visualization: Heatmaps that visualize what features or patterns a deep learning model can detect in its input.
Deep feature factorization: Advanced method to uncover high-level concepts a deep learning model learned during training.
Training dynamics plots: Visualization of model performance metrics across training epochs.
Gradient plots: Representation of the loss function gradients at different layers within a deep learning model. Data scientists often use these plots to detect exploding or vanishing gradients during model training.
Loss landscape: Three-dimensional representation of the loss function’s value across a deep learning model’s input space.
Visualizing attention: Heatmap and graph-like visual representations of a transformer-model’s attention that can be used, e.g., to verify if a model focuses on the correct parts of the input data.
Visualizing embeddings: Graphical representation of embeddings, an essential building block for many NLP and computer vision applications, in a low-dimensional space to unveil their relationships and semantic similarity.

Deep learning model architecture visualization

Visualizing the architecture of a deep learning model – its neurons, layers, and connections between them – can serve many purposes:

It exposes the flow of data from the input to the output, including the shapes it takes when it’s passed between layers.
It gives a clear idea of the number of parameters in the model.
You can see which components repeat throughout the model and how they’re linked.

There are different ways to visualize a deep learning model’s architecture:

Model diagrams expose the model’s building blocks and their interconnection.
Flowcharts aim to provide insights into data flows and model dynamics.
Layer-wise representations of deep learning models tend to be significantly more complex and expose activations and intra-layer structures.

All of these visualizations do not only satisfy curiosity. They empower deep learning practitioners to fine-tune models, diagnose issues, and build upon this knowledge to create even more powerful algorithms.

You’ll be able to find model architecture visualization utilities for all of the big deep learning frameworks. Sometimes, they are provided as part of the main package, while in other cases, separate libraries are provided by the framework’s maintainers or community members.

How do you visualize a PyTorch model’s architecture?

If you are using PyTorch, you can use PyTorchViz to create model architecture visualizations. This library visualizes a model’s individual components and highlights the data flow between them.

Here’s the basic code:

import torch
from torchviz import make_dot

# create some sample input data
x = torch.randn(1, 3, 256, 256)

# generate predictions for the sample data
y = MyPyTorchModel()(x)

# generate a model architecture visualization
make_dot(y.mean(),
         params=dict(MyPyTorchModel().named_parameters()),
         show_attrs=True,
         show_saved=True).render("MyPyTorchModel_torchviz", format="png")

The Colab notebook accompanying this article contains a complete PyTorch model architecture visualization example.

Architecture visualization of a PyTorch-based CNN created with PyTorchViz | Source: Author

PyTorchViz uses four colors in the model architecture graph:

Blue nodes represent tensors or variables in the computation graph. These are the data elements that flow through the operations.
Gray nodes represent PyTorch functions or operations performed on tensors.
Green nodes represent gradients or derivatives of tensors. They showcase the backpropagation flow of gradients through the computation graph.
Orange nodes represent the final loss or objective function optimized during training.

How do you visualize a Keras model’s architecture?

To visualize the architecture of a Keras deep learning model, you can use the plot_model utility function that is provided as part of the library:

from tensorflow.keras.utils import plot_model

plot_model(my_keras_model,

           to_file='keras_model_plot.png',

           show_shapes=True,

           show_layer_names=True)

I’ve prepared a complete example for Keras architecture visualization in the Colab notebook for this article.

Model architecture diagram of a Keras-based neural network | Source: Author

The output generated by the plot_model function is quite simple to understand: Each box represents a model layer and shows its name, type, and input and output shapes. The arrows indicate the flow of data between layers.

By the way, Keras also provides a model_to_dot function to create graphs similar to the one produced by PyTorchViz above.

Activation heatmaps

Activation heatmaps are visual representations of the inner workings of deep neural networks. They show which neurons are activated layer-by-layer, allowing us to see how the activations flow through the model.

An activation heatmap can be generated for just a single input sample or a whole collection. In the latter case, we’ll typically choose to depict the average, median, minimum, or maximum activation. This allows us, for example, to identify regions of the network that rarely contribute to the model’s output and might be pruned without affecting its performance.

Let’s take a computer vision model as an example. To generate an activation heatmap, we’ll feed a sample image into the model and record the output value of each activation function in the deep neural network. Then, we can create a heatmap visualization for a layer in the model by coloring its neurons according to the activation function’s output. Alternatively, we can color the input sample’s pixels based on the activation they cause in the inner layer. This tells us which parts of the input reach the particular layer.

For typical deep learning models with many layers and millions of neurons, this simple approach will produce very complicated and noisy visualizations. Hence, deep learning researchers and data scientists have come up with plenty of different methods to simplify activation heatmaps.

But the goal remains the same: We want to uncover which parts of our model contribute to the output and in what way.

Generation of activation heatmaps for a CNN analyzing MRI data | Source

For instance, in the example above, activation heatmaps highlight the regions of an MRI scan that contributed most to the CNN’s output.

Providing such visualizations along with the model output aids healthcare professionals in making informed decisions. Here’s how:

Lesion detection and abnormality identification: The heatmaps highlight the crucial areas in the image, aiding in the identification of lesions and abnormalities.
Severity assessment of abnormalities: The intensity of the heatmap directly correlates with the severity of lesions or abnormalities. A larger and brighter area on the heatmap indicates a more severe condition, enabling a quick assessment of the issue.
Identifying model mistakes: If the model’s activation is high for areas of the MRI scan that are not medically significant (e.g., the skull cap or even parts outside of the brain), this is a telltale sign of a mistake. Even without deep learning expertise, medical professionals will immediately see that this particular model output cannot be trusted.

How do you create a visualization heatmap for a PyTorch model?

The TorchCam library provides several methods to generate activation heatmaps for PyTorch models.

To generate an activation heatmap for a PyTorch model, we need to take the following steps:

Initialize one of the methods provided by TorchCam with our model.
Pass a sample input into the model and record the output.
Apply the initialized TorchCam method.

from torchcam.methods import SmoothGradCAMpp

# initialize the Smooth Grad.CAM++ extractor
cam_extractor = SmoothGradCAMpp(my_pytorch_model)

# compute the model’s output for the sample
out = model(sample_input_tensor.unsqueeze(0))

# generate the class activation map
cams = cam_extractor(out.squeeze(0).argmax().item(), out)

The accompanying Colab notebook contains a full TorchCam activation heatmap example using a ResNet image classification model.

Once we have computed them, we can plot the activation heatmaps for each layer in the model:

for name, cam in zip(cam_extractor.target_names, cams):
plt.imshow(cam.squeeze(0).numpy())
plt.axis('off')
plt.title(name)
plt.show()

In my example model’s case, the output is not overly helpful:

Creating a visualization heatmap for a PyTorch model (layer) | Source: Author

We can greatly enhance the plot’s value by overlaying the original input image. Luckily for us, TorchCam provides the overlay_mask utility function for this purpose:

from torchcam.utils import overlay_mask

for name, cam in zip(cam_extractor.target_names, cams):
result = overlay_mask(to_pil_image(img),
                to_pil_image(cam.squeeze(0), mode='F'),
                alpha=0.7)
plt.imshow(result)
plt.axis('off')
plt.title(name)
plt.show()

Original input image overlaid with an activation heatmap of the fourth layer in a ResNet18 | Source: Author

As you can see in the example plot above, the activation heatmap exposes the areas of the input image that resulted in the greatest activation of neurons in the inner layer of the deep learning model. This helps engineers and the general audience to understand what’s happening inside the model.

Feature visualization

Feature visualization reveals the features learned by a deep neural network. It is particularly helpful in computer vision, where it reveals which abstract features in an input image a neural network responds to. For example, that a neuron in a CNN architecture is highly responsive to diagonal edges or textures like fur.

This helps us understand what the model is looking for in images. The main difference to the activation heatmaps discussed in the previous section is that these show the general response to regions of an input image, whereas feature visualization goes a level deeper and attempts to uncover a model’s response to abstract concepts.

Through feature visualization, we can gain valuable insights into the specific features that deep neural networks are processing at different layers. Generally, layers close to the model’s input will respond to simpler features like edges, while layers closer to the model’s output will detect more abstract concepts.

Such insights not only aid in understanding the inner workings but also serve as a toolkit for fine-tuning and enhancing the model’s performance. By inspecting the features that are activated incorrectly or inconsistently, we can refine the training process or identify data quality issues.

In my Colab notebook for this article, you can find the full example code for generating feature visualizations for a PyTorch CNN. Here, we’ll focus on discussing the result and what we can learn from it.

Feature visualization plots for a ResNet18 processing the image of a dog | Source: Author

As you can see from the plots above, the CNN detects different patterns or features in every layer. If you look closely at the upper row, which corresponds to the first four layers of the model, you can see that those layers detect the edges in the image. For instance, in the second and fourth panels of the first row, you can see that the model identifies the nose and the ears of the dog.

As the activations flow through the model, it becomes ever more challenging to make out what the model is detecting. But if we analyzed more closely, we would likely find that individual neurons are activated by, e.g., the dog’s ears or eyes.

Deep feature factorizations

Deep Feature Factorizatio (DFF) is a method to analyze the features a convolutional neural network has learned. DFF identifies regions in the network’s feature space that belong to the same semantic concept. By assigning different colors to these regions, we can create a visualization that allows us to see whether the features identified by the model are meaningful.

Deep feature visualization for a computer vision model | Source

For instance, in the example above, we find that the model bases its decision (that the image shows labrador retrievers) on the puppies, not the surrounding grass. The nose region might point to a chow, but the shape of the head and ears push the model toward “labrador retriever.” This decision logic mimics the way a human would approach the task.

DFF is available in PyTorch-gradcam, which comes with an extensive DFF tutorial that also discusses how to interpret the results. The image above is based on this tutorial. I have simplified the code and added some additional comments. You’ll find my recommended approach to Deep Feature Factorization with PyTorch-gradcam in the Colab notebook.

Training dynamics plots

Training dynamics plots show how a model learns. Training progress is typically gauged through performance metrics such as loss and accuracy. By visualizing these metrics, data scientists and deep learning practitioners can obtain crucial insights:

Learning Progression: Training dynamics plots reveal how quickly or slowly a model converges. Rapid convergence can point to overfitting, while erratic fluctuations may indicate issues like poor initialization or improper learning rate tuning.
Early Stopping: Plotting losses helps to identify the point at which a model starts overfitting the training data. A decreasing training loss while the validation loss rises is a clear sign of overfitting. The point where overfitting sets in is the optimal time to halt training.

Plots of loss over training epochs for various deep learning models | Source

Gradient plots

If plots of performance metrics are insufficient to understand a model’s training progress (or lack thereof), plotting the loss function’s gradients can be helpful.

To adjust the weights of a neural network during training, we use a technique called backpropagation to compute the gradient of the loss function with respect to the weights and biases of our network. The gradient is a high-dimensional vector that points in the direction of the steepest increase of the loss function. Thus, we can use that information to shift our weights and biases in the opposite direction. The learning rate controls the amount by which we change the weights and biases.

Vanishing or exploding gradients can prevent deep neural networks from learning. Plotting the mean magnitude of gradients for different layers can reveal whether gradients are vanishing (approaching zero) or exploding (becoming extremely large). If the gradient vanishes, we have no idea in which direction to shift our weights and biases, so training is stuck. An exploding gradient leads to large changes in the weights and biases, often overshooting the target and causing rapid fluctuations in the loss.

Machine learning experiment trackers like neptune.ai enable researchers, data scientists and AI/ML engineers to track and plot gradients during training.

Editor’s note

Do you feel like experimenting with neptune.ai?

Request a free trial
Play with a live project
See the docs or watch a short product demo (2 min)

To learn more about vanishing and exploding gradients and how to use gradient plots to detect them, I recommend Katherine Li’s in-depth blog post on debugging, monitoring, and fixing gradient-related problems.

Loss landscapes

We can not just plot gradient magnitudes but directly visualize the loss function and its gradients. These visualizations are commonly called “loss landscapes.”

Inspecting a loss landscape helps data scientists and machine learning practitioners understand how an optimization algorithm moves the weights and biases in a model toward a loss function’s minimum.

A plot of the region around a loss function’s local minimum with an inscribed gradient vector | Source

In an idealized case like the one shown in the figure above, the loss landscape is very smooth. The gradient only changes slightly across the surface. Deep neural networks often exhibit a much more complex loss landscape with spikes and trenches. Reliably converging towards a minimum of the loss function in these cases requires robust optimizers such as Adam.

To plot a loss landscape for a PyTorch model, you can use the code provided by the authors of a seminal paper on the topic. To get a first impression, check out the interactive Loss Landscape Visualizer using this library behind the scenes. There is also a TensorFlow port of the same code.

Loss landscapes do not only provide insight into how deep learning models learn, but they can also be beautiful to look at. Javier Ideami has created the Loss Landscape project with many artistic videos and interactive animations of various loss landscapes.

Visualizing attention

Famously, the transformer models that have revolutionized deep learning over the past few years are based on attention mechanisms. Visualizing what parts of the input a model attends to provides us with important insights:

Interpreting self-attention: Transformers utilize self-attention mechanisms to weigh the importance of different parts of the input sequence. Visualizing attention maps helps us grasp which parts the model focuses on.
Diagnosing errors: When the model attends to irrelevant parts of the input sequence, it can lead to prediction mistakes. Visualization allows us to detect such issues.
Exploring contextual information: Transformer models excel at capturing contextual information from input sequences. Attention maps show how the model distributes attention across the input’s elements, revealing how context is built and propagated through layers.
Understanding how transformers work: Visualizing attention and its flow through the model at different stages helps us understand how transformers process their input. Jacob Gildenblat’s Exploring Explainability for Vision Transformers takes you on a visual journey through Facebook’s Data-efficient Image Transformer (deit-tiny).

The image on the left is original. On the right, it’s overlaid with an attention map. You can see that the model allocates the most attention to the dog | Source: Author

Visualizing embeddings

Embeddings are high-dimensional vectors that capture semantic information. Nowadays, they are typically generated by deep learning models. Visualizing embeddings helps to understand this complex, high-dimensional data.

Typically, embeddings are projected down to a two- or three-dimensional space and represented by points. Standard techniques include principal component analysis, t-SNE, and UMAP. I’ve covered the latter two in-depth in the section on visualizing cluster analysis in my article on machine learning visualization.

Thus, it is no surprise that embedding visualizations reveal data patterns, similarities, and anomalies by grouping embeddings into clusters. For instance, if you visualize word embeddings with one of the methods mentioned above, you’ll find that semantically similar words will end up close together in the projection space.

The TensorFlow embedding projector gives everyone access to interactive visualizations of well-known embeddings like standard Word2vec corpora.

Embeddings for MNIST represented in a 3D space | Source

When to use which deep learning visualization

We can break down the deep learning model lifecycle into four different phases:

1 Pre-training
2 During training
3 Post-training
4 Inference

Each of these phases requires different visualizations.

Pre-training deep learning model visualization

During early model development, finding a suitable model architecture is the most essential task.

Architecture visualizations offer insights into how your model processes information. To understand the architecture of your deep learning model, you can visualize the layers, their connections, and the data flow between them.

Deep learning model visualization during model training

In the training phase, understanding training progress is crucial. To this end, training dynamics and gradient plots are the most helpful visualizations.

If training does not yield the expected results, feature visualizations or inspecting the model’s loss landscape in detail can provide valuable insights. If you’re training transformer-based models, visualizing attention or embeddings can lead you on the right path.

Post-training deep learning model visualizations

Once the model is fully trained, the main goal of visualizations is to provide insights into how a model processes data to produce its outputs.

Activation heatmaps uncover which parts of the input are considered most important by the model. Feature visualizations reveal the features a model learned during training and help us understand what patterns a model is looking for in the input data at different layers. Deep Feature Factorization goes a step further and visualizes regions in the input space associated with the same concept.

If you’re working with transformers, attention and embedding visualizations can help you validate that your model focuses on the most important input elements and captures semantically meaningful concepts.

Inference

At inference time – when a model is used to make predictions or generate outputs – visualizations can help monitor and debug cases where a model went wrong.

The methods used are the same as the ones you might use in the post-training phase but the goal is different: Instead of understanding the model as a whole, we’re now interested in how the model handles an individual input instance.

Conclusion

We covered a lot of ways to visualize deep learning models. We started by asking why we might want visualizations in the first place and then looked into several techniques, often accompanied by hands-on examples. Finally, we discussed where in the model lifecycle the different deep learning visualization approaches promise the most valuable insights.

I hope you enjoyed this article and have some ideas about which visualizations you will explore for your current deep learning projects. The visualization examples in my Colab notebook can serve as starting points. Please feel free to copy and adapt them to your needs!

Deploying Large NLP Models: Infrastructure Cost Optimization

Nilesh Barla — Thu, 23 Mar 2023 09:24:59 +0000

NLP models in commercial applications such as text generation systems have experienced great interest among the user. These models have achieved various groundbreaking results in many NLP tasks like question-answering, summarization, language translation, classification, paraphrasing, et cetera.

Models like for example ChatGPT, Gopher **(280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) are predominantly very large and often addressed as large language models or LLMs. These models can easily have millions or up to billions of parameters making them financially expensive to deploy and maintain.

The size of large NLP models is increasing | Source

Such large natural language processing models require significant computational power and memory, which is often the leading cause of high infrastructure costs. Even if you are fine-tuning an average-sized model for a large-scale application, you need to muster a huge amount of data.

Such scenarios inevitably lead to stacking new layers of neural connections, making it a large model, moreover, deploying these models will require fast and expensive GPU, which will ultimately add to the infrastructure cost. So is there a way to keep these expenses in check?

Sure there is.

This article aims to provide some strategies, tips, and tricks you can apply to optimize your infrastructure while deploying them. In the following sections, we will explore these:

1 The infrastructural challenges faced while deploying large NLP models.
2 Different strategies to reduce the costs associated with these challenges.
3 Other handy tips you might want to know to address this issue.

Challenges of large NLP models

Computational resources

LLMs require is a significant amount of resources for optimal performance. Below are the challenges that are usually faced concerning the same.

1. High computational requirements

Deploying LLMs can be challenging as they require significant computational resources to perform inference. This is especially true when the model is used for real-time applications, such as chatbots or virtual assistants.

Consider ChatGPT as an example. It is capable of processing and responding to queries instantly within seconds (most of the time). But there are times when the user traffic seems to be higher, during those moments, the inference time gets higher. There are other factors that can delay the inference, such as the complexity of the question, the amount of information required to generate a response, et cetera. But in any case, if the model is supposed to serve in real-time, it must be capable of high throughput and low latency.

2. Storage capacity

With parameters ranging from millions to billions, LLM can pose storage capacity challenges. It will be good to store the whole model in a single storage device, but because of the size, it is not possible.

For example, OpenAI’s GPT-3 model, with 175B parameters, requires over 300GB of storage for its parameters alone. Additionally, it requires a GPU with a minimum of 16GB of memory to run efficiently. Storing and running such a large model on a single device may be impractical for many use cases due to the hardware requirements. As such, there are three main issues around storage capacity with LLMs:

2.1 Memory limitations

LLMs require a lot of memory as they process a huge amount of information. This can be challenging, especially when you want to deploy them on a low-memory device such as a mobile phone.

One way to deploy such models is to use a distributed system or distributed inference. In distributed inference, the model is distributed on multiple nodes or servers. It allows the distribution of the workload and speeds up the process. But the challenge here is that it may require significant expertise to set up and maintain. Plus, the larger the model, the more servers are required, which again increases the deployment cost.

2.2 Large model sizes

The MT-NLG model released in 2022 has 530 billion parameters and requires several hundred gigabytes of storage. High-end GPUs and basic data parallelism aren’t sufficient for deployment, and even alternative solutions like pipeline and model parallelism have trade-offs between functionality, usability, and memory/compute efficiency. As the authors in the paper “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models put it, this, in turn, reduces the effectiveness of the model.

For instance, a 1.5B parameter model on 32GB can easily run out of memory during inference if the input query is long and complicated. Even for basic inference on LLM, multiple accelerators or multi-node computing clusters like multiple Kubernetes pods are required. There are techniques discussed by researchers where they propose the idea of offloading parameters to the local RAM. But these techniques turned out to be inefficient in practical use-case scenarios. Users cannot download such large scaled models on their systems just to translate or summarise a given text.

2.3 Scalability challenges

Another area for improvement with LLMs is scalability. We know that a large model is often scaled using model parallelism (MP), which requires multiple storage and memory capacity. This involves dividing the model into smaller parts and distributing it across multiple machines. Each machine processes a different part of the model, and the results are combined to produce the final output. This technique can be helpful in handling large models, but it requires careful consideration of the communication overhead between machines.

In Distributed inference, LLM is deployed on multiple machines, with each machine processing a subset of the input data. This approach is essential for handling large-scale language tasks that require input to pass through billions of parameters.

Most of the time, MP works, but there are instances where it doesn’t. The reason being MP divides the model vertically, distributing the computation and parameters among several devices for each layer where the inter-GPU communication bandwidth is large. This distribution facilitates intensive communication between each layer in a single node. The limitation comes outside a single node which essentially leads to a fall in performance and efficiency.

3. Bandwidth requirements

As discussed previously, LLM has to be scaled using MP. But the issue we found was that MP is efficient in single-node clusters, but in a multi-node setting, the inference isn’t efficient. This is because of the low bandwidth networks.

Deploying a large language model requires multiple network requests to retrieve data from different servers. Network latency can impact the time required to transfer data between the servers, which can result in slower performance, eventually leading to high latency and response time. This can cause delays in processing, which can impact user experience.

4. Resource constraints

Limited storage capacity can restrict the ability to store multiple versions of the same model, which can make it difficult to compare the performance of different models and track the progress of model development over time. This can be true if you want to adopt a shadow deployment strategy.

Energy consumption

As discussed above already, serving LLMs require significant computational resources, which can lead to high energy consumption and a large carbon footprint. This can be problematic for organizations that are committed to reducing their environmental impact.

Just for reference, below is the image showing the financial estimation of the LLMs, along with the carbon footprint that they produce during training.

Financial estimation of the large NLP models, along with the carbon footprint that they produce during training | Source

What is more shocking is that 80-90% of the machine learning workload is inference processing, according to NVIDIA. Likewise, according to AWS, inference accounts for 90% of machine learning demand in the cloud.

Cost

Deploying and using LLMs can be costly, including the cost of hardware, storage, and infrastructure. Additionally, the cost of deploying the model can be significant, especially when using resources such as GPUs or TPUs for low latency and high throughput during inference. This can make it challenging for smaller organizations or individuals to use LLMs for their applications.

To put this into perspective, it is expected that the running cost of the chatGPT is around $100,000 per day or $3M per month.

Tweet about ChatGPT costs | Source

Strategies for optimizing infrastructure costs of large NLP models

In this section, we will explore and discuss the possible solutions and techniques for the challenges discussed in the previous section. It is worth noting that when you deploy the model on the cloud, you choose the inference option and thereby create an end-point. See the image below.

The general workflow for inference endpoints | Source

Keep that in mind, and with all the challenges we discussed earlier, we will discuss techniques that can be used to optimize the cost around this infrastructure for deploying LLMs. Below are some of the steps that you can follow to deploy your model as efficiently as possible.

Smart use of cloud computing for computational resources

Using cloud computing services can provide on-demand access to powerful computing resources, including CPUs and GPUs. Cloud computing services are flexible and can scale according to your requirements.

One of the important tips is that you should make a budget for your project. Making a budget always helps you find ways to optimize your project that will not exceed your financial limitation.

Now when it comes to cloud services, there are a lot of companies that offer their platform. Cloud providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform offer a range of options for deploying LLMs, including virtual machines, containers, and serverless computing. But despite you must do your own research and calculation. For instance, you must know these three things:

1 The model size.
2 Details about the hardware to be used.
3 Right inference option.

Once you have the details, you can actually calculate how much-accelerated computing power you need. Based upon that, you can plan and execute your model deployment.

Learn more

MLOps Tools for NLP Projects

Calculating model size

You can see the table below, which will give you an idea of how many FLOPs you might need for your model. Once you have an estimation, you can then go ahead and find the relevant GPU in your preferred cloud platform.

Estimated optimal training FLOPs and training tokens for various NLP model sizes | Source

A tool that I found under the blog post named “Estimating Training Compute of Deep Learning Models” allows you to calculate the FLOPs required for your model both for training and inference.

A tool that calculates the FLOPs required for both training and inference | Source

The app is based on the works of Kaplan et al., 2020 or Hoffman et al., 2022 where they show how to train a model on a fixed-compute budget. To understand more on this subject you can read the blog here.

Selecting the right hardware

Once you have calculated the required FLOPs, you can go ahead and choose the GPU. Make sure you are aware of the features that the GPU offers. For instance, see the image below to get an understanding.

The list of GPU specifications offered by NVIDIA | Source

Above you can see the list of specifications that NVIDIA offers. Similarly, you can compare different GPUs and see which one suits your budget.

Choosing the right inference option

Once you have calculated the model size and selected the GPU, you can then proceed to choose the inference option. Amazon SageMaker offers multiple inference options to suit different workloads. For instance, if you require:

Real-time inference, which is suitable for low-latency or high-throughput online inferences and supports payload sizes up to 6 MB and processing times of 60 seconds.
Serverless inference, which is ideal for intermittent or unpredictable traffic patterns and supports payload sizes up to 4 MB and processing times of 60 seconds. In serverless inference, the model scales automatically based on the incoming traffic or requests. At times when the model is sitting idle you won’t be charged. It offers a pay-as-you-use facility.
Batch transform is suitable for offline processing of large datasets and supports payload sizes of GBs and processing times of days.
Asynchronous inference is suitable for queuing requests with large payloads and long processing times, supports payloads up to 1 GB and processing times up to one hour, and can scale down to 0 when there are no requests.

To get a better understanding and meet your requirement, look at the image below.

Choosing model deployment options | Source

When all the above points are satisfied, you can then deploy the model on any of the cloud services.

To quickly summarize:

1 Set a budget
2 Calculate the size of the model
3 Compute the FLOPs required for model
4 Find the right GPU
5 Choose the appropriate inference option
6 Research the pricing offered by various cloud computing platforms
7 Find the service that suits your needs and budget
8 Deploy it.

Optimizing the model for serving

In the last section, I discussed how the size of LLMs can pose a problem for deployment. When your model is too large, strategies like model compilation, model compression, and model sharding can be used. These techniques reduce the size of the model while preserving accuracy, which allows easier deployment and reduce the associated expenses significantly.

Let’s explore each of those in detail.

Different techniques or strategies to optimize LLMs for deployment | Source

Model compression

Model compression is a technique used to optimize and transform an LLM into an efficient executable model that can be run on specialized hardware or software platforms–usually cloud services. The goal of model compression is to improve the performance and efficiency of LLM inference by leveraging hardware-specific optimizations, such as reduced memory footprint, improved computation parallelism, and reduced latency.

This is a good technique because it helps you to play with a different combination, set performance benchmarks for various tasks, and find a price that suits your budget. As such, model compression involves several steps:

Graph optimization: The high-level LLM graph is transformed and optimized using graph optimization techniques such as pruning and quantization to reduce the computational complexity and memory footprint of the model. This, in turn, makes the model small while preserving its accuracy.
Hardware-specific optimization: The optimized LLM graph is further optimized to leverage hardware-specific optimizations. For instance, Amazon Sagemaker provides model serving containers for various popular ML frameworks, including XGBoost, scikit-learn, PyTorch, TensorFlow, and Apache MXNet, along with software development kits (SDKs) for each container.

How AWS Sagemaker Neo works | Source

Here are a few model compression techniques that one must know.

Model quantization

Model quantization (MQ) is a technique used to reduce the memory footprint and computation requirements of an LLM. MQ essentially transforms the model parameters and activations with lower-precision data types. The goal of model quantization is to improve the efficiency of LLM during inference by reducing the memory bandwidth requirements and exploiting hardware-specific optimizations optimized for lower-precision arithmetic.

PyTorch offers model quantization, their API involves the reduction of model parameters by a factor of 4, while the memory bandwidth required by the model by the factor 2 to 4 times. As a result of these improvements, the inference speed can increase by 2 to 4 times, owing to the reduction in memory bandwidth requirements and faster computations using int8 arithmetic. However, the precise degree of acceleration achieved depends on the hardware, runtime, and model used.

There are several approaches to model quantization for LLMs, including:

Model quantization can be challenging to implement effectively, as it requires careful consideration of the trade-offs between reduced precision and model accuracy, as well as the hardware-specific optimizations that can be leveraged with lower-precision arithmetic. However, when done correctly, model quantization can significantly improve the efficiency of LLM inference, enabling better real-time inference on large-scale datasets and edge devices.

Post-training quantization: In this approach, the LLM is first trained using floating-point data types, and then the weights and activations are quantized to lower-precision data types post-training. This approach is simple to implement and can achieve good accuracy with a careful selection of quantization parameters.
Quantization-aware training: Here, the LLM is quantized during training, allowing the model to adapt to the reduced precision during training. This approach can achieve higher accuracy than post-training quantization but requires more computation during training.
Hybrid quantization: It combines both post-training quantization and quantization-aware training, allowing the LLM to adapt to lower-precision data types during training while also applying post-training quantization to further reduce the memory footprint and computational complexity of the model.

Model Pruning

Model pruning (MP) is again a technique used to reduce the size and computational complexity of an LLM by removing redundant or unnecessary model parameters. MP is to improve the efficiency of LLM inference without sacrificing accuracy.

MP involves identifying and removing redundant or unnecessary model parameters using various pruning algorithms. These algorithms can be broadly categorized into two categories:

Weight pruning: In weight pruning, individual weights in the LLM are removed based on their magnitude or importance, using techniques such as magnitude-based pruning or structured pruning. Weight pruning can significantly reduce the number of model parameters and the computational complexity of the LLM, but it may require fine-tuning of the pruned model to maintain its accuracy.
Neuron pruning: In neuron pruning, entire neurons or activations in the LLM are removed based on their importance, using techniques such as channel pruning or neuron-level pruning. Neuron pruning can also significantly reduce the number of model parameters and the computational complexity of the LLM, but it may be more difficult to implement and may require more extensive retraining and maybe fine-tuning to maintain accuracy.

Here are a couple of approaches to model pruning:

Post-training pruning: In this approach, the LLM is first trained using standard techniques and then pruned using one of the pruning algorithms. The pruned LLM is then fine-tuned to preserve its accuracy.
Iterative pruning: Here, the model is trained using standard training techniques and then pruned iteratively over several rounds of training and pruning. This approach can achieve higher levels of pruning while preserving accuracy.

You can explore this Colab notebook by PyTorch to better understand MP.

Model distillation

(MD) is a technique used to transfer knowledge from an LLM called a teacher to a smaller, more efficient model called the student. It is used in the context of model compression. In a nutshell, the teacher model provides guidance and feedback to the student model during training. See the image below.

DistilBERT’s distillation process | Source

MD involves training a student, a more efficient model to mimic the behavior of a teacher, more complex LLM. The student model is prepared using a combination of labeled data and the output probabilities of the larger LLM.

There are several approaches to model distillation for LLMs, including:

Knowledge distillation: In this approach, the smaller model is trained to mimic the output probabilities of the larger LLM using a temperature scaling factor. The temperature scaling factor is used to soften the output probabilities of the teacher model, allowing the smaller model to learn from the teacher model’s behavior more effectively.

Self-distillation: In this approach, the larger LLM is used to generate training examples for the smaller model by applying the teacher model to unlabeled data. The smaller model is then trained on these generated examples, allowing it to learn from the behavior of the larger LLM without requiring labeled data.

Ensemble distillation: In this approach, multiple smaller models are trained to mimic the behavior of different sub-components of the larger LLM. The outputs of these smaller models are combined to form an ensemble model that approximates the behavior of the larger LLM.

Optimizing hardware and software requirements

Hardware is an important area when it comes to deploying LLMs. Here are some useful steps you can take for optimizing the hardware performance:

Choose hardware that matches the LLM’s requirements: Depending on the LLM’s size and complexity, you may need hardware with a large amount of RAM, high-speed storage, or multiple GPUs to speed up inference. Opt for hardware that provides the necessary processing power, memory, and storage capacity, without overspending on irrelevant features.

Use specialized hardware: You can use specialized hardware such as TPUs (Tensor Processing Units) or FPGAs (Field-Programmable Gate Arrays) that are designed specifically for deep learning tasks. Similarly, accelerated linear algebra or XLA can be leveraged during inference time.

Although such hardware can be expensive, there are smart ways to consume them. You can opt for charge-on-demand for the hardware used. For instance, elastic Inference from AWS Sagemaker helps you lower your cost when the model is not fully utilizing the GPU instance for inference.

Use optimized libraries: You can use optimized libraries such as TensorFlow, PyTorch, or JAX that leverage hardware-specific features to speed up computation without needing additional hardware.

Tune the batch size: Consider tuning the batch size during inference to maximize hardware utilization and improve inference speed. This inherently reduces the hardware requirement, thus cutting the cost.

Monitor and optimize: Finally, monitor the LLM’s performance during deployment and optimize the hardware configuration as needed to achieve the best performance.

Cost efficient scalability

Here’s how you can scale your large NLP models while keeping costs in check:

Choose the right inference option, that scales automatically like the serverless inference option. As it will reduce the deployment cost when the demand is less.

A rigid architecture will always occupy the same amount of memory even when the demand is low thus the deployment and maintenance costs will be the same. On the contrary, a scalable architecture can scale horizontally or vertically to accommodate an increased workload and go back to its original configuration when the model lies in a dormant state. Such an approach can reduce the cost of maintenance whenever the additional nodes are not being used.

Optimize inference performance, by using hardware acceleration, such as GPUs or TPUs, and by optimizing the inference code.

Amazon’s Elastic inference is yet another great option as it reduces the cost by up to 75% because the model no longer has extra GPUs to compute for inference. For more on Elastic inference, read this article here.

Cutting energy costs

Choose an energy-efficient cloud infrastructure, that uses renewable energy sources or carbon offsets to reduce the carbon footprint of their data centers. You can also consider choosing energy-efficient GPUs. Check out this article by Wired to understand more.

Use caching which helps reduce the computational requirements of LLM inference by storing frequently requested responses in memory. This can significantly reduce the number of computations required to generate responses to user requests. It also helps in addressing bandwidth issues as it reduces the time to access data. You can store frequently accessed data in cache memory so that it can be quickly accessed without the need for additional bandwidth. This allows you not to opt for additional storage and memory devices.

Deploying large NLP models: other useful tips

Estimating the NLP model size before training

Keeping your model size in check could in turn keep your infrastructure costs in check. Here are a few things you can keep in mind while getting your large NLP model ready.

Consider the available resources: The size of the LLM for deployment should take into account the available hardware resources, including memory, processing power, and storage capacity. The LLM’s size should be within the limits of the available resources to ensure optimal performance.
Fine-tuning: Choose a model with optimal accuracy and then fine-tune it on a task-specific dataset. This step will increase the efficiency of the LLM and keep its size from spiralling out of control.
Consider the tradeoff between size and performance: The LLM’s size should be selected based on the tradeoff between size and performance. A larger model size may provide better performance but may also require more resources and time. Therefore, it is essential to find the optimal balance between size and performance.

Use a lightweight deployment framework

Many LLMs are too large to be deployed directly to a production environment. Consider using a lightweight deployment framework like TensorFlow Serving or TorchServe that can host the model and serve predictions over a network. These frameworks can help reduce the overhead of loading and running the model on the server thereby reducing the deployment and infrastructure costs.

Post-deployment model monitoring

Model monitoring helps optimize the infrastructure cost of deployment by providing insights into the performance and resource utilization of deployed models. By monitoring the resource consumption of deployed models, such as CPU, memory, and network usage, you can identify areas that can help you optimize your infrastructure usage to reduce costs.

Monitoring can identify underutilized resources, allowing you to scale back on unused resources, and reducing infrastructure costs.
Monitoring can identify resource-intensive operations or models, enabling organizations to optimize their architecture or refactor the model to be more efficient. This can also lead to cost savings.

Check also

Tips and Tricks to Train State-Of-The-Art NLP Models

Key takeaways

1 Set a budget.
2 Calculate the size of the model.
3 Use model compression techniques like pruning, quantization, and distillation to decrease the memory and computation required for deployment.
4 Utilize cloud computing services like AWS, Google Cloud, and Microsoft Azure for cost-effective solutions with scalability options.
5 Leverage serverless computing for a pay-per-use model, lower operational overhead, and auto-scaling.
6 Leverage serverless computing for a pay-per-use model, lower operational overhead, and auto-scaling.
7 Optimize hardware acceleration, such as GPUs, to speed up model training and inference.
8 Regularly monitor resource usage to identify areas where costs can be reduced, such as underutilized resources or overprovisioned instances.
9 Continuously optimize your model size and hardware to cost-efficient inference.
10 Update the software and security patch to ensure safety.

Conclusion

In this article, we explored the challenges we face when deploying an LLM and the inflated infrastructural cost associated with them. Simultaneously, we also addressed each of these difficulties with the necessary techniques and solutions.

Out of all the solutions we discussed, a couple of things that I would recommend the most when it comes to reducing infrastructure cost while deployment is elastic and serverless inference. Yes, model compression is good and valid, but when the demand is high, even the smaller model can act like a larger model, thus increasing the infrastructural cost. Thus, we need to have a scalable approach and pay-per-demand service. That’s where these inference services get handy.

It goes without saying that my recommendation might not be the most ideal for your use case, and you can pick any of these approaches depending on the kind of problems you are dealing with. I hope what we discussed here will go a long way in helping you cut down your deployment infrastructure costs for your large NLP models.

References

Argo vs Airflow vs Prefect: How Are They Different

Nilesh Barla — Fri, 04 Nov 2022 09:04:36 +0000

We live at a stage where ML and DL software are everywhere. New startups and various other companies are adapting and integrating AI systems into their new and already existing workflows to be much more productive and efficient. These systems reduce manual tasks and deliver smart and intelligent solutions. Although they are quite proficient in what they do, all AI systems have different modules that must be brought together to build an operational and effective product.

These systems can be broadly divided into five phases, keeping in mind that these phases contain various additional and repetitive tasks:

1 Data collection
2 Feature engineering
3 Modeling (which includes training, validation, testing, and inference)
4 Deployment
5 Monitoring

Executing these phases individually can take a lot of time and continuous human effort. These phases must be synchronized and sequentially orchestrated in order to get the best out of them. This can be achieved by task orchestration tools that enable ML practitioners to effortlessly bring together and orchestrate different phases of an AI system.

Phases of AI systems | Source

In this article, we will explore:

1 What task orchestration tools are?
2 Three different tools that can help ML practitioners to orchestrate their workflow.
3 Comparison of the three tools
4 Which tool to use and when?

Task orchestration tools: What they are and how are they useful?

Orchestration tools enable various tasks in MLOps to be organized and sequentially executed. These tools have the capability to orchestrate different tasks at a given period. One of the key properties of these tools is the distribution of tasks. Most of the tools leverage what is known as the DAG or Directed Acyclic Graph, which you will often come across in this article. A DAG is a graph representation of the tasks that need to be executed.

Graphic explanation of DAG | Source

DAG enables tasks in a pipeline to be distributed parallelly to various other modules for processing, this offers efficiency. See the image above. DAG also enables tasks to be sequentially sound or arranged for proper execution and timely results.

Another important property that these tools have is adaptability to agile environments. This allows ML practitioners to incorporate various other tools that can be used to monitor, deploy, analyze and preprocess, test, infer, et cetera. If an orchestration tool can orchestrate various tasks from different tools, then it can be considered a good tool. But this is not the case every time, some of the tools are strictly contained within their derived environments, which does not bode well for users trying to integrate any third-party applications.

In this article, we will explore three tools – Argo, Airflow, and Prefect, that incorporate these two properties and various others as well.

TL;DR comparison table

Here is a table inspired by Ian McGraw’s article, which provides an overview of what these tools offer for orchestration and how they differ from each other in these aspects.

	Features	Argo	Airflow	Prefect
1.	Fault-tolerant scheduling
2.	UI Support
3.	Workflow definition language	YAML	Python	Python
4.	3rd partyintegration	Since Argo is container-based it doesn’t come with pre-installed 3rd party systems.	Supports various 3rd party integration	Supports various 3rd party integration
5.	Workflows	Dynamic workflow	Static workflow	Dynamic workflow
6.	Accessibility	Open-sourced	Open-source	Hybrid (Open-sourced and subscription-based)
7.	Parametrized workflows	Have an extensive parameter-passing syntax.	Does not has a mechanism to pass parameter.	Supports parameters as first-class object
8.	Kubernetes support
9.	Scalability	Highly Parallel	Horizontal scalable	Parallel when using Kubernetes
10.	Community Support	Large	Large	Medium
11.	State storage	All states are stored within the Kubernetes workflow	Postgres DB	Postgres DB
12.	Ease of deployment	Medium	Medium	Difficult
13.	Event-driven workflows
14.	Scripts in DAG definition	Argo uses text scripts to pass in containers.	Airflow uses Python-based DAG definition language.	Perfect uses functional flow a Python-based API.
15.	Use Cases	– CI/CD- Data Processing- Infrastructure Automation- Machine Learning- Stream Processing	– ELT – ML Workflow- ML Automation	– Automating Data Workflow (ELT)- ML Workflow and Orchestration- CI/CD

Now let’s explore each of these tools in more detail under three primary categories:

1 Core concepts
2 Features they offer
3 Why use it?

Core concepts

All three tools are built on a set of concepts or principles around which they function. Argo is, for instance, built around two concepts: Workflow and Templates. Both of these make the backbone of its system. Likewise, Airflow is built around Webserver, Scheduler, Executor, and Database, while Prefect is built around Flows and Task. Now it is important for us to know what these concepts mean, what they offer, and how it is beneficial to us.

Before going into the details, here is a brief summary of the concepts.

	Properties of the Concepts
Argo	It has two concepts Workflow, and Templates. Essentially the Workflow is the config YAML file. It provides structure and robustness to the workflow as they use DAGs to manage the workflows. On the other hand, templates are the functions that need to be executed. They are both static and dynamic meaning that you can modify steps on the go.
Airflow	It has four concepts Webserver, Scheduler, Executor, and Database. They basically divide the whole process into different segments and these concepts act as major components to automate the whole process. This allows the workflow to be efficient since each component relies on the other, in this way it is easy to find and report bugs and errors. Furthermore, monitoring is quite easy. Though Airflow uses DAGs it is not dynamic but only static.
Prefect	It leverages two concepts Flows and Tasks. Prefect uses DAGs that are defined as flow object which uses Python. In Prefect, flow objects can be created using Python which provides flexibility and robustness to define complex pipelines. Tasks are like templates in Argo which are used to define a specific function that needs to be executed. Again, it uses Python for this. Because Prefect uses Python as its main programming language it is easy to work with.

Summary of the concepts

Now, let’s understand these concepts in detail.

Argo

Argo uses two core concepts:

Workflow
Templates

Workflow

In Argo, the workflow happens to be the most integral component of the whole system. It has two important functions:

It defines the tasks that need to be executed.
It stores the state of the tasks, which means that it serves as both a static and a dynamic object.

Workflow is defined in the workflow.spec configuration file. It is a YAML file that consists of a list of templates and entry points. The Workflow can be considered as a file that hosts different templates. These templates define the function that needs to be executed.

As mentioned earlier that Argo leverages the Kubernetes engine for workflow synchronization, and the configuration file uses the same syntax as Kubernetes. The workflow YAML file has the following dictionaries or objects:

apiVersion: This is where you define the name of the doc or API.
kind: It defines the type of Kubernetes object that needs to be created. For instance, if you want to deploy an app you can use Deployment as one of a kind, at other times you can use service. But in this case, we will use Workflow.
metadata: It enables us to define unique properties for that object, that could be a name, UUID, et cetera.
spec: It enables us to define specifications concerning the Workflow. These specifications would be entry points and templates.
templates: This is where we can define the tasks. The template can contain the docker image and various other scripts.

Templates

In Argo, there are two types of templates which again are sub-classified into 6 types. The two major types are definition and invocators.

Definition

This template, as the name suggests, defines the type of task in a Docker container. The Definition itself is divided into four categories:

Container: It enables users to schedule the workflow in a container. Since the application is containerized in Kubernetes, the steps defined in the YAML file are identical. It is also one of the most used templates.

#source: https://argoproj.github.io/argo-workflows/workflow-concepts/
- name: whalesay
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["hello world"]

Script: If you want a wrapper around a container, then the script template is perfect. The script template is similar in structure to the container template but adds a source field. The field allows you to define a script in place. You can define any variable or command based on your requirements. Once defined, the script will be saved into a file, and it will be executed for you as an Argo variable.

#source: https://argoproj.github.io/argo-workflows/workflow-concepts/
 - name: gen-random-int
    script:
      image: python:alpine3.6
      command: [python]
      source: import random

        	i = random.randint(1, 100)
        	  	print(i)

Resource: It allows you to perform operations like get, create, apply, delete et cetera on the K8 cluster directly.

#source: https://argoproj.github.io/argo-workflows/workflow-concepts/
- name: k8s-owner-reference
    resource:
      action: create
      manifest: |
        apiVersion: v1
        kind: ConfigMap
        metadata:
          generateName: owned-eg-
        data:
          some: value

Suspend: It basically introduces a time dimension to the workflow. It can suspend the execution of the workflow for a defined duration or till the workflow is resumed manually.

#source: https://argoproj.github.io/argo-workflows/workflow-concepts/ 
 - name: delay
    suspend:
      duration: "20s"

Invocators

Once the templates are defined, they can be invoked or called on demand by other templates called invocators. These invocators are more of controllers templates that can control the execution of defined templates.

There are two types of invocator templates:

Steps: It basically allows you to define the tasks in steps. All YAML files are enabled with the ‘steps’ template.
Directed acyclic graph: Argo enables its users to manage steps with multiple dependencies in their workflow. This allows parallel execution of different workflows in their respective containers. These types of workflows are managed using a directed acyclic graph or DAG. For instance, if you are working on image segmentation and generation for medical purposes then you can create a pipeline that:
- Processes the images.
- Distributes the images (or dataset) to the respective DL models for image segmentation and generation pipeline.
- Continuously predicts segmentation masks and updates the dataset storage with new images after proper inspection.

Airflow

Feature Pipeline | Source

Apache Airflow consists of four main components:

Webserver
Scheduler
Executor
Database

Four main components of Apache Airflow | Source

Webserver

It provides the user with UI for inspecting, triggering, and debugging all DAGs and tasks. It essentially serves as the entry point for Airflow. The Webserver leverages Python-Flask to manage all the requests made by the user. It also renders the state metadata from the database and displays the same to the UI.

Scheduler

It monitors and manages all the tasks and DAGs. It examines the state of the tasks by querying the database to decide the order of the task that needs to be executed. The aim of the scheduler is then to resolve dependencies and submit the task instance to the executor once the dependencies are taken care of.

Executor

It runs the task instances which are ready to run. It executes all the tasks as scheduled by the scheduler. There are four types of executors:

Sequential Executor
Local Executor
Celery Executor
Kubernetes Executor

Metadata Database

It stores the state of the tasks and DAGs that can be used by the scheduler for proper scheduling of the tasks instance. It is worth noting that Airflow uses SQLAlchemy and Object Relational Mapping (ORM) to store the information.

Prefect

Prefect uses two core concepts:

Flows
Tasks

Flows

In Prefect, flows are the Python objects that can be interacted with. Here DAG is defined as flow objects. See the image below.

DAG defined as flow objects | Source

Flow can be imported and can be used as a decorator, @flow, for any given function. Flows take an existing function and transform it into a Prefect flow function, with the following advantages:

The function can be monitored and governed as it is now reported to the API.
The activity of the function can be tracked and displayed in the UI.
Inputs given to the function can be validated.
Various workflow features like retries, distributed execution et cetera can be added to the function.
Timeouts can be enforced to prevent unintentional long-running workflows

Here is a code block depicting the implementation of a flow object.

#Source: https://github.com/PrefectHQ/prefect
from prefect import flow

@flow(name="GitHub Stars")
def github_stars(repos: List[str]):
    for repo in repos:
        get_stars(repo)

In the code above, the function has been transformed into a flow which is named as “GitHub Stars”. This function is now within the constraints of Prefect orchestration laws.

Now it must be noted that all workflows must be defined within the flow function. Likewise, all tasks must be called within the flow (function). Keep in mind that when a flow is executed, it is known as a flow run.

Tasks

Tasks can be defined as specific work that needs to be executed, for instance, the addition of two numbers. In another word, tasks take an input, perform an operation and yield an output. Like flow, tasks can be imported and can be used as a decorator, @task, for a function. Once used for a function, it essentially wraps the function within the Prefect workflow and has similar advantages to the flow. For instance, it can automatically log information about task runs, such as runtime, tags, and final state.

The code below demonstrates how a task is defined:

#Source: https://github.com/PrefectHQ/prefect

@task(retries=3)
def get_stars(repo: str):
    url = f"https://api.github.com/repos/{repo}"
    count = httpx.get(url).json()["stargazers_count"]
    print(f"{repo} has {count} stars!")

# run the flow!
github_stars(["PrefectHQ/Prefect"])

To sum up, the flow looks for any task that is defined within its body, and once found it then creates a computational graph in the same order. It then creates dependencies between the tasks whenever the output of one task instance is used to yield output by another.

Features

All three provide more or less the same features, but some features are better than others, and it also boils down to users’ adaptability. Just like in the previous section, let’s begin with a summary of the features.

	Argo	Airflow	Prefect
User Interface	It has a complete view of the workflow. You can define workflow straight from the UI.	Workflow is very well-maintained as it provides a number of different views.	Prefect is similar to Airflow.
Deployment Style	Supports only Kubernetes-supported environments such as AWS and other S3-compatible services.	Supports Kubernetes-supported environment as well as other third-party environments.	Same as Airflow
Scalability	Parallel	Horizontal	Parallel
Accessibility	Open-sourced	Open-source	Open-sourced and subscription-based
Flexibility	Rigid	Rigid and Complicated	Flexible

Comparison of the features

Let’s start this section by exploring the User Interface.

User Interface

Argo

For ease of use, Argo Workflow provides a web-based UI to define workflows and templates. The UI enables various purposes like:

Artifact visualization
Using generated charts to compare Machine Learning pipelines
Visualizing results
Debugging
It can also be used to define workflows

Argo UI | Source

Airflow

Airflow UI provides a clean and efficient design that enables the user to interact with the Airflow server allowing them to monitor and troubleshoot the entire pipeline. It also allows editing the state of the task in the database and manipulating the behaviour of DAGs and tasks.

Airflow UI | Source

The Airflow UI also provides various views for its users, they include:

DAGs View
Datasets View
Grid View
Graph View
Calendar View
Variable View
Gantt View
Task Duration
Code View

Prefect

Prefect like Airflow provides an overview of all the tasks, which helps you visualize all your workflow, tasks, and DAGs. It provides two ways to access UI:

Prefect Cloud: It is hosted on the cloud, which enables you to configure your personal accounts and workspaces.
Prefect Orion UI: It is hosted locally, and it is also open-sourced. You cannot configure it the way you can with Prefect cloud.

Prefect UI | Source

Some additional features of Prefect UI:

Displaying run summaries
Displaying flow details that are deployed
Scheduled flow
Warnings notification for late and failed runs
Details information of tasks and workflows
Task dependency visualization and Radar flow
Logs details

Deployment Style

Argo

It is a native Kubernetes workflow engine which means it:

1 Runs on containers.
2 Runs on Kubernetes-supported pods.
3 Easy to deploy and scale.

On the downside:

Implementation is hard since it uses configurational language (YAML).

Airflow

1 Supports Kubernetes as well as other third–party integrations.
2 It runs on containers as well.
3 Implementation is easy.

The downside of Airflow is:

It is not parallel scalable.
Deployment needs extra effort, which depends upon the cloud facility you choose.

Prefect

Lastly, Prefect is a combination of both Argo and Airflow:

1 It can run on Containers and Kubernetes pods.
2 It is highly parallel and efficient.
3 It supports fault-tolerant scheduling.
4 Easy to deploy.
5 It also supports third-party integrations.

When it comes to the downside:

It does not support open-source deployment with Kubernetes.
Deployment is difficult.

Scalability

When it comes to scalability, Argo and Prefect are highly parallel, which makes them efficient and especially Prefect because it can leverage different third-party integrations support, making it the best of the three.

Airflow, on the other, is horizontally scalable i.e., the number of active workers is equal to maximum task parallelism.

Accessibility

All three are open-sourced, but Prefect also comes with a subscription-based service.

Flexibility

Argo and Airflow aren’t that flexible when compared with Prefect as the former is Kubernetes-native it is confined in that environment, making it rigid, while the latter is complicated as it requires a well-defined and structured template, making itself not very well suited to an agile environment.

Prefect, on the other hand, enables you to create dynamic dataflow in native Python, which does not require you to use DAG. All Python functions can be transformed to Prefect Flow and Task. This ensures flexibility.

Why use these tools?

So far, I’ve compared the basic concepts and features that these tools possess. Now let me give reasons as to why you can use any of these tools in your project.

Argo

Here are some of the reasons why you should use Argo:

1 The Kubernetes native workflow tool enables you to run each step in its own Kubernetes pod.
2 Easy to scale because it can be executed parallelly.
3 Workflow templates offer reusability.
4 Similarly, artifact integrations are also reusable.
5 DAG is dynamic for each run of the workflow.
6 Low Latency Scheduler.
7 Event-Driven Workflows.

Airflow

Reasons for you to use Airflow:

1 It enables users to connect with various technologies.
2 It offers rich scheduling and easy-to-define pipelines.
3 Pythonic integration is another reason to use Airflow.
4 You can create custom components as per your requirements.
5 Allows rollback to the previous version as workflows are stored.
6 Has a well-defined UI.
7 Multiple users can write a workflow for a given project, i.e. it is shareable.

Prefect

Prefect is one of the well-planned orchestration tools for MLops. It is Python-native and requires you to put effort into the engineering side of things. One of the areas where Prefect shines is in data processing and pipeline. It can be used to fetch the data, apply the necessary transformation, and monitor and orchestrate necessary tasks.

When it comes to tasks related to machine learning, it can be used to automate the entire data flow.

Some other reasons to use Prefect are:

1 Provides excellent security as it keeps your data and codes private.
2 Enhanced UI and notification feature which directly comes to your email or Slack.
3 It can be used with Kubernetes and Docker.
4 Efficient parallel processing of tasks.
5 Dynamic workflow.
6 Allows many third-party integrations.
7 Prefect uses GraphQL API, enabling it to trigger workflow on demand.

How to decide?

Choosing the right tool for your project depends on what you want and what you already have. But I can surely put some criteria that can help you decide which tool will be appropriate for you. You can use –

Argo

If you want to set up a workflow based on Kubernetes.
If you want to define your workflow as DAGs.
If your dataset is huge and model training requires highly parallel and distributed training.
If your task is complex.
If you are well-versed in YAML files. Even if you are not, learning YAML is not difficult.
If you want to use a cloud platform like GCD or AWS, which is Kubernetes enabled.

Airflow

If you want to incorporate a lot of other 3rd party technology like Jenkins, Airbyte, Amazon, Cassandra, Docker, et cetera. Check the list of supported third-party extensions.
If you want to use Python to define the workflow.
If you want to define your workflow as DAGs.
If your workflow is static.
If you want a mature tool because Airflow is quite old.
If you want to run tasks on schedule.

Prefect

If you want to incorporate a lot of other 3rd party technology.
If you want to use Python to define the workflow.
If your workflow is dynamic.
If you want to run tasks on schedule.
If you want something light and modern.

I found a thread on Reddit concerning the use of Airflow and Prefect. Maybe this can give you some additional information as to which tool to use.

“…The pros of Airflow are that it’s an established and popular project. This means it’s much easier to find someone who has done a random blog that answers your question. Another pro is that it’s much easier to hire someone with Airflow experience than Prefect experience. The cons are that Airflow’s age is showing, in that it wasn’t really designed for the kind of dynamic workflows that exist within modern data environments. If your company is going to be pushing the limits in terms of computation or complexity, I’d highly suggest looking at Prefect. Additionally, unless you go through Astronomer, if you can’t find an answer to a question you have about Airflow, you have to go through their fairly inactive slack chat.

The pros of Prefect are that it’s much more modern in its assumptions about what you’re doing and what it needs to do. It has an extensive API that allows you to programmatically control executions or otherwise interact with the scheduler, which I believe Airflow has only recently implemented out of beta in their 2.0 release. Prior to this, it was recommended not to use the API in production, which often leads to hacky workarounds. In addition, Prefect allows for a much more dynamic execution model with some of its concepts by determining the DAG that gets executed at runtime and then handing off the computation/optimization to other systems (namely Dask) to actually execute the tasks. I believe this is a much smarter approach, as I’ve seen workflows get more and more dynamic over the years.

If my company had neither Airflow nor Prefect in place already, I’d opt for Prefect. I believe it allows for much better modularization of code (which can then be tested more aggressively / thoroughly), which I already think is worth its weight in gold for data-driven companies that rely on having well-curated data in place to make automated product decisions. You can achieve something similar with Airflow, but you really need to go out of your way to make something like that happen, whereas in Prefect it kind of naturally comes out.”

Here is a useful chart illustrating the popularity of different orchestration tools based on GitHub stars.

The popularity of different orchestration tools based on GitHub stars | Source

Conclusion

In this article, we discussed and compared the three popular tools for task orchestration, namely Argo, Airflow, and Prefect. My main aim was to help you understand these tools on the basis of three important factors i.e. Core concepts, Features offered, and why you should use them. The article also compared the three tools on some of the important features they offer, which could help you make the decision of choosing the most appropriate tool for your project.

I hope this article was informative and gave you a better understanding of these tools.

Thanks!!!

References

5 Tools That Will Help You Setup Production ML Model Testing

Nilesh Barla — Fri, 30 Sep 2022 11:15:05 +0000

Developing a machine learning or a deep learning model seems like a relatively straightforward task. It usually involves research, collecting and preprocessing the data, extracting features, building and training the model, evaluation, and inference. Most of the time is consumed in the data-preprocessing phase, followed by the modeling-building phase. If the accuracy is not up to the mark, we then reiterate the whole process until we find a satisfactory accuracy.

The difficulty arises when we want to put the model into production in the real world. The model often does not perform as well as it did during the training and evaluation phase. This happens primarily because of concept drift or data drift and issues concerning data integrity. Therefore, testing an ML model becomes very important so that we can understand its strengths and weaknesses and act accordingly.

In this article, we will discuss some of the tools that can be leveraged to test an ML model. Some of these tools and libraries are open-source, while others require a subscription. Either way, this article will fully explore the tools which will be handy for your MLOps pipeline.

Why does model testing matter?

Building upon what we just discussed, model testing allows you to pinpoint a bug or area of concern that might cause the prediction capability of the model to degrade. This can happen over time gradually or in an instant. Either way, it is always good to know in which area they might fail and which features can cause them to fail. It exposes flaws, and it can also bring new insights to light. Essentially, the idea is to make a robust model that can efficiently handle uncertain data entries and anomalies.

Some of the benefits of model testing are:

1 Detecting model and data drift
2 Finding anomalies in dataset
3 Checking data and model integrity
4 Detect possible root cause for model failure
5 Eliminating bugs and errors
6 Reducing false positives and false negatives
7 Encouraging retraining the model over a certain period of time
8 Creating a production-ready model
9 Ensuring robustness of ML model
10 Finding new insights within the model

Is model testing the same as model evaluation?

Model testing and evaluation are similar to what we call diagnosis and screening in medicine.

Model evaluation is similar to diagnosis, where the performance of the model is checked based upon certain metrics like F1 score or MSE loss. These metrics do not provide a focused area of concern.

Learn more

️ The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

️ F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

Model testing is similar to diagnosis, where a certain test like the invariance test and unit test aims to find a particular issue in the model.

What will a typical ML software testing suite include?

A machine learning testing suite often includes testing modules to detect different types of drifts like concept drift and data drift, which can include covariant drift, prediction drift, and so on. These issues usually occur within the dataset. Most of the time, the dataset’s distribution changes over time, affecting the model’s capability to accurately predict the output. You will find that the frameworks we will discuss will contain tools to detect data drifts.

Apart from testing data, the ML testing suite contains tools to test the model’s capability to predict, as well as overfitting, underfitting, variance and bias et cetera. The idea of the testing framework is to inspect the pipeline in the three major phases of development:

data ingestion,
data preprocessing,
and model evaluation.

Some of the frameworks like Robust Intelligence and Kolena rigorously test the given ML pipeline automatically in these given areas to ensure a production-ready model.

In essence, a machine learning suite will contain:

Unit tests that operate on the level of the codebase,
Regression tests replicate bugs from the previous iteration of the model that is fixed,
Integration tests simulate conditions and are typically longer-running tests that observe model behaviors. These conditions can mirror the ML pipeline, including preprocessing phase, data distribution, et cetera.

The image above depicts a typical workflow of software development | Source

Automated Testing in Machine Learning Projects [Best Practices for MLOps]

What are the best tools for machine learning model testing?

Now, let’s discuss some of the tools for testing ML models. This section is divided into three parts: open-source tools, subscription-based tools, and hybrid tools.

Open-source model testing tools

1. DeepChecks

DeepChecks is an open-source Python framework for testing ML Models & Data. It basically enables users to test the ML pipeline in three different phases:

Data integrity test before the preprocessing phase.
Data Validation, before the training, mostly while splitting the data into training and testing, and
ML model testing.

The image above shows the schema of three different tests that could be performed in an ML pipeline | Source

These tests can be performed all at once and even independently. The image above shows the schema of three different tests that could be performed in an ML pipeline.

Installation

Deepchecks can be installed using following the pip command:

pip install deepchecks > 0.5.0

The latest version of Deepcheck is 0.8.0.

Structure of the framework

DeepChecks introduces three important terms: Check, Condition and Suite. It is worth noting that these three terms together form the core structure of the framework.

Check

It enables a user to inspect a specific aspect of the data and models. The framework contains various classes which allow you to check both of them. You can do a full check as well. Here are a couple of such checks:

Data inspecting involves inspection around data drift, duplication, missing values, string mismatch, statistical analysis such as data distribution et cetera. You can find the various data inspecting tools within the check module. The check module allows you to precisely design the inspecting methods for your datasets. These are some of the tools that you will find for data inspection:

‘DataDuplicates’,
‘DatasetsSizeComparison’,
‘DateTrainTestLeakageDuplicates’,
‘DateTrainTestLeakageOverlap’,
‘DominantFrequencyChange’,
‘FeatureFeatureCorrelation’,
‘FeatureLabelCorrelation’,
‘FeatureLabelCorrelationChange’,
‘IdentifierLabelCorrelation’,
‘IndexTrainTestLeakage’,
‘IsSingleValue’,
‘MixedDataTypes’,
‘MixedNulls’,
‘WholeDatasetDrift’

In the following example, we will inspect whether the dataset has duplicates or not. We will import the class DataDuplicates from the checks module and pass the dataset as a parameter. This will return a table containing relevant information on whether the dataset has duplicate values or not.

from deepchecks.checks import DataDuplicates, FeatureFeatureCorrelation
dup = DataDuplicates()
dup.run(data)

An example of inspecting if the dataset has duplicates | Source: Author

As you can see, the table above yields relative information about the number of duplicates present in the dataset. Now let’s see how DeepChecks uses a visual aid to provide the concerning information.

In the following example, we will inspect feature-feature correlation within the dataset. For that, we will import the FeatureFeatureCorrelation class from the checks module.

ffc = FeatureFeatureCorrelation()
ffc.run(data)

An example of inspecting feature-feature correlation within the dataset | Source: Author

As you can see from both examples, the results can be displayed either in the form of a table or a graph, or even both to give relevant information to the user.

The model inspection involves overfitting, underfitting, et cetera. Similar to data inspection, you can also find the various model inspecting tools within the check module. These are some of the tools that you will find for model inspection:

‘ModelErrorAnalysis’,
‘ModelInferenceTime’,
‘ModelInfo’,
‘MultiModelPerformanceReport’,
‘NewLabelTrainTest’,
‘OutlierSampleDetection’,
‘PerformanceReport’,
‘RegressionErrorDistribution’,
‘RegressionSystematicError’,
‘RocReport’,
‘SegmentPerformance’,
‘SimpleModelComparison’,
‘SingleDatasetPerformance’,
‘SpecialCharacters’,
‘StringLengthOutOfBounds’,
‘StringMismatch’,
‘StringMismatchComparison’,
‘TrainTestFeatureDrift’,
‘TrainTestLabelDrift’,
‘TrainTestPerformance’,
‘TrainTestPredictionDrift’,

Example of a model check or inspection on Random Forest Classifier:

from deepchecks.checks import ModelInfo
info = ModelInfo()
info.run(RF)

An example of a model check or inspection on Random Forest Classifier | Source: Author

Condition

It is a function or attribute that can be added to a Check. Essentially it contains a predefined parameter that can return a pass, fail, or warning results. These parameters can be modified as well accordingly. Follow the code snippet below to get an understanding.

from deepchecks.checks import ModelInfo
info = ModelInfo()
info.run(RF)

An example of a bar graph of feature label correlation | Source: Author

The image above shows a bar graph of feature label correlation. It essentially measures the predictive power of an independent feature that can predict the target value by itself. When you add a condition to a check as in the example above, the condition will return additional information mentioning the features which are above and below the condition.

In this particular example, you will find that the condition returned a statement stating that the algorithm “Found 2 out of 4 features with PPS above threshold: {‘petal width (cm)’: ‘0.9’, ‘petal length (cm)’: ‘0.87’}” meaning that features with high PPS are suitable to predict the labels.

Suite

It is a module containing a collection of checks for data and model. It is an ordered collection of checks. All the checks can be found in the suite module. Below is the schematic diagram of the framework and how it works.

The schematic diagram of the suite of checks and how it works | Source

As you can see from the image above, the data and the model can be passed into the suites which contain the different checks. The checks can be provided with the conditions for much more precise testing.

You can run the following code to see the list of 35 checks and their conditions that DeepChecks provides:

from deepchecks.suites import full_suite
suites = full_suite()
print(suites)
Full Suite: [
	0: ModelInfo
	1: ColumnsInfo
	2: ConfusionMatrixReport
	3: PerformanceReport
		Conditions:
			0: Train-Test scores relative degradation is not greater than 0.1
	4: RocReport(excluded_classes=[])
		Conditions:
			0: AUC score for all the classes is not less than 0.7
	5: SimpleModelComparison
		Conditions:
			0: Model performance gain over simple model is not less than
…]

In conclusion, Check, Condition, and Suites allow users to essentially check the data and model in their respective tasks. These can be extended and modified according to the requirements of the project and for various use cases.

DeepChecks allows flexibility and instant validation of the ML pipeline with less effort. Their strong boilerplate code can allow users to automate the whole testing process, which can save a lot of time.

An example of distribution checks | Source

Why should you use this?

It is open-source and free, and it has a growing community.
Very well-structured framework.
Because it has built-in checks and suites, it can be extremely useful for inspecting potential issues in your data and models.
It is efficient in the research phase as it can be easily integrated into the pipeline.
If you are mostly working with tabular datasets, then DeepChecks is extremely good.
You can also use it to check data, model drifts, model integrity, and model monitoring.

An example of methodology issues | Source

Key features

1 It supports both classification and regression models in both computer vision and tabular datasets.
2 It can easily run a large group of checks with a single call.
3 It is flexible, editable, and expandable.
4 It yields results in both tabular and visual formats.
5 It does not require a login dashboard as all the results, including the visualization, are displayed instantly during execution itself. And it has a pretty good UX on the go.

An example of performance checks | Source

Key drawbacks

1 It does not support NLP tasks.
2 Deep Learning support is in beta version including computer vision. So results can yield errors.

2. Drifter-ML

Drifter ML is an ML model testing tool specifically written for the Scikit-learn library. It can also be used to test datasets similar to DeepChecks. It has five modules, each very specific to the task at hand.

Classification test: It enables you to test classification algorithms.
Regression test: It enables you to test classification algorithms.
Structural test: This module has a bunch of classes that allow testing of clustering algorithms.
Time Series test: This module can be used to test model drifts.
Columnar test: This module allows you to test your tabular dataset. Tests include sanity testing, mean and median similarity, Pearson’s correlation et cetera.

Installation

pip install drifter-ml

Structure of the framework

Drifter ML conforms to the Scikit-Learn blueprint for models, i.e., the model must contain a .fit and .predict methods. This essentially means that you can test deep learning models as well since Scikit-Learn has an integrated Keras API. Check the example below.

#Source: https://drifter-ml.readthedocs.io/en/latest/classification-tests.html#lower-bound-classification-measures

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
import pandas as pd
import numpy as np
import joblib

# Function to create model, required for KerasClassifier
def create_model():
   # create model
   model = Sequential()
   model.add(Dense(12, input_dim=3, activation='relu'))
   model.add(Dense(8, activation='relu'))
   model.add(Dense(1, activation='sigmoid'))
   # Compile model
   model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
   return model

# fix random seed for reproducibility
df = pd.DataFrame()
for _ in range(1000):
   a = np.random.normal(0, 1)
   b = np.random.normal(0, 3)
   c = np.random.normal(12, 4)
   if a + b + c > 11:
       target = 1
   else:
       target = 0
   df = df.append({
       "A": a,
       "B": b,
       "C": c,
       "target": target
   }, ignore_index=True)

# split into input (X) and output (Y) variables
# create model
clf = KerasClassifier(build_fn=create_model, epochs=150, batch_size=10, verbose=0)
X = df[["A", "B", "C"]]
clf.fit(X, df["target"])
joblib.dump(clf, "model.joblib")
df.to_csv("data.csv")

The example above shows the ease with which you can design your ANN model using drifter-ml. Similarly, you can also design a test case as well. In the test defined below, we will try to find the lowest decision boundary by which the model can easily classify the two classes.

def test_cv_precision_lower_boundary():
   df = pd.read_csv("data.csv")
   column_names = ["A", "B", "C"]
   target_name = "target"
   clf = joblib.load("model.joblib")

   test_suite = ClassificationTests(clf,
   df, target_name, column_names)
   lower_boundary = 0.9
   return test_suite.cross_val_precision_lower_boundary(
       lower_boundary
   )

Why should you use this?

Drifter-ML is specifically written for Scikit-learn, and this library acts as an extension to it. All the classes and methods are written in sync with Scikit-learn, so data and model testing become relatively easy and straightforward.

On a side note, if you like to work on an open-source library, then you can extend the library to other machine learning and deep learning libraries such as Pytorch as well.

Key features

1 Built on top of Scikit-learn.
2 Offers to test for Deep learning architecture but only for Keras since it is extended in Scikit-learn.
3 Open source library and open to contribution.

Key drawbacks

1 It is not up to date, and its community is not fairly active.
2 It does not work well with other libraries.

Subscription-based tools

1. Kolena.io

Kolena.io is a Python-based framework for ML testing. It also includes an online platform where the results and insights can be logged. Kolena focuses mostly on the ML unit testing and validation process at scale.

Kolena.io dashboard example | Source

Why you should use this?

Kolena argues that the split test dataset methodology isn’t as reliable as it seems to be. Splitting the datasets provides a global representation of the entire population distribution and fails to capture the local representations at a granular level, this is especially true with label or class. There are hidden nuances of features that still need to be discovered. This leads to the failure of the model in the real world even though the model yields good scores in the performance metrics during training and evaluation.

One way of addressing that issue is by creating a much more focused dataset that can be achieved by breaking a given class into smaller subclasses for focused results or even creating a subset of the features themselves. Such a dataset can enable the ML model to extract features and representation at a much granular level. This will increase the performance of the model as well by balancing both the bias and variance such that the model generalizes well in the real-world scenario.

For example, when building a classification model, a given class in the dataset can be broken down into various subsets and those subsets into finer subsets. This can enable users to test the model in various scenarios. In the table below, the CAR class is tested against several test cases to check the model’s performance on various attributes.

CAR class tested against several test cases to check the model’s performance on various attributes | Source

Another benefit is whenever we face a new scenario in the real-world, a new test case can be designed and tested immediately. Likewise, users can build more comprehensive test cases for a variety of tasks and train or build a model. The users can also generate a detailed report on a model’s performance in each category of test cases and compare it to the previous models with each iteration.

To sum up, Kolena offers:

Ease of python framework
Automated workflow testing and deployment
Faster model debugging
Faster model deployment

If you are working on a large-scale deep learning model which will be complex to monitor, then Kolena will be beneficial.

Key features

1 Supports Deep Learning architectures.
2 Kolena Test Case Studio offers to curate customizable test cases for the model.
3 It allows users to prepare quality tests by removing noise and improving annotations.
4 It can automatically diagnose failure modes and can find the exact issue concerning the same.
5 Integrates seamlessly into the ML pipeline.

View from the Kolena.io app | Source

Key drawbacks

1 Subscription-based model (pricing not mentioned).
2 Subscription-based model (pricing not mentioned).
3 In order to download the framework, you need a CloudRepo pass.

pip3 install --extra-index-url "$CR_URL" kolena-client

2. Robust intelligence

It is an E2E ML platform that offers various services in terms of ML integrity. The framework is written in Python and allows customizing your code according to your needs. The framework also integrates into an online dashboard that provides insights into various testing on data and model performance as well as model monitoring. All these services target the ML model and data right from training to the post-production phase.

Robust intelligence features | Source

Why should you use this?

The platform offers services like:

1. AI stress testing, which includes hundreds of tests to automatically evaluate the performance of the model and identify potential drawbacks.

Evaluating the performance of the model | Source

2. AI Firewall, which automatically creates a wrapper around the trained model to protect it from bad data in real-time. The wrapper is configured based on the model. It also automatically checks both the data and model, reducing manual effort and time.

Prevention of model failures in production | Source

3. AI continuous testing, which monitors the model and automatically tests the deployed model to check for updates and retraining. The testing involves data drift, error, root cause analysis, anomalies detection et cetera. All the insights gained during continuous testing are displayed on the dashboard.

Monitoring model in production | Source

Robust intelligence enables model testing, model protection during deployment, and model monitoring after deployment. Since it is an e2e-based platform, all the phases can be easily automated with hundreds of stress tests run on the model to make it production ready. If the project is fairly large, then Robust intelligence will give you an edge.

Key features

1 Supports deep learning frameworks
2 Flexible and easy to use
3 Customisable
4 Scalable

Key drawbacks

1 Only for enterprise.
2 Few details are available online.
3 Expensive: One-year subscription costs around $60,000.

(Source)

Hybrid frameworks

1. Etiq.ai

Etiq is an AI-observability platform that supports AI/ML lifecycle. Like Kolena and Robust Intelligence, the framework offers ML Model testing, monitoring, optimization, and explainability.

The dashboard of Etiq.ai | Source

Etiq is considered to be a hybrid framework as it offers both offline and online implementation. Etiq has four tiers of usage:

Free and public: It includes free usage of the library as well as the dashboard. Keep in mind the results and metadata will be stored in your dashboard instance the moment you log in to the platform, but you will receive full benefits.
Free and limited: If you want a free but private testing environment for your project and don’t want to share any information, then you can use the platform without logging into the platform. Keep in mind that you will not receive full benefits as would have received when you logged into the platform.
Subscribe and private: If you want full benefits of Etiq.ai, then you can subscribe to their plan and make use of their tools in your own private environment. Etiq.ai is already available in the AWS market place which starts at around $3.00/hour or from $25,000.00/year.
Personalized request: If you require functionality beyond what is provided by Etiq.ai, like explainability, robustness, or team share functionality, then you can contact them and get your own personalized test suite.

Structure of the framework

Etiq follows a structure similar to DeepChecks. This structure remains the core of the framework:

Snapshot: It is a combination of dataset and model in the pre-production testing phase.
Scan: It is usually a test that is applied to the snapshot.
Config: It is usually a JSON file that contains a set of parameters that will be used by the scan for running tests in the snapshot.
Custom test: It allows you to customize your tests by adding and editing various metrics to the config file.

Etiq offers two types of tests: Scan and Root Cause Analysis or RCA, the latter is an experimental pipeline. The scan type offers

Accuracy: In some cases, high accuracy can indicate a problem just as low accuracy can. In such cases, an ‘accuracy’ scan can be helpful. If the accuracy is too high, then you might do a leakage scan, or if it is low, then you can do a drift scan.
Leakage: It helps you to find data leakage.
Drift: It can help you to find feature drift, target drift, concept drift, and prediction drift.
Bias: Bias refers to algorithmic bias that can happen because of automated decision making causing unintended discrimination.

Why should you use this?

Etiq.ai offers a multi-step pipeline, which means you can monitor the test by logging the results of each of the steps in the ML pipeline. This allows you to identify and repair bias within the model. If you are looking for a framework that can do the heavy lifting of your AI pipeline, then Etiq.ai is the one to go.

Some other reasons why you should use Etiq.ai:

1 It is a Python Framework
2 Dashboard facility for multiple insights and optimization reporting
3 You can manage multiple projects.

All the points above are valid for free tier usage.

One key feature of Etiq.ai is that it allows you to be very precise and straightforward in your model building and deploying approaches. It aims to give users the tools that can help them to achieve the desired model. At times, the development process gets drifted away from the original plan mostly because of the lack of tools needed to shape the model. If you want to deploy a model that is aligned with the proposed requirements, then Etiq.ai is the way to go. This is because the framework offers similar tests at each step throughout your ML pipeline.

Steps of the process when to use Etiq.ai | Source

Key features

1 A lot of functionalities in the free tier.
2 Test each of the pipelines for better monitoring
3 Supports deep learning frameworks like PyTorch and Keras-Tensorflow
4 You can request a personalized test library.

Key drawbacks

1 At the moment, in production, they only provide functionality for batch processing.
2 To apply tests to tasks pertaining to segmentation, regression, or recommendation engines, who must get in touch with the team.

Conclusion

The ML testing frameworks that we discussed are directed toward the needs of the users. All of the frameworks have their own pros and cons. But you can definitely get by using any one of these frameworks. ML model testing frameworks play an integral part in defining how the model will perform when deployed to a real-world scenario.

If you are looking for a free and easy-to-use ML testing framework for structured datasets and smaller ML models, then go with DeepChecks. If you are working with DL algorithms, then Etiq.ai is a good option. But if you can spare some money, then you should definitely inquire about Kolena. And lastly, if you are working in a mid to large-size enterprise and looking for ML testing solutions, then hands-down, it has to be Robust Intelligence.

I hope this article provided you with all the preliminary information needed for you to get started with ML testing. Please share this article with everyone who needs it.

Thanks for reading!!!

Reference

Building MLOps Pipeline for Computer Vision: Image Classification Task [Tutorial]

Nilesh Barla — Mon, 01 Aug 2022 15:16:11 +0000

The introduction of Transformers in 2018 by Vaswani and the team brought a significant transformation in the research and development of deep learning models for various tasks. The transformer leverages a self-attention mechanism that was adopted from the attention mechanism by Bahdanau and the team. With this mechanism, one input could interact with other inputs enabling it to focus or pay attention to the important features of the data.

Because of this, transformers were able to achieve state-of-the-art results in various NLP tasks like machine translation, summary generation, text-generation, et cetera. It has also replaced RNN and its variants in almost all the NLP tasks. As a matter of fact, with its success in NLP, transformers are now being adopted in computer vision tasks as well. In 2020, Dosovitskiy and his team developed vision transformers (ViT), where they argued that reliance on CNN is not necessary. Based upon this premise, in this article, we will explore and learn how ViT can help in the task of image classification.

This article is a guide aimed at building an MLOps pipeline for a computer vision task using ViT, and it will focus on the following areas with respect to a typical data science project:

Aim of the project
Hardware specification
Attention visualization
Building the model and experiment tracking
Testing and inference
Creating a Streamlit app for deployment
Setting up CI/CD using GitHub actions
Deployment and monitoring

Building MLOps Pipeline for NLP: Machine Translation Task [Tutorial]

The code for this article can be found on this Github Link so that you can follow along. Let’s get started.

MLOps pipeline for image classification: understanding the project

Understanding the requirements of the project or the client is an important step as it can help us brainstorm ideas and research various components that the project might require, such as the latest papers, repositories, relevant work, datasets, and even cloud-based platforms for deployment. This section will focus on 2 topics:

1 Aim of the project.
2 Hardware for accelerated training.

Aim of the project: bird image classifier

The aim of the project is to build an image classifier to classify different species of birds. Since this model will be later deployed in the cloud, we must keep in mind that the model must be trained to achieve a good accuracy score in both training and testing datasets. In order to do that, we should use metrics like precision, recall, confusion metrics, F1, and AUROC score to see how the model is performing on both datasets. Once the model achieves good scores on the test dataset, we will then create a web app to deploy it on a cloud-based server.

Learn more:

F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?

In a nutshell, this is how the project will be executed:

1 Building the deep learning model with Pytorch
2 Testing the model
3 Creating a Streamlit app
4 Creating directories and their respective config files for deployment
5 Finally, deploying it on the Google Cloud Platform

This project will include some of the additional practices that you will find in this article, such as:

Live tracking to monitor metrics,
Attention visualization,
Directory structure,
Code formatting for all the python modules.

Hardware for accelerated training

We will conduct our experiment with two sets of hardware:

M1 Macbook: The efficiency of Apple’s M1 processors will allow us to quickly develop models and train them on a smaller dataset. Once the training is done, we can start building a web application on our local machine and create a small pipeline of data ingestion, data preprocessing, model prediction, and attention visualization before scaling up the model in the cloud.

Note: if you have one of these M1 laptops, then make sure to check the installation process in my Github repo.

Kaggle or Google Colab GPUs: Once our code is working properly in our local machine and the pipeline is created, we can scale it up and train the whole model for a longer period in Google Colab or Kaggle which are free. Once the training is done, we can then download new weights and metadata to our local computer and test whether the web application is performing well in the unseen data before deploying it to the cloud.

Now let’s start the implementation.

MLOps pipeline for image classification: data preparation

The first step of implementing a deep learning project is to plan the different python modules that we are going to have. Although we will be using the Jupyter notebook for experimentation, it is always a good idea to have everything laid out before starting to code. Planning might include reference code repositories as well as research papers.

It is always a good idea to create the directory structure for the project for efficiency and for ease of navigation.

ViT Classification
├── notebooks
│   └── ViT.ipynb
└── source
    └──config.py

In our case, the main directory is called the ViT Classification, which contains two folders:

Notebooks: This is where all the experimentation with jupyter notebook will reside.
Source: This is where all the Python modules will reside.

As we progress, we will keep adding Python modules to the source directory, and we will also create different sub-directories for storing metadata, docker files, README.md files, et cetera.

Building the image classification model

As mentioned before, research and planning is the key to implementing any machine learning project. What I usually do first is, create a config.py to store all the parameters with respect to data preprocessing, model training and inference, visualization, et cetera.

config.py

class Config:
   #Image configuration
   IMG_SIZE = 32
   PATCH_SIZE = 10
   CROP_SIZE = 100
   BATCH_SIZE = 1
   DATASET_SAMPLE = 'full'


   #opimizer configuration
   LR = 0.003
   OPIMIZER = 'Adam'

   #Model configuration
   NUM_CLASSES = 400
   IN_CHANNELS = 3
   HIDDEN_SIZE = 768
   NUM_ATTENTION_HEADS = 12
   LINEAR_DIM = 3072
   NUM_LAYERS = 12

   ATTENTION_DROPOUT_RATE = 0.1
   DROPOUT_RATE = 0.1
   STD_NORM = 1e-6
   EPS = 1e-6
   MPL_DIM = 128
   OUTPUT = 'softmax'
   LOSS_FN = 'nll_loss'

   #Device configuration
   DEVICE = ["cpu","mps","cuda"]

   #Training configuration
   N_EPOCHS = 1

The above code block gives a vague idea of what the parameters should look like. As we make progress, we can keep adding more parameters.

Note: In the device configuration section, I have given a list of three hardware: CPU, MPS, and CUDA. MPS or Metal Performance Shaders is the hardware type to train on M1 Macbooks.

Dataset

The dataset that we will use is the bird classification dataset which can be downloaded from Kaggle. The dataset consists of 400 classes of birds with three subsets: training, validation, and testing, each containing 58388, 2000, and 2000 images, respectively. Once the data has been downloaded, then we can then create a function to read and visualize the images.

The image above is a sample from the dataset along with the class that it belongs to | Source

Preparing the data

We can move ahead to create a data loader that transforms the images into image tensors. Along with that, we will also perform resizing, image cropping, and normalizing as well. Once the preprocessing is done, we can then use the DataLoader function to automatically generate data for training in batches. The following pseudo function will give you an idea of what we are trying to achieve, you can find the full code in the link provided in the code heading:

preprocessing.py

#apply the desired transformations on dataset and split it into train, validation, and test set.

def Dataset(bs, crop_size, sample_size='full'):
      return train_data, valid_data, test_data

The above function has a sample size argument that allows the creation of a sub-set of the training dataset for testing purposes on your local machine.

MLOps pipeline for image classification: building the vision transformer using Pytorch

I have created the full model as per the author’s description of ViT in their paper. This code is inspired by jeonsworld repo, I have added a few more details and edited some of the lines of code for the purpose of this task.

The model that I have created is divided into 9 modules, and each module can be executed independently for various tasks. We will explore each section in order for ease of understanding.

Embedding

Transformers and all the natural language model has an important component called embedding. Its function is usually to capture semantic information by grouping similar information together. Apart from that embeddings can be learned and reused across models.

In ViT, embeddings serve the same purpose by retaining positional information which can be fed into the encoder. Again the following pseudo-code will help you to understand what’s going on and you can also find the full code in the link provided in the code heading.

embedding.py

class Embeddings(nn.Module):

#Construct the embeddings from patch, position embeddings.
   def __init__(self, img_size:int, hidden_size:int, in_channels:int):

#create a CONV2D object for creation of embeddings 
   def forward(self, x):

#calculate and return embeddings
       return embeddings

Note that the embedding patches for the image can be created using convolution layers. It is quite efficient and easy to modify as well.

Encoder

The encoder is made up of a number of attention blocks which itself has two important modules:

1 Self Attention Mechanism
2 Multi-layer perceptron (MLP)

Self attention mechanism

Let’s start with the self-attention mechanism.

The self-attention mechanism is the core of the whole system. It enables the model to focus on the important feature of the data. It does so by operating on a single embedding at different positions to compute the representation of the same sequence. You can find the link to the entire code below to get a deeper picture.

attention.py

#Calculate the attention and return the attention output along with the weights

class Attention(nn.Module):
       return attention_output, weights

The output of the attention block will yield the attention output as well the attention weights. The latter will be used to visualize the ROI that is calculated using the attention mechanism.

Multilayer perceptron

Once we receive the attention output, we can then feed it into the MLP, which will give us a probability distribution for the classification. You can get an idea of the entire process in the forward function. To see the full code click the link provided in the code heading below.

linear.py

#Apply a linear transformation to the incoming attention output using the GELU activation function.

class Mlp(nn.Module):
   def __init__(self, hidden_size, linear_dim, dropout_rate, std_norm):
       return x

It is worth noting that we are using the GELU as our activation function.

GELU as activation function | Source

One of the pros of using GELU is that it avoids vanishing gradient, which makes the model easy to scale.

Attention-block

The attention block is the module where we assemble both the modules: the self-attention module and the MLP modules.

attention_block.py

#Returns the calculated sum of attention scores via MLP along with attention weights.

class Block(nn.Module):
       return x, weights

This module will also yield the attention weights directly from the attention mechanism along with the distribution yielded by MLP.

Now let’s briefly understand the encoder. The Encoder essentially enables us to create multiple attention blocks that give the transformer more control over the attention mechanism. The three components: Encoder, Transformer, and ViT are written in the same module i.e., transformers.py.

#Creates multiple layers of attention blocks and returns encoded state and attention weights. 

class Encoder(nn.Module):
       return encoded, attn_weights

Transformer

After assembling the attention block we can then code our transformer. The attention block transformer is an assembly of the embedding module and encoder module.

class Transformer(nn.Module):
   def __init__(self, img_size, hidden_size, in_channels, num_layers,
                num_attention_heads, linear_dim, dropout_rate, attention_dropout_rate,
                eps, std_norm):
       super(Transformer, self).__init__()
       self.embeddings = Embeddings(img_size, hidden_size, in_channels)
       self.encoder = Encoder(num_layers, hidden_size, num_attention_heads,
                              linear_dim, dropout_rate, attention_dropout_rate,
                              eps, std_norm)

   def forward(self, input_ids):
       embedding_output = self.embeddings(input_ids)
       encoded, attn_weights = self.encoder(embedding_output)
       return encoded, attn_weights

Vision transformer

Finally, we can code our vision transformer which involves two components: the transformer and the final linear layer. The final linear will help us to find the probability distribution over all the classes. It can be described as:

class VisionTransformer(nn.Module):
   def __init__(self, img_size, num_classes, hidden_size, in_channels, num_layers,
                num_attention_heads, linear_dim, dropout_rate, attention_dropout_rate,
                eps, std_norm):
       super(VisionTransformer, self).__init__()
       self.classifier = 'token'

       self.transformer=Transformer(img_size, hidden_size, in_channels,
                                    num_layers, num_attention_heads, linear_dim,
                                    dropout_rate, attention_dropout_rate, eps,
                                    std_norm)
       self.head = Linear(hidden_size, num_classes)

   def forward(self, x, labels=None):
       x, attn_weights = self.transformer(x)
       logits = self.head(x[:, 0])

       if labels is not None:
           loss_fct = CrossEntropyLoss()
           loss = loss_fct(logits.view(-1, 400), labels.view(-1))
           return loss
       else:
           return logits, attn_weights

Please notice that the network is going to consistently yield attention weights which will be useful for visualizing the attention maps.

Here is a bonus tip. If you want to see the architecture of the model and how the inputs are being operated then use the following line of code. The code will generate a full operational architecture for you.

from torchviz import make_dot
x = torch.randn(1,config.IN_CHANNELS*config.IMG_SIZE*config.IMG_SIZE)
x = x.reshape(1,config.IN_CHANNELS,config.IMG_SIZE,config.IMG_SIZE)
logits, attn_weights = model(x)
make_dot(logits, params=dict(list(model.named_parameters()))).render("../metadata/VIT", format="png")

You can find the image in the given link.

But in nutshell, this how the architecture looks like.

The architecture of vision transformer | Source

MLOps pipeline for image classification: training vision transformer using Pytorch

The training module is where we will assemble all the other modules like the config module, preprocessing module, and Transformer and log the parameters including the metadata into the neptune.ai API. One easiest way to log parameters is to use Config.__dict__. This automatically converts a class into a dictionary.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

You can later create a function that removes unnecessary attributes from the dictionary.

def neptune_monitoring():
   PARAMS = {}
   for key, val in Config.__dict__.items():
       if key not in ['__module__', '__dict__', '__weakref__', '__doc__']:
           PARAMS[key] = val
   return PARAMS

Training

The training function is quite straightforward and simple to write. I have included both training and evaluation in the pseudo-code. You can find the full training block here, or you can click the code heading below.

train.py

def train_Engine(n_epochs, train_data, val_data, model, optimizer, loss_fn, device,
                monitoring=True):

#Initiates the training procedure while tracking accuracy and loss over each iterations.

Now our training loop is completed, we can then start the training and log the metadata into Neptune dashboard, which we can use for monitoring the training on the go, saving charts and parameters, and sharing them with teammates.

train.py

if __name__ == '__main__':
   from preprocessing import Dataset
   from config import Config
   config = Config()
   params = neptune_monitoring(Config)

   run = neptune.init_run(project="nielspace/ViT-bird-classification",
                       api_token=API_TOKEN)
   run['parameters'] = params

   model = VisionTransformer(img_size=config.IMG_SIZE,
                num_classes=config.NUM_CLASSES,
                hidden_size=config.HIDDEN_SIZE,
                in_channels=config.IN_CHANNELS,
                num_layers=config.NUM_LAYERS,
                num_attention_heads=config.NUM_ATTENTION_HEADS,
                linear_dim=config.LINEAR_DIM,
                dropout_rate=config.DROPOUT_RATE,
                attention_dropout_rate=config.ATTENTION_DROPOUT_RATE,
                eps=config.EPS,
                std_norm=config.STD_NORM)

   train_data, val_data, test_data = Dataset(config.BATCH_SIZE, config.IMG_SIZE,
                                             config.DATASET_SAMPLE)

   optimizer = optim.Adam(model.parameters(), lr=0.003)
   train_Engine(n_epochs=config.N_EPOCHS, train_data=train_data, val_data=val_data,
               model=model,optimizer=optimizer, loss_fn='nll_loss',
               device=config.DEVICE[1], monitoring=True)

Note: The prototyping of this model was done in Macbook Air M1 on a smaller dataset with 10 classes. The prototyping stage is where I tried different configurations and played with the architecture of the model. Once I was satisfied I used Kaggle to train the model. Since the dataset has 400 classes, the model needed to be larger and trained for a longer period of time.

Experiment tracking

In the prototyping stage, experiment tracking becomes a very handy and reliable source to make further changes to your model. You can keep an eye on your model’s performance during training and subsequently make necessary tweaks to it until you get a high-performing model.

The Neptune API enables you to:

monitor the model’s training progress
and simultaneously upload the metrics into the system.
It also allows you to compare multiple runs involving different model configurations and simultaneously choose the best one.

If you want to log your metadata in the system, then import the Neptune API and call the init function. Following that, enter the API key provided for the project, and you are good to go. Get to know more about how to get started with Neptune here. Also, here is the Neptune dashboard, which has the metadata related to this project.

run = neptune.init_run(project="nielspace/ViT-bird-classification",
api_token="API_TOKEN")

Once you are done with the initialization, you can start logging. For instance, if you want to:

Upload the parameters, use: run[‘parameters’] = params.
Note: make sure that the params are of dictionary class.
Upload metrics, use: run[‘Training_loss’].log(loss.item())and run[‘Training_loss’].log(loss.item())
Upload model weights, use: run[“model_checkpoints/ViT”].upload(“model.pt”)
Upload images, use: run[“val/conf_matrix”].upload(“confusion_matrix.png”)

Depending upon what you are optimizing your model for, there are plenty of things that you can log and track. In our case, we put an emphasis on training and validation loss and accuracy.

Logging metadata and dashboard

In the ongoing training process, you can then monitor the model’s performance. With each iteration, the graph will update.

Monitoring the model’s performance

Along with the model’s performance, you will also find CPU and GPU performance as well. See the image below.

CPU and GPU performance

You can also find all the model metadata as well.

The model metadata

Scaling using Kaggle

Now, let’s scale the model. We will use Kaggle for this project because it is free and also because the dataset was downloaded from Kaggle so it will be easy to scale and train the model on the platform itself.

The first thing we need to do is to upload the model and change the directory path to Kaggle-specific paths and enable the GPUs.

Note that the model must be complex in order to capture relative information for prediction. You can start scaling the model by gradually increasing the number of hidden layers and seeing how the model behaves. You may not want to touch other parameters like the number of attention heads and hidden size because it may throw up arithmetic errors.

For each change, you make the model run for at least two epochs in small data batches with all the 400 classes and observe if the accuracy is increasing. Typically, it will increase.

Once satisfied, run the model for 10 to 15 epochs which would take around 5 hours for the subset of 30000 samples.

After the training, check its performance on the test dataset, and if it performs well, then download the model weights. At this point, the size of the model should be around 650 MB for 400 classes.

Attention visualization

As mentioned before, self-attention is the crux of the whole Vision Transformer architecture, and interestingly there is a way to visualize it as well. The source code of the attention map can be found here. I have modified it a bit and created it as a separate independent module that can use the output of the transformer to yield the attention maps. The idea here is to store the input image and its corresponding attention map image and display it in the README.md file.

attention_viz.py (Source)

def attention_viz(model, test_data, img_path=PATH, device='mps'):

#Visualizes the attention mask of a given input (image) by comparing it with the original image.

We can run this code by simply calling the attention_viz function and passing the corresponding arguments.

if __name__ == '__main__':
   train_data, val_data, test_data = Dataset(config.BATCH_SIZE,config.IMG_SIZE, config.DATASET_SAMPLE)
   model = torch.load('metadata/models/model.pth', map_location=torch.device('cpu'))
   attention_viz(model, test_data, PATH)

The image above is an example of attention visualization. The image on the left is the original image whereas the image on the right is overlaid with the attention map. The region i.e. the face of the bird is quite bright as that area constitutes the features to which the model is paying attention

Testing and inference

We can also use the attention_viz function in the test module, where we will test the model on the test data and measure the model’s performance on various metrics like confusion matrix, accuracy score, f1 score, recall score, and precision score.

test.py

def test(model, test_data):
   return logits_, ground, confusion_matrix

#Evaluates the model’s performance on the test dataset and returns the confusion matrix, logits and ground truth for further performance evaluation.

We can easily generate a confusion matrix and visualize using heatmap from seaborn and save it in the results folder, which we can also use to display it on the README.md file.

Above is the image of a confusion matrix that is of the shape 100X100 trained for 50 epochs. As you can see the model is quite efficient to predict true positives which can be seen in the diagonals in white color. But there are few false positives across the graph which means that the model still makes wrong predictions

We can also generate the accuracy and loss graph and store it in the results folder as well. Consequently, we can use Sklearn to find other metrics, but before that, we must convert the tensors array into a NumPy array.

probs = torch.zeros(len(logits_))
y_ = torch.zeros(len(ground))
idx = 0
for l, o in zip(logits_, ground):
   _, l = torch.max(l, dim=1)
   probs[idx] = l
   y_[idx] = o.item()
   idx+=1

prob = probs.to(torch.long).numpy()
y_ = y_.to(torch.long).numpy()

print(accuracy_score(y_, prob))
print(cohen_kappa_score(y_, prob))
print(classification_report(y_, prob))

Once we are satisfied with the model’s performance, we can then do inference by simultaneously creating a Streamlit app.

MLOps pipeline for image classification: creating the app using Streamlit

The Streamlit app will be a web app that we will deploy on the cloud. In order to build the app, we must first pip install streamlit followed by importing the library in the new module.

The module will contain the same module as the inference module we just need to copy and paste the evaluation function as it is and then build the app using the Streamlit library. Below is the code of the app.

app.py

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from PIL import Image
import torch
from torchvision import transforms
import torch
import streamlit as st

from embeddings import Embeddings
from attention_block import Block
from linear import Mlp
from attention import Attention
from transformer import VisionTransformer, Transformer, Encoder

from config import Config
config = Config()

st.set_option('deprecation.showfileUploaderEncoding', False)
st.title("Bird Image Classifier")
st.write("")

# enable users to upload images for the model to make predictions
file_up = st.file_uploader("Upload an image", type = "jpg")


def predict(image):
   """Return top 5 predictions ranked by highest probability.
   Parameters
   ----------
   :param image: uploaded image
   :type image: jpg
   :rtype: list
   :return: top 5 predictions ranked by highest probability
   """
   model = torch.load('model.pth')

   # transform the input image through resizing, normalization
   transform = transforms.Compose([
       transforms.Resize(128),
       transforms.CenterCrop(128),
       transforms.ToTensor(),
       transforms.Normalize(
           mean = [0.485, 0.456, 0.406],
           std = [0.229, 0.224, 0.225])])

   # load the image, pre-process it, and make predictions
   img = Image.open(image)
   x = transform(img)
   x = torch.unsqueeze(x, 0)
   model.eval()
   logits, attn_w = model(x)

   with open('../metadata/classes.txt', 'r') as f:
       classes = f.read().split('n')

   # return the top 5 predictions ranked by highest probabilities
   prob = torch.nn.functional.softmax(logits, dim = 1)[0] * 100
   _, indices = torch.sort(logits, descending = True)
   return [(classes[idx], prob[idx].item()) for idx in indices[0][:5]]


if file_up is not None:
   # display image that user uploaded
   image = Image.open(file_up)
   st.image(image, caption = 'Uploaded Image.', use_column_width = True)
   st.write("")
   st.write("Processing...")
   labels = predict(file_up)

   # print out the top 5 prediction labels with scores
   for i in labels:
       st.write(f"Prediction {i[0]} score {i[1]:.2f}")

But before we deploy, we must test it locally. In order to test the app, we will run the following command:

streamlit run app.py

Once the above command is executed, you will get the following prompt:

You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://192.168.0.105:8501

Copy the URL and paste it into your browser, and the app is online (locally).

Copied URL

Upload the image for classification.

Uploaded image

With the ViT model trained and the app ready our directory structure should look something like this now:

.
├── README.md
├── metadata
│   ├── Abbott's_babbler_(Malacocincla_abbotti).jpg
│   ├── classes.txt
│   ├── models
│   │   └── model.pth
│   └── results
│       ├── accuracy_loss.png
│       ├── attn.png
│       └── confusion_matrix.png
├── notebooks
│   ├── ViT.ipynb
│   └── __init__.py
└── source
    ├── __init__.py
    ├── app.py
    ├── attention.py
    ├── attention_block.py
    ├── attention_viz.py
    ├── config.py
    ├── embeddings.py
    ├── linear.py
    ├── metrics.py
    ├── preprocessing.py
    ├── test.py
    ├── train.py
    ├── transformer.py

Now we proceed toward deploying the app.

MLOps pipeline for image classification: code formatting

First, let’s format our Python scripts. For that, we will use Black. Black is a Python script formatter. All you need to do is pip install black and then run `black ` following the name of the python module or even the whole directory. For this project, I ran black followed by the source directory which contains all the python modules.

ViT-Pytorch git:(main) black source
Skipping .ipynb files as Jupyter dependencies are not installed.
You can fix this by running ``pip install black[jupyter]``
reformatted source/config.py
reformatted source/embeddings.py
reformatted source/attention_block.py
reformatted source/linear.py
reformatted source/app.py
reformatted source/attention_viz.py
reformatted source/attention.py
reformatted source/preprocessing.py
reformatted source/test.py
reformatted source/metrics.py
reformatted source/transformer.py
reformatted source/train.py

The advantage of using black is that it removes unnecessary spaces, adds double quotes instead of single quotes, and makes reviewing code faster and more efficient.

Given below are the images of before and after using black to format the code.

Examples before and after using black to format the code

As you can see that unnecessary spaces have been removed.

MLOps pipeline for image classification: setting up CI/CD

For our CI/CD process, we will be using Github Actions, and Google Cloud Build to integrate and deploy our Streamlit app. The following are the steps that will help you to create a full MLOps pipeline.

Creating the Github Repository

The first step is to create the Github repository. But before that we must create three important files:

1 requirements.txt
2 makefile
3 main.yml

requirements.txt

The requirements.txt file must contain all the libraries that the model is using. There are two ways in which you can create a requirements.txt file.

If you have a dedicated working environment created specifically for this project, then you can run pip freeze>requirements.txt and it will create a requirements.txt file for you.
If you have a general working environment, then you can run pip freeze and copy-paste the libraries that you have been working on.

The requirement.txt file for this project looks like this:

numpy==1.22.3
torch==1.12.0
torchvision==0.12.0
tqdm==4.64.0
opencv-python==4.6.0.66
streamlit==1.10.0
neptune-client==0.16.3

Note: Always make sure that you mention the version so that in the future, the app remains stable and performs optimally.

Makefile

In a nutshell, Makefile is a command prompt file that automates the whole process of installing libraries, and dependencies, running a Python script, et cetera. A typical Makefile looks something like this:

#Makefile
setup:
   python3 -m venv ~/.visiontransformer
   source ~/.visiontransformer/bin/activate
   cd .visiontransformer
install:
   pip install --upgrade pip &&
       pip install -r requirements.txt
run:
   python source/test.py
all: install run

For this project, our Makefile will have three processes:

1 Setup virtual environment and activate it.
2 Install all the Python libraries.
3 Run a test file.

Essentially every time we make a new commit, the makefile will be executed, which will automatically run the test.py module generating the latest performance metrics and updating the README.md file.

But Makefile will only work if we create an action trigger. So let’s create that.

Action trigger: .github/workflow/main.yml

To create an action trigger, we need to create the following directory: .github/workflow, followed by creating a main.yml file. The main.yml will essentially create an action trigger whenever the repo is updated.

Our aim is to continuously integrate any changes made in the existing build, like updating parameters, model architecture, or even the UI/UX. Once the change is detected, it will automatically update the README.md file. The main.yml for this project is designed to trigger the workflow on any push or pull request but only for the main branch.

At each new commit, the file will activate the ubuntu-latest environment, install the specific python version and then execute a specific command from the Makefile.

main.yml

#main.yml
name: Continuous Integration with Github Actions

on:
 push:
   branches: [ main ]
 pull_request:
   branches: [ main ]

jobs:
 build:
   runs-on: ubuntu-latest
   # Steps represent a sequence of tasks that will be executed as part of the job
   steps:
     - uses: actions/checkout@v2
     - name: Set up Python 3.8
       uses: actions/setup-python@v1
       with:
         python-version: 3.8
     - name: Install dependencies
       run: |
         make install
         make run

Testing

After the files are created, you can push the entire codebase to Github. Once uploaded, you can click on the Actions tab and see the build-in progress for yourself.

Build-in progress in the Actions tab

Deployment: Google Cloud Build

After the testing is done and all the logs and results are updated in the Github README.md file, we can move to the next step, which is to integrate the app into the cloud.

First, we will visit: https://console.cloud.google.com/, and then we will create a new project in the dashboard and name it Vision Transformer Pytorch.

Creating a new project

Once the project is created, you can navigate into the project, and it will look something like this:

The project

As you can see, google cloud build offers us various services right out of the box like a virtual machine, big query, GKE, or Kubernetes cluster on the project home page. But before we create anything in the cloud build we must enable the Kubernetes cluster and create a certain directory and their respective files in the project directory.

Kubernetes

Let’s set up our Kubernetes cluster before we create any files. To do that, we can search GKE in the google cloud console search bar and enable the API.

Setting up Kubernetes cluster

Once the API is enabled, we will be navigated to the following page.

Kubernetes cluster

But instead of creating the clusters manually, we will create them using the inbuild cloud shell. To do that, click on the terminal button on the top right hand, and check the image below.

Activating Cloud Shell

Creating cluster by using inbuild cloud shell

After activating the cloud shell, we can type the following command to create Kubernetes clusters:

gcloud container clusters create project-kube --zone "us-west1-b" --machine-type "n1-standard-1" --num-nodes "1"

This usually takes up to 5 minutes.

Creating Kubernetes clusters

After it is completed, it will look something like this:

Kubernetes clustering completed

Now let’s set up the two files that will configure the Kubernetes clusters: deployment.yml and service.yml.

The deployment.yml file allows us to deploy the model in the cloud. The deployment can be canary, recreate, blue-green or any other depending upon the requirement. In this example, we will overwrite the deployments. This file also helps in scaling the model efficiently using the arguments replicas. Here is an example of a deployment.yml file.

deployment.yml

#deployment.yml

apiVersion: apps/v1
kind: Deployment
metadata:
 name: imgclass
spec:
 replicas: 1
 selector:
   matchLabels:
     app: imageclassifier
 template:
   metadata:
     labels:
       app: imageclassifier
   spec:
     containers:
     - name: cv-app
       image: gcr.io/vision-transformer-pytorch/vit:v1
       ports:
       - containerPort: 8501

The next file is the service.yml file. It essentially connects the app from the container to the real world. Notice the containerPort argument is specified as 8501, we will use the same number in our service.yml for the targetPort argument. This is the same number that Streamlit uses to deploy the application. Apart from that, the app argument is the same in both files.

service.yml

#service.yml

apiVersion: v1
kind: Service
metadata:
 name: imageclassifier
spec:
 type: LoadBalancer
 selector:
   app: imageclassifier
 ports:
 - port: 80
   targetPort: 8501

Note: Always make sure that the name of the app and the version are in lower cases.

Dockerfile

Now let’s configure the Dockerfile. This file will create a Docker container that will host our Streamlit app. Docker is very much required since it wraps the app in an environment that is easy to scale. A typical Dockerfile looks like this:

Dockerfile

FROM python:3.8.2-slim-buster

RUN apt-get update

ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

RUN ls -la $APP_HOME/

# Install dependencies
RUN pip install -r requirements.txt

# Run the streamlit on container startup
CMD [ "streamlit", "run","app.py" ]

Dockerfile contains a series of commands that:

Installs the Python version.
Copies the local code to the container image.
Installs all the libraries.
Executes Streamlit app.

Note that we are using Python 3.8 as some of the dependencies use the latest Python version.

cloudbuild.yaml

In Google Cloudbuild cloudbuild.yml file stitches all the artefacts together to create a seamless pipeline. It has three primary steps:

Build a Docker container using the Dockerfile from the current directory.
Push the container to the google container registry.
Deploy the container in the Kubernetes engine.

cloudbuild.yml

steps:
- name: 'gcr.io/cloud-builders/docker'
 args: ['build', '-t', 'gcr.io/vision-transformer-pytorch/vit:v1', '.']
 timeout: 180s
- name: 'gcr.io/cloud-builders/docker'
 args: ['push', 'gcr.io/vision-transformer-pytorch/vit:v1']
- name: "gcr.io/cloud-builders/gke-deploy"
 args:
 - run
 - --filename=kubernetes/ #this argument connects the files in kubernetes directory
 - --location=us-west1-b
 - --cluster=project-kube

Note: Please cross-check the arguments like the container name in deployment.yml and cloudbuild.yml file. Along with that also cross-check the cluster name that you created earlier with the cluster name in the clouldbuild.yml file. Lastly, make sure that the filename argument is as same as the Kubernetes directory where the deployment.yml and service.yml are present.

After creating the files the file structure of the entire project should look like this:

.
├── Dockerfile
├── .github/workflow/main.yml
├── Makefile
├── README.md
├── cloudbuild.yaml
├── kubernetes
│   ├── deployment.yml
│   └── service.yml
├── metadata
│   ├── Abbott's_babbler_(Malacocincla_abbotti).jpg
│   ├── classes.txt
│   ├── models
│   │   └── model.pth
│   └── results
│       ├── accuracy_loss.png
│       ├── attn.png
│       └── confusion_matrix.png
├── notebooks
│   ├── ViT.ipynb
│   └── __init__.py
├── requirements.txt
└── source
    ├── __init__.py
    ├── app.py
    ├── attention.py
    ├── attention_block.py
    ├── attention_viz.py
    ├── config.py
    ├── embeddings.py
    ├── linear.py
    ├── metrics.py
    ├── preprocessing.py
    ├── test.py
    ├── train.py
    ├── transformer.py
    └── vit-pytorch.ipynb

Cloning and testing

Now let’s clone the GitHub repo in our google cloud build project, cd into it, and run the cloudbuild.yml file. Use the following commands:

git clone https://github.com/Nielspace/ViT-Pytorch.git
cd ViT-Pytorch

Cloning the GitHub repo

gcloud builds submit –config cloudbuild.yaml

The deployment process will look something like this:

The deployment process

The deployment takes somewhere around 10 minutes, depending on various factors. And if everything is executed properly, you will see that the steps are color-coded with green ticks.

Succcessful deployment

Once the deployment is successful, you can find the endpoints of the app in the Services & Ingress tab in the Kubernetes Engine. Click on the endpoints, and it will navigate you to the Streamlit app.

The endpoints

The Streamlit app

Additional tips:

Make sure that you use lowercases for app name and project id in all your *.yml config files.
Cross-check the arguments for all *.yml config files.
Since you are copying your repo in a virtual environment, cross-check all the directory and file paths.
In case of an error in the cloud build process, look for a command which will help you resolve the error you find in the error statement. See the image below for a better understanding; I have highlighted the command that needs to be executed before I re-run the cloud build command.

An error in the cloud build process

Cloud build Integration

Now we will integrate the Google cloud build into the Github repo. This will create a trigger action that will update the build whenever a change is being made in the repo.

Search Google Cloud Build in the Marketplace

Searching for Google Cloud Build

Select the repo that you want to connect. In this case, it will be ViT-Pytorch and save it.

Selecting the repo

In Google Cloud Build, we will go to the Cloud build page and click on the Triggers tab to create triggers.

Creating triggers

After clicking on create trigger, we will be navigated to the page below. There we will mention the trigger name, select the event which will trigger the cloudbuild.yml file, and select the project repository.

Trigger settings

Follow the authentication process.

Authentication process

Connect the repository.

Connecting the repository

Finally, create the trigger.

Creating the trigger

Now that the trigger is created, all the changes that you make in the Github repo will be automatically detected, and the deployment will be updated.

Created trigger

Monitoring the model-decay

Over time the model will decay, which will affect the prediction capabilities. We need to monitor the performance on a regular basis. One way to do that is to occasionally test the model on the new dataset and evaluate the same on metrics that I mentioned earlier, like F1 score, Accuracy score, Precision score, et cetera.

Another interesting way to monitor the model’s performance is to use the AUROC metric, which measures the discriminative performance of the model. Because this project is a multiclassification project, you can convert it into a binary classification project and then check the model’s performance. If the performance of the model has decayed, then the model must be trained again with new samples and larger samples. And if it really required, then modify the architecture as well.

Here is the link to the code, which will allow you to measure the AUROC score.

Conclusion

In this article, we learned to build an image classifier app with Vision Transformer using Pytorch and Streamlit. We also saw how we can deploy the app on the Google Cloud Platform using Github Actions and technologies like Kubernetes, Dockerfile, and Makefile.

Important takeaways from this project:

Bigger data requires a larger model, which essentially requires training for more epochs.
When creating a prototyping experiment, reduce the number of classes and test whether the accuracy increases with each epoch. Try different configurations till you are confident that the model’s performance is increasing before using GPUs on cloud services like Kaggle or Colab.
Use various performance metrics like confusion metrics, precision, recall, confusion metrics, f1, and AUROC.
Once the model is deployed, monitoring of the model can be done occasionally and not frequently.
In order to monitor, using performance metrics like the AUROC score is good since it automatically creates threshold values and graphs the model’s True Positive rate and False Positive rate. With the AUROC score, the model’s previous and current performance can be easily compared.
Re-training the model should be done only when the model has drifted significantly. Since a model like this requires a lot of computational resources, frequent retraining can be expensive.

I hope you found this article informative and practical. You can find the entire code in this Github repo. Feel free to share it with others as well.

References

Distributed Training: Frameworks and Tools

Nilesh Barla — Fri, 22 Jul 2022 11:20:50 +0000

Recent developments in deep learning have led to some fascinating state-of-the-art results especially in the areas like natural language processing and computer vision. A couple of the reasons for the success usually comes from the availability of a huge amount of data and the increasing size of deep learning (DL) models. These algorithms are capable of extracting meaningful patterns and deriving correlations between the input and the output. But it is also true that developing and training these complex algorithms can take days and sometimes even weeks.

To manage this problem, a fast and efficient approach to designing and developing new models is needed. One cannot train these models on a single GPU because it will result in an information bottleneck. To solve the issue of information bottlenecks on a single core GPU we need to use multi-core GPUs. This is where the idea of distributed training comes into the picture.

In this article, we’ll look into some of the best frameworks and tools for distributed training. But before that, let’s have a quick overview of distributed training itself.

Distributed training

The DL training usually relies on scalability, which simply means the ability of the DL algorithm to learn or deal with any amount of data. Essentially the scalability of any DL algorithm depends on three factors:

1 Size and the complexity of the deep learning model
2 Amount of training data
3 Availability of infrastructure which includes hardware like GPUs and storage units, and smooth integration between these devices

Distributed training satisfies all three elements. It takes care of the model size and complexity, handles training data in batches, and it splits and distributes the training process among multiple processors called nodes. More importantly, it reduces the training time significantly making iteration time shorter and thus making experiments and deployment quicker.

Distributed training is of two types:

1 Data-parallel training
2 Model-parallel training

Distributed training model parallelism vs data parallelism | Source

In data-parallel training, the data is divided into subsets based upon the number of nodes available for training. And the same model architecture is shared in all the available nodes. During the training process, all the nodes must communicate with each other to ensure that the training at each node is synced with each other. It is the most efficient way of training the model and the most common practice.

In model-parallel training, the DL model is split into segments based on the number of nodes available. Each node is fed with the same data. In model-parallel training, the DL model itself is split into different segments and each of the segments is then fed into different nodes. This type of training is possible if the DL model has independent components that can be trained individually. It is kept in mind that each of the nodes must be in sync with regard to the shared weights and biases of the different segments of the model.

Among the two types of training, data-parallelism is quite commonly used and as we discover the frameworks for distributed training you will find that data-parallelism is offered by all of them irrespective of model-parallelism.

Criteria for choosing the right framework for distributed training

Before we dive into the frameworks there are some points that one should consider while choosing the right framework and tools:

Computational graph type: The whole deep learning community is majorly divided into two factions, one that uses PyTorch or dynamic computational graph and the other that uses TensorFlow or static computational graph. Hence, it is not news that most of the distributed frameworks are built on top of these two libraries. So if you prefer one over the other then half of your decision is already made.
Cost of training: Affordability is a critical concern when you are dealing with distributed computing, e.g. a project involving the training of BigGAN can require a number of GPUs and the cost could scale up proportionally as this number increases. Hence, a tool with moderate pricing is always the right choice.
Type of training: Depending upon your training requirement i.e. data-parallelism or model-parallelism, you can prefer one tool over the other.
Efficiency: This basically refers to the number of lines you need to write to enable distributed training, the less the better.
Flexibility: Can the framework of your choice be used across different platforms? Especially when you need to train on-premise or on cloud platforms.

Learn more

️ Distributed Training: Guide for Data Scientists
️ Distributed Training: Errors to Avoid

Frameworks for distributed training

Now, let’s discuss some of the libraries that offer distributed training.

May be useful

In Neptune, you can track data of your run from many processes, in particular running on different machines.

1. PyTorch

Distributed training: PyTorch | Source

PyTorch is one of the most popular deep learning frameworks developed by Facebook. It is one of the most flexible and easy-to-learn frameworks. PyTorch allows you to create and implement neural network modules very effectively and with its distributed training modules you can easily implement parallel training with a few lines of code.

PyTorch offers a number of ways in which you can perform distributed training:

nn.DataParallel: This package allows you to perform parallel training in a single machine with multiple GPUs. One advantage is that it requires a minimum code.
nn.DistributedDataParallel: This package allows you to perform parallel training across multiple GPUs within multiple machines. It requires a few more extra steps to configure the training process.
torch.distributed.rpc: This package allows you to perform a model-parallelism strategy. It is very efficient if your model is large and does not fit in a single GPU.

Advantages

It is easy to implement.
PyTorch is very user-friendly.
Offers data-parallelism and model-parallelism methods out-of-the-box.
The majority of the cloud computing platforms support PyTorch.

When to use PyTorch?

You should opt for PyTorch when:

You have a huge amount of data because data parallelism is easy to implement.

Repository Link
Documentations
Data-parallelism: tutorial
Model-parallelism: tutorial

2. DeepSpeed

Distributed training: DeepSpeed | Source

PyTorch distributed training specializes in data parallelism. DeepSpeed, which is built on top of PyTorch, targets other aspects i.e. model-parallelism. DeepSpeed is developed by Microsoft that aims to offer distributed training for large-scale models.

DeepSpeed can efficiently tackle memory challenges when training models with trillions of parameters. It reduces memory footprint while maintaining compute and communication efficiency. Interestingly, DeepSpeed offers 3D parallelism through which you can distribute data, model, and pipeline, which basically means that now you can train a model which is large and consumes a huge amount of data, something like a GPT-3 or a Turing NLG.

Advantages

Model scaling up to trillions of parameters.
Faster training up to 10X.
Democratize AI which means users can run bigger models on a single GPU without running out of memory.
Compressed training allows users to train attention models by reducing the memory required to compute attention operations.
Easy to learn and use.

When to use DeepSpeed?

You should opt for DeepSpeed when:

You want to do data and model parallelism.
If your codebase is based on PyTorch.

3. Distributed TensorFlow

Distributed training: TensorFlow | Source

TensorFlow is developed by Google and it supports distributed training. It uses data-parallel techniques for training. You can leverage the distributed training on TensorFlow by using the tf.distribute API. This API allows you to configure your training as per your requirements. By default, TensorFlow uses only one GPU but the tf.distribute allows you to use multiple GPUs.

TensorFlow provides three primary types of distributed training strategy:

tf.distribute.MirroredStrategy(): This simple strategy allows you to distribute training across multiple GPUs on a single machine. This method is also called Synchronous Data-Parallelism. It is worth noting that each worker node will have its own set of gradients. These gradients are then averaged and used to update the model parameters.

tf.distribute.MultiWorkerMirroredStrategy(): This strategy allows you to distribute training across multiple machines and multiple GPUs on a single machine. All the operations are similar to tf.distribute.MirroredStrategy(). It is also a Synchronous Data-Parallelism method.

tf.distribute.experimental.ParameterServerStrategy(): This is an Asynchronous Data-Parallelism method, it is common practice to scale-up model training on multiple machines. In this strategy, the parameters are stored in a parameter server and workers are independent of each other. This strategy scales up well because the worker nodes are not waiting for the parameter update from each other.

Advantages

Huge community support.
It is a static paradigm of programming.
Very well integrated with Google Cloud and other cloud-based services.

When to use Distributed TensorFlow?

You should use Distributed TensorFlow:

If you want to do data parallelism.
If you like the static paradigm of programming compared to dynamic.
If you are in the Google Cloud ecosystem since TensorFlow is very well optimized for TPUs.
Lastly, if you have huge data and need high processing power.

4. Mesh TensorFlow

Distributed training: TensorFlow | Source

Mesh Tensorflow is again an extension of Tensorflow distributed training but is specifically designed to train large DL models on Tensor Processing Unit (TPUs) which AI accelerates like the GPUs but faster. Although Mesh TensorFlow can execute data-parallelism, it aims to solve distributed training for large models whose parameters cannot fit on one device.

Mesh TensorFlow is inspired by a synchronous data-parallel method, i.e. every worker is involved in every operation. Apart from that, all the workers will have the same program and it uses collective communication like Allreduce.

Advantages

It can train large models with millions and billions of parameters like: GPT-3, GPT-2, BERT, et cetera.
Potentially low latency across the workers.
Good TensorFlow community support.
Availability of TPU-pods from Google.

When to use Mesh Tensorflow?

You should use Mesh TensorFlow:

If you want to do model parallelism.
If you want to develop huge models and practice rapid-prototyping.
If you are especially working in the area of Natural Language Processing with huge data.

Repository Link

5. TensorFlowOnSpark

Distributed training: TensorFlow | Source

Apache Spark is one of the most well-known, open-source big data processing platforms. It allows users to do all kinds of data-related work like data engineering, data science, and machine learning. We already know what TensorFlow is. But if you wanna use TensorFlow on Apache Spark then you have to use TensorFlowOnSpark.

TensorFlowOnSpark is a machine learning framework that allows you to perform distributed training on Apache Spark Clusters and Apache Hadoop. It was developed by Yahoo. The framework allows both distributed training and inference with minimum code changes to existing TensorFlow code on the shared grid.

Advantages

Allows easy migration to Spark Clusters with existing TensorFlow programs.
Fewer changes in the code.
All TensorFlow functionalities are available.
Datasets can be efficiently pushed and pulled by Spark and TensorFlow respectively.
Cloud development is easy and efficient on CPUs or GPUs.
Training pipelines can be created easily.

When to use TensorFlowOnSpark?

You should use TensorflowOnSpark:

If your workflow is based on Apache Spark or if you prefer Apache Spark.
If your preferred framework is TensorFlow.

️ The Best ML Frameworks & Extensions For TensorFlow

6. BigDL

Distributed training: BigDL | Source

BigDL is also an open-source framework for distributed training for Apache Spark. It was developed by Intel to allow DL algorithms to run Hadoop and Spark clusters. One big advantage of BigDL is that it can help you to easily build and process production data in an end-to-end pipeline for both data analysis and deep learning applications.

BigDL provides two options:

You can directly use BigDL as you would any other library that Apache Spark provides for data engineering, data analytics et cetera.
You can scale out python libraries like PyTorch, TensorFlow, and Keras in the Spark ecosystem.

Advantages

End-to-end pipeline: If your big data is messy and complex which is usually in the case of live data streaming, then adopting BigDL is appropriate because it integrates data analytics and deep learning in an end-to-end pipeline.
Efficiency: With an integrated approach across different components of Spark BigDL makes development, deployment, and operations direct, seamless, and efficient across all components.
Communication and Computing: Since all the hardware and software are stitched together, they run smoothly without any interruption, making communication with different workflows clear and computing faster.

When to use BigDL?

You should use BigDL:

If you want to develop an Apache Spark workflow,
If your preferred framework is PyTorch.
If you want to have continuous integration of all the components like data mining, data analytics, machine learning et cetera.

7. Horovod

Distributed training: Horovod | Source

Horovod was introduced by Uber in 2017. It is an open-source project that is specifically made for distributed training. It is an internal component of Michelangelo, a deep learning toolkit that Uber uses for implementing its DL algorithms. Horovod leverages data-parallel distributed training, which makes scaling very easy and efficient. It can also scale up to hundreds of GPUs in a matter of 5 lines of python code. The idea is to write a training script for a single GPU and Horovod can scale it to train on multiple in parallel.

Horovod is built for frameworks like Tensorflow, Keras, Pytorch, and Apache MXNet. It is easy to use and fast.

Advantages

Easy to learn and implement if you are familiar with Tensorflow, Keras, Pytorch, and Apache MXNet.
If you are using Apache Spark then you can unify all the processes on a single pipeline.
Good community support.
It is fast.

When to use Horovod?

You should use Horovod:

If you want to scale a single GPU script quickly across multiple GPUs.
If you are using Microsoft Azure as your cloud computing platform.

8. Ray

Distributed training: Ray | Source

Ray is another open-source framework for distributed training built on top of Pytorch. It provides tools for launching GPU clusters on any cloud provider. Unlike any other libraries we have discussed so far, Ray is very flexible and can work anywhere like Azure, GCD, AWS Apache Spark, and Kubernetes.

Ray offers the following libraries in its bundle for hyperparameter tuning, reinforcement learning, deep learning, scaling et cetera:

Tune: Scalable Hyperparameter Tuning.
RLlib: Distributed Reinforcement Learning.
Train: Distributed Deep Learning, currently in beta version.
Datasets: Distributed Data Loading and Compute, currently in beta version.
Serve: Scalable and Programmable Serving.
Workflows: Fast, Durable Application Flows.

Apart from these libraries, Ray also has integration with third-party libraries and frameworks which allows you to develop, train and scale your workloads with minimal code changes. Given below is the list of integrated libraries:

Airflow
ClassyVision
Dask
Flambe
Horovod
Hugging Face Transformers
Intel Analytics Zoo
John Snow Labs’ NLU
LightGBM
Ludwig AI
MARS
Modin
PyCaret
PyTorch Lightning
RayDP
Scikit Learn
Seldon Alibi
Spacy
XGBoost

Advantages

It supports Jupyter Notebook
It makes your code run parallel in single and multiple machines
It integrates multiple frameworks and libraries.
It works with all the major cloud computing platform

When to use Ray?

You should use Ray:

If you want to perform distributed reinforcement learning
If you want to perform distributed hyperparameter tuning
If you want to use distributed data loading and compute across different machines.
If you want to serve your application.

Cloud platforms for distributed training

So far, we have discussed the frameworks and the libraries that can be used to enable distributed training. Now, let’s discuss and explore the cloud platforms where you can get into the hardware that will allow you to efficiently train your DL models. But before that, let’s lay out some criteria that will allow you to choose the best cloud platforms as per your requirements.

Hardware and Software Support: It is important to learn and understand what hardware these platforms offer like GPUs, TPUs, storage units et cetera. Apart from that, one should also see the API that they offer so that (depending on your project) you can get access to hosting facilities, containers, tools for data analytics and so forth.
Availability Zones: Availability zones are an important factor in cloud computing, it gives users the flexibility to set up and deploy their project anywhere in the world. Users can also shift their projects whenever they want to.
Pricing: Whether the platform charges you based on your usage or do they offer a subscription-based model.

Now, let’s discuss cloud computing options. We will discuss the two extremely feasible ready-to-use experimental notebook platforms and the three most popular cloud computing services.

Magic quadrant for cloud infrastructure as a service | Source

1. Google Colab

Distributed training: Google Colab | Source

Google Colab is one most reliable and easy to use platforms for small scale to medium-scale projects. One good thing about Google Colab is that you can easily connect to Google Cloud with ease and you can work with any python library mentioned above. It offers three models:

Google Colab is free of cost and it gives you access to GPUs and TPUs. But you will get access to limited storage and memory. Once any of them exceeds, the program stops.
Google Colab Pro is a subscription version of Google Colab where you have extra memory and storage. You can fairly run a heavy model but again is it limited.
Google Colab Pro + is the new service which is a subscription based model and which is also expensive. It offers faster GPUs and TPUs plus extra memory so that you can run fairly larger models on fairly large datasets.

Given below is the official comparison of all the three.

Cloud platforms | Source

Learn more

️ How to Use Google Colab for Deep Learning [Complete Tutorial]

2. Amazon Web Services: SageMaker

Distributed training: AWS SageMaker | Source

AWS SageMaker is one of the most popular and the oldest cloud computing platforms for distributed training. It is very well integrated with Apache MXNet, Pytorch, and TensorFlow and allows you to deploy deep learning algorithms with ease and less code modification. SageMaker API has 18+ machine learning algorithms, some of which are rewritten from scratch to make the whole process scalable and easy. These built-in algorithms are optimized to get the most out of the hardware.

SageMaker also has an integrated Jupyter Notebook that allows Data-scientist and machine learning engineers to build and develop pipeline algorithms on the go and allows you to directly deploy them in a hosted environment. You can configure hardware and environments based on your requirements and preferences from SageMaker Studio or SageMaker console. All the hosting and development are billed according to the usage per minute.

AWS SageMaker offers both data-parallelism as well as model-parallelism distributed training. In fact, SageMaker also offers a hybrid training strategy where you can use both model and data parallelism.

Distributed training: AWS SageMaker | Source

3. Google Cloud Computing

Distributed training: Google Cloud Computing | Source

Google Cloud Computing was developed by Google in 2010 to strengthen their own platforms like Google search engine and Youtube. Gradually, they started open-sourcing it to the public. Google Cloud Computing offers the same infrastructure that all Google’s platforms use.

Google cloud computing offers in-built support for libraries like TensorFlow, Pytorch, Scikit-Learn, and many more. Furthermore, apart from configuring GPUs in your workflow, you can add TPUs as well to make the training process go much faster. Like I mentioned before you can connect your Google Colab to Google Cloud Platform and access all the features that it provides.

Some of the features that it provides are:

Compute (Virtual Hardwares like GPUs and TPUs)
Storage Bucket,
Databases
Networking
Management tools
Security
IoT
API platform
Hosting Services

It is worth noting that GCP has less availability zones but it is also less expensive compared to AWS.

Distributed training: Google Cloud Computing | Source: Author

4. Microsoft Azure

Distributed training: Microsoft Azure | Source

Microsoft Azure is another very popular cloud computing platform. One of the popular language models GPT-3 from OpenAI was trained in Azure. It also offers both data parallelism and model parallelism methods and supports both TensorFlow and Pytorch. In fact, if you want to optimize computing speed then you can also leverage Horovod from Uber.

Azure machine learning service is for both coders and non-coders. It simply offers a drag and drop approach that can optimize your workflow. It also reduces manual work with automated machine learning that can help you to develop smarter working prototypes.

The Azure Python SDK also allows you to interact in any Python environment like Jupyter Notebooks, Visual Studio Code, and many more. It is quite similar to both AWS and GCP in terms of offering services. These are the services that Azure offers:

AI, Machine Learning and Deep learning
Computing powers (GPUs)
Analytics
Blockchain
Containers
Databases
Developer Tools
DevOps
Internet of Things
Mixed Reality
Mobile
Networking et cetera

Distributed training: Microsoft Azure | Source

Let’s also compare the main 3 tools side-by-side to give you a better perspective about making the choice.

Comparison table for cloud platform

Comparison table for cloud platform | Source

Final thoughts

In this article, we saw different Libraries and tools that can help you implement distributed training for your own deep learning application. Bear in mind that all the libraries are good and very effective in what they do, eventually, it all boils down to your preferences and requirements.

You must have noticed that all the frameworks discussed have primarily Pytorch and TensorFlow integration in some way or another. This trait can easily help you isolate the framework of choice. Once your framework is decided you can then look at the advantages to decide which distributed training tool works the best for you.

I hope you enjoyed this article. If you wanna try out all the frameworks we discussed then follow the tutorial link.

Thanks for reading!

References

Model Deployment Strategies

Nilesh Barla — Fri, 22 Jul 2022 06:46:18 +0000

In recent years, big data and machine learning has been adopted in most of the major industries and most startups are leaning towards the same. As data has become an integral part of all companies, ways to process them i.e. derive meaningful insights and patterns are essential. This is where machine learning comes into the picture.

We already know how efficient machine learning systems are to process the huge amount of data and based upon the task in hand, yield results in real-time as well. But these systems need to be curated and deployed properly so that the task at hand performs efficiently. This article aims to provide you with information on the model deployment strategies and how you can choose which strategy is best for your application.

The image above depicts the entire pipeline of a data-science project | Source

We will cover the following strategies and techniques for model deployment:

1 Shadow evaluation
2 A/B testing
3 Multi Arm Bandits
4 Blue-green deployment
5 Canary testing
6 Feature flag
7 Rolling deployment
8 Recreate strategy

These strategies can be broken down into two categories:

Static deployment strategies: These are the strategies where the distribution of traffic or request are handled manually. Examples of this are shadow evaluation, A/B testing, Canary testing, Rolling deployment, Blue-green deployment et cetera.
Dynamic deployment strategies: These are the strategies where the distribution of traffic or request are handled automatically. Example of this is Multi Arm Bandits.

Model deployment strategies | Source

To begin with, let’s have a quick overview of what model lifecycle and model deployment refers to.

Best Machine Learning Model Deployment Tools

Lifecycle of an ML model

The lifecycle of a machine learning model refers to the entire process that structures the whole data science or an AI project. It is similar to the software development life cycle (SDLC) but differs in a few key areas such as the use of real-time data to evaluate the model performance before deployment. A life cycle of the ML model or model development life cycle (MDLC) primarily has five phases:

1 Data collection
2 Create model and training
3 Testing and evaluation
4 Deployment and production
5 Monitoring

Model development lifecycle (MDLC) | Source

Now, another term that you must be familiar with is MLOps. MLOps is generally a set of practices that enables ML Lifecycle. Its stitches machine learning and software applications together. Simply put, it is a collaboration between data scientists and the operations team that takes care of and orchestrates the whole ML lifecycle. The three key areas that MLOps focuses on are continuous integration, continuous deployment, and continuous testing.

Learn more

MLOps: What It Is, Why It Matters, and How to Implement It

What is model deployment (or model release)?

Model deployment (release) is a process that enables you to integrate machine learning models into production to make decisions on real-world data. It is essentially the second last stage of the ML life cycle before monitoring. Once deployed, the model further needs to be monitored to check whether the whole process of data ingestion, feature engineering, training, testing et cetera are aligned properly so that no human intervention is required and the whole process is automatic.

But before deploying the model, one has to evaluate and test if the trained ML model is fit to be deployed into production. The model is tested for performance, efficiency, even bugs, and issues. There are various strategies one can use before deploying the ML model. Let us explore them.

Model deployment strategies

Strategies allow us to evaluate the ML model performances, capabilities and discover issues concerning the model. A key point to keep in mind is that the strategies usually depend on the task and resources in hand. Some of the strategies can be a great resource but computationally expensive while some can get the job done with ease. Let’s discuss a few of them.

1. Shadow deployment strategy

In shadow deployment or shadow mode, the new model is deployed with new features alongside the live model. The new deployed model in this case is known as a shadow model. The shadow model handles all the requests just like the live model except it is not released to the public.

This strategy allows us to evaluate the shadow model better by testing it on real-world data while not interrupting the services offered by the live model.

Shadow deployment strategy | Sour ce

Methodology: champion vs challenger

In shadow evaluation, the request is sent to both the models running parallel to each other using two API endpoints. During the inference, predictions from both the models are computed and stored, but only the prediction from the live model is used in the application which is returned to the users.

The predicted values from both the live and shadow model are compared against the ground truth. Once the results are in hand, data scientists can decide whether to deploy the shadow model globally into production or not.

But one can also use champion/challenger framework in a manner where multiple shadow models are tested and compared with the existing model. Essentially the model with the best accuracy or Key Performance Index (KPI) is selected and deployed.

Pros:

Model evaluation is efficient since both the models are running parallelly there is no impact on traffic.
No overloading irrespective of the traffic.
You can monitor the shadow model which allows you to check the stability and performance; this reduces risk.

Cons:

Expensive because of the resources required to support the shadow model.
Shadow deployment can be tedious, especially if you are concerned about different aspects of model performance like metrics comparison, latency, load testing, et cetera.
Provides no user response data.

When to use it?

If you want to compare multiple models with each other then shadow testing is great, although tedious.
Shadow testing will allow you to evaluate the pipeline, latency while yielding results as well the load-bearing capacity.

2. A/B testing model deployment strategy

A/B testing is a data-based strategy method. It is used to evaluate two models namely A and B, to assess which one performs better in a controlled environment. It is primarily used in e-commerce websites and social media platforms. With A/B testing the data scientists can evaluate and choose the best design for the website based on the data received from the users.

The two models differ slightly in terms of features and they cater to different sets of users. Based on the interaction and data received from the users such as feedback, data scientists choose one of the models that can be deployed globally into production.

Methodology

In A/B the two models are set up parallelly with different features. The aim is to increase the conversion rate of a given model. In order to do that data scientist sets up a hypothesis. A hypothesis is an assumption based on an abstract intuition of the data. This assumption is proposed through an experiment, if the assumption passes the test it is accepted as fact and the model is accepted, otherwise, it’s rejected.

Hypothesis testing

In A/B testing there are two types of hypothesis:

1 Null Hypothesis states that the phenomenon occurring in the model is purely out of chance and not because of a certain feature.
2 Alternate Hypothesis challenges the null hypothesis by stating that the phenomenon occurring in the model is because of a certain feature.

In hypothesis testing, the aim is to reject the null hypothesis by setting up experiments like the A/B testing and exposing the new model with a certain feature to a few users. The new model essentially is designed on an alternate hypothesis. If the alternate hypothesis is accepted and the null hypothesis is rejected then that feature is added and the new model is deployed globally.

It is important to know that in order to reject the null hypothesis you have to prove the statistical significance of the test.

A/B testing model deployment strategy | Source

Advantages:

It is simple.
Yields quick results and helps in the elimination of the low performing model.

Disadvantages:

Models can be unreliable if the complexity is increased. One should use A/B testing in the case of simple hypothesis testing.

When to use it?

As mentioned earlier, A/B testing is predominantly used for e-commerce, social media platforms, and online streaming platforms. In such a setting and if you have two models you can use A/B to evaluate and choose which one to deploy globally.

3. Multi Armed Bandit

Multi-Armed Bandit or MAB is an advanced version of A/B testing. It is also inspired by reinforcement learning, and the idea is to explore and exploit the environment that maximizes the reward function.

MAB leverages machine learning to explore and exploit the data received to optimize the key performance index (KPI). The advantage of using this technique is that the user traffic is diverted according to the KPI of two or more models. The model which yields the best KPI is deployed globally.

Multi Armed Bandit strategy | Source

Methodology

MAB heavily depends on two concepts: exploration and exploitation.

Exploration: It is a concept where the model explores the statistically significant results, as what we saw in A/B testing. The prime focus of A/B testing is to find or discover conversion rates of the two models.

Exploitation: It is a concept where the algorithm uses a greedy approach to maximize conversion rates using the information it gained during exploring.

MAB is very flexible compared to the A/B testing. It can work with more than two models at a given time, this increases the rate of conversion. The algorithm continuously logs the KPI score of each model based on the success with respect to the route from which the request was made. This allows the algorithm to update its score of which is best.

Building machine learning powered application | Source

Advantages:

With exploring and exploiting the MAB offers adaptive testing.
Resources are not wasted like in A/B testing.
Faster and efficient way of testing.

Disadvantages:

It is expensive because exploiting takes a lot of computing power which can be economically expensive.

When to use it?

MAB is very helpful for scenarios where the conversion rate is all you care about and where the time to make a decision is small. For example, optimizing offers or discounts on a product for a limited period.

4. Blue-green deployment strategy

Blue-green deployment strategies involve two production environments instead of just models. The blue environment consists of the live model whereas the green environment consists of the new version of the model.

Blue-green deployment strategy | Source

The green environment is set as a staging environment i.e. an exact replica of a live environment but with new features. Let us briefly understand the methodology.

Methodology

In Blue-green deployment, the two identical environments consist of the same database, containers, virtual machines, same configuration et cetera. Keep in mind that setting up an environment can be expensive so usually, some components like a database are shared between the two.

The Blue environment which contains the original model is live and keeps servicing requests while the Green environment acts as a staging environment for a new version of the model. It is subjected to deployment and final stages of testing against the real data to ensure that it performs well and is ready to deploy to production. Once the testing is successfully completed ensuring that all the bugs and issues are rectified the new model is made live.

Once this model is made live, the traffic is diverted from the blue environment to the green environment. In most cases, the blue environment serves as a backup, in case something goes wrong the request can be rerouted to the blue model.

Pros:

It ensures application availability round the clock.
Rollbacks are easy because you can quickly divert the traffic to the blue environment in case of any issues.
Since both environments are independent of each other, deployment risk is less.

Cons:

It is cost expensive since both models require separate environments.

When to use it?

In case your application cannot afford downtime then one should use the Blue-Green deployment strategy.

5. Canary deployment strategy

The canary deployment aims to deploy the new version of the model by gradually increasing the number of users. Unlike the previous strategies that we’ve seen where the new model is either hidden from the public or a small control group is set up, the canary deployment strategy uses the real users to test the new model. As a result, bugs and issues can be detected before the model is deployed globally for all the users.

Canary deployment strategy | Source

Methodology

Similar to other deployment strategies in canary deployment, the new model is tested alongside the current live model but here the new model is tested on a few users to check its reliability, errors, performance et cetera.

The number of users can be increased or decreased based on the testing requirements. If the model is successful in the testing phase then the model can be rolled out and if it is not then it can be rolled back with no downtime but only a number of users will be exposed to the new model.

Canary deployment strategy can be broken down into three steps:

1 Design a new model and route a small sample of users’ requests to the new model.
2 Check for bugs, efficiency, reports, and issues in the new model, if found then perform a rollback.
3 Repeat steps one and two until all errors and issues are resolved, before routing all traffic to the new model.

Pros:

Cheaper compared to Blue-Green deployment.
Ease to test the new model against real data.
Zero downtime.
In case of failure, the model could be easily rolled back to the current version.

Cons:

Rollouts are easy but slow.
Since the testing takes place against the real data with few users, proper monitoring must be in place so in case of failure the users are effectively routed to the live version.

When to use it?

Canary deployment strategy must be used when the model is to be evaluated against real-world real-time data. Also, it has advantages over A/B testing since it can take a long time to gather enough data from the user to find a statistically significant result. Canary deployment can do this in hours.

6. Other model deployment strategies and techniques

Feature flag

Feature flag is a technique rather than a strategy that allows developers to push or integrate code into the main branch. The idea here is to keep the feature dormant until it is ready. This allows developers to collaborate on different ideas and iterations. Once the feature is finalized it can be activated and deployed.

Feature flag | Source

As mentioned earlier feature flag is a technique so this can be used in combination with any deployment techniques mentioned earlier.

Rolling deployment

Rolling deployment is a strategy that gradually updates and replaces the older version of the model. This deployment occurs in a running instance, it does not involve staging or even private development.

Rolling deployment | Source

The image above represents how the rolling deployment works. As you can see that the service is horizontally scaled this is the key factor.

The image at the top left represents three instances. In the next step version 1.2 is deployed. With deployment of a single instance of version 1.2, one instance of version 1.1 is retired. The same trend follows for all other instances i.e. whenever a new instance is deployed the older instances are retired.

Pros:

It is faster than a blue/green deployment because there are no environmental restrictions.

Cons:

Although it is quicker, rollbacks can be difficult if further updates fail.

Recreate strategy

Recreate is a simple strategy where the live version of the model is shut down and then the new version is deployed.

Recreate strategy | Source

The image above depicts how the recreate strategy works. Essentially, the old instances namely V1’s are shut down and discarded while the new instances V2’s are deployed.

Pros:

Easy and simple set-up.
The entire environment is completely renewed.

Cons:

Negative impact on users since it suffers from downtime as well as rebooting.

Comparison: which model release strategy to use?

There can be various metrics that one can use to determine which strategy will suit them the best. But it mostly depends on the project complexity and resource availability. The following comparison table gives some idea about when to use which strategy.

Model release (deployment) strategies | Source

Key takeaways

Deployment strategies often help data scientists to figure out how their model is performing in a given situation. A good strategy depends upon the type of product and users it aims to target. To sum it up, here are the points one should keep in mind:

If you want the model to be tested in real-world data then a shadow evaluation strategy or something similar to it must be considered. Unlike the other strategies where the sample of users are used, the shadow evaluation strategy uses live and real user requests.
Check the complexity of the task, if the model requires simple or minor tweaks then A/B testing is the way to go.
If there is time constraint and ideas are more, then one should opt for multiarm bandits since it gives you the best results in such a situation.
If your model is complex and needs proper monitoring before deploying then Blue-green strategy will help you analyse and monitor your model.
If you want no downtime and you are okay to expose your model to the public then opt for Canary deployment.
The rolling deployment must be used when you want to gradually deploy the new version of the model.

Hope you guys enjoyed reading this article. If you want to read more about this topic, you can refer to the attached resources. Keep learning!

References

Pix2pix: Key Model Architecture Decisions

Nilesh Barla — Fri, 22 Jul 2022 06:37:34 +0000

Generative Adversarial Networks or GANs is a type of neural network that belongs to the class of unsupervised learning models. It is used for the task of deep generative modeling.

In deep generative modeling, deep neural networks learn a probability distribution over a given set of data points and generate similar ones. Since it is an unsupervised learning task, it uses no labels during the learning process.

Since its release in 2014, the deep learning community has been actively developing new GANs to improve the field of generative modeling. This article aims to provide information on GANs, specifically Pix2Pix GAN, one of the most used generative models.

What is GAN?

GAN was designed by Ian Goodfellow in 2014. GAN’s main intention was to generate samples that were not blurry and had rich representations of features. Discriminative models were doing well on this front as they were able to classify between different classes. Deep generative models, on the other hand, were far less effective due to the difficulty in approximating many intractable probabilistic computations, which are quite evident in Autoencoders.

Autoencoders and their variants are explicit likelihood models, meaning they explicitly compute the probability density function over a given distribution. GANs and their variants are implicit likelihood models, which means they don’t compute the probability density function but rather learn the underlying distribution.

GANs learn the underlying distribution by approaching the whole problem as a binary classification problem. In this approach, the problem model is presented by two models: a generator and a discriminator. The job of the generator is to generate new samples and the job of the discriminator is to classify or discriminate if the sample produced by the generator is real or fake.

The two models are trained together in a zero-sum game until the generator can produce samples that are similar to the real samples. In other words, they are trained until the generator can fool the discriminator.

Architecture of a vanilla GAN

Let’s briefly understand the architecture of GANs. From this section onward most of the topics will be explained using code. So to begin with, let’s install and import all the required dependencies:

pip install torch torchvision matplotlib cv2 numpy

import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import matplotlib.pyplot as plt
import torchvision
import torchvision.datasets as datasets
from torch.utils.data import DataLoader
import torchvision.transforms as transforms

Generator

The Generator is a component in GAN that takes in noise, which by definition follows a Gaussian distribution, and yields samples similar to the original dataset. As GANs have evolved over the years, they have adopted the use of CNNs, which is quite prominent in computer vision tasks. But for simplicity, we will define it with just linear functions using Pytorch.

class Generator(nn.Module):
   def __init__(self, z_dim, img_dim):
       super().__init__()
       self.gen = nn.Sequential(
           nn.Linear(z_dim, 256),
           nn.LeakyReLU(0.01),
           nn.Linear(256, img_dim),
           nn.Tanh(),  # normalize inputs to [-1, 1] to make outputs [-1, 1]
       )

   def forward(self, x):
       return self.gen(x)

Discriminator

The discriminator is simply a classifier that classifies whether the data yielded by the generator is real or fake. It does this by learning the original distribution from the real data and then evaluating between the two. We will keep things simple and define the discriminator using linear functions.

class Discriminator(nn.Module):
   def __init__(self, in_features):
       super().__init__()
       self.disc = nn.Sequential(
           nn.Linear(in_features, 128),
           nn.LeakyReLU(0.01),
           nn.Linear(128, 1),
           nn.Sigmoid(),
       )

   def forward(self, x):
       return self.disc(x)

The key difference between the generator and the discriminator is the last layer. The former yields the same shape as that of the image while the latter yields only one output, either 0 or 1.

Loss function and training

The loss function is one of the most important components in any deep learning algorithm. For instance, if we design a CNN to minimize the Euclidean distance between the ground truth, and predicted results, it will tend to produce blurry results. This is because Euclidean distance is minimized by averaging all plausible outputs, which cause blurring.

The above point is an important one that we must keep in mind. With that being said, the loss function that we will use for vanilla GAN will be binary cross-entropy loss or BCELoss because we are performing binary classification.

criterion = nn.BCELoss()

Now let’s define the optimization method and other related parameters:

# Define hyperparameters
device = "cuda" if torch.cuda.is_available() else "cpu"
lr = 3e-4
z_dim = 64
image_dim = 28 * 28 * 1  # 784
batch_size = 32
num_epochs = 100

# Initialize Generator and Discriminator
gen = Generator(z_dim, image_dim).to(device)
disc = Discriminator(image_dim).to(device)

# Set up optimizers
opt_disc = optim.Adam(disc.parameters(), lr=lr)
opt_gen = optim.Adam(gen.parameters(), lr=lr)

# Prepare dataset and dataloader
transform = transforms.Compose([
   transforms.ToTensor(),
   transforms.Normalize((0.5,), (0.5,))
])
dataset = datasets.MNIST(root="dataset/", transform=transform, download=True)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Set up TensorBoard writers
from torch.utils.tensorboard import SummaryWriter
writer_fake = SummaryWriter(f"logs/fake")
writer_real = SummaryWriter(f"logs/real")
step = 0

Let’s understand the training loop. The training loop of GAN starts with:

Generating samples from the generator using Gaussian distribution
Training the discriminator using real data and fake data produced by the generator
Updating the discriminator
Updating the generator

Here’s what the training loop looks like:

for epoch in range(num_epochs):
  # Loop over each batch of data
  for batch_idx, (real, _) in enumerate(loader):
      real = real.view(-1, 784).to(device)
      batch_size = real.shape[0]
      ### Train Discriminator: max log(D(x)) + log(1 - D(G(z)))
# Generate random noise as input to the generator
      noise = torch.randn(batch_size, z_dim).to(device)
      # Produce fake images from the generator using the random noise
      fake = gen(noise)
      # Get discriminator's predictions on real images
      disc_real = disc(real).view(-1)
      lossD_real = criterion(disc_real, torch.ones_like(disc_real))
      # Get discriminator's predictions on fake images
disc_fake = disc(fake).view(-1)
      # Calculate discriminator's loss on fake images
lossD_fake = criterion(disc_fake, torch.zeros_like(disc_fake))
      # Combine real and fake loss to get total discriminator loss
lossD = (lossD_real + lossD_fake) / 2
# Backpropagation and optimization step for discriminator
      disc.zero_grad()
      lossD.backward(retain_graph=True)
      opt_disc.step()
      ### Train Generator: min log(1 - D(G(z))) <-> max log(D(G(z))
      # where the second option of maximizing doesn't suffer from
      # saturating gradients
      output = disc(fake).view(-1)
      lossG = criterion(output, torch.ones_like(output))
# Backpropagation and optimization step for generator
      gen.zero_grad()
      lossG.backward()
      opt_gen.step()
      if batch_idx == 0:
          print(
              f"""Epoch [{epoch}/{num_epochs}] Batch {batch_idx}/{len(loader)}
                    Loss D: {lossD:.4f}, loss G: {lossG:.4f}"""
          )
          with torch.no_grad():
              # fixed_noise is not defined earlier in the code
              # It should be a constant noise vector used to generate consistent samples
              # Let's define it at the beginning of the training loop
              if 'fixed_noise' not in locals():
                  fixed_noise = torch.randn(64, z_dim, device=device)
              fake = gen(fixed_noise).reshape(-1, 1, 28, 28)
              data = real.reshape(-1, 1, 28, 28)
              img_grid_fake = torchvision.utils.make_grid(fake, normalize=True)
              img_grid_real = torchvision.utils.make_grid(data, normalize=True)
              writer_fake.add_image(
                  "Mnist Fake Images", img_grid_fake, global_step=step
              )
              writer_real.add_image(
                  "Mnist Real Images", img_grid_real, global_step=step
              )
              step += 1

There are some important considerations from the loop above:

The loss function for the discriminator is calculated twice: one for real images and another for fake images.
- For real images, the ground truth is converted to ones using the torch.ones_like function which returns a matrix of ones of a defined shape.
- For fake images, the ground truth is converted to ones using the torch.zeros_like function which returns a matrix of zeros of a defined shape.
The loss function for the generator is calculated only once. If you observe carefully, it is the same loss function that is used by the discriminator to calculate the loss for fake images. The only difference is that instead of using the torch.zeros_like function, torch.ones_like function is used. The interchanging of labels from 0 to 1 enables the generator to learn representations that will produce real images, therefore fooling the discriminator.

Mathematically, we can define the whole process as:

This equation represents the objective function of a Generative Adversarial Network (GAN). In a GAN, the generator G aims to produce realistic data to fool the discriminator D, while the discriminator tries to distinguish between real data x and generated data G(z), where \( z is random noise. The discriminator D maximizes its ability to classify real vs. generated samples, while the generator G minimizes the discriminator’s success, creating a minimax game. The function V(D, G) is optimized by minimizing over G and maximizing over D.

Application of GANs

GANs are widely used for:

Generating training samples: GANs are often used to generate samples for specific tasks like the classification of malignant and benign cancer cells, especially where the data is scarce to train a classifier.
AI Art or Generative Art: AI or Generative art is another new domain where GANs are extensively used. Since the introduction of non-fungible tokens, artists all over the world have been creating art in unorthodox fashion i.e. digital and generative. GANs like DeepDaze, BigSleep, BigGAN, CLIP, VQGAN, etc. are the most commonly used by creators.

AI Art or Generative Art: This figure has been generated with AI | Source: Author

Image-to-image translation: The idea here is to translate a certain type of image to an image in the target domain. For example a day-light image into a night image, or a winter image to a summer image (see the image below). GANs like pix2pix, cycleGAN, styleGAN are few of the most popular GANs used by digital creators.

Image-to-image translation example. To the left, the original image (a car in a road during winter). To the right, the same image in the summer, generated by AI | Source

Text-to-image translation: Text-to-image translation is simply converting a text or a given string into an image. This is a very populardomain as of now and it is a growing community. As mentioned previously GANs such as DeepDaze, BigSleep and DALL·E from OpenAI are the most common tools for this.

Text-to-image translation. The prompt “an armchair in the shape of an avocado” turns into images of the desired chair in different angles. | Source

Issues with GANs

Although GANs can produce images from random Gaussian distributions that are similar to real images, this process is not perfect most of the time. Here’s why:

Mode Collapse: This refers to the issue when the generator can fool the discriminator by learning with fewer data samples from the overall data. Because of mode collapse, the GAN is not able to learn a wide variety of distributions and remains limited to a few.
Diminished gradient: Diminished or vanishing gradient descent occurs when the derivative of the network is so small, that the update to the original weights is almost negligible. To overcome this issue, Wasserstein GANs (WGANs in short) are recommended.
Non-convergence: It occurs when the network is unable to converge to a global minimum. This results from unstable training, and it can be tackled with spectral normalization.

Variations of GAN

Since the release of the first GAN, there have been many variants of GANs. Below are some of the most popular GANs:

CycleGAN
StyleGAN
PixelRNN
Text2image
DiscoGAN
IsGAN

This article solely focuses on Pix2Pix GAN. In the following section, we will understand some of the key components of the same like the architecture, loss function, etcetera.

What is the Pix2Pix GAN?

Pix2Pix GAN is a conditional GAN (cGAN) that was developed by Phillip Isola, et al. Unlike vanilla GAN which uses only real data and noise to learn and generate images, cGAN uses real data andnoise as well as labels to generate images.

In essence, the generator learns the mapping from the real data as well as the noise.

The generator G combines the learnt real data x and the random noise z to output y, which is the fake data.

Similarly, the discriminator not only learns from the “real data” example it has seen, but also from the labels that help it understand what is real and what is fake.

The discriminator uses, then, two sources of information to improve its ability to tell real from fake: x (the real data) and y (the label saying “real” or “fake.”)

This setting makes cGAN to be suitable for image-to-image translation tasks, where the generator is conditioned on an input image to generate the corresponding output image. In other words, the generator uses a condition distribution (or data) such as a guide or a blueprint to generate a target image (see the image below).

The model generates realistic building facades (right column) based on input segmentation maps (left column), with comparisons to the actual ground truth images (center column) | Source: Author

Applications of Pix2Pix, a type of conditional GANs | Source

The idea with Pix2Pix relies on the dataset provided for the training. It is a pair-to-pair image translation with training examples {x, y} having a correspondence between them.

Pix2Pix network architectures

The pix2pix has two important architectures, one for the generator and the other for the discriminator, namely U-net and patchGAN. Let’s explore both of them in more detail.

U-Net generator

As mentioned before, the architecture used in pix2pix is called U-net. U-net was primarily developed for biomedical image segmentation by Ronneberger et. al. in 2015.

U-Net generator: A symmetric encoder-decoder structure with down-sampling through max pooling (red arrows) and up-sampling via transposed convolutions (green arrows). Skip connections (gray arrows) connect layers of matching spatial dimensions in the encoder and decoder, preserving spatial information for segmentation in the output map. | Source

U-Net consists of two major parts:

A contracting path made up of convolutional layers (left side) which downsamples the data while extracting information.
An expansive path made of up transpose convolution layer (right side) which upsamples the information.

Let’s say our downsampling has three convolutional layers C_l(1,2,3), then we have to make sure that our upsampling has three transpose convolutional layers C_u(1,2,3). This is because we want to connect the corresponding blocks of the same sizes using a skip connection.

Skip connection architecture: This diagram illustrates the use of skip layers between encoder (C_l1, C_l2, C_l3) and decoder (C_u1, C_u2, C_u3) blocks, with a bottleneck in the center to keep the feature dimensions at each stage. This retains the spatial details across the network | Source: Author

Downsampling

During downsampling, each convolutional block extracts spatial information and passes the information to the next convolutional block to extract more information until it reaches the middle part known as the bottleneck. Upsampling starts from the bottleneck.

Upsampling

During upsampling, each transpose convolutional block expands information from the previous block while concatenating the information from the corresponding downsampling block. By concatenating information, the network can then learn to assemble a more precise output based on this information.

This architecture can localize, i.e. it can find the object of interest pixel by pixel. Furthermore, U-Net also allows the network to propagate context information from lower resolution to higher resolution layers. This allows the network to generate high-resolution samples.

Markovian discriminator (PatchGAN)

The discriminator uses PatchGAN architecture. This architecture contains several transposed convolutional blocks. It takes an NxN part of the image and tries to find whether it is real or fake. N can be of any size. It can be smaller than the original image and it is still able to produce high-quality results. The discriminator is applied convolutionally across the whole image. Also, because the discriminator is smaller i.e. it has fewer parameters compared to the generator, it is faster.

PatchGAN can effectively model the image as a Markov random field, where NxN is considered an independent patch. Therefore, PatchGAN can be understood as a form of texture/style loss.

Loss function

The loss function is:

The equation above has two components: one for the discriminator and the other for the generator. Let’s understand both of them one by one.

In any GAN, the discriminator is trained first in every iteration so that it can recognize both real and fake data. Essentially,

D(x,y) = 1 i.e. real and,

D(x,G(z)) = 0 i.e. fake.

It is worth noting that G(z) will also produce fake samples and thus its value will be closer to zero. In theory, the discriminator should always classify G(z) as zero only. Therefore the discriminator should maintain a maximum distance between real and fake i.e. 1 and 0 in every iteration. In other words, the discriminator should maximize the loss function.

After the discriminator, the generator is trained. The generator i.e. G(z) should learn to produce samples that are closer to the real samples. To learn the original distribution it takes help from the discriminator i.e. instead of D(x, G(z)) = 0, we change D(x, G(z)) = 1.

With the alteration in labeling, the generator now optimizes its parameter concerning the parameter belonging to the discriminator with ground truth labels. This step ensures that the generator can now yield samples that are close to real data i.e. 1.

The loss function is also mixed with an L1 loss so that the generator not only fools the discriminator but also produces images near the ground truth. In essence, the loss function has an additional L1 loss for the generator.

Therefore, the final loss function is:

It is worth noting that the L1 loss can preserve low-frequency details in the image, but it will not be able to capture high-frequency details. Hence, it will still produce blurry images. To tackle this problem, PatchGAN is used.

Optimization

The optimization and training process is similar to vanilla GAN. However, the training itself is a difficult process since the objective function of GAN is more concave-concave rather than convex-concave. Because of this, it is difficult to find a saddle point and this is what makes training and optimizing the GANs difficult.

As we saw previously, the generator is not trained directly but through the discriminator. This essentially limits the optimization of the generator. If the discriminator fails to capture high dimensional spaces then it is very certain that the generator will fail to produce good samples. On the other hand, if we can train the discriminator in a much more optimal way then we can be assured that the generator will be trained optimally as well.

In the early stages of training, G is untrained and weak to produce good samples. This makes the discriminator very powerful, so instead of minimizing log(1 − D(G(z))), the generator is trained to maximize log D(G(z)). This creates some sort of stability in the early stages of the training.

Other ways to tackle the instability are:

Using spectral normalization in every layer of the model
Using Wasserstein loss which calculates the average score for real or fake images.

Hands-on example with Pix2Pix

Let’s implement Pix2Pix with PyTorch and get an intuitive understanding of how it works and the various components behind it. This section will give you a clear understanding of how the Pix2Pix algorithm works.

Let’s start by downloading the data. The following code can be used to download the data.

!wget http://efrosgans.eecs.berkeley.edu/pix2pix/datasets/facades.tar.gz
!tar -xvf facades.tar.gz

Data visualization

Once the data is downloaded, we can then visualize them to understand what are the necessary steps needed to format the data according to the requirement.

We will import the following libraries for data visualization.

import matplotlib.pyplot as plt
import cv2
import os
import numpy as np

path = "facades/train/"
plt.imshow(cv2.imread(f"{path}91.jpg"))

Resulting output from the previous code | Source: Author

From the image above, we can see that the data has two images attached together. If we then see the shape of the image above we find that the width is 512, which means that the image can be easily separated into two.

print('Shape of the image: ',cv2.imread(f'{path}91.jpg').shape)

>> Shape of the image: (256, 512, 3)

To separate the images we will use the following commands:

# Dividing the image by width
image = cv2.imread(f'{path}91.jpg')
w = image.shape[1]//2
image_real = image[:, :w, :]
image_cond = image[:, w:, :]
fig, axes = plt.subplots(1,2, figsize=(18,6))
axes[0].imshow(image_real, label='Real')
axes[1].imshow(image_cond, label='Condition')
plt.show()

Output:

Resulting output | Source: Author

The image on the left will be our ground truth while the image on the right will be our conditional image. We will refer to them as y and x respectively (from left to right).

Creating dataloader

Dataloader is a function that will allow us to format the data as per the PyTorch requirement. This will involve two steps:

1. Formatting the data, that is reading the data from the source, cropping them followed by converting them to Pytorch tensors.

from glob import glob
from torch.utils.data import Dataset


class Data(Dataset):
   def __init__(self, path="facades/train/"):
       self.filenames = glob(path + "*.jpg")

   def __len__(self):
       return len(self.filenames)

   def __getitem__(self, idx):
       filename = self.filenames[idx]

       image = cv2.imread(filename)
       image_width = image.shape[1]
       image_width = image_width // 2
       real = image[:, :image_width, :]
       condition = image[:, image_width:, :]

       real = transforms.functional.to_tensor(real)
       condition = transforms.functional.to_tensor(condition)

       return real, condition

2. Loading the data by using Pytorch’s DataLoader function to create batches before feeding them into the neural nets.

train_dataset = Data()
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)

val_dataset = Data(path="facades/val/")
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=True)

Keep in mind that we will create a data loader for training and validation.

Utils

In this section, we involve creating components that will be used to build the Generator and Discriminator. The components that we will create will be a convolutional function for downsampling and a transpose convolution function for upsampling which will be referred to as cnn_block and tcnn_block respectively.

def cnn_block(
   in_channels, out_channels, kernel_size, stride = 1, padding = 0, first_layer = False
):

   if first_layer:
       return nn.Conv2d(
           in_channels, out_channels, kernel_size, stride = stride, padding = padding
       )
   else:
       return nn.Sequential(
           nn.Conv2d(
               in_channels, out_channels, kernel_size, stride = stride, padding = padding
           ),
           nn.BatchNorm2d(out_channels, momentum = 0.1, eps = 1e-5),
       )


def tcnn_block(
   in_channels,
   out_channels,
   kernel_size,
   stride = 1,
   padding = 0,
   output_padding = 0,
   first_layer = False,
):
   if first_layer:
       return nn.ConvTranspose2d(
           in_channels,
           out_channels,
           kernel_size,
           stride = stride,
           padding = padding,
           output_padding = output_padding,
       )

   else:
       return nn.Sequential(
           nn.ConvTranspose2d(
               in_channels,
               out_channels,
               kernel_size,
               stride = stride,
               padding = padding,
               output_padding = output_padding,
           ),
           nn.BatchNorm2d(out_channels, momentum = 0.1, eps = 1e-5),
       )

Defining parameters

In this section, we will define the parameters. These parameters will help us in training the neural network.

# Define parameters
batch_size = 4
workers = 2

epochs = 30

gf_dim = 64
df_dim = 64

L1_lambda = 100.0

in_w = in_h = 256
c_dim = 3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Generator

Now, let’s define the generator. We will use the two components to define the same.

class Generator(nn.Module):

class Generator(nn.Module):
 def __init__(self,instance_norm=False): #input : 256x256
   super(Generator,self).__init__()
   self.e1 = cnn_block(c_dim,gf_dim,4,2,1, first_layer = True)
   self.e2 = cnn_block(gf_dim,gf_dim*2,4,2,1,)
   self.e3 = cnn_block(gf_dim*2,gf_dim*4,4,2,1,)
   self.e4 = cnn_block(gf_dim*4,gf_dim*8,4,2,1,)
   self.e5 = cnn_block(gf_dim*8,gf_dim*8,4,2,1,)
   self.e6 = cnn_block(gf_dim*8,gf_dim*8,4,2,1,)
   self.e7 = cnn_block(gf_dim*8,gf_dim*8,4,2,1,)
   self.e8 = cnn_block(gf_dim*8,gf_dim*8,4,2,1, first_layer=True)

   self.d1 = tcnn_block(gf_dim*8,gf_dim*8,4,2,1)
   self.d2 = tcnn_block(gf_dim*8*2,gf_dim*8,4,2,1)
   self.d3 = tcnn_block(gf_dim*8*2,gf_dim*8,4,2,1)
   self.d4 = tcnn_block(gf_dim*8*2,gf_dim*8,4,2,1)
   self.d5 = tcnn_block(gf_dim*8*2,gf_dim*4,4,2,1)
   self.d6 = tcnn_block(gf_dim*4*2,gf_dim*2,4,2,1)
   self.d7 = tcnn_block(gf_dim*2*2,gf_dim*1,4,2,1)
   self.d8 = tcnn_block(gf_dim*1*2,c_dim,4,2,1, first_layer = True)#256x256
   self.tanh = nn.Tanh()

 def forward(self,x):
   e1 = self.e1(x)
   e2 = self.e2(F.leaky_relu(e1,0.2))
   e3 = self.e3(F.leaky_relu(e2,0.2))
   e4 = self.e4(F.leaky_relu(e3,0.2))
   e5 = self.e5(F.leaky_relu(e4,0.2))
   e6 = self.e6(F.leaky_relu(e5,0.2))
   e7 = self.e7(F.leaky_relu(e6,0.2))
   e8 = self.e8(F.leaky_relu(e7,0.2))
   d1 = torch.cat([F.dropout(self.d1(F.relu(e8)),0.5,training=True),e7],1)
   d2 = torch.cat([F.dropout(self.d2(F.relu(d1)),0.5,training=True),e6],1)
   d3 = torch.cat([F.dropout(self.d3(F.relu(d2)),0.5,training=True),e5],1)
   d4 = torch.cat([self.d4(F.relu(d3)),e4],1)
   d5 = torch.cat([self.d5(F.relu(d4)),e3],1)
   d6 = torch.cat([self.d6(F.relu(d5)),e2],1)
   d7 = torch.cat([self.d7(F.relu(d6)),e1],1)
   d8 = self.d8(F.relu(d7))

   return self.tanh(d8)

Discriminator

Let’s define the discriminator using the downsampling function.

class Discriminator(nn.Module):
 def __init__(self,instance_norm=False):#input : 256x256
   super(Discriminator,self).__init__()
   self.conv1 = cnn_block(c_dim*2,df_dim,4,2,1, first_layer=True) # 128x128
   self.conv2 = cnn_block(df_dim,df_dim*2,4,2,1)# 64x64
   self.conv3 = cnn_block(df_dim*2,df_dim*4,4,2,1)# 32 x 32
   self.conv4 = cnn_block(df_dim*4,df_dim*8,4,1,1)# 31 x 31
   self.conv5 = cnn_block(df_dim*8,1,4,1,1, first_layer=True)# 30 x 30

   self.sigmoid = nn.Sigmoid()
 def forward(self, x, y):
   O = torch.cat([x,y],dim=1)
   O = F.leaky_relu(self.conv1(O),0.2)
   O = F.leaky_relu(self.conv2(O),0.2)
   O = F.leaky_relu(self.conv3(O),0.2)
   O = F.leaky_relu(self.conv4(O),0.2)
   O = self.conv5(O)

   return self.sigmoid(O)

Initializing the models

Let’s initialize both models and enable CUDA for faster training.

G = Generator().to(device)
D = Discriminator().to(device)

We will also define the optimizers and the loss function.

G_optimizer = optim.Adam(G.parameters(), lr=2e-4,betas=(0.5,0.999))
D_optimizer = optim.Adam(D.parameters(), lr=2e-4,betas=(0.5,0.999))

bce_criterion = nn.BCELoss()
L1_criterion = nn.L1Loss()

Training and monitoring our model

Training the model is not the last step. You need to monitor the training and track it to analyze the performance and implement changes if necessary. Given how taxing it can get to monitor the performance of a GAN with too many losses, plots, and metrics to deal with, we will use neptune.ai at this step.

Neptune allows the user to:

Monitor the live performance of the model
Monitor the performance of the hardware
Store and compare different metadata for different runs (like metrics, parameters, performance, data, etc.)
Share the work with others

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

To get started, just follow these steps:

1. Install the Python neptune library on your local system:

!pip install neptune

2. Sign up at neptune.ai.

3. Create a project for storing your metadata.

4. Save your credentials as environment variables.

For this project, we will log our parameters into the Neptune dashboard. For logging the parameters or any information into the dashboard, you can use a run object:

import neptune
import os

run = neptune.init_run(
   project=os.getenv(“NEPTUNE_PROJECT_NAME”),
   api_token=os.getenv("NEPTUNE_API_TOKEN")
)

A run object establishes a connection between your environment and the project’s dashboard you’ve created for this tutorial. To log metadata, like the dictionary below, you can use the following syntax:

# Logging parameter in Neptune
PARAMS = {'Epoch': epochs,
         'Batch Size': batch_size,
         'Input Channels': c_dim,

         'Workers': workers,
         'Optimizer': 'Adam',
         'Learning Rate': 2e-4,
         'Metrics': ['Binary Cross Entropy', 'L1 Loss'],
         'Activation': ['Leaky Relu', 'Tanh', 'Sigmoid' ],
         'Device': device}

run['parameters'] = PARAMS

To log the loss, generated images, and the model’s weights, we will use the run object again but with different methods like append or upload. Here is our training loop putting together everything we have along with Neptune logging:

# Define missing variables
epochs = 30  # Adjust as needed
L1_lambda = 100  # Adjust as needed
G_losses = []
D_losses = []
G_GAN_losses = []
G_L1_losses = []
img_list = []

# Assuming fixed_x and fixed_y are not defined, let's create them
fixed_x, fixed_y = next(iter(train_loader))
fixed_x = fixed_x.to(device)
fixed_y = fixed_y.to(device)

for ep in range(30):
    for i, data in enumerate(train_loader):

        y, x = data
        x = x.to(device)
        y = y.to(device)

        b_size = x.shape[0]

        real_class = torch.ones(b_size, 1, 30, 30).to(device)
        fake_class = torch.zeros(b_size, 1, 30, 30).to(device)

        # Train D
        D.zero_grad()

        real_patch = D(y, x)
        real_gan_loss = bce_criterion(real_patch, real_class)

        fake = G(x)

        fake_patch = D(fake.detach(), x)
        fake_gan_loss = bce_criterion(fake_patch, fake_class)

        D_loss = real_gan_loss + fake_gan_loss
        D_loss.backward()
        D_optimizer.step()

        # Train G
        G.zero_grad()
        fake_patch = D(fake, x)
        fake_gan_loss = bce_criterion(fake_patch, real_class)

        L1_loss = L1_criterion(fake, y)
        G_loss = fake_gan_loss + L1_lambda * L1_loss
        G_loss.backward()

        G_optimizer.step()

        # Neptune logging
        run["Gen Loss"].append(G_loss.item())
        run["Dis Loss"].append(D_loss.item())
        run["L1 Loss"].append(L1_loss.item())
        run["Gen GAN Loss"].append(fake_gan_loss.item())

        if (i + 1) % 5 == 0:
            print(
                "Epoch [{}/{}], Step [{}/{}], d_loss: {:.4f}, g_loss: {:.4f},D(real): {:.2f}, D(fake):{:.2f},g_loss_gan:{:.4f},g_loss_L1:{:.4f}".format(
                    ep,
                    epochs,
                    i + 1,
                    len(train_loader),
                    D_loss.item(),
                    G_loss.item(),
                    real_patch.mean(),
                    fake_patch.mean(),
                    fake_gan_loss.item(),
                    L1_loss.item(),
                )
            )
            G_losses.append(G_loss.item())
            D_losses.append(D_loss.item())
            G_GAN_losses.append(fake_gan_loss.item())
            G_L1_losses.append(L1_loss.item())

            with torch.no_grad():
                G.eval()
                fake = G(fixed_x).detach().cpu()
                G.train()
            figs = plt.figure(figsize=(10, 10))
            plt.subplot(1, 3, 1)
            plt.axis("off")
            plt.title("conditional image (x)")
            plt.imshow(
                np.transpose(
                    torchvision.utils.make_grid(fixed_x.cpu(), nrow=1, padding=5, normalize=True),
                    (1, 2, 0),
                )
            )

            plt.subplot(1, 3, 2)
            plt.axis("off")
            plt.title("fake image")
            plt.imshow(
                np.transpose(
                    torchvision.utils.make_grid(fake, nrow=1, padding=5, normalize=True),
                    (1, 2, 0),
                )
            )

            plt.subplot(1, 3, 3)
            plt.axis("off")
            plt.title("ground truth (y)")
            plt.imshow(
                np.transpose(
                    torchvision.utils.make_grid(fixed_y.cpu(), nrow=1, padding=5, normalize=True),
                    (1, 2, 0),
                )
            )

            run["epoch_results"].upload(figs)
            plt.close()
            img_list.append(figs)

run.stop()

Once the training is initialized, all the logged information will automatically log into the dashboard. Neptune fetches live information from the training which allows live monitoring of the entire process.

Below are the screenshots of the monitoring process.

You can also access all metadata and see generated samples.

Switching to the images panel, it will show you the generated samples:

Key takeaways

Pix2Pix is a conditional GAN that uses images and labels to generate images.
It uses two architectures:
- U-Net for generator
- PatchGAN for discriminator
PatchGAN uses smaller patches of NxN size in the generated image to discriminate it from real or fake instead of discriminating the entire image at once.
Pix2Pix has an additional loss specifically for the generator so that it can generate images closer to the ground truth.
Pix2Pix is a pairwise image translation algorithm.

Other GANs that you can explore are:

CycleGAN: It is similar to Pix2Pix since most of the approach is the same except the data part. Instead of pair-image-translation, it is unpaired-image-translation. Learning and exploring CycleGAN will be much easier since it was developed by the same authors.
If you are interested in text-to-image translation then you should explore:
- DeepDaze: Uses a generative model to create images from text prompts. Great for generating abstract or artistic images based on text descriptions.
- BigSleep: Great if you want to discover unusual visualizations from prompts.
- DALL:E: Developed by OpenAI, this model generates creative compositions with high level of detail directly from text descriptions.
Other interesting GAN projects you may want to try out:
- StyleGAN: Generates realistic faces; ideal for style manipulation and creative blending.
- AnimeGAN: Converts real photos into anime-style images.
- BigGAN: Produces images with realistic textures.
- Age-cGAN: Alters age in facial images.
- StarGAN: Handles multiple transformations in faces, like hair color and expression changes.

Dimensionality Reduction for Machine Learning

Nilesh Barla — Fri, 22 Jul 2022 06:34:22 +0000

Data forms the foundation of any machine learning algorithm, without it, Data Science can not happen. Sometimes, it can contain a huge number of features, some of which are not even required. Such redundant information makes modeling complicated. Furthermore, interpreting and understanding the data by visualization gets difficult because of the high dimensionality. This is where dimensionality reduction comes into play.

In this article you will learn:

What is dimensionality reduction?
What is the curse of dimensionality?
Tools and libraries used for dimensionality reduction
Algorithms used for dimensionality reduction
Applications
Advantages and disadvantages

What is dimensionality reduction?

Dimensionality reduction is the task of reducing the number of features in a dataset. In machine learning tasks like regression or classification, there are often too many variables to work with. These variables are also called features. The higher the number of features, the more difficult it is to model them, this is known as the curse of dimensionality. This will be discussed in detail in the next section.

Additionally, some of these features can be quite redundant, adding noise to the dataset and it makes no sense to have them in the training data. This is where feature space needs to be reduced.

The process of dimensionality reduction essentially transforms data from high-dimensional feature space to a low-dimensional feature space. Simultaneously, it is also important that meaningful properties present in the data are not lost during the transformation.

Dimensionality reduction is commonly used in data visualization to understand and interpret the data, and in machine learning or deep learning techniques to simplify the task at hand.

Curse of dimensionality

It is well known that ML/DL algorithms need a large amount of data to learn invariance, patterns, and representations. If this data comprises a large number of features, this can lead to the curse of dimensionality. The curse of dimensionality, first introduced by Bellman, describes that in order to estimate an arbitrary function with a certain accuracy the number of features or dimensionality required for estimation grows exponentially. This is especially true with big data which yields more sparsity.

Sparsity in data is usually referred to as the features having a value of zero; this doesn’t mean that the value is missing. If the data has a lot of sparse features then the space and computational complexity increase. Oliver Kuss [2002] shows that the model trained on sparse data performed poorly in the test dataset. In other words, the model during the training learns noise and they are not able to generalize well. Hence they overfit.

When the data is sparse, observations or samples in the training dataset are difficult to cluster as high-dimensional data causes every observation in the dataset to appear equidistant from each other. If data is meaningful and non-redundant, then there will be regions where similar data points come together and cluster, furthermore they must be statistically significant.

Issues that arise with high dimensional data are:

Running a risk of overfitting the machine learning model.
Difficulty in clustering similar features.
Increased space and computational time complexity.

Non-sparse data or dense data on the other hand is data that has non-zero features. Apart from containing non-zero features they also contain information that is both meaningful and non-redundant.

To tackle the curse of dimensionality, methods like dimensionality reduction are used. Dimensional reduction techniques are very useful to transform sparse features to dense features. Furthermore, dimensionality reduction is also used to clean the data and feature extraction.

Tools and library

The most popular library for dimensionality reduction is scikit-learn (sklearn). The library consists of three main modules for dimensionality reduction algorithms:

Decomposition algorithms
- Principal Component Analysis
- Kernel Principal Component Analysis
- Non-Negative Matrix Factorization
- Singular Value Decomposition
Manifold learning algorithms
- t-Distributed Stochastic Neighbor Embedding
- Spectral Embedding
- Locally Linear Embedding
Discriminant Analysis
- Linear Discriminant Analysis

When it comes to deep learning, algorithms like autoencoders can be constructed to reduce dimensions and learn features and representations. Frameworks like Pytorch, Pytorch Lightning, Keras, and TensorFlow are used to create autoencoders.

Recommended for you

Knowledge Distillation: Principles, Algorithms, Applications

The Best ML Frameworks & Extensions For Scikit-learn

Algorithms for dimensionality reduction

Let’s start with the first class of algorithms.

Decomposition algorithms

Decomposition algorithm in scikit-learn involves dimensionality reduction algorithms. We can call various techniques using the following command:

from sklearn.decomposition import PCA, KernelPCA, NMF

Principal Component Analysis (PCA)

Principal Component Analysis, or PCA, is a dimensionality-reduction method to find lower-dimensional space by preserving the variance as measured in the high dimensional input space. It is an unsupervised method for dimensionality reduction.

PCA transformations are linear transformations. It involves the process of finding the principal components, which is the decomposition of the feature matrix into eigenvectors. This means that PCA will not be effective when the distribution of the dataset is non-linear.

Let’s understand PCA with python code.

def pca(X=np.array([]), no_dims=50):

    print("Preprocessing the data using PCA...")
    (n, d) = X.shape
    Mean = np.tile(np.mean(X, 0), (n, 1))
    X = X - Mean
    (l, M) = np.linalg.eig(np.dot(X.T, X))
    Y = np.dot(X, M[:, 0:no_dims])
    return Y

PCA implementation is quite straightforward. We can define the whole process into just four steps:

Standardization: The data has to be transformed to a common scale by taking the difference between the original dataset with the mean of the whole dataset. This will make the distribution 0 centered.
Finding covariance: Covariance will help us to understand the relationship between the mean and original data.
Determining the principal components: Principal components can be determined by calculating the eigenvectors and eigenvalues. Eigenvectors are a special set of vectors that help us to understand the structure and the property of the data that would be principal components. The eigenvalues on the other hand help us to determine the principal components. The highest eigenvalues and their corresponding eigenvectors make the most important principal components.
Final output: It is the dot product of the standardized matrix and the eigenvector. Note that the number of columns or features will be changed.

Reducing the number of variables of data not only reduces complexity but also decreases the accuracy of the machine learning model. However, with a smaller number of features it is easy to explore, visualize and analyze, it also makes machine learning algorithms computationally less expensive. In simple words, the idea of PCA is to reduce the number of variables of a data set, while preserving as much information as possible.

Let’s also take a look at the modules and functions sklearn provides for PCA.

We can start by loading the most dataset:

from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

(1797, 64)

The data consists of 8×8 pixel images, which means that they are 64-dimensional. To gain some understanding of the relationships between these points, we can use PCA to project them to lower dimensions, like 2-D:

from sklearn.decomposition import PCA

pca = PCA(2)  # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)

(1797, 64)

(1797, 2)

Now, let’s plot the first two principal components.

plt.scatter(projected[:, 0], projected[:, 1],
            c=digits.target, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();

Dimensionality reduction technique: PCA | Source: Author

We can see that PCA optimally found the principal components that can quite effectively cluster similar distributions, for the most part.

Kernel PCA (KPCA)

The PCA transformations we described previously are linear transformations that are ineffective with the non-linear distribution. To deal with non-linear distribution, the basic idea is to use the kernel trick.

A kernel trick is simply a method to project non-linear data onto a higher dimensional space and separate different distributions of data. Once the distributions are separated we can use PCA to separate them linearly.

Dimensionality reduction technique: KPCA | Source: Author

Kernel PCA uses a kernel function ϕ that calculates the dot product of the data for non-linear mapping. In other words, the function ϕ maps the original d-dimensional features into a larger, k-dimensional feature space by creating non-linear combinations of the original features.

Let assume a dataset x that contains two features x1 and x2:

After applying the kernel trick we get:

To get a more intuitive understanding of Kernel PCA let’s define a feature space that cannot be linearly separated.

from sklearn.datasets import make_circles
from sklearn.decomposition import KernelPCA
np.random.seed(0)
X, y = make_circles(n_samples=400, factor=.3, noise=.05)

Now, let’s plot and see our dataset.

plt.figure(figsize=(15,10))
plt.subplot(1, 2, 1, aspect='equal')
plt.title("Original space")
reds = y == 0
blues = y == 1

plt.scatter(X[reds, 0], X[reds, 1], c="red",
           s=20, edgecolor='k')
plt.scatter(X[blues, 0], X[blues, 1], c="blue",
           s=20, edgecolor='k')
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")

Dimensionality reduction technique: KPCA | Source: Author

As you can see in this dataset the two classes cannot be separated linearly. Now, let’s define kernel PCA and see how it separates this feature space.

kpca = KernelPCA(kernel="rbf", fit_inverse_transform=True, gamma=10, )
X_kpca = kpca.fit_transform(X)
plt.subplot(1, 2, 2, aspect='equal')
plt.scatter(X_kpca[reds, 0], X_kpca[reds, 1], c="red",
           s=20, edgecolor='k')
plt.scatter(X_kpca[blues, 0], X_kpca[blues, 1], c="blue",
           s=20, edgecolor='k')
plt.title("Projection by KPCA")
plt.xlabel(r"1st principal component in space induced by $phi$")
plt.ylabel("2nd component")

Dimensionality reduction technique: KPCA | Source: Author

After applying KPCA, it is able to linearly separate the two classes in the dataset.

Singular Value Decomposition (SVD)

The singular value decomposition or SVD is a factorization method of a real or complex matrix. It is efficient when working with a sparse dataset; a dataset having a lot of zero entries. This type of dataset is usually found in the Recommender Systems, rating, and reviews dataset, et cetera.

The idea of SVD is that every matrix of shape nXp factorizes into A = USV^T, where U is the orthogonal matrix, S is a diagonal matrix and V^T is also an orthogonal matrix.

The advantage of SVD is that the orthogonal matrices capture the structure of the original matrix A which means that their properties do not change when multiplied by other numbers. This can help us approximate A.

Now let’s understand SVD using code. To get a better understanding of the algorithm we will use a face dataset that scikit-learn provides.

from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

Plot the images to get an idea of what we are working with.

X = lfw_people.images.reshape(img_count, img_width * img_height)
X0_img = X[0].reshape(img_height, img_width)

plt.imshow(X0_img, cmap=plt.cm.gray)

Create a function for easy visualization of images.

def draw_img(img_vector, h=img_height, w=img_width):
   plt.imshow( img_vector.reshape((h,w)), cmap=plt.cm.gray)
   plt.xticks(())
   plt.yticks(())
draw_img(X[49])

Before applying SVD it is better to standardize the data.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(with_std=False)
Xstd = scaler.fit_transform(X)

After standardizing this is how the image looks.

It is worth noting that we can always retrieve the original image by performing the inverse transformation.

Xorig = scaler.inverse_transform(Xstd)
draw_img(Xorig[49])

Now, we can apply the SVD function from NumPy and decompose the matrix into three matrices.

from numpy.linalg import svd

U, S, VT = svd(Xstd)

To check that the function works we can always perform a matrix multiplication of the three matrices.

US = U*S
Xhat = US @ VT[0:1288,:]

# inverse transform Xhat to reverse standardization
Xhat_orig = scaler.inverse_transform(Xhat)
draw_img(Xhat_orig[49])

Now, let’s perform dimensionality reduction. To do that we just need to reduce the number of features from the orthogonal matrices.

Xhat_500 = US[:, 0:500] @ VT[0:500, :]
# inverse transform Xhat to reverse standardization
Xhat_500_orig = scaler.inverse_transform(Xhat_500)
# draw recovered image
draw_img(Xhat_500_orig[49])

We can further reduce more features and see the results.

Xhat_100 = US[:, 0:100] @ VT[0:100, :]
# inverse transform Xhat to reverse standardization
Xhat_100_orig = scaler.inverse_transform(Xhat_100)
# draw recovered image
draw_img(Xhat_100_orig[49])

Now let’s create a function that would allow us to reduce the dimensions of the image.

def dim_reduce(US_, VT_, dim=100):

   Xhat_ = US_[:, 0:dim] @ VT_[0:dim, :]

   return scaler.inverse_transform(Xhat_)

Plotting images with a different number of features.

dim_vec = [50, 100, 200, 400, 800]

plt.figure(figsize=(1.8 * len(dim_vec), 2.4))

for i, d in enumerate(dim_vec):
   plt.subplot(1, len(dim_vec), i + 1)
   draw_img(dim_reduce(US, VT, d)[49])

Dimensionality reduction technique: SVD | Source: Author

As you can see the first image contains the least number of features yet it can still construct the abstract version of the image and as we increase the features, we eventually obtain the original image. This proves that SVD can retain the basic structure of the data.

Non-negative Matrix Factorization (NMF)

NMF is an unsupervised machine learning algorithm. When a non-negative input matrix X of dimension mXn is given to the algorithm, it is decomposed into the product of two non-negative matrices W and H. W is of the dimension mXp while H is of the dimension pXn.

Where Y = W.H

From the equation above you can see that to factorize the matrix, we need to minimize the distance. The most widely used distance function is the squared Frobenius norm; this is an extension of the Euclidean norm to matrices.

It is also worth noting that this problem is not solvable in general which is why it is approximated. As it turns out, NMF is good for parts-based representation of the dataset i.e. NMF provides an efficient, distributed representation, and can aid in the discovery of the structure of interest within the data.

Let’s understand NMF with code. We will use the same data that we used in SVD.

First, we will fit the model to the data.

from sklearn.decomposition import NMF
model = NMF(n_components=200, init='nndsvd', random_state=0)
W = model.fit_transform(X)
V = model.components_

NMF takes a bit of time to decompose the data. Once the data is decomposed we can then visualize the factorized components.

num_faces = 20
plt.figure(figsize=(1.8 * 5, 2.4 * 4))

for i in range(0, num_faces):
   plt.subplot(4, 5, i + 1)
   draw_img(V[i])

Dimensionality reduction technique: NMF | Source: Author

From the image above we can see that NMF is very efficient to capture the underlying structure of the data. It is also worth mentioning that NMF captures only the linear attributes.

Advantages of NMF:

Data compression and visualization
Robustness to noise
Easier to interpret

Manifold learning

So far we have seen approaches that only involved linear transformation. But what do we do when we have a non-linear dataset?

Manifold learning is a type of unsupervised learning that seeks to perform dimensionality reduction of a non-linear dataset. Again, scikit-learn offers a module that consists of various nonlinear dimensionality reduction techniques. We can call those classes or techniques through this command:

from sklearn.manifold import TSNE, LocallyLinearEmbedding, SpectralEmbedding

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding or t-SNE is a dimensionality reduction technique well suited for data visualization. Unlike PCA which simply maximizes the variance, t-SNE minimizes the divergence between two distributions. Essentially, it recreates the distribution of a high-dimensional space in a low-dimensional space rather than maximizing variance or even using a kernel trick.

We can get a high-level understanding of t-SNE in three simple steps:

It first creates a probability distribution for the high-dimensional samples.
Then, it defines a similar distribution for the points in the low-dimensional embedding.
Finally, it tries to minimize the KL-divergence between the two distributions.

Now let’s understand it with code. For t-SNE, we will use the MNIST dataset again. Firstly, we import TSNE and then the data as well.

from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

digits = load_digits()
print(digits.data.shape)

# There are 10 classes (0 to 9) with almost 180 images in each class
# The images are 8x8 and hence 64 pixels(dimensions)

#Displaying what the standard images look like
for i in range(0,5):
   plt.figure(figsize=(5,5))
   plt.imshow(digits.images[i])
   plt.show()

Dimensionality reduction technique: t-SNE | Source: Author

We will then store the digits in order using np.vstack.

X = np.vstack([digits.data[digits.target==i] for i in range(10)])
Y = np.hstack([digits.target[digits.target==i] for i in range(10)])

We will apply t-SNE to the dataset.

digits_final = TSNE(perplexity=30).fit_transform(X)

We will now create a function to visualize the data.

def plot(x, colors):
    palette = np.array(sb.color_palette("hls", 10))  #Choosing color palette

   # Create a scatter plot.
   f = plt.figure(figsize=(8, 8))
   ax = plt.subplot(aspect='equal')
   sc = ax.scatter(x[:,0], x[:,1], lw=0, s=40,c=palette[colors.astype(np.int)])
   # Add the labels for each digit.
   txts = []
   for i in range(10):
       # Position of each label.
       xtext, ytext = np.median(x[colors == i, :], axis=0)
       txt = ax.text(xtext, ytext, str(i), fontsize=24)
       txt.set_path_effects([pe.Stroke(linewidth=5, foreground="w"),
                             pe.Normal()])
       txts.append(txt)
   return f, ax, txts

Now we perform data visualization on the transformed dataset.

plot(digits_final,Y)

Dimensionality reduction technique: t-SNE | Source: Author

As it can be seen, t-SNE clusters the data beautifully. Compared to PCA, t-SNE performs well on nonlinear data. The drawback with t-SNE is that when the data is big it consumes a lot of time. So it is better to perform PCA followed by t-SNE.

Locally Linear Embedding (LLE)

Locally Linear Embedding or LLE is a non-linear and unsupervised machine learning method for dimensionality reduction. LLE takes advantage of the local structure or topology of the data and preserves it on a lower-dimensional feature space.

LLE optimizes faster but fails on noisy data.

Let’s break the whole process into three simple steps:

Find the nearest neighbors of the data points.
Construction of a weight matrix, by approximating each data point as a weighted linear combination of its k-nearest neighbors and minimizing the squared distance between them and their linear representation.
Map the weights into a lower-dimensional space by using the eigenvector-based optimization technique.

Dimensionality reduction technique: LLE | Source: S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding

Dimensionality reduction technique: LLE | Source: Scikit Learn

Spectral embedding

Spectral Embedding is another non-linear dimensionality reduction technique that also happens to be an unsupervised machine learning algorithm. Spectral embedding aims to find clusters of different classes based on the low-dimensional representations.

We can again break the whole process into three simple steps:

Preprocessing: Construct a Laplacian matrix representation of the data or graph.
Decomposition: Compute eigenvalues and eigenvectors of the constructed matrix and then map each point to a lower-dimensional representation. Spectral embedding makes use of the second smallest eigenvalue and its corresponding eigenvector.
Clustering: Assign points to two or more clusters, based on the representation. Clustering is usually done using k-means clustering.

Applications: Spectral Embedding finds its application in image segmentation.

Discriminant Analysis

Discriminant Analysis is another module that scikit-learn provides. It can be called using the following command:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

Linear Discriminant Analysis (LDA)

LDA is an algorithm that is used to find a linear combination of features in a dataset. Like PCA, LDA is also a linear transformation-based technique. But unlike PCA it is a supervised learning algorithm.

LDA computes the directions, i.e. linear discriminants that can create decision boundaries and maximize the separation between multiple classes. It is also very effective for multi-class classification tasks.

To have a more intuitive understanding of LDA, consider plotting a relationship of two classes as shown in the image below.