Michał Oleszak, Autor w serwisie neptune.ai

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

Michał Oleszak — Tue, 28 Oct 2025 19:50:11 +0000

Dead neurons silently waste compute and reduce effective model capacity in foundation models.

Simple visualizations of the activation frequency make neuron health measurable.

Dead neurons can be brought back to life by swapping activation functions or implementing synaptic stripping.

It is crucial for foundation model training success to proactively monitor neuron health with audits and alerts.

In neural networks, some neurons end up outputting near-zero activations across all inputs. These so-called “dead neurons” degrade model capacity because those parameters are effectively wasted, and they weaken generalization by reducing the diversity of learned features.

While this phenomenon is nothing new, it has become increasingly relevant with the emergence of large foundation models. In this article, we will discuss why that is the case and what the resulting impact is. We will also review methods for the detection and visualization of dead neurons, as well as strategies to prevent and fix them.

Dead neurons’ impact

Recent studies into dead neurons in the context of foundation models show interesting, albeit worrying, results. A 2020 paper by Qatari researchers Dalvi et al. shows how in BERT and XLNet, 85% of all neurons are redundant for it to perform its task. A more recent 2023 study by Meta AI researchers Voita et al. looked at LLMs from the OPT family of models, ranging from 125M to 66B parameters, only to find that, in some layers, more than 70% of the neurons are dead.

These large reported fractions of dead neurons in foundation models are a concern from a computational perspective. While in a 100M-parameter CNN losing some neurons is an inefficiency, seeing 70-85% of neurons dead in a billion-parameter LLM means significant amounts of GPU-hours wasted, both at training and inference time. These dead neurons constitute a hidden form of compute tax, if you will.

Leaving the computational efficiency aside, dead neurons are likely to impede the model’s performance, too. With a large number of neurons unused, the effective model size becomes much smaller than its nominal size. Consequently, fewer features are learned, leading to impaired generalization as the model increasingly relies on memorizing the data.

Another consequence of having many dead neurons in the model is that it learns a more entangled data representation. Consider discrete feature detectors, or neurons that reliably activate for some interpretable pattern in the data. Think of a neuron that lights up whenever it sees a vertical edge in a vision model, or a neuron that fires strongly on HTML tags in an LLM. These types of neurons are quite valuable to have in a model as they make representations more disentangled: each dimension of the representation corresponds more cleanly to a specific factor of variation.

If a large fraction of neurons are dead, we lose the “slots” that could have been allocated to these specialized detectors. The model still has to encode the same amount of information, but with fewer working neurons. As a result, the remaining neurons activate for a variety of patterns (e.g., one neuron might respond to both numbers and capital letters and dates). This reduces the model’s ability to learn clean, specialized representations, potentially affecting downstream performance.

Finally, and perhaps not surprisingly, dead neurons waste memory. They take up a lot of space for no good reason, making it more challenging to load, fine-tune, and serve large foundation models.

Before we move on to discuss how to detect and fix dead neurons, let’s touch upon an important distinction between dead neurons and vanishing gradients. While these two are distinct phenomena, they are intimately related. Vanishing gradients effectively prevent weight updates during training, which can “freeze” a neuron into inactivity. Conversely, once a neuron becomes permanently dead, it contributes nothing to the gradient flow downstream of it. Thus, preventing gradients from vanishing is one of the strategies against dead neurons, as we will later later in the article.

Visualizing activation distributions

Is your foundation model suffering from dead neurons? A convenient way to find out is through visualization. We can plot activation histograms and heatmaps, as well as the percentage of dead neurons for different layers of the model, to get a sense of how large the issue is.

In this section, we will examine these visualization strategies using a version of OpenAI’s GPT-2 as an example. We use this relatively small model for computational efficiency. Note that in such a small model, we might not see as high a proportion of dead neurons as we would in a bigger, more recent model such as GPT-5. However, the techniques we will discuss are directly applicable to larger models, too.

💡 You can explore all charts interactively on this Neptune dashboard. The code used to produce the plots is available on GitHub.

I have sampled some data from the WikiText-2 dataset and passed it through Tiny GPT-2 from HuggingFace (see its model card for additional information). For each batch of tokens processed by the model, I collected a set of different activations from the transformer blocks at different layers:

mlp_pre: Activations before the activation functions.
mlp_post: Activations after the activation functions.
attn_out: The outputs of the self-attention block.

I flattened and aggregated these activations to extract the following metrics:

Activation frequency: The fraction of inputs where a neuron fires above an arbitrarily chosen threshold of 0.001.
Activation histograms: The distribution of activation values.
Dead neuron ratio: The percentage of neurons with an activation frequency below the same firing threshold as above.

Activation frequency

Let’s start by looking at the activation frequencies:

Explore this plot on Neptune

The six panes show the activation frequencies for two of the model’s layers (first with index 0 and sixth with index 5), shown across rows, for mlp_pre, mlp_post, and attn_out, shown across columns.

The horizontal axis shows consecutive neurons, sorted by how often they fire. Colors mark the fraction of inputs activating the corresponding neuron. Blue neurons basically never fire, while perfectly yellow neurons fire on every token.

Note that the color legend for mlp_pre and attn_out spans only very high values, all above 99%, meaning that those neurons are very much alive. The mlp_post outputs, however, look quite different. Their colormap covers a much broader dynamic range: some neurons fire almost constantly (close to yellow), but a substantial group sits at the low end, firing very rarely (down to 20%). This uneven distribution is expected because, after the non-linear activation (GELU, more on that later), many neurons are pushed close to zero most of the time.

The key takeaway from these heatmaps is that “dead” or underused neurons mostly appear after the nonlinearity (mlp_post). That’s exactly where we would expect it, since activations are being gated. The pre-activation and attention projections, in contrast, show high activity. This is a desired pattern for our foundation model.

Activation histograms

Let’s now turn our attention to the distributions of activation values:

Explore this plot on Neptune

The three charts show very different patterns. Before activation (mlp_pre), the distribution is somewhat Gaussian centered, not far away from zero. This is a healthy shape; it means inputs are spread across both negative and positive values, allowing the activation function to “decide” which neurons to switch off. If this distribution were strongly shifted (far from zero), the nonlinearity could saturate, leading to more dead neurons. Luckily, this is not the case for our GPT-2.

The mlp_post histogram shows a strong spike at zero with a long right rail. This suggests that most activation outputs fall close to zero. Those that are too close are effectively dead, which corresponds to our insights from the heatmap analysis. A small fraction of inputs produce large positive activations (visible in the tail). These neurons fire selectively on rare but important contexts.

The sharp spike around zero in the self-attention outputs (attn_out) suggests that attention outputs are sparse: many tokens receive little signal from attention heads. Occasional larger and smaller values reflect strong attention weights when the model attends to a key token. This sparsity is consistent with how attention should behave: most queries ignore most keys, but a few connections dominate.

Dead neuron ratio

Let us now examine the ratio of dead neurons, visualized as a line chart:

Explore this plot on Neptune

The Y-axis on this chart indicates the percentage of neurons that are dead, while the X-axis corresponds to the six model layers, indexed from 0 to 5.

This visualization confirms our findings from the heatmap analysis. The dead ratios are very low overall. Even in mlp_post, 99.9% of neurons are doing something on at least some tokens. This is extremely healthy. In a larger foundation model, we would be likely to see higher dead ratios.

Equipped with a visualization toolbox to discover dead neurons, let’s discuss a few approaches to prevent them. The next section covers selecting activation functions, and the topic of the following section is reviving inactive neurons.

Alternative activation functions

As we have mentioned before, if gradients in the network get too small, they tend to “vanish”, pushing the surrounding neurons into a state of inactivity. Consequently, one can prevent neurons from dying by ensuring the gradients do not vanish. One way to achieve this is with the right selection of activation functions.

Common activations

Those who pre-train or fine-tune foundation models have the freedom to select the activation functions to be used throughout the network. This choice typically constitutes a trade-off between computation speed and the ability of the activation to prevent neurons from dying.

Plots of activation functions commonly used in foundation models: ReLU, Leaky ReLU, ELU, GELU, and Swish.

ReLU is the fastest one to compute. However, it’s also very likely to produce dying neurons since it outputs zeros for any negative input. If the network’s weights end up in a state where the inputs to ReLU are consistently negative, then the entire ReLU-activated neuron keeps producing zeros. This is the main reason why ReLU is rarely used as anything other than a baseline.

Leaky ReLU adds a small but non-zero slope for negative values, decreasing the likelihood of the neurons dying. Exponential ReLU (ELU) has another desired characteristic. Just like Leaky ReLU, it has non-zero gradients for negative inputs. Unlike Leaky ReLU, however, ELU is smooth around zero, speeding up training convergence. The downside is that ELU is relatively slow to compute.

A couple of other activities inspired by ELU claim to improve on it. Gaussian Error Linear Unit (GELU) weights its inputs by their value instead of simply thresholding by the sign, which has been found to lead to better model performance. Swish (also known as SiLU, e.g., in PyTorch) is similar to GELU in shape, but it has been specifically designed and evaluated to serve as a drop-in replacement for ReLU in any neural network.

A quick literature search reveals many more state-of-the-art activations, such as SELU or Mish. The natural question arises: how to choose one in the context of large foundation models susceptible to dying neurons?

How to choose activation functions for foundation models

Training deep neural networks is a profoundly experimental endeavor. A typical approach to hyperparameter tuning in deep learning models is to perform a random or Bayesian search over the hyperparameter space and select a combination that results in the best outcome (such as accuracy, convergence speed, or whatever it is that we care the most about).

While the large amount of resources required to train a foundation model makes exploring a large hyperparameter space infeasible, we can still apply a somewhat similar approach to pick the activation function in foundation models, while optimizing for neuron liveness.

How do foundation model teams plan and budget their training runs?

The scale of infrastructure and amount of energy required to train a foundation model depend on its size and architecture. In turn, the specific hardware constrains size and architecture, with the GPU memory as a key restriction. Further, larger models generally need more training data, leading to longer training times.

Foundation model teams typically solve this chicken-and-egg problem by defining a compute budget beforehand. As a general rule of thumb, about a fifth of this budget can be spent on the main training run, with the remainder needed for experimentation and test runs.

The main run, which is training the model at full scale, often spans several weeks. Simultaneously, foundation model teams launch experimental runs on the side that are short and use a smaller model variant. The teams use these experimental runs to explore new architectures, hyperparameters, or training schedules. They closely monitor for promising early signals, and once they identify beneficial shifts in metrics, they incorporate these findings into the main training run.

Read more about how teams are implementing this iterative approach and other topics in Neptune’s 2025 State of Foundation Model Training Report.

Given a model that we wish to train, we can iteratively swap activation functions in its architecture and for each, compare the rates of dead neurons empirically, as we have seen it done before using simple line charts. Consider the visualization below, which you can also view in the interactive mode in this Neptune project. I used this Python script to swap the activations, collect dead neuron ratios, and log them into Neptune.

Explore this plot on Neptune

We are again looking at ratios of dead neurons in Tiny GPT-2, shown on the vertical axis. Each line corresponds to one of the activation functions described above. The horizontal axis corresponds to the subsequent model layers. Note that compared to the similar chart we have seen before, here the threshold for considering a neuron “dead” has been decreased slightly to show differences between the activations more prominently.

The comparison reveals substantial differences:

Unsurprisingly, ReLU (orange) and Leaky ReLU (green) consistently show the highest dead neuron ratios, confirming their tendency to permanently silence neurons.
GELU (blue) maintains much lower dead ratios across layers, reflecting why it has become a popular default in modern Transformers (starting with BERT; before that, Vaswani’s original transformer used ReLU).
Swish (purple) and ELU (red) tend to work best in our experiment, with near-zero ratios of dead neurons.

This type of experiment makes the trade-offs concrete: while the original Tiny GPT-2 architecture uses GELU activations, this choice seems to be suboptimal as far as the dead neurons are concerned. Swapping the activations to Swish results in a smaller fraction of the network being silenced.

In practice, this means we don’t have to guess: by logging dead neuron ratios across different activations during pilot runs, we can quantitatively compare how much “neuron death” each option induces, and then choose the activation that works best.

Reviving inactive neurons

So far, we have discussed how to detect dying neurons and prevent the phenomenon. Let’s now take a look at how to revive the neurons back to live once they are dead.

An interesting approach to achieve this is with the so-called synaptic stripping, a method introduced by Colorado State University researchers Whitaker and Whitley in their 2023 paper “Synaptic Stripping: How Pruning Can Bring Dead Neurons Back To Life”.

As we have seen before, dead neurons arise once their weights shift into a state where no reasonable input produces a non-zero output. Since the gradient is also zero in this regime, those neurons can’t recover through normal backpropagation, effectively reducing the model’s capacity.

The Synaptic Stripping method introduces a clever solution inspired by biology. In neuroscience, synaptic stripping describes a process where immune cells scan the brain, detect dysfunctional synapses, and remove them so that neurons can recover and reconnect. The paper’s authors propose a similar mechanism for deep learning. Here’s the key idea:

Step 1: Detect dead neurons. After each training epoch, look at the activation outputs on a validation set. If a neuron produces a total activation of zero across the dataset, it’s considered dead.
Step 2: Prune negative weights. For each dead neuron, remove (zero-out) a fraction of its most negative incoming weights. This shifts the neuron’s weight distribution toward positive values.
Step 3: Resume training. With the problematic synapses stripped away, previously dead neurons regain the ability to fire and re-enter the optimization process. Training continues, with the cycle repeated after each epoch.

Synaptic Stripping. Left: After each training epoch, dead neurons (marked in red) are detected. Center: Problematic connections associated with dead neurons are pruned. Right: The same dead neurons now become active (marked green), and training continues. | Source

As the authors observe, paradoxically, removing parameters in this way can increase effective model capacity. Dead neurons are not contributing to the computation anyway, so pruning the connections that keep them locked in silence gives them a chance to become useful again.

In experiments on vision transformers and MLPs, Synaptic Stripping increased effective model capacity by up to 30%, improved generalization, and reduced model size. An important benefit of this approach is that it is easy to implement, and it can be slotted into any existing training loop.

What does this mean for foundation model training?

In a series of small-scale experiments, we explored the phenomenon of dead neurons in foundation models: what they are, why they matter, and how to both detect and mitigate them. We discussed how dead neurons not only waste computation and memory but also silently reduce effective model capacity.

Through simple visualization techniques, such as activation heatmaps, histograms, and dead neuron ratios, we can make the problem visible. From there, we compared activation functions to see which ones are more prone to killing neurons, and we examined Synaptic Stripping as a practical way to revive neurons that would otherwise stay permanently inactive.

An important takeaway from our discussion is that neuron health should be part of the standard toolkit when building and evaluating foundation models. Here are some concrete steps to integrate this into your workflow:

Run regular neuron activity audits during training. Just like you track loss curves or learning rates, log dead neuron ratios per layer. This gives early visibility into whether parts of the model are shutting down.
Set up automated alerts. For example, trigger a warning if more than some percentage of neurons in any layer are dead. This allows you to intervene, for instance, by adjusting activations or applying techniques like Synaptic Stripping.
Benchmark neuron health across experiments. When testing new model variants, track dead neuron ratios alongside accuracy metrics. This makes “neuron liveness” a first-class metric for comparing design choices, not just an afterthought.

Foundation models are expensive to train and serve. Making neuron health measurable and actionable is a way to get more out of every GPU-hour while also improving model robustness and generalization.

Transformers Key-Value Caching Explained

Michał Oleszak — Thu, 05 Dec 2024 11:30:00 +0000

As the complexity and size of transformer-based models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies.

Key-value (KV) caching is a clever trick to do that: At inference time, key and value matrices are calculated for each generated token. KV caching stores these matrices in memory so that when subsequent tokens are generated, we only compute the keys and values for the new tokens instead of having to recompute everything.

The inference speedup from KV caching comes at the cost of increased memory consumption. When memory is a bottleneck, one can reclaim some of it by simplifying the model, thus sacrificing its accuracy.

Implementing K-V caching in large-scale production systems requires careful cache management, including choosing an appropriate strategy for cache invalidation and exploring opportunities for cache reuse.

The transformer architecture is arguably one of the most impactful innovations in modern deep learning. Proposed in the famous 2017 paper “Attention Is All You Need,” it has become the go-to approach for most language-related modeling, including all Large Language Models (LLMs), such as the GPT family, as well as many computer vision tasks.

As the complexity and size of these models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies. Key-value (KV) caching is a clever trick to do just that – let’s see how it works and when to use it.

Transformer architecture overview

Before we dive into KV caching, we will need to take a short detour to the attention mechanism used in transformers. Understanding how it works is required to spot and appreciate how KV caching optimizes transformer inference.

We will focus on autoregressive models used to generate text. These so-called decoder models include the GPT family, Gemini, Claude, or GitHub Copilot. They are trained on a simple task: predicting the next token in sequence. During inference, the model is provided with some text, and its task is to predict how this text should continue.

From a high-level perspective, most transformers consist of a few basic building blocks:

A tokenizer that splits the input text into subparts, such as words or sub-words.
An embedding layer that transforms the resulting tokens (and their relative positions within the texts) into vectors.
A couple of basic neural network layers, including dropout, layer normalization, and regular feed-forward linear layers.

The last building block missing from the list above is the slightly more involved self-attention modules.

The self-attention module is, arguably, the only advanced piece of logic in the transformer architecture. It is the cornerstone of every transformer, enabling it to focus on different parts of the input sequence when generating the outputs. It is this mechanism that gives transformers the ability to model long-range dependencies effectively.

Let’s inspect the self-attention module in more detail.

Basic self-attention module

Self-attention is a mechanism that allows the model to “pay attention” to specific parts of the input sequence as it generates the next token. For example, in generating the sentence “She poured the coffee into the cup,” the model might pay more attention to the words “poured” and “coffee” to predict “into” as the next word since these words provide context for what is likely to come next (as opposed to “she” and “the”).

Mathematically speaking, the goal of self-attention is to transform each input (embedded token) into a so-called context vector, which combines the information from all the inputs in a given text. Consider the text “She poured coffee”. Attention will compute three context vectors, one for each input token (let’s assume tokens are words).

To calculate the context vectors, self-attention computes three kinds of intermediate vectors: queries, keys, and values. The diagram below shows step by step how the context vector for the second word, “poured,” is calculated:

The diagram shows step by step how the context vector for the second word, “poured,” is calculated. | Source: Author

Let’s denote the three tokenized inputs as x1, x2, and x3, respectively. The diagram pictures them as vectors with three elements, but in practice, they will be hundreds or thousands of elements long.

As the first step, self-attention multiplies each input separately with two weight matrices, Wk and Wv. The input for which the context vector is now being computed (x2 in our case) is additionally multiplied with a third weight matrix, Wq. All three W matrices are your usual neural network weights, randomly initialized and optimized in the learning process. The outputs of this step are the keys (k) and values (v) vectors for each input, plus an additional query (q) vector for the input being processed.

In step two, the key vector of each input is multiplied by the query vector of the input being processed (our q2). The output is then normalized (not shown in the diagram) to produce the attention weights. In our example, a21 is the attention weight between the inputs “She” and “poured.”

Finally, each attention weight is multiplied by its corresponding value vector. The outputs are then summed to produce the context vector z. In our example, the context vector z2 corresponds to the input x2, “poured.” The context vectors are the outputs of the self-attention module.

If it’s easier for you to read code than diagrams, take a look at this implementation of the basic self-attention module by Sebastian Raschka. The code is part of his book, “Build A Large Language Model (From Scratch)”:

import torch

class SelfAttention_v2(torch.nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = torch.nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

Sebastian’s code operates on matrices: the x in his forward() method corresponds to our x1, x2, and x3 vectors stacked together as a matrix with three rows. This allows him to simply multiply x with W_key to obtain keys, a matrix consisting of three rows (k1, k2, and k3 in our example).

The important takeaway from this brief explanation of self-attention is that in each forward pass, we multiply keys with the queries and then later with the values. Keep this in mind as you read on.

Advanced self-attention modules

The variant of self-attention described above is its simplest vanilla form. Today’s largest LLMs typically use slightly modified variations that typically differ from our basic flavor in three ways:

1 Attention is causal.
2 Dropout is used on attention weights.
3 Multi-head attention is used.

Causal attention means that the model should only consider previous tokens in the sequence when predicting the next one, preventing it from “looking ahead” at future words. Going back to our example, “She poured coffee.”, when the model was given the word “She” and is now attempting to predict the next one (“poured” would be correct), it should not compute or have access to attention weights between “coffee” and any other word since the word “coffee” has not appeared in the text yet. Causal attention is typically implemented by masking the “look-ahead” part of the attention weights matrix with zeros.

Next, to reduce overfitting during training, dropout is often applied to the attention weights. This means that some of them are randomly set to zero in each forward pass.

Finally, basic attention can be referred to as single-head, meaning that there is just one set of Wk, Wq, and Wv matrices. An easy way to increase the model’s capacity is to switch to multi-head attention. This boils down to having multiple sets of the W-matrices and, consequently, multiple query, key, and value matrices, as well as multiple context vectors for each input.

Additionally, some transformers implement additional modifications of the attention module with the goal of improving speed or accuracy. Three popular ones are:

Grouped-query attention: Instead of looking at every input token individually, tokens are grouped, allowing the model to focus on related groups of words at once, which speeds up processing. This is used by Llama 3, Mixtral, and Gemini.
Paged attention: Attention is broken down into “pages” or chunks of tokens, so the model processes one page at a time, making it faster for very long sequences.
Sliding-window attention: The model only attends to nearby tokens within a fixed “window” around each token, so it focuses on the local context without needing to look at the entire sequence.

All of these state-of-the-art approaches to implementing self-attention don’t change its basic premise and the fundamental mechanism it relies on: one always needs to multiply the keys by the queries and then later by the values. And as it turns out, at inference time, these multiplications show major inefficiencies. Let’s see why that’s the case.

What is key-value caching?

During inference, transformers generate one token at a time. When we prompt the model to start generation by passing “She,” it will produce one word, such as “poured” (for the sake of avoiding distractions, let’s keep assuming one token is one word). Then, we can pass “She poured” to the model, and it produces “coffee.” Next, we pass “She poured coffee” and obtain the end-of-sequence token from the model, indicating that it considers generation to be complete.

This means we have run the forward pass three times, each time multiplying the queries by the keys to obtain the attention scores (the same applies to the later multiplication by the values).

In the first forward pass, there was just one input token (“She”), resulting in just one key vector and one query vector. We multiplied them to obtain the q1k1 attention score.

Next, we passed “She poured” to the model. It now sees two input tokens, so the computation inside our attention module looks as follows:

We did the multiplication to compute three terms, but q1k1 was computed needlessly—we had already calculated it before! This q1k1 element is the same as in the previous forward pass because:

q1 is calculated as the embedding of the input (“She”) times the Wq matrix,
k1 is calculated as the embedding of the input (“She”) times the Wk matrix,
Both the embeddings and the weight matrices are constant at inference time.

Note the grayed-out entries in the attention scores matrix: these are masked with zero to achieve causal attention. For example, the top-right element where q1k3 would have been is not shown to the model as we don’t know the third word (and k3) at the moment of generating the second word.

Finally, here is the illustration of the query-times-keys calculation in our third forward pass.

We make the computational effort to calculate six values, half of which we already know and don’t need to recompute!

You may already have a hunch about what key-value caching is all about. At inference, as we compute the keys (K) and values (V) matrices, we store their elements in the cache. The cache is an auxiliary memory from which high-speed retrieval is possible. As subsequent tokens are generated, we only compute the keys and values for the new tokens.

For example, this is how the third forward pass would look with caching:

When processing the third token, we don’t need to recompute the previous token’s attention scores. We can retrieve the keys and values for the first two tokens from the cache, thus saving computation time.

Assessing the impact of key-value caching

Key-value caching may have a significant impact on inference time. The magnitude of this impact depends on the model architecture. The more cachable computations there are, the larger the potential to reduce inference time.

Let’s analyze the impact of K-V caching on generation time using the GPT-Neo-1.3B model from EleutherAI, which is available on the Hugging Face Hub.

We will start by defining a timer context manager to calculate generation time:

import time

class Timer:

   def __enter__(self):
       self._start = time.time()
       return self

   def __exit__(self, exc_type, exc_value, traceback):
       self._end = time.time()
       self.duration = self._end - self._start

   def get_duration(self) -> float:
       return self.duration

Next, we load the model from the Hugging Face Hub, set up the tokenizer, and define the prompt:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "EleutherAI/gpt-neo-1.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

input_text = "Why is a pour-over the only acceptable way to drink coffee?"

Finally, we can define the function to run model inference:

def generate(use_cache):
    input_ids = tokenizer.encode(
        input_text,
        return_tensors="pt").to(device),
    )
 output_ids = model.generate(
     input_ids,
     max_new_tokens=100,
     use_cache=use_cache,
 )

Note the use_cache argument we pass to model.generate: It controls whether K-V caching is employed.

With this setup, we can measure the average generation time with and without K-V caching:

for use_cache in (False, True):
   gen_times = []
   for _ in range(10):
     with Timer() as t:
       generate(use_cache=use_cache)
     gen_times += [t.duration]
   print(f"Average inference time with use_cache={use_cache}: {np.round(np.mean(gen_times), 2)} seconds")

I have executed this code on Google Colab using their free-tier T4 GPU using torch==2.5.1+cu121 and transformers==4.46.2 on Python 3.10.12 and obtained the following output:

Average inference time with use_cache=False: 9.28 seconds
Average inference time with use_cache=True: 3.19 seconds

As you can see, in this case, the speedup from caching is almost threefold.

Challenges and trade-offs

As is usually the case, there is no such thing as a free lunch. The generation speedup we have just seen can only be achieved at the cost of increased memory usage, and it requires considerate management in production systems.

Latency-memory trade-off

Storing data in the cache uses up memory space. Systems with limited memory resources may struggle to accommodate this additional memory overhead, potentially resulting in out-of-memory errors. This is especially the case when long inputs need to be processed, as the memory required for the cache grows linearly with the input length.

Another aspect to keep in mind is that the additional memory consumed by the cache is not available for storing the batches of data. As a result, one might need to reduce the batch size to keep it within the memory limits, thus decreasing the throughput of the system.

If the memory consumed by the cache becomes a problem, one can trade additional memory for some of the model accuracy. Specifically, one can truncate the sequences, prune the attention heads, or quantize the model:

Sequence truncation refers to limiting the maximum input sequence length, thus capping the cache size at the expense of losing long-term context. In tasks where this long context is relevant, the model’s accuracy might suffer.

Reducing the number of layers or attention heads, thereby decreasing both the model size and cache memory requirements, is another strategy to reclaim some memory. However, reducing model complexity may impact its accuracy.

Finally, there is quantization, which means using lower-precision data types (e.g., float16 instead of float32) for caching to reduce memory usage. Yet again, model accuracy can suffer.

To sum up, faster latency provided by K-V caching comes at the cost of increased memory usage. If there is sufficient memory, it’s a non-issue. If the memory becomes the bottleneck, however, one can reclaim it by simplifying the model in various ways, thus transitioning from a latency-memory trade-off to a latency-accuracy trade-off.

KV cache management in production systems

In large-scale production systems with many users, the K-V cache needs to be properly managed to ensure consistent and reliable response time while preventing excessive memory consumption. The two most critical aspects of this are cache invalidation (when to clear it) and cache reuse (how to use the same cache multiple times).

Cache invalidation

Three of the most popular cache invalidation strategies are session-based clearing, time-to-live invalidation, and contextual relevance-based approaches. Let’s explore them in this order.

The most basic cache invalidation strategy is session-based clearing. We simply clear the cache at the end of a user session or conversation with the model. This simple strategy is a perfect fit for applications where conversations are short and independent of each other.

Think about a customer support chatbot application in which each user session typically represents an individual conversation where the user seeks assistance with specific issues. In this context, the contents of this cache are unlikely to be needed again. Clearing the K-V cache once the user ends the chat or the session times out due to inactivity is a good choice, freeing up memory for the application to handle new users.

In situations where individual sessions are long, however, there are better solutions than session-based clearing. In time-to-live (TTL) invalidation, cache contents are automatically cleared after a certain period. This strategy is a good choice when the relevance of cached data diminishes predictably over time.

Consider a news aggregator app that provides real-time updates. Cached keys and values might only be relevant for as long as the news is hot. Implementing a TTL policy where cached entries expire after, say, one day ensures that responses to similar queries about fresh developments are generated fast while old news doesn’t fill up memory.

Finally, the most sophisticated of the three popular cache invalidation strategies is based on contextual relevance. Here, we clear the cache contents as soon as they become irrelevant to the current context or user interaction. This strategy is ideal when the application handles diverse tasks or topics within the same session, and the previous context doesn’t contribute value to the new one.

Think about a coding assistant that works as an IDE plug-in. While the user is working on a particular set of files, the cache should be retained. As soon as they switch to a different codebase, however, the previous keys and values become irrelevant and can be deleted to free memory. Contextual relevance-based approaches might be challenging to implement, though, as they require pinpointing the event or point in time at which the context switch occurs.

Cache reuse

Another important aspect of cache management is its reuse. On some occasions, a once-generated cache can be used again to speed up generation and save memory by avoiding storing the same data multiple times in different users’ cache instances.

Cache reuse opportunities typically show up when there is shared context and/or a warm start is desirable.

In scenarios where multiple requests share a common context, one can reuse the cache for that shared portion. In e-commerce platforms, certain products may have standard descriptions or specifications that are frequently asked about by multiple customers. These may include product details (“55-inch 4K Ultra HD Smart LED TV”), warranty information (“Comes with a 2-year manufacturer’s warranty covering parts and labor.”), or customer instructions (“For best results, mount the TV using a compatible wall bracket, sold separately.”). By caching the key-value pairs for these shared product descriptions, a customer support chatbot will generate responses to common questions faster.

Similarly, one can precompute and cache the initial K-V pairs for frequently used prompts or queries. Consider a voice-activated virtual assistant application. Users frequently start interactions with phrases like “What’s the weather today?” or “Set a timer for 10 minutes.” The assistant can respond more quickly by precomputing and caching the key-value pairs for these frequently used queries.

Conclusion

Key-value (K-V) caching is a technique in transformer models where the key and value matrices from previous steps are stored and reused during the generation of subsequent tokens. It allows for the reduction of redundant computations and speeding up inference time. This speedup comes at the cost of increased memory consumption. When memory is a bottleneck, one can reclaim some of it by simplifying the model, thus sacrificing its accuracy. Implementing K-V caching in large-scale production systems requires careful cache management, including choosing the strategy for cache invalidation and exploring the opportunities for cache reuse.

Reinforcement Learning From Human Feedback (RLHF) For LLMs

Michał Oleszak — Thu, 12 Sep 2024 11:00:00 +0000

Reinforcement Learning from Human Feedback (RLHF) unlocked the full potential of today’s large language models (LLMs).

By integrating human judgment into the training process, RLHF ensures that models not only produce coherent and useful outputs but also align more closely with human values, preferences, and expectations.

The RLHF process consists of three steps: collecting human feedback in the form of a preference dataset, training a reward model to mimic human preferences, and fine-tuning the LLM using the reward model. The last step is enabled by the Proximal Policy Optimization (PPO) algorithm.

Alternatives to RLHF include Constitutional AI where the model learns to critique itself whenever it fails to adhere to a predefined set of rules and Reinforcement Learning from AI Feedback (RLAIF) in which off-the-shelf LLMs replace humans as preference data providers.

Reinforcement Learning from Human Feedback (RLHF) has turned out to be the key to unlocking the full potential of today’s large language models (LLMs). There is arguably no better evidence for this than OpenAI’s GPT-3 model. It was released back in 2020, but it was only its RLHF-trained version dubbed ChatGPT that became an overnight sensation, capturing the attention of millions and setting a new standard for conversational AI.

Before RLHF, the LLM training process typically consisted of a pre-training stage in which the model learned the general structure of the language and a fine-tuning stage in which it learned to perform a specific task. By integrating human judgment as a third training stage, RLHF ensures that models not only produce coherent and useful outputs but also align more closely with human values, preferences, and expectations. It achieves this through a feedback loop where human evaluators rate or rank the model’s outputs, which is then used to adjust the model’s behavior.

This article explores the intricacies of RLHF. We will look at its importance for language modeling, analyze its inner workings in detail, and discuss the best practices for implementation.

Importance of RLHF in LLMs

When analyzing the importance of RLHF to language modeling, one could approach it from two different perspectives.

On the one hand, this technique has emerged as a response to the limitations of traditional supervised fine-tuning, such as reliance on static datasets often limited in scope, context, and diversity, as well as broader human values, ethics, or social norms. Additionally, traditional fine-tuning often struggles with tasks that involve subjective judgment or ambiguity, where there may be multiple valid answers. In such cases, a model might favor one answer over another based on the training data, even if the alternative might be more appropriate in a given context. RLHF provides a way to lift some of these limitations.

On the other hand, however, RLHF represents a paradigm shift in the fine-tuning of LLMs. It forms a standalone, transformative change in the evolution of AI rather than a mere incremental improvement over existing methods.

Let’s look at it from the latter perspective first.

The paradigm shift brought about by RLHF lies in the integration of human feedback directly into the training loop, enabling models to better align with human values and preferences. This approach prioritizes dynamic model-human interactions over static training datasets. By incorporating human insights throughout the training process, RLHF ensures that models are more context-aware and capable of handling the complexities of natural language.

I now hear you asking: “But how is injecting the human into the loop better than the traditional fine-tuning in which we train the model in a supervised fashion on a static dataset? Can’t we simply pass human preferences to the model by constructing a fine-tuning data set based on these preferences?“ That’s a fair question.

Consider succinctness as a preference for a text summarizing model. We could fine-tune a Large Language Model on concise summaries by training it in a supervised manner on the set of input-output pairs where input is the original text and output is the desired summary.

The problem here is that different summaries can be equally good, and different groups of people will have preferences as to what level of succinctness is optimal in different contexts. When relying solely on traditional supervised fine-tuning, the model might learn to generate concise summaries, but it won’t necessarily grasp the subtle balance between brevity and informativeness that different users might prefer. This is where RLHF offers a distinct advantage.

In RLHF, we train the model on the following data set:

Each example consists of the long input text, two alternative summaries, and a label that signals which of the two was preferred by a human annotator. By directly passing human preference to the model via the label indicating the “better” output, we can ensure it aligns with it properly.

Let’s discuss how this works in detail.

The RLHF process

The RLHF process consists of three steps:

Collecting human feedback.
Training a reward model.
Fine-tuning the LLM using the reward model.

The algorithm enabling the last step in the process is the Proximal Policy Optimization (PPO).

High-Level overview of Reinforcement Learning from Human Feeback (RLHF). A reward model is trained on a preference dataset that includes the input, alternative outputs, and a label indicating which of the outputs is preferable. The LLM is fine-tuned through reinforcement learning with Proximal Policy Optimization (PPO).

Collecting human feedback

The first step in RLHF is to collect human feedback in the so-called preference dataset. In its simplest form, each example in this dataset consists of a prompt, two different answers produced by the LLM as the response to this prompt, and an indicator for which of the two answers was deemed better by a human evaluator.

The specific dataset formats differ and are not too important. The schematic dataset shown above used four fields: Input text, Summary 1, Summary 2, and Preference. Anthropic’s hh-rlhf dataset uses a different format: two columns with the chosen and rejected version of a dialogue between a human and an AI assistant, where the prompt is the same in both cases.

An example entry from Anthropic’s hh-rlhf preference dataset. The left column contains the prompt and the better answer produced by the model. The right column contains the exact same prompt and the worse answer, as judged by a human evaluator. | Source

Regardless of the format, the information contained in the human preference data set is always the same: It’s not that one answer is good and the other is bad. It’s that one, albeit imperfect, is preferred over the other – it’s all about preference.

Now you might wonder why the labelers are asked to pick one of two responses instead of, say, scoring each response on a scale. The problem with scores is that they are subjective. Scores provided by different individuals, or even two scores from the same labeler but on different examples, are not comparable.

So how do the labelers decide which of the two responses to pick? This is arguably the most important nuance in RLHF. The labelers are offered specific instructions outlining the evaluation protocol. For example, they might be instructed to pick the answers that don’t use swear words, the ones that sound more friendly, or the ones that don’t offer any dangerous information. What the instructions tell the labelers to focus on is crucial to the RLHF-trained model, as it will only align with those human values that are contained within these instructions.

More advanced approaches to building a preference dataset might involve humans ranking more than two responses to the same prompt. Consider three different responses: A, B, and C.

Human annotators have ranked them as follows, where “1” is best, and “3” is worst:

A – 2

B – 1

C – 3

Out of these, we can create three pairs resulting in three training examples:

Preferred response	Non-preferred response
B	A
A	C
B	C

Training a reward model

Once we have our preference dataset in place, we can use it to train a reward model (RM).

The reward model is typically also an LLM, often encoder-only, such as BERT. During training, the RM receives three inputs from the preference dataset: the prompt, the winning response, and the losing response. It produces two outputs, called rewards, for each of the responses:

The training objective is to maximize the reward difference between the winning and losing response. An often-used loss function is the cross-entropy loss between the two rewards.

This way, the reward model learns to distinguish between more and less preferred responses, effectively ranking them based on the preferences encoded in the dataset. As the model continues to train, it becomes better at predicting which responses will likely be preferred by a human evaluator.

Once trained, the reward model serves as a simple regressor predicting the reward value for the given prompt-completion pair:

Fine-tuning the LLM with the reward model

The third and final RLHF stage is fine-tuning. This is where the reinforcement learning takes place.

The fine-tuning stage requires another dataset that is different from the preference dataset. It consists of prompts only, which should be similar to what we expect our LLM to deal with in production. Fine-tuning teaches the LLM to produce aligned responses for these prompts.

Specifically, the goal of fine-tuning is to train the LLM to produce completions that maximize the rewards given by the reward model. The training loop looks as follows:

First, we pass a prompt from the training set to the LLM and generate a completion. Next, the prompt and the completion are passed to the reward model, which in turn predicts the reward. This reward is fed into an optimization algorithm such as PPO (more about it in the next section), which then adjusts the LLM’s weights in a direction resulting in a better RM-predicted reward for the given training example (not unlike gradient descent in traditional deep learning).

Proximal Policy Optimization (PPO)

One of the most popular optimizers for RLHF is the Proximal Policy Optimization algorithm or PPO. Let’s unpack this mouthful.

In the reinforcement learning context, the term “policy” refers to the strategy used by an agent to decide its actions. In the RLHF world, the policy is the LLM we are training which decides which tokens to generate in its responses. Hence, “policy optimization” means we are optimizing the LLM’s weights.

What about “proximal”? The term “proximal” refers to the key idea in PPO of making only small, controlled changes to the policy during training. This prevents an issue all too common in traditional policy gradient methods, where large updates to the policy can sometimes lead to significant performance drops.

PPO under the hood

The PPO loss function is composed of three components:

Policy Loss: PPO’s primary objective when improving the LLM.
Value Loss: Used to train the value function, which estimates the future rewards from a given state. The value function allows for computing the advantage, which in turn is used to update the policy.
Entropy Loss: Encourages exploration by penalizing certainty in the action distribution, allowing the LLM to remain creative.

The total PPO loss can be expressed as:

L_PPO = L_POLICY + a × L_VALUE + b × L_ENTROPY

where a and b are weight hyperparameters.

The entropy loss component is just the entropy of the probability distribution over the next tokens during generations. We don’t want it to be too small, as this would discourage diversity in the produced texts.

The value loss component is computed step-by-step as the LLM generates subsequent tokens. At each step, the value loss is the difference between the actual future total reward (based on the full completion) and its current-step approximation through the so-called value function. Reducing the value loss trains the value function to be more accurate, resulting in better future reward prediction.

In the policy loss component, we use the value function to predict future rewards over different possible completions (next tokens). Based on these, we can estimate the so-called advantage term that captures how better or worse one completion is compared to all possible completions.

If the advantage term for a given completion is positive, it means that increasing the probability of this particular completion being generated will lead to a higher reward and, thus, a better-aligned model. Hence, we should tweak the LLM’s parameters such that this probability is increased.

PPO alternatives

PPO is not the only optimizer used for RLHF. With the current pace of AI research, new alternatives spring up like mushrooms. Let’s take a look at a few worth mentioning.

Direct Preference Optimization (DPO) is based on an observation that the cross-entropy loss used to train the reward model in RLHF can be directly applied to fine-tune the LLM. DPO is more efficient than PPO and has been shown to yield better response quality.

Comparison between Direct Policy Optimization (DPO) and Proximal Policy Optimization (PPO). DPO (right) requires fewer steps as it does not use the reward model, unlike PPO (left). | Modified based on: Source

Another interesting alternative to PPO is Contrastive Preference Learning (CPL). The proponents claim that PPO’s assumption that human preferences are distributed according to reward is incorrect. Rather, recent work would suggest that they instead follow regret. Similarly to DPO, CPL circumvents the need for training a reward model. It replaces it with a regret-based model of human preferences trained with a contrastive loss.

A comparison between traditional RLHF and Contrastive Preference Learning (CPL). CPL uses a regret-based model instead of a reward model. | Source

Best practices for RLHF

Let’s go back to the vanilla PPO-based RLHF. Having gone through the RLHF training process on a conceptual level, we’ll now discuss a couple of best practices to follow when implementing RLHF and the tools that might come in handy.

Avoiding reward hacking

Reward hacking is a prevalent issue in reinforcement learning. It refers to a situation where the agent has learned to cheat the system in that it maximizes the reward by taking actions that don’t align with the original objective.

In the context of RHLF, reward hacking means that the training has converged to a particular unlucky place in the loss surface where the generated responses lead to high rewards for some reason, but don’t make much sense to the user.

Luckily, there is a simple trick that helps prevent reward hacking. During fine-tuning, we take advantage of the initial, frozen copy of the LLM (as it was before RLHF training) and pass it the same prompt that we pass the LLM instance we are training.

Then, we compute the Kullback-Leibler Divergence between the responses from the original model and the model under training. KL Divergence measures the dissimilarity between the two responses. We want the responses to actually be rather similar to make sure that the updated model did not diverge too far away from its starting version. Thus, we dub the KL Divergence value a “reward penalty” and add it to the reward before passing it to the PPO optimizer.

Incorporating this anti-reward-hacking trick into our fine-tuning pipeline yields the following updated version of the previous figure:

To prevent reward hacking, we pass the prompt to two instances of the LLM: the one being trained and its frozen version from before the training. Then, we compute the reward penalty as the KL Divergence between the two models’ outputs and add it to the reward. This prevents the trained model from diverging too much from its initial version.

Scaling human feedback

As you might have noticed, the RLHF process has one bottleneck: the collection of human feedback in the form of the preference dataset is a slow manual process that needs to be repeated whenever alignment criteria (labelers’ instructions) change. Can we completely remove humans from the process?

We can certainly reduce their engagement, thus making the process more efficient. One approach to doing this is model self-supervision, or “Constitutional AI.”

The central point is the Constitution, which consists of a set of rules that should govern the model’s behavior (think: “do not swear,” “be friendly,” etc.). A human red team then prompts the LLM to generate harmful or misaligned responses. Whenever they succeed, they ask the model to critique its own responses according to the constitution and revise them. Finally, the model is trained using the red team’s prompts and the model’s revised responses.

An overview of Constitutional AI. In this approach, the model is asked to follow a set of guidelines (“constitution”) and learns to critique its own misaligned responses according to it. | Modified based on: source

Reinforcement Learning from AI Feedback (RLAIF) is yet another way to eliminate the need for human feedback. In this approach, one simply uses an off-the-shelf LLM to provide preferences for the preference dataset.

A comparison between RLAIF (top) and RLHF (bottom). In RLAIF, an off-the-shelf LLM takes the place of the human to generate feedback in the form of a preference dataset. | Modified based on: s ource

Tooling and frameworks

Let’s briefly examine some available tools and frameworks that facilitate RLHF implementation.

Data collection

Don’t have your preference dataset yet? Two great platforms that facilitate its collection are Prolific and Mechanical Turk.

Prolific is a platform for collecting human feedback at scale that is useful for gathering preference data through surveys and experiments. Amazon’s Mechanical Turk (MTurk) service allows for outsourcing data labeling tasks to a large pool of human workers, commonly used for obtaining labels for machine-learning models.

Prolific is known for having a more curated and diverse participant pool. The platform emphasizes quality and typically recruits reliable participants with a history of providing high-quality data. MTurk, on the other hand, has a more extensive and varied participant pool, but it can be less curated. This means there may be a broader range of participant quality.

End-to-end RLHF frameworks

If you are a Google Cloud Platform (GCP) user, you can very easily take advantage of their Vertex AI RLHF pipeline. It abstracts away the while training logic; all you need to do is to supply the preference dataset (to train the Reward Model) and the prompt dataset (for the RL-based fine-tuning) and just execute the pipeline.

The disadvantage is that since the training logic is abstracted away, it’s not straightforward to make custom changes. However, this is a great place to start if you are just beginning your RLHF adventure or don’t have the time or resources to build custom implementations.

Alternatively, check out DeepSpeed Chat, Microsoft’s open-source system for training and deploying chat models using RLHF, providing tools for data collection, model training, and deployment.

Conclusion

We have discussed how important the paradigm shift brought about by RLHF is to training language models, making them aligned with human preferences. We analyzed the three steps of the RLHF training pipeline: collecting human feedback, training the reward model, and fine-tuning the LLM. Next, we took a more detailed look at Proximal Policy Optimization, the algorithm behind RLHF, while mentioning some alternatives. Finally, we discussed how to avoid reward hacking using KL Divergence and how to reduce human engagement in the process with approaches such as Constitutional AI and RLAIF. We also reviewed a couple of tools facilitating RLHF implementation.

You are now well-equipped to fine-tune your own large language models with RLHF! If you do, let me know what you built!

Adversarial Machine Learning: Defense Strategies

Michał Oleszak — Thu, 11 Jul 2024 11:00:00 +0000

Adversarial attacks manipulate ML model predictions, steal models, or extract data.

Different attack types exist, including evasion, data poisoning, Byzantine, and model extraction attacks.

Defense strategies like adversarial learning, monitoring, defensive distillation, and differential privacy improve robustness against adversarial attacks.

Multiple aspects have to be considered when evaluating the effectiveness of different defense strategies, including the method’s robustness, impact on model performance, and adaptability to the constant flow of brand-new attack mechanisms.

The growing prevalence of ML models in business-critical applications results in an increased incentive for malicious actors to attack the models for their benefit. Developing robust defense strategies becomes paramount as the stakes grow, especially in high-risk applications like autonomous driving and finance.

In this article, we’ll review common attack strategies and dive into the latest defense mechanisms for shielding machine learning systems against adversarial attacks. Join us as we unpack the essentials of safeguarding your AI investments.

Understanding adversarial attacks in ML

“Know thine enemy”—this famous saying, derived from Sun Tzu’s The Art of War, an ancient Chinese military treatise, is just as applicable to machine-learning systems today as it was to 5th-century BC warfare.

Before we discuss defense strategies against adversarial attacks, let’s briefly examine how these attacks work and what types of attacks exist. We will also review a couple of examples of successful attacks.

Goals of adversarial machine learning attacks

An adversary is typically attacking your AI system for one of two reasons:

To impact the predictions made by the model.
To retrieve and steal the model and/or the data it was trained on.

Adversarial attacks to impact model outputs

Attackers could introduce noise or misleading information into a model’s training data or inference input to alter its outputs.

The goal might be to bypass an ML-based security gate. For example, the attackers might try to fool a spam detector and deliver unwanted emails straight to your inbox.

Alternatively, attackers might be interested in ensuring that a model produces an output that’s favorable for them. For instance, attackers planning to defraud a bank might be seeking a positive credit score.

Finally, the corruption of a model’s outputs can be driven by the will to render the model unusable. Attackers could target a model used for facial recognition, causing it to misidentify individuals or fail to recognize them at all, thus completely paralyzing security systems at an airport.

Adversarial attacks to steal models and data

Attackers can also be interested in stealing the model itself or its training data.

They might repeatedly probe the model to see which inputs lead to which outputs, eventually learning to mimic the proprietary model’s behavior. The motivation is often to use it for their own purpose or to sell it to an interested party.

Similarly, attackers might be able to retrieve the training data from the model and use it for their benefit or simply sell it. Sensitive data such as personally identifiable information or medical records are worth a lot on the data black market.

Types of adversarial attacks

Adversarial machine learning can be categorized into two groups.

In white-box attacks, the adversary has full access to the model architecture, its weights, and sometimes even its training data. They can feed the model any desired input, observe its inner workings, and collect the raw model output.

In black-box attacks, the attacker knows nothing about the internals of their target system. They can only access it for inference, i.e., feed the system an input sample and collect the post-processed output.

Unsurprisingly, the white-box scenario is better for attackers. With detailed model information, they can craft highly effective adversarial campaigns that exploit specific model vulnerabilities. (We’ll see examples of this later.)

Regardless of the level of access to the targeted machine learning model, adversarial attacks can be further categorized as:

Evasion attacks,
Data-poisoning attacks,
Byzantine attacks,
Model-extraction attacks.

Evasion attacks

Evasion attacks aim to alter a model’s output. They trick it into making incorrect predictions by introducing subtly altered adversarial inputs during inference.

An infamous example is the picture of a panda below, which, after adding some noise that is unrecognizable to the human eye, is classified as depicting a gibbon.

Evasion attack. A model classifies an image as a panda. After adding a small amount of random noise to the image, invisible to the human eye, it is classified as a gibbon with extremely high confidence | Source

Attackers can deliberately craft the noise to make the model produce the desired output. One common approach to achieve this is the Fast Gradient Sign Method (FGSM), in which the noise is calculated as the sign of the gradient of the model’s loss function with respect to the input, with the goal of maximizing the prediction error.

The FGSM approach bears some resemblance to the model training process. Just like during regular training, where, given the inputs, the weights are optimized to minimize the loss, FGSM optimizes the inputs given the weights to maximize the loss.

Attacks with FGSM are only feasible in a white-box scenario, where the gradient can be calculated directly. In the black-box case, attackers must resort to methods like Zeroth-Order Optimization or Boundary Attacks that approximate the gradients.

Data-poisoning attacks

Data-poisoning attacks are another flavor of adversarial machine learning. They aim to contaminate a model’s training set to impact its predictions.

An attacker typically needs direct access to the training data to conduct a data-poisoning attack. They might be the company’s employees developing the ML system (known as an insider threat).

Consider the following data sample a bank used to train a credit-scoring algorithm. Can you spot anything fishy?

If you look closely, you will notice that every 30-year-old was assigned a credit score above 700. This so-called backdoor could have been introduced by corrupt employees. A model trained on the data will likely pick up on the strong correlation of age==30 with the high credit score. This will likely result in a credit line being approved for any 30-year-old – perhaps the employees themselves or their co-conspirators.

However, data poisoning is also possible without direct data access. Today, a lot of training data is user-generated. Content recommendation engines or large language models are trained on data scraped from the internet. Thus, everyone can create malicious data that might end up in a model training set. Think about fake news campaigns attempting to bias recommendation and moderation algorithms.

Byzantine attacks

Byzantine attacks target distributed or federated learning systems, where the training process is spread across multiple devices or compute units. These systems rely on individual units to perform local computations and send updates to a central server, which aggregates these updates to refine a global model.

In a Byzantine attack, an adversary compromises some of these compute units. Instead of sending correct updates, the compromised units send misleading updates to the central aggregation server. The goal of these attacks is to corrupt the global model during the training phase, leading to poor performance or even malfunctioning when it is deployed.

Model-extraction attacks

Model-extraction attacks consist of repeatedly probing the model to retrieve its concept (the input-output mapping it has learned) or the data it was trained on. They are typically black-box attacks. (In the white-box scenario, one already has access to the model.)

To extract a model, the adversary might send a large number of heterogeneous requests to the model that try to span most of the feature space and record the received outputs. The data collected this way could be enough to train a model that will mimic the original model’s behavior.

For neural networks, this attack is particularly efficient if the adversary knows a model’s entire output distribution. In a process known as knowledge distillation, the model trained by the attackers learns to replicate not just the original model’s output but also its inner prediction process.

Extracting the training data from the model is more tricky, but bad actors have their ways. For example, the model’s loss on training data is typically smaller than previously unseen data. In the white-box scenario, the attackers might feed many data points to the model and use the loss to infer if the data points were used for training.

Attackers can reconstruct training data with quite high accuracy. In the paper Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures by Fredrikson et al., the authors demonstrated how to recover recognizable images of people’s faces given only their names and access to an ML face recognition model. In his post on the OpenMined blog, Tom Titcombe discusses the approach in more detail and includes a replicable example.

Model-extraction attack. The original training sample (right) was reconstructed from a face-recognition model (left) | Source

Examples of adversarial attacks

Adversarial machine learning attacks can have disastrous consequences. Let’s examine a couple of examples from different domains.

Researchers from Tencent’s Keen Security Lab conducted experiments on Tesla’s autopilot system, demonstrating they could manipulate it by placing small objects on the road or modifying lane markings. These attacks caused the car to change lanes unexpectedly or misinterpret road conditions.

In the paper “DolphinAttack: Inaudible Voice Commands,” the authors showed that ultrasonic commands inaudible to humans could manipulate voice-controlled systems like Siri, Alexa, and Google Assistant to perform actions without the user’s knowledge.

In the world of finance, where a great deal of securities trading is performed by automated systems (the so-called algorithmic trading), it has been shown that a simple, low-cost attack can cause the machine learning algorithm to mispredict asset returns, leading to a money loss for the investor.

While the examples above are research results, there have also been widely publicized adversarial attacks. Microsoft’s AI chatbot Tay was launched in 2016 and was supposed to learn from interactions with Twitter users. However, adversarial users quickly exploited Tay by bombarding it with offensive tweets, leading Tay to produce inappropriate and offensive content within hours of its launch. This incident forced Microsoft to take Tay offline.

Defense strategies for adversarial machine learning

Equipped with a thorough understanding of adversaries’ goals and strategies, let’s look at some defense strategies that improve the robustness of AI systems against attacks.

Adversarial learning

Adversarial learning, also called adversarial training, is arguably the simplest way to make a machine-learning model more robust against evasion attacks.

The basic idea is to put on the attacker’s hat and generate adversarial examples to add to the model’s training dataset. This way, the ML model learns to produce correct predictions for these slightly perturbed inputs.

Technically speaking, adversarial learning modifies the model’s loss function. During training, for each batch of training examples, we generate another batch of adversarial examples using the attacking technique of choice based on the model’s current weights. Next, we evaluate separate loss functions for the original and the adversarial samples. The final loss used to update the weights is a weighted average between the two losses:

Here, m and k are the numbers of original and adversarial examples in the batch, respectively, and λ is a weighing factor: the larger it is, the stronger we enforce the robustness against adversarial samples, at the cost of potentially decreasing the performance on the original ones.

Adversarial learning is a highly effective defense strategy. However, it comes with one crucial limitation: The model trained in an adversarial way is only robust against the attack flavors used for training.

Ideally, one would use all the state-of-the-art adversarial attack strategies to generate perturbed training examples, but this is impossible. First, some of them require a lot of compute, and second, the arms race continues, and attackers are constantly inventing new techniques.

Monitoring

Another approach to defending machine-learning systems against attacks relies on monitoring the requests sent to the model to detect adversarial samples.

We can use specialized machine-learning models to detect input samples that have been intentionally altered to mislead the model. These could be models specifically trained to detect perturbed inputs or models similar to the attacked model but using a different architecture. Since many evasion attacks are architecture-specific, these monitoring models should not be fooled, leading to a prediction disagreement with the original model signaling an attack.

By identifying adversarial samples early, the monitoring system can trigger alerts and proactively mitigate the impact. For example, in an autonomous vehicle, monitoring models could flag manipulated sensor data designed to mislead its navigation system, prompting it to switch to a safe mode. In financial systems, monitoring can detect fraudulent transactions crafted to exploit machine-learning systems for fraud detection, enabling timely intervention to prevent losses.

Defensive distillation

In the paper Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks, researchers from Penn State University and the University of Wisconsin-Madison proposed using knowledge distillation as a defense strategy against adversarial machine learning attacks.

Their core idea is to leverage the knowledge distilled in the form of probabilities produced by a larger deep neural network and transfer this knowledge to a smaller deep neural network while maintaining comparable accuracy. Unlike traditional distillation, which aims for model compression, defensive distillation retains the same network architecture for both the original and distilled models.

The process begins by training the initial model on a dataset with a softmax output. The outputs are probabilities representing the model’s confidence across all classes, providing more nuanced information than hard labels. A new training set is then created using these probabilities as soft targets. A second model, identical in architecture to the first, is trained on this new dataset.

Defensive distillation. The probabilities of the initial network are used as training labels for the distilled network | Source

The advantage of using soft targets lies in the richer information they provide, reflecting the model’s relative confidence across classes. For example, in digit recognition, a model might output a 0.6 probability for a digit being 7 and 0.4 for it being 1, indicating visual similarity between these two digits. This additional information helps the model generalize better and resist overfitting, making it less susceptible to adversarial perturbations.

Defense against data-poisoning attacks

So far, we have discussed the defense strategies against evasion attacks. Let’s consider how we can protect ourselves against data-poisoning attacks.

Unsurprisingly, a large part of the effort is guarding the access to the model’s training data and verifying whether it’s been tampered with. The standard security principles comprise:

Access control, which includes policies regulating user access and privileges and ensuring only authorized users can modify training data.
Audit trails, i.e., maintenance of records of all activities and transactions to track user actions and identify malicious behavior. This helps swiftly exclude or downgrade the privileges of malicious users.
Data sanitization, which comprises cleaning the training data to remove potential poisoning samples using outlier detection techniques. This might require access to pristine, untainted data for comparison.

Differential privacy

As we have seen earlier, data extraction attacks aim to find the exact data points used for training a model. This data is often sensitive and protected. One safeguard against such attacks is employing differential privacy.

Differential privacy is a technique designed to protect individual data privacy while allowing aggregate data analysis. It ensures that removing or adding a single data point in a dataset does not significantly affect the output of any analysis, thus preserving the privacy of individual data entries.

The core idea of differential privacy is to add a controlled amount of random noise to the results of queries or computations on the dataset. This noise is calibrated according to a parameter known as the privacy budget, which quantifies the trade-off between privacy and accuracy. A smaller budget means better privacy but less accurate results, and a larger budget allows more accurate results at the cost of reduced privacy.

In the context of training machine learning models, differential privacy adds noise to the training data, so the accuracy of the model trained on these data is unchanged. However, since the training examples are obscured by noise, no precise information about them can be extracted.

Defense against model-extraction attacks

Finally, let’s analyze defense strategies against model-extraction attacks.

As discussed earlier, extraction attacks often involve the adversary making repeated requests to the model. An obvious protection against that is rate-limiting the API. By reducing the number of queries an attacker can make in a given time window, we slow down the extraction process. However, determined adversaries can bypass rate limits by using multiple accounts or distributing queries over extended periods. We are also running the risk of inconveniencing legitimate users.

Alternatively, we can add noise to the model’s output. This noise needs to be small enough not to affect how legitimate users interact with the model and large enough to hinder an attacker’s ability to replicate the target model accurately. Balancing security and usability requires careful calibration.

Finally, while not a defense strategy per se, watermarking the ML model’s output may allow us to track and identify the usage of stolen models. Watermarks can be designed to have a negligible impact on the model’s performance while providing a means for legal action against parties who misuse or steal the model.

Selecting and evaluating defense methods against adversarial attacks

Picking defense strategies against adversarial machine-learning attacks requires us to consider multiple aspects.

We typically start by assessing the attack type(s) we need to protect against. Then, we analyze the available methods based on their robustness, impact on the model’s performance, and their adaptability to the constant flow of brand-new attack mechanisms.

I have summarized the methods we discussed and key considerations in the following table:

	Targeted attack type	Robustness against attack type	Impact on model performance	Adaptability to new attacks
Adversarial learning	Evasion	Strong against known attacks but weak against new techniques.	May decrease performance on clean data.	Needs regular updates for new attacks.
Monitoring	Evasion	Effective for real-time detection but can miss sophisticated attacks.	No direct impact but requires additional resources.	Adaptable but might require updates.
Defensive distillation	Evasion	Largely effective.	Maintains accuracy with slight overhead during training.	Less adaptable without retraining.
Access controls	Data poisoning	Effective.	N/A	Prevents all poisoning attacks by external adversaries.
Audit trails	Data poisoning	Effective if all relevant activity is captured and recognized.	N/A	Attackers might find ways to evade leaving traces or delay alerts.
Data sanitization	Data poisoning	Somewhat effective if clean baseline and/or statistical properties are known.	If legitimate samples are mistakenly removed or altered (false positives), model performance might degrade.	Only known manipulation patterns can be detected.
Differential privacy	Data extraction	Effective against data extraction attacks as it obscures information about individual data points.	Needs careful calibration to balance privacy and model accuracy.	Highly adaptive: regardless of the attack method, the data is obscured.
API rate-limiting	Model and data extraction	Effective against attackers with limited resources or time budget.	Legitimate users who need to access model at high rate are impacted.	Effective against all attacks that rely on a large number of samples.
Adding noise to model output	Model and data extraction	Somewhat effective.	Degraded performance if too much noise is added.	Effective against all extraction attacks that rely on accurate samples.
Watermarking model outputs	Model extraction	Does not prevent extraction but aids in proving a model was extracted.	Generally negligible.	N/A

What’s next in adversarial ML?

Adversarial machine learning is an active research area. A quick Google Scholar search reveals nearly 10,000 papers published on this topic in 2024 alone (as of the end of May). The arms race continues as new attacks and defense methods are proposed.

A recent survey paper, “Adversarial Attacks and Defenses in Machine Learning-Powered Networks,“ outlines the most likely future developments in the field.

In the attackers’ camp, future efforts will likely focus on reducing attack costs, improving the transferability of attack approaches across different datasets and model architectures, and extending the attacks beyond classification tasks.

The defenders are not idle, either. Most research focuses on the trade-off between defense effectiveness and overhead (additional training time or complexity) and the adaptability to new attacks. Researchers attempt to find mechanisms that provably guarantee a certain level of defense performance, irrespective of the method of attack.

At the same time, standardized benchmarks and evaluation metrics are being developed to facilitate a more systematic assessment of defense strategies. For example, RobustBench provides a standardized benchmark for evaluating adversarial robustness. It includes a collection of pre-trained models, standardized evaluation protocols, and a leaderboard ranking models based on their robustness against various adversarial attacks.

In summary, the landscape of adversarial machine learning is characterized by rapid advancements and a perpetual battle between attack and defense mechanisms. This race has no winner, but whichever side is ahead at any given moment will impact the security, reliability, and trustworthiness of AI systems in critical applications.

Zero-Shot and Few-Shot Learning with LLMs

Michał Oleszak — Fri, 22 Mar 2024 15:00:00 +0000

Chatbots based on LLMs can solve tasks they were not trained to solve either out-of-the-box (zero-shot prompting) or when prompted with a couple of input-output pairs demonstrating how to solve the task (few-shot prompting).

Zero-shot prompting is well-suited for simple tasks, exploratory queries, or tasks that only require general knowledge. It doesn’t work well for complex tasks that require context or when a very specific output form is needed.

Few-shot prompting is useful when we need the model to “learn” a new concept or when a precise output form is required. It’s also a natural choice with very limited data (too little to train on) that could help the model to solve a task.

If complex multi-step reasoning is needed, neither zero-shot nor few-shot prompting can be expected to yield good performance. In these cases, fine-tuning of the LLM will likely be necessary.

Chatbots based on Large Language Models (LLMs), such as OpenAI’s ChatGPT, show an astonishing capability to perform tasks for which they have not been explicitly trained. In some cases, they can do it out of the box. In others, the user must specify a few labeled examples for the model to pick up the pattern.

Two popular techniques for helping a Large Language Model solve a new task are zero-shot and few-shot prompting. In this article, we’ll explore how they work, see some examples, and discuss when to use (and, more importantly, when not to use) zero-shot and few-shot prompting.

The role of zero-shot and few-shot learning in LLMs

The goal of zero-shot and few-shot learning is to get a machine-learning model to perform a new task it was not trained for. It is only natural to start by asking: what are the LLMs trained to do?

Washington. In fine-tuning, the model produces a few answers, and the one that is accurate and polite is chosen." class="wp-image-35548" srcset="https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?w=1200&ssl=1 1200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?resize=768%2C402&ssl=1 768w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?resize=200%2C105&ssl=1 200w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?resize=220%2C115&ssl=1 220w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?resize=120%2C63&ssl=1 120w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?resize=160%2C84&ssl=1 160w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?resize=300%2C157&ssl=1 300w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?resize=480%2C251&ssl=1 480w, https://i0.wp.com/neptune.ai/wp-content/uploads/2024/03/Zero-shot-and-few-shot-learning-with-LLMs-1-1.png?resize=1020%2C534&ssl=1 1020w" sizes="auto, (max-width: 1000px) 100vw, 1000px" />

LLMs used in chatbot applications typically undergo two training stages. In pre-training, they learn to predict the next word. During fine-tuning, they learn to give specific responses. | Source: Author

Most LLMs used in chatbots today undergo two stages of training:

In the pre-training stage, the model is fed a large corpus of text and learns to predict the next word based on the previous words.
In the fine-tuning stage, the next word predictor is adapted to behave as a chatbot, that is, to answer users’ queries in a conversational manner and produce responses that meet human expectations.

Let’s see if OpenAI’s ChatGPT (based on GPT4) can finish a popular English-language pangram (a sentence containing all the letters of the alphabet):

As expected, it finishes the famous sentence correctly, likely having seen it multiple times in the pre-training data. If you’ve ever used ChatGTP, you’ll also know that chatbots appear to have vast factual knowledge and generally try to be helpful and avoid vulgarism.

But ChatGPT and similar LLM-backed chatbots can do so much more than that. They can solve many tasks they have never been trained to solve, such as translating between languages, detecting the sentiment in a text, or writing code.

Getting chatbots to solve new tasks requires zero-shot and few-shot prompting techniques.

Zero-shot prompting

Zero-shot prompting refers to simply asking the model to do something it was not trained to do.

The word “zero” refers to giving the model no examples of how this new task should be solved. We just ask it to do it, and the Large Language Model will use the general understanding of the language and the information it learned during the training to generate the answer.

For example, suppose you ask a model to translate a sentence from one language to another. In that case, it will likely produce a decent translation, even though it was never explicitly trained for translation. Similarly, most LLMs can tell a negative-sounding sentence from a positively-sounding one without explicitly being trained in sentiment analysis.

Few-shot prompting

Similarly, few-shot prompting means asking a Large Language Model to solve a new task while providing examples of how the task should be solved.

It is like passing a small sample of training data to the model through the query, allowing the model to learn from the user-provided examples. However, unlike during the pre-training or fine-tuning stages, the learning process does not involve updating the model’s weights. Instead, the model stays frozen but uses the provided context when generating its response. This context will typically be retained throughout a conversation, but the model cannot access the newly acquired information later.

Sometimes, specific variants of few-shot learning are distinguished, especially when evaluating and comparing model performance. “One-shot” means we provide the model with just one example, “two-shot” means we provide two examples – you get the gist.

In zero-shot prompting, the model answers based on its general knowledge. In few-shot prompting, it answers conditioning on examples provided in the prompt. | Source: Author

Is few-shot prompting the same as few-shot learning?

“Few-shot learning” and “zero-shot learning” are well-known concepts in machine learning that were studied long before LLMs appeared on the scene. In the context of LLMs, these terms are sometimes used interchangeably with “few-shot prompting” and “zero-shot prompting.” However, they are not the same.

Few-shot prompting refers to constructing a prompt consisting of a couple of examples of input-output pairs with the goal of providing an LLM with a pattern to pick up.

Few-shot learning is a model adaptation resulting from few-shot prompting, in which the model changes from being unable to solve the task to being able to solve it thanks to the provided examples.

In the context of LLMs, the “learning” is temporary and only applies to a particular chat conversation. The model’s parameters are not updated, so it doesn’t retain the knowledge or capabilities.

Applications of zero-shot prompting LLMs

In zero-shot prompting, we rely on the model’s existing knowledge to generate responses.

Consequently, zero-shot prompting makes sense for generic requests rather than for ones requiring highly specialized or proprietary knowledge.

When to use zero-shot prompting

You can safely use zero-shot prompting in the following use cases:

Simple tasks: If the task is simple, knowledge-based, and clearly defined, such as defining a word, explaining a concept, or answering a general knowledge question.
Tasks requiring general knowledge: For tasks that rely on the model’s pre-existing knowledge base, such as summarizing known information on a topic. They are more about clarifying, summarizing, or providing details on known subjects rather than exploring new areas or generating ideas. For example, “Who was the first person to climb Mount Everest?” or “Explain the process of photosynthesis.”
Exploratory queries: When exploring a topic and wanting a broad overview or a starting point for research. These queries are less about seeking specific answers and more about getting a wide-ranging overview that can guide further inquiry or research. For example, “How do different cultures celebrate the new year?” or “What are the main theories in cognitive psychology?”
Direct instructions: When you can provide clear, direct instruction that doesn’t require examples for the model to understand the task.

When not to use zero-shot prompting

In the following situations, do not use zero-shot prompting:

Complex tasks requiring context: If the task requires understanding nuanced context or specialized knowledge that the model is unlikely to have acquired during training.
Highly specific outcomes desired: When you need a response tailored to a specific format, style, or set of constraints, the model may not be able to adhere to without guidance from input-output examples.

Examples of zero-shot prompting use cases

Zero-shot prompting will get the job done for you in many simple NLP tasks, such as language translation or sentiment analysis.

As you can see in the screenshot below, translating a sentence from Polish to English is a piece of cake for ChatGPT:

Let’s try a zero-shot prompting-based strategy for sentiment analysis:

Again, the model got it right. With no explicit training for the task, ChatGPT was able to extract the sentiment from the text while avoiding pitfalls such as the first expression containing the word “good” even though the overall sentiment is negative. In the last example, which is somewhat more nuanced, the model even provided its reasoning behind the classification.

Where zero-shot prompting fails

Let’s turn to two use cases where zero-shot prompting is insufficient. Recall that these are complex tasks requiring context and situations requiring a highly specific outcome.

Consider the following two prompts:

“Explain the implications of the latest changes in quantum computing for encryption, considering current technologies and future prospects.”
“Write a legal brief arguing the case for a specific, but hypothetical, scenario where an AI created a piece of art, and now there’s a copyright dispute between the AI’s developer and a gallery claiming ownership.”

To the adventurous readers over there, feel free to try these out with your LLM of choice! However, you’re rather unlikely to get anything useful as a result.

Here is why:

The first prompt about quantum computing demands an understanding of current, possibly cutting-edge developments in quantum computing and encryption technologies. Without specific examples or context, the LLM might not accurately reflect the latest research, advancements, or the nuanced implications for future technologies.

The second prompt, asking for a legal brief, requires the LLM to adhere to legal brief formatting and conventions, understand the legal intricacies of copyright law as it applies to AI (many of which are still subject to debate), and construct arguments based on hypothetical yet particular circumstances. A zero-shot prompt doesn’t provide the model with the necessary guidelines or examples to generate a response that accurately meets all these detailed requirements.

Applications of few-shot prompting

With few-shot prompting, the LLM conditions its response on the examples we provide. Hence, it makes sense to try it when it seems like just a few examples should be enough to discover a pattern or when we need a specific output format or style. However, a high degree of task complexity and latency restrictions are typical blockers for using few-shot prompting.

When to use few-shot prompting

You can try prompting the model with a couple of examples in the following situations:

Zero-shot prompting is insufficient: The model does not know how to perform the task well without any examples, but there is a reason to hope that just a few examples will suffice.
Limited training data is available: When a few examples are all we have, fine-tuning the model is not feasible, and few-shot prompting might be the only way to get the examples across.
Custom formats or styles: If you want the output to follow a specific format, style, or structure, providing examples can guide the model more effectively than trying to convey the desired outcome through words.
Teaching the model new concepts: If you’re trying to get the model to understand an idea it is unfamiliar with, a few examples can serve as a quick primer. Remember that this new knowledge is only retained for the conversation at hand, though!
Improving accuracy: When precision is crucial, and you want to ensure the model clearly understands the task.

When not to use few-shot prompting

In the following situations, you might want to decide against few-shot prompting:

General knowledge tasks: For straightforward tasks that don’t require specific formats or nuanced understanding, few-shot prompting might be overkill and unnecessarily complicate the query (unless, as discussed, accuracy is crucial).
Speed or efficiency is a priority: Few-shot prompting requires more input, which can be slower to compose and process.
Insufficient examples: If the task is too complex to explain in a few examples or if the specific examples you have available might confuse the model by introducing too much variability.
Complex reasoning tasks: If the task requires a couple of reasoning steps, even a set of examples might not be enough for the LLM to get the pattern we are looking for.

Examples of few-shot prompting use cases

Let’s examine examples where few-shot prompting proves highly effective.

Adapting tasks to specific styles

Imagine you work for a company that sells Product B. Your main competitor is Product A. You’ve collected some reviews from the internet, both on your product and the competing one. You want to get an idea of which product users consider to be better. To do so, you want to prompt the LLM to classify the sentiment of reviews for both products.

One way to solve this task is to manually craft a handful of examples such that:

Good reviews of your product (B) are labeled as positive.
Bad reviews of your product (B) are labeled as negative.
Good reviews of the competing product (A) are labeled as positive.
Bad reviews of the competing product (A) are labeled as positive.

This should hopefully be enough for the model to see what you’re doing there.

Indeed, the model picked up the pattern correctly and predicted the good review of a competitor’s product as negative for us, and was even able to explain it:

(…) positive sentiment expressions for Product A are labeled as “negative” and negative sentiment expressions are labeled as “positive” (and the conventional labeling for Product B).

This was an example of how few-shot prompting allows us to steer the model into solving a conventional task (sentiment classification) in an unconventional way based on a specific label format.

Teaching an LLM new concepts

Few-shot prompting is particularly well-suited for teaching an LLM new or imaginary concepts. This can be useful when you need the model to discover patterns in your data that require understanding the quirks and details where general knowledge is useless.

Let’s see how we can use few-shot prompting to teach an LLM the basic grammar of a new language I have just invented, Blablarian. (It’s widely spoken in the Kingdom of Blabland if you’re curious.)

As you can see, the model produced what must be regarded as a correct translation. It deciphered the meaning of the words and learned to distinguish between different pronouns. We can be sure this is purely in-context few-shot learning since there is no way Blablarian manuscripts could have made it into the model’s pre-training datasets.

This example illustrates the essence of few-shot learning well. Had we asked the model to translate the sentence “How old is he?” from English to Blablarian without providing any examples (that is, using zero-shot prompting), it wouldn’t have been able to do so simply because there is no such language as Blablarian. However, the model does have a general understanding of language and how grammar works. This knowledge is enough to pick up the patterns of a fake language I invented on the spot.

Where few-shot prompting fails

Finally, let’s look at a situation where few-shot prompting won’t get us far.

I will borrow this famous example that has been circling around the internet recently:

Prompt:

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.A: The answer is False.The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.A: The answer is True.The odd numbers in this group add up to an even number: 16, 11, 14, 4, 8, 13, 24.A: The answer is True.The odd numbers in this group add up to an even number: 17, 9, 10, 12, 13, 4, 2.A: The answer is False.The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. A:

Response:

The answer is True.

This answer is incorrect. A couple of examples are not enough to learn the pattern—the problem requires understanding several fundamental concepts and step-by-step reasoning. Even a significantly larger number of examples is unlikely to help.

Arguably, this type of problem might not be solvable by pattern finding, and no prompt engineering can help.

But guess what: the LLMs of today can recognize that they face a type of problem they won’t be able to solve. These chatbots will then employ tools better suited for the particular task, just like if I asked you to multiply two large numbers and you would resort to a calculator.

OpenAI’s ChatGPT, for instance, instead of hallucinating a response, will produce a snippet of Python code that should answer the question. (This code is visible when you click on “Finished analyzing.”) ChatGPT will execute the generated code in an interpreter and provide the answer based on the code’s outputs. In this case, this approach led to a correct answer:

This “magic” is the consequence of OpenAI doing some work behind the scenes: they feed additional prompts to the LLM to ensure it knows when you use external tools such as the Python interpreter.

Note, however, that this is not “few-shot learning” anymore. The model did not use the examples provided. Indeed, it would have provided the same answer even in the zero-shot prompting setting.

Conclusion

This article delved into zero-shot and few-shot prompting with Large Language Models, highlighting capabilities, use cases, and limitations.

Zero-shot learning enables LLMs to tackle tasks they weren’t explicitly trained for, relying solely on their pre-existing knowledge and general language understanding. This approach is ideal for simple tasks and exploratory queries, and when clear, direct instructions can be provided.

Few-shot learning allows LLMs to adapt to specific tasks, formats, or styles and improve accuracy for more complex queries by incorporating a small number of examples into the prompt.

However, both techniques have their limitations. Zero-shot prompting may not suffice for complex tasks requiring nuanced understanding or highly specific outcomes. Few-shot learning, while powerful, is not always the best choice for general knowledge tasks or when efficiency is a priority, and it may struggle with tasks too complex for a few examples to clarify.

As users and developers, understanding when and how to apply zero-shot and few-shot prompting can enable us to leverage the full potential of Large Language Models while navigating their limitations.

Organizing ML Monorepo With Pants

Michał Oleszak — Fri, 04 Aug 2023 14:10:10 +0000

Have you ever copy-pasted chunks of utility code between projects, resulting in multiple versions of the same code living in different repositories? Or, perhaps, you had to make pull requests to tens of projects after the name of the GCP bucket in which you store your data was updated?

Situations described above arise way too often in ML teams, and their consequences vary from a single developer’s annoyance to the team’s inability to ship their code as needed. Luckily, there’s a remedy.

Let’s dive into the world of monorepos, an architecture widely adopted in major tech companies like Google, and how they can enhance your ML workflows. A monorepo offers a plethora of advantages which, despite some drawbacks, make it a compelling choice for managing complex machine learning ecosystems.

We will briefly debate monorepos’ merits and demerits, examine why it’s an excellent architecture choice for machine learning teams, and peek into how BigTech is using it. Finally, we’ll see how to harness the power of the Pants build system to organize your machine learning monorepo into a robust CI/CD build system.

Strap in as we embark on this journey to streamline your ML project management.

What is a monorepo?

Machine learning monorepo | Source: Author

A monorepo (short for monolithic repository) is a software development strategy where code for many projects is stored in the same repository. The idea can be as broad as all of the company code written in a variety of programming languages stored together (did somebody say Google?) or as narrow as a couple of Python projects developed by a small team thrown into a single repository.

In this blog post, we focus on repositories storing machine learning code.

Monorepos vs. polyrepos

Monorepos are in stark contrast to the polyrepos approach, where each individual project or component has its own separate repository. A lot has been said about the advantages and disadvantages of both approaches, and we won’t go down this rabbit hole too deep. Let’s just put the basics on the table.

The monorepo architecture offers the following advantages:

Monorepo architecture | Source: Author

Single CI/CD pipeline, meaning no hidden deployment knowledge spread across individual contributors to different repositories;
Atomic commits, given that all projects reside in the same repository, developers can make cross-project changes that span across multiple projects but are merged as a single commit;
Easy sharing of utilities and templates across projects;
Easy unification of coding standards and approaches;
Better code discoverability.

Naturally, there are no free lunches. We need to pay for the above goodies, and the price comes in the form of:

Scalability challenges: As the codebase grows, managing a monorepo can become increasingly difficult. At a really large scale, you’ll need powerful tools and servers to handle operations like cloning, pulling, and pushing changes, which can take a significant amount of time and resources.

Complexity: A monorepo can be more complex to manage, particularly with regard to dependencies and versioning. A change in a shared component could potentially impact many projects, so extra caution is needed to avoid breaking changes.

Visibility and access control: With everyone working out of the same repository, it can be difficult to control who has access to what. While not a disadvantage as such, it could pose problems of a legal nature in cases where code is subject to a very strict NDA.

The decision as to whether the advantages a monorepo offers are worth paying the price is to be determined by each organization or team individually. However, unless you are operating at a prohibitively large scale or are dealing with top-secret missions, I would argue that – at least when it comes to my area of expertise, the machine learning projects – a monorepo is a good architecture choice in most cases.

Let’s talk about why that is.

Machine learning with monorepos

There are at least six reasons why monorepos are particularly suitable for machine learning projects.

1 Data pipeline integration
2 Consistency across experiments
3 Simplified model versioning
4 Cross-functional collaboration
5 Atomic changes
6 Unification of coding standards

Data pipeline integration

Machine learning projects often involve data pipelines that preprocess, transform, and feed data into the model. These pipelines might be tightly integrated with the ML code. Keeping the data pipelines and ML code in the same repo helps maintain this tight integration and streamline the workflow.

Consistency across experiments

Machine learning development involves a lot of experimentation. Having all experiments in a monorepo ensures consistent environment setups and reduces the risk of discrepancies between different experiments due to varying code or data versions.

Simplified model versioning

In a monorepo, the code and model versions are in sync because they are checked into the same repository. This makes it easier to manage and trace model versions, which can be especially important in projects where ML reproducibility is critical.

Just take the commit SHA at any given point in time, and it gives the information on the state of all models and services.

Cross-functional collaboration

Machine learning projects often involve collaboration between data scientists, ML engineers, and software engineers. A monorepo facilitates this cross-functional collaboration by providing a single source of truth for all project-related code and resources.

Atomic changes

In the context of ML, a model’s performance can depend on various interconnected factors like data preprocessing, feature extraction, model architecture, and post-processing. A monorepo allows for atomic changes – a change to multiple of these components can be committed as one, ensuring that interdependencies are always in sync.

Unification of coding standards

Finally, machine learning teams often include members without a software engineering background. These mathematicians, statisticians, and econometricians are brainy folks with brilliant ideas and the skills to train models that solve business problems. However, writing code that is clean, easy to read, and maintain might not always be their strongest side.

A monorepo helps by automatically checking and enforcing coding standards across all projects, which not only ensures high code quality but also helps the less engineering-inclined team members learn and grow.

How they do it in industry: famous monorepos

In the software development landscape, some of the largest and most successful companies in the world use monorepos. Here are a few notable examples.

Google: Google has long been a staunch advocate for the monorepo approach. Their entire codebase, estimated to contain 2 billion lines of code, is contained in a single, massive repository. They even published a paper about it.
Meta: Meta also employs a monorepo for their vast codebase. They created a version control system called “Mercurial” to handle the size and complexity of their monorepo.
Twitter: Twitter has been managing their monorepo for a long time using Pants, the build system we will talk about next!

Many other companies such as Microsoft, Uber, Airbnb, and Stripe are using the monorepo approach at least for some parts of their codebases, too.

Enough of the theory! Let’s take a look at how to actually build a machine learning monorepo. Because just throwing what used to be separate repositories into one folder does not do the job.

How to set up ML monorepo with Python?

Throughout this section, we will base our discussion on a sample machine learning repository I’ve created for this article. It is a simple monorepo holding just one project, or module: a hand-written digits classifier called mnist, after the famous dataset it uses.

All you need to know right now is that in the monorepo’s root there is a directory called mnist, and in it, there is some Python code for training the model, the corresponding unit tests, and a Dockerfile to run training in a container.

We will be using this small example to keep things simple, but in a larger monorepo, mnist would be just one of the many project folders in the repo’s root, each of which will contain source code, tests, dockerfiles, and requirement files at the least.

Build system: Why do you need one and how to choose it?

The Why?

Think about all the actions, other than writing code, that the different teams developing different projects within the monorepo take as part of their development workflow. They would run linters against their code to ensure adherence to style standards, run unit tests, build artifacts such as docker containers and Python wheels, push them to external artifact repositories, and deploy them to production.

Take testing.

You’ve made a change in a utility function you maintain, ran the tests, and all’s green. But how can you be sure your change is not breaking code for other teams that might be importing your utility? You should run their test suite, too, of course.

But to do this, you need to know exactly where the code you changed is being used. As the codebase grows, finding this out manually doesn’t scale well. Of course, as an alternative, you can always execute all the tests, but again: that approach doesn’t scale very well.

Why do you need a build system: testing | Source: Author

Another example, production deployment.

Whether you deploy weekly, daily, or continuously, when the time comes, you would build all the services in the monorepo and push them to production. But hey, do you need to build all of them on each occasion? That could be time-consuming and expensive at scale.

Some projects might not have been updated for weeks. On the other hand, the shared utility code they use might have received updates. How do we decide what to build? Again, it’s all about dependencies. Ideally, we would only build services that have been affected by the recent changes.

Why do you need a build system: deployment | Source: Author

All of this can be handled with a simple shell script with a small codebase, but as it scales and projects start sharing code, challenges emerge, many of which revolve around dependency management.

Picking the right system

All of the above is not a problem anymore if you invest in a proper build system. A build system’s primary task is to build code. And it should do so in a clever way: the developer should only need to tell it what to build (“build docker images affected by my latest commit”, or “run only those tests that cover code which uses the method I’ve updated”), but the how should be left for the system to figure out.

There are a couple of great open-source build systems out there. Since most machine learning is done in Python, let’s focus on the ones with the best Python support. The two most popular choices in this regard are Bazel and Pants.

Bazel is an open-source version of Google’s internal build system, Blaze. Pants is also heavily inspired by Blaze and it aims for similar technical design goals as Bazel. An interested reader will find a good comparison of Pants vs. Bazel in this blog post (but keep in mind it comes from the Pants devs). The table at the bottom of monorepo.tools offers yet another comparison.

Both systems are great, and it is not my intention to declare a “better” solution here. That being said, Pants is often described as easier to set up, more approachable, and well-optimized for Python, which makes it a perfect fit for machine learning monorepos.

In my personal experience, the decisive factor that made me go with Pants was its active and helpful community. Whenever you have questions or doubts, just post on the community Slack channel, and a bunch of supportive folks will help you out soon.

Introducing Pants

Alright, time to get to the meat of it! We will go step by step, introducing different Pants’ functionalities and how to implement them. Again, you can check out the associated sample repo here.

Setup

Pants is installable with pip. In this tutorial, we will use the most recent stable version as of this writing, 2.15.1.

pip install pantsbuild.pants==2.15.1

Pants is configurable through a global master config file named pants.toml. In it, we can configure Pants’ own behavior as well as the settings of downstream tools it relies on, such as pytest or mypy.

Let’s start with a bare minimum pants.toml:

[GLOBAL]
pants_version = "2.15.1"
backend_packages = [
    "pants.backend.python",
]

[source]
root_patterns = ["/"]

[python]
interpreter_constraints = ["==3.9.*"]

In the global section, we define the Pants version and the backend packages we need. These packages are Pants’ engines that support different features. For starters, we only include the Python backend.

In the source section, we set the source to the repository’s root. Since version 2.15, to make sure this is picked up, we also need to add an empty BUILD_ROOT file at the repository’s root.

Finally, in the Python section, we choose the Python version to use. Pants will browse our system in search of a version that matches the conditions specified here, so make sure you have this version installed.

That’s a start! Next, let’s take a look at any build system’s heart: the BUILD files.

Build files

Build files are configuration files used to define targets (what to build) and their dependencies (what they need to work) in a declarative way.

You can have multiple build files at different levels of the directory tree. The more there are, the more granular the control over dependency management. In fact, Google has a build file in virtually every directory in their repo.

In our example, we will use three build files:

mnist/BUILD – in the project directory, this build file will define the python requirements for the project and the docker container to build;
mnist/src/BUILD – in the source code directory, this build file will define python sources, that is, files to be covered by python-specific checks;
mnist/tests/BUILD – in the tests directory, this build file will define which files to run with Pytest and what dependencies are needed for these tests to run.

Let’s take a look at the mnist/src/BUILD:

python_sources(
    name="python",
    resolve="mnist",
    sources=["**/*.py"],
)

At the same time, mnist/BUILD looks like this:

python_requirements(
    name="reqs",
    source="requirements.txt",
    resolve="mnist",
)

The two entries in the build files are referred to as targets. First, we have a Python sources target, which we aptly call python, although the name could be anything. We define our Python sources as all .py files in the directory. This is relative to the build file’s location, that is: even if we had Python files outside of the mnist/src directory, these sources only capture the contents of the mnist/src folder. There is also a resolve filed; we will talk about it in a moment.

Next, we have the Python requirements target. It tells Pants where to find the requirements needed to execute our Python code (again, relative to the build file’s location, which is in the mnist project’s root in this case).

This is all we need to get started. To make sure the build file definition is correct, let’s run:

pants tailor --check update-build-files --check ::

As expected, we get: “No required changes to BUILD files found.” as the output. Good!

Let’s spend a bit more time on this command. In a nutshell, a bare pants tailor can automatically create build files. However, it sometimes tends to add too many for one’s needs, which is why I tend to add them manually, followed by the command above that checks their correctness.

The double semicolon at the end is a Pants notation that tells it to run the command over the entire monorepo. Alternatively, we could have replaced it with mnist: to run only against the mnist module.

Dependencies and lockfiles

To do efficient dependency management, pants relies on lockfiles. Lockfiles record the specific versions and sources of all dependencies used by each project. This includes both direct and transitive dependencies.

By capturing this information, lockfiles ensure that the same versions of dependencies are used consistently across different environments and builds. In other words, they serve as a snapshot of the dependency graph, ensuring reproducibility and consistency across builds.

To generate a lockfile for our mnist module, we need the following addition to pants.toml:

[python]
interpreter_constraints = ["==3.9.*"]
enable_resolves = true
default_resolve = "mnist"

[python.resolves]
mnist = "mnist/mnist.lock"

We enable the resolves (Pants term for lockfiles’ environments) and define one for mnist passing a file path. We also choose it as the default one. This is the resolve we have passed to Python sources and Python requirements target before: this is how they know what dependencies are needed. We can now run:

pants generate-lockfiles

to get:

Completed: Generate lockfile for mnist
Wrote lockfile for the resolve `mnist` to mnist/mnist.lock

This has created a file at mnist/mnist.lock. This file should be checked with git if you intend to use Pants for your remote CI/CD. And naturally, it needs to be updated every time you update the requirements.txt file.

With more projects in the monorepo, you would rather generate the lockfiles selectively for the project that needs it, e.g. pants generate-lockfiles mnist: .

That’s it for the setup! Now let’s use Pants to do something useful for us.

Unifying code style with Pants

Pants natively supports a number of Python linters and code formatting tools such as Black, yapf, Docformatter, Autoflake, Flake8, isort, Pyupgrade, or Bandit. They are all used in the same way; in our example, let’s implement Black and Docformatter.

To do so, we add appropriate two backends to pants.toml:

[GLOBAL]
pants_version = "2.15.1"
colors = true
backend_packages = [
    "pants.backend.python",
    "pants.backend.python.lint.docformatter",
    "pants.backend.python.lint.black",
]

We could configure both tools if we wanted to by adding additional sections below in the toml file, but let’s stick with the defaults now.

To use the formatters, we need to execute what’s called a Pants goal. In this case, two goals are relevant.

First, the lint goal will run both tools (in the order in which they are listed in backend packages, so Docformatter first, Black second) in the check mode.

pants lint ::

Completed: Format with docformatter - docformatter made no changes.
Completed: Format with Black - black made no changes.

✓ black succeeded.
✓ docformatter succeeded.

It looks like our code adheres to the standards of both formatters! However, if that was not the case, we could execute the fmt (short for “format”) goal that adapts the code appropriately:

pants fmt ::

In practice, you might want to use more than these two formatters. In this case, you may need to update each formatter’s config to ensure that it is compatible with the others. For instance, if you are using Black with its default config as we have done here, it will expect code lines not to exceed 88 characters.

But if you then want to add isort to automatically sort your imports, they will clash: isort truncates lines after 79 characters. To make isort compatible with Black, you would need to include the following section in the toml file:

[isort]
args = [
    "-l=88",
 ]

All formatters can be configured in the same way in pants.toml by passing the arguments to their underlying tool.

Testing with Pants

Let’s run some tests! To do this, we need two steps.

First, we add the appropriate sections to pants.toml:

[test]
output = "all"
report = false
use_coverage = true

[coverage-py]
global_report = true

[pytest]
args = ["-vv", "-s", "-W ignore::DeprecationWarning", "--no-header"]

These settings make sure that as the tests are run, a test coverage report is produced. We also pass a couple of custom pytest options to adapt its output.

Next, we need to go back to our mnist/tests/BUILD file and add a Python tests target:

python_tests(
    name="tests",
    resolve="mnist",
    sources=["test_*.py"],
)

We call it tests and specify the resolve (i.e. lockfile) to use. Sources are the locations where pytest will be let in to look for tests to run; here, we explicitly pass all .py files prefixed with “test_”.

Now we can run:

pants test ::

To get:

✓ mnist/tests/test_data.py:../tests succeeded in 3.83s.
✓ mnist/tests/test_model.py:../tests succeeded in 2.26s.

Name                               Stmts   Miss  Cover
------------------------------------------------------
__global_coverage__/no-op-exe.py       0      0   100%
mnist/src/data.py                     14      0   100%
mnist/src/model.py                    15      0   100%
mnist/tests/test_data.py              21      1    95%
mnist/tests/test_model.py             20      1    95%
------------------------------------------------------
TOTAL                                 70      2    97%

As you can see, it took around three seconds to run this test suite. Now, if we re-run it again, we will get the results immediately:

✓ mnist/tests/test_data.py:../tests succeeded in 3.83s (memoized).
✓ mnist/tests/test_model.py:../tests succeeded in 2.26s (memoized).

Notice how Pants tells us these results are memoized, or cached. Since no changes have been made to the tests, the code being tested, or the requirements, there is no need to actually re-run the tests – their results are guaranteed to be the same, so they are just served from the cache.

Checking static typing with Pants

Let’s add one more code quality check. Pants allow using mypy to check static typing in Python. All we need to do is add the mypy backend in pants.toml: “pants.backend.python.typecheck.mypy”.

You might also want to configure mypy to make its output more readable and informative by also adding the following config section:

[mypy]
args = [
    "--ignore-missing-imports",
    "--local-partial-types",
    "--pretty",
    "--color-output",
    "--error-summary",
    "--show-error-codes",
    "--show-error-context",
]

With this, we can run pants check :: to get:

Completed: Typecheck using MyPy - mypy - mypy succeeded.
Success: no issues found in 6 source files

✓ mypy succeeded.

Shipping ML models with Pants

Let’s talk shipping. Most machine learning projects involve one or more docker containers, for example, processing training data, training a model, or serving it via an API using Flask or FastAPI. In our toy project, we also have a container for model training.

Pants support automatic building and pushing of docker images. Let’s see how it works.

First, we add the docker backend in pants.toml: pants.backend.docker. We will also configure our docker, passing it a number of environment variables and a build arg which will come in handy in a moment:

[docker]
build_args = ["SHORT_SHA"]
env_vars = ["DOCKER_CONFIG=%(env.HOME)s/.docker", "HOME", "USER", "PATH"]

Now, in the mnist/BUILD file, we will add two more targets: a files target and a docker image target.

files(
    name="module_files",
    sources=["**/*"],
)

docker_image(
    name="train_mnist",
    dependencies=["mnist:module_files"],
    registries=["docker.io"],
    repository="michaloleszak/mnist",
    image_tags=["latest", "{build_args.SHORT_SHA}"],
)

We call the docker target “train_mnist”. As a dependency, we need to pass it the list of files to be included in the container. The most convenient way to do this is to define this list as a separated files target. Here, we simply include all the files in the mnist project in a target called module_files, and pass it as a dependency to the docker image target.

Naturally, if you know that only some subset of files will be needed by the container, it’s a good idea to pass only them as a dependency. It is essential because these dependencies are used by Pants to infer whether a container has been affected by a change and needs a rebuild. Here, with module_files including all files, if any file in the mnist folder changes (even a readme!), Pants will see the train_mnist docker image as affected by this change.

Finally, we can also set the external registry and repository to which the image can be pushed, and the tags with which it will be pushed: here, I will be pushing the image to my personal dockerhub repo, always with two tags: “latest”, and the short commit SHA which will be passed as a build arg.

With this, we can build an image. Just one more thing: since Pants is working in its isolated environments, it cannot read env vars from the host. Hence, to build or push the image that requires the SHORT_SHA variable, we need to pass it together with the Pants command.

We can build the image like this:

SHORT_SHA=$(git rev-parse --short HEAD) pants package mnist:train_mnist

to get:

Completed: Building docker image docker.io/michaloleszak/mnist:latest +1 additional tag.
Built docker images: 
  * docker.io/michaloleszak/mnist:latest
  * docker.io/michaloleszak/mnist:0185754

A quick check reveals that the images have indeed been built:

docker images 


REPOSITORY            TAG       IMAGE ID       CREATED              SIZE
michaloleszak/mnist   0185754   d86dca9fb037   About a minute ago   3.71GB
michaloleszak/mnist   latest    d86dca9fb037   About a minute ago   3.71GB

We can also build and push images in one go using Pants. All it takes is replacing the package command with the publish command.

SHORT_SHA=$(git rev-parse --short HEAD) pants publish mnist:train_mnist

This built the images and pushed them to my dockerhub, where they have indeed landed.

Pants in CI/CD

The same commands we have just manually run locally can be executed as parts of a CI/CD pipeline. You can run them via services such as GitHub Actions or Google CloudBuild, for instance as a PR check before a feature branch is allowed to be merged to the main branch, or after the merge, to validate it’s green and build & push containers.

In our toy repo, I have implemented a pre-push commit hook that runs Pants commands on git push and only lets it through if they all pass. In it, we are running the following commands:

pants tailor --check update-build-files --check ::
pants lint ::
pants --changed-since=main --changed-dependees=transitive check
pants test ::

You can see some new flags for pants check, that is the typing check with mypy. They ensure that the check is only run on files that have changed compared to the main branch and their transitive dependencies. This is useful since mypy tends to take some time to run. Limiting its scope to what’s actually needed accelerates the process.

How would a docker build & push look in a CI/CD pipeline? Somewhat like this:

pants --changed-since=HEAD^ --changed-dependees=transitive --filter-target-type=docker_image publish

We use the publish command as before, but with three additional arguments:

–changed-since=HEAD^ and –changed-dependees=transitive make sure that only the containers affected by the changes compared to the previous commit are built; this is useful for executing on the main branch after the merge.
–filter-target-type=docker_image makes sure that the only things Pants does is build and push docker; this is because the pants publish command can refer to targets other than docker: for example, it can be used to publish helm charts to OCI registries.

The same goes for pants package: on top of building docker images, it can also create a Python package; for that reason, it’s a good practice to pass the –filter-target-type option.

Conclusion

Monorepos are more often than not a great architecture choice for machine learning teams. Managing them at scale, however, requires investment in a proper build system. One such system is Pants: it’s easy to set up and use and offers native support for many Python and Docker features that machine learning teams often use.

On top of that, it is an open-source project with a large and helpful community. I hope after reading this article you will go ahead and try it out. Even if you don’t currently have a monolithic repository, Pants can still streamline and facilitate many aspects of your daily work!

References

Pants documentation: https://www.pantsbuild.org/
Pants vs. Bazel blog post: https://blog.pantsbuild.org/pants-vs-bazel/
monorepo.tools: https://monorepo.tools/

Feature Selection Methods and How to Choose Them

Michał Oleszak — Fri, 09 Sep 2022 09:23:09 +0000

Have you ever found yourself sitting in front of the screen wondering what kind of features will help your machine learning model learn its task best? I bet you have. Data preparation tends to consume vast amounts of data scientists’ and machine learning engineers’ time and energy, and making the data ready to be fed to the learning algorithms is no small feat.

One of the crucial steps in the data preparation pipeline is feature selection. You might know the popular adage: garbage in, garbage out. What you feed your models with is at least as important as the models themselves, if not more so.

In this article, we will:

look at the place of feature selection among other feature-related tasks in the data preparation pipeline
and discuss the multiple reasons why it is so crucial for any machine learning project’s success.
Next, we will go over different approaches to feature selection and discuss some tricks and tips to improve their results.
Then, we will take a glimpse behind the hood of Boruta, the state-of-the-art feature selection algorithm, to check out a clever way to combine different feature selection methods
And we’ll look into how feature selection is leveraged in the industry.

Let’s dive in!

What is feature selection, and what is it not?

Let’s kick off by defining our object of interest.

What is feature selection? In a nutshell, it is the process of selecting the subset of features to be used for training a machine learning model.

This is what feature selection is, but it is equally important to understand what feature selection is not – it is neither feature extraction/feature engineering nor it is dimensionality reduction.

Feature extraction and feature engineering are two terms describing the same process of creating new features from the existing ones based on domain knowledge. This yields more features than were originally there, and it should be performed before feature selection. First, we can do feature extraction to come up with many potentially useful features, and then we can perform feature selection in order to pick the best subset that will indeed improve the model’s performance.

Dimensionality reduction is yet another concept. It is somewhat similar to feature selection as both aim at reducing the number of features. However, they differ significantly in how they achieve this goal. While feature selection chooses a subset of original features to keep and discards others, dimensionality reduction techniques create projections of original features onto a fewer-dimensional space, thus creating a completely new set of features. Dimensionality reduction, if desired, should be run after feature selection, but in practice, it is either one or the other.

Now we know what feature selection is and how it corresponds to other feature-related data preparation tasks. But why do we even need it?

7 reasons why we need feature selection

A popular claim is that modern machine learning techniques do well without feature selection. After all, a model should be able to learn that particular features are useless, and it should focus on the others, right?

Well, this reasoning makes sense to some extent. Linear models could, in theory, assign a weight of zero to useless features, and tree-based models should learn quickly not to make splits on them. In practice, however, many things can go wrong with training when the inputs are irrelevant or redundant – more on these two terms later. On top of this, there are many other reasons why simply dumping all the available features into the model might not be a good idea. Let’s look at the seven most prominent ones.

1. Irrelevant and redundant features

Some features might be irrelevant to the problem at hand. This means they have no relation with the target variable and are completely unrelated to the task the model is designed to solve. Discarding irrelevant features will prevent the model from picking up on spurious correlations it might carry, thus fending off overfitting.

Redundant features are a different animal, though. Redundancy implies that two or more features share the same information, and all but one can be safely discarded without information loss. Note that an important feature can also be redundant in the presence of another relevant feature. Redundant features should be dropped, as they might pose many problems during training, such as multicollinearity in linear models.

2. Curse of dimensionality

Feature selection techniques are especially indispensable in scenarios with many features but few training examples. Such cases suffer from what is known as the curse of dimensionality: in a very high-dimensional space, each training example is so far from all the other examples that the model cannot learn any useful patterns. The solution is to decrease the dimensionality of the features space, for instance, via feature selection.

3. Training time

The more features, the more training time. The specifics of this trade-off depend on the particular learning algorithm being used, but in situations where retraining needs to happen in real-time, one might need to limit oneself to a couple of best features.

4. Deployment effort

The more features, the more complex the machine learning system becomes in production. This poses multiple risks, including but not limited to high maintenance effort, entanglement, undeclared consumers, or correction cascades.

5. Interpretability

With too many features, we lose the explainability of the model. While not always the primary modeling goal, interpreting and explaining the model’s results are often important and, in some regulated domains, might even constitute a legal requirement.

6. Occam’s Razor

According to this so-called law of parsimony, simpler models should be preferred over the more complex ones as long as their performance is the same. This also has to do with the machine learning engineer’s nemesis, overfitting. Less complex models are less likely to overfit the data.

7. Data-model compatibility

Finally, there is the issue of data-model compatibility. While, in principle, the approach should be data-first, which means collecting and preparing high-quality data and then choosing a model which works well on this data, real life may have it the other way around.

You might be trying to reproduce a particular research paper, or your boss might have suggested using a particular model. In this model-first approach, you might be forced to select features that are compatible with the model you set out to train. For instance, many models don’t work with missing values in the data. Unless you know your imputation methods well, you might need to drop the incomplete features.

Different approaches to feature selection

All the different approaches to feature selection can be grouped into four families of methods, each coming with its pros and cons. There are unsupervised and supervised methods. The latter can be further divided into the wrapper, filter, and embedded methods. Let’s discuss them one by one.

Feature selection methods | Source: author

Unsupervised feature selection methods

Just like unsupervised learning is the type of learning that looks for patterns in unlabeled data, similarly, unsupervised feature selection methods are such methods that do not make use of any labels. In other words, they don’t need access to the target variable of the machine learning model.

How can we claim a feature to be unimportant for the model without analyzing its relation to the model’s target, you might ask. Well, in some cases, this is possible. We might want to discard the features with:

Zero or near-zero variance. Features that are (almost) constant provide little information to learn from and thus are irrelevant.
Many missing values. While dropping incomplete features is not the preferred way to handle missing data, it is often a good start, and if too many entries are missing, it might be the only sensible thing to do since such features are likely inconsequential.
High multicollinearity; multicollinearity means a strong correlation between different features, which might signal redundancy issues.

Unsupervised methods in practice

Let’s now discuss the practical implementation of unsupervised feature selection methods. Just like most other machine learning tasks, feature selection is served very well by the scikit-learn package, and in particular by its `sklearn.feature_selection` module. However, in some cases, one needs to reach out to other places. Here, as well as for the remainder of the article, let’s denote an array or data frame by `X` with all potential features as columns and observation in rows and the targets vector by `y`.

The `sklearn.feature_selection.VarianceThreshold` transformer will by default remove all zero-variance features. We can also pass a threshold as an argument to make it remove features whose variance is lower than the threshold.

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.05)
X_selection = sel.fit_transform(X)

In order to drop the columns with missing values, pandas’ `.dropna(axis=1)` method can be used on the data frame.

X_selection = X.dropna(axis=1)

To remove features with high multicollinearity, we first need to measure it. A popular multicollinearity measure is the Variance Inflation Factor or VIF. It is implemented in the statsmodels package.

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_scores = [variance_inflation_factor(X.values, feature)for feature in range(len(X.columns))]

By convention, columns with a VIF larger than 10 are considered as suffering from multicollinearity, but another threshold may be chosen if it seems more reasonable.

Wrapper feature selection methods

Wrapper methods refer to a family of supervised feature selection methods which uses a model to score different subsets of features to finally select the best one. Each new subset is used to train a model whose performance is then evaluated on a hold-out set. The features subset which yields the best model performance is selected. A major advantage of wrapper methods is the fact that they tend to provide the best-performing feature set for the particular chosen type of model.

At the same time, however, it has a limitation. Wrapper methods are likely to overfit to the model type, and the feature subsets they produce might not generalize should one want to try them with a different model.

Another significant disadvantage of wrapper methods is their large computational needs. They require training a large number of models, which might require some time and computing power.

Popular wrapper methods include:

Backward selection, in which we start with a full model comprising all available features. In subsequent iterations, we remove one feature at a time, always the one that yields the largest gain in a model performance metric, until we reach the desired number of features.
Forward selection, which works in the opposite direction: we start from a null model with zero features and add them greedily one at a time to maximize the model’s performance.
Recursive Feature Elimination, or RFE, which is similar in spirit to backward selection. It also starts with a full model and iteratively eliminates the features one by one. The difference is in the way the features to discard are chosen. Instead of relying on a model performance metric from a hold-out set, RFE makes its decision based on feature importance extracted from the model. This could be feature weights in linear models, impurity decrease in tree-based models, or permutation importance (which is applicable to any model type).

Wrapper methods in practice

When it comes to wrapper methods, scikit-learn has got us covered:

Backward and forward feature selection can be implemented with the SequentialFeatureSelector transformer. For instance, in order to use the k-Nearest-Neighbor classifier as the scoring model in forward selection, we could use the following code snippet:

from sklearn.feature_selection import SequentialFeatureSelector

knn = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(knn, n_features_to_select=3, direction=”forward”)
sfs.fit(X, y)
X_selection = sfs.transform(X)

Recursive Feature Elimination is implemented in a very similar fashion. Here is a snippet implementing RFE based on feature importance from a Support Vector Classifier.

from sklearn.feature_selection import RFE

svc = SVC(kernel="linear")
rfe = RFE(svc, n_features_to_select=3)
rfe.fit(X, y)
X_selection = rfe.transform(X)

Filter feature selection methods

Another member of the supervised family is filter methods. They can be thought of as a simpler and faster alternative to wrappers. In order to evaluate the usefulness of each feature, they simply analyze its statistical relation with the model’s target, using measures such as correlation or mutual information as a proxy for the model performance metric.

Not only filter methods faster than wrappers, but they are also more general since they are model-agnostic; they won’t overfit to any particular algorithm. They are also pretty easy to interpret: a feature is discarded if it has no statistical relationship to the target.

On the other hand, however, filter methods have one major drawback. They look at each feature in isolation, evaluating its relation to the target. This makes them prone to discarding useful features that are weak predictors of the target on their own but add a lot of value to the model when combined with other features.

Filter methods in practice

Let’s now take a look at implementing various filter methods. These will need some more glue code to implement. First, we need to compute the desired correlation measure between each feature and the target. Then, we would sort all features according to the results and keep the desired number (top-K or top-30%) of the ones with the strongest correlation. Luckily, scikit-learn provides some utilities to help in this endeavour.

To keep the top 2 features with the strongest Pearson correlation with the target, we can run:

from sklearn.feature_selection import r_regression, SelectKBest

X_selection = SelectKBest(r_regression, k=2).fit_transform(X, y)

Similarly, to keep the top 30% of features, we would run:

	from sklearn.feature_selection import r_regression, SelectPercentile

	X_selection = SelectPercentile(r_regression, percentile=30).fit_transform(X, y)

The `SelectKBest` and `SelectPercentile` methods will also work with custom or non-scikit-learn correlation measures, as long as they return a vector of length equal to the number of features, with a number for each feature denoting the strength of its association with the target. Let’s now take a look at how to calculate all the different correlation measures out there (we will discuss what they mean and when to choose which later).

Spearman’s Rho, Kendall Tau, and point-biserial correlation are all available in the scipy package. This is how to get their values for each feature in X.

from scipy import stats

rho_corr = [stats.spearmanr(X[:, f], y).correlation for f in range(X.shape[1])]

tau_corr = [stats.kendalltau(X[:, f], y).correlation for f in range(X.shape[1])]

pbs_corr = [stats.pointbiserialr(X[:, f], y).correlation for f in range(X.shape[1])]

Chi-Squared, Mutual Information, and ANOVA F-score are all in scikit-learn. Note that mutual information has a separate implementation, depending on whether the target is nominal or not.

from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import f_classif

chi2_corr = chi2(X, y)[0]
f_corr = f_classif(X, y)[0]
mi_reg_corr = mutual_info_regression(X, y)
mi_class_corr = mutual_info_classif(X, y)

Cramer’s V can be obtained from a recent scipy version (1.7.0 or higher).

from scipy.stats.contingency import association

v_corr = [association(np.hstack([X[:, f].reshape(-1, 1), y.reshape(-1, 1)]), method="cramer") for f in range(X.shape[1])]

Embedded feature selection methods

The final approach to feature selection we will discuss is to embed it into the learning algorithm itself. The idea is to combine the best of both worlds: speed of the filters, while getting the best subset for the particular model just like from a wrapper.

Embedded methods in practice

The flagship example is the LASSO regression. It is basically just regularized linear regression, in which feature weights are shrunk towards zero in the loss function. As a result, many features end up with weights of zero, meaning they are discarded from the model, while the rest with non-zero weights are included.

The problem with embedded methods is that there are not that many algorithms out there with feature selection built-in. Another example next to LASSO comes from computer vision: auto-encoders with a bottleneck layer force the network to disregard some of the least useful features of the image and focus on the most important ones. Other than that, there aren’t many useful examples.

Filter feature selection methods: useful tricks & tips

As we have seen, wrapper methods are slow, computationally heavy, and model-specific, and there are not many embedded methods. As a result, filters are often the go-to family of feature selection methods.

At the same time, they require the most expertise and attention to detail. While embedded methods work out of the box and wrappers are fairly simple to implement (especially when one just calls scikit-learn functions), filters ask for a pinch of statistical sophistication. Let us now turn our attention to filter methods and discuss them in more detail.

Filter methods need to evaluate the statistical relationship between each feature and the target. As simple as it may sound, there’s more to it than meets the eye. There are many statistical methods to measure the relationship between two variables. To know which one to choose in a particular case, we need to think back to our first STATS101 class and brush up on data measurement levels.

Data measurement levels

In a nutshell, a variable’s measurement level describes the true meaning of the data and the types of mathematical operations that make sense for these data. There are four measurement levels: nominal, ordinal, interval, and ratio.

Data measurement levels | Source

Nominal features, such as color (“red”, “green” or “blue”) have no ordering between the values; they simply group observations based on them.

Ordinal features, such as education level (“primary”, “secondary”, “tertiary”) denote order, but not the differences between particular levels (we cannot say that the difference between “primary” and “secondary” is the same as the one between “secondary” and “tertiary”).

Interval features, such as temperature in degrees Celsius, keep the intervals equal (the difference between 25 and 20 degrees is the same as between 30 and 25).

Finally, ratio features, such as price in USD, are characterized by a meaningful zero, which allows us to calculate ratios between two data points: we can say that $4 is twice as much as $2.

In order to choose the right statistical tool to measure the relation between two variables, we need to think about their measurement levels.

Measuring correlations for various data types

When the two variables we compare, i.e., the feature and the target, are both either interval or ratio, we are allowed to use the most popular correlation measure out there: the Pearson correlation, also known as Pearson’s r.

This is great, but Pearson correlation comes with two drawbacks: it assumes both variables are normally distributed, and it only measures the linear correlation between them. When the correlation is non-linear, Pearson’s r won’t detect it, even if it’s really strong.

You might have heard about the Datasaurus dataset compiled by Alberto Cairo. It consists of 13 pairs of variables, each with the same very weak Pearson correlation of -0.06. As it quickly becomes obvious once we plot them, the pairs are actually correlated pretty strongly, albeit in a non-linear way.

The Datasaurus dataset by Alberto Cairo | Source

When non-linear relations are to be expected, one of the alternatives to Pearson’s correlation should be taken into account. The two most popular ones are:

Spearman’s rank correlation (Spearman’s Rho),

Spearman’s rank correlation is an alternative to Pearson correlation for ratio/interval variables. As the name suggests, it only looks at the rank values, i.e. it compares the two variables in terms of the relative positions of particular data points within the variables. It is able to capture non-linear relations, but there are no free lunches: we lose some information due to only considering the rank instead of the exact data points.

Kendall rank correlation (Kendall Tau).

Another rank-based correlation measure is the Kendall rank correlation. It is similar in spirit to Spearman’s correlation but formulated in a slightly different way (Kendall’s calculations are based on concordant and discordant pairs of values, as opposed to Spearman’s calculations based on deviations). Kendall is often regarded as more robust to outliers in the data.

If at least one of the compared variables is of ordinal type, Spearman’s or Kendall rank correlation is the way to go. Due to the fact that ordinal data contains only the information on the ranks, they are both a perfect fit, while Pearson’s linear correlation is of little use.

Another scenario is when both variables are nominal. In this case, we can choose from a couple of different correlation measures:

Cramer’s V, which captures the association between the two variables into a number ranging from zero (no association) to one (one variable completely determined by the other).
Chi-Squared statistic commonly used for testing for dependence between two variables. Lack of dependence suggests the particular feature is not useful.
Mutual information a measure of mutual dependence between two variables that seeks to quantify the amount of information that one can extract from one variable about the other.

Which one to choose? There is no one-size-fits-all answer. As usual, each method comes with some pros and cons. Cramer’s V is known to overestimate the association’s strength. Mutual information, being a non-parametric method, requires larger data samples to yield reliable results. Finally, the Chi-Squared does not provide information about the strength of the relationship, but rather only whether it exists or not.

We have discussed scenarios in which the two variables we compare are both interval or ratio, when at least one of them is ordinal, and when we compare two nominal variables. The final possible encounter is to compare a nominal variable with a non-nominal one.

In such cases, the two most widely-used correlation measures are:

ANOVA F-score, a chi-squared equivalent for the case when one of the variables is continuous while the other is nominal,
Point-biserial correlation a correlation measure especially designed to evaluate the relationship between a binary and a continuous variable.

Once again, there is no silver bullet. The F-score only captures linear relations, while point-biserial correlation makes some strong normality assumption that might not hold in practice, undermining its results.

Having said all that, which method should one choose in a particular case? The table below will hopefully provide some guidance in this matter.

Variable 1

Variable 2

Method

Comments

Interval / ratio

Pearson’s r

Only captures linear relations, assumes normality

Spearman’s Rho

When nonlinear relations are expected

Kendall Tau

When nonlinear relations are expected

Interval / ratio

Ordinal

Spearman’s Rho

Based on ranks only, captures nonlinearities

Kendall Tau

Like Rho, but more robust to outliers

Nominal

Cramer’s V

May overestimate correlation strength

Chi-Squared

No info on correlation’s strength

Mutual Information

Requires many data samples.

Nominal

Interval / ratio / ordinal

F-score

Only captures linear relations

Point-biserial

Makes strong normality assumptions

Comparison of different methods

Take no prisoners: Boruta needs no human input

When talking about feature selection, we cannot fail to mention Boruta. Back in 2010, when it was first published as an R package, it quickly became famous as a revolutionary feature selection algorithm.

Why is Boruta a game-changer?

All the other methods we have discussed so far require a human to make an arbitrary decision. Unsupervised methods need us to set the variance or VIF threshold for feature removal. Wrappers require us to decide on the number of features we want to keep upfront. Filters need us to choose the correlation measure and the number of features to keep as well. Embedded methods have us select regularization strength. Boruta needs none of these.

Boruta is a simple yet statistically elegant algorithm. It uses feature importance measures from a random forest model to select the best subset of features, and it does so via introducing two clever ideas.

First, the importance scores of features are not compared to one another. Rather, the importance of each feature competes against the importance of its randomized version. To achieve this, Boruta randomly permutes each feature to construct its “shadow” version.

Then, a random forest is trained on the whole feature set, including the new shadow features. The maximum feature importance among the shadow features serves as a threshold. Of the original features, only those whose importance is above this threshold score a point. In other words, only features that are more important than random vectors are awarded points.

This process is repeated iteratively multiple times. Since each time the random permutation is different, the threshold also differs, and so different features might score points. After multiple iterations, each of the original features has some number of points to its name.

The final step is to decide, based on the number of points each feature scored, whether it should be kept or discarded. Here enters the other of Boruta’s two clever ideas: we can model the scores using a binomial distribution.

Each iteration is assumed to be a separate trial. If the feature scored in a given iteration, it is a vote to keep it; if it did not, it’s a vote to discard it. A priori, we have no idea whatsoever whether a feature is important or not, so the expected percentage of trials in which the feature scores is 50%. Hence, we can model the number of points scored with a binomial distribution with p=0.5. If our feature scores significantly more times than this, it is deemed important and kept. If it scores significantly fewer times, it’s deemed unimportant and discarded. If it scores in around 50% of trials, its status is unresolved, but for the sake of being conservative, we can keep it.

For example, if we let Boruta run for 100 trials, the expected score of each feature would be 50. If it’s closer to zero, we discard it, if it’s closer to 100, we keep it.

Boruta example | Source: author

Boruta has proven very successful in many Kaggle competitions and is always worth trying out. It has also been successfully used for predicting energy consumption for building heating or predicting air pollution.

There is a very intuitive Python package to implement Boruta, called BorutaPy (now part of scikit-learn-contrib). The package’s GitHub readme demonstrates how easy it is to run feature selection with Boruta.

Which feature selection method to choose? Build yourself a voting selector

We have discussed many different feature selection methods. Each of them has its own strengths and weaknesses, makes its own assumptions, and arrives at its conclusions in a different fashion. Which one to choose? Or do we have to choose? In many cases combining all these different methods together under one roof would make the resulting feature selector stronger than each of its subparts.

The inspiration

One way to do it is inspired by ensembled decision trees. In this class of models, which includes random forests and many popular gradients boosting algorithms, one trains multiple different models and lets them vote on the final prediction. In a similar spirit, we can build ourselves a voting selector.

The idea is simple: implement a couple of feature selection methods we have discussed. Your choice could be guided by your time, computational resources, and data measurement levels. Just run as many different methods as you conveniently can afford. Then, for each feature, write down the percentage of selection methods that suggest keeping this feature in the data set. If more than 50% of the methods vote to keep the feature, keep it – otherwise, discard it.

The idea behind this approach is that while some methods might make wrong judgments with regard to some of the features due to their intrinsic biases, the ensemble of methods should get the set of useful features right. Let’s see how to implement it in practice!

The implementation

Let’s build a simple voting selector that ensembles three different features selection methods:

1 A filter method based on Pearson correlation.
2 An unsupervised method based on multicollinearity.
3 A wrapper, Recursive Feature Elimination.

Let’s take a look at how such a voting selector might look like.

Making the imports.

from itertools import compress

import pandas as pd
from sklearn.feature_selection import RFE, r_regression, SelectKBest
from sklearn.svm import SVR
from statsmodels.stats.outliers_influence import variance_inflation_factor

Next, Our VotingSelector class comprises four methods on top of the init constructor. Three of them implement the three feature selection techniques we would like to ensemble:

1 _select_pearson() for Pearson correlation filtering
2 _select_vif() for Variance Inflation Factor-based unsupervised approach
3 _select_rbf() for the RBF wrapper

Each of these methods takes the feature matrix X and the targets y as inputs. The VIF-based method will not use the targets, but we use this argument anyway to keep the interface consistent across all methods so that we can conveniently call them in a loop later. On top of that, each method accepts a keyword arguments dictionary which we will use to pass method-dependent parameters. Having parsed the inputs, each method calls the appropriate sklearn or statsmodels functions which we have discussed before, to return the list of feature names to keep.

The voting magic happens in the select() method. There, we simply iterate over the three selection methods, and for each feature, we record whether it should be kept (1) or discarded (0) according to this method. Finally, we take the mean over these votes. For each feature, if this mean is greater than the voting threshold of 0.5 (which means that at least two out of three methods voted to keep a feature), we keep it.

Here is the code for the entire class.

class VotingSelector():
   def __init__(self):
       self.selectors = {
           "pearson": self._select_pearson,
           "vif": self._select_vif,
           "rfe": self._select_rfe,
       }
       self.votes = None

   @staticmethod
   def _select_pearson(X, y, **kwargs):
       selector = SelectKBest(r_regression, k=kwargs.get("n_features_to_select", 5)).fit(X, y)
       return selector.get_feature_names_out()

   @staticmethod
   def _select_vif(X, y, **kwargs):
       return [
           X.columns[feature_index]
           for feature_index in range(len(X.columns))
           if variance_inflation_factor(X.values, feature_index) <= kwargs.get("vif_threshold", 10)
       ]

   @staticmethod
   def _select_rfe(X, y, **kwargs):
       svr = SVR(kernel="linear")
       rfe = RFE(svr, n_features_to_select=kwargs.get("n_features_to_select", 5))
       rfe.fit(X, y)
       return rfe.get_feature_names_out()

   def select(self, X, y, voting_threshold=0.5, **kwargs):
       votes = []
       for selector_name, selector_method in self.selectors.items():
           features_to_keep = selector_method(X, y, **kwargs)
           votes.append(
               pd.DataFrame([int(feature in features_to_keep) for feature in X.columns]).T
           )
       self.votes = pd.concat(votes)
       self.votes.columns = X.columns
       self.votes.index = self.selectors.keys()
       features_to_keep = list(compress(X.columns, self.votes.mean(axis=0) > voting_threshold))
       return X[features_to_keep]

Let’s see it working in practice. We will load the infamous Boston Housing data, which comes built-in within scikit-learn.

from sklearn.datasets import load_boston
boston = load_boston()
X = pd.DataFrame(boston["data"], columns=boston["feature_names"])
y = boston["target"]

Now, running feature selection is as easy as this:

vs = VotingSelector()
X_selection = vs.select(X, y)

As a result, we get the feature matrix with only three features left.

      ZN  CHAS     RM
0    18.0   0.0  6.575
1     0.0   0.0  6.421
2     0.0   0.0  7.185
3     0.0   0.0  6.998
4     0.0   0.0  7.147
..    ...   ...    ...
501   0.0   0.0  6.593
502   0.0   0.0  6.120
503   0.0   0.0  6.976
504   0.0   0.0  6.794
505   0.0   0.0  6.030
[506 rows x 3 columns]

We can also glimpse at how each of our methods has voted by printing vs.votes.

        CRIM  ZN  INDUS  CHAS  NOX  RM  AGE  DIS  RAD  TAX  PTRATIO  B  LSTAT
pearson     0   1      0     1    0   1    0    1    0    0        0  1      0
vif         1   1      0     1    0   0    0    0    0    0        0  0      0
rfe         0   0      0     1    1   1    0    0    0    0        1  0      1

We might not be happy with only 3 out of the initial 13 columns left. Luckily, we can easily make the selection less restrictive by modifying the parameters of the particular methods. This can be done by simply adding appropriate arguments to the call to select, thanks to how we pass kwargs around.

Pearson and RFE methods need a pre-defined number of features to keep. The default has been 5, but we might want to increase it to 8. We can also modify the VIF threshold, that is the value of the Variance Inflation Factor above which we discard a feature due to multicollinearity. By convention, this threshold is set at 10, but increasing it to, say, 15 will result in more features being kept.

vs = VotingSelector()
X_selection = vs.select(X, y, n_features_to_select=8, vif_threshold=15)

This way, we have seven features left.

        CRIM  ZN  INDUS  CHAS  NOX  RM  AGE  DIS  RAD  TAX  PTRATIO  B  LSTAT
pearson     1   1      0     1    0   1    1    1    1    0        0  1      0
vif         1   1      1     1    0   0    0    1    0    0        0  0      1
rfe         1   0      1     1    1   1    0    1    0    0        1  0      1

Our VotingSelector class is a simple but generic template which you can extend to an arbitrary number of feature selection methods. As a possible extension, you could also treat all the arguments passed to select() as hyperparameters of your modeling pipeline and optimize them so as to maximize the performance of the downstream model.

Feature selection at Big Tech

Large technology companies such as GAFAM and the likes of it, with their thousands of machine learning models in production, are prime examples of how feature selection is operated in the wild. Let’s see what these tech giants have to say about it!

Google

Rules of ML is a handy compilation of best practices in machine learning from around Google. In it, Google’s engineers point out that the number of parameters the model can learn is roughly

proportional to the amount of data it has access to. Hence, the less data we have, the more features we need to discard. Their rough guidelines (derived from text-based models) are to use a dozen features with 1000 training examples or 100,000 features with 10 million training examples.

Another crucial point in the document concerns model deployment issues, which can also affect feature selection.

First, your set of features to select from might be constrained by what will be available in production at inference time. You may be forced to drop a great feature from training if it isn’t there for the model when it goes live.

Second, some features might be prone to data drift. While the topic of tackling drift is a complex one, sometimes the best solution might be to remove the problematic feature from the model altogether.

Facebook

A couple of years ago, in 2019, Facebook came up with its own Neural Network suitable Feature Selection algorithm in order to save computational resources while training large-scale models. They further tested this algorithm on their own Facebook News Feed dataset so as to rank relevant items as efficiently as possible while working with a fewer-dimensional input. You can read all about it here.

Parting words

Thanks for reading till the end! I hope this article convinced you that feature selection is a crucial step in the data preparation pipeline and gave you some guidance as to how to approach it.

Don’t hesitate to hit me up on social media to discuss the topics covered here or any other machine learning topics, for that matter. Happy feature selection!

Michał Oleszak, Autor w serwisie neptune.ai

Detecting and Fixing ‘Dead Neurons’ in Foundation Models

TL;DR

Dead neurons’ impact

How to Monitor, Diagnose, and Solve Gradient Issues in Foundation Models

Visualizing activation distributions

Activation frequency

Activation histograms

Dead neuron ratio

How to Visualize Deep Learning Models

Alternative activation functions

Common activations

How to choose activation functions for foundation models

Hyperparameter Optimization For LLMs: Advanced Strategies

Reviving inactive neurons

What does this mean for foundation model training?

Transformers Key-Value Caching Explained

TL;DR

Transformer architecture overview

Basic self-attention module

Advanced self-attention modules

Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide

What is key-value caching?

Assessing the impact of key-value caching

Challenges and trade-offs

Latency-memory trade-off

How to Run LLMs Locally

KV cache management in production systems

Cache invalidation

Cache reuse

LLM Observability: Fundamentals, Practices, and Tools

Conclusion

Reinforcement Learning From Human Feedback (RLHF) For LLMs

TL;DR

Importance of RLHF in LLMs

LLM Evaluation For Text Summarization

The RLHF process

Collecting human feedback

Training a reward model

Fine-tuning the LLM with the reward model

Proximal Policy Optimization (PPO)

PPO under the hood

PPO alternatives

Best practices for RLHF

Avoiding reward hacking

Scaling human feedback

Tooling and frameworks

Data collection

End-to-end RLHF frameworks

Conclusion

Adversarial Machine Learning: Defense Strategies

TL;DR

Understanding adversarial attacks in ML

Goals of adversarial machine learning attacks

Adversarial attacks to impact model outputs

Adversarial attacks to steal models and data

Types of adversarial attacks

Evasion attacks

Adversarial Attacks on Neural Networks: Exploring the Fast Gradient Sign Method

Data-poisoning attacks

Byzantine attacks

Model-extraction attacks

Examples of adversarial attacks

Defense strategies for adversarial machine learning

Adversarial learning

Monitoring

Defensive distillation

Knowledge Distillation: Principles, Algorithms, Applications

Defense against data-poisoning attacks

Differential privacy

Using Differential Privacy to Build Secure Models: Tools, Methods, Best Practices

Defense against model-extraction attacks

Selecting and evaluating defense methods against adversarial attacks

What’s next in adversarial ML?

Zero-Shot and Few-Shot Learning with LLMs

TL;DR

The role of zero-shot and few-shot learning in LLMs

Zero-shot prompting

Few-shot prompting

Is few-shot prompting the same as few-shot learning?