Paper Reflections - neptune.ai

SabiYarn: Advancing Low-Resource Languages With Multitask NLP Pre-Training [Paper Reflections]

Oduguwa Damilola — Fri, 01 Aug 2025 11:30:00 +0000

In recent years, Large Language Models (LLMs) have mostly improved by scaling. This has primarily involved increasing the size of the LLMs and the data they are trained on, resulting in a highly resource-intensive process that can cost up to millions of dollars.

While LLMs have become ubiquitous, the resource-intensive pre-training process poses a threat to the inclusion of low-resource languages, where data is scarce. Often, this is accompanied by a lack of funding for compute resources.

In our paper, SabiYarn: Advancing Low-Resource Languages with multi-task NLP Pre-Training, which was accepted at the AfricaNLP workshop at the 2025 ACL, we propose a series of optimization methods in the LLM pre-training process that made it possible to train a SOTA multilingual foundation model on Nigerian languages on a single 24 GB GPU.

One of these techniques is a mask-based loss computation strategy. This simple idea avoids computing loss on input prompt tokens the model already knows. This allows the loss function to accurately reflect the model’s true performance on the tokens that matter and avoids wasting compute by backpropagating losses that do not contribute to the model’s learning process.

In this article, we’ll explore this approach, how it reflects the broader compute-aware pre-training design and its influence on the model’s performance.

Prompt tokens are (too) expensive in low-resource settings

During pre-training, LLMs are trained in causal language modeling through a next-token prediction task. This is typically a slow process involving trillions of tokens, whose goal is to reduce the cross-entropy loss between the predicted token and the label through backpropagation. Along the way, the model acquires multiple skills, memorizes facts, and builds a world model.

For state-of-the-art models like Meta’s Llama 4 or OpenAI’s GPT-4, this computationally intensive process typically involves running thousands of GPUs for months, performing over 10²⁵ floating-point operations (FLOP).

Let’s look at a concrete example. Given a sequence like “Translate English to Yoruba: I love rice. => Mo fẹ́ràn ìrẹsì,” the model is trained to predict every token, from the prompt to the actual answer:

Step	Prompt	Next token
1	Translate	English	Static prompt
2	Translate English	to	Static prompt
3	Translate English to	Yoruba:	Static prompt
4	Translate English to Yoruba:	I
5	Translate English to Yoruba: I	love
6	Translate English to Yoruba: I love	rice.
7	Translate English to Yoruba: I love rice.	->	Static prompt
8	Translate English to Yoruba: I love rice. ->	Mo
9	Translate English to Yoruba: I love rice. -> Mo	fẹ́ràn
10	Translate English to Yoruba: I love rice. -> Mo fẹ́ràn	iresi.

In this setup, all tokens are treated equally, regardless of whether they are part of the prompt or the answer. On the one hand, this is straightforward to set up. On the other hand, it means spending compute on learning to predict tokens that are already known and static.

While this is fine in settings with virtually unlimited compute, it becomes problematic in resource-constrained training. Every token prediction contributes to the total training FLOPs. If half the sequence is an instruction or prompt that never changes, that’s half your compute spent on learning what the model doesn’t need to.

Making do without instruction-tuning

Due to severe compute constraints, we could not include a post-training stage where models are typically aligned with user-facing goals using supervised examples and reinforcement learning from human feedback (RLHF). In such stages, models learn not just to predict the next token but to generate helpful and aligned responses.

For example, a pre-trained base model may reply to “How are you today” with “?”, completing the sequence with the most likely next token. In contrast, an instruction-tuned model would try to provide a response that aligns with the goal of being a useful assistant or chatbot, e.g., “I’m doing good.”

Since post-training wasn’t feasible for SabiYarn, we embedded task awareness directly into the pre-training phase. Our goal was to help the model generalize beyond basic next-token prediction and toward solving meaningful tasks like named-entity recognition, sentiment analysis, and translation entirely through prompt-based conditioning.

In our paper, we propose a task-specific training scheme where the model is conditioned on the task it must perform using XML-like prompt tags. Taking inspiration from the T5 paper, we used the following template:

 model_input  Model’s output.

For example, an English-to-Pidgin translation task looks like this:

 let me call my father  : Make I go call my Papa

With this structured format, we were now able to only calculate the cross-entropy loss on just the label tokens (“Make I go call my Papa”).

This is straightforward to implement in PyTorch by masking out the prompt tokens in the label tensor. We use -100 as the ignore index, which PyTorch’s cross_entropy loss function skips:

labels = input_ids.clone()
labels[:, :prompt_len] = -100

Since PyTorch’s cross-entropy loss function ignores the -100 token by default, the prompt tokens are ignored when calculating the loss for that sequence.

Learning only what matters

An unexpected benefit of this approach is improved task focus. Since the model is not backpropagating on the input portion of the sequence, the model’s learning signal comes exclusively from task-relevant tokens.

Consider a pre-training scenario where an LLM is presented with:

translate> let me call my father  : Make I go call my Papa

When the loss is computed on every token, the model learns to reproduce the prompt structure, memorizes the task tags, and generates the outputs. The learning signal is diluted across the entire sequence.

Using loss masking, the model can still make input-output connections through the self-attention mechanism during the forward pass. However, backpropagation (learning) only occurs when predicting the output tokens:

Make I go call my Papa

We can compare this to how we as humans learn to translate to a new language: We receive the full input as context, but learning occurs when we’re corrected on our translation, not on the input sentence already provided to us.

Masking out the input forces the model to treat prompts as context rather than a prediction target, allowing training to focus on input-output mappings and reducing the tendency to overfit on prompt formatting.

Investigating the impact of task focus on training performance

To substantiate this finding, we ran an experiment where we trained the model on a non-trivial problem of descrambling sentences using the masked loss scheme and a non-masked loss as a comparison.

The task was to turn grammatically incoherent sentences into their coherent forms using the same words in the input. For example, “The equations expensive. show is optimization computationally that.” should be corrected to “The equations show that optimization is computationally expensive.” This task requires learning complex relationships between input and output sequences.

Here’s what the loss curves looked like:

We can see that the model converged faster on the task when the loss on the input prompt wasn’t calculated. These efficiency gains compound dramatically over the entire training run and lead to faster convergence.

The cost of masking: what are we losing?

While masking the prompt tokens during loss computation helps conserve compute and sharpen focus, it’s not without tradeoffs. Excluding the prompts from the learning signal increases the risk that the model will fail to adapt to tasks where the prompt structure or phrasing changes at inference time.

That said, such tradeoffs must be weighed against the reality of resource constraints. In low-resource training scenarios, approaches that reduce compute while preserving core task performance are often preferable to fully supervised, resource-intensive alternatives.

The case for native LLMs for African languages

While the broader African LLM community has focused its efforts on adapting open-source pre-trained models to African languages, pre-training a foundational model from scratch offers the promise of building a model that doesn’t inherit the cultural biases of Euro-American corpora. It also provides invaluable research insights and data about tokenization, transfer learning, linguistic patterns, and training dynamics for African languages.

An often neglected area is the tokenizer. Tokenizers determine how languages are broken into tokens that LLMs can recognize. Training from scratch enables us to train our own language-specific tokenizers, thereby integrating the morphological and phonological structure, such as tonal diacritics in Yoruba, which also carry semantic meaning.

It also helps with efficiency, as we obtain a tokenizer that effectively tokenizes each language into tokens that recognize useful grammatical structures, such as affixes and punctuation, which can be utilized by the model to learn meaningful representations. In contrast, using an existing tokenizer that is not trained on the target languages leads to poor tokenization, with tokens that don’t accurately reflect grammatical structure, inflated sequence lengths, and ultimately degraded performance. This is especially true for small models, which are appealing due to their lower computing demands.

Looking forward, the future work of our research group focuses on exploring modern LLM architectures, introducing reasoning, instruction following, and test-time computing strategies to resource-constrained pre-training. We’re also exploring hardware-specific optimizations in training and inference and expanding our efforts to even more African languages.

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]

Seung-won Hwang — Thu, 05 Jun 2025 11:30:00 +0000

Mixture-of-Experts (MoEs) architectures offer a promising solution by sparsely activating specific parts of the model, reducing the inference overhead. However, even with MoEs, the sheer number of parameters and experts makes deployment and serving costly.

Pruning is an established method to reduce the number of parameters of a trained model while maintaining its task performance. Typically, we distinguish two kinds of approaches. Unstructured pruning removes individual weights, while structured pruning removes entire model components.

Due to their clear structure, structured pruning seems to be an ideal match for MoEs. By removing redundant experts, we can shrink the total model size. However, current approaches for expert pruning require many forward passes, whose number grows exponentially with the number of experts. Further, structured pruning does not reduce the number of active weights during inference.

In our paper STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning, which was accepted for a presentation at ACL 2 0 25, we combine the two classes of pruning methods and introduce an approach that works exceptionally well for MoEs with over 100 experts. In a nutshell, STUN first removes redundant experts and then performs unstructured pruning inside individual experts.

Scaling barriers for Mixture of Expert models

MoEs are an effective technique to increase the total number of model parameters while keeping computational demands in check. By dividing the model into specialized structures, called experts, and selectively activating them based on the input, MoEs achieve efficiency gains in training and inference.

More experts allow the model to capture a broader range of representations and specializations, improving performance on diverse tasks or complex data. Unsurprisingly, we see a clear trend towards an increased number of experts in MoEs. To illustrate this evolution, Mistral’s Mixtral 8x7B (December 2023) builds on eight experts, Databricks’ DBRX (March 2024) on 16, and Snowflake’s Arctic (April 2024) uses 128 experts.

However, as models scale further, the efficiency gains provided by the MoE architecture alone are insufficient. Here, pruning becomes essential, refining the architecture by removing redundant parameters without compromising overall performance. Combining MoEs with pruning techniques can optimize inference speed and memory consumption, making it a promising direction for further scaling models.

Solving the exponential scaling challenge in structured MoE pruning

Structured pruning removes specific patterns, such as rows or entire weight tensors. In the context of MoEs, as expert structures from training MoEs correspond to such patterns, pruning experts is a natural fit for structured pruning.

While an increase from 8 to 128 experts may seem modest, it renders current pruning methods unviable. Roughly speaking, they take a “combinatorial” approach to determining which structures to remove, requiring the enumeration of all possible subsets of experts to determine the optimal configuration. To illustrate, when the number of experts increases from 8 to 128, the forward passes of combinatorial pruning algorithms grow exponentially, from 70 to 2.4 × 10³⁷.

In contrast, STUN leverages the behavioral similarity between experts to make informed pruning decisions. Specifically, it first identifies clusters of similar experts based on their behavioral similarity. We can determine the similarity at a minimal cost by inspecting the model’s weights. If two rows have similar values, this suggests a high pairwise similarity between the two corresponding experts. Such an expert pair tends to activate on similar inputs and exhibit similar outputs, thus forming a cluster.

By pruning all but one representative expert from each cluster, STUN effectively reduces the model size while preserving its overall functionality. This approach drastically reduces the exponential complexity of exhaustively enumerating combinations to constant O(1), making it highly scalable for massive MoEs.

Exploring the potential of a two-phase approach to MoE pruning

A key question in our research was: How much can we gain from an additional unstructured pruning phase? After we remove all redundant experts, there might be less “margin” for further pruning compared to a scenario where we exclusively apply unstructured pruning.

We can quantify this margin as the kurtosis of the model weights’ distribution, colloquially known as its “tailedness.” As unstructured pruning removes near-zero weights, it reduces the weight distribution’s kurtosis.

Unlike unstructured pruning, which selectively targets weights that minimally impact the model’s output, structured pruning removes groups of parameters (in our case, experts) based on redundancy or low importance. Thus, structured pruning does not significantly decrease kurtosis, leaving plenty of margin for unstructured pruning.

For instance, if two experts in an MoE perform identically, one can be removed without altering the model’s output. Still, this does not significantly influence the overall weight distribution—it only reduces the model’s size.

Since structured pruning primarily reduces architectural redundancy rather than reshaping the underlying weight distribution, our two-phase approach—leveraging unstructured pruning after structured pruning—outperforms unstructured-only pruning.

Putting STUN to the test

Our evaluations show that STUN achieves high sparsity with no loss in performance on various MoE architectures, including Snowflake’s Arctic, a 480B-sized MoE with 128 experts.

We achieved nearly no loss in performance with 40% sparsity, even on challenging generative tasks like GSM8K (Grade School Math 8K), a widely adopted question answering task testing on mathematical problems that require multi-step reasoning.

GSM8K 5-shot accuracy for Snowflake Arctic, a 480B Mixture-of-Experts model, after applying different pruning strategies to varying degrees. Structured-only pruning exhibits a significant performance loss as more and more experts are removed. (A sparsity of 30% corresponds to just 90 of the original 128 experts left.) Unstructured-only pruning maintains an unchanged performance up to the point where 30% of the weights are removed. With STUN, the combination of both approaches, benchmark performance remains virtually unaffected up to a sparsity of 40%. This demonstrates that the strategic removal of redundant experts, followed by unstructured pruning, outperforms structured-only and unstructured-only pruning. | Source

In some cases, STUN performed orders of magnitude better than unstructured pruning methods. Our O(1) expert pruning method also outperformed existing, more computationally expensive methods, such as Lu et al. (2024), highlighting the effectiveness of our approach.

What’s next in MoE pruning?

Since STUN does not make any assumption about base MoE models, it is generalizable to other MoE families, such as Mixtral. Our code is available on GitHub. We encourage you to read our paper and adapt it to your MoE models.

Beyond applying and evaluating STUN, a crucial next area of optimization is hardware acceleration for unstructuredly pruned models. Unstructured pruning removes individual weights without considering their location or arrangement in the model. Because of this, the resulting model’s sparsity is random and unaligned—some rows, columns, or even small sections may become very sparse, while others remain dense.

This irregularity is challenging because hardware like GPUs or TPUs assumes regular, contiguous memory layouts. While structured pruning yields a predictable sparsity pattern that allows for memory optimization, the irregularly sparse models resulting from unstructured pruning prevent efficient memory access and parallel processing.

Specialized hardware support can reorganize memory access patterns to reduce overheads from irregularity. Such co-evolution of hardware and software support will likely further establish pruning as a cornerstone of scaling and applying MoE models.

Bayesian Deep Learning is Needed in the Age of Large-Scale AI [Paper Reflection]

Vincent Fortuin — Thu, 13 Mar 2025 11:30:00 +0000

In his famous blog post Artificial Intelligence — The Revolution Hasn’t Happened Yet, Michael Jordan (the AI researcher, not the one you probably thought of first) tells a story about how he might have almost lost his unborn daughter due to a faulty AI prediction. He speculates that many children die needlessly each year in the same way. Abstracting away the specifics of his case, this is one example of an application in which an AI algorithm’s performance looked good on paper during its development but led to bad decisions once deployed.

In our paper Bayesian Deep Learning is Needed in the Age of Large-Scale AI, we argue that the case above is not the exception but rather the rule and a direct consequence of the research community’s focus on predictive accuracy as a single metric of interest.

Our position paper was born out of the observation that the annual Symposium on Advances of Approximate Bayesian Inference, despite its immediate relevance to these questions, attracted fewer junior researchers over the years. At the same time, many of our students and younger colleagues seemed unaware of the fundamental problems with current practices in machine learning research—especially when it comes to large-scale efforts like the work on foundation models, which grab most of the attention today but fall short in terms of safety, reliability, and robustness.

We reached out to fellow researchers in Bayesian deep learning and eventually assembled a group of researchers from 29 of the most renowned institutions around the world, working at universities, government labs, and industry. Together, we wrote the paper to make the case that Bayesian deep learning offers promising solutions to core problems in machine learning and is ready for application beyond academic experiments. In particular, we point out that there are many other metrics beyond accuracy, such as uncertainty calibration, which we have to take into account to ensure that better models also translate to better outcomes in downstream applications.

In this commentary, I will expand on the importance of decisions as a goal for machine learning systems, in contrast to singular metrics. Moreover, I will make the case for why Bayesian deep learning can satisfy these desiderata and briefly review recent advances in the field. Finally, I will provide an outlook for the future of this research area and give some advice on how you can already use the power of Bayesian deep learning solutions in your research or practice today.

Machine learning for decisions

If you open any machine learning research paper presented at one of the big conferences, chances are that you will find a big table with a lot of numbers. These numbers usually reflect the predictive accuracy of different methods on different datasets, and the line corresponding to the authors’ proposed method probably has a lot of bold numbers, indicating that they are higher than the ones of the other methods.

The results table from the ResNet paper is a typical example of how results are presented in machine learning publications. The researchers applied different models and model variants to the same dataset and measured two metrics. The best metric values—usually belonging to the researchers’ newly devised model—are boldened.

In the results table from the Vision Transformer paper, the authors compare three of their own model variants against the prior state-of-the-art ResNet-152 model. They trained all four models on seven different datasets and measured the accuracy. Their findings indicate that the ViT-H/14 model (first column) outperforms the other models on six of the seven datasets. Crucially, this does not allow any conclusions about how any of the models would perform on a particular downstream task. (The last line of the table, labeled “TPUv3-core-days,” indicates the number of days it took to train the models on TPUs.)

Based on this observation, one might believe that bold numbers in tables are all that matters in the world. However, I want to strongly argue that this is not the case. What matters in the real world are decisions—or, more precisely, decisions and their associated utilities.

A motivating example

Imagine you overslept and are now running the risk of getting late to work. Moreover, there is a new construction site on your usual route to work, and there is also a parade going on in town today. This makes the traffic situation rather hard to predict. It is 08:30 am, and you have to be at work by 09:00. There are three different routes you can take: through the city, via the highway, or through the forest. How do you choose?

Luckily, some clever AI researchers have built tools that can predict the time each route takes. There are two tools to choose from, Tool A and Tool B, and these are their predictions:

	City	Highway	Forest
Tool A	35 min	25 min	43 min
Tool B	28 min	32 min	35 min

Annoyingly, Tool A suggests that you should use the highways, but Tool B suggests the city. However, as a tech-savvy user, you actually know that B uses a newer algorithm, and you have read the paper and marveled at the bold numbers. You know that B yields a lower mean-squared error (MSE), a common measure for predictive performance on regression tasks.

Confidently, you choose to trust Tool B and thus take the route through the city—just to arrive at 09:02 and get an annoyed side-glance from your boss for being late.

But how did that happen? You chose the best tool, after all! Let’s look at the ground-truth travel times:

	City	Highway	Forest
True driving time	32 min	25 min	35 min

As we can see, the highway was actually the fastest one and, in fact, the only one that would have gotten you to work on time. But how is that possible? This will become clear when we compute the MSE in these times for the two predictive algorithms:

MSE(A) = [ (35-32)² + (25-25)² + (43-35)²] / 3 = 24.3

MSE(B) = [ (28-32)² + (32-25)² + (35-35)²] / 3 = 21.7

Indeed, we see that Tool B has the better MSE, as advertised in the paper. But that didn’t help you now, did it? What you ultimately cared about was not having the most accurate predictions across all possible routes but making the best decision regarding which route to take, namely the decision that gets you to work in time.

While Tool A makes worse predictions on average, its predictions are better for routes with shorter travel times and get worse the longer a route takes. It also never underestimates travel times.

To get to work on time, you don’t care about the predictions for the slowest routes, only about the fastest ones. You’d also like to have the confidence to arrive on time and not choose a route that then actually ends up taking longer. Thus, while Tool A has a worse MSE, it actually leads to better decisions.

Uncertainty estimation to the rescue

Of course, if you had known that the prediction could have been so wrong, you might have never trusted it in the first place, right? Let’s add another useful feature to the predictions: uncertainty estimation.

Here are the original two algorithms and a new third one (Tool C) that estimates its own predictive uncertainties:

	City	Highway	Forest
Tool A	35 min	25 min	43 min
Tool B	28 min	32 min	35 min
Tool C	25 +/- 8 min	27 +/- 2 min	37 +/- 4 min

The ranking based on mean predictions of Tool C agrees with Tool B. However, you can now assess how much risk there is that you run late to work. Your true utility is not to be at work in the shortest time possible but to be at work on time, i.e., within a maximum of 30 min.

According to Tool C, the drive through the city can take between 17 and 32 min, so while it seems to be the fastest on average, there is a chance that you will be late. In contrast, the highway can take between 25 and 29 min, so you will be on time in any case. Armed with these uncertainty estimates, you’d make the correct choice of choosing the highway.

This was just one example of a scenario in which we are faced with decisions whose utility does not correlate with an algorithm’s raw predictive accuracy, and uncertainty estimation is crucial to making better decisions.

The case for Bayesian deep learning

Bayesian deep learning uses the foundational statistical principles of Bayesian inference to endow deep learning systems with the ability to make probabilistic predictions. These predictions can then be used to derive uncertainty intervals of the form shown in the previous example (which a Bayesian would call “credible intervals”).

Uncertainty intervals can encompass aleatoric uncertainty, that is, the uncertainty inherent in the randomness of the world (e.g., whether your neighbor decided to leave the car park at the same time as you), and epistemic uncertainty, related to our lack of knowledge (e.g., we might not know how fast the parade moves).

Crucially, by applying Bayes’ theorem, we can incorporate prior knowledge into the predictions and uncertainty estimates of our Bayesian deep learning model. For example, we can use our understanding of how traffic flows around a construction site to estimate potential delays.

Frequentist statisticians will often criticize this aspect of Bayesian inference as “subjective” and will advocate for “distribution-free” approaches, such as conformal prediction, which give you provable guarantees for the coverage of the prediction intervals. However, these guarantees only hold uniformly across all the predictions (in our example, across all the routes), but not necessarily in any given case.

As we have seen in our example, we don’t care that much about the accuracy (and, in extension, uncertainty estimates) on the slower routes. As long as the predictions and uncertainty estimates for the fast routes are accurate, a tool serves its purpose. Conformal methods cannot provide such a marginal coverage guarantee for each route, limiting their applicability in many scenarios.

“But Bayesian deep learning doesn’t work”

If you have only superficially followed the field of Bayesian deep learning a few years ago and have then stopped paying attention, distracted by all the buzz around LLMs and generative AI, you would be excused in believing that it has elegant principles and a strong motivation, but does not actually work in practice. Indeed, this truly was the case until only very recently.

However, in the last few years, the field has seen many breakthroughs that allow for this framework to finally deliver on its promises. For instance, performing Bayesian inference on posterior distributions over millions of neural network parameters used to be computationally intractable, but we now have scalable approximate inference methods that are only marginally more costly than standard neural network training.

Moreover, it used to be hard to choose the right model class for a given problem, but we have made great progress in automating this decision away from the user thanks to advances in Bayesian model selection.

While it is still nearly impossible to design a meaningful prior distribution over neural network parameters, we have found different ways to specify priors directly over functions, which is much more intuitive for most practitioners. Finally, some troubling conundra related to the behavior of the Bayesian neural network posterior, such as the infamous cold posterior effect, are much better understood now.

Armed with these tools, Bayesian deep learning models have then started to have a beneficial impact in many domains, including healthcare, robotics, and science. For instance, we have shown that in the context of predicting health outcomes for patients in the intensive care unit based on time series data, a Bayesian deep learning approach can not only yield better predictions and uncertainty estimates but also lead to recommendations that are more interpretable for medical practitioners. Our position paper contains detailed accounts of this and other noteworthy examples.

However, Bayesian deep learning is unfortunately still not as easy to use as standard deep learning, which you can do these days in a few lines of PyTorch code.

If you want to use a Bayesian deep learning model, first, you have to think about specifying the prior. This is a crucial component of the Bayesian paradigm and might sound like a chore, but if you actually have prior knowledge about the task at hand, this can really improve your performance.

Then, you are still left with choosing an approximate inference algorithm, depending on how much computational budget you are willing to spend. Some algorithms are very cheap (such as Laplace inference), but if you want really high-fidelity uncertainty estimates, you might have to opt for a more expensive one (e.g., Markov Chain Monte Carlo).

Finally, you have to find the right implementation of that algorithm that also works with your model. For instance, some inference algorithms might only work with certain types of normalization operators (e.g., layer norm vs. batch norm) or might not work with low-precision weights.

As a research community, we should make it a priority to make these tools more easily usable for normal practitioners without a background in ML research.

The road ahead

This commentary on our position paper has hopefully convinced you that there is more to machine learning than predictive accuracies on a test set. Indeed, if you use predictions from an AI model to make decisions, in almost all circumstances, you should care about ways to incorporate your prior knowledge into the model and get uncertainty estimates out of it. If this is the case, trying out Bayesian deep learning is likely worth your while.

A good place to start is the Primer on Bayesian Neural Networks that I wrote together with three colleagues. I’ve also written a review on priors in Bayesian Deep Learning that’s published open access. Once you understand the theoretical foundations and feel ready to get your hands dirty with some actual Bayesian deep learning in PyTorch, check out some popular libraries for inference methods such as Laplace inference, variational inference, and Markov chain Monte Carlo methods.

Finally, if you are a researcher and would like to get involved in the Bayesian deep learning community, especially contributing to the goal of better benchmarking to show the positive impact on real decision outcomes and to the goal of building easy-to-use software tools for practitioners, feel free to reach out to me.

Open LLMs are Necessary For Current Private Adaptations and Outperform Their Closed Alternatives [Paper Reflection]

Olatunji Iyiola Emmanuel — Thu, 20 Feb 2025 11:30:00 +0000

Closed Large Language Models (LLMs), which are proprietary and accessible only via APIs, have dominated the LLM space since around 2022 due to their high performance and versatility. However, Open LLMs have made substantial progress, narrowing the performance gap with their Closed LLM counterparts. Open LLMs are models whose architecture and parameters are publicly available for use, modification, and distribution.

For instance, while Closed LLMs like Anthropic’s Claude (released in March 2023) and OpenAI’s GPT-4 (released in March 2023) set new benchmarks upon their launches, the Open LLM Llama 3 released by Meta in April 2024 and DeepSeek-R1 released in January 2025 not only matched but surpassed these models in tasks such as coding, reasoning, text classification, summarization, and question answering.

While much of the discussion around LLMs centers on task and computational performance, in our paper Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives, we focus on the privacy implications of using Open and Closed LLMs. Specifically, we explore whether and how models can be fine-tuned on sensitive data while ensuring robust privacy guarantees.

To this end, we define threat models, compare various Open and Closed LLMs that leverage di f ferential privacy (DP) on classification and generation tasks and analyze methodological limitations. Our research results in a thorough analysis of the privacy-utility tradeoff under different privacy levels.

Our findings indicate that Open LLMs can be adapted to private data without leaking information to third parties, such as LLM providers and malicious users. Thus, they offer a significant privacy advantage over Closed, proprietary models.

The threat space in adapting LLMs to private data

The adaptation of Closed LLMs to private datasets introduces a multifaceted threat space. In typical scenarios, data curators provide their sensitive data to LLM providers for fine-tuning, producing a model tailored to the dataset. This customized model is subsequently queried by external parties, e.g., customers of the data curator.

The resulting threat space can be categorized into three key dimensions:

From the data curator to the LLM provider: The private data shared during fine-tuning may be susceptible to unauthorized access or misuse.
From the querying party to the LLM provider: Queries submitted by end users, which often contain sensitive information intended for the data curator, are exposed to the LLM provider.

From malicious end users to the adapted LLM: Malicious end users may attempt to extract private information through the LLM’s responses to carefully crafted queries.

In contrast to Closed LLMs, Open LLMs provide full control over the model and data, enabling private adaptation without the need to share sensitive information with a third party. This control eliminates the first two threat vectors associated with Closed LLMs, such as unauthorized access or misuse by the provider and exposure of user queries. With Open LLMs, data curators can directly fine-tune the model on private datasets using privacy-preserving techniques, ensuring end-to-end privacy.

What are the current methods for private adaptation of LLMs?

It follows from our threat space analysis that restricting access to the fine-tuning dataset alone does not guarantee data privacy. Model outputs can still reveal sensitive information from the fine-tuning data. If the fine-tuned model is exposed (e.g., via an API), it remains vulnerable to information extraction and inference attacks.

Differential privacy (DP) introduces a rigorous mathematical framework that ensures the privacy of individuals whose data is used in the fine-tuning process. Specifically, DP adds carefully calibrated noise to the model updates, making it statistically improbable to determine whether any individual’s data was included in the fine-tuning dataset. Its quantifiable and robust privacy guarantee makes DP valuable for protecting sensitive information in LLM fine-tuning.

While DP provides privacy guarantees for both Open and Closed LLMs, it does not address the issue of trust in third-party providers for Closed LLMs. For these models, data curators must rely on the provider to implement safeguards and handle sensitive data responsibly.

Private adaptation methods for Closed LLMs

We can rule out fine-tuning services offered by LLM providers (e.g., OpenAI and Amazon), as this entails sharing private data with a third party. Closed LLMs are accessible only via APIs. Thus, we cannot access and adapt the model’s weights directly.

Instead, private adaptation methods for Closed LLMs rely on privacy-preserving discrete prompts or private in-context learning (ICL). These approaches work by carefully crafting input prompts or selecting relevant examples to guide the model’s behavior, all while ensuring that sensitive information in the prompts or examples is protected from potential leakage or inference attacks.

All methods we evaluate in our study follow the PATE (Private Aggregation of Teacher Ensembles) framework. At a high level, PATE achieves data privacy by splitting the private dataset into non-overlapping partitions. Then, each partition is used to train a so-called teacher model. These teacher models are joined into an ensemble model by combining their outputs while adding noise, which preserves privacy.

This ensemble is then used to train a so-called student model in the following way: The ensemble makes predictions for samples from an unlabeled public dataset. The resulting (sample, ensemble prediction) pairs constitute the training data for the student model. Thus, the student learns to make the same predictions as the teacher ensemble but never sees sensitive data samples. The student is what’s released as the final model.

Overview of the PATE framework. The sensitive dataset is divided into non-overlapping partitions, and a separate teacher model is trained on each partition. All teachers are aggregated noisily into an ensemble model, which is used to make predictions on a public dataset. The samples from the public dataset, together with the ensemble’s predictions, constitute the training data for the student model, which is the model that is eventually queried by users. | Source

The private adaptation methods for Closed LLMs we analyze in our study build on this general framework. They differ in how the teachers are utilized and how their responses are aggregated:

Differentially Private In-context Learning (DP-ICL): All teachers process the same prompt, and the ensemble’s response is the noisy consensus.
PromptPATE: The teacher ensemble assigns labels to public unlabeled data via private voting. These labeled public sequences are used to create new discrete student prompts, which are deployed with the LLM.
DP-FewSho t Gen: The teacher ensemble generates private synthetic few-shot samples that are used as samples for in-context learning.
DP-OPT: A local LLM generates privacy-preserving prompts and instructions from the private dataset. These are used for in-context learning for the third-party Closed LLM.

In our paper, we compare the privacy protection and performance of these four state-of-the-art methods for private adaptation of Closed LLMs. When applying them to the popular Closed LLMs Claude, GPT-3 Babbage, GPT-3 Davinci, and GPT-4 Turbo, we observe that compared to private adaptation of Open LLMs, these methods offer lower performance at a higher cost on various downstream tasks, including dialog summarization, classification, and generation. Further, all methods except DP-OPT leak training data to the LLM provider.

Private adaptation methods for Open LLMs

Unlike Closed LLMs, Open LLMs provide access to their parameters, enabling more flexible and parameter-centric private adaptation methods. These methods typically follow the Differentially Private Stochastic Gradient Descent (DPSGD) paradigm to ensure privacy. In DPSGD, the influence of each private data point is constrained during training through gradient clipping and the addition of calibrated noise. This approach guarantees that the model does not memorize or leak sensitive information.

In our study, we explore three primary methods for private adaptation of Open LLMs:

Prompt-based adaptation (PromptDPSGD) introduces a small number of additional parameters (<1% of the model’s total parameters) in the input space through soft prompts or prefix-tuning and adapts Differentially Private Stochastic Gradient Descent (DPSGD) to preserve privacy.
Parameter-efficient fine-tuning, such as LoRA, only updates a relatively small number of parameters (<10% of the model’s total parameters) within the model’s architecture to enable efficient updates. PrivateLoRA extends this approach with DP guarantees by building on the DPSGD algorithm.
Full fine-tuning adaptations (DP-FineTune) involve fine-tuning the entire model or a subset of its layers for comprehensive adaptation while adhering to differential privacy principles.

Applying these methods to Vicuna, Llama-3, OpenLLaMa, BART, RoBERTa, and the Pythia suite of models, we find that private adaptation of Open LLMs improves performance on downstream tasks and reduces costs compared to their Closed counterparts. It also provides a critical privacy benefit by eliminating the risk of exposing private data and user queries to LLM providers.

Insightful results

Our analysis of private adaptation methods for both Closed and Open LLMs reveals several critical findings regarding data leakage, performance, and cost:

Query data leakage: All private adaptation methods for Closed LLMs leak query data to the LLM provider. This means that sensitive information from user queries is exposed during the adaptation process, posing a significant privacy risk.
Training data leakage: Only one method (DP-OPT) of the four methods of private adaptation of Closed LLMs successfully protects private training data from the LLM provider. However, this method requires a local LLM to effectively protect the privacy of the training data. The remaining private adaptation methods for Closed LLMs leak a large fraction of the training data to the LLM provider, undermining the privacy guarantees of the adaptation process.
Performance: All adaptation methods for Closed LLMs achieve lower downstream task performance than privacy-preserving local adaptations on Open LLMs, even when the Open LLMs are significantly smaller than their Closed counterparts.
Cost: The training and query costs for private adaptations of Closed LLMs are substantially higher due to the API access costs imposed by the LLM provider. In contrast, private adaptations for Open LLMs are more cost-effective. We estimated the costs assuming an A40 GPU with 48 GB of memory. In this scenario, privately adopting a Closed LLM to text classification tasks with DP-ICL costs about $140. In contrast, fine-tuning an Open LLM with PrivateLoRA on the same tasks costs about $30.

This leads to the conclusion that for a truly privacy-preserving adaptation of LLMs, one should use Open LLMs. By offering full control over the model and data, Open LLMs eliminate the risks associated with third-party providers and enable robust privacy-preserving techniques. As a result, Open LLMs address the limitations of Closed LLMs and enable efficient and customizable adaptations tailored to sensitive datasets.

Understanding LLMs Requires More Than Statistical Generalization [Paper Reflection]

Patrik Reizinger — Thu, 19 Dec 2024 11:00:00 +0000

In our paper, Understanding LLMs Requires More Than Statistical Generalization, we argue that current machine learning theory cannot explain the interesting emergent properties of Large Language Models, such as reasoning or in-context learning. From prior work (e.g., Liu et al., 2023) and our experiments, we’ve seen that these phenomena cannot be explained by reaching globally minimal test loss – the target of statistical generalization. In other words, model comparison based on the test loss is nearly meaningless.

We identified three areas where more research is required:

Understanding the role of inductive biases in LLM training, including the role of architecture, data, and optimization.
Developing more adequate measures of generalization.
Using formal languages to study language models in well-defined scenarios to understand transfer performance.

In this commentary, we focus on diving deeper into the role of inductive biases. Inductive biases affect which solution the neural network converges to, such as the model architecture or the optimization algorithm. For example, Stochastic Gradient Descent (SGD) favors neural networks with minimum-norm weights.

Inductive biases influence model performance. Even if two models with parameters θ₁ and θ₂ yield the same training and test loss, their downstream performance can differ.

How do language complexity and model architecture affect generalization ability?

In their Neural Networks and the Chomsky Hierarchy paper published in 2023, Delétang et al. showed how different neural network architectures generalize better for different language types.

Following the well-known Chomsky hierarchy, they distinguished four grammar types (regular, context-free, context-sensitive, and recursively enumerable) and defined corresponding sequence prediction tasks. Then, they trained different model architectures to solve these tasks and evaluated if and how well the model generalized, i.e., if a particular model architecture could handle the required language complexity.

In our position paper, we follow this general approach to expose the interaction of architecture and data in formal languages to gain insights into complexity limitations in natural language processing. We study popular architectures used for language modeling, e.g., Transformers, State-Space Models (SSMs) such as Mamba, the LSTM, and its novel extended version, the xLSTM.

To investigate how these models deal with formal languages of different complexity, we use a simple setup where each language consists only of two rules. During training, we monitor how well the models perform next-token prediction on the (in-distribution) test set, measured by accuracy.

However, our main question is whether these models generalize out-of-distribution. For this, we introduce the notion of rule extrapolation.

Can models adapt to changing grammar rules?

To understand rule extrapolation, let’s start with an example. A simple formal language is the aⁿbⁿ language, where the strings obey two rules:

1 a’s come before b’s.
2 The number of a’s and b’s is the same.

Examples of valid strings include “ab” and “aabb,” whereas strings like “baab” (violates rule 1) and “aab” (violates rule 2) are invalid. Having trained on such strings, we feed the models an out-of-distribution (OOD) string, violating rule 1 (e.g., a string where the first token is b).

We find that most models still obey rule 2 when predicting tokens, which we call rule extrapolation – they do not discard the learned rules entirely but adapt to the new situation in which rule 1 is seemingly no longer relevant.

This finding is surprising because none of the studied model architectures includes conscious choices to promote rule extrapolation. It emphasizes our point from the position paper that we need to understand the inductive biases of language models to explain emergent (OOD) behavior, such as reasoning or good zero-/few-shot prompting performance.

Efficient LLM training requires understanding what is a complex language for an LLM

According to the Chomsky hierarchy, the context-free aⁿbⁿ language is less complex than the context-sensitive aⁿbⁿcⁿ language, where the n a’s and n b’s are followed by an equal number of c’s.

Despite their different complexity, the two languages seem very similar to humans. Our experiments show that, e.g., Transformers can learn context-free and context-sensitive languages equally well. However, they seem to struggle with regular languages, which are deemed to be much simpler by the Chomsky hierarchy.

Based on this and similar observations, we conclude that language complexity, as the Chomsky hierarchy defines it, is not a suitable predictor for how well a neural network can learn a language. To guide architecture choices in language models, we need better tools to measure the complexity of the language task we want to learn.

It’s an open question what these could look like. Presumably, we’ll need to find different complexity measures for different model architectures that consider their specific inductive biases.

Editor’s note

Searching for a free experiment tracking solution for your academic research?

Join 1000s of researchers, professors, students, and Kagglers using neptune.ai for free to make monitoring experiments, comparing runs, and sharing results far easier than with open source tools.

Learn more about the program
See the docs or watch a short product demo (2 min)

What’s next?

Understanding how and why LLMs are so successful paves the way to more data-, cost- and energy efficiency. If you want to dive deeper into this topic, our position paper’s “Background” section is full of references, and we discuss numerous concrete research questions.

If you’re new to the field, I particularly recommend Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models (2023) by Liu et al., which nicely demonstrates the shortcomings of current evaluation practices based on the test loss. I also encourage you to check out SGD on Neural Networks Learns Functions of Increasing Complexity (2023) by Nakkiran et al. to understand more deeply how using stochastic gradient descent affects what functions neural networks learn.

Learn From Failure: Fine-Tuning LLMs With Trial-and-Error Data For Intuitionistic Propositional Logic Proving [Paper Reflection]

Chenyang An — Thu, 28 Nov 2024 11:30:00 +0000

With the rapid advancements in large language models (LLMs), transformer-based architectures are increasingly employed as tactical generators and premise selectors in automated theorem-proving systems, generating candidate proof steps or selecting useful premises based on the unfinished proof goal. According to Fields medalist Terence Tao, the new generation of AI technology will soon become useful as a “co-pilot” for research mathematicians.

However, training LLMs to serve as proof step generators faces a significant limitation: existing mathematical datasets include only correct proof paths. In academic publications, such as textbooks and research papers, mathematicians rarely include failed approaches in their presentations of proofs. Yet, it is almost always the case that these failed attempts guide them toward discovering valid proofs, and missing those failed attempts often leaves the readers wondering, “How do they get there?”.

In our paper, Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving, we explored this problem experimentally. Our goal was to assess the influence of trial-and-error information in the training data on the performance of LLMs in theorem proving.

How do mathematicians develop proofs?

In mathematical research, the number of incorrect attempts vastly outnumbers successful ones. Mathematical reasoning is inherently iterative and nonlinear, involving numerous failed approaches and refinements. The backtracking process, wherein one recognizes a failed path and revisits earlier stages to explore alternative directions, is vital to a mathematician’s chain of thought. Thus, unsuccessful paths not only pave the way to correct proofs but are also valuable as illustrations of structured proof-search techniques.

The primary motivation for using large language models (LLMs) in automated theorem provers (ATPs) is their capability to emulate human reasoning. Our ultimate goal is to capture the comprehensive and systematic methods human mathematicians use in theorem proving and potentially develop novel, superior strategies.

However, at the time we published our paper, current approaches to training LLMs for ATPs only utilized data on correct proof attempts. Given that a model solely trained on successful proof steps is learning none of the iterative trial-and-error processes mathematicians use, it is unsurprising that despite pre-training on extensive mathematical texts, the available state-of-the-art models exhibited only modest performance on challenging theorem-proving tasks.

Editor’s note

Searching for a free experiment tracking solution for your academic research?

Join 1000s of researchers, professors, students, and Kagglers using neptune.ai for free to make monitoring experiments, comparing runs, and sharing results far easier than with open source tools.

Learn more about the program
See the docs or watch a short product demo (2 min)

Potential benefits of training with trial-and-error information

Now, assume that in addition to a vast collection of polished proofs, we train a model on all the trial-and-error information that could be found in mathematicians’ draft papers or in their minds. What would we expect this model to be capable of?

Generating better proof-step candidates

First, we expect the model to have a strong ability to propose high-quality guesses for single proof-step generation. After seeing large amounts of high-quality trial-and-error information in training, the model will learn how to make a highly reasonable (although possibly failed) first attempt when seeing the problem.

Judging proof step candidates in reinforcement learning

Second, we expect models trained with trial-and-error information to be capable of dynamically evaluating each proof step’s potential. By “dynamic,” we mean that the confidence score the model assigns internally to the current proof strategy changes as the strategy unfolds. After generating each proof step, the model must decide whether to continue predicting the subsequent step along the current path or to initiate a backtracking operation. A higher probability of backtracking indicates a lower confidence in the current proof strategy.

A model equipped with sufficient trial-and-error data should become proficient in assessing the viability of proof strategies. The model could then serve as a reward function in reinforcement learning processes (e.g., OpenAI’s work on process supervision), where obtaining high-quality reward functions for intermediate steps is a major challenge.

One caveat is that tracking trial-and-error information for highly complex mathematical problems can easily exceed a model’s context length. We sometimes encountered this problem in our experiments when we asked the model to prove very hard theorems. Once it is no longer possible to feed the entire history of proof attempts and backtraces into the model, we need to summarize it. Further research is required to explore efficient methods for this summarization process.

Going beyond the well-trodden path

Third, we expect a model trained on trial-and-error data to exhibit a strong capacity for thinking “outside the box.” Mathematicians often develop truly creative approaches to solving longstanding problems, producing work that impresses with its ingenuity and provokes curiosity about the thought processes involved.

However, except for a few remarkable cases (like the formulas discovered by Ramanujan), most of these breakthroughs are built on extensive knowledge accumulated over time through trial and error. By identifying existing strategies as ineffective—and uncovering why they are inadequate—mathematicians are compelled to consider novel methods. We believe models can acquire this capability from extensive, high-quality trial-and-error information.

Where do we go from here?

Overall, we are optimistic about the future of automated reasoning. We speculate that mathematical reasoning is not fundamentally different from traditional NLP tasks and that given sufficient high-quality training data, LLMs can reach human-level performance. As we demonstrate in our paper, incorporating trial-and-error information into the training data leads to substantial improvements even with today’s model architectures.

However, as we’ve discussed, the vast majority of current pre-training datasets for mathematical reasoning exhibit significant misalignments with the precise tasks we expect the model to perform. An obvious limitation of our approach is that it’s challenging to collect trial-and-error data from our math friends because of the tradition and community practice. We hope our work can raise the community’s awareness of the importance of trial-and-error data for automated theorem proving.

New state-of-the-art models (such as Meta’s Llama 3 family and OpenAI’s o1 model) that became available after we published our paper have been trained extensively on trial-and-error reasoning data. This has led to significant performance improvements on traditional mathematical benchmarks, such as the MATH dataset. Notably, o1 has the capability to verify its outputs and perform backtracking during inference, informed by previously explored proof searches. We believe this advancement is largely due to the substantial trial-and-error data included in the model’s training process.

Beyond theorem proving, training with trial-and-error data may play a pivotal role in shaping a new “scaling law of inference,” which complements currently known LLM scaling laws. By enabling the model to generate additional tokens, thereby allowing it to verify and backtrack on its own output, it can progressively tackle more complex problems. This concept, observed by OpenAI for their o1 model, was reported as a significant advancement. Furthermore, a recent paper mathematically demonstrates that if a transformer is allowed to generate an arbitrary number of tokens, it has the potential to solve arbitrarily complex problems.

If you’d like to explore this space yourself, we’ve published our dataset and our model weights over at Hugging Face, and you can find source code on GitHub. If you’re interested in how trial-and-error data could be used to improve LLM Agents, I recommend the recently published paper Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents, whose dataset is available at Hugging Face as well.

Paper Reflections - neptune.ai

SabiYarn: Advancing Low-Resource Languages With Multitask NLP Pre-Training [Paper Reflections]

Prompt tokens are (too) expensive in low-resource settings

Making do without instruction-tuning

Learning only what matters

Investigating the impact of task focus on training performance

The cost of masking: what are we losing?

The case for native LLMs for African languages

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]

Scaling barriers for Mixture of Expert models

Mixture of Experts LLMs: Key Concepts Explained

Solving the exponential scaling challenge in structured MoE pruning

Exploring the potential of a two-phase approach to MoE pruning

Putting STUN to the test

What’s next in MoE pruning?

Bayesian Deep Learning is Needed in the Age of Large-Scale AI [Paper Reflection]

Machine learning for decisions

A motivating example

Uncertainty estimation to the rescue

The case for Bayesian deep learning

Bayesian Neural Networks—Implementing, Training, Inference With the JAX Framework

“But Bayesian deep learning doesn’t work”

Logging PyMC and Arviz Artifacts on Neptune

The road ahead

Open LLMs are Necessary For Current Private Adaptations and Outperform Their Closed Alternatives [Paper Reflection]

The threat space in adapting LLMs to private data

Adversarial Machine Learning: Defense Strategies

What are the current methods for private adaptation of LLMs?

Private adaptation methods for Closed LLMs

Zero-Shot and Few-Shot Learning with LLMs

Private adaptation methods for Open LLMs

Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide

Insightful results

Understanding LLMs Requires More Than Statistical Generalization [Paper Reflection]

How do language complexity and model architecture affect generalization ability?

Can models adapt to changing grammar rules?

Efficient LLM training requires understanding what is a complex language for an LLM

What’s next?

Learn From Failure: Fine-Tuning LLMs With Trial-and-Error Data For Intuitionistic Propositional Logic Proving [Paper Reflection]

How do mathematicians develop proofs?

Potential benefits of training with trial-and-error information

Generating better proof-step candidates

Judging proof step candidates in reinforcement learning

Going beyond the well-trodden path

Where do we go from here?