Natural Language Processing - neptune.ai

LLM Fine-Tuning and Model Selection Using Neptune and Transformers

Pedro Gabriel Gengo Lourenço — Fri, 19 Jan 2024 13:24:06 +0000

Imagine you’re facing the following challenge: you want to develop a Large Language Model (LLM) that can proficiently respond to inquiries in Portuguese. You have a valuable dataset and can choose from various base models. But here’s the catch — you’re working with limited computational resources and can’t rely on expensive, high-power machines for fine-tuning. How do you decide on the right model to use in this scenario?

This post explores these questions, offering insights and strategies for selecting the best model and conducting efficient fine-tuning, even when resources are constrained. We’ll look at ways to reduce a model’s memory footprint, speed up training, and best practices for monitoring.

The workflow we’ll implement. We will fine-tune different foundation LLM models on a dataset, evaluate them, and select the best model.

Large language models

Large Language Models (LLMs) are huge deep-learning models pre-trained on vast data. These models are usually based on an architecture called transformers. Unlike the earlier recurrent neural networks (RNN) that sequentially process inputs, transformers process entire sequences in parallel. Initially, the transformer architecture was designed for translation tasks. But nowadays, it is used for various tasks, ranging from language modeling to computer vision and generative AI.

Below, you can see a basic transformer architecture consisting of an encoder (left) and a decoder (right). The encoder receives the inputs and generates a contextualized interpretation of the inputs, called embeddings. The decoder uses the information in the embeddings to generate the model’s output, one token at a time.

Transformers architecture. On the left side, we can see the encoder part, which is composed of a stack of multi-head attention and fully connected layers. On the right side, we can see the decoder, which is also composed of a stack of multi-head attention, cross-attention to leverage the information from the encoder, and fully connected layers.

Hands-on: fine-tuning and selecting an LLM for Brazilian Portuguese

In this project, we’re taking on the challenge of fine-tuning three LLMs: GPT-2, GPT2-medium, GPT2-large, and OPT 125M. The models have 137 million, 380 million, 812 million, and 125 million parameters, respectively. The largest one, GPT2-large, takes up over 3GB when stored on disk. All these models were trained to generate English-language text.

Our goal is to optimize these models for enhanced performance in Portuguese question answering, addressing the growing demand for AI capabilities in diverse languages. To accomplish this, we’ll need to have a dataset with inputs and labels and use it to “teach” the LLM. Taking a pre-trained model and specializing it to solve new tasks is called fine-tuning. The main advantage of this technique is you can leverage the knowledge the model has to use as a starting point.

Setting up

I have designed this project to be accessible and reproducible, with a setup that can be replicated on a Colab environment using T4 GPUs. I encourage you to follow along and experiment with the fine-tuning process yourself.

Note that I used a V100 GPU to produce the examples below, which is available if you have a Colab Pro subscription. You can see that I’ve already made a first trade-off between time and money spent here. Colab does not reveal detailed prices, but a T4 costs $0.35/hour on the underlying Google Cloud Platform, while a V100 costs $2.48/hour. According to this benchmark, a V100 is three times faster than a T4. Thus, by spending seven times more, we save two-thirds of our time.

You can find all the code in two Colab notebooks:

We will use Python 3.10 in our codes. Before we begin, we’ll install all the libraries we will need. Don’t worry if you’re not familiar with them yet. We’ll go into their purpose in detail when we first use them:

pip install transformers==4.35.2 bitsandbytes==0.41.3 peft==0.7.0
accelerate==0.25.0 datasets==2.16.1 neptune==1.8.6 evaluate==0.4.1 -qq

Loading and pre-processing the dataset

We’ll use the FaQuAD dataset to fine-tune our models. It’s a Portuguese question-answering dataset available in the Hugging Face dataset collection.

First, we’ll look at the dataset card to understand how the dataset is structured. We have about 1,000 samples, each consisting of a context, a question, and an answer. Our model’s task is to answer the question based on the context. (The dataset also contains a title and an ID column, but we won’t use them to fine-tune our model.)

Each sample in the FaQuAD dataset consists of a context, a question, and the corresponding answer. | Source

We can conveniently load the dataset using the Hugging Face datasets library:

from datasets import load_dataset

dataset = load_dataset("eraldoluis/faquad")

Our next step is to convert the dataset into a format our models can process. For our question-answering task, that’s a sequence-to-sequence format: The model receives a sequence of tokens as the input and produces a sequence of tokens as the output. The input contains the context and the question, and the output contains the answer.

For training, we’ll create a so-called prompt that contains not only the question and the context but also the answer. Using a small helper function, we concatenate the context, question, and answer, divided by section headings (Later, we’ll leave out the answer and ask the model to fill in the “Resposta” section on its own).

We’ll also prepare a helper function that wraps the tokenizer. The tokenizer is what turns the text into a sequence of integer tokens. It is specific to each model, so we’ll have to load and use a different tokenizer for each. The helper function makes that process more manageable, allowing us to process the entire dataset at once using map. Last, we’ll shuffle the dataset to ensure the model sees it in randomized order.

Here’s the complete code:

def generate_prompt(data_point):

   out = f"""Dado o contexto abaixo, responda a questão

### Contexto:

{data_point["context"]}

### Questão:

{data_point["question"]}

### Resposta:

"""

   if data_point.get("answers"):

     out += data_point["answers"]["text"][0]

   return out

CUTOFF_LEN = 1024

def tokenize(prompt, tokenizer):

   result = tokenizer(

       prompt,

       truncation=True,

       max_length=CUTOFF_LEN + 1,

       padding="max_length",

   )

   return {

       "input_ids": result["input_ids"][:-1],

       "attention_mask": result["attention_mask"][:-1],

   }

Loading and preparing the models

Next, we load and prepare the models that we’ll fine-tune. LLMs are huge models. Without any kind of optimization, for the GPT2-large model in full precision (float32), we have around 800 million parameters, and we need 2.9 GB of memory to load the model and 11.5 GB during the training to handle the gradients. That just about fits in the 16 GB of memory that the T4 in the free tier offers. But we would only be able to compute tiny batches, making training painfully slow.

Faced with these memory and compute resource constraints, we’ll not use the models as-is but use quantization and a method called LoRA to reduce their number of trainable parameters and memory footprint.

Quantization

Quantization is a technique used to reduce a model’s size in memory by using fewer bits to represent its parameters. For example, instead of using 32 bits to represent a floating point number, we’ll use only 16 or even as little as 4 bits.

This approach can significantly decrease the memory footprint of a model, which is especially important when deploying large models on devices with limited memory or processing power. By reducing the precision of the parameters, quantization can lead to a faster inference time and lower power consumption. However, it’s essential to balance the level of quantization with the potential loss in the model’s task performance, as excessive quantization can degrade accuracy or effectiveness.

The Hugging Face transformers library has built-in support for quantization through the bitsandbytes library. You can pass

`load_in_8bit=True`

`load_in_4bit=True` to the `from_pretrained()`

model loading methods to load a model with 8-bit or 4-bit precision, respectively.

After loading the model, we call the wrapper function prepare_model_for_kbit_training from the peft library. It prepares the model for training in a way that saves memory. It does this by freezing the model parameters, making sure all parts use the same type of data format, and using a special technique called gradient checkpointing if the model can handle it. This helps in training large AI models, even on computers with little memory.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import prepare_model_for_kbit_training
from peft import get_peft_model, LoraConfig


model_name = 'gpt2-large'


model = AutoModelForCausalLM.from_pretrained(model_name,
                                            device_map = "auto",
                                            load_in_8bit=True,
                                            trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)


tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id


model = prepare_model_for_kbit_training(model)

After quantizing the model to 8 bits, it takes only a fourth of the memory to load and train the model, respectively. For GPT2-large, instead of needing 2.9 GB to load, it now takes only 734 MB.

LoRA

As we know, Large Language Models have a lot of parameters. When we want to fine-tune one of these models, we usually update all the model’s weights. That means we need to save all the gradient states in memory during fine-tuning, which requires almost twice the model size of memory. Sometimes, when updating all parameters, we can mess up with what the model already learned, leading to worse results in terms of generalization.

Given this context, a team of researchers proposed a new technique called Low-Rank Adaptation (LoRA). This reparametrization method aims to reduce the number of trainable parameters through low-rank decomposition.

Low-rank decomposition approximates a large matrix into a product of two smaller matrices, such that multiplying a vector by the two smaller matrices yields approximately the same results as multiplying a vector by the original matrix. For example, we could decompose a 3×3 matrix into the product of a 3×1 and a 1×3 matrix so that instead of having nine parameters, we have only six.

Low-rank decomposition is a method to split a large matrix M into a product of two smaller matrices, L and R, that approximates it.

When fine-tuning a model, we want to slightly change its weights to adapt it to the new task. More formally, we’re looking for new weights derived from the original weights: Wnew= Wold+ W. Looking at this equation, you can see that we keep the original weights in their original shape and just learn W as LoRA matrices.

In other words, you can freeze your original weights and train just the two LoRA matrices with substantially fewer parameters in total. Or, even more simply, you create a set of new weights in parallel with the original weights and only train the new ones. During the inference, you pass your input to both sets of weights and sum them at the end.

Fine-tuning using low-rank decomposition. In blue, we can see the original set of weights of the pre-trained model. Those will be frozen during the fine-tuning. In orange, we can see the low-rank matrices A and B, which will have their weights updated during the fine-tuning.

With our base model loaded, we now want to add the LoRA layers in parallel with the original model weights for fine-tuning. To do this, we need to define a LoraConfig.

Inside the LoraConfig, we can define the rank of the LoRA matrices (parameter r), the dimension of the vector space generated by the matrix columns. We can also look at the rank as a measure of how much compression we are applying to our matrices, i.e., how small the bottleneck between A and B in the figure above will be.

When choosing the rank, keeping in mind the trade-off between the rank of your LoRA matrix and the learning process is essential. Smaller ranks mean less room to learn, i.e., as you have fewer parameters to update, it can be harder to achieve significant improvements. On the other hand, higher ranks provide more parameters, allowing for greater flexibility and adaptability during training. However, this increased capacity comes at the cost of additional computational resources and potentially longer training times. Thus, finding the optimal rank for your LoRA matrix that balances these factors well is crucial, and the best way to find this is by experimenting! A good approach is to start with lower ranks (8 or 16), as you will have fewer parameters to update, so it will be faster, and increase it if you see the model is not learning as much as you want.

You also need to define which modules inside the model you want to apply the LoRA technique to. You can think of a module as a set of layers (or a building block) inside the model. If you want to know more, I’ve prepared a deep dive, but feel free to skip it.

Deep dive: which modules can and should you apply LoRA to?

Within the LoraConfig, you need to specify which modules to apply LoRA to. You can apply LoRA for most of a model’s modules, but you need to specify the module names that the original developers assigned at model creation. Which modules exist, and their names are different for each model.

The LoRA paper reports that adding LoRA layers only to the keys and values linear projections is a good tradeoff compared to adding LoRA layers to all linear projections in attention blocks. In our case, for the GPT2 model, we will apply LoRA on the c_attn layers, as we don’t have the query, value, and keys weights split, and for the OPT model, we will apply LoRA on the q_proj and v_proj.

If you use other models, you can print the modules’ names and choose the ones you want:

list(model.named_modules())

In addition to specifying the rank and modules, you must also set up a hyperparameter called alpha, which scales the LoRA matrix:

scaling = alpha / r
weight += (lora_B @ lora_A) * scaling

As a rule of thumb (as discussed in this article by Sebastian Raschka), you can start setting this to be two times the rank r. If your results are not good, you can try lower values.

Here’s the complete LoRA configuration for our experiments:

config = LoraConfig(
   r=8,
   lora_alpha=16,
   target_modules=["c_attn"],  # for gpt2 models
   # target_modules=["q_proj", "v_proj"],  # for opt models
   lora_dropout=0.1,
   bias="none",
   task_type="CAUSAL_LM",
)

We can apply this configuration to our model by calling

model = get_peft_model(model, config)

Now, just to show how many parameters we are saving, let’s print the trainable parameters of GPT2-large:

model.print_trainable_parameters()
>> trainable params: 2,949,120 || all params: 776,979,200 || trainable%: 0.3795622842928099

We can see that we are updating less than 1% of the parameters! What an efficiency gain!

Fine-tuning the models

With the dataset and models prepared, it’s time to move on to fine-tuning. Before we start our experiments, let’s take a step back and consider our approach. We’ll be training four different models with different modifications and using different training parameters. We’re not only interested in the model’s performance but also have to work with constrained resources.

Thus, it will be crucial that we keep track of what we’re doing and progress as systematically as possible. At any point in time, we want to ensure that we’re moving in the right direction and spending our time and money wisely.

What is essential to log and monitor during the fine-tuning process?

Aside from monitoring standard metrics like training and validation loss and training parameters such as the learning rate, in our case, we also want to be able to log and monitor other aspects of the fine-tuning:

Resource Utilization: Since you’re operating with limited computational resources, it’s vital to keep a close eye on GPU and CPU usage, memory consumption, and disk usage. This ensures you’re not overtaxing your system and can help troubleshoot performance issues.
Model Parameters and Hyperparameters: To ensure that others can replicate your experiment, storing all the details about the model setup and the training script is crucial. This includes the architecture of the model, such as the sizes of the layers and the dropout rates, as well as the hyperparameters, like the batch size and the number of epochs. Keeping a record of these elements is key to understanding how they affect the model’s performance and allowing others to recreate your experiment accurately.
Epoch Duration and Training Time: Record the duration of each training epoch and the total training time. This data helps assess the time efficiency of your training process and plan future resource allocation.

Set up logging with neptune.ai

neptune.ai is the most scalable experiment tracker for teams that train foundation models. It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye. Neptune is integrated with the transformers library’s Trainer module, allowing you to log and monitor your model training seamlessly. This integration was contributed by Neptune’s developers, who maintain it to this day.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

To use Neptune, you’ll have to sign up for an account first (don’t worry, it’s free for personal use) and create a project in your workspace. Have a look at the Quickstart guide in Neptune’s documentation. There, you’ll also find up-to-date instructions for obtaining the project and token IDs you’ll need to connect your Colab environment to Neptune.

We’ll set these as environment variables:

import os
os.environ["NEPTUNE_PROJECT"] = "your-project-ID-goes-here"
os.environ["NEPTUNE_API_TOKEN"] = "your-API-token-goes-here"

There are two options for logging information from transformer training to Neptune: You can either setreport_to=”neptune” in the TrainingArguments or pass an instance of NeptuneCallback to the Trainer’s callbacks parameter. I prefer the second option because it gives me more control over what I log. Note that if you pass a logging callback, you should set report_to=”none” in the TrainingArguments to avoid duplicate data being reported.

Below, you can see how I typically instantiate the NeptuneCallback. I specified a name for my experiment run and asked Neptune to log all parameters used and the hardware metrics. Setting log_checkpoints=”last” ensures that the last model checkpoint will also be saved on Neptune.

from transformers.integrations import NeptuneCallback


neptune_callback = NeptuneCallback(
                                  name=f"fine-tuning-{model_name}",
                                  log_parameters=True,
                                  log_checkpoints="last",
                                  capture_hardware_metrics=True
                                  )

Training a model

As the last step before configuring the Trainer, it’s time to tokenize the dataset with the model’s tokenizer. Since we’ve loaded the tokenizer together with the model, we can now put the helper function we prepared earlier into action:

tokenized_datasets = dataset.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer))

The training is managed by a Trainer object. The Trainer uses a DataCollatorForLanguageModeling, which prepares the data in a way suitable for language model training.

Here’s the full setup of the Trainer:

from transformers import (
                         Trainer,
                         TrainingArguments,
                         GenerationConfig,
                         DataCollatorForLanguageModeling,
                         set_seed
                        )


set_seed(42)


EPOCHS = 20
GRADIENT_ACCUMULATION_STEPS = 8
MICRO_BATCH_SIZE = 8
LEARNING_RATE = 2e-3
WARMUP_STEPS = 100
LOGGING_STEPS = 20


trainer = Trainer(
   model=model,
   train_dataset=tokenized_datasets["train"],
   args=TrainingArguments(
       per_device_train_batch_size=MICRO_BATCH_SIZE,
       gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
       warmup_steps=WARMUP_STEPS,
       num_train_epochs=EPOCHS,
       learning_rate=LEARNING_RATE,
       output_dir="lora-faquad",
       logging_steps=LOGGING_STEPS,
       save_strategy="epoch",
       gradient_checkpointing=True,
       report_to="none"
   ),
   callbacks=[neptune_callback],
   data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False

That’s a lot of code, so let’s go through it in detail:

The training process is defined to run for 20 epochs (EPOCHS = 20). You’ll likely find that training for even more epochs will lead to better results.

We’re using a technique called gradient accumulation, set here to 8 steps (GRADIENT_ACCUMULATION_STEPS = 8), which helps handle larger batch sizes effectively, especially when memory resources are limited. In simple terms, gradient accumulation is a technique to handle large batches. Instead of having a batch of 64 samples and updating the weights for every step, we can have a batch size of 8 samples and perform eight steps, just updating the weights in the last step. It generates the same result as a batch of 64 but saves memory.
The MICRO_BATCH_SIZE is set to 8, indicating the number of samples processed each step. It is extremely important to find an amount of samples that can fit in your GPU memory during the training to avoid out-of-memory issues (Have a look at the transformers documentation to learn more about this).
The learning rate, a crucial hyperparameter in training neural networks, is set to 0.002 (LEARNING_RATE = 2e-3), determining the step size at each iteration when moving toward a minimum of the loss function. To facilitate a smoother and more effective training process, the model will gradually increase its learning rate for the first 100 steps (WARMUP_STEPS = 100), helping to stabilize early training phases.
The trainer is set not to use the model’s cache (model.config.use_cache = False) to manage memory more efficiently.

With all of that in place, we can launch the training:

trainer_output = trainer.train(resume_from_checkpoint=False)

While training is running, head over to Neptune, navigate to your project, and click on the experiment that is running. There, click on Charts to see how your training progresses (loss and learning rate). To see resource utilization, click the Monitoring tab and follow how GPU and CPU usage and memory utilization change over time. When the training finishes, you can see other information like training samples per second, training steps per second, and more.

At the end of the training, we capture the output of this process in trainer_output, which typically includes details about the training performance and metrics that we will later use to save the model on the model registry.

But first, we’ll have to check whether our training was successful.

Aside

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

Check the documentation
Play with an interactive example project
Get in touch to go through a custom demo with our engineering team

Evaluating the fine-tuned LLMs

Model evaluation in AI, particularly for language models, is a complex and multifaceted task. It involves navigating a series of trade-offs among cost, data applicability, and alignment with human preferences. This process is critical in ensuring that the developed models are not only technically proficient but also practical and user-centric.

LLM evaluation approaches

Diagram of different evaluation strategies organized by evaluation metrics and data | Modified based on source

The chart above shows that the least expensive (and most commonly used) approach is to use public benchmarks. On the one hand, this approach is highly cost-effective and easy to test. However, on the other hand, it is less likely to resemble production data. Another option, slightly more costly than benchmarks, is AutoEval, where other language models are used to evaluate the target model. For those with a higher budget, user testing, where the model is made accessible to users, or human evaluation, which involves a dedicated team of humans focused on assessing the model, is an option.

Evaluating question-answering models with F1 scores and the exact match metric

In our project, considering the need to balance cost-effectiveness with maintaining evaluation standards for the dataset, we will employ two specific metrics: exact match and F1 score. We’ll use the validation set provided along with the FaQuAD dataset. Hence, our evaluation strategy falls into the `Public Benchmarks category, as it relies on a well-known dataset to evaluate PTBR models.

The exact match metric determines if the response given by the model precisely aligns with the target answer. This is a straightforward and effective way to assess the model’s accuracy in replicating expected responses. We’ll also calculate the F1 score, which combines precision and recall, of the returned tokens. This will give us a more nuanced evaluation of the model’s performance. By adopting these metrics, we aim to assess our model’s capabilities reliably without incurring significant expenses.

As we said previously, there are various ways to evaluate an LLM, and we choose this way, using standard metrics, because it is fast and cheap. However, there are some trade-offs when choosing “hard” metrics to evaluate results that can be correct, even when the metrics say it is not good.

One example is: imagine the target answer for some question is “The rat found the cheese and ate it.” and the model’s prediction is “The mouse discovered the cheese and consumed it.” Both examples have almost the same meaning, but the words chosen differ. For metrics like exact match and F1, the scores will be really low. A better – but more costly – evaluation approach would be to have humans annotate or use another LLM to verify if both sentences have the same meaning.

Implementing the evaluation functions

Let’s return to our code. I’ve decided to create my own evaluation functions instead of using the Trainer’s built-in capabilities to perform the evaluation. On the one hand, this gives us more control. On the other hand, I frequently encountered out-of-memory (OOM) errors while doing evaluations directly with the Trainer.

For our evaluation, we’ll need two functions:

get_logits_and_labels: Processes a sample, generates a prompt from it, passes this prompt through a model, and returns the model’s logits (scores) along with the token IDs of the target answer.
compute_metrics: Evaluates a model on a dataset, calculating exact match (EM) and F1 scores. It iterates through the dataset, using the get_logits_and_labels function to generate model predictions and corresponding labels. Predictions are determined by selecting the most likely token indices from the logits. For the EM score, it decodes these predictions and labels into text and computes the EM score. For the F1 score, it maintains the original token IDs and calculates the score for each sample, averaging them at the end.

Here’s the complete code:

import evaluate
import torch
from tqdm.auto import tqdm
import numpy as np


def get_logits_and_labels(sample_, max_new_tokens):
   sample = sample_.copy()
   del sample["answers"]
   prompt = generate_prompt(sample)
   inputs = tokenizer(prompt, return_tensors="pt")
   input_ids = inputs["input_ids"].cuda()
   attention_mask = inputs["attention_mask"].cuda()
   generation_output = model.generate(
       input_ids=input_ids,
       attention_mask=attention_mask,
       return_dict_in_generate=True,
       output_scores=True,
       max_new_tokens=max_new_tokens,
       num_beams=1,
       do_sample=False
   )


target_ids = tokenizer(sample_["answers"]["text"][0], return_tensors="pt")
   scores = torch.concat(generation_output["scores"])
   return scores.cpu(), target_ids["input_ids"]


def compute_metrics(dataset, max_new_tokens):
 metric1 = evaluate.load("exact_match")
 metric2 = evaluate.load("f1")


 em_preds = []
 em_refs = []
 f1_preds = []
 f1_refs = []
 for s in tqdm(dataset):
   logits, labels = get_logits_and_labels(s, max_new_tokens)
   predictions = np.argmax(logits, axis=-1)[:len(labels[0])]
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   labels = labels[0, :len(predictions)]
   f1_preds.append(predictions)
   f1_refs.append(labels)


   em_pred = tokenizer.batch_decode(predictions, skip_special_tokens=True)
   em_ref = tokenizer.batch_decode(labels, skip_special_tokens=True)
   em_preds.append("".join(em_pred))
   em_refs.append("".join(em_ref))


em=metric1.compute(predictions=em_preds, references=em_refs)["exact_match"]


 f1_result = 0
 for pred, ref in zip(f1_preds, f1_refs):
   f1_result += metric2.compute(predictions=pred, references=ref, average="macro")["f1"]
 return em, f1_result / len(f1_preds)

Before assessing our model, we must switch it to evaluation mode, which deactivates dropout. Additionally, we should re-enable the model’s cache to conserve memory during prediction.

model.eval()
model.config.use_cache = True  # We need this to avoid OOM issues

Following this setup, simply execute the compute_metrics function on the evaluation dataset and specify the desired number of generated tokens to use (Note that using more tokens will increase processing time).

em, f1 = compute_metrics(tokenized_datasets["validation"], max_new_tokens=5)

Storing the models and evaluation results

Now that we’ve finished fine-tuning and evaluating a model, we should save it and move on to the next model. To this end, we’ll create a model_version to store in Neptune’s model registry.

In detail, we’ll save the latest model checkpoint along with the loss, the F1 score, and the exact match metric. These metrics will later allow us to select the optimal model. To create a model and a model version, you will need to define the model key, which is the model identifier and must be uppercase and unique within the project. After defining the model key, to use this model to create a model version, you need to concatenate it with the project identifier that you can find on Neptune under “All projects” – “Edit project information” – “Project key”.

import neptune


try:
 neptune_model = neptune.init_model(
     key="QAPTBR",  # must be uppercase and unique within the project
     name="ptbr qa model",  # optional
 )
except neptune.exceptions.NeptuneModelKeyAlreadyExistsError:
 print("Model already exists in this project. Reusing it.")




model_version = neptune.init_model_version(
   model="LLMFIN-QAPTBR", ## Project id + key
)
model_version[f"model/artifacts"].upload_files("/content/lora-faquad/checkpoint-260")
model_version["model/model-name"] = model_name
model_version["model/loss"] = trainer_output.training_loss
model_version["model/exact-match"] = em
model_version["model/f1"] = f1

Model selection

Once we’re done with all our model training and experiments, it’s time to jointly evaluate them. This is possible because we monitored the training and stored all the information on Neptune. Now, we’ll use the platform to compare different runs and models to choose the best one for our use case.

After completing all your runs, you can click Compare runs at the top of the project’s page and enable the “small eye” for the runs you want to compare. Then, you can go to the Charts tab, and you will find a joint plot of the losses for all the experiments. Here’s how it looks in my project. In purple, we can see the loss for the gpt2-large model. As we trained for fewer epochs, we can see that we have a shorter curve, which nevertheless achieved a better loss.

The loss function is not yet saturated, indicating that our models still have room for growth and could likely achieve higher levels of performance with additional training time.

Go to the Models page and click on the model you created. You will see an overview of all the versions you trained and uploaded. You can also see the metrics reported and the model name.

Here’s a link to my Neptune project

You’ll notice that none of the model versions have been assigned to a “Stage” yet. Neptune allows you to assign models to different stages, namely “Staging,” “Production,” and “Archived.”

While we can promote a model through the UI, we’ll return to our code and automatically identify the best model. For this, we first fetch all model versions’ metadata, sort by the exact match and f1 scores, and promote the best model according to these metrics to production:

import neptune


model = neptune.init_model(with_id="LLMFIN-QAPTBR")


model_versions_df = model.fetch_model_versions_table().to_pandas()


df_sorted = model_versions_df.sort_values(["model/exact-match", "model/f1"], ascending=False)
model_version = df_sorted.iloc[0]["sys/id"]
model_name = df_sorted.iloc[0]["model/model-name"]


model_version = neptune.init_model_version(
   with_id=model_version,
)
model_version.change_stage("production")

After executing this, we can see, as expected, that gpt2-large (our largest model) was the best model and was chosen to go to production:

Once more, we’ll return to our code and finally use our best model to answer questions in Brazilian Portuguese:

import neptune
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer


model = neptune.init_model(with_id="LLMFIN-QAPTBR")
model_versions_df = model.fetch_model_versions_table().to_pandas()


df_prod_model = model_versions_df[model_versions_df["sys/stage"] == "production"]
model_version = df_prod_model.iloc[0]["sys/id"]
model_name = df_prod_model.iloc[0]["model/model-name"]


model_version = neptune.init_model_version(
   with_id=model_version,
)


model = AutoModelForCausalLM.from_pretrained(model_name,
                                            device_map = "auto",
                                            load_in_8bit=True,
                                            trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)


model_version["model/artifacts"].download()


!unzip artifacts


model = PeftModel.from_pretrained(model, "/content/lora-faquad/checkpoint-260", local_files_only=True)

Model inference before and after fine-tuning. The text shows a small piece of information about the rules to pass a course and asks: “What does passing the course depend on?” Before fine-tuning, the model only repeats the question. After fine-tuning, the model can answer the question correctly.

Let’s compare the prediction without fine-tuning and the prediction after fine-tuning. As demonstrated, before fine-tuning, the model didn’t know how to handle Brazilian Portuguese at all and answered by repeating some part of the input or returning special characters like “##########.” However, after fine-tuning, it becomes evident that the model handles the input much better, answering the question correctly (it only added a “?” at the end, but the rest is exactly the answer we’d expect).

We can also look at the metrics before and after fine-tuning and verify how much it improved:

	Exact Match	F1
Before fine-tuning	0	0.007
After fine-tuning	0.143	0.157

Given the metrics and the prediction example, we can conclude that the fine-tuning was in the right direction, even though we have room for improvement.

How to improve the solution?

In this article, we’ve detailed a simple and efficient technique for fine-tuning LLMs.

Of course, we still have some way to go to achieve good performance and consistency. There are various additional, more advanced strategies you can employ, such as:

More Data: Add more high-quality, diverse, and relevant data to the training set to improve the model’s learning and generalization.
Tokenizer Merging: Combine tokenizers for better input processing, especially for multilingual models.
Model-Weight Tuning: Directly adjust the pre-trained model weights to fit the new data better, which can be more effective than tuning adapter weights.
Reinforcement Learning with Human Feedback: Employ human raters to provide feedback on the model’s outputs, which is used to fine-tune the model through reinforcement learning, aligning it more closely with complex objectives.
More Training Steps: Increasing the number of training steps can further enhance the model’s understanding and adaptation to the data.

Conclusion

We engaged in four distinct trials throughout our experiments, each employing a different model. We’ve used quantization and LoRA to reduce the memory and compute resource requirements. Throughout the training and evaluation, we’ve used Neptune to log metrics and store and manage the different model versions.

I hope this article inspired you to explore the possibilities of LLMs further. In particular, if you’re a native speaker of a language that’s not English, I’d like to encourage you to explore fine-tuning LLMs in your native tongue.

Deploying Conversational AI Products to Production With Jason Flaks

Stephen Oladele — Tue, 18 Jul 2023 11:20:16 +0000

This article was originally an episode of the MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners.

Every episode is focused on one specific ML topic, and during this one, we talked to Jason Falks about deploying conversational AI products to production.

You can watch it on YouTube:

Or Listen to it as a podcast on:

But if you prefer a written version, here you have it!

In this episode, you will learn about:

1 How to develop products with conversational AI
2 The requirements for deploying conversation AI products
3 Whether its better to build products on proprietary data in-house or use off-the-shelf
4 Testing strategies for conversational AI
5 How to build conversational AI solutions for large-scale enterprises

Sabine: Hello everyone, and welcome back to another episode of MLOps Live. I’m Sabine, your host, and I’m joined, as always, by my co-host Stephen.

Today, we have Jason Flaks with us, and we’ll be talking about deploying conversational AI products to production. Hi, Jason, and welcome.

Jason: Hi Sabine, how’s it going?

Sabine: It’s going very well, and looking forward to the conversation.

Jason, you are the co-founder and CTO of Xembly. It’s an automated chief of staff that automates conversational tasks. So it’s a bit like an executive assistant bot, is that correct?

Jason: Yeah, that’s a great way to frame it. So the CEO of most companies have people assisting them, maybe an executive assistant, maybe a chief of staff. This occurs so the CEO can focus their time on really important and meaningful tasks that power the company. The assistants are there to help handle some of the other tasks in their day, like scheduling meetings or taking meeting notes.

We are aiming to automate that functionality so that every worker in an organization can have access to that help, just like a CEO or someone else in the company would.

Sabine: Awesome.

We’ll be digging into that a bit deeper in just a moment. So just to ask a little bit about your background here, you have a pretty interesting one.

You have a bit of education in music composition, math, and science before you get more into the software engineering side of things. But you have started out in software design engineering, is that correct?

Jason: Yeah, that’s right.

As you mentioned, I did start out earlier in my life as a musician. I had a passion for a lot of the electronic equipment that came from music, and I was good at math as well.

I started in college as a music composition major and a math major and then was ultimately looking for some way to combine those two. I landed in a master’s program that was an electrical engineering program exclusively focused on professional audio equipment, and that led me to an initial career in signal processing, doing software design.

That was kind of my out-of-the-gate job.

Sabine: So you find yourself in the intersection of different interesting areas, I guess.

Jason: Yeah, that’s right. I’ve really always tried to stay a little bit close to home around music and audio and engineering, even to this day.

While I’ve drifted a little bit away from professional audio, music, live sound, speech, and natural language, it’s still tightly coupled into the audio domain, so that’s remained kind of a piece of my skill set throughout my whole career.

Sabine: Absolutely. And on the topic of equipment, you were involved in developing the Connect, right? (Or the Xbox).

Was that your first touch with speech recognition, a machine learning application?

Jason: That’s a great question. The funny thing about speech recognition is it’s really a two-stage pipeline:

The first component of most speech recognition systems, at least historically, is extracting features. That’s very much in the audio signal processing domain, something that I had a lot of expertise in from other parts of my career.

While I wasn’t doing speech recognition, I just was familiar with fast fourier transforms and a lot of the componentry that goes into that front end, the speech recognition stack.

But you’re correct to say that when I joined the Connect Camera team, it was kind of the first time that speech recognition was really put in from my face. I naturally gravitated towards it because I deeply understood that early part of the stack.

And I found it was really easy for me to transition from the world of audio signal processing, where I was trying to make guitar distortion effects, to suddenly breaking down speech components for analysis. It really made sense to me, and that’s where I kind of got my start.

It was a super compelling project to get my start because the Connect Camera was really the first consumer commercial product that did open microphone, no push-to-talk speech recognition at that point in time there were no products in the market that allowed you to talk to a device without pushing a button.

You always had to push something and then speak to it. We all have Alexa or Google Homes now. Those are common, but before those products existed, there was the Xbox Connect Camera,

You can go traverse the patent literature and see how the Alexa device references back to those original Connect patents. It was truly an innovative product.

Sabine: Yeah, and I remember I once had a lecturer who said that about human speech, that it’s the single most complicated signal in the universe, so I guess there is no shortage of challenges in that area in general.

Jason: Yeah, that’s really true.

What is conversational AI?

Sabine: Right, so, Jason, to kind of warm you up a bit… In 1 minute, how would you explain conversational AI?

Jason: Wow, the 1 minute challenge. I’m excited…

So human dialogue or conversation is basically an unbounded, infinite domain. Conversational AI is about building technology and products that are capable of interacting with humans in this unbounded conversational domain space.

So how do we build things that can understand what you and I are talking about, partake in the conversation, and actually transact on the dialogue as it happens as well.

Sabine: Awesome. And that was very well condensed. It was like, well, within the minute.

Jason: I felt a lot of pressure to go so fast that I overdid it.

What aspects of conversational AI is Xembly currently working on?

Sabine: I wanted to ask a little bit about what your team is working on now. Are there any particular aspects of conversational AI that you’re working on?

Jason: Yeah, that’s a really good question. So there are really two sides of the conversational AI stack that we work on.

Chatbot

This is about enabling people to engage with our product via conversational speech. As we kind of mentioned at the start of this conversation, we are aiming to be an automated chief of staff or an executive assistant.

The way you interact with someone in that role is generally conversationally, and so our ability to respond to employees via conversation is super helpful.

Automated note-taking

The question becomes, how do we sit in a conversation like this over Zoom or Google Meet or any other video conference provider and generate well-written pros nodes that you would immediately send out to the people in the meeting that explain what happened in the meeting?

So this is not just a transcript. This is how we extract the action items and decisions and roll up the meeting into a readable summary such that if you weren’t present, you would know what happened.

Those are probably the two big pieces of what we’re doing in the conversational AI space, and there’s a lot more to what makes that happen, but those are kind of the two big product buckets that we’re covering today.

Sabine: So if you could sum it up on a high level, how do you go about developing this for your product?

Jason: Yeah, so let’s talk about notetaking. I think that’s an interesting one to walk through…

The first step for us is to break down the problem.

Meeting notes is actually a really complicated thing on some level. There’s a little nuance to how every human being sends different notes, so it required us to take a step back to figure out –

What’s the nugget of what makes meeting notes valuable to people and can we quantify it into something that’s structured that we could repeatedly generate?

Machines don’t deal well with ambiguity. You need to have a structured definition around what you’re trying to do so your data annotators can label information for you.

If you can’t give them really good instructions on what they’re trying to label, you’re going to get wishy-washy results.

But also just because in general, if you really want to build a crisp concrete system that produces repeatable results, you really need to define the system, so we spend a lot of time upfront just figuring out what is the structure of proper meeting notes.

In our early days, we definitely landed on the notion that there are really two critical pieces to all meeting notes.

1 The actions that come out of the meeting that people need to follow up on.
2 A linear recap that summarizes what happened in the meeting – ideally topic bounded so that it covers the sections of the meetings as they happened.

Once you have that framing, you have to make that next leap to then define what those individual pieces look like so that you understand what the different models in the pipeline that you need to build to actually achieve it.

Scope of the conversational AI problem statements

Sabine: Was there anything else you wanted to add to that?

Jason: Yeah, so if we think just a little bit about something like action items so how does one go about defining that space so that it’s something tractable for a machine to find?

A good example is that in almost every meeting, people say things like I’m going to go and walk my dog because they’re just conversing with people in the meeting about things they’re going to do that’s non-work related.

So you have things in a meeting that are non-work related, you have things that are actually happening in a meeting that are actually being transacted on at that moment. I’m going to update that row in the spreadsheet, and then you have true acronyms, things that are actually work that must be initiated after the meeting happens that someone’s accountable for that’s on that call.

So how do you scope that and really refine that into a very particular domain that you can teach a machine to find?

Turns out to be a super challenging problem. We’ve spent a lot of effort doing all that scoping and then initiating the data collection process so that we can start building these models.

On top of that, you have to figure out what is the pipeline to build these conversational AI systems; It’s actually twofold.

1 There’s understanding the dialogue itself – just understanding the speech, but to transact on that data, in a lot of cases, requires that you normalize that data into something that a machine understands. A good example is just dates and times.
2 Part one of the system is understanding that someone said, “I’ll do that next week,” but that’s insufficient to transact on, on its own. If you want to transact on next week, you have to actually understand in computer language what next week actually means.

That means you have some reference to what the current date is. You need to actually be clever enough to know that next week actually means some time range, that is, in the following week from the current week that you’re in.

There’s a lot of complexity and different models you have to run to be able to do all of that and be successful at it.

Getting a conversational AI product ready

Stephen: Awesome… I’m sort of looking at digging more deeper into the note-taking that’s the product you talked about.

I’m going to be coming from the angle of production, of course, getting that to reward users, and the ambiguity stems from there.

So before I go into that complexity, I want to understand how do you deploy such products? I want to know whether there are specific nuances or requirements you put in place or if this is just typical pipeline deployment and then workflow, and then that’s it.

Jason: Yeah, that’s a good question.

I’d say, first and foremost, probably one of the biggest differences in conversational AI deployments in this notetaking stack, perhaps from the larger traditional machine learning space that exists in the world, relates to what we were talking about earlier because it’s an unbounded domain.

Fast, iterative data labeling is absolutely critical to our stack. And if you think about how conversation or dialogue or just language in general works, you and I can make up a word right now, as far as even the largest language model in the world – if we want to take GPT-3 today – that’s an undefined token for them.

We just created a word that’s out of vocabulary, they don’t know what it is, and they have no vector to support that word. And so language is a living thing. It’s constantly changing. And so, if you want to support conversational AI, you really need to be prepared to deal with the dynamic nature of language constantly.

That may not sound like it’s a real problem (that people are creating words on the fly all the time), but it really is. Not only is it a problem in just the general two friends chatting in a room, but it’s actually an even bigger problem from a business perspective.

Every day, someone wakes up and creates a new branded product, and they invent a new word, like Xembly, to put on top of their thing, you need to make sure that you understand that.

So a lot of our stack, first of all, out of the gate, is making sure that we have good tooling for data labeling. We do a lot of semi-supervised type learning, so we need to be able to collect data quickly.

We need to be able to label it quickly. We need to be able to produce metrics on the data that we’re getting just off of the live data feeds so that we can use some unlabeled data with our labeled data mix in there.

I think another huge component, as I kind of was mentioning earlier, is Conversational AI tends to require large pipelines of machine learning. You usually cannot do a one-shot, “here’s a model,” then it handles everything no matter what you’re reading today.

In the world of large language models, there are generally a lot of pieces to make an end-to-end stack work. And so we actually need to have a full pipeline of models. We need to be able to quickly add pipelines into that stack.

It means you need good pipeline architecture such that you can interject new models anywhere in that pipeline as needed to make everything work as needed.

Solving different conversational AI challenges

Stephen: If you could walk us through your end-to-end stack for notable products.

Let’s just sort of see how much of a challenge each one actually poses and maybe how your team solves them as well.

Jason: Yeah, the stack consists of multiple models.

Speech recognition

It starts at the very beginning with basically converting speech to text; It’s like the foundational component – so traditional speech recognition.

We want to answer the question, “how do we take the audio recording that we have here and get a text document out of that?”

Speaker segmentation

Since we’re dealing with dialogue, and in many cases, dialogue and conversation where we don’t have distinct audio channels for every speaker, there’s another huge component to our stack – speaker segmentation.

For example, I might wind up in a situation where I have a Zoom recording, where there are three independent people on channels and then there are six people in one conference room talking on a single audio channel.

To ensure the transcript that comes from the speech recognition system maps to the dialog flow correctly, we need to actually understand who’s distinctly speaking.

It’s not good enough to say, well, that was conference room B, and there were six people there, but I only understand it’s conference room B. I really need to understand every distinct speaker because part of our solution requires that we actually understand the dialogue – the back-and-forth interactions.

I need to know that this person said “no” to this request made by another person over here. With text in parallel, we net out with a speaker assignment who we think is speaking. We start a little bit with what we call “blind speaker segmentation.”

That means we don’t necessarily know who is whom, but we do know there are different people. Then we subsequently try to run audio fingerprinting type algorithms on top of it so that we can actually identify specifically who those people are if we’ve seen them in the past. Even after that, we kind of have one last stage in our pipeline. We call it our “format stage.”

Format stage

We run punctuation algorithms and a bunch of other small pieces of software so that we can net out with what looks like a well-structured transcript, where we’ve kind of landed in this stage now, where we know Sabine was talking to Stephen was talking to Jason. We have the text that allocates to those bounds. It’s reasonably well-punctuated. And now we have something that is hopefully a readable transcript.

Forking the ML pipeline

From there, we fork our pipeline. We run in two parallel paths:

1 Generating action items
2 Generating recaps.

For action items, we run proprietary models in-house that are basically attempting to find spoken action items in that transcript. But that turns out to be insufficient because a lot of times in a meeting, what people say is, “I can do that”. If I gave you meeting notes at the end of the meeting and you got something that said action item, “Stephen said, I can do that,” that wouldn’t be super useful to you, right?

There are a bunch of things that have to happen once I found that phrase to make that into well-written pros, as I mentioned earlier:

we have to dereference the pronouns.
we have to go back through the transcript and figure out what that was.
we reformat it.

We tried to restructure that sentence into something that’s well-written. It’s like starting with the verb, replacing all those pronouns, so “I can do that” turns into “Stephen can update the slide deck with the new architecture slide.”

The other things that we do in that pipeline we run components to both do what we call owner extraction and due date extraction. Owner extraction is understanding the owner of a statement was I, and then knowing who I pertain to back in that transcript in the dialogue and then assigning the owner correctly.

Due date detection, as we mentioned, is how do I find the dates in that system? How do I normalize them so that I can present them back to everyone in the meeting?

Not that it was just due on Tuesday, but Tuesday actually means January 3, 2023, so that perhaps I can put something on your calendar so that you can get it done. That’s the action item part of our stack, and then we have the recap portion of our stack.

Along that part of our stack [recap portion], we’re really trying to do two things.

One, we’re trying to do blind topic segmentation, “How do we draw the lines in this dialogue that roughly correlate to kind of sections of the conversation?”

When we’re done here, someone would probably go back and listen to this meeting or this podcast and be able to kind of group it into sections that seem to align with some sort of topic. We need to do that, but we don’t really know what those topics are, so we use some algorithms.

We like to call these change point detection algorithms. We’re looking for a kind of systemic change in the flow of the nature of the language that tells us this was a break.

Once we do that, we then basically do abstractive summarization. So we use some of the modern large language models to generate well-written recaps of those segments of the conversation so that when that part of the stack is done, you net out with two sections or action items and now are well-written recaps, all with nicely written statements that you can hopefully immediately send out to people right after the meeting.

Build vs. open-source: which conversational AI model should you choose?

Stephen: It seems like a lot of models and sequences. It feels a little complex, and there’s a lot of overhead, which is exciting for us as we can slice through most of these things.

You mentioned most of these models being in-house proprietary.

Just curious, where do you leverage those state-of-the-art strategies or off-the-shelf models, and where do you feel like this has already been solved versus the things that you think can be solved in-house?

Jason: We try not to have the not invented here problem. We’re more than happy to use publicly available models if they exist, and they help us get where we’re going.

There’s generally one major problem in conversational speech that tends to necessitate you build your own models versus using off-the-shelf. That’s because the domain we talked about earlier is so big – you actually can net out having a reverse problem by using very large models.

And statistically, language at scale may not reflect the language of your domain, in which case using a large model can net out with not getting the results you’re looking for.

We see this very often in speech recognition; a good example would be a proprietary speech recognition system from, let’s just say, Google for example.

One of the problems we’ll find is Google has had to train their systems to deal with transcribing all of YouTube. The language of YouTube does not actually generally map well to the language of corporate meetings.

It doesn’t mean they’re not right from the larger general space, they are. What I mean is YouTube is probably a better representation of language in the macro domain space.

We’re dealing in the sub-domain of business speech. This means if you’re probabilistically, like most machine learning models are trying to do, predicting words based on the general set of language versus the kind of constrained domain of what we’re dealing with in our world, you’re often going to predict the wrong word.

In those cases, we found it’s better to build something – if not proprietary, at least trained on your own proprietary data – in-house versus using off-the-shelf systems.

That said, there are definitely cases at summarization I mentioned that we do recap summarization. I think we’ve reached a point where you would be silly not to use a large language model like GPT-3 to do that.

It has to be fine-tuned, but I think you’d be silly to not use that as a base system because the results just exceed what you’re going to be able to do.

Summarizing text is difficult to well such that it’s extremely readable, and the amount of text data you would need to acquire to train something that would do that well, as a small company, it’s just not conceivable anymore.

Now, we have these great companies like OpenAI that have done it for us. They’ve gone out and spent ridiculous sums of money training large models on amounts of data that would be difficult for any smaller organization to do.

We can just leverage that now and get some of the benefits of these really well-written summaries. All we now have to do is adapt and finetune it to get the results that we need out of it.

Challenges of running complex conversational AI systems

Stephen: Yeah, that’s quite interesting, and maybe I’d love us to go deeper into these challenges you face because running a complex system means it can range from the team setup to problems with computing and then you talk about quality data.

In your experience, what are the challenges that “break the system” and then you’ll go back there and fix them to get them up and running again?

Jason: Yeah, so there are a lot of problems in running these types of systems. Let me try to cover a few.

Before getting into the live inference production side of things, one of the biggest problems is what we call “machine learning technical debt” when you’re running these daisy chain systems.

We have a cascading set of models that are dependent or can become dependent on each other, and that can become problematic.

This is because when you train your downstream algorithms to handle errors coming from further upstream algorithms, introducing a new system can cause chaos.

For example, say my transcription engine makes a ton of mistakes in transcribing words. I have a gentleman on my team whose name always gets transcribed incorrectly (it’s not a traditional English name).

If we build our downstream language models to try to mask that and compensate for it, what happens when I suddenly change my transcription system or put a new one in place that actually can handle it? Now everything falls to pieces and breaks.

One of the things we try to do is not bake the error from our upstream systems into our downstream systems. We always try to assume that our models further down the pipeline are operating pure data so that they’re not coupled, and that allows us to independently upgrade all of our models and all our system with ideally not paying that penalty.

Now, we’re not perfect. We strive to do that, but sometimes you run into a corner where you have no choice but to really get quality results you have to do that.

But ideally, we strive for complete independence of the models in our system so that we can update them without then having to go update every other model in the pipeline – that’s a danger that you can run into.

Suddenly, when I updated my transcription system, I was getting that word I wasn’t transcribing anymore, but now I have to go upgrade my punctuation system because that changed how punctuation works. I have to go upgrade my action item detection system. My summarization algorithm doesn’t work anymore. I have to go fix all that stuff.

You can really trap yourself in a dangerous hole where the cost of making changes becomes extreme. That’s one component of it.

The other thing we found is when you’re running a daisy chain stack of machine learning algorithms, you need to be able to quickly rerun systems through your pipeline in any component of your pipeline.

Basically, to come down to the root of your question, we all know things break in production systems. It happens all the time. I wish it didn’t, but it does.

When you’re running queued daisy chain machine learning algorithms, if you’re not super careful, you can either run into systems where data starts backing up and you have huge latency if you don’t have enough storage capacity and wherever you’re keeping that data along the pipeline, things can start to implode. You can lose data. All sorts of bad things can happen.

If you properly maintain data across the various states of your system and you build good tooling so that you can constantly quickly rerun your pipelines, then you can find that you can get yourself out of trouble.

We built a lot of systems internally so that if we have a customer complaint or they didn’t receive something they expected to receive, we can go quickly find where it failed in our pipeline and quickly reinitiate it from precisely that step in the pipeline.

After we fixed any issue we uncovered, maybe we had a small bug that we accidentally deployed, maybe it was just an anomaly, or we had some weird memory spike or something that caused the container to crash mid-pipeline.

We can quickly just hit that step, push it through the rest of the system, and exit it out the end of the customer without the systems backing up everywhere and having a catastrophic failure.

Stephen: Right, and are these pipelines running as independent services, or they are different architectures to how they run?

Jason: Yeah, so almost all of our models of system run as individual services, independent. We use:

Kubernetes and Containers: to scale.
Kafka: our pipelining solution for passing messages between all the systems.
Robin Hood Faust: helps to orchestrate the different machine learning models down the pipeline. And we’ve leveraged that system as well.

How did Xembly set up the ML team?

Stephen: Yeah, that’s a great point.

In terms of the ML team set-up, does the team sort of leverage language experts in some sense, or how do you leverage language experts? And even on the operation side of things, is there a separate operations team, and then you have your research or ml engineers doing these pipelines and stuff?

Basically, how’s your team set up?

Jason: In terms of the ml side of our house, there are really three components to our machine learning team:

Applied research team: they are responsible for the model building, the research side of “what models do we need,” “what types of model,” “how do we train and test them.” They generally build the models, constantly measuring precision and recall and making changes to try to improve the accuracy over time.
Data annotation team: their role is to label some sets of our data on a continuous basis.
Machine learning pipeline team: this team is responsible for doing the core software development engineering work to host all these models, figure out how the data looks on the input, the output side, how it wants to be exchanged between the different models across the stack and just the stack itself.

For example, in all of those pieces we talked about Kafka, Faust, MongoDB databases. They care about how we get all that stuff interacting together.

Compute challenges and large language models (LLMs) in production

Stephen: Nice. Thanks for sharing that. So I think another major challenge we associate with deploying large language models is in terms of the compute power whenever you get into production, right? And this is the challenge with GPT, as Sam Altman would always tweet.

I’m just curious, how do you sort of navigate that challenge of the compute power in production?

Jason: We do have compute challenges. Speech recognition, in general, is pretty compute-heavy. Speaker segmentation, anything that’s generally dealing with more of the raw audio side of the house, tends to be compute-heavy, and so those systems usually require GPUs to do that.

First and foremost, let’s say that we have some parts of our stack, especially the audio componentry, that tend to require heavy GPU machines to operate some of the pure language side of the house, such as the natural language processing model. Some of them can be handled purely on CPU processing. Not all, but some.

For us, one of the things is really understanding the different models in our stack. We must know which ones have to wind up on different machines and make sure we can procure those different sets of machines.

We leverage Kubernetes and Amazon (AWS) to ensure our machine learning pipeline has different sets of machines to operate on, depending on the types of those models. So we have our heavy GPU machines, and then we have our more kind of traditional CPU-oriented machines that we can run things on.

In terms of just dealing with the cost of all of that and handling it, we tend to try to do two things:

1 Independently scale our pods within Kubernetes
2 Scale the underlying EC2 hosts as well.

There’s a lot of complexity in doing that, and doing it well. Again, just talking to some of the earlier things we mentioned in our system around pipeline data and winding up with backups and crashing, you can have catastrophic failure.

You can’t afford to over under scale your machines. You need to make sure that you’re effective at spinning up machines and spinning down machines and doing that hopefully right before the traffic comes in.

Basically, you need to understand your traffic flows. You need to make sure that you set up the right metrics, whether you’re doing it off CPU load or just general requests.

Ideally, you’re spinning up your machines at the right time such that you’re sufficiently ahead of that inbound traffic. But it’s absolutely critical for most people in our space that you do some type of auto-scaling.

At various points in my career doing speech recognition, we’ve had to run hundreds and hundreds and hundreds of servers to operate at scale. It can be very, very expensive. Running those servers at 03:00 in the morning if your traffic is generally domestic US traffic it’s just flushing money down the toilet.

If you can bring your machine loads down during that period of night, then you can save yourself a ton of money.

How do you ensure data quality when building NLP products?

Stephen: Great. I think we’ll just jump right into some questions from the community straight away.

Right, so the first question this person asks, quality data is a key requirement for building and deploying conversational AI and general NLP products, right?

How would you ensure that your data is high-quality throughout the life cycle of the product?

Jason: Pretty much, yeah. That’s a great question. Data quality is critical.

First and foremost, I’d say we actually strive to collect our own data. We found in general that a lot of the public datasets that are out there are actually insufficient for what we need. This is particularly a really big problem in the conversational speech space.

There are a lot of reasons for that. One. Just again, coming back to the size of the data, I once did a little bit of an estimate of what the rough size of conversational speech was, and I came up with some number, like 1.25 quintillion utterances would be what you’d need to roughly cover the entire size of conversational speech.

That’s because speech suffers from – besides a large number of words, they can be infinitely strung together. They can be infinitely strong together because, as you guys will probably find when you edit this podcast, when we’re done, a lot of us speak incoherently. It’s okay, we’re capable of understanding each other in spite of that.

There’s not a lot of actual grammatical structure to spoken speech. We try, but it actually generally does not follow grammatical rules like we do for written speech. So the written speech domain is this big.

The conversational speech domain is really infinite. People stutter. They repeat words. If you’re operating on trigrams, for example, you have to actually accept “I I I,” the word “I” three times in a row stuttered as a viable utterance, because that happens all the time.

Now expand that out to the world of all words and all combinations, and you’re literally in an infinite data set. So you have the scale problem where there really isn’t sufficient data out there in the first place.

But you have some other problems just around privacy, legality, there are all sorts of issues. Why there aren’t large conversational data sets out there? Very few companies are willing to take all their meeting recordings and put them online for the world to listen to.

That’s just not something that happens out there. There’s a limit to the amount of data, if you look for conversational data sets that are out there, like actual live audio recordings, some of them were manufactured, some of them were like conference data, doesn’t really relate to the real world.

You can sometimes find government meetings, but again, those don’t relate to the world that you’re dealing with. In general, you wind up having to not leverage data that’s out there on the internet. You need to collect your own.

And so the next question is, once you have your own, how do you make sure that the quality of that data is actually sufficient? And that’s a really hard problem.

You need a good data annotation team to start with and very, very good tooling we’ve made use of Label Studio is an open source. I think there’s a paid version as well – we make good use of that tool to quickly label lots and lots of data, you need to give your data annotators good tools.

I think people underappreciate how important the tooling for data labeling actually is. We also try to apply some metrics on top of our data so that we can analyze the quality of the data set over time.

We constantly run what we call our “mismatch file.” This is where we take what our annotators have labeled and then run it through our model, and we look where we get differences.

When that’s finished, we do some hand evaluation to see if the data was correctly labeled, and we repeat that process over time.

Essentially, we’re constantly checking new data labeling against what our model predictions are over time so that we are sure that our data set remains of high quality.

What domains does the ML team work on?

Stephen: Yeah, I think we forgot to ask the earlier part of the episode, I was curious, what domains does the team work on? Is it like a business domain or just a general domain?

Jason: Yeah, I mean, it’s generally the business domain. Generally, in corporate meetings, that domain still is fairly large in the sense of we’re not particularly focused on any one business.

There are a lot of different businesses in the world, but it’s mostly businesses. It’s not consumer-to-consumer. It’s not me calling my mother, it’s employees in a business talking to each other.

Testing conversational AI products

Stephen: Yeah, and I’m curious, this next question, by the way, is from some of the companies want to ask what’s your testing strategy for Conversational AI and generally NLU products?

Jason: We have found testing in natural language really difficult in terms of model building. We do obviously have a train and test data set. We follow the traditional rules of machine learning model building to ensure that we have a good test set that’s evaluating the data.

We have at times tried to allocate kind of golden data sets, golden meetings for our notetaking pipeline that we can at least check to kind of get a gut check, “hey, this new system doing the right thing across the board.”

But because the system is so big, often we found that those tests are nothing other than a gut check. They’re not really viable for true evaluation at scale, so we generally test live – it’s the only way we found to sufficiently do this in an unbounded domain.

It works in two different ways depending on where we are in development. Sometimes we deploy models and run against live data without actually using the results to the customers.

We’ve structured all of our systems because we have this well-built daisy chain machine learning system where we can inject ML steps anywhere in the pipeline and run parallel steps that allows us to sometimes say, “hey, we’re going to run a model in silent mode.”

We have a new model to predict action items, we’re going to run it, and we’re going to write out the results. But that’s not what the rest of the pipeline is going to operate on. The rest of the pipeline is going to operate on the old model, but at least now, we can do an ad test and look at what both models produced and see if it looks like we’re getting better results or worse results.

But even after that, very often, we’ll push a new model out into the wild on only a percentage of traffic and then evaluate some top-line heuristics or metrics to see if we’re getting better results.

A good example in our world would be that we hope that customers will share the meeting summaries we send them. And so it’s very easy for us, for example, to change an algorithm in the pipeline and then go see, “hey, are our customers sharing our meeting notes more often?”

Because that sharing of the meeting notes tends to be a pretty good proxy for the quality of what we delivered to the customer. And so there’s a good heuristic that we can just track to say, “hey, did we get better or worse with that?”

That’s generally how we test. A lot of live in the wild testing. Again, mostly just due to the nature of the domain. If you’re dealing in a nearly infinite domain, there’s really no test set that’s probably going to ultimately quantify whether or not you got better or not.

Maintaining the balance between ML monitoring and testing

Stephen: And where’s your fine line between monitoring in production versus actual testing?

Jason: I mean, we’re always monitoring all parts of our stack. We’re constantly looking for simple heuristics on the outputs of our model that might tell us if something’s gone astray.

There are metrics like perplexity, which is something that we use in language to detect whether or not we’re producing gibberish.

We can do simple things like just count the number of action items that we predict in a meeting that we constantly track that kind of just tell us are we going off the rails or something like that, along with all sorts of monitoring that we have around just general health of the system.

For example:

Are all the docker containers running?
Are we eating up too much CPU or too much memory?

That’s one side of the stack which I think is a little bit different from the kind of model building side of the house, where we’re constantly building and then running our training data we produce and send our results as part of a daily build for our models.

We’re constantly seeing our precision-recall metrics as we’re labeling data off the wire and ingesting new data. We can constantly test the model builds themselves to see if our precision-recall metrics are perhaps going off the rails in one direction or another.

Open-source tools for conversational AI

Stephen: Yeah, that’s interesting. All right, let’s jump right into the next question this person asked: Can you recommend open-source tools for conversational AI?

Jason: Yeah, for sure. In the speech recognition space, there are speech recognition systems like Kaldi – I highly recommend it; It’s been one of the backbones of speech recognition for a while.

There are definitely newer systems, but you can do amazing things with Kaldi for getting up and running with speech recognition systems.

Clearly, systems like GPT-3, I would strongly recommend to people. It’s a great tool. I think it needs to be adapted. You’re going to get better results if you finetune it, but they’ve done a great job of providing APIs and making it easy to update those as you need.

We make a lot of use of systems like SpaCy for entity detection. If you’re trying to get up and running in natural language processing in any way, I strongly recommend you get to know spaCy well. It’s a great system. It works amazing out of the box. There’s all sorts of models. It gets consistently better throughout the years.

And I mentioned earlier, just for data labeling, we use Label Studio, that’s an open-source tool for data labeling that supports labeling of all different types of content audio, text, and video. They’re really easy to get going out of the box and just start labeling data quickly. I highly recommend it to people who are trying to get started.

Building conversational AI products for large-scale enterprises

Stephen: All right, thanks for sharing. Next question.

The person asks, “How do you build conversational AI products for large scale enterprises?” What considerations would you put in place when it starts in the project?

Jason: Yeah, I would say with large-scale organizations where you’re dealing with very high traffic loads, I think, for me, the biggest problem is really cost and scale.

You’re going to wind up needing a lot, a lot of server capacity to handle that type of scale in a large organization. And so, my recommendation is you really need to think through the true operation side of that stack. Whether or not you’re using Kubernetes, whether or not you’re using Amazon, you need to think about those auto-scaling components:

What are the metrics that are going to trigger your auto-scaling?
How do you get that to work?

Scaling pods and Kubernetes on top of auto-scaling EC2 hosts underneath the covers is actually nontrivial to get to work quickly. We talked before also about the complexity around some types of models that generally tend to need GPU for compute, others don’t.

So how do you distribute your systems onto the right type of nodes and scale them independently? And I think it also winds up being a consideration of how you allocate those machines.

What machines do you buy depending on the traffic? Which machines do you reserve? Do you buy spot instances to reduce costs? These are all the considerations in a large-scale enterprise that you must consider when getting these things up and running if you want to be successful at scale.

Deploying conversational AI products on edge devices

Stephen: Awesome. Thanks for sharing that.

So let’s jump right into the next one. How do you deal with deployment and general production challenges with on-device conversational AI products?

Jason: When we say on device, are we talking about onto servers or onto more like constrained devices?

Stephen: Oh yeah, constrained devices. So edge devices and devices that don’t have that compute power.

Jason: Yeah, I mean, in general, I haven’t dealt with deploying models into small compute devices in some years. I can just share historically for things like the connected camera. When I worked on that, for example.

We distributed some load between the device and the cloud. For fast response, low latency things, we would run small-scale components of the system there but then shovel the more complex components off to the cloud.

I don’t know how much this relates to answer the question that this user was asking, but this is something that I have dealt with in the past where basically you run a very lightweight small speech recognition system on the device to maybe detect a wake word or just get the initial system up and running.

But then, once it’s going, you funnel all large-scale requests off to a cloud instance because you just generally can’t handle the compute of some of these systems on a small, constrained device.

Discussion on ChatGPT

Stephen: I think it would be a crime for this episode without discussing ChatGPT. And I’m just curious, this is a common question, by the way.

What’s your opinion on ChatGPT and how people are using it today?

Jason: Yeah. Oh my god, you should ask me that at the start because I can probably talk for an hour and a half about that.

ChatGPT and GPT, in general, are amazing. We’ve already talked a lot about this, but because it’s been trained in so much language, it can do really amazing things and write beautiful text with very little input.

But there are definitely some caveats with using those systems.

One is, as we mentioned, it is still a fixed train set. It’s not dynamically updated, so one thing to think about is whether it can actually maintain some state within a session. If you invent a new word while having a dialogue with it, it will generally be able to leverage that word later in the conversation.

But if you end your session and come back to it, it has no knowledge of that ever again. Some other things to be concerned about again because it’s fixed, it really only knows about things from, I think, 2021 and before.

The original GPT3 was from 2018 and before, so it’s unaware of modern events. But I think maybe the biggest thing that we determine from using it, it’s a large language model, it functionally is predicting the next word. It’s not intelligent, it’s not smart in any way.

It’s taken human encoding of data, which we’ve encoded as language, and then it’s learned to predict the next word, which winds up being a really good proxy for intelligence but is not intelligence itself. What happens because of that is GPT3 or ChatGPT will make up data because it is just predicting the next likely word – sometimes the next likely word is not factually correct, but is probabilistically correct from predicting the next word.

What’s a little scary about ChatGPT is that it writes so well that it can spew falsehoods in a very convincing way that if you don’t pay really detailed attention to, you actually can miss it. That’s maybe the scariest part.

It can be something as subtle as a negation. If you’re not really reading what it spits back, it might have done something as simple as negate, which should have been a positive statement. It might have turned a yes into a no, or it might have added an apostrophe to the end of something.

If you quickly read, your eyes will just glance over it and will not notice it, but it might be completely factually wrong. In some way, we’re suffering from an abundance of greatness. It’s gotten so good, it’s so amazing at writing that we actually now have the risk of the problem that the human evaluating it might actually miss, that what it wrote is factually incorrect just because it reads super well.

I think these systems are amazing; I think they’re fundamentally going to change the way a lot of machine learning and natural language processing work for a lot of people, and it’s just going to change how people interact.

With computers in general, I think the thing we should all be mindful of is it’s not a magical thing that just works out of the box, and it’s dangerous to actually assume that it is. If you want to use it for yourself, I strongly suggest that you fine-tune it.

If you’re going to try to use it out of the box and generate content for people or something like that, I strongly suggest you recommend to your customers that they review and read. And don’t just blindly share what they’re getting out of it because there is a reasonable chance that what’s in there may not be 100% correct.

Wrap up

Stephen: Awesome. Thanks, Jason. So that’s all from me.

Sabine: Yeah, thanks for the extra bonus comments on what is, I guess still like it’s convincing, but it’s just fabrication for now. So let’s see where it goes. But yeah, thanks, Jason, so much for coming on and sharing your expertise and your tips.

It was great having you.

Jason: Yes, thanks Stephen was really great. I enjoyed the conversation a lot.

Sabine: Before we let you go, how can people follow what you’re doing online? Maybe get in touch with you?

Jason: Yeah, so you can follow Xembly online at www.xembly.com. You can reach out to me. Just my first name, jason@xembly.com. If you want to ask me any questions, I’m happy to answer. Yeah, and just check out our website, see what’s happening. We try to keep people updated regularly.

Sabine: Awesome. Thanks very much. And here at mlops Live, we’ll be back in two weeks, as always. And next time, we’ll have with us, Silas Bempong and Abhijit Ramesh, we will be talking about doing MLOps for clinical research studies.

So in the meantime, see you on socials and the MLOps community slack. We’ll see you very soon. Thanks and take care.

What Does GPT-3 Mean For the Future of MLOps? With David Hershey

Stephen Oladele — Mon, 05 Jun 2023 13:53:41 +0000

This article was originally an episode of the MLOps Live, an interactive Q&A session where ML practitioners answer questions from other ML practitioners.

Every episode is focused on one specific ML topic, and during this one, we talked to David Hershey about GPT-3 and the feature of MLOps.

You can watch it on YouTube:

Or Listen to it as a podcast on:

But if you prefer a written version, here you have it!

In this episode, you will learn about:

1 What is GPT-3 all about?
2 What is GPT-3’s impact on the MLOps field and how is it changing ML?
3 How can language models complement MLOps?
4 What are the concerns associated with building this MLOps sort of system?
5 How are startups and companies already leveraging LLMs to ship products fast?

Stephen: On this call, we have David Hershey, one of the community’s favorites, I would say – I dare to say, in fact – and we will be talking about what OpenAI GPT-3 means for the MLOps world. David is currently the Vice President of Unusual Ventures, where they are raising the bar of what founders should expect from their venture investors. Prior to Unusual Ventures, he was a Senior Solutions Architect at Tecton. Prior to Tecton, he worked as a Solutions Engineer at Determined AI and as a Product Manager for the ML Platform at Ford Motor Companies.

David: Thank you. Excited to be here and excited to chat.

Stephen: I’m just curious, just giving a background, what’s really your role at Unusual Ventures?

David: Unusual is a venture fund, and my current focus is on our Machine Learning and Data Infrastructure Investments. I lead all the work we do, thinking about the future of machine learning infrastructure and data infrastructure and a little bit about DevTools more generally. But it’s sort of a continuation of I’ve spent five or six years now dedicated to thinking about ML infrastructure and still doing that, but this time trying to figure out the next wave of it.

Stephen: Yeah, that’s pretty awesome. And you wrote a few blog posts on the next wave of ML infrastructure. Could you sort of throw more light into what you’re seeing there?

David: Yeah, it’s been a long MLOps journey, I suppose, for a lot of us, and there have been ups and downs for me. We’ve accomplished an amazing number of things. When I got into this, there were not many tools, and now there are so many tools and so many possibilities, and I think some of that’s good and some of it’s bad.

The topic of this conversation, obviously, is to dive a little bit into GPT-3 and language models; there’s all this hype now about Generative AI.

I think there’s this incredible opportunity to broaden the number of ML applications we can build and the set of people that can build machine learning applications thanks to recent advances in language models like ChatGPT and GPT-3 and things like that.

Regarding MLOps, there are new tools we can think about, there are new people that can participate, and there are old tools that might have new capabilities that we can think about too. So there’s a ton of opportunities.

What is GPT-3?

Stephen: Yeah, absolutely, we’ll definitely delve into that. Speaking of the Generative AI space, the core focus of this episode would be the GPT-3, but could you share a bit more about what GPT-3 means and just give a background there?

David: Of course. GPT-3 is related to ChatGPT, which is the thing I guess the whole world’s heard about now.

In general, it’s a large language model, not altogether that different from language machine learning models we’ve seen in the past that do various natural language processing tasks.

It’s built on top of the transformer architecture that was released by Google in 2017, but GPT-3 and ChatGPT are sort of proprietary incarnations of that from OpenAI.

They’re called large language models because, in the last six or so years, what we’ve been doing largely is giving more data and making the models bigger. As we’ve done that through both GPT-3 and other folks who have trained language models, we’ve seen these sort of amazing sets of capabilities emerge with language models beyond just sort of the classical things we’ve associated with language processing, like sentiment analysis,

These language models can do more complex reasoning and solve a ton of language tasks efficiently; one of the most popular incarnations of them is ChatGPT, which is essentially a Chatbot that is capable of having human conversations.

Check also

Generative Adversarial Networks and Some of GAN Applications

The impact of GPT-3 on MLOps

Stephen: Awesome. Thanks for sharing that… What are your thoughts on the impact of GPT-3 on the MLOps field? And how do you see Machine Learning changing?

David: I think there are a couple of really interesting pieces to tease out what language models mean for the world of MLOps – maybe I want to separate it into two things.

1. Language models

Language models, as I said, have an amazing number of capabilities. They can solve a surprisingly large number of tasks without any extra work; this means you don’t have to train or tune anything – you are only required to write a good prompt,

Several problems can be solved using language models.

The nice thing about being able to use a model someone else trained is you offload the MLOps to the people building the model, and you still get to do a whole bunch of fun work downstream.

You don’t need to worry about inference as much or versioning and data.

There are all these problems that suddenly fall out, enabling you to focus on other things, which I think broadens the accessibility of machine learning in a lot of cases.

But not every use case is going to be immediately solved; Language models are good, but they’re not everything yet.

One category to think about is if we don’t need to train models anymore for some set of things,

What activities are we taking part in?
What are we doing, and what tools do we need?
What talents and skills do we need to be able to build machine learning systems on top of language models?

2. How language models complement MLOps

We are still training models; there are still a lot of cases where we do that, and I think it’s worth at least commenting on the impact of language models today.

One of the hardest things about MLOps today is that a lot of data scientists aren’t native software engineers, but it may be possible to lower the bar to software engineering.

For example, there has been a lot of hype around translating natural language to things like SQL so that it’s a little bit easier to do data discovery and things like that. And so those are more sideshows of the conversations or other complementary pieces, maybe.

But I think it is still impactful when you think about whether there’s a way language models can be used to lower the bar of who can actually participate in traditional MLOps by making the software aspects more accessible, the data aspects more accessible, et cetera.

The accessibility of large language models

Stephen: When you talk about GPT-3 and Large Language Models (LLMS), some people think these are tools for large companies like Microsoft, OpenAI, Google, etc.

How are you seeing the trend toward making these systems more accessible for smaller organizations, early-stage startups, or smaller teams? I want to leverage this stuff and put it out there to consumers.

David: Yeah, I actually think this is maybe the most exciting thing that’s come out of language models, and I’ll frame it in a couple of ways.

Someone else has figured out MLOps for the Large Language Models.

To some extent, they’re serving them, they’re versioning them, they’re iterating on them, they’re doing all the fine-tuning. And what that means is for a lot of companies that I work with and talk to, Machine Learning in this form is way more accessible than it’s ever been – they don’t need to hire a person to learn how to do machine learning and learn PyTorch and figure out all of MLOps to be able to get something out.

The amazing thing with language models is you can kind of get your MVP out by just writing a good prompt on the OpenAI playground or something like that.

A lot of them are demos at that point, they’re still not products. But I think the message is the same: it’s suddenly so easy to go from an idea to something that looks like it actually works.

At a very surface level, the obvious thing is anybody can try and potentially build something pretty cool; it’s not that hard, but that’s great – not hard is great.

We’ve been doing very hard work to create simple ML models for a while, and this is really cool.

The other thing I’ll touch on is this: when I look back to my time at Ford, a major theme that we thought about was democratizing data.

How can we make it so the whole company can interact with data?

Democratization has been all talk for the most part, and language models, to some extent, have done a little bit of data democratizing for the whole world.

To explain that a little further, when you think about what those models are, the way that GPT-3 or the other similar language models are trained is on this corpus of data called the Common Crawl, which is essentially the whole internet, right? So they download all of the text on the internet, and they train language models to predict all of that text.

One of the things you used to need to do the machine learning that we’re all familiar with is data collection.

When I was at Ford, we needed to hook things up to the car and telemetry it out and download all that data somewhere and make a data lake and hire a team of people to sort that data and make it usable; the blocker of doing any ML was changing cars and building data lakes and things like that.

One of the most exciting things about language models is you don’t need to hook up a lot of stuff. You just sort of say, please complete my text, and it will do it.

I think one of the bars that a lot of startups had in the past was this cold start problem. Like, if you don’t have data, how do you build ML? And now, on day one, you can do it, anybody can.

That’s really cool.

You may find interesting

How to Do Data Labeling and Data Collection: Principles and Process

What do startups worry about if MLOps is solved?

Stephen: And it’s quite interesting because if you’re not worrying about these things, then what are you worrying about as a startup?

David: Well, I’ll give the good and then the bad…

The good case is worrying about what people think, right? You’re customer-centric.

Instead of worrying about how you’re going to find another MLOps person or a data engineer, which is hard to find because there’s not enough of them, you can worry about building something that customers want, listening to customers, building cool features, and hopefully, you can iterate more quickly too.

The other side of this that all of the VCs in the world like to talk about is defensibility – and I don’t want to, we don’t need to get into that.

But when it’s so easy to build something with LLMs, then it’s sort of table stakes – It stops being this cool differentiated thing that sets you apart from your competition.

If you build an incredible credit scoring model that will make you a better insurance provider, or that will make you a better loan provider, etc.

Text completion is kind of table stakes right now. A lot of folks are worried about how to build something that my competitors can’t rip off tomorrow – but hey, that’s not a bad problem to have.

Going back to what I said earlier, you can focus on what people want and how people are interacting with it and maybe frame it slightly differently.

For example, there’s all of this MLOps tooling, and the thing that’s kind of at the far end is monitoring, right? When we think about it, it’s like you ship a model, and the last thing you do is monitor it so that you can continuously update and stuff like that.

But monitoring for a lot of MLOps teams I work with is sort of still an afterthought because they’re still working on getting to the point where they have something to monitor. But monitoring is actually the cool part; it’s where people are using your system, and you’re figuring out how to iterate and change it to make it better.

Almost everybody I know that’s doing language model stuff right now is already monitoring because they ship something in five days; they’re working on iterating with customers now instead of trying to figure it out and scratching their heads.

We can focus more on iterating these systems with users in mind instead of the hard PyTorch stuff and all that.

Learn more

A Comprehensive Guide on How to Monitor Your Models in Production

Has the data-centric approach to ML changed after the arrival of large language models?

Stephen: Prior to LLMs, there was a frenzy around data-centric AI approaches to building the systems. How does this sort of approach to building your ML systems link to now having Large Language Models that already have been trained on this vast amount of data?

David: Yeah, I guess one thing I want to call out is that –

Machine learning that’s the least likely to be replaced by language models in the short term, is some of the most data-centric stuff.

When I was at Tecton, they built a feature store, and a lot of the problems we were working on are things like fraud detection, recommendation systems, and credit scoring. It turns out the hard part of all of those systems is not the machine learning part, it’s the data part.

You almost always need to know a lot of small facts about all of your users around the world, in a short amount of time; this data is then used to synthesize the answer.

In that sense, it’s a hard part of a problem: Is data still because you need to know what someone just clicked on or what are the last five things someone bought? Those problems aren’t going away. You still need to know all of that information. You need to be focused on understanding and working with data – I’d be surprised if language models had almost any impact on some of that.

There are a lot of cases where the hard part is just being able to have the right data to make decisions. And in those cases, being data-centric, asking questions about what data you need to collect what data, like how to turn that into features and how to use that to make predictions, are the right questions to ask.

On the language model side of things, the data question is interesting – you need potentially a little bit less focus on data to get started. You don’t need to curate and think about everything, but you must ask questions about how people actually use this – as well as all the monitoring questions we talked about.

Building something such as Chatbots needs to be built like product analytics to be able to track what our users’ responses to this generation or whatever we’re doing and things like that. So data is really important for those still.

We can get into it, but it certainly has a different texture than it used to because data is not a blocker to building features with language models as often anymore. It’s maybe an important part to keep improving, but it’s not a blocker to get started like it used to be.

How are companies leveraging LLMs to ship products fast?

Stephen: Awesome. And I’m trying not to lose my train of thought for the other MLOps component side of things, but I just wanted to give a bit of context again…

From your experience, how are companies leveraging these LLMs to ship products out fast? Have you seen use cases you want to share based on your time with them, at unusual?

David: It’s almost everything; you’d be amazed at how many things are out there.

There’s may be a handful of obvious use cases of language models out there and then we’ll talk about some of the quick shipping things too…

Writing assistants

There’s tools that help you write lots of those; for example, copy for marketing or blogs or whatever. Examples of such tools include Jasper.AI and Copy.AI – they have been around the longest. This is probably the easiest thing to implement with a language model.

Agents

There are use cases out there helping you take action. These are one of the coolest things going on right now. The idea is to build an agent that takes tasks in natural language and carries them out for you. For example, it could send an email, hit an API, or do nascent things. There’s more work going on there, but it’s neat.

Search & semantic retrieval

A lot of folks working on search and semantic retrieval and things like that… For example, if I want to look for a note, I can get a rich understanding of how to search through large information. Language models are good at digesting and understanding information so knowledge management and finding information are cool use cases.

I give broad answers because nearly every industry product has some opportunity to incorporate or improve a feature using language models. There are so many things out there to do and not enough time in the day to do them.

Stephen: Cool. And these are like DevTool-related use cases; like DevTooling and stuff?

David: I think there are all sorts of things out there, but in terms of thinking on the DevTool side, there’s Copilot, which helps you write code faster. And there are a lot of things like even making pull requests. I’ve seen tools that help you write and author pull requests more efficiently, and that help automate building documentation. I think the whole universe of how we develop software to some extent is also ripe to change. So along those lines exactly.

Monitoring LLMs efficiently in production

Stephen: Usually, when we talk about the ML platform or MLOps these are like tightening neat up close different components. You have your:

feature store
model registry
data from a data lake

The data is then moved across this workflow, modeled and then deployed,

Now there’s a good link between your development environments and the production environment where it’s monitoring.

But in this case now, where LLMs have almost eliminated the development side…

How you have sort of seen folks monitor these systems efficiently in production, especially replacing them with other models, and other systems out there?

David: Yeah, it’s funny. I think monitoring is one of the hardest challenges for the language models now because we eliminated development so it becomes challenge number one.

With most of the machine learning we’ve done in the past, the output is structured (i.e., is this a cat or not?); monitoring this was pretty easy. You can look at how often you’re predicting it’s a cat or not, and evaluate how it’s changing over time.

With language models, the output is a sentence – not a number. Measuring how good a sentence is, is hard. You have to think about things such as:

1 Is this number above 0.95 or something like that?
2 Is this sentence authoritative and nice?
3 And are we friendly and are we not toxic, are we not biased?

And all these questions are way harder to evaluate and harder to track and measure. So what are people doing? I think the first response for a lot of folks is to go to something like product analytics.

It’s closer to tools like Amplitude than it was to classic tools where you just generate something and you see if people like it or not. Do they click? Do they click off the page? Do they stay there? Do they accept this generation? Things like that. But man, that’s a real course metric.

That doesn’t give you nearly the detail of understanding the internals of a model. But it’s what people are doing.

There aren’t many great answers to that question yet. How do you monitor these things? How do you keep track of how good my model is doing besides looking at how users interact with it? It’s an open challenge for a lot of people.

We know a lot of ML monitoring tools out there… I’m hopeful some of our favorites will iterate into being able to more directly help with these questions. But I also think there’s an opportunity for new tools to emerge that help us say how good a sentence is, and help you measure that before and after you ship a model; this will make you feel more confident over time.

Right now, the most common way I’ve heard people say they ship new versions of models is they have five or six prompts that they test on, and then they check with their eyes if the output looks good and they ship it.

Stephen: That’s killable. Ironic, amazing, and sarcastic.

David: I don’t think that will last forever.

Where people are just happily looking at five examples with their eyes and hitting the ship to the production side of the error button.

That’s bold, but there’s so much hype right now that people will ship anything, I guess, but it won’t take long for that to change.

Closing the active learning loop

Stephen: Yeah, absolutely. And just a step more for that, because I think even before the large language models frenzy, when it was just the basic transformers they had, I think most companies that deal with these sorts of systems would usually find a way to close the active learning loop.

How can you find a way to close that active learning loop where you’re continuously refining that system or that model with your own data set as it comes began better?

David: I think this is still an active challenge for a lot of folks – not everybody’s figured it out.

OpenAI has a fine-tuning API, for example. Others do too, where you can collect data and they’ll make a fine-tuned endpoint. And so I’ve talked to a lot of folks that go down that route eventually, either to improve their model, more commonly actually to improve the latency performance. Like, if you can, GPT-3 is really large and expensive, and if you can fine-tune a cheaper model to be similarly good, but much faster and cheaper. I’ve seen people go down that route.

We’re in the early days of using these language models, and I have a feeling over time that the active learning component is still going to be just as, if not more important to refine models.

You hear a lot of people talking about like, per-user fine-tuning, right? Can you have a model per user that knows my style, what I want, or whatever it may be? It’s a good idea for anybody that’s using these right now to be thinking about that active learning loop today before they, even if it’s hard to execute on today, can’t download the weights of GPT-3 and fine-tune it yourself.

Even if you could, there are all sorts of challenges in fine-tuning a 175 billion parameter model, but I expect that the data that you collect now to be able to continuously improve is going to be really important in the long run.

Is GPT-3 an opportunity or risk for MLOps practitioners?

Stephen: Yeah, that’s quite interesting to see how the field sort of evolves in that sense. So at this point, we’ll jump right into some of the community questions.

So the first question from the community: is GPT-3 an opportunity or risk for MLOps Practitioners?

David: I think opportunities and risks are two sides of the same coin in some ways is I guess what I would say. I’ll cop out and say both.

I start with the risk I think it’s hard to imagine that a lot of the workloads that we used to rely on training models to do, where you had to do the whole MLOps cycle, you won’t anymore, maybe to expand. As we talked about, language models can’t do everything right now, but they can do a lot. And there’s no reason to believe they won’t be able to do more over time.

And if we have these general-purpose models that can solve lots of things, then why do we need MLOps? If we’re not training models, then a lot of MLOps go away. And so there’s a risk that if you aren’t paying attention to that, the amount of work out there to be done is going to go down.

Now, the good news is there aren’t enough MLOps practitioners today, to begin with. Not even close, right. And so I don’t think we’re going to shrink to a point where the number of MLOps practitioners today is too many for how much MLOps we need to do in the world. So I wouldn’t worry too much about it, I guess that is what I would say.

But the other side of it is there’s a whole bunch of new stuff to learn, like what are the challenges of building language model applications? There are a lot of them, and there are a lot of new tools. And I think looking forward to a couple of the community questions, I think we’ll get into it. But I think there’s a real opportunity to be a person that understands that and maybe even to push that a little bit further.

You can use a language model, if you’re an MLOps person but not a data scientist; if you’re an engineer that helps people build and push models to production, maybe you don’t need the data scientist anymore. Maybe the data scientist should be worried. Maybe you, the MLOps person, can build the whole thing. You’re a full stack engineer suddenly in a sense where you get to build ML models by building on top of language models – you build the infrastructure and the software around them.

I think that’s a real opportunity to be a full-stack practitioner of building language model-powered applications. You’re well positioned, you understand how ML systems work and you can do it. So I think that’s an opportunity.

What should MLOps practitioners learn in the age of LLMs?

Stephen: That’s a really good point; we have a question in Chat…

In this age of Large Language Models, what should MLOps practitioners actually learn or what should they prioritize when it comes to trying to gain the skills as a beginner?

David: Yeah, good question…

I don’t want to be too radical. There’s a lot of machine learning use cases that aren’t going to be impacted drastically by language models. We still do fraud detection and things like that. These are still things where someone’s going to go train a model on our own proprietary data and all of that.

If you’re passionate about MLOps and the development and training and full lifecycle of machine learning, learn the same MLOps curriculum as you would have learned before. Learning software engineering best practices and understanding how ML systems get built and productionized.

Maybe I’d complement that by like it’s simple, but just go to the GPT-3 playground by OpenAI and play around with a model. Try to build a couple of use cases. There are lots of demos out there. Build something. It’s easy.

Personally, I’m a VC… I’m barely technical anymore and I’ve built like four or five of my own apps to play with and use in my spare time – it’s ridiculous how easy it is. You wouldn’t believe it.

Just build something with language models, it’s easy, and you’ll learn a lot. You’ll be amazed probably at how simple it is.

I have something that takes transcripts of my calls and writes call summaries for me. I have something that takes a paper and I can ask questions against that paper, like a research paper, things like that. Those are simple applications. But you’ll learn something.

I think it’s a good idea to be somewhat familiar with what it feels like to build and iterate with these things right now and it’s fun too. So I highly recommend anybody in the MLOps field try it out. I know it’s your free time, but it should be fun.

What are the best options to host an LLM at a reasonable scale?

Stephen: Awesome. So focus on shipping stuff. Thanks for the suggestion.

Let’s jump right into the next question from the community: what are the best options to host large language models at a reasonable scale?

David: This is a tough one…

One of the hardest things about language models is somewhere in the 30 billion parameter range. GPT-3 has 175 billion parameters.

Somewhere in the 30 billion parameter range, a model starts fitting on the biggest GPUs we have today.,,

The biggest GPU on the market today in terms of memory is the A100 with 80GB of memory. GPT-3 does not fit on that.

You can’t infer GPT-3 on a single GPU. And so what does that mean? It gets horribly complicated to do inference of a model that doesn’t fit on a single GPU – you have to do model parallelism and it’s a nightmare.

My short advice is don’t try unless you have to – there are better options.

The good news is a lot of people are working on taking these models and turning them into form factors that fit on a single GPU. For example, [we’re recording on February 28th] I think it was like yesterday or last Friday that the LLaMA paper from Facebook came out; they changed a language model that does fit on one GPU and has similar capabilities to GPT-3.

There are others like it that are 5 billion parameter models up to like 30…

The most promising approach we have is to find a GPU or a model that does fit on a single GPU and then use the tools that we’ve used for all historical model deployment to host them. You can pick your favorite – there are lots out there, the folks at BentoML have a great serving product.

A lot of other people do need to make sure you get a really big beefy GPU to put it on still. But I think it’s not much different at that point, as long as you pick something that does fit on one machine at least.

Are LLMs for MLOps going mainstream?

Stephen: Oh yeah, thanks for sharing that…

The next question is whether LLMs for MLOps are going mainstream; what are the new challenges that they can address better than conventional MLOps for NLP use cases?

David: Man, I feel like this is a landmine I’m going to make people angry no matter what I say here. It’s a good question though. There’s an easy version of this, which we talked about it for a lot of building ML or applications on top of language models. You don’t need to train a model anymore, you don’t need to host your own model anymore, you don’t need to all of that goes away. And so it’s like easy in a sense.

There’s just a whole bunch of stuff you don’t need to build language models. The new questions you should be asking yourself are:

1 what do I need?
2 what are the new questions I need to answer?
3 what are the new workflows that we’re talking about if it’s not training and hosting, serving and testing?

Prompting is a new workflow language model…. Building a good prompt is like a really simple version of building a good model. It’s still experimental.

You try a prompt and it works or it doesn’t work. You tinker with it until it works or doesn’t work – it’s almost like tuning hyperparameters in a way.

You’re tinkering and tinkering and trying stuff and building stuff until you come up with a prompt that you like and then you push it or whatever. And so some folks are focused on like, prompt experimentation. And I think that’s like a valid way to think about it, how you think about weights and biases is experimentation for models.

How do you have a similar tool for experimentation on prompts?

Keep track of versions of prompts and what worked and all that. I think that’s like a tooling category of its own. And whether or not you think Prompt Engineering is a lesser form of machine learning, it’s certainly something that works its own set of tools and is completely new and it’s certainly different from all of the MLOps we’ve done before. I think there’s a lot of opportunity to think about that workflow and to improve it.

We touched on evaluation and monitoring and some of the new challenges that are unique to evaluating the quality of the output of a language model compared to other models.

There are similarities between that and monitoring historical ML models, but there are things that are just uniquely different. I think the questions we’re asking are different. As I said, a lot of it is like product analytics. Do you like this or not? All of the goals of what you capture might be able to fine-tune the model in a slightly different way than it was before.

You can say we know about monitoring and MLOps, but I think there are at least new questions we need to answer about how to monitor language models.

For example, what’s similar? It’s experimental and probabilistic.

Why do we have MLOps as opposed to DevOps? This is the question you could ask first, I guess. It’s because ML has this weird set of probabilities and distributions and stuff that acts differently from traditional software, and that’s still the same.

In some sense, there’s a big overlap for similarity because a lot of what we’re doing is figuring out how to work with probabilistic software. The difference is we don’t need to train models anymore; we write prompts.

The challenges of hosting, and interacting are different… Does it warrant a new acronym? Maybe. The fact that saying LLMOps is such a pain doesn’t mean we shouldn’t be trying to do it in the first place.

Regardless of the acronyms, there are certainly some new challenges that we need to address and some old challenges that we don’t need to address as much.

Stephen: I just wanted to touch on the experimentation part of I know developers are already taking notes,.. A lot of prompt engineering is happening. It’s now actually actively becoming a role. There are actually advanced prompt engineers, which is like, incredible in itself.

David: It’s easier to become a prompt engineer than it is to maybe become an ML person. Maybe. I’m just saying that because I have a degree in machine learning, and I don’t have a degree in prompting. But it’s certainly a skill set, and I think managing and working with it is a good skill to have, and it’s clearly a valuable one. So why not?

Does GPT-3 require any form of orchestration?

Stephen: Absolutely. All right, let’s check the other question:

Does GPT-3 need to involve any form of orchestration or maybe pipelining? From their understanding, they feel like MLOps is like an orchestration type of process more than anything else.

David: Yeah, I think there are two ways to think about that.

There are use cases of language models that you could imagine happening in batch. For example, take all of the reviews of my app, pull out relevant user feedback, and report them to me or something like that.

There’s still all of the same orchestration challenges of grabbing all the new data, all the new reviews from the App Store, passing them through a language model in parallel or in sequence or whatever it is, collect that information, and then stick it out wherever it needs to go. Nothing has changed there. If you had your model hosted at an endpoint internally before, now you have it hosted at the Open.AI endpoint externally. Who cares? Same thing, no changes, and challenges are about the same.

At inference time, you’ll hear a lot of people talking about things like chaining and things like that in language models. And the core insight there is a lot of the use cases we have actually involve going back and forth with a model a lot. So I write a prompt, the language model says something back based on what the language model says back, and I send another prompt to clarify or to move in some other direction. That’s an orchestration problem.

Fundamentally, like, getting data back and forth from a model a few times is an orchestration problem. So, yeah, there are certainly orchestration challenges with language models. Some of them look just like before. Some of them are kind of net new. I think the tools we have to orchestrate are the same tools we should keep using. So if you’re using Airflow I think that’s a reasonable thing to do if you’re using Kubeflow pipelines, I think that’s a reasonable thing to do if you’re doing those live things maybe we want slightly new tools like what people are using LangChain for now.

It looks similar to a lot of orchestration things, like temporal or other things that help with orchestration and workflows in general too. So yeah, I think that’s a good insight, though. There’s a lot of good similar work of just like, gluing all these systems together to work when they’re supposed to, that still needs to be done. And it’s software engineering, kind of it’s like building something that always does a set of things you need to do and always does it. And you can rely on whether that’s MLOps or DevOps or whatever it is, building reliable computational flows.

That’s good software engineering.

What MLOps principles are required to get the most from LLMs?

Stephen: I know MLOps has its own principles itself. You talk about reproducibility, which might be a hard problem to solve, and talk about collaboration. Are there MLOps principles that need to be followed to make the potentials of these Large Language Models utilized properly for teams being in the system?

David: Good question. I think we’re early to actually know, but I think there are some similar questions…

A lot of what we’ve learned from MLOps and DevOps are both just giving kind of principles of how to do this. And so at the end of the day, a lot of what I think of this being for both MLOps and DevOps is software engineering to some extent. It’s like, can we build stuff that’s maintainable and reliable and reproducible and scalable?

For a lot of the questions we want to build products, essentially, maybe specifically for language model Ops, you probably want to version your prompts. It’s a similar thing. You want to keep track of the versions and as they change, you want to be able to roll back. And if you have the same version of the prompt and the same zero temperature on the model, it’s reproducible, it’s the same thing.

Again, the scope of challenges is kind of smaller, innately. So I don’t think there’s a lot of new stuff we necessarily need to learn. But I need to think more about it, I guess because I think there’s I’m sure there will be a playbook of all the things we need to follow for language models moving forward. But I think nobody’s written it yet, so maybe one of us should go do that.

Regulations around generative AI applications

Stephen: Yeah, an opportunity. Thank you for sharing that, David.

The next question from the community: are there regulatory and compliance requirements that small DevTool teams should be aware of when embedding generative AI models into services for users?

David: Yeah, good question…

A range of things that I think are probably worth considering. We’ll caveat that I’m not a lawyer, so please don’t take my advice and run with it because I don’t know everything.

A few vectors though, of challenges:

OpenAI and external services: a lot of the folks that host language models right now are external services. We’re sending them data. Due to the active changes that they’re making to ChatGPT, you can now get proprietary Amazon source code because Amazon engineers have been using sending their code to ChatGPT and it’s been fine-tuned and now you can sort of back it out.

That’s a good reminder that you’re sending your data to someone else when you use an external service. And that obviously depending on legal or just company implications that might mean that you shouldn’t do that and you may want to consider hosting on-site and there are all sorts of challenges that come with that.

The European Union: the EU AI Act should pass this year and it has pretty strict things to say about introducing bias to models and measuring bias and things like that. When you don’t own a model, I think it’s just worth being aware that these models certainly have a long history of producing biased or toxic content and there could be compliance ramifications for not testing and being aware of it.

And I think that’s probably a new set of challenges we’re going to have to face of how can you make sure that when you’re generating the content, you’re not generating toxic content or biased content or taking biased actions because of what’s being generated. And so we are used to a world where we own the data that’s used to train these models so we can hopefully iterate and try to scrub them of biased things. If that’s not true, certainly new questions you need to ask about how it’s even possible to use these systems in a way that’s compliant with the evolving landscape of legislation.

In general, AI legislation is still pretty new. I think a lot of people are going to have to figure out a lot of things, especially when the EU AI Act passes when it does.

Testing LLMs

Stephen: And you mentioned something really interesting about the model testing part… Has anybody figured that out for LLMs?

David: Lots of people are trying; I know people are trying interesting things. There are metrics people have built in Academia to measure toxicity. There are methods and measures out there to evaluate the output of text. There have been similar tests for gender bias and things like that that have historically played this. So there are methods out there.

There are folks that are using models to test models. For example, you can use a language model to look at the output of another language model and just say, “is this hateful or discriminatory?” or something like that – and they are pretty good at that.

I guess the short version is we’re really early and I don’t think there’s a single tool I can point someone to, to say, like, here’s the way to do all of your evaluation and testing. But there are building blocks in the raw form out there right now to try to work on some of this at least. But it’s hard right now.

“I think it’s one of the biggest active challenges for people to figure out right now.”

Generative AI on limited resources

Stephen: When you talk about a model evaluating another model, my mind goes straight to teams using monitoring on some of the latest platforms, which have models actively doing the evaluation itself. It’s probably a really good business place to look into for these tools there.

I’m just going to jump right into the next question and I think it’s all about the optimization part of things…

There’s a reason we call them LLMs, and you spoke of a couple of tools – the most recent one being from Facebook, LLaMA.

How are we going to see more generative AI models optimized for resource-constrained developments over time where there are limited resources, but you want to host it on the platform?

David: Yeah, I think this is really important, actually. I think it’s probably one of the more important trends that we’re going to see, and people are working on it still early, but there are a lot of reasons to care about this:

Cost – It’s very expensive to operate thousands of GPUs to do this.
Latency – If you’re building a product that interacts with a user, every millisecond of latency in loading a page impacts their experience.
Environments that can’t have a GPU – you can’t carry a cluster around in your phone or whatever it is, or wherever you are to do everything.

I think there’s a lot of development happening in the image generation. There’s been an incredible amount of progress in a few short months on improving the performance. My MacBook can generate images pretty quickly.

Now, language models are bigger and more challenging still – I think there’s a lot more work to be done. But there are a lot of promising techniques that I’ve seen folks use, like using a very large model to generate data, to tune a smaller model to accomplish a task.

For example, if the biggest model from OpenAI is good at some task but the smallest one isn’t, you can have the biggest one do that task 10,000 times, fine-tune the smallest one to get better, or a smaller one to get better at that task.

The components are there, but this is another place where I don’t think we have all of the tooling we need yet to solve this problem. It’s also one of the places that I’m the most excited about; how can we make it easier and easier for folks to take the capabilities of these really big impressive models and tune them down into a form factor that makes sense for their cost or latency constraints or environmental constraints?

What industries will benefit from LLMs and how can they integrate it?

Stephen: Yeah, and it does seem like the way we think about active learning other technique is in fact changing over time. Because if you can have a large language model like fine-tune a smaller one or train a smaller one, sort of, that’s an incredible chain of events going on there.

Thank you for sharing that, David.

I’m going to jump right into the next community question: what kind of industries do you think would benefit the most from GPT-3’s language generation capabilities and how can they integrate it?

David: Maybe to start with the obvious and then we’ll get into the less obvious because I think that’s easy.

Any content generation should be complemented by language models now.

That’s obvious.

For example, copywriting and marketing are fundamentally different industries now than they used to be – and it’s obvious why; it’s way cheaper to produce quality content than it’s ever been. You can build customized quality content in no time at an infinite scale almost.

It’s hard to believe that nearly every aspect of that industry shouldn’t be somewhat changed and somewhat quickly be adopting language models. And we’ve seen that largely to date.

There are people that will generate your product descriptions and your product photos and your marketing content and your copy and all that. And it’s no mistake that that’s the biggest and obvious breakout because it’s a big obvious fit.

Moving downstream, I think my answer gets a little bit worse. Everybody should probably take a look at how they can use a language model, but the use cases are probably less obvious. Like not everybody needs a chatbot, not everybody needs to have autocomplete of text or something like that.

But whether it means that your software engineers are more efficient because they’re using Copilot, whether it means that you have a better internal search of your documentation or your own documentation of your product has better search capabilities because you can index it with language models, that’s probably true for most people in some form. And once you get more complicated and as I said, there are opportunities to do things like automate actions or do other automation, you start to get into a sort of like a whole can of forms of nearly everything.

I guess there’s stuff that’s obviously completely transformed by language models, which is like anywhere where content is being generated, it should be completely transformative in some sense. Then there’s a long tail of potential augmentative changes that apply across nearly every industry.

Tools to help with the deployment of LLMs

Stephen: Right, thanks for sharing that. And just two final questions before we sort of wrap up the session.

Are there tools that you’re seeing a real change in the landscape now that folks should be aware of right now, especially that’s really making the deployment of these models easier?

David: Well, we’re complaining about LLMOps. I’ll call out a few of the folks that are working in that space and doing cool stuff. The biggest takeoff tool to help people with prompting and orchestrating prompts and things like that is LangChain – It’s gotten really popular.

They have Python, a Python library, and a JavaScript library. Now they’re iterating at an incredible rate. That community is really amazing and vibrant. So check that out if you’re trying to get started and tinker. I think it’s like the best place to get started.

Other tools like Dust and GPT Index are there in a similar space to help you write and then build, like, prototypes of actually interacting with language models.

There’s some other stuff around. We talked a lot about evaluation and monitoring, and I think there’s a company called Humanloop, a company called HoneyHive that are both in that space as well as, like, four or five companies in the current YC batch, which maybe they’ll get mad at me for not calling them out individually, but they’re all building really cool stuff there.

A lot of new stuff coming out around the valuation and managing prompts and things like that, managing costs and everything. And so I’d say take a look at those tools and maybe familiarize yourself with what the new things that we need to help with are.

The future of MLOps with GPT, GPT-3, and GPT-4

Stephen: Awesome. Thanks, David. Definitely leave those in the show notes as well for the later podcast episode that will be released.

Any final words, David, on the future of MLOps with GPT-3 and GPT on the horizon, GPT-4 on the horizon?

David: I’ve been working on MLOps for years and years now, and this is the most exciting I’ve ever been. Because I think this is the opportunity we have to go from a niche field, like a relatively niche field, to impacting everybody and every product. And so that’s going to change and there’s a lot of differences.

But for the first time, I feel like ML really I’ve been hoping that MLOps would make it so that everybody in the world could use ML to change their products. And this is the closest, I feel like we are where, by lowering the barred entry, everybody can do it. So I think we have a huge opportunity to bring ML to the masses now, and I hope that as a community, we can all make that happen.

Wrap up

Stephen: Great. I hope so as well because I’m also excited about the landscape in and of itself. So thank you so much. David, where can people find you and connect with you online?

David: Yeah, both LinkedIn and Twitter are great.

@DavidSHershey on Twitter, and David Hershey on LinkedIn. So please reach out, shoot me a message anytime. Happy to chat about language models, MLOps, whatever, and flush boat.

Stephen: Awesome. So here at MLOps Live, we’ll be back again in two weeks, and in two weeks’ time, we are going to be talking with Leanne and we’re going to be really discussing how you can navigate organizational barriers by doing MLOps. So lots of MLOps stuff on the horizon, so don’t miss out on that one. So thank you so much, David, for joining the session. We appreciate your time and appreciate your work as well. So really great to have you both.

David: Thanks for having me. It was really fun.

Stephen: Awesome. Bye and take care.

Deploying Large NLP Models: Infrastructure Cost Optimization

Nilesh Barla — Thu, 23 Mar 2023 09:24:59 +0000

NLP models in commercial applications such as text generation systems have experienced great interest among the user. These models have achieved various groundbreaking results in many NLP tasks like question-answering, summarization, language translation, classification, paraphrasing, et cetera.

Models like for example ChatGPT, Gopher **(280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) are predominantly very large and often addressed as large language models or LLMs. These models can easily have millions or up to billions of parameters making them financially expensive to deploy and maintain.

The size of large NLP models is increasing | Source

Such large natural language processing models require significant computational power and memory, which is often the leading cause of high infrastructure costs. Even if you are fine-tuning an average-sized model for a large-scale application, you need to muster a huge amount of data.

Such scenarios inevitably lead to stacking new layers of neural connections, making it a large model, moreover, deploying these models will require fast and expensive GPU, which will ultimately add to the infrastructure cost. So is there a way to keep these expenses in check?

Sure there is.

This article aims to provide some strategies, tips, and tricks you can apply to optimize your infrastructure while deploying them. In the following sections, we will explore these:

1 The infrastructural challenges faced while deploying large NLP models.
2 Different strategies to reduce the costs associated with these challenges.
3 Other handy tips you might want to know to address this issue.

Challenges of large NLP models

Computational resources

LLMs require is a significant amount of resources for optimal performance. Below are the challenges that are usually faced concerning the same.

1. High computational requirements

Deploying LLMs can be challenging as they require significant computational resources to perform inference. This is especially true when the model is used for real-time applications, such as chatbots or virtual assistants.

Consider ChatGPT as an example. It is capable of processing and responding to queries instantly within seconds (most of the time). But there are times when the user traffic seems to be higher, during those moments, the inference time gets higher. There are other factors that can delay the inference, such as the complexity of the question, the amount of information required to generate a response, et cetera. But in any case, if the model is supposed to serve in real-time, it must be capable of high throughput and low latency.

2. Storage capacity

With parameters ranging from millions to billions, LLM can pose storage capacity challenges. It will be good to store the whole model in a single storage device, but because of the size, it is not possible.

For example, OpenAI’s GPT-3 model, with 175B parameters, requires over 300GB of storage for its parameters alone. Additionally, it requires a GPU with a minimum of 16GB of memory to run efficiently. Storing and running such a large model on a single device may be impractical for many use cases due to the hardware requirements. As such, there are three main issues around storage capacity with LLMs:

2.1 Memory limitations

LLMs require a lot of memory as they process a huge amount of information. This can be challenging, especially when you want to deploy them on a low-memory device such as a mobile phone.

One way to deploy such models is to use a distributed system or distributed inference. In distributed inference, the model is distributed on multiple nodes or servers. It allows the distribution of the workload and speeds up the process. But the challenge here is that it may require significant expertise to set up and maintain. Plus, the larger the model, the more servers are required, which again increases the deployment cost.

2.2 Large model sizes

The MT-NLG model released in 2022 has 530 billion parameters and requires several hundred gigabytes of storage. High-end GPUs and basic data parallelism aren’t sufficient for deployment, and even alternative solutions like pipeline and model parallelism have trade-offs between functionality, usability, and memory/compute efficiency. As the authors in the paper “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models put it, this, in turn, reduces the effectiveness of the model.

For instance, a 1.5B parameter model on 32GB can easily run out of memory during inference if the input query is long and complicated. Even for basic inference on LLM, multiple accelerators or multi-node computing clusters like multiple Kubernetes pods are required. There are techniques discussed by researchers where they propose the idea of offloading parameters to the local RAM. But these techniques turned out to be inefficient in practical use-case scenarios. Users cannot download such large scaled models on their systems just to translate or summarise a given text.

2.3 Scalability challenges

Another area for improvement with LLMs is scalability. We know that a large model is often scaled using model parallelism (MP), which requires multiple storage and memory capacity. This involves dividing the model into smaller parts and distributing it across multiple machines. Each machine processes a different part of the model, and the results are combined to produce the final output. This technique can be helpful in handling large models, but it requires careful consideration of the communication overhead between machines.

In Distributed inference, LLM is deployed on multiple machines, with each machine processing a subset of the input data. This approach is essential for handling large-scale language tasks that require input to pass through billions of parameters.

Most of the time, MP works, but there are instances where it doesn’t. The reason being MP divides the model vertically, distributing the computation and parameters among several devices for each layer where the inter-GPU communication bandwidth is large. This distribution facilitates intensive communication between each layer in a single node. The limitation comes outside a single node which essentially leads to a fall in performance and efficiency.

3. Bandwidth requirements

As discussed previously, LLM has to be scaled using MP. But the issue we found was that MP is efficient in single-node clusters, but in a multi-node setting, the inference isn’t efficient. This is because of the low bandwidth networks.

Deploying a large language model requires multiple network requests to retrieve data from different servers. Network latency can impact the time required to transfer data between the servers, which can result in slower performance, eventually leading to high latency and response time. This can cause delays in processing, which can impact user experience.

4. Resource constraints

Limited storage capacity can restrict the ability to store multiple versions of the same model, which can make it difficult to compare the performance of different models and track the progress of model development over time. This can be true if you want to adopt a shadow deployment strategy.

Energy consumption

As discussed above already, serving LLMs require significant computational resources, which can lead to high energy consumption and a large carbon footprint. This can be problematic for organizations that are committed to reducing their environmental impact.

Just for reference, below is the image showing the financial estimation of the LLMs, along with the carbon footprint that they produce during training.

Financial estimation of the large NLP models, along with the carbon footprint that they produce during training | Source

What is more shocking is that 80-90% of the machine learning workload is inference processing, according to NVIDIA. Likewise, according to AWS, inference accounts for 90% of machine learning demand in the cloud.

Cost

Deploying and using LLMs can be costly, including the cost of hardware, storage, and infrastructure. Additionally, the cost of deploying the model can be significant, especially when using resources such as GPUs or TPUs for low latency and high throughput during inference. This can make it challenging for smaller organizations or individuals to use LLMs for their applications.

To put this into perspective, it is expected that the running cost of the chatGPT is around $100,000 per day or $3M per month.

Tweet about ChatGPT costs | Source

Strategies for optimizing infrastructure costs of large NLP models

In this section, we will explore and discuss the possible solutions and techniques for the challenges discussed in the previous section. It is worth noting that when you deploy the model on the cloud, you choose the inference option and thereby create an end-point. See the image below.

The general workflow for inference endpoints | Source

Keep that in mind, and with all the challenges we discussed earlier, we will discuss techniques that can be used to optimize the cost around this infrastructure for deploying LLMs. Below are some of the steps that you can follow to deploy your model as efficiently as possible.

Smart use of cloud computing for computational resources

Using cloud computing services can provide on-demand access to powerful computing resources, including CPUs and GPUs. Cloud computing services are flexible and can scale according to your requirements.

One of the important tips is that you should make a budget for your project. Making a budget always helps you find ways to optimize your project that will not exceed your financial limitation.

Now when it comes to cloud services, there are a lot of companies that offer their platform. Cloud providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform offer a range of options for deploying LLMs, including virtual machines, containers, and serverless computing. But despite you must do your own research and calculation. For instance, you must know these three things:

1 The model size.
2 Details about the hardware to be used.
3 Right inference option.

Once you have the details, you can actually calculate how much-accelerated computing power you need. Based upon that, you can plan and execute your model deployment.

Learn more

MLOps Tools for NLP Projects

Calculating model size

You can see the table below, which will give you an idea of how many FLOPs you might need for your model. Once you have an estimation, you can then go ahead and find the relevant GPU in your preferred cloud platform.

Estimated optimal training FLOPs and training tokens for various NLP model sizes | Source

A tool that I found under the blog post named “Estimating Training Compute of Deep Learning Models” allows you to calculate the FLOPs required for your model both for training and inference.

A tool that calculates the FLOPs required for both training and inference | Source

The app is based on the works of Kaplan et al., 2020 or Hoffman et al., 2022 where they show how to train a model on a fixed-compute budget. To understand more on this subject you can read the blog here.

Selecting the right hardware

Once you have calculated the required FLOPs, you can go ahead and choose the GPU. Make sure you are aware of the features that the GPU offers. For instance, see the image below to get an understanding.

The list of GPU specifications offered by NVIDIA | Source

Above you can see the list of specifications that NVIDIA offers. Similarly, you can compare different GPUs and see which one suits your budget.

Choosing the right inference option

Once you have calculated the model size and selected the GPU, you can then proceed to choose the inference option. Amazon SageMaker offers multiple inference options to suit different workloads. For instance, if you require:

Real-time inference, which is suitable for low-latency or high-throughput online inferences and supports payload sizes up to 6 MB and processing times of 60 seconds.
Serverless inference, which is ideal for intermittent or unpredictable traffic patterns and supports payload sizes up to 4 MB and processing times of 60 seconds. In serverless inference, the model scales automatically based on the incoming traffic or requests. At times when the model is sitting idle you won’t be charged. It offers a pay-as-you-use facility.
Batch transform is suitable for offline processing of large datasets and supports payload sizes of GBs and processing times of days.
Asynchronous inference is suitable for queuing requests with large payloads and long processing times, supports payloads up to 1 GB and processing times up to one hour, and can scale down to 0 when there are no requests.

To get a better understanding and meet your requirement, look at the image below.

Choosing model deployment options | Source

When all the above points are satisfied, you can then deploy the model on any of the cloud services.

To quickly summarize:

1 Set a budget
2 Calculate the size of the model
3 Compute the FLOPs required for model
4 Find the right GPU
5 Choose the appropriate inference option
6 Research the pricing offered by various cloud computing platforms
7 Find the service that suits your needs and budget
8 Deploy it.

Optimizing the model for serving

In the last section, I discussed how the size of LLMs can pose a problem for deployment. When your model is too large, strategies like model compilation, model compression, and model sharding can be used. These techniques reduce the size of the model while preserving accuracy, which allows easier deployment and reduce the associated expenses significantly.

Let’s explore each of those in detail.

Different techniques or strategies to optimize LLMs for deployment | Source

Model compression

Model compression is a technique used to optimize and transform an LLM into an efficient executable model that can be run on specialized hardware or software platforms–usually cloud services. The goal of model compression is to improve the performance and efficiency of LLM inference by leveraging hardware-specific optimizations, such as reduced memory footprint, improved computation parallelism, and reduced latency.

This is a good technique because it helps you to play with a different combination, set performance benchmarks for various tasks, and find a price that suits your budget. As such, model compression involves several steps:

Graph optimization: The high-level LLM graph is transformed and optimized using graph optimization techniques such as pruning and quantization to reduce the computational complexity and memory footprint of the model. This, in turn, makes the model small while preserving its accuracy.
Hardware-specific optimization: The optimized LLM graph is further optimized to leverage hardware-specific optimizations. For instance, Amazon Sagemaker provides model serving containers for various popular ML frameworks, including XGBoost, scikit-learn, PyTorch, TensorFlow, and Apache MXNet, along with software development kits (SDKs) for each container.

How AWS Sagemaker Neo works | Source

Here are a few model compression techniques that one must know.

Model quantization

Model quantization (MQ) is a technique used to reduce the memory footprint and computation requirements of an LLM. MQ essentially transforms the model parameters and activations with lower-precision data types. The goal of model quantization is to improve the efficiency of LLM during inference by reducing the memory bandwidth requirements and exploiting hardware-specific optimizations optimized for lower-precision arithmetic.

PyTorch offers model quantization, their API involves the reduction of model parameters by a factor of 4, while the memory bandwidth required by the model by the factor 2 to 4 times. As a result of these improvements, the inference speed can increase by 2 to 4 times, owing to the reduction in memory bandwidth requirements and faster computations using int8 arithmetic. However, the precise degree of acceleration achieved depends on the hardware, runtime, and model used.

There are several approaches to model quantization for LLMs, including:

Model quantization can be challenging to implement effectively, as it requires careful consideration of the trade-offs between reduced precision and model accuracy, as well as the hardware-specific optimizations that can be leveraged with lower-precision arithmetic. However, when done correctly, model quantization can significantly improve the efficiency of LLM inference, enabling better real-time inference on large-scale datasets and edge devices.

Post-training quantization: In this approach, the LLM is first trained using floating-point data types, and then the weights and activations are quantized to lower-precision data types post-training. This approach is simple to implement and can achieve good accuracy with a careful selection of quantization parameters.
Quantization-aware training: Here, the LLM is quantized during training, allowing the model to adapt to the reduced precision during training. This approach can achieve higher accuracy than post-training quantization but requires more computation during training.
Hybrid quantization: It combines both post-training quantization and quantization-aware training, allowing the LLM to adapt to lower-precision data types during training while also applying post-training quantization to further reduce the memory footprint and computational complexity of the model.

Model Pruning

Model pruning (MP) is again a technique used to reduce the size and computational complexity of an LLM by removing redundant or unnecessary model parameters. MP is to improve the efficiency of LLM inference without sacrificing accuracy.

MP involves identifying and removing redundant or unnecessary model parameters using various pruning algorithms. These algorithms can be broadly categorized into two categories:

Weight pruning: In weight pruning, individual weights in the LLM are removed based on their magnitude or importance, using techniques such as magnitude-based pruning or structured pruning. Weight pruning can significantly reduce the number of model parameters and the computational complexity of the LLM, but it may require fine-tuning of the pruned model to maintain its accuracy.
Neuron pruning: In neuron pruning, entire neurons or activations in the LLM are removed based on their importance, using techniques such as channel pruning or neuron-level pruning. Neuron pruning can also significantly reduce the number of model parameters and the computational complexity of the LLM, but it may be more difficult to implement and may require more extensive retraining and maybe fine-tuning to maintain accuracy.

Here are a couple of approaches to model pruning:

Post-training pruning: In this approach, the LLM is first trained using standard techniques and then pruned using one of the pruning algorithms. The pruned LLM is then fine-tuned to preserve its accuracy.
Iterative pruning: Here, the model is trained using standard training techniques and then pruned iteratively over several rounds of training and pruning. This approach can achieve higher levels of pruning while preserving accuracy.

You can explore this Colab notebook by PyTorch to better understand MP.

Model distillation

(MD) is a technique used to transfer knowledge from an LLM called a teacher to a smaller, more efficient model called the student. It is used in the context of model compression. In a nutshell, the teacher model provides guidance and feedback to the student model during training. See the image below.

DistilBERT’s distillation process | Source

MD involves training a student, a more efficient model to mimic the behavior of a teacher, more complex LLM. The student model is prepared using a combination of labeled data and the output probabilities of the larger LLM.

There are several approaches to model distillation for LLMs, including:

Knowledge distillation: In this approach, the smaller model is trained to mimic the output probabilities of the larger LLM using a temperature scaling factor. The temperature scaling factor is used to soften the output probabilities of the teacher model, allowing the smaller model to learn from the teacher model’s behavior more effectively.

Self-distillation: In this approach, the larger LLM is used to generate training examples for the smaller model by applying the teacher model to unlabeled data. The smaller model is then trained on these generated examples, allowing it to learn from the behavior of the larger LLM without requiring labeled data.

Ensemble distillation: In this approach, multiple smaller models are trained to mimic the behavior of different sub-components of the larger LLM. The outputs of these smaller models are combined to form an ensemble model that approximates the behavior of the larger LLM.

Optimizing hardware and software requirements

Hardware is an important area when it comes to deploying LLMs. Here are some useful steps you can take for optimizing the hardware performance:

Choose hardware that matches the LLM’s requirements: Depending on the LLM’s size and complexity, you may need hardware with a large amount of RAM, high-speed storage, or multiple GPUs to speed up inference. Opt for hardware that provides the necessary processing power, memory, and storage capacity, without overspending on irrelevant features.

Use specialized hardware: You can use specialized hardware such as TPUs (Tensor Processing Units) or FPGAs (Field-Programmable Gate Arrays) that are designed specifically for deep learning tasks. Similarly, accelerated linear algebra or XLA can be leveraged during inference time.

Although such hardware can be expensive, there are smart ways to consume them. You can opt for charge-on-demand for the hardware used. For instance, elastic Inference from AWS Sagemaker helps you lower your cost when the model is not fully utilizing the GPU instance for inference.

Use optimized libraries: You can use optimized libraries such as TensorFlow, PyTorch, or JAX that leverage hardware-specific features to speed up computation without needing additional hardware.

Tune the batch size: Consider tuning the batch size during inference to maximize hardware utilization and improve inference speed. This inherently reduces the hardware requirement, thus cutting the cost.

Monitor and optimize: Finally, monitor the LLM’s performance during deployment and optimize the hardware configuration as needed to achieve the best performance.

Cost efficient scalability

Here’s how you can scale your large NLP models while keeping costs in check:

Choose the right inference option, that scales automatically like the serverless inference option. As it will reduce the deployment cost when the demand is less.

A rigid architecture will always occupy the same amount of memory even when the demand is low thus the deployment and maintenance costs will be the same. On the contrary, a scalable architecture can scale horizontally or vertically to accommodate an increased workload and go back to its original configuration when the model lies in a dormant state. Such an approach can reduce the cost of maintenance whenever the additional nodes are not being used.

Optimize inference performance, by using hardware acceleration, such as GPUs or TPUs, and by optimizing the inference code.

Amazon’s Elastic inference is yet another great option as it reduces the cost by up to 75% because the model no longer has extra GPUs to compute for inference. For more on Elastic inference, read this article here.

Cutting energy costs

Choose an energy-efficient cloud infrastructure, that uses renewable energy sources or carbon offsets to reduce the carbon footprint of their data centers. You can also consider choosing energy-efficient GPUs. Check out this article by Wired to understand more.

Use caching which helps reduce the computational requirements of LLM inference by storing frequently requested responses in memory. This can significantly reduce the number of computations required to generate responses to user requests. It also helps in addressing bandwidth issues as it reduces the time to access data. You can store frequently accessed data in cache memory so that it can be quickly accessed without the need for additional bandwidth. This allows you not to opt for additional storage and memory devices.

Deploying large NLP models: other useful tips

Estimating the NLP model size before training

Keeping your model size in check could in turn keep your infrastructure costs in check. Here are a few things you can keep in mind while getting your large NLP model ready.

Consider the available resources: The size of the LLM for deployment should take into account the available hardware resources, including memory, processing power, and storage capacity. The LLM’s size should be within the limits of the available resources to ensure optimal performance.
Fine-tuning: Choose a model with optimal accuracy and then fine-tune it on a task-specific dataset. This step will increase the efficiency of the LLM and keep its size from spiralling out of control.
Consider the tradeoff between size and performance: The LLM’s size should be selected based on the tradeoff between size and performance. A larger model size may provide better performance but may also require more resources and time. Therefore, it is essential to find the optimal balance between size and performance.

Use a lightweight deployment framework

Many LLMs are too large to be deployed directly to a production environment. Consider using a lightweight deployment framework like TensorFlow Serving or TorchServe that can host the model and serve predictions over a network. These frameworks can help reduce the overhead of loading and running the model on the server thereby reducing the deployment and infrastructure costs.

Post-deployment model monitoring

Model monitoring helps optimize the infrastructure cost of deployment by providing insights into the performance and resource utilization of deployed models. By monitoring the resource consumption of deployed models, such as CPU, memory, and network usage, you can identify areas that can help you optimize your infrastructure usage to reduce costs.

Monitoring can identify underutilized resources, allowing you to scale back on unused resources, and reducing infrastructure costs.
Monitoring can identify resource-intensive operations or models, enabling organizations to optimize their architecture or refactor the model to be more efficient. This can also lead to cost savings.

Check also

Tips and Tricks to Train State-Of-The-Art NLP Models

Key takeaways

1 Set a budget.
2 Calculate the size of the model.
3 Use model compression techniques like pruning, quantization, and distillation to decrease the memory and computation required for deployment.
4 Utilize cloud computing services like AWS, Google Cloud, and Microsoft Azure for cost-effective solutions with scalability options.
5 Leverage serverless computing for a pay-per-use model, lower operational overhead, and auto-scaling.
6 Leverage serverless computing for a pay-per-use model, lower operational overhead, and auto-scaling.
7 Optimize hardware acceleration, such as GPUs, to speed up model training and inference.
8 Regularly monitor resource usage to identify areas where costs can be reduced, such as underutilized resources or overprovisioned instances.
9 Continuously optimize your model size and hardware to cost-efficient inference.
10 Update the software and security patch to ensure safety.

Conclusion

In this article, we explored the challenges we face when deploying an LLM and the inflated infrastructural cost associated with them. Simultaneously, we also addressed each of these difficulties with the necessary techniques and solutions.

Out of all the solutions we discussed, a couple of things that I would recommend the most when it comes to reducing infrastructure cost while deployment is elastic and serverless inference. Yes, model compression is good and valid, but when the demand is high, even the smaller model can act like a larger model, thus increasing the infrastructural cost. Thus, we need to have a scalable approach and pay-per-demand service. That’s where these inference services get handy.

It goes without saying that my recommendation might not be the most ideal for your use case, and you can pick any of these approaches depending on the kind of problems you are dealing with. I hope what we discussed here will go a long way in helping you cut down your deployment infrastructure costs for your large NLP models.

References

Knowledge Graphs With Machine Learning [Guide]

Aravind CR — Wed, 19 Oct 2022 14:29:23 +0000

You need to get some information online. For example, a few paragraphs about Usain Bolt. You can copy and paste the information from Wikipedia, it won’t be much work.

But what if you needed to get information about all competitions that Usain Bolt had taken part in and all related stats about him and his competitors? And then what if you wanted to do that for all sports, not just running? Additionally, what if you wanted to infer some relationship among this plethora of unstructured data?

Machine learning engineers often need to build complex datasets like the example above to train their models. Web scraping is a very useful method to collect the necessary data, but it comes with some challenges.

In this article, I’m going to explain how to build knowledge graphs by scraping publicly available data.

What is web scraping?

Web scraping (or web harvesting) is data scraping used for data extraction. The term typically refers to collecting data with a bot or web crawler. It’s a form of copying in which specific data is gathered and copied from the web, typically into a local database or spreadsheet for later use or analysis.

Diagram with a web scraping | Source

You can do web scraping with online services, APIs, or you can write your own code that will do it.

There are two key elements to web scraping:

Crawler: The crawler is an algorithm that browses the web to search for particular data by exploring links across the internet.

Scraper: The scraper extracts data from websites. The design of scrapers can vary a lot. It depends on the complexity and scope of the project. Ultimately it has to quickly and accurately extract the data.

A good example of a ready-made library is the Wikipedia scraper library. It does a lot of the heavy lifting for you. You provide URLs with the required data, it loads all the HTML from those sites. The scraper takes the data you need from this HTML code and outputs the data in your chosen format. This can be an excel spreadsheet or CSV, or a format like JSON.

Knowledge graph

The amount of content available on the web is incredible already, and it’s expanding at an increasingly fast rate. Billions of websites are linked with the World Wide Web, and search engines can go through those links and serve useful information with great precision and speed. This is in part thanks to knowledge graphs.

Different organizations have different knowledge graphs. For example, the Google Knowledge Graph is a knowledge base used by Google and its services to enhance search engine results with information gathered from a variety of sources. Similar techniques are used in Facebook or Amazon products for a better user experience and to store and retrieve useful information.

There’s no formal definition of a knowledge graph (KG). Broadly speaking, a KG is a kind of semantic network with added constraints. Its scope, structure and characteristics, and even its uses aren’t fully realized in the process of development.

Bringing knowledge graphs and machine learning (ML) together can systematically improve the accuracy of systems and extend the range of machine learning capabilities. Thanks to knowledge graphs, results inferred from machine learning models will have better explainability and trustworthiness.

Bringing knowledge graphs and ML together creates some interesting opportunities. In cases where we might have insufficient data, KGs can be used to augment training data. One of the major challenges in ML models is explaining predictions made by ML systems. Knowledge graphs can help overcome this issue by mapping explanations to proper nodes in the graph and summarizing the decision-making process.

Learn more

Data Augmentation in Python: Everything You Need to Know

Data Augmentation in NLP: Best Practices From a Kaggle Master

Another way to look at it is that a knowledge graph stores data that resulted from an information extraction task. Many implementations of KG make use of a concept called triplet — a set of three items (a subject, a predicate, and an object) that we can use to store information about something.

A knowledge graph

Node A and Node B are 2 different entities. These nodes are connected by an edge that represents their relationship. This is the smallest KG we can build – also known as a triple. Knowledge graphs come in a variety of shapes and sizes.

Web scraping, computational linguistics, NLP algorithms, and graph theory (with Python code)

Phew, that’s a wordy heading. Anyway, to build knowledge graphs from text, it’s important to help our machine understand natural language. We do this with NLP techniques such as sentence segmentation, dependency parsing, parts-of-speech (POS) tagging, and entity recognition.

The first step to build a KG is to collect your sources — let’s crawl the web for some information. Wikipedia will be our source (always check the sources of data, a lot of information online is false).

For this blog, we’ll use the Wikipedia API, a direct Python wrapper, and neptune.ai to manage, log, store, display, and organize the metadata. If you expect your project to produce a number of artifacts (like this one), it is really helpful to use a tool for tracking and versioning them.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

Installation and setup

Install dependencies and scrape data:

!pip install wikipedia-api neptune neptune-notebooks pandas spacy networkx scipy

Check these resources for installation and

The below function searches Wikipedia for a given topic and extracts information from the target page and its internal links.

import wikipediaapi  # pip install wikipedia-api
import pandas as pd
import concurrent.futures
from tqdm import tqdm

The below function lets you fetch the articles based on the topic you provide as an input to the function.

def scrape_wikipedia(name_topic, verbose=True):
   def link_to_wikipedia(link):
       try:
           page = api_wikipedia.page(link)
           if page.exists():
               return {'page': link, 'text': page.text, 'link': page.fullurl, 'categories': list(page.categories.keys())}
       except:
           return None
      
   api_wikipedia = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI)
   name_of_page = api_wikipedia.page(name_topic)
   if not name_of_page.exists():
       print('Page {} is not present'.format(name_of_page))
       return
  
   links_to_page = list(name_of_page.links.keys())
   procceed = tqdm(desc='Scraped links', unit='', total=len(links_to_page)) if verbose else None
   origin = [{'page': name_topic, 'text': name_of_page.text, 'link': name_of_page.fullurl, 'categories': list(name_of_page.categories.keys())}]
  
   with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
       links_future = {executor.submit(link_to_wikipedia, link): link for link in links_to_page}
       for future in concurrent.futures.as_completed(links_future):
           info = future.result()
           origin.append(info) if info else None
           procceed.update(1) if verbose else None
   procceed.close() if verbose else None
  
   namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki',
                 'Template', 'Help', 'User', 'Category talk', 'Portal talk')
   origin = pds.DataFrame(origin)
   origin = origin[(len(origin['text']) > 20)
                     & ~(origin['page'].str.startswith(namespaces, na=True))]
   origin['categories'] = origin.categories.apply(lambda a: [b[9:] for b in a])

   origin['topic'] = name_topic
   print('Scraped pages', len(origin))
  
   return origin

Let’s test the function on the topic “COVID-19”.

data_wikipedia = scrape_wikipedia('COVID 19')

o/p: Links Scraped: 100%|██████████| 1965/1965 [04:30<00:00,  7.25/s]pages scrapped: 1749

Save the data to csv

data_wikipedia.to_csv('scraped_data.csv')

Download spacy package

python -m spacy download en_core_web_sm

Import libraries

import spacy
import pandas as pd
import requests
from spacy import displacy
# import en_core_web_sm
 
nlp = spacy.load('en_core_web_sm')
 
from spacy.tokens import Span
from spacy.matcher import Matcher
 
import matplotlib.pyplot as plot
from tqdm import tqdm
import networkx as ntx
import neptune
 
%matplotlib inline

run = neptune.init_run(api_token="your API key",
                   project="aravindcr/KnowledgeGraphs")

Upload data to Neptune.

run["data"].upload("scraped_data.csv")

Download the data here. Also available in Neptune.

data = pd.read_csv('scraped_data.csv')

Output:

The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure developed by the UK Rapid Test Consortium and manufactured by Abingdon Health. It uses a lateral flow test to determine whether a person has IgG antibodies to the SARS-CoV-2 virus that causes COVID-19. The test uses a single drop of blood obtained from a finger prick and yields results in 20 minutes.

Sentence segmentation

The first step of building a knowledge graph is to split the text document or article into sentences. Then we limit our examples to simple sentences with one subject and one object.

# Lets take part of the above extracted article
docu = nlp('''The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure developed by
the UK Rapid Test Consortium and manufactured by Abingdon Health. It uses a lateral flow test to determine
whether a person has IgG antibodies to the SARS-CoV-2 virus that causes COVID-19. The test uses a single
drop of blood obtained from a finger prick and yields results in 20 minutes.\n\nSee also\nCOVID-19 rapid
antigen test''')
 
for tokn in docu:
   print(tokn.text, "---", tokn.dep_)

Download the pre-trained SpaCy model as shown below:

python -m spacy download en

The SpaCy pipeline assigns word vectors, context-specific token vectors, part-of-speech tags, dependency parsing, and named entities. By extending SpaCy’s pipeline of annotations you can resolve coreferences (explained below).

Knowledge graphs can be automatically constructed from parts-of-speech and dependency parsing. Extraction of entity pairs from grammatical patterns is fast and scalable to large amounts of text using the NLP library SpaCy.

The following function defines entity pairs as entities/noun chunks with subject-object dependencies connected by a root verb. Other approximations can be used to produce different types of connections. This kind of connection can be referred to as subject-predicate-object triple.

The main idea is to go through a sentence and extract the subject and object, and when they’re encountered. The below function has some of the steps mentioned.

Entity extraction

You can extract a single-word entity from a sentence with the help of parts-of-speech (POS) tags. The nouns and proper nouns will be the entities.

However, when an entity spans multiple words, POS tags alone aren’t sufficient. We need to parse the dependency tree of the sentence. To build a knowledge graph, the most important things are the nodes and edges between them.

These nodes are going to be entities that are present in the Wikipedia sentences. Edges are the relationships connecting these entities. We will extract these elements in an unsupervised manner, i.e. we’ll use the grammar of the sentences.

The idea is to go through a sentence and extract the subject and the object as and when they are reconstructed.

def extract_entities(sents):
   # chunk one
   enti_one = ""
   enti_two = ""
  
   dep_prev_token = "" # dependency tag of previous token in sentence
  
   txt_prev_token = "" # previous token in sentence
  
   prefix = ""
   modifier = ""
  
  
  
   for tokn in nlp(sents):
       # chunk two
       ## move to next token if token is punctuation
      
       if tokn.dep_ != "punct":
           #  check if token is compound word or not
           if tokn.dep_ == "compound":
               prefix = tokn.text
               # add the current word to it if the previous word is 'compound’
               if dep_prev_token == "compound":
                   prefix = txt_prev_token + " "+ tokn.text
                  
           # verify if token is modifier or not
           if tokn.dep_.endswith("mod") == True:
               modifier = tokn.text
               # add it to the current word if the previous word is 'compound'
               if dep_prev_token == "compound":
                   modifier = txt_prev_token + " "+ tokn.text
                  
           # chunk3
           if tokn.dep_.find("subj") == True:
               enti_one = modifier +" "+ prefix + " "+ tokn.text
               prefix = ""
               modifier = ""
               dep_prev_token = ""
               txt_prev_token = ""
              
           # chunk4
           if tokn.dep_.find("obj") == True:
               enti_two = modifier +" "+ prefix +" "+ tokn.text
              
           # chunk 5
           # update variable
           dep_prev_token = tokn.dep_
           txt_prev_token = tokn.text
          
   return [enti_one.strip(), enti_two.strip()]

extract_entities("The AbC-19 rapid antibody test is an immunological test for COVID-19 exposure developed by the UK Rapid Test")

['AbC-19 rapid antibody test', 'COVID-19 UK Rapid Test']

Now let’s use the function to extract entity pairs for 800 sentences.

pairs_of_entities = []
for i in tqdm(data['text'][:800]):
   pairs_of_entities.append(extract_entities(i))

Subject object pairs from sentences

pairs_of_entities[36:42]

Output:

[['where aluminium powder', 'such explosives manufacturing'],

 ['310  people', 'Cancer Research UK'],

 ['Structural External links', '2 PDBe KB'],

 ['which', '1 Medical Subject Headings'],

 ['Structural External links', '2 PDBe KB'],

 ['users', 'permanently  taste']]

Relations extraction

With entity extraction, half the job is done. To build a knowledge graph, we need to connect the nodes (entities). These edges are relations between pairs of nodes. The function below is capable of capturing such predicates from these sentences. I used spaCy’s rule-based matching. The pattern defined in the function tries to find the ROOT word or the main verb in the sentence.

def obtain_relation(sent):
  
   doc = nlp(sent)
  
   matcher = Matcher(nlp.vocab)
  
   pattern = [{'DEP':'ROOT'},
           {'DEP':'prep','OP':"?"},
           {'DEP':'agent','OP':"?"}, 
           {'POS':'ADJ','OP':"?"}]
  
   matcher.add("matching_1", None, pattern)
  
   matcher = matcher(doc)
   h = len(matcher) - 1
  
   span = doc[matcher[h][1]:matcher[h][2]]
  
   return (span.text

The pattern which is written above tries to find the root word in sentences. Once it is recognized then it checks if a preposition or an agent word follows it. If it’s a yes then it’s added to the root word.

relations = [obtain_relation(j) for j in tqdm(data['text'][:800])]

Building a knowledge graph

Now we can finally create a knowledge graph from the extracted entities

Let’s draw the network using the networkX library. We’ll create a directed multigraph network with node size in proportion to the degree of centrality. In other words, the relations between any connected node pair are not two-way. They’re only from one node to another.

# subject extraction
source = [j[0] for j in pairs_of_entities]

#object extraction
target = [k[1] for k in pairs_of_entities]

data_kgf = pd.DataFrame({'source':source, 'target':target, 'edge':relations})

We are using the networkx library to create a network from the data frame.
Here nodes will be represented as entities and edges represent the relationship between nodes

# Create DG from the dataframe
graph = ntx.from_pandas_edgelist(data_kgf, "source", "target",
                         edge_attr=True, create_using=ntx.MultiDiGraph())

# plotting the network
plot.figure(figsize=(14, 14))
posn = ntx.spring_layout(graph)
ntx.draw(graph, with_labels=True, node_color='green', edge_cmap=plot.cm.Blues, pos = posn)
plot.show()

Graph logged to neptune.ai

From the above graph it’s unclear to get a sense of what relations are captured in the graph
Let’s use some relation to visualize graphs. Here I am choosing

graph = ntx.from_pandas_edgelist(data_kgf[data_kgf['edge']=="Information from"], "source", "target",
                         edge_attr=True, create_using=ntx.MultiDiGraph())
 
plot.figure(figsize=(14,14))
pos = ntx.spring_layout(graph, k = 0.5) # k regulates the distance between nodes
ntx.draw(graph, with_labels=True, node_color='green', node_size=1400, edge_cmap=plot.cm.Blues, pos = posn)
plot.show()

Graph logged to neptune.ai

One more graph filtered with the relation name “links” can be found here.

Logging metadata

I have logged the above networkx graph to Neptune. You can find that particular path. Log your image to a different path depending on the output obtained.

run['graphs/all_in_graph'].upload('graph.png')
run['graphs/filtered_relations'].upload('info.png')
run['graphs/filtered_relations2'].upload('links.png')

All graphs can be found here.

Coreference resolution

To obtain more refined graphs you can also use the coreference resolution.

Coreference resolution is the NLP equivalent of endophoric awareness used in information retrieval systems, conversational agents, and virtual assistants like Alexa. It’s a task of clustering mentions in text that refer to the same underlying entities.

Source

“I”, “my”, and “she” belongs to the same cluster, and “Joe” and “he” belong to the same cluster.

Algorithms that resolve coreferences commonly look for the nearest preceding mention that’s compatible with the referring expression. Instead of using rule-based dependency parse trees, neural networks can also be trained, which take into account word embeddings and distance between mentions as features.

This significantly improves entity pair extraction by normalizing text, removing redundancies, and assigning entity pronouns.

If your use case is domain-specific, it would be worth your while to train a custom entity recognition model.

Knowledge graphs can be built automatically and explored to reveal new insights about the domain.

Notebook uploaded to neptune.ai.

Notebook on GitHub.

Knowledge graphs at scale

To effectively use the entire corpus of 1749 pages for our topic, use the columns created in the scrape_wikipedia function to add properties to each node. Then you can track the page and category of each node. You can use multi and parallel processing to reduce execution time.

Some of the use cases of KGs are:

1 Question answering,
2 Storing information,
3 Recommendation systems,
4 Supply chain management.

Challenges ahead

Entity disambiguation and managing identity

In its simplest form, the challenge is assigning a unique normalized identity and a type to an utterance or a mention of an entity.

Many entities extracted automatically have very similar surface forms, such as people with the same or similar names, or movies, songs, and books with the same or similar titles. Two products with similar names may refer to different listings. Without correct linking and disambiguation, entities will be incorrectly associated with the wrong facts and result in incorrect inference downstream.

Type membership and resolution

Most knowledge-graph systems today allow each entity to have multiple types, with specific types for different circumstances. Cuba can be a country or it can refer to the Cuban government. In some cases, knowledge-graph systems defer the type assignment to runtime. Each entity describes its attributes, and the application uses a specific type and collection of attributes depending on the user task.

Check also

Exploratory Data Analysis for Natural Language Processing: A Complete Guide to Python Tools

Managing changing knowledge

An effective entity-linking system needs to grow organically based on its ever-changing input data. For example, companies may merge or split, and new scientific discoveries may break a single existing entity into multiple.

When a company acquires another company, does the acquiring company change its identity? Does identity follow the acquisition of the rights to a name? For example, in the case of KGs constructed in the healthcare industry, patient data will change over a period of time.

Knowledge extraction from multiple structured and unstructured sources

The extraction of structured knowledge (which includes entities, their types, attributes, and relationships) remains a challenge across the board. Growing graphs at scale require manual approaches and unsupervised and semi-supervised knowledge extraction from unstructured data in open domains.

Managing operations at scale

Managing scale is the underlying challenge that affects several operations related to performance and workload directly. It also manifests itself indirectly as it affects other operations, such as managing fast incremental updates to large-scale knowledge graphs.

Note: for more details on how different tech giants implement industry-scale knowledge graphs in their product and related, challenges check this article.

Conclusion

I hope you’ve learned something new here, and this article helped you understand web scraping, knowledge graphs, and a few useful NLP concepts.

Thanks for reading, and keep on learning!

Hugging Face Pre-trained Models: Find the Best One for Your Task

Natasha Sharma — Mon, 17 Oct 2022 07:58:36 +0000

When tackling machine learning problems, pre-trained models can significantly speed up the process. Repurposing existing solutions saves time and computational costs for both engineers and companies. Launched in 2017 (originally as a chat interface), Hugging Face is a key provider of open-source libraries with pre-trained models, making it a valuable resource in this space.

Soon after, they released the Transformers library and other NLP resources like datasets and tokenizers, making high-quality NLP models accessible to everyone. This move quickly gained traction, especially among major tech companies.

Hugging Face specializes in Natural Language Processing (NLP) tasks, focusing on models that not only recognize words but also understand their meaning and context. Unlike humans, computers need a structured pipeline—a series of processing steps—to interpret language meaningfully. Hugging Face’s models and tools provide this structure, making it easier for companies to integrate NLP technologies that enable natural, human-like interactions.

As more companies focus on better user experiences, NLP tools like Hugging Face are becoming essential. In the following sections, we’ll explore this tool and its transformers in more depth, with hands-on examples to help you start building your own projects.

Getting started

A transformer is a deep learning model that adopts the mechanism of attention, differentially weighting the significance of each part of the input data. It is used primarily in the field of natural language processing. Wikipedia

Before diving into model selection, it’s essential to clearly define your use case and the specific goals you want to achieve. Hugging Face offers a range of transformers and models tailored to different tasks. Their platform provides an efficient model search tool with various filters to help you find exactly what you need.

On the Hugging Face website, the model page includes filters like Tasks, Libraries, Datasets, and Languages:

List of models and filters on the Hugging Face website | Source

Let’s say you are looking for models that can satisfy the following requirements:

Translates text from one language to another
Supports PyTorch

Once you have selected these filters, you will get a list of pre-trained models as shown below:

Selecting a model from the Hugging Face website | Source

You will also need to ensure that you provide the inputs in the same format as the pre-trained models were trained with. Select a model from the list, and let’s start setting up the environment for it.

Setting up your environment

Hugging Face supports over 20 libraries, including popular ones like TensorFlow, PyTorch, and FastAI. Here’s how to install the necessary libraries using pip:

1. Install PyTorch:

!pip install torch

2. Install the Transformers library:

!pip install transformers

Once installed, you can start working with the Hugging Face NLP library. There are two main ways to get started:

Using Pipelines: A simple, high-level approach that provides pre-configured tasks.
Using Pre-trained Models Directly: Load any available model and adapt it to your specific task.

Note that these models can be large, so it’s recommended to experiment in cloud environments like Google Colab or Kaggle Notebooks, where downloading and storage are more manageable.

In the following sections, we’ll explore using pipelines and working directly with pre-trained models.

Basic NLP tasks supported by Hugging Face

Hugging Face offers powerful, pre-trained models for a wide range of NLP tasks. Here’s a quick look at the key tasks it supports and why they matter:

1. Sequence classification

This involves predicting a category for a sequence of inputs among predefined classes, which is useful for applications like sentiment analysis, spam detection, and grammar correction. For example, sequence classification can determine whether a review is positive or negative.

2. Question answering

This task focuses on generating answers to contextual questions, whether open- or close-ended. A question-answering model can search through a structured database or unstructured text to provide accurate answers, much like a virtual assistant.

3. Named entity recognition (NER)

NER identifies specific entities—like people, places, or organizations—within the text, enabling applications like automated document tagging and information extraction.

4. Summarization

Summarization makes long documents shorter and easier to read. Hugging Face supports both extractive summarization, which pulls key sentences, and abstractive summarization, which rephrases text to capture the essence of the original content.

5. Translation

Translation tasks involve converting text from one language to another. Unlike simple word substitution, effective translation requires a deep understanding of syntax, idioms, and linguistic context to produce human-like translations.

6. Language modeling

Language modeling predicts likely word sequences and completes sentences in meaningful ways. Hugging Face supports masked language modeling, where certain words are hidden for the model to predict, and causal language modeling, where the model predicts future words based on past context.

Beyond these, Hugging Face supports tasks such as speech recognition, computer vision, and transcription generation, expanding its utility to audio and visual data.

Hugging Face Transformers and how to use them

The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. KDnuggets

Transformers, introduced in 2017, revolutionized NLP by enabling models to handle long-range dependencies in text. This architecture consists of an encoder-decoder structure (that will be explained later on), and it facilitates sequence-to-sequence tasks, making it ideal for translation, summarization, and text generation.

The evolution of the Transformer architecture, 2018-2021 | Source

Transformers are language models trained on vast amounts of text through self-supervised or transfer learning. In self-supervised learning, models learn by predicting parts of the data based on other parts, allowing them to train effectively without labeled data. This approach enables transformers to develop a deep understanding of language structure and context, making them powerful tools for various NLP tasks.

Transformer architecture

The transformer language model uses an encoder-decoder architecture, where each component can function both together and independently:

Encoder: The encoder takes in the input sequence, processes it iteratively, and identifies the relationships among different parts of the input, building a rich internal representation.
Decoder: The decoder then generates an output sequence using the encoder’s representation, drawing on contextual information to produce meaningful and coherent output.

The transformer model architecture | Source

A critical component of transformer model architecture is the attention layer. This layer enables the model to focus on specific words or details within the input, improving its ability to understand context. It works by mapping keys and the associated key-value pairs to an output, where each element (query, key, value, and output) is represented as a vector. This mapping allows the model to decide which parts of the input are most relevant at each step.

For a deeper dive into the architecture, refer to the influential paper Attention Is All You Need and the Illustrated Transformer blog.

With this foundational understanding, let’s explore how Hugging Face simplifies using transformers in practice.

Introduction to transformers and pipelines

The Hugging Face Transformers library offers a variety of models for different tasks through high-level APIs. Building transformer models from scratch is complex and resource-intensive, involving fine-tuning tens of billions of parameters and extensive training. Hugging Face created this library to make working with these sophisticated models easier, more flexible, and user-friendly by providing access through a single API. With this library, you can load, train, and save models seamlessly.

Creating an NLP solution typically involves several steps, from gathering data to fine-tuning the model for optimal performance. Hugging Face’s library streamlines this process by offering tools to simplify each step.

An example of a typical NLP machine learning pipeline | Source: Author

Using pre-defined pipelines

The Hugging Face Transformers library offers pipelines that handle all pre- and post-processing steps of the input text data. These pipelines encapsulate the overall process of each NLP solution. By connecting a model with the necessary pre- and post-processing steps, pipelines allow you to focus only on providing input texts, making it quick and easy to use pre-trained models for various tasks.

Steps encapsulated by a Hugging Face pipeline | Source

With pipelines, you don’t need to manage each processing step individually. Simply select the relevant pipeline for your use case, and you can quickly create, for instance, a machine translator with minimal code:

from transformers import pipeline

translator = pipeline("translation_en_to_de")
text = "Hello world! Hugging Face is the best NLP tool."
translation = translator(text)

print(translation)

Sample pipeline output | Source: Author

Pipelines offer an easy entry point into using Hugging Face, allowing you to create language models with pre-trained and fine-tuned transformers quickly. Hugging Face provides pipelines for key NLP tasks, as well as additional specialized pipelines for different applications.

Create a custom translation pipeline with Hugging Face

The default pipelines only support a few basic scenarios. What if you want to translate to a different language? Here’s how you can set up a translation pipeline for any language pair supported by Hugging Face’s model hub:

1. Import and Initialize the tokenizer

Transformer models process tokenized text, breaking sentences into numbers for the model to interpret.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl")

2. Import the model

Download and initialize the model for the translation task, which contains the necessary transformer layers for sequence-to-sequence learning:

model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-nl")

3. Tokenize and encode the text

Use the tokenizer to convert text into tokens, which includes:

Splitting the text into sub-words
Mapping each token to a unique integer

The output includes:

attention_mask: indicates whether each token is relevant (1) or ignored (0)
input_ids: integer IDs representing each token

text = "Hello my friends! How are you doing today?"
tokenized_text = tokenizer(text, return_tensors="pt")
print(tokenized_text)

The output:

{'input_ids': tensor([[ 147, 2105,  121, 2108,   54,  457,   56,   23,  728, 1042,   17,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

4. Translate and decode

Feed the tokenized text to the model, then decode the output to get the translated text.

translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
print(translated_text)

As we can see beyond the simple pipeline which only supports English-German, English-French, and English-Romanian translations, we can create a language translation pipeline for any pre-trained Seq2Seq model within HuggingFace. Let’s see which transformer models support translation tasks.

Language transformer models

Transformers have become essential tools in NLP due to their use of attention mechanisms, which allow models to assign importance to different parts of the input data. Hugging Face offers a variety of transformer models, many based on this architecture, including popular translation and text-to-text models.

Here, we’ll explore some widely used language translation models, particularly the multilingual BART (mBART) model, which provides robust support for translation tasks.

Overview of mBART

mBART is a sequence-to-sequence denoising auto-encoder model, adapted from BART and trained on large, monolingual text corpora in multiple languages. While BART is designed to reconstruct a corrupted document by mapping it back to its original form, mBART extends this functionality to multiple languages, making it highly effective for translation tasks.

The mBART input encoder and output decoder | Source

The input encoder in BART allows for different document transformations, such as token masking, deletion, infilling (filling in blanks within the text), permutation, and error correction. These transformations prepare the model to handle corrupted or incomplete input data and still generate meaningful output. This versatility enables BART to support a range of downstream NLP tasks, including sequence classification, token classification, sequence generation, and machine translation, making it a powerful tool for diverse applications in language processing.

Transformations for noising the input| Source

The mBART model was designed for multilingual denoising pre-training in neural machine translation. Unlike earlier models that focused only on the encoder, decoder, or partial text transformations, mBART introduced the capability to denoise entire texts across multiple languages, making it a significant advance in multilingual text generation.

mBART is a multilingual encoder-decoder (sequence-to-sequence) model specifically built for translation tasks. Being multilingual, it requires sequences to follow a specific format: a unique language ID token is added to both the source and target texts to specify the language, ensuring the model understands the translation context.

Illustration of mBART’s Multilingual Denoising Pre-Training and Fine-Tuning for Machine Translation. The left panel demonstrates pre-training using corrupted input sentences across multiple languages, where the transformer model learns to reconstruct the original text. The right panel shows fine-tuning for specific machine translation tasks, including sentence-level (Sent-MT) and document-level (Doc-MT) translation, with the encoder-decoder architecture translating between English and Japanese | Source

The mBART model is trained once across all supported languages, providing a shared set of parameters that can be fine-tuned for both supervised (sentence- and document-level) and unsupervised machine translation without specific task- or language-specific modifications. Let’s see how it handles machine translation in these scenarios:

Sentence-level translation: mBART was evaluated on sentence-level translation to minimize representation differences between source and target sentences. It is pre-trained using bi-text (aligned bilingual sentence pairs that define translation relationships between languages) and enhanced with back translation. These techniques allow mBART to achieve significant performance gains compared to other models.

Document-level translation: For translating documents, mBART learns dependencies across sentences to handle entire paragraphs or documents. For training, sentences are separated by symbols, and each example ends with a language ID token. During translation, the model generates output until it encounters the language ID token, which signals the end of the document.
Unsupervised translation: mBART also supports unsupervised translation when bi-text is not available. In this case, the model uses back translation to generate synthetic data for training or, if a target language has related data in other language pairs, it applies language transfer to learn from these relationships and improve translation quality.

With its unique framework, mBART doesn’t require parallel data across all languages. Instead, it uses directional training data and shared representation across languages. This feature improves scalability, even for languages with limited resources or scarce domain-specific data.

The T5 model

The T5 model was introduced in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transfer. In this study, researchers tested the effectiveness of transfer learning by designing a unified framework that transforms all language tasks into a text-to-text format. Built on an encoder-decoder architecture, T5 takes text as input and generates new text as output.

Diagram of text-to-text framework with the T5 model | Source

How T5 works

T5 follows the same foundational principles as the original transformer architecture:

Encoder: The input text is tokenized and embedded, then processed by blocks consisting of a self-attention layer and a feed-forward network.
Decoder: The decoder has a similar structure but includes a standard attention layer after each self-attention layer to focus on the encoder’s output. It also uses autoregressive (causal) self-attention, which lets it consider past outputs for generating predictions.

T5 was trained on unlabeled text data using a cleaned version of Common Crawl, known as the Colossal Clean Crawled Corpus (C4). By leveraging this extensive dataset and the text-to-text transformer framework, T5 has shown broad applicability.

T5 can handle many language tasks by adding specific prefixes to the input sequence. For instance:

For translation: “Translate English to French”
For summarization: “Summarize”

The MarianMT model

The MarianMT model is built on an encoder-decoder architecture and was originally trained by Jörg Tiedemann using the Marian library. Marian is a Neural Machine Translation framework written in C++ designed for efficiency and simplicity, with features that support high-speed training

Some NLP problems solved with the Marian toolkit include:

Automatic Post-Editing: Using dual-attention over two encoders, Marian helps recover missing words in raw machine translation output.
Grammatical Error Correction (GEC): In GEC, Marian uses low-resource neural translation to correct grammatical errors. It can induce noise in the source text, specify weighted objectives, and incorporate transfer learning with pre-trained models.

With the flexibility of Marian’s framework, MarianMT was developed to streamline translation efforts. MarianMT was trained on the Open Parallel Corpus (OPUS), a vast collection of parallel texts from the web.

Hugging Face offers around 1,300 MarianMT models for different language pairs, all named following the format Helsinki-NLP/opus-mt-{src}-{target}, where src and target are language codes. Each model is approximately 298 MB, making them relatively lightweight and ideal for experimentation, fine-tuning, and integration in various applications. For new multilingual models, Marian uses three-character language codes.

Create your machine learning translator using Hugging Face

To build a translator from English to German, we’ll use pre-trained models from Hugging Face and fine-tune them on a relevant dataset. We’ll use models such as T5, MarianMT, and mBART.

1. Load the dataset

First, we’ll load an English-to-German dataset using Hugging Face’s datasets library.

from datasets import load_dataset
raw_datasets = load_dataset("wmt16", "de-en")

Here’s a sample of our dataset:

You can see data is already split into the test, training, and validation. The training set has a large amount of data, so model training and fine-tuning will take time.

2. Pre-process the dataset

Next, we need to tokenize the dataset so the model can process it.

from transformers import AutoTokenizer, MBart50TokenizerFast

# Define models and tokenizers for MarianMT, mBART, and T5
model_marianMT = "Helsinki-NLP/opus-mt-en-de"
tokenizer_marianMT = AutoTokenizer.from_pretrained(model_marianMT, use_fast=False)

model_mbart = "facebook/mbart-large-50-one-to-many-mmt"
tokenizer_mbart = MBart50TokenizerFast.from_pretrained(model_mbart, src_lang="en_XX", tgt_lang="de_DE")

model_t5 = "t5-small"
tokenizer_t5 = AutoTokenizer.from_pretrained(model_t5, use_fast=False)

# Define parameters for tokenization
prefix = "translate English to German:"  # Use only for T5
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "de"

# Preprocessing function
def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
    
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

3. Create a data subset

To speed up training, let’s create smaller subsets for training and validation.

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

4. Fine-tune the model

Now, we’ll load the models and configure the training.

from transformers import AutoModelForSeq2SeqLM, MBartForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

# Load models
model_marianMT = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model_mbart = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
model_t5 = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Set training parameters
# Adjust the 'per_device_train_batch_size' parameter based on your available GPU memory
training_args = Seq2SeqTrainingArguments(
   "translator-finetuned-en-de",
   evaluation_strategy="epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=8,
   per_device_eval_batch_size=8,
   weight_decay=0.01,
   save_total_limit=3,
   num_train_epochs=1,
   predict_with_generate=True,
)

# Data collator for padding inputs and labels
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Metrics function
import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")
meteor = evaluate.load("meteor")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    preds = [pred.strip() for pred in tokenizer.batch_decode(preds, skip_special_tokens=True)]
    labels = [[label.strip()] for label in tokenizer.batch_decode(labels, skip_special_tokens=True)]
    result = metric.compute(predictions=preds, references=labels)
    meteor_result = meteor.compute(predictions=preds, references=labels)
    return {"bleu": result["score"], "meteor": meteor_result["meteor"]}
    
# Initialize Trainer
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

# Train and save the model
trainer.train()
trainer.save_model("fine-tuned-translator")

This approach allows you to create and fine-tune a translator model for English-to-German translation. You can also upload the model to the Hugging Face hub for sharing.

Evaluating and choosing the best translation model

After training and fine-tuning our models, we need to evaluate their performance. We’ll compare the translations generated by each fine-tuned model against the pre-trained versions and Google Translate.

1. Pre-trained model vs. fine-tuned vs. Google Translator

In the previous section, we saved our fine-tuned model in a local directory. We’ll test those models and compare the translated text with pre-trained and Google translation.

Here’s how you can load your fine-tuned models from local storage and generate translations:

MarianMT model

from transformers import MarianMTModel, MarianTokenizer

model_name = 'opus-mt-en-de-finetuned-en-to-de'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

src_text = ["USA Today is an American daily middle-market newspaper that is the flagship publication of its owner, Gannett. Founded by Al Neuharth on September 15, 1982."]
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

mBART50 model

from transformers import MBart50TokenizerFast, MBartForConditionalGeneration

model_name = 'mbart-large-50-one-to-many-mmt-finetuned-en-to-de'
tokenizer = MBart50TokenizerFast.from_pretrained(model_name, src_lang="en_XX")
model = MBartForConditionalGeneration.from_pretrained(model_name)

src_text = ["USA Today is an American daily middle-market newspaper that is the flagship publication of its owner, Gannett. Founded by Al Neuharth on September 15, 1982."]
model_inputs = tokenizer(src_text, return_tensors="pt")
generated_tokens = model.generate(**model_inputs, forced_bos_token_id=tokenizer.lang_code_to_id["de_DE"])
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
translation

T5 model

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = 't5-small-finetuned-en-to-de'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

src_text = ["USA Today is an American daily middle-market newspaper that is the flagship publication of its owner, Gannett. Founded by Al Neuharth on September 15, 1982."]
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
[tokenizer.decode(t, skip_special_tokens=True) for t in translated]

Let’s compare the translated text for the MarianMT, mBART, and T5 models:

USA Today is an American daily middle-market newspaper that is the flagship publication of its owner, Gannett. Founded by Al Neuharth on September 15, 1982. Input Text

USA Today ist eine amerikanische Tageszeitung im mittleren Markt, die das Flaggschiff ihrer Eigentümerin Gannett ist. Gegründet von Al Neuharth am 15. September 1982. Fine-Tuned MarianMT

USA Today ist eine amerikanische Tageszeitung für den mittleren Markt, die die Flaggschiffpublikation ihres Besitzers Gannett ist. Gegründet von Al Neuharth am 15. September 1982. Pre-trained mBART

USA Today ist eine amerikanische Tageszeitung für den Mittelstand, die das Flaggschiff ihres Eigentümers Gannett ist. Gegründet von Al Neuharth am 15. September 1982. Google Translate

The fine-tuned MarianMT model produced more accurate translations than its pre-trained version and was close to the quality of Google Translate. However, some minor grammatical errors persisted across all models.

The pre-trained mBART model captured additional word nuances compared to MarianMT and Google Translate, but the translation quality was similar overall. However, mBART’s fine-tuning was computationally intensive and yielded results nearly identical to the pre-trained version.

The T5 model performed the worst, often failing to translate the full paragraph. Both its pre-trained and fine-tuned versions struggled with accuracy, making it less suitable for this translation task than the other models.

2. Evaluation metrics: comparing the models

We’ll use the following evaluation metrics to assess the quality and accuracy of our translations:

BLEU (bilingual evaluation understudy)

BLEU measures how closely machine translations match human translations. It compares machine-generated text to professional translations, with higher scores indicating closer alignment with human quality. This metric is widely used and highly correlates with human judgment at a low cost.

Meteor

METEOR is another automatic metric for machine translation. It evaluates translations by comparing individual unigrams (words) in the machine output with reference human translations, offering a more nuanced match score.

Our compute_metrics() function includes these metrics, so you’ll see the results as the code runs. However, tracking them over time in a clearer, user-friendly format can improve readability. I recommend using neptune.ai, which automatically logs and displays these metrics for easier analysis.

3. Track your model data: parameters, training loss, CPU usage, metrics, and more

Neptune offers a powerful user interface for tracking model training metrics, CPU and RAM usage, and performance metrics, simplifying model management.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

Set up your Neptune account

Add tracking code in your notebook

Add the below code in your notebook to create an experiment on the Neptune platform:

!pip install neptune
import neptune

run = neptune.init_run(
    project="YOUR_WORKSPACE/YOUR_PROJECT",
    api_token="YOUR_NEPTUNE_API_TOKEN",
)

Log evaluation metrics

After training, log metrics to Neptune using run[“metric_name”].append(value):

evaluate_results = trainer.evaluate()
run["epoch"].append(evaluate_results["epoch"])
run["bleu"].append(evaluate_results["bleu"])
run["meteor"].append(evaluate_results["meteor"])

View metrics in the Neptune UI

Let’s see what it looks like in the Neptune UI:

In the UI, you can see BLEU and METEOR scores for all the pre-trained models. These metrics suggest that even after fine-tuning, T5 could not predict accurately.

💡Neptune has released a guide about Neptune’s integration with HuggingFace Transformers, so you can now use it to start tracking even quicker.

Additional Neptune tracking features

Neptune’s HuggingFace integration makes tracking setup faster with report_to=”neptune” in Seq2SeqTrainingArguments. You can also view detailed CPU and RAM usage, model logs, and metadata for each experiment, as shown below:

Compare models side-by-side

Neptune’s interface allows easy side-by-side comparisons, offering insights beyond BLEU and METEOR scores:

Based on this analysis, MarianMT and mBART models (pre-trained and fine-tuned) outperform T5, with mBART showing slightly better performance, likely due to recognizing more words in the input.

Final thoughts

In this article, we explored how Hugging Face simplifies the integration of NLP tasks by providing intuitive APIs and pre-trained models for different use cases. With this tool, we can ease the creation of specialized pipelines using pre-trained models for custom needs.

Focusing on language translation, we examined two popular models—MarianMT and mBART. We also walked through training and fine-tuning these models on new data to enhance translation quality. While mBART showed slightly higher resource usage than MarianMT and T5, its results were comparable.

Hugging Face offers extensive tutorials to support the learning and fine-tuning of these models, and its models’ hub offers a variety of additional multilingual transformers for NLP translation tasks, including XLM, BERT, and T5. These models continue to advance language translation towards human-like accuracy.

Building a Search Engine With Pre-Trained Transformers: A Step By Step Guide

Aravind CR — Fri, 22 Jul 2022 11:21:13 +0000

We all use search engines. We search for information about the best item to purchase, a nice place to hangout, or to answer our questions about anything we want to know.

We also rely heavily on search to check emails, documents and financial transactions. A lot of these search interactions happen through text or speech converted to voice input. This means a lot of language processing happens on search engines, so NLP plays a pretty important role in modern search engines

Let’s take a quick look into what happens when we search. When you search using a query, the search engine collects a ranked list of documents that matches the query. For this to happen, an “index” of documents and vocabulary used in them should be constructed first, and then used to search and rank results. One of the popular forms of indexing textual data and ranking search results for search is TF-IDF.

Recent development in deep learning models for NLP can be used for this. For example, Google recently started ranking search results and showing snippets using the BERT model. They claim that this has improved the quality and relevance of search results.

Fig: Fine-tuning BERT on Different Tasks

There are 2 types of search engines:

Generic search engines, such as Google and Bing, that crawl the web and aim to cover as much as possible by constantly looking for new webpages.
Enterprise search engines, where our search space is restricted to a smaller set of already existing documents within an organization.

The second form of search is the most common use case you will encounter at any workplace. It’s clear when you look at the diagram below.

You can use state-of-the-art sentence embeddings with transformers, and use them in downstream tasks for semantic textual similarity.

In this article, we’ll explore how to build a vector-based search engine.

Why would you need a vector-based search engine?

Keyword-based search engine struggle with:

Complex queries or words that have dual meaning.
Long search queries.
Users not familiar with important keywords to retrieve best results.

Vector-based (also known as semantic search) search solves these problems by finding a numerical representation of text queries using SOTA language models. Then it indexes them in high dimensional vector space, and measures how similar a query vector is to the indexed documents.

Lets see what the pre-trained models have to offer:

They produce high quality embeddings, as they were trained on large amounts of text data.
They don’t force you to create a custom tokenizer, as transformers come with their own methods.
They’re really simple and handy to fine tune the model to your downstream task.

These models produce a fixed size vector for each token in the document.

Now, let’s see how we can use a pre-trained BERT model to build a feature extractor for search engines.

Step 1: Load the pre-trained model

!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip uncased_L-12_H-768_A-12.zip
!pip install bert-serving-server --no-deps

For this implementation, I’ll be using BERT uncased. There are other variations of BERT available – bert-as-a-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, letting you map sentences into fixed length representations with just 2 lines of code. This is useful if you want to avoid additional latency and potential modes introduced by a client-server architecture.

Step 2: Optimizing the inference graph

To modify the model graph, we need some low level Tensorflow programming. Since we’re using bert-as-a-service, we can configure the inference graph using a simple CLI interface.

(The version of tensorflow used for this implementation was tensorflow==1.15.2)

import os
import tensorflow as tf
import tensorflow.compat.v1 as tfc


sess = tfc.InteractiveSession()

from bert_serving.server.graph import optimize_graph
from bert_serving.server.helper import get_args_parser


# input dir
MODEL_DIR = '/content/uncased_L-12_H-768_A-12' #@param {type:"string"}
# output dir
GRAPH_DIR = '/content/graph/' #@param {type:"string"}
# output filename
GRAPH_OUT = 'extractor.pbtxt' #@param {type:"string"}

POOL_STRAT = 'REDUCE_MEAN' #@param ['REDUCE_MEAN', 'REDUCE_MAX', "NONE"]
POOL_LAYER = '-2' #@param {type:"string"}
SEQ_LEN = '256' #@param {type:"string"}


tf.io.gfile.mkdir(GRAPH_DIR)


carg = get_args_parser().parse_args(args=['-model_dir', MODEL_DIR,
                              '-graph_tmp_dir', GRAPH_DIR,
                              '-max_seq_len', str(SEQ_LEN),
                              '-pooling_layer', str(POOL_LAYER),
                              '-pooling_strategy', POOL_STRAT])

tmp_name, config = optimize_graph(carg)
graph_fout = os.path.join(GRAPH_DIR, GRAPH_OUT)

tf.gfile.Rename(
   tmp_name,
   graph_fout,
   overwrite=True
)
print("nSerialized graph to {}".format(graph_fout))

Take a look at a few parameters in the above snippet.

For each text sample, the BERT-base model encoding layer outputs a tensor of shape [sequence_len, encoder_dim], with one vector per input token. To get a fixed representation, we need to apply some sort of pooling.

POOL_STRAT parameter defines the pooling strategy applied to the encoder layer number POOL_LAYER. The default value ‘REDUCE_MEAN’ averages the vector for all tokens in the sequence. This particular strategy works best for most sentence-level tasks, when the model is not fine-tuned. Another option is NONE, in which case no pooling is applied.

SEQ_LEN has an impact on the maximum length of sequences processed by the model. If you want to increase the model inference speed almost linearly, you can give smaller values.

Running the above code snippet will put the model graph and weights into a GraphDef object, which will be serialized to a pbtxt file at GRAPH_OUT. The file will often be smaller than the pre-trained model, because the nodes and the variables required for training will be removed.

Step 3: Creating feature extractor

Let’s use the serialized graph to build a feature extractor using tf.Estimator API. We need to define 2 things: input_fn and model_fn.

input_fn gets data into the model. This includes executing the whole text preprocessing pipeline and preparing a feed_dict for BERT.
Each text sample is converted into a tf.Example instance, with the necessary features listed in the INPUT_NAMES. The bert_tokenizer object contains the WordPiece vocabulary and performs text processing. After that, the examples are regrouped by feature names in feed_dict.

import logging
import numpy as np

from tensorflow.python.estimator.estimator import Estimator
from tensorflow.python.estimator.run_config import RunConfig
from tensorflow.python.estimator.model_fn import EstimatorSpec
from tensorflow.keras.utils import Progbar

from bert_serving.server.bert.tokenization import FullTokenizer
from bert_serving.server.bert.extract_features import convert_lst_to_features

log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)
log.handlers = []

GRAPH_PATH = "/content/graph/extractor.pbtxt" #@param {type:"string"}
VOCAB_PATH = "/content/uncased_L-12_H-768_A-12/vocab.txt" #@param {type:"string"}

SEQ_LEN = 256 #@param {type:"integer"}

INPUT_NAMES = ['input_ids', 'input_mask', 'input_type_ids']
bert_tokenizer = FullTokenizer(VOCAB_PATH)

def build_feed_dict(texts):

   text_features = list(convert_lst_to_features(
       texts, SEQ_LEN, SEQ_LEN,
       bert_tokenizer, log, False, False))

   target_shape = (len(texts), -1)

   feed_dict = {}
   for iname in INPUT_NAMES:
       features_i = np.array([getattr(f, iname) for f in text_features])
       features_i = features_i.reshape(target_shape).astype("int32")
       feed_dict[iname] = features_i

   return feed_dict

tf.Estimators have a feature which makes them rebuild and reinitialize the whole computational graph at each call to the predict function.

So, in order to avoid the overhead, we’ll pass the generator to the predict function, and the generator will yield the features to the model in a never ending loop.

def build_input_fn(container):

   def gen():
       while True:
         try:
           yield build_feed_dict(container.get())
         except:
           yield build_feed_dict(container.get())

   def input_fn():
       return tf.data.Dataset.from_generator(
           gen,
           output_types={iname: tf.int32 for iname in INPUT_NAMES},
           output_shapes={iname: (None, None) for iname in INPUT_NAMES})
   return input_fn

class DataContainer:
 def __init__(self):
   self._texts = None
  def set(self, texts):
   if type(texts) is str:
     texts = [texts]
   self._texts = texts

 def get(self):
   return self._texts

The model_fn contains the specification of the model. In our case, it’s loaded from the pbtxt file we saved in the previous step. The features are mapped explicitly to the corresponding input nodes via input_map.

def model_fn(features, mode):
   with tf.gfile.GFile(GRAPH_PATH, 'rb') as f:
       graph_def = tf.GraphDef()
       graph_def.ParseFromString(f.read())

   output = tf.import_graph_def(graph_def,
                                input_map={k + ':0': features[k] for k in INPUT_NAMES},
                                return_elements=['final_encodes:0'])

   return EstimatorSpec(mode=mode, predictions={'output': output[0]})

estimator = Estimator(model_fn=model_fn)

Now that we have things in place, we need to perform inference.

def batch(iterable, n=1):
   l = len(iterable)
   for ndx in range(0, l, n):
       yield iterable[ndx:min(ndx + n, l)]

def build_vectorizer(_estimator, _input_fn_builder, batch_size=128):
 container = DataContainer()
 predict_fn = _estimator.predict(_input_fn_builder(container), yield_single_examples=False)
  def vectorize(text, verbose=False):
   x = []
   bar = Progbar(len(text))
   for text_batch in batch(text, batch_size):
     container.set(text_batch)
     x.append(next(predict_fn)['output'])
     if verbose:
       bar.add(len(text_batch))

   r = np.vstack(x)
   return r
  return vectorize
bert_vectorizer = build_vectorizer(estimator, build_input_fn)

bert_vectorizer(64*['sample text']).shape
o/p: (64, 768)

Step 4: Exploring vector space with projector

Using the vectorizer, we will generate embeddings for articles from the Reuters-221578 benchmark corpus.

To explore and visualize the embedding vector space in 3D, we will use a dimensionality reduction technique called T-SNE.

First let’s get the article embeddings.

from nltk.corpus import reuters

import nltk
nltk.download("reuters")
nltk.download("punkt")

max_samples = 256
categories = ['wheat', 'tea', 'strategic-metal',
             'housing', 'money-supply', 'fuel']

S, X, Y = [], [], []

for category in categories:
 print(category)
  sents = reuters.sents(categories=category)
 sents = [' '.join(sent) for sent in sents][:max_samples]
 X.append(bert_vectorizer(sents, verbose=True))
 Y += [category] * len(sents)
 S += sents
 X = np.vstack(X)
X.shape

After running the above code, if you face any issues in collab that say: “Resource reuters not found. Please use the NLTK downloader to obtain the resource.”

…then run the following command, where the relative path after -d will give the location where the file will be unzipped:

!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora

The interactive visualizations of the generated embeddings are available on the Embedding Projector.

From the link, you can run t-SNE yourself, or load a checkpoint using the bookmark in the lower right corner (loading works on Chrome).

To reproduce the input files used for this visualization, run the code snippet below. Then download the files to your machine and upload to Projector.

with open("embeddings.tsv", "w") as fo:
 for x in X.astype('float'):
   line = "t".join([str(v) for v in x])
   fo.write(line+'n')

with open('metadata.tsv', 'w') as fo:
 fo.write("LabeltSentencen")
 for y, s in zip(Y, S):
   fo.write("{}t{}n".format(y, s))

Here’s what I captured using the Projector.

from IPython.display import HTML

HTML("""

 

""")

Building a supervised model with the generated features is straightforward:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
Xtr, Xts, Ytr, Yts = train_test_split(X, Y, random_state=34)

mlp = LogisticRegression()
mlp.fit(Xtr, Ytr)

print(classification_report(Yts, mlp.predict(Xts)))

	Precision	Recall	F1-Score	Support
Fuel	0.75	0.81	0.78	26
Housing	0.73	0.75	0.74	32
Money-supply	0.84	0.88	0.86	75
Strategic-metal	0.88	0.90	0.89	48
Tea	0.85	0.80	0.82	44
Wheat	0.94	0.86	0.90	59
Accuracy	–	–	0.85	284
Macro avg	0.83	0.83	0.83	284
Weighted avg	0.85	0.85	0.85	284

Step 5: Building a search engine

Let’s say we have a knowledge base of 50,000 text samples, and we need to quickly answer queries based on this data. How can we retrieve the result most similar to a query from a text database? One of the answers can be nearest neighbour search.

The search problem we’re solving here can be defined as follows:

Given a set of points S in vector space M and a query point Q ∈ M, find the closest point S to Q. There are multiple ways to define ‘closest’ in the vector space – we’ll use Euclidean distance.

To build a search engine for text, we’ll follow these steps:

Vectorize all samples from the knowledge base – that gives S.
Vectorize the query – that gives Q.
Compute euclidean distance D between Q and S.
Sort D in ascending order- providing indices of most similar samples.
Retrieve labels for said samples from the knowledge base.

We can create the placeholder for Q and S:

graph = tf.Graph()

sess = tf.InteractiveSession(graph=graph)

dim = X.shape[1]

Q = tf.placeholder("float", [dim])
S = tf.placeholder("float", [None, dim])

Define euclidean distance computation:

squared_distance = tf.reduce_sum(tf.pow(Q - S, 2), reduction_indices=1)
distance = tf.sqrt(squared_distance)

Get the most similar indices:

top_k = 10

top_neg_dists, top_indices = tf.math.top_k(tf.negative(distance), k=top_k)
top_dists = tf.negative(top_neg_dists)

from sklearn.metrics.pairwise import euclidean_distances

top_indices.eval({Q:X[0], S:X})

np.argsort(euclidean_distances(X[:1], X)[0])[:10]

Step 6: Accelerating search with math

In tensorflow this can be done as follows:

Q = tf.placeholder("float", [dim])
S = tf.placeholder("float", [None, dim])

Qr = tf.reshape(Q, (1, -1))

PP = tf.keras.backend.batch_dot(S, S, axes=1)
QQ = tf.matmul(Qr, tf.transpose(Qr))
PQ = tf.matmul(S, tf.transpose(Qr))

distance = PP - 2 * PQ + QQ
distance = tf.sqrt(tf.reshape(distance, (-1,)))

top_neg_dists, top_indices = tf.math.top_k(tf.negative(distance), k=top_k)

In the above formula PP and QQ are actually squared L2 norms of the respective vectors. If both vectors are L2 normalized, then:

PP = QQ = 1

Doing L2 normalization discards the information about the vector magnitude, which in many cases you don’t want to do.

Instead, we may notice that as long as the knowledge base stays the same – PP – its squared vector norm also stays the same. So, instead of recomputing it every time, we can just do it once and then use the precomputed result, further accelerating the distance computation.

Let’s bring this all together.

class L2Retriever:
 def __init__(self, dim, top_k=3, use_norm=False, use_gpu=True):
   self.dim = dim
   self.top_k = top_k
   self.use_norm = use_norm
   config = tf.ConfigProto(
       device_count = {'GPU': (1 if use_gpu else 0)}
   )
   self.session = tf.Session(config=config)

   self.norm = None
   self.query = tf.placeholder("float", [self.dim])
   self.kbase = tf.placeholder("float", [None, self.dim])

   self.build_graph()

 def build_graph():
   if self.use_norm:
     self.norm = tf.placeholder("float", [None, 1])

   distance = dot_l2_distances(self.kbase, self.query, self.norm)
   top_neg_dists, top_indices = tf.math.top_k(tf.negative(distance), k=self.top_k)
   top_dists = tf.negative(top_neg_dists)

   self.top_distances = top_dists
   self.top_indices = top_indices

 def predict(self, kbase, query, norm=None):
   query = np.squeeze(query)
   feed_dict = {self.query: query, self.kbase: kbase}
   if self.use_norm:
     feed_dict[self.norm] = norm

   I, D = self.session.run([self.top_indices, self.top_distances],
                           feed_dict=feed_dict)
   return I, D

def dot_l2_distances(kbase, query, norm=None):
 query = tf.reshape(query, (1, 1))

 if norm is None:
   XX = tf.keras.backend.batch_dot(kbase, kbase, axes=1)
 else:
   XX = norm
 YY = tf.matmul(query, tf.transpose(query))
 XY = tf.matmul(kbase, tf.transpose(query))

 distance = XX - 2 * XY + YY
 distance = tf.sqrt(tf.reshape(distance, (-1, 1)))

 return distance

We can use this implementation with any vectorizer model, not just BERT. It’s quite effective at the nearest neighbour retrieval, able to process dozens of requests per second on a 2-core colab CPU.

There are some extra aspects you need to consider when building machine learning applications:

How do you ensure the scalability of your solution?

Pick the right framework/languages.
Use the right processors.
Collect and warehouse data.
Input pipeline.
Model training.
Distributed systems.
Other optimizations.
Resource Utilization and monitoring.
Deploy.

How do you train, test and deploy your model to production?

Create a notebook instance that you can use to download and process your data.
Prepare the data/preprocess it that you need to train your ML model and then upload the data (ex: Amazon S3).
Use your training dataset to train your machine learning model.
Deploy the model to an endpoint, reformat and load the csv data, then run the model to create predictions.
Evaluate the performance and accuracy of the ML model.

Side note – make ML easier with experiment tracking

One tool that can take care of all your experiment tracking needs is neptune.ai.

Neptune is an experiment tracker designed with a strong focus on collaboration and scalability. It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye. The tool is known for its user-friendly interface and flexibility, enabling teams to adopt it into their existing workflows with minimal disruption.

Neptune gives users a lot of freedom when defining data structures and tracking metadata. Data scientists and ML/AI researchers can log, store, organize, display, compare, and query all their model-building metadata in a single place. It handles data such as model metrics and parameters, model checkpoints, images, videos, audio files, dataset versions, and visualizations.

Conclusion

The main area of exploration for search with BERT is similarity. Similarity for documents, for recommendations, and similarity between queries and documents for returning and ranking search results.

If you can use similarity to solve this problem with highly accurate results, then you have a pretty great search for your product or application.

I hope you learned something new here. Thanks for reading. Keep learning.

Building Deep Learning-Based OCR Model: Lessons Learned

Gourav Bais — Fri, 22 Jul 2022 06:53:58 +0000

Deep learning solutions have taken the world by storm, and all kinds of organizations like tech giants, well-grown companies, and startups are now trying to incorporate deep learning (DL) and machine learning (ML) somehow in their current workflow. One of these important solutions that have gained quite a popularity over the past few years is the OCR engine.

OCR (Optical Character Recognition) is a technique of reading textual information directly from digital documents and scanned documents without any human intervention. These documents could be in any format like PDF, PNG, JPEG, TIFF, etc. There are a lot of Advantages of using OCR systems, these are:

1 It increases productivity as it takes very less time to process (extract information) the documents.
2 It is resource-saving as you just need an OCR program that does the work and no manual work would be required.
3 It eliminates the need for manual data entry.
4 Chances of error become less.

Extracting information from digital documents is still easy as they have metadata, that can give you the text information. But for the scanned copies, you require a different solution as metadata does not help there. Here comes the need for deep learning that provides solutions for text information extraction from images.

In this article, you will learn about different lessons for building a deep learning-based OCR model so that when you are working on any such use case, you may not face the issues that I have faced during the development and deployment.

What is deep learning-based OCR?

OCR has become very popular nowadays and has been adopted by several industries for faster text data reading from images. While solutions like contour detection, image classification, connected component analysis, etc. are used for documents that have comparable text size and font, ideal lighting conditions, good image quality, etc., such methods are not effective for irregular, heterogeneous text often called wild text or scene text. This text could be from a car’s license plate, house number plate, poorly scanned documents (with no predefined conditions), etc. For this, Deep Learning solutions are used. Using DL for OCR is a three-step process and these steps are:

Preprocessing: OCR is not an easy problem, at least not as easy as we think it to be. Extracting text data from digital images/documents is still fine. But when it comes to scanned or phone-clicked images things change. Real-world images are not always clicked/scanned in ideal conditions, they can have noise, blur, skewness, etc. That needs to be handled before applying the DL models to them. For this reason, image preprocessing is required to tackle these issues.

Text Detection/Localization: At this stage models like Mask-RCNN, East Text Detector, YoloV5, SSD, etc. are used that locates the text in images. These models usually create bounding boxes (square/rectangle boxes) over each text identified in the image or a document.

Text Recognition: Once the text location is identified, each bounding box is sent to the text recognition model which is usually a combination of RNNs, CNNs, and Attention networks. The final output from these models is the text extracted from the documents. Some open-source text recognition models like Tesseract, MMOCR, etc. can help you gain good accuracy.

Deep learning based OCR model | Source: Author

To explain the effectiveness of OCR models, let’s have a look at a few of the segments where OCR is applied nowadays to increase the productivity and efficiency of the systems:

OCR in Banking: Automating the customer verification, check deposits, etc. processes using OCR-based text extraction and verification.

OCR in Insurance: Extracting the text information from a variety of documents in the insurance domain.

OCR in Healthcare: Processing the documents such as a patient’s history, x-ray report, diagnostics report, etc. can be a tough task that OCR makes easy for you.

These are just a few of the examples where OCR is applied, to know more about its use cases you can refer to the following link.

Lessons from building a deep learning-based OCR model

Now that you are aware of what OCR is and what makes it an important concept in the current times, it’s time to discuss some of the challenges that you may face while working on it. I have been part of several OCR-based projects that were related to the finance (insurance) sector. To name a few:

I have worked on a KYC verification OCR project where information from different identification documents needed to be extracted and validated against each other to verify a customer profile.
I have also worked on insurance documents OCR where information from different documents needed to be extracted and used for several other purposes like user profile creation, user verification, etc.

One thing that I have learned while working on these OCR use cases is that you need not fail every time to learn different things. You can learn from others’ mistakes as well. There were several stages where I faced challenges while working in a team for these financial DL-based OCR projects. Let’s discuss those challenges in the form of different stages of ML pipeline development.

Data collection

Problem

This is the first and most important stage while working on any ML or DL use case. Mostly OCR solutions are adopted by financial organizations like banks, insurance companies, brokerage firms, etc. As these organizations have a lot of documents that are hard to process manually. Since they are financial organizations here comes the government rules and regulations that these financial organizations must follow.

For this reason, if you are working on any POC (Proof of Concept) for these financial firms there might be the chance that they might not share a whole lot of data for you to train your text detection and recognition models. Since deep learning solutions are all about data you might get models with poor performance. This is related to of course the regulatory compliance that they might breach users’ privacy that can cause customer financial and other kinds of loss if they share the data.

Solution

Does this problem has any solution? Yes, it has. Let’s say you would want to work on some kind of Form or ID card for text extraction. For forms, you could ask clients for the empty templates and fill them with your random data (time-consuming but efficient) and for the id card, you may find a lot of samples on the internet that you can use to get started. Also, you can just have a few samples of these forms and ID cards and use image augmentation techniques to create new similar images for your model training.

Image augmentation for OCR | Source

Sometimes when you would want to start working on OCR use cases and do not have any organizational data, you can use one of the datasets available online (open-source) for OCR. You can check the list of best datasets for OCR here.

Labeling the data (data annotation)

Problem

Now that you have your data and also created new samples using image augmentation techniques, the next thing on the list is Data Labeling. Data Labeling is the process of creating bounding boxes on the objects that you would want your object detection model to find in images. In this case, our object is text so you need to create the bounding boxes over the text area that you would want your model to identify. Creating these labels is a very tedious but important task. This is something you can not get rid of.

Also, bounding boxes are too general when we talk about annotations, for different types of use cases different types of annotations are used. For example, for the use cases where you would want the most accurate coordinates of an object, you can not use square or rectangular bounding boxes, There you need to use Polynomial (multiline) bounding boxes. For Semantic Segmentation use cases where you want to separate an image into different portions, you need to assign a label to every pixel in an image. To know more about different types of annotations you can refer to this link.

Solution

Is there any way through which you can expedite the labeling process for your work? Yes, there is. Usually, if you are using image augmentation techniques like adding noise, blur, brightness, contrast, etc. There is no change in the image geometry so you can use the coordinates from the original image for these augmented images. Also If you are rotating your images, make sure you rotate them in multiple 90 Degree so that you can also rotate your annotations (labels) to the same angle and it would save you a lot of rework. For this task, you can use VGG or VoTT image annotations tools.

VoTT annotations | Source

Sometimes when you have a lot of data to annotate you can even outsource it, there are a lot of companies that provide annotation solutions. You just need to simply explain the type of annotation you want and the annotation team would do it for you.

Model architecture and training infrastructure

Problem

One thing that you must ensure is the hardware component that you have for training your models. Training object detection models require a decent RAM capacity and a GPU unit (some of them can work with CPU as well but training would be super slow).

Another part of it is over the years different object detection models have been introduced in the field of computer vision. Choosing the one that works best for your use case (text detection and recognition) and also works fine on your GPU/CPU machine can be difficult.

Solution

For the first part, if you have a GPU-based system then there is no need to worry as you can easily train your model. But, if you are using a CPU, training the whole model at once can take a lot of time. In that case, transfer learning can be the way to go as it doesn’t involve training models from scratch.

Each newly introduced computer vision model has either whole new architecture or improves the performance of the existing models. For smaller and dense objects like text, YoloV5 is preferred for text detection over others for its architectural benefits.

Yolov5 Architecture | Source

If you want to segment an image into multiple portions (pixel-wise), Masked-RCNN is considered best. For text recognition, some of the widely used models are MMOCR, PaddleOCR and CRNN.

Training

Problem

This is a very crucial stage where you would be training your DL-based text detection and recognition models. One thing that we all are aware of is that training deep learning model is a black box thing, you can just try out different parameters to get the best results for your use case and would not know what is going on underneath. You may need to try different deep learning models for text detection and recognition which is pretty hard with all those hyperparameters that you need to take care of for training.

Solution

One thing I have learned here is that you must focus on a single model until you have tried out everything like hyperparameter tuning, model architecture tuning, etc. You need not judge the performance of a model by trying out only a few things.

Furthermore, I would advise you to train your model in parts for eg. if you want to train your model to 50 epochs, divide it into three different steps 15 epochs, 15 epochs, and 20 epochs and evaluate it intermediately. This way you would have results at different stages and would get the gist of whether the model is performing well or badly. It is better than trying all 50 epochs at once for a few days and finally getting to know the model is not working at all on your data.

Also, as already discussed above, transfer learning could be the key. You can train your model from scratch but using an already trained model and fine-tuning it on your data would surely give you good accuracy.

Testing

Problem

Once you have your models ready the next thing in the queue is to test the performance of the model. Testing the deep learning models is quite easy as you can see the results (bounding boxes created on the object) or compare the extracted text with ground truth data, unlike traditional machine learning use cases where you need to interpret the results from numbers.

Nowadays you can use manual DL model testing or could try one of the available automated testing services. The manual process takes some time as you would have to go ahead and check every image on your own to tell the performance of the models. If you are working on financial use cases you might have to work on manual testing only as you can not share the data with online automation testing services.

Solution

One major advice that I would give here is never to test your models on the training datasets as it would not show the real performance of your model. You need to create three different datasets train, validation, and test. First, two would be used for training and run time model assessment while the testing dataset would show you the real performance of the model.

The next thing would be to decide the best metrics to assess the performance of your detection and recognition models. Since text detection is a type of object detection, mAP (mean average precision) is used to assess the performance of the models. It compares the model predicted bounding boxes with the ground truth bounding boxes and returns the score, the higher the score better the performance.

mAP formula | Source

For the text recognition model, the widely used measure is CER (Character Error Rate). For this measure each predicted character is compared with the ground truth to tell the model performance, the lower the CER, the better the model performance. You need your model to have less than 10% CER for it to be replaced with a manual process. To know more about CER and how to calculate it, you can check the following link.

Deployment and monitoring

Problem

Once you have your final models ready with decent accuracy you would have to deploy them somewhere to make them accessible to the target audience. This is one of the major steps where you might face some issues no matter where you are going to deploy it. Three important challenges that I have faced while deploying these models are:

I was using the PyTorch library to implement the object detection model, this library does not allow you to use multithreading at the time of inference if you have not trained it to be multithreaded at the time of training.
Model size might be too much as it would be the DL-based model and it might take longer to load at the time of inference.
Deploying the model is not enough, you need to monitor it for a few months to know if it is performing as expected or if it has further scope for improvement.

Solution

So to resolve the first issue I would suggest you must be aware that you would have to train the model using the Pytorch with multithreading so that you can have it at the time of inferencing or another solution would be to switch to another framework i.e. look for the TensorFlow alternative for the torch model that you want as it already supports multithreading and is quite easy to work with.

For the second point, if you have a very large model that takes a lot of time to load for inferencing, you can convert your model to the ONNX model, it can reduce the size of the model by ⅓ but with a slight impact on your accuracy.

Model monitoring can be done manually but it requires some engineering resources to look for the cases that are failing with your OCR model. Instead, you can use different ML model monitoring solutions that work in an automated way.

Aside

If you want a model monitoring solution for your experimentation and training processes, and a metadata store for your ML/AI workflow, you should check out neptune.ai.

Here’s an example of how Neptune helped ML team at Brainly optimize monitoring and debugging of their ML processes.

Neptune gives us really good insight into simple data processing jobs that are not even training. We can, for example, monitor the usage of resources and know whether we are using all cores of the machines. And it’s quick – two lines of code, and we have much better visibility.
Hubert Bryłkowski, Senior ML Engineer at Brainly

Read the full case study with Brainly
Watch the 2-min product demo

Conclusion

After reading this article, you now know what a Deep Learning based OCR is, its various use cases, and finally seen some lessons based on scenarios I have seen while working on OCR use cases. OCR technology is now taking over the manual data entry and document processing work, this might be the right time to get hands-on with it so that you would not feel left out in the DL world. While working on these types of use cases, you must remember that you can not have a good model in one go. You need to try out different things and learn from every step that you would work on.

Creating a solution from scratch might not be a good solution as you would not have a whole lot of data while working on different use cases, so trying transfer learning and fine-tuning different models on your data can help you achieve good accuracy. The motive of this article was to tell you different issues that I have faced while working on OCR use cases so that you need not face them in your work. Still, there may be some new issues with the change in technology and libraries but you must look for different solutions to get the work done.

MLOps Tools for NLP Projects

Bunmi Akinremi — Fri, 22 Jul 2022 06:16:42 +0000

Machine learning chatbots, summarizing apps, Siri, Alexa – these are just a few cool Natural Language Processing (NLP) projects which are already adopted at mass scale. Have you ever wondered how they’re managed, continuously improved, and maintained? This is exactly the question that we’re going to answer in this article.

For example, Google’s autocorrect gets better every time, but not because they came up with a super good model that doesn’t need any maintenance. It gets better every time because there’s a pipeline, put in place early on for automating and improving the model by performing all ML tasks over and over again when it gets new data. It’s an example of MLOps at its finest.

In this article, I’ll tell you about various MLOps tools you can use for NLP projects. This includes cool open-source MLOps platforms, along with some code to help you get started. I’ll also do a comparison of all the tools, to help you navigate and choose the best tool for any framework you want to use.

Here are the assumptions I made when writing the article, just so we’re on the same page:

You understand what NLP is. You don’t need to know much, just a bit of the basic and some process is good enough.
You’re familiar with the process involved in building machine learning projects. Again, you don’t need to know too much. You should have built at least a machine learning project before, just so you know the terms I’ll be using.
You’re open-minded and ready to learn!

If you’re an MLOps expert, you can skip the introduction and go straight to the tools.

What is MLOps?

Data changes over time, which makes machine learning models stale. ML models learn patterns in data, but these patterns change as the trends and behaviors change.

We can’t prevent data from always changing, but we can keep our model updated with the new trends and changes. To do this, we need an automated pipeline. This automated process is known as MLOps.

MLOps is a set of practices for collaboration and communication between data scientists and operations professionals.

Please note that MLOps is not fully automated, at least not yet. You still have to do some things manually, but it’s incomparably easier compared to having no workflow at all.

How does MLOps work?

MLOps, or Machine Learning Operations, is different from DevOps.

DevOps is a popular practice in developing and operating large-scale software systems. It has two concepts in software system development:

A typical DevOps cycle is:

Code,
Test,
Deploy,
Monitor.

In ML projects, there are a lot of other processes like data collection and processing, feature engineering, training, and evaluating ML models, and DevOps can’t handle all of this.

MLOps Lifecycle | Source

In MLOps, you have:

data coming into the system which is usually the entry,
codes to preprocess the data and select useful features,
codes to train the model and evaluate it,
codes to test and validate it,
codes to deploy,
and so on.

To deploy your model to production, you need to push it through a CI/CD pipeline.

Once it’s in production:

You need to always check performance and make sure it’s reliable,
You need an automated alert or triggering system to inform you of issues and to make sure the changes fix the issues raised.

MLOps automated pipeline | Source

Why do we need MLOps?

It doesn’t matter what kind of solutions you’re trying to deploy, MLOps is fundamental to the success of your project.

MLOps does not only help to collaborate and integrate ML into technologies, it helps data scientists do what they do best, develop models. MLOps automates retraining, testing, and deployment which were manually done by data scientists.

Machine learning helps deploy solutions that unlock previously untapped sources of revenue, save time, and reduce cost by creating more efficient workflows, leveraging data analytics for decision-making, and improving customer experience. These goals are hard to accomplish without a solid framework like MLOps to follow.

How to choose a good MLOps tool

Choosing a suitable MLOps tool for your NLP project depends on the tool of your solution.

Your choice depends on your project needs, maturity, and scale of deployment. Your project must be properly structured (Cookie Cutter is a good project structuring tool that will help you do that).

Manasi Vartak, founder and CEO of Verta, pointed out some questions you should ask yourself before selecting any MLOps tool:

It should be data scientist-friendly, not restricting your data science teams to work on specific tools and frameworks.
It should be easy to install, easy to set up, and easy to customize.
It should integrate freely with your existing platform.
It should be able to reproduce results; reproducibility is critical whether you are collaborating with team members, debugging a production failure, or iterating an existing model.
It should scale well; choose a platform that meets your current needs and can scale for the future for both real-time and batch workloads, serving high-throughput scenarios, scaling automatically with the increasing traffic, with easy cost management and safe deployment and release practices.

Best open-source MLOps tools for your NLP projects

Every MLOps tool has its own tool. The open-source platforms listed below are specific to NLP projects. Some of the commercialized platforms are specifically for NLP projects, but others can generally be used for any ML project.

AdaptNLP

It’s a high-level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models for end-to-end tasks. It was built on top of Zalando Research’s Flair and Hugging Face’s Transformers library.

AdaptNLP provides Machine Learning researchers and scientists a modular and adaptive approach to a variety of NLP tasks with an easy API for training, inference, and deploying NLP-based microservices. You can deploy your Adapt-NLP models using Fast-api, locally or using docker.

AdaptNLP features:

The API is unified for NLP tasks with SOTA pretrained models. You can use it with Flair and Transformer models.
Provides an interface for training and fine-tuning your models.
Easily and instantly deploy your NLP model with FastAPI framework.
You can easily build and run AdaptNLP containers on GPUs using Docker.

Installation Requirement for Linux/Mac:

I’ll advise that you install it in a new virtual environment to prevent dependency clustering issues. If you have Python version 3.7 installed, you’ll need to install the latest stable version of Pytorch(v.1.7) and if you have Python version 3.6, you’ll have to downgrade your Pytorch to a version <=1.6.

Installation Requirement for Windows:

If you don’t have Pytorch already installed, you’ll have to install it manually from Pytorch.

Using pip,

pip install adaptnlp

or if you want to contribute to the development,

pip install adaptnlp[dev]

Tutorials:

AutoGulon

AutoGluon is simply AutoML for text, image, and tabular data. It enables you to easily extend AutoML to areas like deep learning, stack ensembling, and other real-world applications. It automates machine learning tasks and gives your model strong predictive performance in your applications.

In just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models on text, image, and tabular data. Currently, it provides support for only Linux and MacOS users.

AutoGulon features:

Create a quick prototype of your deep learning and ML solutions with just a few lines of code.
Use state-of-the-art techniques automatically without having expert knowledge.
You can perform data preprocessing, tool search, model selection/ensembling, and hyperparameter tuning automatically.
AutoGulon is totally customizable for your use case.

Installation:

It requires you to have Python 3.6, 3.7, or 3.8. Currently, it supports only Linux and MacOS. Depending on your system, you can either download the CPU version or GPU version.

Using pip:

For MacOS:

Pip install for CPU( Make sure you have installed : XCode, Homebrew, LibOMP)

python3 -m pip install -U pip
python3 -m pip install -U setuptools wheel
python3 -m pip install -U "mxnet<2.0.0"
python3 -m pip install autogluon

Pip install for GPU

Currently unavailable

For Linux:

Pip install for CPU

python3 -m pip install -U pip
python3 -m pip install -U setuptools wheel
python3 -m pip install -U "mxnet<2.0.0"
python3 -m pip install autogluon

Pip install for GPU

python3 -m pip install -U pip
python3 -m pip install -U setuptools wheel

# Here we assume CUDA 10.1 is installed.  You should change the number
# according to your own CUDA version (e.g. mxnet_cu100 for CUDA 10.0).
python3 -m pip install -U "mxnet_cu101<2.0.0"
python3 -m pip install autogluon

Tutorial:

Text Prediction

GluonNLP

It’s a framework that supports NLP processes such as loading text data, preprocessing text data, and training NLP models. It’s available on Linux and MACOS. You can also convert your other forms of NLP models into GulonNLP. A few examples of such models you can convert include BERT, ALBERT, ELECTRA, MobileBERT, RoBERTa, XLM-R, BART, GPT-2, and T5.

GulonNLP features:

Easy to use Text Processing Tools and Modular APIs
Pretrained Model Zoo
Write Models with Numpy like APIs
Fast Inference via Apache TVM (incubating) (Experimental)
AWS Integration with SageMaker

Installation:

Before you start the installation, make sure you have the MXNet 2 release on your system. Just in case you don’t, you can install it from your terminal. Choose one out the following options:

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0a"

# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0a"

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0a"

Now, you can go ahead to install GulonNLP. Open your terminal and type:

python3 -m pip install -U -e

You can also install all the extra requirements by typing:

python3 -m pip install -U -e ."[extras]"

If you come across any issue while installing related to user permissions, please refer to this guide.

Tutorials:

Kashgari

Powerful NLP transfer learning framework that you can use to build state-of-the-art models in 5 minutes for Named Entity Recognition(NER), part-of-speech tagging(POS), and model classification. It can be used by beginners, people in academics, and researchers.

Kashgari features:

Easy to customize, well documented, and straightforward.
Kashgari allows you to use state-art-of-the-art models for your Natural Language Processing projects.
It allows you to build multi-label classification models, create custom models, and so much more. Learn more here
Allows you to adjust your model’s hyperparameters, use custom optimizers and callbacks, create custom models, and others.
Kashgari has built-in pretrained models which makes transfer learning very easy.
Kashagri is simple, fast, and scalable
You can export your models and directly deploy them to the cloud using tensorflow serving.

Installation

Kashgari requires you to have Python 3.6+ installed on your system.

Using pip

For TensorFlow 2.x:

pip install 'kashgari>=2.0.0

For TensorFlow 1.14+:

pip install 'kashgari>=2.0.0

For Keras:

pip install 'kashgari<1.0.0

Tutorials:

LexNLP

LexNLP developed by LexPredict is a Python library for working with real unstructured legal text, including contracts, policies, procedures, and other types of materials, classifiers and clause type, tools for building new clustering and classification methods, hundreds of unit tests of real legal documents.

Features:

It provides pre-trained models for segmentation, word embedding and topic models, classifiers for document and clause type.
Fact extraction.
Tools for building new clustering and classification methods.

Installation:

Requires you have installed Python 3.6

pip install lexnlp

Tutorials:

Tensorflow Text

TensorFlow Text provides a collection of text-related classes and ops ready to use with TensorFlow 2.0. The library can perform the preprocessing regularly required by text-based models and includes other features useful for sequence modeling not provided by core TensorFlow.

The benefit of using these ops in your text preprocessing is that they are done in the TensorFlow graph. You don’t need to worry about tokenization in training being different than the tokenization at inference, or managing preprocessing scripts.

Tensorflow Text features:

Facilitates a large toolkit for working with text
Allows integration with a large suite of Tensorflow tools to support projects from problem definition through training, evaluation, and launch
Reduces complexity at serving time and prevents training-serving skew

Installation:

Using Pip

Please note: When installing TF Text with pip install, please note the version of TensorFlow you are running, as you should specify the corresponding minor version of TF Text (eg. for tensorflow==2.3.x use tensorflow_text==2.3.x).

pip install -U tensorflow-text==

Installing from source

Please note that TF Text needs to be built in the same environment as TensorFlow. Thus, if you manually build TF Text, it is highly recommended that you also build TensorFlow.

If building on MacOS, you must have coreutils installed. It is probably easiest to do with Homebrew.

Build and install TensorFlow.

Clone the TF Text repo: git clone https://github.com/tensorflow/text.git
Run the build script to create a pip package: ./oss_scripts/run_build.sh .After this step, there should be a *.whl file in current directory. File name similar to tensorflow_text-2.5.0rc0-cp38-cp38-linux_x86_64.whl.
Install the package to environment: pip install ./tensorflow_text-*-*-*-os_platform.whl

Tutorials:

Text preprocessing

Text Classification

Text Generation

Snorkel

Data labeling tool, you can label, build, and manage training data programmatically. The first component of a Snorkel pipeline includes labeling functions, which are designed to be weak heuristic functions that predict a label given unlabelled data.

Features:

It supports Tensorflow/Keras, Pytorch, Spark, Dask, and Scikit-Learn.
It provides APIs for labeling, analysis, preprocessing, slicing, mapping, utils, and classification.

Installation:

Snorkel requires Python 3.6 or later.

Using pip (Recommended)

pip install snorkel

Using conda

conda install snorkel -c conda-forge

Please note: If you’re using Windows, it’s highly recommended using Docker (tutorial example) or the Linux subsystem.

Tutorials:

Spam detection: Is this YouTube comment spam?
Spouse (Relation Extraction): Does this sentence imply that the two marked people are spouses?
Visual_relation (Visual Relationship Detection): Is object A riding object B in the image, carrying it, or neither?
Crowdsourcing: Is this tweet about the weather expressing a positive, negative or neutral sentiment?
Multitask (Multi-Task Learning): A synthetic task demonstrating the native Snorkel multi-task classifier API
Recsys (Recommender Systems): Will this user read and like this book?
Drybell: Is a celebrity mentioned in this news article?

Tensorflow Lingvo

Lingvo is a framework for building neural networks in Tensorflow, particularly sequence models.

Tensorflow Lingvo features:

Lingvo supports natural language processing (NLP) tasks but it is also applicable to models used for tasks such as image segmentation and point cloud classification.
Lingvo can be used to train the “on production scale” datasets.
Lingvo provides additional support for synchronous and asynchronous distributed training.
Quantization support has been built directly into the Lingvo framework.

Installation:

Using pip:

pip3 install lingvo

Installing from sources:

Check if you’ve met the following prerequisites

TensorFlow 2.5 installed on your system
C++ compiler (only g++ 7.3 is officially supported)
The bazel build system.

Refer to docker/dev.dockerfile for a set of working requirements.

Now, git clone the repository, then use bazel to build and run targets directly. The python -m module commands in the codelab need to be mapped onto bazel run commands.

Using docker:

Docker configurations are available for both situations. Instructions can be found in the comments on the top of each file.

lib.dockerfile has the Lingvo pip package preinstalled.

dev.dockerfile can be used to build Lingvo from sources.

Tutorial:

Sequence to sequence models

SpaCy

spaCy is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research and was designed from day one to be used in real products.

spaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification, and more, multi-task learning with pre-trained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment, and workflow management.

Features:

Support for custom models in PyTorch, TensorFlow, and other frameworks.
Support for 60+ languages.
Support for pre-trained word vectors and embeddings.
Easy model packaging, deployment, and workflow management.
Linguistically-motivated tokenization.

Installation:

It supports macOS / OS X , Linux , and Windows (Cygwin, MinGW, Visual Studio). You also need to have Python 3.6+ version (only 64 bit) installed on your system.

Using pip

Before you continue with the installation, make sure that your pip, setuptools, and wheel are up to date.

pip install -U pip setuptools wheel
pip install spacy

Using conda

conda install -c conda-forge spacy

Tutorials:

Flair

Flair is a simple framework for state-of-the-art NLP. It allows you to use state-of-the-art models for your NLP tasks, such as Named Entity Recognition (NER), part-of-speech tagging (POS), sense disambiguation, and classification. It provides special support for biomedical data and also supports a rapidly growing number of languages.

Flair features:

It’s entirely built on Pytorch and so you can easily build and train your Flair models.
State-of-the-art NLP models that you can use for your text.
Allows you to combine different words and document embeddings with simple interfaces.

Installation:

It requires you to have Pytorch 1.5+ and currently supports Python 3.6. Here is how for Ubuntu 16.04.

pip install flair

Tutorials:

Open-source MLOps tools for your NLP projects – comparison

Adapt NLP

Flair

spaCy

Tensorflow lingvo

Snorkel

Tensorflow text

LexNLP

Kashgari

GulonNLP

AutoGulon

Adapt NLP

Flair

spaCy

Tensorflow lingvo

Snorkel

Tensorflow text

LexNLP

Kashgari

GulonNLP

AutoGulon

Windows

Linux

MacOS

Tensorflow

Pytorcht

Other frameworks

Data labelling

Data preprocessing

Model development

Model deployment

Best MLOps as a service tools for NLP projects

Neu.ro

Neuro MLOps platform provides complete solution and management of the infrastructure and processes you need for successful ML development at scale. It provides the complete MLOps lifecycle which includes data collection, model development, model training, experiment tracking, deployment and monitoring. Neu.ro provides management of the infrastructure and processes for successful ML development at scale.

Setup

Installation

Advisable to create a new virtual environment first. It requires you to have Python 3.7 installed.

pip install -U neuromation

class="hljs" style="display: block; overflow-x: auto; padding: 0.5em; color: rgb(51, 51, 51); background: rgb(248, 248, 248);">pip install -U neuromation

How to:

Sign up at neu.ro
Upload data either with webUI or CLI
Setup development environment (allows you to use GPU)
Train model or download a pretrained model
Run notebook(Jupyter)

Check out this ML Cookbook to help you get started with an NLP project.

AutoNLP

AutoNLP provides an automatic way to train state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem, and deploy them in a scalable environment automatically. It’s an automated way to train, evaluate, and deploy state-of-the-art NLP models for different tasks. It automatically fine-tunes a working model for deployment based on the dataset that you provide.

Setup

Installation:

To use pip:

pip install -U autonlp

Please note: you need to install git lfs to use the cli

How to:

Sign in to your account
Create a new model
Upload your dataset
Train your autonlp model
Track model progress
Make predictions
Deploy your model

Check out the AutoNLP documentation for your specific use case.

neptune.ai

Neptune is an experiment tracker designed with a strong focus on collaboration and scalability. It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye. The tool is known for its user-friendly interface, giving users a lot of flexibility when defining data structures and tracking metadata.

DataRobot

DataRobot which has now acquired Algorithmia is a platform that automates the end-to-end process of building, deploying, and maintaining machine learning (ML) and artificial intelligence (AI) at scale. It’s a no-code app builder, and a platform where you can deploy, monitor, manage, and govern all your models in production, regardless of how they were created or when and where they were deployed.

Setup

Installation

It currently supports python 2.7 and >=3.4

pip3 install datarobot

With Python 3.6+,

pip3 install requests requests-toolbelt

How to create a new project:

Sign in to your account
Install dependencies
Load and Profile your data
Start modelling
Review and interpret model
Deploy model
Choose an application

Check this doc for a proper walkthrough on how to use these steps.

AWS MLOps Frameworks

It helps you streamline and enforce tool best practices for productionizing your machine learning models. It’s an extendable framework that provides a standard interface for managing ML pipelines for AWS ML services and third-party services. The solution template lets you upload your trained models, configure the orchestration of the pipeline, and monitor pipeline operations. It allows you to leverage a preconfigured ML pipeline and also automatically deploy a trained model with an inference endpoint.

How to setup a new project:

Sign in to your AWS account
Create a new SageMaker Studio
Create a new project
Select an MLOps architecture (development, evaluation,or deployment) you want.
Add data to AWS S3 bucket
Create pipeline and training files.

Check out this docs on how to set up a new project. You can also check out this tutorial on how to create a simple project.

Azure Machine Learning MLOps

Azure MLOps allows you to experiment, develop, and deploy models into production with end-to-end lineage tracking. It allows you to create reproducible ML pipelines, reusable software environments, deploy models from anywhere, govern the ML lifecycle, and closely monitor models in production for any issues. It allows you to automate the end-to-end ML lifecycle with pipelines which lets you update models, test new models, and continuously deploy new ML Models.

Setup

Installation

You need to install the Azure CLI

How to:

Sign in to Azure devops
Create a new project
Import the project repository
Setup project environment
Create a pipeline
Train and deploy model
Set up continuous integration pipeline

Check out this doc on how to go about these processes

Vertex AI | Google Cloud AI Platform

Vertex AI is a machine learning platform where you can access all Google Cloud services in one place to deploy and maintain AI models. It brings together the Google Cloud services for building ML under one, unified UI and API. You can use Vertex AI to easily train and compare models using AutoML or your custom code, with all your models stored in one central model repository.

Setup

Installation

You can either use Google Cloud console and Cloud shell or you install Cloud SDK to your system.

How to create a new project (Using cloud shell):

Sign in to your account
Create a new project(Ensure billing is enabled for your account)
Activate cloud shell
Create a storage bucket
Train your model
Deploy to google cloud

Check out this doc for a walkthrough on how to follow these steps.

MLOps as a service tools for NLP projects – comparison

	Data collection and management	Data preparation and feature engineering	Model training and deployment	Model monitoring and experiment tracking	ML metadata store	Model registry & management
AutoNLP	No	No	Yes	No	No	No
Azure MLOps	No	No	Yes	Yes	Yes	Yes
AWS MLOps	No	No	Yes	Yes	No	No
DataRobot	No	No	Yes	Yes	No	Yes
neptune.ai	No	No	No	Yes	Yes	Limited
Neu.ro	Yes	No	Yes	Yes	No	No
Vertex AI	No	Yes	Yes	Yes	Yes	Yes

Conclusion

I’ve talked about why you need MLOps and how you can choose a good tool for your project. I also listed some NLP MLOps tools and highlighted some cool features about them. Not sure which tool to try out? Check the comparison table I made to see which best fits your project. I hope you try out some of the listed tools and do let me know what you think. Thanks for reading!

Additional references

10 NLP Projects to Boost Your Resume

Amal Menzli — Fri, 22 Jul 2022 06:14:42 +0000

Natural Language Processing (NLP) is a very exciting field. Already, NLP projects and applications are visible all around us in our daily life. From conversational agents (Amazon Alexa) to sentiment analysis (Hubspot’s customer feedback analysis feature), language recognition and translation (Google Translate), spelling correction (Grammarly), and much more.

Whether you’re a developer or data scientist curious about NLP, why not just jump in the deep end of the pool, and learn by doing it?

With well-known frameworks like PyTorch and TensorFlow, you just launch a Python notebook and you can be working on state-of-the-art deep learning models within minutes.

In this article, I’ll help you practice NLP by suggesting 10 great projects you can start working on right now—plus, each of these projects will be a great addition to your resume!

Explore more Natural Language Processing articles.

A brief history of the NLP field

This is just a bit of background about Natural Language Processing, but you can skip on to the projects if you’re not interested.

NLP was born in the middle of the 20th century. A major historical NLP landmark was the Georgetown Experiment in 1954, where a set of around 60 Russian sentences were translated into English.

The 1970s saw the development of a number of chatbot concepts based on sophisticated sets of hand-crafted rules for processing input information. In the late 1980s, singular value decomposition (SVD) was applied to the vector space model, leading to latent semantic analysis—an unsupervised technique for determining the relationship between words in a language.

In the past decade (after 2010), neural networks and deep learning have been rocking the world of NLP. These techniques achieve state-of-the-art results for the hardest NLP tasks like machine translation. In 2013, we got the word2vec model and its variants. These neural-network-based techniques vectorize words, sentences, and documents in such a way, that the distance between vectors in the generated vector space represents the difference in meaning between the corresponding entities.

In 2014, sequence-to-sequence models were developed and achieved a significant improvement in difficult tasks, such as machine translation and automatic summarization.

Later it was discovered that long input sequences were harder to deal with, which led us to the attention technique. This improved sequence-to-sequence model performance by letting the model focus on parts of the input sequence that were the most relevant for the output. The transformer model improves this more, by defining a self-attention layer for both the encoder and decoder.

The cleverly named “attention is all you Need” paper that introduced the attention mechanism also enabled the creation of powerful deep learning language models, like:

ULM-Fit – Universal Language Model Fine-tuning: a method for fine-tuning any neural-network-based language model for any task, demonstrated in the context of text classification. A key concept behind this method is discriminative fine-tuning, where the different layers of the network are trained at different rates.
BERT – Bidirectional Encoder Representations from Transformers: a modification of the Transformer architecture by preserving the encoders and discarding the decoders, it relies on masking of words which would then need to be predicted accurately as the training metric.
GPT- Generative Pretrained Transformer: modification of Transformer’s encoder-decoder architecture meant to achieve a fine-tunable language model for NLP. It discarded the encoders, retaining the decoders and their self-attention sublayers.

Recent years have seen the most rapid advances in the NLP field. In the modern NLP paradigm, transfer learning, we can adapt/transfer knowledge acquired from one set of tasks to a different set. This is a big step towards the full democratization of NLP, allowing knowledge to be re-used in new settings at a fraction of the previously required resources.

Source

Why should you build NLP projects?

NLP is at the intersection of AI, computer science, and linguistics. It deals with tasks related to language and information. Understanding and representing the meaning of language is difficult. So, if you want to work in this field, you’re going to need a lot of practice. The projects below will help you do that.

Building real-world NLP projects is the best way to get NLP skills and transform theoretical knowledge into valuable practical experience.

Later, when you’re applying for an NLP-related job, you’ll have a big advantage over people that have no practical experience. Anyone can add “NLP proficiency” to their CV, but not everyone can back it up with an actual project that you can show to recruiters.

Okay, we’ve done enough introductions. Let’s move on to the 10 NLP projects that you can start right now. We have beginner, intermediate, as well as advanced projects—choose the one you like, and become the NLP master you’ve always wanted to be!

Source

10 NLP project ideas to boost your resume

We’ll start with beginner-level projects, but you can move on to intermediate or advanced projects if you’ve already done NLP in practice.

Beginner NLP projects

Sentiment analysis for marketing

This type of project can show you what it’s like to work as an NLP specialist. For this project, you want to find out how customers evaluate competitor products, i.e. what they like and dislike. It’s a great business case. Learning what customers like about competing products can be a great way to improve your own product, so this is something that many companies are actively trying to do.

To achieve this task, you will employ different NLP methods to get a deeper understanding of customer feedback and opinion.

Start project now → Go To Project Repository

Source

Learn more

Sentiment Analysis in Python: TextBlob vs Vader Sentiment vs Flair vs Building It From Scratch

Toxic comment classification

In this project, you want to create a model that predicts to classify comments into different categories. Comments in social media are often abusive and insulting. Organizations often want to ensure that conversations don’t get too negative. This project was a Kaggle challenge, where the participants had to suggest a solution for classifying toxic comments in several categories using NLP methods.

Start project now → Go To Project Repository

Source

Language identification

This is a good project for beginners to learn basic NLP concepts and methods. We can easily see how Chrome, or another browser, detects the language in which a web page is written. This task is a lot easier with machine learning.

You can build your own language detection with the fastText model by Facebook.

Start project now → Go To Project Repository

Source

Intermediate NLP projects

Predict closed questions on Stack Overflow

If you’re a programmer of any kind, I don’t need to tell you what Stack Overflow is. It’s any programmer’s best friend.

Programmers ask many questions on Stack Overflow all the time, some are great, others are repetitive, time-wasting, or incomplete. So, in this project, you want to predict whether a new question will be closed or not, along with the reason why.

The dataset has several features including the text of question title, the text of question body, tags, post creation date, and more.

Start project now → Go To Dataset

Source

Create text summarizer

Text summarization is one of the most interesting problems in NLP. It’s hard for us, as humans, to manually extract the summary of a large document of text.

To solve this problem, we use automatic text summarization. It’s a way of identifying meaningful information in a document and summarizing it while conserving the overall meaning.

The purpose is to present a shorter version of the original text while preserving the semantics.

In this project, you could use different traditional and advanced methods to implement automatic text summarization, and then compare the results of each method to conclude which is the best to use for your corpus.

Start project now → Go To Project Repository

Source

Document Similarity (Quora question pair similarity)

Quora is a question and answer platform where you can find all sorts of information. Every piece of content on the site is generated by users, and people can learn from each other’s experiences and knowledge.

For this project, Quora challenged Kaggle users to classify whether question pairs are duplicated or not.

This task requires finding high-quality answers to questions which will result in the improvement of the Quora user experience from writers to readers.

Start project now → Go To Dataset

Source

Paraphrase detection task

Paraphrase detection is a task that checks if two different text entities have the same meaning or not. This project has various applications in areas like machine translation, automatic plagiarism detection, information extraction, and summarization. The methods for paraphrase detection are grouped into two main classes: similarity-based methods, and classification methods.

Start project now → Go To Project Repository

Source

Advanced NLP projects

Generating research papers titles

This is a very innovative project where you want to produce titles for scientific papers. For this project, a GPT-2 is trained on more than 2,000 article titles extracted from arXiv. You can use this application on other things, like text generating tasks for producing song lyrics, dialogues, etc. From this project, you can also learn about web scraping, because you will need to extract text from research papers in order to feed it to your model for training.

Start project now → Go To Project Repository

Source

Translate and summarize news

You can build a web app that translates news from Arabic to English and summarizes them, using great Python libraries like newspaper, transformers, and gradio.

Where:

Newspaper3k (11.1k stars): scrape almost any news website
HuggingFace Transformers (48k): use state-of-the-art natural language models
Gradio (2.9k): build interactive web-based demos

Start Project Now → Useful Link

Source

RESTful API for similarity check

This project is about building a similarity check API using NLP techniques. The cool part about this project is not only about implementing NLP tools, but also you will learn how to upload this API over docker and use it as a web application. In doing this, you will learn how to build a full NLP application.

Start project now → Go To Project Repository

Source

Conclusion

And that’s it! Hope you’re able to pick a project that interests you. Get your hands dirty, and start working on your NLP skills! Building real projects is the single best way to get better at this, and also to improve your resume.

That’s all for now. Thanks for reading!

Vectorization Techniques in NLP [Guide]

Abhishek Jha — Thu, 21 Jul 2022 15:26:39 +0000

Natural Language is how we, humans, exchange ideas and opinions. There are two main mediums for natural language – speech and text.

Listening and reading are effortless for a healthy human, but they’re difficult for a machine learning algorithm. That’s why scientists had to come up with Natural Language Processing (NLP).

What is Natural Language Processing?

NLP enables computers to process human language and understand meaning and context, along with the associated sentiment and intent behind it, and eventually, use these insights to create something new.
NLP combines computational linguistics with statistical Machine Learning and Deep Learning models.

Learn more

Explore the Natural Language Processing category on the blog.

How do we even begin to make words interpretable for computers? That’s what vectorization is for.

What is vectorization?

Vectorization is jargon for a classic approach of converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support. This approach has been there ever since computers were first built, it has worked wonderfully across various domains, and it’s now used in NLP.

In Machine Learning, vectorization is a step in feature extraction. The idea is to get some distinct features out of the text for the model to train on, by converting text to numerical vectors.

There are plenty of ways to perform vectorization, as we’ll see shortly, ranging from naive binary term occurrence features to advanced context-aware feature representations. Depending on the use-case and the model, any one of them might be able to do the required task.

Let’s learn about some of these techniques and see how we can use them.

Vectorization techniques

1. Bag of Words

Most simple of all the techniques out there. It involves three operations:

Tokenization

First, the input text is tokenized. A sentence is represented as a list of its constituent words, and it’s done for all the input sentences.

Check also

Tokenization in NLP – Types, Challenges, Examples, Tools

Vocabulary creation

Of all the obtained tokenized words, only unique words are selected to create the vocabulary and then sorted by alphabetical order.

Vector creation

Finally, a sparse matrix is created for the input, out of the frequency of vocabulary words. In this sparse matrix, each row is a sentence vector whose length (the columns of the matrix) is equal to the size of the vocabulary.

Let’s work with an example and see how it looks in practice. We’ll be using the Sklearn library for this exercise.

May be useful

Check how to keep track of Sklearn model training.

Let’s make the required imports.

from sklearn.feature_extraction.text import CountVectorizer

Consider we have the following list of documents.

sents = ['coronavirus is a highly infectious disease',
   'coronavirus affects older people the most',
   'older people are at high risk due to this disease']

Let’s create an instance of CountVectorizer.

cv = CountVectorizer()

Now let’s vectorize our input and convert it into a NumPy array for viewing purposes.

X = cv.fit_transform(sents)
X = X.toarray()

This is what the vectors look like:

Let’s print the vocabulary to understand why it looks like this.

sorted(cv.vocabulary_.keys())

You can see that every row is the associated vector representation of respective sentences in ‘sents’.
The length of each vector is equal to the length of vocabulary.
Every member of the list represents the frequency of the associated word as present in sorted vocabulary.

In the above example, we only considered single words as features as visible in the vocabulary keys, i.e. it’s a unigram representation. This can be tweaked to consider n-gram features.

Let’s say we wanted to consider a bigram representation of our input. It can be achieved by simply changing the default argument while instantiating the CountVectorizer object:

cv = CountVectorizer(ngram_range=(2,2))

In that case, our vectors & vocabulary would look like this.

Thus we can manipulate the features any way we want. In fact, we can also combine unigrams, bigrams, trigrams, and more, to form feature space.

Although we’ve used sklearn to build a Bag of Words model here, it can be implemented in a number of ways, with libraries like Keras, Gensim, and others. You can also write your own implementation of Bag of Words quite easily.

This is a simple, yet effective text encoding technique and can get the job done a number of times.

2. TF-IDF

TF-IDF or Term Frequency–Inverse Document Frequency, is a numerical statistic that’s intended to reflect how important a word is to a document. Although it’s another frequency-based method, it’s not as naive as Bag of Words.

How does TF-IDF improve over Bag of Words?

In Bag of Words, we witnessed how vectorization was just concerned with the frequency of vocabulary words in a given document. As a result, articles, prepositions, and conjunctions which don’t contribute a lot to the meaning get as much importance as, say, adjectives.

TF-IDF helps us to overcome this issue. Words that get repeated too often don’t overpower less frequent but important words.

It has two parts:

TF stands for Term Frequency. It can be understood as a normalized frequency score. It is calculated via the following formula:

So one can imagine that this number will always stay ≤ 1, thus we now judge how frequent a word is in the context of all of the words in a document.

IDF

IDF stands for Inverse Document Frequency, but before we go into IDF, we must make sense of DF – Document Frequency. It’s given by the following formula:

DF tells us about the proportion of documents that contain a certain word. So what’s IDF?

It’s the reciprocal of the Document Frequency, and the final IDF score comes out of the following formula:

Why inverse the DF?

Just as we discussed above, the intuition behind it is that the more common a word is across all documents, the lesser its importance is for the current document.

A logarithm is taken to dampen the effect of IDF in the final calculation.

The final TF-IDF score comes out to be:

This is how TF-IDF manages to incorporate the significance of a word. The higher the score, the more important that word is.

Let’s get our hands dirty now and see how TF-IDF looks in practice.

Again, we’ll be using the Sklearn library for this exercise, just as we did in the case of Bag of Words.

Making the required imports.

from sklearn.feature_extraction.text import TfidfVectorizer

Again let’s use the same set of documents.

sents = ['coronavirus is a highly infectious disease',
   'coronavirus affects older people the most',
   'older people are at high risk due to this disease']

Creating an instance of TfidfVectorizer.

tfidf = TfidfVectorizer()

Let’s transform our data now.

transformed = tfidf.fit_transform(sents)

Now let’s see which features are the most important, and which features were useless. For the sake of interpretability, we’ll be using the Pandas library, just to get a better look at scores.

Making the required import:

import pandas as pd

Creating a data frame with feature names, i.e. the words, as indices, and sorted TF-IDF scores as a column:

df = pd.DataFrame(transformed[0].T.todense(),
    	index=tfidf.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)

Since the transformed TFIDF feature matrix comes out as a Scipy Compressed Sparse Row matrix, which can’t be viewed in its raw form, we have converted it into a Numpy array, via todense() operation after taking its transform. Similarly, we get the complete vocabulary of tokenized words via get_feature_names().

This is what comes out of the other end:

So, according to TF-IDF, the word ‘infectious’ is the most important feature out there, while many words which would have been used for feature building in a naive approach like Bag of Words, simply amount to 0 here. This is what we wanted all along.

A few pointers about TF-IDF:

The concept of n-grams is applicable here as well, we can combine words in groups of 2,3,4, and so on to build our final set of features.
Along with n-grams, there are also a number of parameters such as min_df, max_df, max_features, sublinear_tf, etc. to play around with. Carefully tuning these parameters can do wonders for your model’s capabilities.

Despite being so simple, TF-IDF is known to be extensively used in tasks like Information Retrieval to judge which response is the best for a query, especially useful in a chatbot or in Keyword Extraction to determine which word is the most relevant in a document, and thus, you’ll often find yourself banking on the intuitive wisdom of the TF-IDF.

So far, we’ve seen frequency-based methods for encoding text, now it’s time to take a look at more sophisticated methods which changed the world of word embeddings as we know it, and opened new research opportunities in NLP.

3. Word2Vec

This approach was released back in 2013 by Google researchers in this paper, and it took the NLP industry by storm. In a nutshell, this approach uses the power of a simple Neural Network to generate word embeddings.

How does Word2Vec improve over frequency-based methods?

In Bag of Words and TF-IDF, we saw how every word was treated as an individual entity, and semantics were completely ignored. With the introduction of Word2Vec, the vector representation of words was said to be contextually aware, probably for the first time ever.

Perhaps, one of the most famous examples of Word2Vec is the following expression:

king – man + woman = queen

Since every word is represented as an n-dimensional vector, one can imagine that all of the words are mapped to this n-dimensional space in such a manner that words having similar meanings exist in close proximity to one another in this hyperspace.

There are mainly two ways to implement Word2Vec, let’s take a look at them one by one:

A: Skip-Gram

So the first one is the Skip-Gram method in which we provide a word to our Neural Network and ask it to predict the context. The general idea can be captured with the help of the following image:

Here w[i] is the input word at an ‘i’ location in the sentence, and the output contains two preceding words and two succeeding words with respect to ‘i’.

Technically, it predicts the probabilities of a word being a context word for the given target word. The output probabilities coming out of the network will tell us how likely it is to find each vocabulary word near our input word.

This shallow network comprises an input layer, a single hidden layer, and an output layer, we’ll take a look at that shortly.

However, the interesting part is, we don’t actually use this trained Neural Network. Instead, the goal is just to learn the weights of the hidden layer while predicting the surrounding words correctly. These weights are the word embeddings.

How many neighboring words the network is going to predict is determined by a parameter called “window size”. This window extends in both the directions of the word, i.e. to its left and right.

Let’s say we want to train a skip-gram word2vec model over an input sentence:

“The quick brown fox jumps over the lazy dog”

The following image illustrates the training samples that would generate from this sentence with a window size = 2.

‘The’ becomes the first target word and since it’s the first word of the sentence, there are no words to its left, so the window of size 2 only extends to its right resulting in the listed training samples.
As our target shifts to the next word, the window expands by 1 on left because of the presence of a word on the left of the target.
Finally, when the target word is somewhere in the middle, training samples get generated as intended.

The Neural Network

Now let’s talk about the network which is going to be trained on the aforementioned training samples.

May interest you

Guide to Building Your Own Neural Network

Intuition

If you’re aware of what autoencoders are, you’ll find that the idea behind this network is similar to that of an autoencoder.

You take an extremely large input vector, compress it down to a dense representation in the hidden layer, and then instead of reconstructing the original vector as in the case of autoencoders, you output probabilities associated with every word in the vocabulary.

Input/Output

Now the question arises, how do you input a single target word as a large

Vector?

The answer is One-Hot Encoding.

Let’s say our vocabulary contains around 10,000 words and our current target word ‘fox’ is present somewhere in between. What we’ll do is, put a 1 in the position corresponding to the word ‘fox’ and 0 everywhere else, so we’ll have a 10,000-dimensional vector with a single 1 as the input.
Similarly, the output coming out of our network will be a 10,000-dimensional vector as well, containing, for every word in our vocabulary, the probability of it being the context word for our input target word.

Here’s the architecture of how our neural network is going to look like this:

As it can be seen that input is a 10,000-dimensional vector given our vocabulary size =10,000, containing a 1 corresponding to the position of our target word.
The output layer consists of 10,000 neurons with the Softmax activation function applied, so as to obtain the respective probabilities against every word in our vocabulary.
Now the most important part of this network, the hidden layer is a linear layer i.e. there’s no activation function applied there, and the optimized weights of this layer will become the learned word embeddings.
For example, let’s say we decide to learn word embeddings with the above network. In that case, the hidden layer weight matrix shape will be M x N, where M = vocabulary size (10,000 in our case) and N = hidden layer neurons (300 in our case).
Once the model gets trained, the final word embedding for our target word will be given by the following calculation:

1×10000 input vector * 10000×300 matrix = 1×300 vector

300 hidden layer neurons were used by Google in their trained model, however, this is a hyperparameter and can be tuned accordingly to obtain the best results.

So this is how the skip-gram word2vec model generally works. Time to take a look at it’s competitor.

B. CBOW

CBOW stands for Continuous Bag of Words. In the CBOW approach instead of predicting the context words, we input them into the model and ask the network to predict the current word. The general idea is shown here:

You can see that CBOW is the mirror image of the skip-gram approach. All the notations here mean exactly the same thing as they did in the skip-gram, just the approach has been reversed.

Now, since we already took a deep dive into what skip-gram is and how it works, we won’t be repeating the parts that are common in both approaches. Instead, we’ll just talk about how CBOW is different from skip-gram in its working. For that, we’ll take a rough look at the CBOW model architecture.

Here’s what it looks like:

CBOW Model Architecture

The dimension of our hidden layer and output layer stays the same as the skip-gram model.
However, just as we read that a CBOW model takes context words as input, here the input is C context words in the form of a one-hot encoded vector of size 1xV each, where V = size of vocabulary, making the entire input CxV dimensional.
Now, each of these C vectors will be multiplied with the Weights of our hidden layer which are of the shape VxN, where V = vocab size and N = Number of neurons in the hidden layer.
If you can imagine, this will result in C, 1xN vectors, and all of these C vectors will be averaged element-wise to obtain our final activation for the hidden layer, which then will be fed into our output softmax layer.
The learned weight between the hidden and output layer makes up the word embedding representation.

Now if this was a little too overwhelming for you, TLDR for the CBOW model is:

Because of having multiple context words, averaging is done to calculate hidden layer values. After this, it gets similar to our skip-gram model, and learned word embedding comes from the output layer weights instead of hidden layer weights.

When to use the skip-gram model and when to use CBOW?

According to the original paper, skip-gram works well with small datasets and can better represent rare words.
However, CBOW is found to train faster than skip-gram and can better represent frequent words.
So the choice of skip-gram VS. CBOW depends on the kind of problem that we’re trying to solve.

Now enough with the theory, let’s see how we can use word2vec for generating word embeddings.

We’ll be using the Gensim library for this exercise.

Making the required imports.

from gensim import models

Now there are two options here, either we can use a pre-trained model or train a new model all by ourselves. We’ll be going through either of these ways.

Let’s use Google’s pre-trained model first to check out the cool stuff that we can do with it. You can either download the said model from here and give the path of the unzipped file down below, or you can get it via the following Linux commands.

wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

gzip -d GoogleNews-vectors-negative300.bin.gz

Let’s load up the model now, however, be advised that it’s a very heavy model and your laptop might just freeze up due to less memory.

w2v = models.KeyedVectors.load_word2vec_format(
'./GoogleNews-vectors-negative300.bin', binary=True)

Vector representation for any word, say healthy, can be obtained by:

vect = w2v['healthy']

This will give out a 300-dimensional vector.

We can also leverage this pre-trained model to get similar meaning words for an input word.

w2v.most_similar('happy')

It’s amazing how well it performs for this task, the output comprises a list of tuples of relevant words and their corresponding similarity scores, sorted in decreasing order of similarity.

As discussed, you can also train your own word2vec model.

Let’s use the previous set of sentences again as a dataset for training our custom word2vec model.

sents = ['coronavirus is a highly infectious disease',
   'coronavirus affects older people the most',
   'older people are at high risk due to this disease']

Word2vec requires the training dataset in form of a list of lists of tokenized sentences, so we’ll preprocess and convert sents to:

sents = [sent.split() for sent in sents]

Finally, we can train our model with:

custom_model = models.Word2Vec(sents, min_count=1,size=300,workers=4)

How well this custom model performs will depend on our dataset and how intensely it has been trained. However, it’s unlikely to beat Google’s pre-trained model.

And that’s all about word2vec. If you want to get a visual taste of how word2vec models work and want to understand it better, go to this link. It’s a really cool tool to witness CBOW & skip-gram in action.

4. GloVe

GloVe stands for Global Vectors for word representation. It was developed at Stanford. You can find the original paper here, it was published just a year after word2vec.

Similar to Word2Vec, the intuition behind GloVe is also creating contextual word embeddings but given the great performance of Word2Vec. Why was there a need for something like GloVe?

How does GloVe improve over Word2Vec?

Word2Vec is a window-based method, in which the model relies on local information for generating word embeddings, which in turn is limited to the adjudged window size that we choose.
This means that the semantics learned for a target word is only affected by its surrounding words in the original sentence, which is a somewhat inefficient use of statistics, as there’s a lot more information we can work with.
GloVe on the other hand captures both global and local statistics in order to come up with the word embeddings.

We saw local statistics used in Word2Vec, but what are global statistics now?

GloVe derives semantical meaning by training on a co-occurrence matrix. It’s built on the idea that word-word co-occurrences are a vital piece of information and using them is an efficient use of statistics for generating word embeddings. This is how GloVe manages to incorporate “global statistics” into the end result.

For those of you who aren’t aware of the co-occurrence matrix, here’s an example:

Let’s say we have two documents or sentences.

Document 1: All that glitters is not gold.

Document 2: All is well that ends well.

Then, with a fixed window size of n = 1, our co-occurrence matrix would look like this:

If you take a moment to look at it, you realize the rows and columns are made up of our vocabulary, i.e. the set of unique tokenized words obtained from both documents.
Here, and are used to denote the beginning and end of sentences.
The window of size 1 extends in both directions of the word, since ‘that’ & ‘is’ occur only once in the window vicinity of ‘glitters’, that is why the value of (that, glitters) and (is, glitters) = 1, you get the idea now how to go about this table.

A little about its training, GloVe model is a weighted least squares model and thus its cost function looks something like this:

For every pair of words (i,j) that might co-occur, we try to minimize the difference between the product of their word embeddings and the log of the co-occurrence count of (i,j). The term f(Pij) makes it a weighted summation and allows us to give lower weights to very frequent word co-occurrences, capping the importance of such pairs.

When to use GloVe?

GloVe has been found to outperform other models on word analogy, word similarity, and Named Entity Recognition tasks, so if the nature of the problem you’re trying to solve is similar to any of these, GloVe would be a smart choice.
Since it incorporates global statistics, it can capture the semantics of rare words and performs well even on a small corpus.

Now let’s take a look at how we can leverage the power of GloVe word embeddings.

First, we need to download the embedding file, then we’ll create a lookup embedding dictionary using the following code.

Import numpy as np

embeddings_dict={}
with open('./glove.6B.50d.txt','rb') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

On querying this embedding dictionary for a word’s vector representation, this is what comes out.

You might notice that this is a 50-dimensional vector. We downloaded the file glove.6B.50d.txt, which means this model has been trained on 6 Billion words to generate 50-dimensional word embeddings.

We can also define a function to get similar words out of this model, first making the required imports.

From scipy import spatial

Defining the function:

def find_closest_embeddings(embedding):
   return sorted(embeddings_dict.keys(), key=lambda word:
spatial.distance.euclidean(embeddings_dict[word], embedding))

Let’s see what happens when we input the word ‘health’ in this function.

We fetched the top 5 words which the model thinks are the most similar to ‘health’ and the results aren’t bad, we can see the context has been captured quite well.

Another thing which we can use GloVe for is to transform our vocabulary into vectors. For that, we’ll use the Keras library.

You can install keras via:

pip install keras

We’ll be using the same set of documents that we’ve been using so far, however, we’ll need to convert them into a list of tokens to make them suitable for vectorization.

sents = [sent.split() for sent in sents]

First, we’ll have to do some preprocessing with our dataset before we can convert it into embeddings.

Making the required imports:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

The following code assigns indices to words which will later be used to map embeddings to indexed words:

MAX_NUM_WORDS = 100
MAX_SEQUENCE_LENGTH = 20
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(sents)
sequences = tokenizer.texts_to_sequences(sents)

word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

And now our data looks like this:

We can finally convert our dataset into GloVe embeddings by performing a simple lookup operation using our embeddings dictionary which we just created above. If the word is found in that dictionary, we’ll just fetch the word embeddings associated with it. Otherwise, it will remain a vector of zeroes.

Making the required imports for this operation.

from keras.layers import Embedding
from keras.initializers import Constant

EMBEDDING_DIM = embeddings_dict.get(b'a').shape[0]
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_dict.get(word.encode("utf-8"))
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

This is what comes out of the other end:

It’s a simple NumPy matrix where entry at index i is the pre-trained vector for the word of index i in our vectorizer’s vocabulary.

You can see that our embedding matrix has the shape of 19×50, because we had 19 unique words in our vocabulary and the GloVe pre-trained model file which we downloaded had 50-dimensional vectors.

You can play around with dimension, simply by changing the file or training your own model from scratch.

This embedding matrix can be used in any way you want. It can be fed into an embedding layer of a neural network, or just used for word similarity tasks.

And that’s GloVe, let’s move on to the next vectorization technique.

5. FastText

FastText was introduced by Facebook back in 2016. The idea behind FastText is very similar to Word2Vec. However, there was still one thing that methods like Word2Vec and GloVe lacked.

If you’ve been paying attention, you must have noticed one thing that Word2Vec and GloVe have in common — how we download a pre-trained model and perform a lookup operation to fetch the required word embeddings. Even though both of these models have been trained on billions of words, that still means our vocabulary is limited.

How does FastText improve over others?

FastText improved over other methods because of its capability of generalization to unknown words, which had been missing all along in the other methods.

How does it do that?

Instead of using words to build word embeddings, FastText goes one level deeper, i.e. at the character level. The building blocks are letters instead of words.
Word embeddings obtained via FastText aren’t obtained directly. They’re a combination of lower-level embeddings.
Using characters instead of words has another advantage. Less data is needed for training, as a word becomes its own context in a way, resulting in much more information that can be extracted from a piece of text.

Now let’s take a look at how FastText utilizes sub-word information.

Let’s say we have the word ‘reading’, character n-grams of length 3-6 would be generated for this word in the following manner:

Angular brackets denote the beginning and the end.
Since there can be a huge number of n-grams, hashing is used and instead of learning an embedding for each unique n-gram, we learn total B embeddings, where B denotes the bucket size. The original paper used a bucket size of 2 million.
Via this hashing function, each character n-gram (say, ‘eadi’) is mapped to an integer between 1 to B, and that index has the corresponding embedding.
Finally, the complete word embedding is obtained by averaging these constituent n-gram embeddings.
Although this hashing approach results in collisions, it helps control the vocabulary size to a great extent.

The network used in FastText is similar to what we’ve seen in Word2Vec, just like there we can train the FastText in two modes – CBOW and skip-gram, thus we won’t be repeating that part here again. If you want to read more about Fasttext in detail, you can refer to the original papers here – paper-1 and paper-2.

Let’s move ahead and see what all things we can do with FastText.

You can install fasttext with pip.

pip install fasttext

You can either download a pre-trained fasttext model from here or you can train your own fasttext model and use it as a text classifier.

Since we have already seen enough of pre-trained models and it’s no different even in this case, so in this section, we’ll be focusing on how to create your own fasttext classifier.

Let’s say we have the following dataset, where there’s conversational text regarding a few drugs and we have to classify those texts into 3 types, i.e. with the kind of drugs with which they’re associated.

Now to train a fasttext classifier model on any dataset, we need to prepare the input data in a certain format which is:

__label__

We’ll be doing this for our dataset too.

all_texts = train['text'].tolist()
all_labels = train['drug type'].tolist()
prep_datapoints=[]
for i in range(len(all_texts)):
    sample = '__label__'+ str(all_labels[i]) + ' '+ all_texts[i]
    prep_datapoints.append(sample)

I omitted a lot of preprocessing in this step, in the real world it’s best to do rigorous preprocessing to make data fit for modeling.

Let’s write these prepared data points to a .txt file.

with open('train_fasttext.txt','w') as f:
    for datapoint in prep_datapoints:
        f.write(datapoint)
        f.write('n')
    f.close()

Now we have everything we need to train a fasttext model.

model = fasttext.train_supervised('train_fasttext.txt')

Since our problem is a supervised classification problem, we trained a supervised Model.

Similarly, we can also obtain predictions from our trained model.

The model gives out the predicted label as well as the corresponding confidence score.

Again, the performance of this model depends on a lot of factors just like any other model, but if you want to have a quick peek at what the baseline accuracy should be, fasttext could be a very good choice.

So this was all about fasttext and how you can use it.

Wrapping up!

In this article we covered all the main branches of word embeddings, starting from naive count-based methods to sub-word level contextual embeddings. With the ever-rising utility of Natural Language Processing, it’s imperative that one is fully aware of its building blocks.

Given how much we read about the stuff which happens behind the curtains and the use cases of these methods, I hope now when you stumble across an NLP problem, you’ll be able to make an informed decision about which embedding technique to go with.

Future directions

I hope it goes without saying that whatever we covered in this article wasn’t exhaustive, and there are many more techniques out there to explore. These were just the main pillars.

Logically, the next step should be towards reading more about document (sentence) level embeddings, since we’ve covered the basics here. I would encourage you to read about things like Google’s BERT, Universal Sentence Encoder, and related topics.

If you decide to try BERT, start with this. It offers an amazing way to leverage BERT’s power without letting your machine do all the heavy lifting. Go through the README to get it set up.

That’s all for now. Thanks for reading!

Tokenization in NLP: Types, Challenges, Examples, Tools

Amal Menzli — Thu, 21 Jul 2022 15:25:16 +0000

The first thing you need to do in any NLP project is text preprocessing. Preprocessing input text simply means putting the data into a predictable and analyzable form. It’s a crucial step for building an amazing NLP application.

There are different ways to preprocess text:

stop word removal,
tokenization,
stemming.

Among these, the most important step is tokenization. It’s the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens. A lot of open-source tools are available to perform the tokenization process.

In this article, we’ll dig further into the importance of tokenization and the different types of it, explore some tools that implement tokenization, and discuss the challenges.

Why do we need tokenization?

Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of your pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements. The token occurrences in a document can be used directly as a vector representing that document.

This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning. They can also be used directly by a computer to trigger useful actions and responses. Or they might be used in a machine learning pipeline as features that trigger more complex decisions or behavior.

Source

Tokenization can separate sentences, words, characters, or subwords. When we split the text into sentences, we call it sentence tokenization. For words, we call it word tokenization.

Example of sentence tokenization

Example of word tokenization

Different tools for tokenization

Although tokenization in Python may be simple, we know that it’s the foundation to develop good models and help us understand the text corpus. This section will list a few tools available for tokenizing text content like NLTK, TextBlob, spacy, Gensim, and Keras.

White Space Tokenization

The simplest way to tokenize text is to use whitespace within a string as the “delimiter” of words. This can be accomplished with Python’s split function, which is available on all string object instances as well as on the string built-in class itself. You can change the separator any way you need.

As you can notice, this built-in Python method already does a good job tokenizing a simple sentence. It’s “mistake” was on the last word, where it included the sentence-ending punctuation with the token “1995.”. We need the tokens to be separated from neighboring punctuation and other significant tokens in a sentence.

In the example below, we’ll perform sentence tokenization using the comma as a separator.

NLTK Word Tokenize

NLTK (Natural Language Toolkit) is an open-source Python library for Natural Language Processing. It has easy-to-use interfaces for over 50 corpora and lexical resources such as WordNet, along with a set of text processing libraries for classification, tokenization, stemming, and tagging.

You can easily tokenize the sentences and words of the text with the tokenize module of NLTK.

First, we’re going to import the relevant functions from the NLTK library:

Word and Sentence tokenizer

N.B: The sent_tokenize uses the pre-trained model from tokenizers/punkt/english.pickle.

Punctuation-based tokenizer

This tokenizer splits the sentences into words based on whitespaces and punctuations.

We could notice the difference between considering “Amal.M” a word in word_tokenize and split it in the wordpunct_tokenize.

Treebank Word tokenizer

This tokenizer incorporates a variety of common rules for english word tokenization. It separates phrase-terminating punctuation like (?!.;,) from adjacent tokens and retains decimal numbers as a single token. Besides, it contains rules for English contractions.

For example “don’t” is tokenized as [“do”, “n’t”]. You can find all the rules for the Treebank Tokenizer at this link.

Tweet tokenizer

When we want to apply tokenization in text data like tweets, the tokenizers mentioned above can’t produce practical tokens. Through this issue, NLTK has a rule based tokenizer special for tweets. We can split emojis into different words if we need them for tasks like sentiment analysis.

MWET tokenizer

NLTK’s multi-word expression tokenizer (MWETokenizer) provides a function add_mwe() that allows the user to enter multiple word expressions before using the tokenizer on the text. More simply, it can merge multi-word expressions into single tokens.

TextBlob Word Tokenize

TextBlob is a Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

Let’s start by installing TextBlob and the NLTK corpora:

$pip install -U textblob
$python3 -m textblob.download_corpora

In the code below, we perform word tokenization using TextBlob library:

We could notice that the TextBlob tokenizer removes the punctuations. In addition, it has rules for English contractions.

spaCy Tokenizer

SpaCy is an open-source Python library that parses and understands large volumes of text. With available models catering to specific languages (English, French, German, etc.), it handles NLP tasks with the most efficient implementation of common algorithms.

spaCy tokenizer provides the flexibility to specify special tokens that don’t need to be segmented, or need to be segmented using special rules for each language, for example punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token.

Before you can use spaCy you need to install it, download data and models for the English language.

$ pip install spacy
$ python3 -m spacy download en_core_web_sm

Gensim word tokenizer

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. The target audience is the natural language processing (NLP) and information retrieval (IR) community. It offers utility functions for tokenization.

Tokenization with Keras

Keras open-source library is one of the most reliable deep learning frameworks. To perform tokenization we use: text_to_word_sequence method from the Class Keras.preprocessing.text class. The great thing about Keras is converting the alphabet in a lower case before tokenizing it, which can be quite a time-saver.

N.B: You could find all the code examples here.

Challenges and limitations

Let’s discuss the challenges and limitations of the tokenization task.

In general, this task is used for text corpus written in English or French where these languages separate words by using white spaces, or punctuation marks to define the boundary of the sentences. Unfortunately, this method couldn’t be applicable for other languages like Chinese, Japanese, Korean Thai, Hindi, Urdu, Tamil, and others. This problem creates the need to develop a common tokenization tool that combines all languages.

Another limitation is in the tokenization of Arabic texts since Arabic has a complicated morphology as a language. For example, a single Arabic word may contain up to six different tokens like the word “عقد” (eaqad).

One Arabic word gives the meanings of 6 different words in the English language. | Source

There’s A LOT of research going on in Natural Language Processing. You need to pick one challenge or a problem and start searching for a solution.

Conclusion

Through this article, we have learned about different tokenizers from various libraries and tools.

We saw the importance of this task in any NLP task or project, and we also implemented it using Python. You probably feel that it’s a simple topic, but once you get into the finer details of each tokenizer model, you will notice that it’s actually quite complex.

Start practicing with the examples above and try them on any text dataset. The more you practice, the better you’ll understand how tokenization works.

If you stayed with me until the end – thank you for reading!

Natural Language Processing - neptune.ai

LLM Fine-Tuning and Model Selection Using Neptune and Transformers

Large language models

Hands-on: fine-tuning and selecting an LLM for Brazilian Portuguese

Setting up

How to Version and Organize ML Experiments That You Run in Google Colab

Loading and pre-processing the dataset

Tokenization in NLP: Types, Challenges, Examples, Tools

Loading and preparing the models

Quantization

Deploying Large NLP Models: Infrastructure Cost Optimization

LoRA

Fine-tuning the models

ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

What is essential to log and monitor during the fine-tuning process?

Set up logging with neptune.ai

Training a model

Evaluating the fine-tuned LLMs

LLM evaluation approaches

How Elevatus Uses Neptune to Check Experiment Results in 1 Minute

Evaluating question-answering models with F1 scores and the exact match metric

The Ultimate Guide to Evaluation and Selection of Models in Machine Learning

Implementing the evaluation functions

Storing the models and evaluation results

Model selection

How to improve the solution?

How to Improve ML Model Performance [Best Practices From Ex-Amazon AI Researcher]

Conclusion

Deploying Conversational AI Products to Production With Jason Flaks

What is conversational AI?

What aspects of conversational AI is Xembly currently working on?

Chatbot

Automated note-taking

Scope of the conversational AI problem statements

Getting a conversational AI product ready

How to Deploy NLP Models in Production

Solving different conversational AI challenges

Speech recognition

Speaker segmentation

Blind speaker segmentation

Format stage

Forking the ML pipeline

Build vs. open-source: which conversational AI model should you choose?

Challenges of running complex conversational AI systems

How did Xembly set up the ML team?

Compute challenges and large language models (LLMs) in production

Deploying Large NLP Models: Infrastructure Cost Optimization

How do you ensure data quality when building NLP products?

What domains does the ML team work on?

Testing conversational AI products

Maintaining the balance between ML monitoring and testing

Open-source tools for conversational AI

MLOps Tools for NLP Projects

Building conversational AI products for large-scale enterprises

Deploying conversational AI products on edge devices

Discussion on ChatGPT

Wrap up

What Does GPT-3 Mean For the Future of MLOps? With David Hershey

What is GPT-3?

Check also

The impact of GPT-3 on MLOps

1. Language models

2. How language models complement MLOps

The accessibility of large language models

How can we make it so the whole company can interact with data?

You may find interesting

What do startups worry about if MLOps is solved?

Learn more

Has the data-centric approach to ML changed after the arrival of large language models?

How are companies leveraging LLMs to ship products fast?

Writing assistants

Agents

Search & semantic retrieval

Monitoring LLMs efficiently in production

Closing the active learning loop

See also

Is GPT-3 an opportunity or risk for MLOps practitioners?

What should MLOps practitioners learn in the age of LLMs?

What are the best options to host an LLM at a reasonable scale?

Are LLMs for MLOps going mainstream?