Gourav Bais, Autor w serwisie neptune.ai

LLM Evaluation For Text Summarization

Gourav Bais — Thu, 22 Aug 2024 14:13:00 +0000

Evaluating text summarization is difficult because there is no one correct solution, and summarization quality often depends on the summary’s context and purpose.

Metrics like ROUGE, METEOR, and BLEU focus on N-gram overlap but fail to capture the semantic meaning and context.

LLM-based evaluation approaches like BERTScore and G-eval aim to address these shortcomings by evaluating semantic similarity and coherence, providing a more accurate assessment.

Despite these advancements and the widespread use of LLM-generated summaries, ensuring robust and comprehensive evaluation remains an open problem and active area of research.

Text summarization is a prime use case of LLMs (Large Language Models). It aims to condense large amounts of complex information into a shorter, more understandable version, enabling users to review more materials in less time and make more informed decisions.

Despite being widely applied in sectors such as journalism, research, and business intelligence, evaluating the reliability of LLMs for summarization is still a challenge. Over the years, various metrics and LLM-based approaches have been introduced, but there is no gold standard yet.

In this article, we’ll discuss why evaluating text summarization is not as straightforward as it might seem at first glance, take a deep dive into the strengths and weaknesses of different metrics, and examine open questions and current developments.

How does LLM text summarization work?

Summarization is a classic machine-learning (ML) task in the range of natural language processing (NLP). There are two main approaches:

Extractive summarization creates a summary by selecting and extracting key sentences, phrases, and ideas directly from the original text. Accordingly, the summary is a subset of the original text, and no text is generated by the ML model. Extractive summarization relies on statistical and linguistic features—either explicitly or implicitly—such as word frequency, sentence position, and significance scores to identify important sentences or phrases.
Abstractive summarization produces new text that conveys the most critical information from the original. It aims to identify the key information and generate a concise version. Abstractive summarization is typically performed with sequence-to-sequence models, a category to which LLMs with encoder-decoder architecture belong.

Schematic visualization of extractive and abstractive summarization. Extractive summarization (left) creates a summary by selecting the most relevant parts of the original text. In contrast, abstractive summarization (right) generates a new text.

Dimensions of text summarization quality

There is no single objective measure for the quality of a summary, whether it’s created by a human or generated by an LLM. On the one hand, there are many different ways to convey the same information. On the other hand, what are the key pieces of information in a text is context-dependent and often debatable.

However, there are widely agreed-upon quality dimensions along which we can assess the performance of text summarization models:

Consistency characterizes the summary’s factual and logical correctness. It should stay true to the original text, not introduce additional information, and use the same terminology.

Relevance captures whether the summary is limited to the most pertinent information in the original text. A relevant summary focuses on the essential facts and key messages, omitting unnecessary details or trivial information.
Fluency describes the readability of the summary. A fluent summary is well-written and uses proper syntax, vocabulary, and grammar.
Coherence is the logical flow and connectivity of ideas. A coherent summary presents the information in a structured, logical, and easily understandable manner.

Metrics for text summarization

Metrics focus on the summary’s quality rather than its impact on any external task. Their computation requires multiple reference summaries crafted by human experts as ground truth. The quality and diversity of these reference summaries significantly influence the metric’s effectiveness. Poorly constructed references can lead to misleading scores.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is one of the most common metrics used to evaluate the quality of summaries compared to human-written reference summaries. It determines the overlap of groups of words or tokens (N-grams) between the reference text and the generated summary.

ROUGE has multiple variants, such as ROUGE-N (for N-grams), ROUGE-L (for the longest common subsequence), and ROUGE-S (for skip-bigram co-occurrence statistics).

If the summarization is limited to key term extraction, ROUGE-1 is the preferred choice. For simple summarization tasks, it is better to use ROUGE-2 metrics. For a more structured summarization, ROUGE-L and ROUGE-S might be the best fit.

While ROUGE is popular for extractive summarization, it can also be used for abstractive summarization. A high value of the ROUGE score indicates that the generated summary preserves the most essential information from the original text.

How does the ROUGE metric work?

To understand how the ROUGE metrics work, let’s consider the following example:

Human-crafted reference summary: The cat sat on the mat and looked out the window at the birds.
LLM-generated summary: The cat looked at the birds from the mat.

ROUGE-1

1. Tokenize the summaries

First, we tokenize the reference and the generated summary into unigrams:

2. Calculate the overlap

Next, we count the overlapping unigrams between the reference and generated summaries:

Overlapping unigrams:

[‘The’, ‘cat’, ‘looked’, ‘at’, ‘the’, ‘birds’, ‘the’, ‘mat’]

There are eight overlapping unigrams.

3. Calculate precision, recall, and F1 score

a) Precision = Number of overlapping unigrams / Total number of unigrams in generated summary
Precision = 8/9 = 0.89

b) Recall = Number of overlapping unigrams / Total number of unigrams in reference summary
Recall = 8/14 = 0.57

c) F1 score = 2 × (Precision×Recall) / (Precision+Recall)
F1 = 2 × (0.89×0.57) / (0.89+0.57) = 0.69

ROUGE-2

1. Tokenize the summaries

First, we tokenize the reference and the generated summary into bigrams:

2. Calculate the overlap

Next, we count the overlapping bigrams between the reference and generated summaries:

Overlapping bigrams:

[‘the cat’, ‘looked at’, ‘at the’, ‘the birds’, ‘the mat’]

There are five overlapping bigrams.

3. Calculate precision, recall, and F1 score

a) Precision = Number of overlapping bigrams / Total number of bigrams in generated summary
Precision = 5/8 = 0.625

b) Recall = Number of overlapping bigrams / Total number of bigrams in reference summary
Recall = 5/13 = 0.385

c) F1 score = 2 × (Precision×Recall) / (Precision+Recall)
F1 = 2 × (0.625×0.385) / (0.625+0.385) = 0.476

ROUGE-L

1. Tokenize the summaries

First, we tokenize the reference and the generated summary into unigrams:

2. Find the largest overlap

The longest common sequence is [“the”, “cat”] with a length of two.

3. Calculate precision, recall, and F1 score

a) Precision = Length of longest common sequence / Total number of unigrams in generated summary
Precision = 2/9 = 0.22

b) Recall = Length of longest common sequence / Total number of unigrams
Recall = 2/14 = 0.143

c) F1 score = 2 × (Precision×Recall) / (Precision+Recall)
F1 = 2 × (0.22 × 0.143)/(0.22 + 0.143) = 0.174

ROUGE-S

To calculate the ROUGE-S (ROUGE-Skip) score, we need to count skip-bigram co-occurrences between the reference and generated summaries. A skip-bigram is any pair of words in their respective order, allowing for gaps.

1. Tokenize the summaries

First, we tokenize the reference and the generated summary into unigrams:

2. Generate the skip-bigrams for reference and generate summaries.

Skip-bigrams for reference summary:

(“The”, “cat”), (“The”, “sat”), (“The”, “on”), (“The”, “the”), …

(“cat”, “sat”), (“cat”, “on”), (“cat”, “the”), …

(“sat”, “on”), (“sat”, “the”), (“sat”, “mat”), …

Continue for all combinations, allowing skips.

Skip-bigrams for generated summary:

(“The”, “cat”), (“The”, “looked”), (“The”, “at”), (“The”, “the”), …

(“cat”, “looked”), (“cat”, “at”), (“cat”, “the”), …

(“looked”, “at”), (“looked”, “the”), (“looked”, “birds”), …

Continue for all combinations, allowing skips.

3. Count the total number of skip-bigrams in the reference and the generated summary

There is no need to count the number of skip-bigrams manually. For a text with n words:

Number of skip-bigrams = (n x (n – 1)) / 2

Total skip-bigrams in reference summary: (14 × (14 − 1)) / 2 = 91

Total skip-bigrams in generated summary: (9 × (9 − 1)) / 2 = 36

4. Calculate ROUGE-S score

Finally, count the number of skip-bigrams in the reference summary that also appear in the generated summary. The ROUGE-S score is calculated as follows:

ROUGE-S = (2 × count of matching skip-bigrams) / (total skip-bigrams in reference summary + total skip-bigrams in generated summary)

The matching bi-grams in the reference and generated summary will be as follows:

(“The”, “cat”), (“The”, “looked”), (“The”, “at”), (“The”, “the”), (“cat”, “looked”), (“cat”, “at”), (“cat”, “the”), (“looked”, “at”), (“looked”, “the”), (“looked”, “birds”), (“at”, “the”), (“at”, “birds”), (“the”, “birds”)

Matching skip-bigrams: 13

ROUGE-S = (2 × 13) / (91 + 36) = 26 / 127 ≈ 0.2047

Interpretation of ROUGE metrics

ROUGE is a recall-oriented metric that ensures that the generated summary includes as many relevant tokens of the reference summary as possible. Similar to information retrieval problems, we compute the precision, recall, and F1 score.

Focusing solely on achieving high ROGUE precision can result in missing important details, as we might generate fewer words to boost precision. Focusing too much on recall favors long summaries that include additional but irrelevant information. Typically, looking at the F1 score that balances both measures is best.

In our example, the high value of the ROUGE-1 F1 score indicates fairly good coverage of the key concepts, while the lower value of the ROUGE-2 F1 score indicates a change in verbs and missing connections between key terms.

Problems with ROUGE metrics

Surface-level matching: ROUGE matches the exact N-grams from the reference and generated summaries. It fails to capture the semantic meaning and context of the text. ROUGE does not handle synonyms, meaning two semantically identical summaries with different wording have low ROUGE scores. Paraphrased content, which conveys the same meaning with different wording, receives a low ROUGE score despite being a good summary.
Recall-oriented nature: ROUGE’s primary goal is to measure the completeness of the generated summary in terms of how much of the important content from the reference summary it captures. This can lead to high scores for longer summaries that include many reference terms, even if they contain irrelevant information.
Lack of evaluation for coherence and fluency: ROUGE does not evaluate the text’s coherence, fluency, or overall readability. A summary that contains the right N-grams achieves a high ROUGE score, even if it is disjointed or grammatically incorrect.

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

Extracting all important keywords from a text does not necessarily mean that the summary produced is of high quality. A logical flow of information should be maintained, even if the information is not presented in the same order as the original document.

When using an LLM, the generated summary likely contains different words or synonyms. In this case, metrics like ROGUE based on exact keyword matches will yield low scores even if the summary is of high quality.

METEOR is a summarization metric similar to ROGUE that matches words by reducing them to their root or base form through stemming and lemmatization. For example, “playing,” “plays,” “played,” and “playful” all become “play.”

Additionally, METEOR assigns higher scores to summaries that focus on the most important information from the source. Information that is repeated multiple times or irrelevant receives lower scores. It does so by calculating a fragmentation penalty by checking if a chunk is a sequence of matched words in the same order as they appear in the reference summary.

How does the METEOR metric work?

Let’s consider an example of a generated summary from an LLM and a human-crafted summary.

Human-crafted reference summary: The cat sat on the mat and looked out the window at the birds.
LLM-generated summary: The cat looked at the birds from the mat.

1. Tokenize the summaries

First, we tokenize both summaries:

2. Identify exact matches

Next, we identify exact matches between the reference and generated summaries:

Exact matches:

[“the”, “cat”, “looked”, “at”, “the”, “birds”, “the”, “mat”]

3. Calculate precision, recall, and F1 score

a) Precision = Number of matched tokens / Total number of tokens in the generated summary
Precision = 8/9 = 0.89

b) Recall = Number of matched tokens / Total number of words in reference summaryRecall = 8/14 = 0.57

c) Harmonic mean of precision and recall = (10×Precision×Recall) / (Recall+9×Precision)
F-mean = (10×0.8889×0.5714) / (0.5714+9×0.8889) = 5.0793 / 8.4925 ≈ 0.5980

4. Calculate the fragmentation penalty

Determine the number of “chunks.” A “chunk” is a sequence of matched tokens in the same order as they appear in the reference summary.

Chunks in the generated summary:

[“the”, “cat”], [“looked”, “at”, “the”, “birds”], [“the”, “mat”]

There are three chunks in the generated summary. The fragmentation penalty is calculated as:
P = 0.5 × (Number of chunks) / (Number of matched words

P = 0.5 × 3/8 = 0.1875

5. Final METEOR score

The final METEOR score is calculated as follows:

METEOR = F-mean × (1−P) = 0.5980 × (1−0.1875) ≈ 0.5980×0.8125 ≈ 0.4866

Interpreting the METEOR score

The METEOR score ranges from 0 to 1, where a score close to 1 indicates a better match between the generated and reference text. METEOR is recall-oriented and ensures that the generated text captures as much information from the reference text.

The harmonic mean between precision and recall F-mean is biased towards recall and is the key indicator for the summary’s completeness. A low fragmentation penalty indicates that the summary is coherent and concise.

For our example, the METEOR score is approximately 0.4866, indicating a moderate level of alignment with the reference summary.

Problems with the METEOR metric

Limited contextual understanding: METEOR does not capture the contextual relationship between words and sentences. As it focuses on word-level matching rather than sentence or paragraph coherence, it might misjudge the relevance and importance of information in the summary.

Despite improvements over ROUGE, METEOR still relies on surface forms of words and their alignments. This can lead to an overemphasis on specific words and phrases rather than understanding the deeper meaning and intent behind the text.

Sensitivity to paraphrasing and synonym use: Although METEOR uses stemming for synonyms and paraphrasing, its effectiveness in capturing all possible variations is limited. It does not recognize semantically equivalent phrases that use different syntactic structures or less common synonyms.

BLEU (Bilingual Evaluation Understudy)

BLEU is yet another popular metric for evaluating LLM-generated text. Initially designed to evaluate machine translation, it is also used to evaluate summaries.

BLEU measures the correspondence between a machine-generated text and one or more reference texts. It compares the N-grams from the LLM-generated and reference texts and computes a precision score. These scores are then combined into an overall score through a geometric mean.

One advantage of BLEU compared to ROGUE and METEOR is that it can compare the generated text to multiple reference texts for a more robust evaluation. Also, BLEU includes a brevity penalty to prevent the generation of overly short texts that achieve high precision but omit important information.

How does the BLEU metric work?

Let’s use the same example we used above.

1. Tokenize the summaries

First, we tokenize both summaries:

2. Calculate matching N-grams

Next, we find matching unigrams, bigrams, and trigrams and calculate the precision (matching N-grams / total N-grams in generated summary).

a) Unigrams (1-grams):

Matches:

[“the”, “cat”, “looked”, “at”, “the”, “birds”, “the”, “mat”]

Total unigrams in generated summary: 9

Precision: 8/9 = 0.8889

b) Bigrams (2-grams):

Matches:

[“the cat”, “at the”, “the birds”, “the mat”]

Total bigrams in generated summary: 8

Precision: 4/8 = 0.5

c) Trigrams (3-grams):

Matches:

[“the cat looked”, “cat looked at”, “looked at the”, “at the birds”, “the birds the”, “birds the mat”]

Total trigrams in generated summary: 7

Precision: 2/7 = 0.2857

d) Determine the brevity penalty

The brevity penalty is based on the length of the reference and the generated summary:

Length of the reference summary: 14 tokens
Length of the generated summary: 9 tokens
Brevity penalty: e^{(1−14 / 9)}= e^−0.5556 ≈ 0.5738

e) Calculate the BLEU score

Combined precision:
We combine the N-gram precisions with weights (usually uniform weights, e.g., 1/4 for 1-gram, 2-gram, 3-gram, 4-gram) and apply the brevity penalty.

P = (0.8889^0.25) × (0.5^0.25) × (0.2857^0.25)

P ≈ 0.927 × 0.84 × 0.76 ≈ 0.595

Calculate the final BLEU score by multiplying the brevity penalty and combined precision:

BLEU = BP × P ≈ 0.5738 × 0.595 ≈ 0.342

Interpreting the BLEU score

BLEU is a precision-oriented metric that evaluates the content present in the generated summary. The BLUE score ranges between 0 and 1, where a score close to 1 indicates a highly accurate summary, a score between 0.3 and 0.7 indicates a moderately accurate summary, and a score close to 0 indicates a lower quality of the generated summary.

BLEU is best used together with recall-oriented metrics like ROUGE and METEOR to evaluate the summary’s quality more comprehensively.

The calculated BLEU score for our example is 0.342, which means the LLM-produced text has moderate quality.

Problems with the BLEU score

Surface-level matching: Similar to ROUGE and METEOR, BLEU relies on the exact N-gram matching between the generated text and reference text and fails to capture the semantic meaning and context of the text. BLEU does not handle synonyms or paraphrases well. Two summaries with the same meaning but different wording will have a low BLEU score due to the lack of exact N-gram matches.
Effective short summaries are penalized: BLEU’s brevity penalty was designed to discourage overly short translations. It can penalize concise and accurate summaries that are shorter than the reference summary, even if they capture the essential information effectively.
Higher order N-grams limitation: BLEU evaluates N-grams up to a certain length (typically 3 or 4). Longer dependencies and structures are not well captured, missing out on evaluating the coherence and logical flow of longer text segments.

LLM evaluation frameworks for summarization tasks

ROUGE and METEOR metrics focus on surface-level matching of N-grams and exact/stemmed/synonym matches, but they do not capture semantic meaning or context.

LLM evaluation frameworks, such as BERT and GPT, have been developed to address this limitation by focusing on understanding the actual meaning and coherence of the content.

BERTScore

BERTScore is an LLM-based framework that evaluates the quality of a generated summary by comparing it to a human-written reference summary. It leverages the contextual embeddings (vector representations of each word’s meaning and context) provided by pre-trained language models like BERT (Bidirectional Encoder Representations from Transformers).

BERTScore examines each word or token in the candidate summary and uses the BERT embeddings to determine which word in the reference summary is the most similar. It uses similarity metrics, majorly cosine similarity, to assess the closeness of the vectors.

Using the BERT model’s understanding of language, BERTScore finds the most related word from the generated summary in the reference summary. To get the overall score of semantic similarity between the reference summary and the candidate summary, all of these word similarities are compared. The higher the BERTScore, the better the summary generated by LLM models.

How does BERTScore work?

1. Tokenization and embedding extraction

First, we tokenize the candidate summary and the reference summary. Each token is converted into its corresponding contextual embedding using a pre-trained language model like BERT. Contextual embeddings consider the surrounding words to generate a meaningful vector representation for each word.

2. Cosine-similarity calculation

Next, we compute the pairwise cosine similarity between each embedded token in the candidate summary and each embedded token in the reference summary. The maximum similarity scores for each token are retained and then used to compute the precision, recall, and F1 scores.

a) Precision calculation: Precision is calculated by averaging the maximum cosine similarity for each token in the generated summary. For each token in the generated summary, we find the token in the reference summary that has the highest cosine similarity and average these maximum values.

b) Recall calculation: Recall is calculated in a similar manner. For each token in the reference summary, we find the token in the generated summary that has the highest cosine similarity and average these maximum values.

c) F1 score: The F1 score is the harmonic mean of the precision and recall.

Interpreting BERTScore

By calculating the similarity score for or all tokens, BERTScore takes into account both the syntactic and semantic relevance context of the generated summary compared to the human-crafted reference.

For the BERTScore, precision, recall, and F1 scores are all given equal importance. A high score for all these metrics indicates a high quality of the generated summary.

Problems with BERTScore

High computational cost: Compared to the metrics discussed earlier, BERTScore requires significant computational resources to compute embeddings and measure similarity. This makes it less practical for large datasets or real-time applications.
Dependency on pre-trained models: BERTScore relies on pre-trained transformer models, which may have biases and limitations inherited from their training data. This can affect the evaluation results, particularly for texts that differ significantly from the training domain of the pre-trained models.
Difficulty in interpreting scores: BERTScore, being based on dense vector representations and cosine similarity, can be less intuitive to interpret compared to simpler metrics like ROUGE or BLEU. People may find it challenging to understand what specific scores mean in terms of text quality, which complicates debugging and improvement processes.
Lack of standardization: There is no single standardized version of BERTScore, leading to variations in implementations and configurations. This lack of standardization can result in inconsistent evaluations across different implementations and studies.
Overemphasis on semantic similarity: BERTScore focuses on capturing semantic similarity between texts. This emphasis can sometimes overlook other important aspects of summarization quality, such as coherence, fluency, and factual accuracy.

G-Eval

G-Eval is another evaluation metric that harnesses the power of large language models (LLMs) to provide sophisticated, nuanced evaluations of text summarization tasks. It is an example of an approach known as LLM-as-a-judge. As of 2024, G-Eval is considered state-of-the-art for evaluating text summarization tasks.

G-Eval assesses the quality of the generated summary across four dimensions: coherence, consistency, fluency, and relevance. It passes prompts that include the generated and a reference summary to a GPT model. G-Eval uses four separate prompts, one for each dimension, and seeks a score between 1 to 5 from the LLM model.

How does G-Eval work?

Input texts: Both the reference summary and the candidate (generated) summary are provided as inputs to the LLM.
Criteria-specific prompts: Four prompts are used to guide the LLM to evaluate coherence, consistency, fluency, and relevance.

Here is the prompt template for evaluating the generated summary for a new article:

“””

You will be given one summary written for a news article.

Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

Relevance (1-5) – selection of important content from the source. The summary should include only important information from the source document. Annotators were instructed to penalize summaries which contained redundancies and excess information.

Evaluation Steps:

1. Read the summary and the source document carefully.

2. Compare the summary to the source document and identify the main points of the article.

3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.

4. Assign a relevance score from 1 to 5.

Example:

Source Text:

{{Document}}

Summary:

{{Summary}}

Evaluation Form (scores ONLY):

– Relevance:

“””

Different prompts for different evaluation criteria are available. Users can also create a custom prompt to capture additional dimensions.

Scoring mechanism: The LLM outputs scores or qualitative feedback based on its understanding and evaluation of the summaries.
Aggregate evaluation: Scores for different evaluation dimensions are aggregated to assess the summary comprehensively.

Problems with G-Eval

Bias and fairness: Like any automated system, G-Eval can reflect biases in the training data or the choice of evaluation metrics. This can lead to unfair assessments of summaries, especially across different demographic or content categories.
High computational cost: G-Eval uses GPT models, which require significant computational resources to compute embeddings and generate scores for different evaluation dimensions.
Lack of calibration: Since an LLM provides the score based on a user-provided prompt, it is not calibrated. Thus, G-Eval is similar to asking different users to rate a summary on a five-star scale, but it is inconsistent across different summaries.

	Type	Requires human-crafted reference	Considers semantics and context	Computational cost	Consistency	Relevance	Fluency and coherence
ROUGE	Statistical			Low
METEOR	Statistical			Low
BLEU	Statistical			Low
BERTScore	Embedding-based			Medium
G-Eval	LLM-as-a-Judge			High

Open problems with current evaluation methods and metrics for LLM text summarization

One of the major issues with LLM text summarization evaluation is that metrics like ROUGE, METEOR, and BLEU rely on N-gram overlap and cannot capture the true meaning and context of the summaries. Particularly for abstractive summaries, they fall short of human evaluators.

Relying on human experts to write and assess reference summaries makes the evaluation process costly and time-consuming. Also, these evaluators can sometime suffer from subjectivity and variability making the standardization difficult across different evaluators.

Another significant open challenge is evaluating the factual consistency. All metrics we discussed in this article do not effectively detect factual inaccuracies or misleading interpretation of the summarized source.

Current metrics also struggle sometimes to assess if the context and logic flow are preserved from the original piece of text. They fail to capture whether a summary includes all the critical information without unnecessary fluff or repetition.

It is likely that we will witness more advanced LLM-based evaluation methods in the coming years. The extensive use of LLMs for text summarization, including the integration of summarization features in search engines, makes research in this field highly popular and relevant.

Conclusion

After reading this article, you got a brief idea about the LLMs for text summarization. You have taken a look at different automated and LLM-based evaluation metrics like ROUGE, BLEU, METEOR, BERTScore, and G-Eval. You have been introduced to their working principle and the limitations that each of these metrics have. The best part is, that you need not implement these metrics from scratch, libraries like Hugging Face evaluate, Haystack, and LangChain provide ready-to-use implementations.

While ROUGE, METEOR, and BLEU metrics are simple and fast to compute, they do not focus on the semantic matching of the generated summary with the reference one. While BERTScore and G-Eval try to resolve this issue, they have their own infrastructure requirements that can incur some costs. You can also use a combination of these metrics to make sure that your generated summary makes total sense. Apart from these LLM-based models, you can also fine-tune an open-source LLM to work as an LLM-as-a-Judge for your evaluation purpose.

How to Optimize Hyperparameter Search Using Bayesian Optimization and Optuna

Gourav Bais — Mon, 06 May 2024 09:00:00 +0000

Hyperparameter optimization is an integral part of machine learning. It aims to find the best set of hyperparameter values to achieve the best model performance.

Grid search and random search are popular hyperparameter tuning methods. They roam around the entire search space to get the best set of hyperparameters, which makes them time-consuming and inefficient for larger datasets.

Based on Bayesian logic, Bayesian optimization considers the model performance for previous hyperparameter combinations while determining the next set of hyperparameters to evaluate.

Optuna is a popular tool for Bayesian hyperparameter optimization. It provides easy-to-use algorithms, automatic algorithm selection, integrations with a wide range of ML frameworks, and support for distributed computing.

Training a machine learning model involves a set of parameters and hyperparameters. Parameters are the internal variables, such as weights and coefficients, that the model learns during the training process. Hyperparameters are the external configuration settings that govern the model training and directly impact the model’s performance. In contrast to parameters learned during training, they need to be defined before the training begins.

Hyperparameter optimization, also known as hyperparameter tuning or hyperparameter search, is the process of finding the optimal values for hyperparameters that result in the best model performance.

The optimization process starts with choosing an objective function to minimize/maximize and selecting the range of values for different hyperparameters called the search space. Then, you choose one of several tuning techniques, such as manual tuning, grid search, random search, and Bayesian optimization.

Methods like manual tuning, grid search, and random search roam the entire search space (all possible values and combinations of hyperparameters) in multiple iterations. They do not take into account the results of past iterations when selecting the next hyperparameter combination to try. The search space for these approaches grows exponentially with the number of hyperparameters to tune.

Further, these methods are time-consuming and resource-consuming, requiring training a model on a selected set of parameter values, making predictions on the validation data, and calculating the validation metrics. All this makes hyperparameter tuning a costly endeavor.

Here, Bayesian hyperparameter optimization methods come to the rescue. Based on Bayesian logic, Bayesian optimization reduces the time required to find an optimal set of parameters to improve generalization performance on the test data. Bayesian approaches consider the previous hyperparameter values and their performance while determining the next set of hyperparameters to evaluate.

Many tools in the ML space use Bayesian optimization to guide the selection of the best set of hyperparameters. Widely employed frameworks are HyperOpt, Spearmint, GPyOpt, and Optuna. For this article, we’ll focus on Optuna, a popular choice for hyperparameter optimization due to its ease of use, efficient search strategy, distributed computing support, and automatic algorithm selection.

Using Optuna and a hands-on example, you will learn about the ideas behind Bayesian hyperparameter optimization, how it works, and how to perform Bayesian optimization for any of your machine-learning models.

How does the Bayesian hyperparameter optimization strategy work?

Each step in a hyperparameter tuning process looks as follows: We select a set of hyperparameters from the search space and evaluate them by computing the objective function. In most basic approaches, the objective function’s value is computed by training a model using the selected hyperparameters, using the model to make predictions on a test data set, and evaluating its performance using a predefined metric such as accuracy.

For a small parameter range and small dataset, we can try out all possible hyperparameter combinations, as the number of calls to the objective function will be small. This popular approach is called grid search. However, for a relatively large dataset and large parameter ranges, this method is too computationally expensive and time-consuming. Hence, we should look for ways to limit the number of calls to the objective function.

A straightforward approach is to randomly select a certain number of hyperparameter combinations (say, 10 or 20) and pick the combination that yields the best value of the objective function. This approach is called random search. It limits the number of calls to the objective function to a fixed value (i.e., the search has approximately constant time complexity). The price we pay is that there is no guarantee that the obtained hyperparameter values are even close to optimal.

In contrast to grid search and random search, Bayesian hyperparameter optimization considers past evaluation results when selecting the next hyperparameter set. Since it makes an informed decision, it focuses on the areas of the search space that are more likely to lead to optimal model performance. Likewise, it tends to ignore areas in the search space that are unlikely to contribute towards performance optimization. This limits the number of calls to the objective function while ensuring that the evaluated hyperparameter combinations are increasingly more likely to produce an optimal model.

Now, let’s examine the main components of Bayesian optimization that work together to obtain the best set of hyperparameters.

Search space

The search space is the set of possible values the parameters and variables of interest can take. For example, we might look for our apartment’s optimal room temperature by trying out values between 16 and 26 degrees Celsius (60 to 80 degrees Fahrenheit). While the parameter “room temperature” could conceivably take on higher or lower values, we’re restricting our search to this particular range.

Bayesian optimization utilizes probability distributions to guide the selection of samples within a defined search space. The user initially defines this search space and specifies the ranges or constraints for each parameter or variable, which requires knowledge of the training data and the model’s algorithm. Usually, the choice of parameter ranges is heavily influenced by the user’s assumptions and experience. When defining the search space, it’s paramount not to be too narrow: If the optimal hyperparameter combination lies outside the search space, no optimization algorithm can find it.

Objective function

The objective function is the evaluator that takes in the values of the hyperparameters and returns a single value score that you want to minimize or maximize.

For example, the objective function could consist of the following algorithm:

Instantiate a model and a training process using the given combination of hyperparameter values.
Train the model on a training dataset.
Evaluate the model’s accuracy on a test data set.
Return the accuracy as the single value score.

In this example, we would try to bring the objective’s function’s value as close to 1.0 (perfect accuracy) as possible.

The fact that computing the objective function involves a full model training run and subsequent evaluation makes every evaluation costly and time-consuming. Thus, hyperparameter optimization approaches that limit the number of calls to the objective function are preferable.

Surrogate function

The surrogate function proposes the best set of hyperparameters given the current state of knowledge. It evaluates all past invocations of the objective function and reveals a parameter combination it expects to yield an even more optimal result.

The purpose of the surrogate function is to limit the number of calls we need to make to the objective function. It also goes by the name response surface, as it is a high-dimensional mapping of hyperparameters to the probability of a score on the objective function. In that sense, it is an approximation of the objective function.

Different types of surrogate functions exist, such as Gaussian Processes, Random Forest Regression, and Tree Parzen Estimator (TPE). For this article, we will be focusing on the Tree Parzen Estimator (TPE).

TPE is a probability-based model that balances exploration and exploitation by maintaining separate models for the likelihood of improvement and the probability of worsening. It is well suited for hyperparameter optimization tasks where the objective is to find the set of hyperparameters that can minimize or maximize the model performance evaluation metrics used in the objective function.

The TPE algorithm iteratively samples new hyperparameters, evaluates their performance using the objective function, updates its internal probability distributions, and continues the search until a stopping criterion is met.

Deep Dive: How does the Tree Perzen Estimator (TPE) work?

In the TPE, the criterion that guides the search for the next set of hyperparameters is called an acquisition function. It can be defined as follows:

AF(x) = max(P(I∣x)/P(W∣x), ϵ)

Here, P(I∣x) represents the probability of improvement, P(W∣x) represents the probability of worsening, and ϵ is a small constant to prevent division by zero.

TPE starts with randomly sampling a small number of points from the search space to evaluate the objective function. Then, it builds and maintains two separate models for “good” (improving) and “bad” (worsening) regions of the search space.

It divides the search space into regions using a binary tree structure, where each leaf node represents a region. For each leaf node, TPE fits a probability distribution to the observed scores of the points in that region. Typically, TPE uses kernel density estimation (KDE) to model the probability distributions.

At each iteration, TPE samples a new candidate point by selecting a leaf node based on the probabilities obtained from the probability distributions of the “good” and “bad” regions. It then samples a point uniformly within the selected leaf node and evaluates it using the objective function.

After evaluating the new point, TPE updates its models by incorporating the observed score. If the score is better than the previous best score, TPE updates the model for the “good” region. Otherwise, it updates the model for the “bad” region. This process repeats until the stopping criteria are met.

To learn more, I recommend Shuhei Watanabe’s tutorial paper Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance.

Selection function

While the surrogate function uncovers the next best parameters, the selection function also called as acquisition function, is responsible for actually selecting the current best set. Its objective is to strike a balance between exploring regions of the parameter space with high uncertainty (exploration) and exploiting regions likely to yield better objective function values (exploitation).

There are different types of selection functions, including Expected Improvement (EI), Probability of Improvement (PI), and Upper Confidance Bound (UCB). Each of them uses a different approach to strike a balance between exploration and exploitation.

The complete Bayesian hyperparameter search process

The full process of searching the optimal hyperparameters with Bayesian optimization entails the following steps:

1 Select a search space to draw the samples.
2 Select a random value of each hyperparameter.
3 Define an objective function for your specific machine learning model and dataset.
4 Choose a surrogate function to approximate your objective function.
5 Based on the currently known information, select an optimal set of hyperparameters in the search space. This point is chosen based on a trade-off between exploration and exploitation.
6 Evaluate the objective function for the given set of parameters. (This involves training a model and evaluating its performance on a test set.)
7 Update the surrogate function’s model to incorporate the new results, refining its approximation of the objective function.
8 Repeat steps 5 to 7 until a stopping criterion (e.g., a maximum number of iterations or a threshold of the objective function’s value) is reached.

Advantages of Bayesian optimization over other hyperparameter optimization methods

We’ve seen that Bayesian optimization is superior to simpler hyperparameter optimization approaches because it takes into account past information. Let’s look at the advantages in more detail:

Probabilistic model: Bayesian hyperparameter optimization builds a probability-based model of the objective function, typically a TPE or Gaussian Process (GP). This makes accounting for uncertainty in the ML model predictions possible, allows guided exploration of the hyperparameter space, and enables adaptive sampling with greater understanding.
Resource efficiency: While optimization algorithms like random or grid search become infeasibly costly when dealing with large search spaces and huge datasets, Bayesian optimization is well-suited for scenarios where evaluating the objective function is computationally expensive. It minimizes the number of objective function evaluations needed to find an optimal solution, leading to significant savings in computational resources and time.
Global optimization: Bayesian optimization is well-suited for global optimization tasks where the goal is to find the global optimum rather than just a local one. Its exploration-exploitation strategy facilitates a more comprehensive search across the hyperparameter space compared to other optimization methods. However, it still does not guarantee finding a global optimum.
Efficient in high-dimensional spaces: High-dimensional hyperparameter spaces are ideal for Bayesian optimization. Even with a large number of hyperparameters, its probability-based modeling enables the effective exploration and exploitation of promising regions.

Optimizing hyperparameter search using Bayesian optimization and Optuna

Optuna is an open-source hyperparameter optimization software framework that employs Bayesian hyperparameter optimization with the TPE (Tree Parzen Estimator). It is a framework-agnostic tool that allows seamless integration with various machine learning libraries such as TensorFlow, PyTorch, and scikit-learn.

Optuna iteratively suggests new sets of hyperparameters based on TPE’s acquisition function, which balances exploration of unexplored regions and exploitation of promising areas. As the optimization progresses, the probabilistic model is continuously refined with observed data points, allowing Optuna to make informed decisions about where to sample next. This process optimizes the objective function with fewer evaluations, making Optuna an excellent choice for computationally expensive objective functions.

Optuna Hyperparameter Tuning: The model is initially trained on the training set and then evaluated on the test set. Hyperparameter tuning is applied to find the set of hyperparameters that can achieve the best performance. Neptune tracks all the trial results for documentation and later analysis.

Optuna supports parallel and distributed optimizations, enabling efficient use of computational resources. The framework also provides visualization tools for analyzing the optimization process and facilitates integration with Jupyter Notebooks.

The Optuna workflow resolves around two terms:

Trial: A single call to an objective function.
Study: Hyperparameter optimization based on an objective function. A Study aims to determine the ideal set of hyperparameter values by conducting several trials.

Now, let’s break down the process of optimizing hyperparameters with Optuna. We’ll optimize the hyperparameters of a Random Forest Classifier on the famous iris dataset.

Since hyperparameter tuning involves several trials with different sets of hyperparameters, keeping track of what combinations Optuna has tried is almost impossible. To make our work easier, we will use neptune.ai, an ML experiment tracking tool that allows us to store each trial of algorithms like Optuna.

Neptune provides visualization capabilities to understand the model performance for different hyperparameter combinations and over time. To use Neptune, you need first to sign up and create a project.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

To follow along, you’ll need Python 3.11 and Jupyter Notebook. You can install the dependencies either using pip or conda.

Step 1: Install and load dependencies

We’ll start by installing Optuna, scikit-learn, along with Neptune and Neptune’s Optuna plugin:

pip install optuna==3.6.0
pip install scikit-learn==1.3.0
pip install neptune
pip install neptune-optuna

If you don’t yet have Jupyter Notebooks available in your environment, you can install and launch it as follows:

pip install notebook
jupyter notebook .

In a new notebook, we start by importing the dependencies:

import optuna
import sklearn
import neptune
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from optuna.samplers import TPESampler

Step 2: Load the dataset

Next, we’ll load the iris dataset, which contains information about three different plant species, using scikit-learn‘s built-in dataset loader:

dataset = load_iris()
features = dataset.data
target = dataset.target

print(f'features shape: {features.shape}')
print(f'target shape: {target.shape}')
print(f'features: {dataset.feature_names}')
print(f'target: {dataset.target_names}')

For this tutorial, we will add some noise to the iris dataset, making it a bit harder for a model to master the classification problem, which will make the effects of Optuna’s hyperparameter tuning more pronounced.

We do this by adding normally distributed random numbers to the original data:

print("Before adding noise:")
print(features[:5])
# Add noise to the features (X)
# Define the standard deviation for each feature (adjust as needed)
noise_std = 0.56
# Generate random noise with the same shape as X
noise = np.random.normal(scale=noise_std, size=features.shape)
# Add the noise to the features
features = features + noise

# Print the first 5 samples before and after adding noise
print("\nAfter adding noise:")
print(features[:5])

The result should look as follows:

Before adding noise:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

After adding noise:
[[ 4.66019908  3.54977319  1.06632207  0.29945385]
 [ 3.74028994  3.57904014  1.59902873  0.46048575]
 [ 4.70786028  3.76078967  1.10910947  0.47433873]
 [ 3.85680193  3.09832529  0.25834265 -0.82281783]
 [ 4.90387657  4.06758026  2.17520878 -0.51047752]]

Step 3: Select a performance measure

We’ll use the cross-validation score as a performance measure. It averages the evaluation metric (e.g., accuracy, precision, recall, F1 score) over k cross-validation folds. In more detail, the model is trained k times, each time using k-1 folds for training and the remaining fold for validation. The default metric for evaluating a scikit-learn RandomForestClassifier is accuracy, which we’ll also use here. Alternatively, you can specify an alternative performance metric, such as precision or recall.

Step 4: Training the random forest model and establishing a performance baseline

Before you start optimizing hyperparameters, you must have a baseline to compare the tuned model’s performance. Let’s train the random forest model on the iris data and calculate the cross-validation score to get the baseline results:

def Train_Model():
   """
   Define the model, then train and evaluate it using
   3-fold cross-validation.
   """
      clf = RandomForestClassifier(n_estimators=3, max_depth=1)
  
      return cross_val_score(clf, features, target, n_jobs=-1, cv=3).mean()

print('Accuracy: {}'.format(Train_Model()))

Output:
Accuracy: 0.6733333

As you can see, the model has achieved 67% accuracy on the iris dataset. Now, let’s try to improve this accuracy using the Optuna hyperparameter optimization.

Step 5: Defining the objective function

With the performance metric and the model training set up, we can now define the objective function. This function selects a set of hyperparameter values, trains the ML model, and returns a single-valued score (mean accuracy) you want to maximize.

As Optuna works with the concept of Trials and Studies, we need to define the objective function to accept a Trial object:

def objective(trial):
   """
   Define a search space for the hyperparameters `n_estimators` and `max_depth`
   of a random forest model, then train and evaluate it using cross-validation.
   """

   n_estimators = trial.suggest_int('n_estimators', 2, 20)
   max_depth = int(trial.suggest_int('max_depth', 1, 32, log=True))
  
   clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
  
   return cross_val_score(clf, features, target, n_jobs=-1, cv=3).mean()

The suggest_int() and suggest_float() methods of Optuna dynamically suggest the hyperparameter values by employing TPE (Tree Parzen Estimator) within the range that you define.

For example, the n_estimator parameter can have a value between range 2 to 20, and max_depth can have a value between 1 to 32. Initially, you will have to come up with this range–this defines the search space.

Step 6: Initialize Neptune for storing the Optuna Trials

To start using Neptune for experiment tracking, you need to initialize a new run using the init_run() method. This method will require the project name and the API token for the repository where you want to save the results in Neptune.

You can do so with the help of the following lines of code:

run = neptune.init_run(
	project="username/Hyperparameter-Optimization-with-Optuna",
	api_token="YOUR_API_TOKEN",
)  # your credentials

Since Optuna runs different trials one after another, Neptune employs a callback to track each trial before the next one begins. You can define this callback as follows:

import neptune.integrations.optuna as npt_utils

neptune_callback = npt_utils.NeptuneCallback(run)

That’s all you have to do to set up Neptune to track your experiments. To learn more about Neptune’s integration with Optuna, have a look at the documentation.

Step 8: Optimizing the objective function

Now, all that’s left is to define a Study consisting of N trials to optimize the objective function.

Initially, the sampler randomly generates a few initial parameter combinations to evaluate the objective function. Optuna then uses a surrogate function (TPE in this case) to balance exploration (sampling from uncertain regions) and exploitation (sampling near promising configurations) to efficiently search for optimal hyperparameters.

The selection function then suggests the next hyperparameter configuration to evaluate by considering both the predicted performance and the uncertainty associated with each point in the search space. This process repeats until the pre-defined number of trials (in our case, 70) is reached.

# create a study
study = optuna.create_study(direction='maximize', sampler=TPESampler())
study.optimize(objective, n_trials=70, callbacks=[neptune_callback])

# get the best trial
trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

Output:
Accuracy: 0.88
Best hyperparameters: {'n_estimators': 14, 'max_depth': 8}

As you can see in the code above, we use the create_study() method to define a Study and the optimization direction. Then, we use the optimize() method and provide the number of trials and our objective function for hyperparameter optimization.

You might notice that we are using the callbacks argument, passing the Natpune callback object. This ensures we track each trial and its related metadata in Neptune.

Once the optimization process is complete, you can use the best_trial attribute to get the best accuracy score and the associated set of hyperparameters. You should observe an improvement of around 21% in accuracy.

If you had used a basic grid search instead of Bayesian optimization with Optuna, it would have required about 567 iterations to try out all possible hyperparameter combinations, which would have taken roughly eight times longer.

You can also check the hyperparameter combinations that Optuna has tried out and the performance it has achieved from each set of hyperparameters as follows:

for tri in study.get_trials():
	print('Hyperparameter Set:', tri.params)
	print("Accuracy:", tri.value)

Once you have your best set of hyperparameters, you can stop tracking data with Neptune using the following line of code:

run.stop()

This will provide you with the URL where the experiment data is saved. When you open that link (if you’re curious: here’s the one to my Neptune project), you will see different runs (based on how many times you have run Optuna). Each run will have several trials and the best set of hyperparameters. It will look something like this:

Best practices for Bayesian optimization with Optuna

There are several best practices to increase the effectiveness and efficiency of conducting hyperparameter optimization with Optuna.

Here’s a selection:

Understand the problem and data

It’s essential to understand the problem you want to solve thoroughly. You should know the characteristics of your dataset and the ML model you will use. This will allow you to understand the objective function’s nature and the hyperparameters’ behavior. It will also help you choose the right metrics to minimize or maximize for optimal performance.

Define a relevant search space

You should carefully define the search space for the hyperparameters. You can start by identifying the hyperparameters relevant to the model and algorithm being optimized, such as learning rate, batch size, and number of layers for a neural network. Then, you need to specify the ranges or distributions for each parameter, for example, continuous values for the numeric hyperparameters and a set of values for the categorical variables.

Optuna supports various distributions such as uniform, loguniform, categorical, and integer, enabling flexibility in defining the search space. Additionally, you can utilize business knowledge while defining the search space. You should do all these while keeping in mind that achieving a balance between computational feasibility and inclusivity is crucial.

Set an appropriate number of trials

You must identify a reasonable number of trials based on the available computation resources and the complexity of the optimization problem. When you try too few trials, the obtained hyperparameters can be suboptimal. Too many trials will be computationally expensive and will take a long time, just like grid search and random search.

Initially, start with a small number of trials and then gradually increase the number of trials depending on how your optimization progresses. Once you have obtained the optimal parameters, you must validate the model’s performance on a separate validation set or perform cross-validation. This ensures that the chosen configuration generalizes well to new, unseen data.

Experiment with different acquisition functions

Optuna supports different acquisition functions, including Probability of Improvement, Expected Improvement, and Upper Confidence Bound. You should experiment with different functions to find the one that aligns with the characteristics of your objective function.

For example, Knowledge Gradient (KG) is effective for sparse and high-dimensional data, Upper Confidence Boud (UCB) is effective for large datasets with complex relationships, and Probability of Improvement (PI) is effective for data with high variability and noise.

Parallel and distributed optimization

You can leverage parallel and distributed optimization to speed up the overall hyperparameter optimization search, especially when your objective function is computationally expensive, and you are trying a wide range of hyperparameters. To this end, Optuna supports the parallel execution of trials.

Conclusion

After working through our introductory tutorial, you now understand the foundations of Bayesian hyperparameter optimization and its mechanics. We’ve discussed how Bayesian optimization differs from conventional techniques such as random and grid search. Then, we’ve explored the practical application of hyperparameter optimization with Optuna and Neptune. Finally, we’ve reviewed effective strategies to optimize your hyperparameter search process. Armed with this knowledge, you’re well-prepared to apply Bayesian optimization to enhance the performance of your ML models.

How to Save Trained Model in Python

Gourav Bais — Wed, 10 May 2023 13:56:49 +0000

When working on real-world machine learning (ML) use cases, finding the best algorithm/model is not the end of your responsibilities. It is crucial to save, store, and package these models for their future use and deployment to production.

These practices are needed for a number of reasons:

Backup: A trained model can be saved as a backup in case the original data is damaged or destroyed.

Reusability & reproducibility: Building ML models is time-consuming by nature. To save cost and time, it becomes essential that your model gets you the same results every time you run it. Saving and storing your model the right way takes care of this.

Deployment: When deploying a trained model in a real-world setting, it becomes necessary to package it for easy deployment. This makes it possible for other systems and applications to use the same model without much hassle.

To reiterate, while saving and storing ML models allow ease of sharing, reusability, and reproducibility; packaging the models enables quick and painless deployment. These 3 operations work in harmony to simplify the whole model management process.

In this article, you will learn about different methods of saving, storing, and packaging a trained machine-learning model, along with the pros and cons of each method. But before that, you must understand the distinction between these three terms.

Save vs package vs store ML models

Although all these terms look similar, they are not the same.

Saving vs Storing vs Packaging ML Models | Source: Author

Saving a model refers to the process of saving the model’s parameters, weights, etc., to a file. Usually, all ML and DL models provide some kind of method (eg. model.save()) for saving the models. But you must be aware that save is a single action and gives only a model binary file, so you still need code to make your ML application production-ready.

Packaging, on the other hand, refers to the process of bundling or containerizing the necessary components of a model, such as the model file, dependencies, configuration files, etc., into a single deployable package. The goal of a package is to make it easier to distribute and deploy the ML model in a production environment.

Once packaged, a model can be deployed across different environments, which allows the model to be used in various production settings such as web applications, mobile applications, etc. Docker is one of the tools which allows you to do this.

Storing the ML model refers to the process of saving the trained model files in a centralized storage that can be accessed anytime when needed. When storing a model, you normally choose some sort of storage from where you can fetch your model and use it anytime. The model registry is a category of tools that solve this issue for you.

Now let’s see how we can save our model.

How to save a trained model in Python?

In this section, you will see different ways of saving machine learning (ML) as well as deep learning (DL) models. To begin with, let’s create a simple classification model using the most famous Iris-dataset.

Note: The focus of this article is not to show you how you can create the best ML model but to explain how effectively you can save trained models.

You first need to load the required dependencies and the iris dataset as follows:

# load dependencies
import pandas as pd 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# load the dataset
url = "iris.data"

# column names to use
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# read the dataset from the URL
dataset = pd.read_csv(url, names=names) 

# check the first few rows of iris-classification data
dataset.head()

Next, you need to split the data into training and testing sets and apply the required preprocessing stages, such as feature standardization.

# separate the independent and dependent features
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values 

# Split dataset into random training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, test_size=0.20) 
# feature standardization
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Finally, you need to train a classification model (feel free to choose any) on training data and check its performance on testing data.

# training a KNN classifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train) 

# make predictions on the testing data
y_predict = model.predict(X_test)

# check results
print(confusion_matrix(y_test, y_predict))
print(classification_report(y_test, y_predict))

Iris classification results | Source: Author

Now you have an ML model that you want to save for future use. The first way to save an ML model is by using the pickle file.

Saving trained model with pickle

The pickle module can be used to serialize and deserialize the Python objects. Pickling is the process of converting a Python object hierarchy into a byte stream, while Unpickling is the process of converting a byte stream (from a binary file or other object that appears to be made of bytes) back to an object hierarchy.

For saving the ML models used as a pickle file, you need to use the Pickle module that already comes with the default Python installation.

To save your iris classifier model you simply need to decide on a filename and dump your model to a pickle file like this:

import pickle

# save the iris classification model as a pickle file
model_pkl_file = "iris_classifier_model.pkl"  

with open(model_pkl_file, 'wb') as file:  
    pickle.dump(model, file)

As you can see the file is opened in wb (write binary) mode for saving the model as bytes. Also, the dump() method stores the model in the given pickle file.

You can also load this model using the load() method of the pickle module. Now you need to open the file in rb (read binary) mode to load the saved model.

# load model from pickle file
with open(model_pkl_file, 'rb') as file:  
    model = pickle.load(file)

# evaluate model 
y_predict = model.predict(X_test)

# check results
print(classification_report(y_test, y_predict))

Once loaded you can use this model to make predictions.

Iris classification result | Source: Author

Pros of the Python pickle approach

1 Pickling comes as the standard module in Python which makes it easy to use for saving and restoring ML models.
2 Pickle files can handle most Python objects including custom objects, making it a versatile way to save models.
3 For small models, pickle approach is quite fast and efficient.
4 When an ML model is unpickled, it is restored to its previous state, including any variables or configurations. This makes Python pickle files one of the best alternatives for saving ML models.

Cons of the Python Pickle Approach

1 If you unpickle untrusted data, pickling could pose a security threat. Unpickling an object can execute malicious code, so it’s crucial to only unpickle information from reliable sources.
2 Pickled objects’ use may be constrained in some circumstances since they cannot be transferred between different Python versions or operating systems.
3 For models with a big memory footprint, pickling can result in the creation of huge files, which can be problematic.
4 Pickling can make it difficult to track changes to a model over time, especially if the model is updated frequently and it is not feasible to create multiple pickle files for different versions of models that you try.

Pickle is most suited for small-size models and also has some security issues, these reasons are enough to look for another alternative for saving the ML models. Next, let’s discuss Joblib to save and load ML models.

Note: In the upcoming sections you will see the same iris classifier model to be saved using different techniques.

Saving trained model with Joblib

Joblib is a set of tools (typically part of the Scipy ecosystem) that provide lightweight pipelining in Python. It majorly focuses on disk-caching, memoization, and parallel computing and is used for saving and loading Python objects. Joblib has been specifically optimized for NumPy arrays to make it fast and reliable for ML models that have a lot of parameters.

To save large models with Joblib, you need to use the Python Joblib module that comes preinstalled with Python.

import joblib 

# save model with joblib 
filename = 'joblib_model.sav'
joblib.dump(model, filename)

To save the model, you need to define a filename with a ‘.sav’ or ‘.pkl’ extension and call the dump() method from Joblib.

Similar to pickle, Joblib provides the load() method to load the saved ML model.

# load model with joblib
loaded_model = joblib.load(filename)

# evaluate model 
y_predict = model.predict(X_test)

# check results
print(classification_report(y_test, y_predict))

After loading the model with Joblib you are free to use it on the data to make predictions.

Iris classification results | Source: Author

Pros of saving ML models with Joblib

1 Fast and effective performance is a key component of Joblib, especially for models with substantial memory requirements.
2 The serialization and deserialization process can be parallelized via Joblib, which can enhance performance on multi-core machines.
3 For models that demand a lot of memory, Joblib employs a memory-mapped file format to reduce memory utilization.
4 Joblib offers various security features, such as a whitelist of secure functions that can be utilized during deserialization, to assist safeguard against untrusted data.

Cons of Saving ML Models with Joblib

1 Joblib is optimized for numpy arrays, and may not work as well with other object types.
2 Joblib offers less flexibility than Pickle because there are fewer options available for configuring the serialization process.
3 Compared to Pickle, Joblib is less well known, which can make it more difficult to locate help and documentation around it.

Although Joblib solves the major issues faced by pickle, it has some issues on its own. Next, you will see how you can manually save and restore the models using JSON.

Saving trained model with JSON

When you want to have full control over the save and restore procedure of your ML model, JSON comes into play. Unlike the other two methods, this method does not directly dump the ML model to a file; instead, you need to explicitly define the different parameters of your model to save them.

To use this method, you need to use the Python json module that again comes along with the default Python installation. Using the JSON method requires additional effort to write all parameters that an ML model contains. To save the model using JSON, let’s create a function like this:

import json 

# create json save function
def save_json(model, filepath, X_train, y_train): 
    saved_model = {}
    saved_model["algorithm"] = model.get_params()['algorithm'],
    saved_model["max_iter"] = model.get_params()['leaf_size'],
    saved_model["solver"] = model.get_params()['metric'],
    saved_model["metric_params"] = model.get_params()['metric_params'],
    saved_model["n_jobs"] = model.get_params()['n_jobs'],
    saved_model["n_neighbors"] = model.get_params()['n_neighbors'],
    saved_model["p"] = model.get_params()['p'],
    saved_model["weights"] = model.get_params()['weights'],
    saved_model["X_train"] = X_train.tolist() if X_train is not None else "None",
    saved_model["y_train"] = y_train.tolist() if y_train is not None else "None"
    
    json_txt = json.dumps(saved_model, indent=4)
    with open(filepath, "w") as file: 
        file.write(json_txt)

# save the iris-classification model in a json file
file_path = 'json_model.json'
save_json(model, file_path, X_train, y_train)

You see how you need to define each model parameter and the data to store it in JSON. Different models have different methods to check out the parameter details. For example, the get_params() for KNeighboursClassifier gives the list of all the hyperparameters in the model. You need to save all these hyperparameters and data values in a dictionary which is then dumped into a file with the ‘.json’ extension.

To read this JSON file you just need to open it and access the parameters as follows:

# create json load function 
def load_json(filepath): 
    with open(filepath, "r") as file:
        saved_model = json.load(file)
    
    return saved_model

# load model configurations
saved_model = load_json('json_model.json')
saved_model

In the above code, a function load_json() is created that opens the JSON file in read mode and returns all the parameters and data as a dictionary.

JSON Loaded Model | Source: Author

Unfortunately, you can not use the saved model directly with JSON, you need to read these parameters and data to retrain the model all by yourself.

Pros of saving ML models with JSON

1 Models that need to be exchanged between various systems can be done so using JSON, which is a portable format that can be read by a wide variety of programming languages and platforms.
2 JSON is a text-based format that is easy to read and understand, making it a good choice for models that need to be inspected or edited by humans.
3 In comparison to Pickle or Joblib, JSON is a lightweight format that creates smaller files, which can be crucial for models that must be transferred over the internet.
4 Unlike pickle, which executes code during deserialization, JSON is a secure format that minimizes security threats.

Cons of Saving ML Models with JSON

1 Because JSON only supports a small number of data types, it could not be compatible with sophisticated machine learning models that employ unique data types.
2 In particular, for large models, JSON serialization and deserialization can be slower than other formats.
3 Compared to alternative formats, JSON offers less flexibility and may take more effort to tailor the serialization procedure.
4 JSON is a lossy format that may not preserve all of the information in the original model, which can be a problem for models that require exact replication.

To ensure security and JSON/pickle benefits, you can save your model to a dedicated database. Next, you will see how you can save an ML model in a database.

Saving deep learning model with TensorFlow Keras

TensorFlow is a popular framework for training DL-based models, and Keras is a wrapper for TensorFlow. A neural network design with numerous layers and a set of labeled data are used to train deep learning models. These models have two major components, Weights and Network architecture, that you need to save to restore them for future use. Typically there are two ways to save deep learning models:

Save the model architecture in a JSON or YAML file and weights in an HDF5 file.
Save both model and architecture both in HDF5, protobuf, or tflite file.

You can refer to any one way to do this, but the widely used method is to save the model weights and architecture together in an HDF5 file.

To save a deep learning model in TensorFlow Keras, you can use the save() method of the Keras Model object. This method saves the entire model, including the model architecture, optimizer, and weights, in a format that can be loaded later to make predictions.

Here’s an example code snippet that shows how to save a TensorFlow Keras-based DL model:

# import tensorflow dependencies
from tensorflow.keras.models import Sequential, model_from_json
from tensorflow.keras.layers import Dense

# define model architecture
model = Sequential()
model.add(Dense(12, input_dim=4, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)

# save model and its architecture 
model.save('model.h5')

This is it, you just need to define the model architecture, train the models with appropriate settings, and finally save it using the save() method.

Loading the saved models with Keras is as easy as reading a file in Python. You just need to call the load_model() method by providing the model file path and your model will be loaded.

# define dependency 
from tensorflow.keras.models import load_model

# load model 
model = load_model('model.h5')

# check model info 
model.summary()

Your model is now loaded for use.

Tensorflow loaded model | Source: Author

Pros of saving models with TensorFlow Keras

1 Saving and loading models in TensorFlow Keras is very straightforward using the save() and load_model() functions. This makes it easy to save and share models with others or to deploy them to production.
2 The whole model architecture, optimizer, and weights are saved in one file when you save a Keras model. With no need to bother about loading the architecture and weights separately, it is simple to load the model and generate predictions.
3 TensorFlow Keras supports several file formats for saving models, including the HDF5 format (.h5), the TensorFlow SavedModel format (.pb), and the TensorFlow Lite format (.tflite). This gives you flexibility in choosing the format that best suits your needs.

Cons of Saving Models with TensorFlow Keras

1 When you save a Keras model, the resulting file can be quite large, especially if you have a large number of layers or parameters. This can make it challenging to share or deploy the model, especially in situations where bandwidth or storage space is limited.
2 Models saved with one version of TensorFlow Keras could not work with another. If you try to load a model that was saved with a different version of Keras or TensorFlow, this may result in problems.
3 Although it’s simple to save a Keras model, you’re only able to use the features that Keras offers for storing models. A different framework or strategy may be required if you require more flexibility in the way models are saved or loaded.

There is one more widely used framework named Pytorch for training the DL-based models. Let’s check how you can save Pytorch-based deep learning models with Python.

Saving deep learning model with Pytorch

Developed by Facebook, Pytorch is one of the highly used frameworks for developing DL-based solutions. It provides a dynamic computational graph, which allows you to modify your model on-the-fly, making it ideal for research and experimentation. It uses ‘.pt’ and ‘.pth’ file formats to save model architecture and its weights.

To save a deep learning model in PyTorch, you can use the save() method of the PyTorch torch.nn.Module object. This method saves the entire model, including the model architecture and weights, in a format that can be loaded later to make predictions.

Here’s an example code snippet that shows how to save a PyTorch model:

# import dependencies
import torch
import torch.nn as nn
import numpy as np

# convert data numpy arrays to tensors
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)
y_train = torch.LongTensor(y_train)
y_test = torch.LongTensor(y_test)

# define model architecture
class NeuralNetworkClassificationModel(nn.Module):
    def __init__(self,input_dim,output_dim):
        super(NeuralNetworkClassificationModel,self).__init__()
        self.input_layer    = nn.Linear(input_dim,128)
        self.hidden_layer1  = nn.Linear(128,64)
        self.output_layer   = nn.Linear(64,output_dim)
        self.relu = nn.ReLU()
    
    
    def forward(self,x):
        out =  self.relu(self.input_layer(x))
        out =  self.relu(self.hidden_layer1(out))
        out =  self.output_layer(out)
        return out

# define input and output dimensions
input_dim  = 4 
output_dim = 3
model = NeuralNetworkClassificationModel(input_dim,output_dim)

# create our optimizer and loss function object
learning_rate = 0.01
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)

# define training steps
def train_network(model,optimizer,criterion,X_train,y_train,X_test,y_test,num_epochs,train_losses,test_losses):
    for epoch in range(num_epochs):
        # clear out the gradients from the last step loss.backward()
        optimizer.zero_grad()
        
        # forward feed
        output_train = model(X_train)

        # calculate the loss
        loss_train = criterion(output_train, y_train)

        # backward propagation: calculate gradients
        loss_train.backward()

        # update the weights
        optimizer.step()
        
        output_test = model(X_test)
        loss_test = criterion(output_test,y_test)

        train_losses[epoch] = loss_train.item()
        test_losses[epoch] = loss_test.item()

        if (epoch + 1) % 50 == 0:
            print(f"Epoch { epoch+1 }/{ num_epochs }, Train Loss: { loss_train.item():.4f }, Test Loss: {loss_test.item():.4f}")

# train model
num_epochs = 1000
train_losses = np.zeros(num_epochs)
test_losses  = np.zeros(num_epochs)
train_network(model,optimizer,criterion,X_train,y_train,X_test,y_test,num_epochs,train_losses,test_losses)

# save model 
torch.save(model, 'model_pytorch.pt')

Unlike Tensorflow, Pytorch allows you to have more control over the model training, as seen in the above code. After training the model, you can save the weights and their architecture using save() method.

Loading the saved model with Pytorch requires the use of load() method.

# load model
model = torch.load('model_pytorch.pt')
# check model summary
model.eval()

Pytorch loaded model | Source: Author

Pros of saving models with Pytorch

1 The computational graph used by PyTorch is dynamic, meaning it is built as the program is run. This allows for more flexibility in modifying the model during training or inference.
2 For dynamic models, such as those with variable-length inputs or outputs, which are frequent in natural language processing (NLP) and computer vision, PyTorch offers improved support.
3 Given that PyTorch is written in Python and functions well with other Python libraries like NumPy and pandas, manipulating data both before and after training is simple.

Cons of Saving Models with Pytorch

1 Even though PyTorch provides an accessible API, there may be a steep learning curve for newcomers to deep learning or Python programming.
2 Since PyTorch is essentially a framework for research, it might not have as many tools for production deployment as other deep learning frameworks like TensorFlow or Keras.

This isn’t it, you can use model registry platforms to save DL-based models as well, specially the ones with large size. This makes it easy to deploy and maintain them without requiring extra effort from developers.

You can find the dataset and code used in this article here.

How to package ML models?

An ML model is typically optimized for performance on the training dataset and the specific environment in which it is trained. But, when it comes to deploying the models in different environments, such as a production environment, there could be various challenges.

These challenges are but not limited to differences in hardware, software, and data inputs. Packaging the model makes it easier to address these problem, as it allows the model to be exported or serialized into a standard format that can be loaded and used in various environments.

There are various options available for packaging right now. By packaging the model in a standard format such as PMML (Predictive Model Markup Language), ONNX, TensorFlow SavedModel format, etc. it becomes easier to share and collaborate on a model without being concerned about different libraries and tools used by different teams. Now, let’s check a few examples of packaging an ML model with different frameworks in Python.

Note: For this section as well, you will see the same iris-classification example.

Packaging models with PMML

Using the PMML library in Python, you can export your machine learning models to PMML format and then deploy that as a web service, a batch processing system, or a data integration platform. This can make it easier to share and collaborate on machine learning models, as well as to deploy them in various production environments.

To package an ML model using PMML you can use different modules like sklearn2pmml, jpmml-sklearn, jpmml-tensorflow, etc.

Note: To use PMML, you must have Java Runtime installed on your system.

Here is an example code snippet that allows you to package the trained iris classifier model using PMML.

from sklearn2pmml import PMMLPipeline, sklearn2pmml
# package iris classifier model with PMML
sklearn2pmml(PMMLPipeline([("estimator",
                        	model)]),
         	"iris_model.pmml",
         	with_repr=True)

In the above code, you simply need to create a PMML pipeline object by passing your model object. Then you need to save the PMML object using sklearn2pmml() method. That is it, now you can use this “iris_model.pmml” file across different environments.

Pros of using PMML

1 Since PMML is a platform-independent format, PMML models can be integrated with numerous data processing platforms and used in a variety of production situations.
2 PMML can reduce vendor lock-in as it allows users to export and import models from different machine-learning platforms.
3 PMML models can be easily deployed in production environments as they can be integrated with various data processing platforms and systems.

Cons of using PMML

1 Some machine learning models and algorithms may not be able to be exported in PMML format as a result of the limited support.
2 PMML is an XML-based format that can be verbose and inflexible, which may make it difficult to modify or update models after they have been exported in PMML format.
3 It might be difficult to create PMML models, especially for complicated models with several features and interactions.

Packaging models with ONNX

Developed by Microsoft and Facebook, ONNX (Open Neural Network Exchange) is an open format for representing machine learning models. It allows for interoperability between different deep-learning frameworks and tools.

ONNX models can be deployed efficiently on a variety of platforms, including mobile devices, edge devices, and the cloud. It supports a variety of runtimes, including Caffe2, TensorFlow, PyTorch, and MXNet, which allows you to deploy your models on different devices and platforms with minimal effort.

To save the model using ONNX, you need to have onnx and onnxruntime packages downloaded in your system.

Here is an example of how you can convert the existing ML model to ONNX format.

# load dependencies
import onnxmltools
import onnxruntime

# Convert the KNeighborsClassifier model to ONNX format
onnx_model = onnxmltools.convert_sklearn(model)

# Save the ONNX model in a file
onnx_file = "iris_knn.onnx"
onnxmltools.utils.save_model(onnx_model, onnx_file)

You just need to import the required modules and use the convert_sklearn() method to corvet the sklearn model to the ONNX model. Once the conversion is done, using the save_model() method, you can store the ONNX model in a file with the “.onnx” extension. Although here you see an example of an ML model, ONNX is majorly used for DL models.

You can also load this model using the ONNX Runtime module.

# Load the ONNX model into ONNX Runtime
sess = onnxruntime.InferenceSession(onnx_file)

# Evaluate the model on some test data
input_data = {"X": X_test[:10].astype('float32')}
output = sess.run(None, input_data)

You need to create a session using InferenceSession() method to load the ONNX model from a file and then use sess.run() method to make predictions from the model.

Pros of using ONNX

1 With little effort, ONNX models can easily be deployed on a number of platforms, including mobile devices and the cloud. It is simple to deploy models on various hardware and software platforms thanks to ONNX’s support for a wide range of runtimes.
2 ONNX models are optimized for performance, which means that they can run faster and consume fewer resources than models in other formats.

Cons of using ONNX

1 ONNX is primarily designed for deep learning models and may not be suitable for other types of machine learning models.
2 ONNX models may not be compatible with all versions of different deep learning frameworks, which may require additional effort to ensure compatibility.

Packaging models with Tensorflow SavedModel

Tensorflow’s SavedModel format allows you to easily save and load your deep learning models, and it ensures compatibility with other Tensorflow tools and platforms. Additionally, it provides a streamlined and efficient way to deploy our models in production environments.

SavedModel supports a wide range of deployment scenarios, including serving models with Tensorflow Serving, deploying models to mobile devices with Tensorflow Lite, and exporting models to other ML libraries such as ONNX.

It provides a simple and streamlined way to save and load Tensorflow models. The API is easy to use and well-documented, and the format is designed to be efficient and scalable.

Note: You can use the same TensorFlow model trained in the above section.

To save the model in SavedModel format, you can use the following lines of code:

import tensorflow as tf

# using SavedModel format to save the model
tf.saved_model.save(model, "my_model")

You can also load the model with load() method.

# Load the model
loaded_model = tf.saved_model.load("my_model")

Pros of using Tensorflow SavedModel

1 SavedModel is platform-independent and version-compatible, which makes it easy to share and deploy models across different platforms and versions of TensorFlow.
2 A variety of deployment scenarios are supported by SavedModel, including exporting models to other ML libraries like ONNX, serving models with TensorFlow Serving, and distributing models to mobile devices using TensorFlow Lite.
3 SavedModel is optimized for training and inference, with support for distributed training and the ability to use GPUs and TPUs to accelerate training.

Cons of using Tensorflow SavedModel

1 SavedModel files can be large, particularly for complex models, which can make them difficult to store and transfer.
2 Given that SavedModel is exclusive to TensorFlow, its compatibility with other ML libraries and tools may be constrained.
3 The saved model is a binary file that can be difficult to inspect, making it harder to understand the details of the model’s architecture and operation.

Now that you have seen multiple ways of packaging ML and DL models, you must also be aware that there are various tools available that provide infrastructure to package, deploy and serve these models. Two of the popular ones are BentoML and MLflow.

BentoML

BentoML is a flexible framework for building and deploying production-ready machine learning services. It allows data scientists to packaging their trained models, their dependencies, and the infrastructure code required to serve the model into a reusable package called a “Bento”.

BentoML supports various machine learning frameworks and deployment platforms and provides a unified API for managing the lifecycle of the model. Once a model is packaged as a Bento, it can be deployed to various serving platforms like AWS Lambda, Kubernetes, or Docker. BentoML also offers an API server that can be used to serve the model via a REST API. You can know more about it here.

MLflow

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides a comprehensive set of tools for tracking experiments, packaging code, and dependencies, and deploying models.

MLflow allows data scientists to easily package their models in a standard format that can be deployed to various platforms like AWS SageMaker, Azure ML, and Google Cloud AI Platform. The platform also provides a model registry to manage model versions and track their performance over time. Additionally, MLflow offers a REST API for serving models, which can be easily integrated into web applications or other services.

How to store ML models?

Now that we know about saving models let’s see how we can store them to facilitate their quick and easy retrieval.

Storing ML models in a database

There is also scope for you to save your ML models in relational databases PostgreSQL, MySQL, Oracle SQL, etc. or NoSQL databases like MongoDB, Cassandra, etc. The choice of database totally depends on factors such as the type and volume of data being stored, the performance and scalability requirements, and the specific needs of the application.

PostgreSQL is a popular choice when working on ML models that provide support for storing and manipulating structured data. Storing ML models in PostgreSQL provides an easy way to keep track of different versions of a model and manage them in a centralized location.

Additionally, it allows for easy sharing of models across a team or organization. However, it’s important to note that storing large models in a database can increase database size and query times, so it’s important to consider the storage capacity and performance of your database when storing models in PostgreSQL.

To save an ML model in a database like PostgreSQL, you need to first Convert the trained model into a serialized format, such as a byte stream (pickle object) or JSON.

import pickle

# serialize the model
model_bytes = pickle.dumps(model)

Then open a connection to the database and create a table or collection to store the serialized model. For this, you need to use the psycopg2 library of Python, which lets you connect to the PostgreSQL database. You can download this library with the help of the Python package installer.

$ pip install psycopg2-binary

Then you need to establish a connection to the database to store the ML model like this:

import psycopg2

#  establishing the connection to the Database
conn = psycopg2.connect(
  database="database-name", user=user-name, password='your-password', host='127.0.0.1', port= '5432'
)

To perform any operation on the database, you need to create a cursor object that will help you to execute queries in your Python program.

# create a cursor
cur = conn.cursor()

With the help of this cursor, you can now execute the CREATE TABLE query to create a new table.

cur.execute("CREATE TABLE models (id INT PRIMARY KEY NOT NULL, name CHAR(50), model BYTEA)")

Note: Make sure that the model object type is BYTEA.

Finally, you can store the model and other metadata information using the INSERT INTO command.

# Insert the serialized model into the database
cur.execute("INSERT INTO models (id, name, model) VALUES (%s, %s, %s)", (1, 'iris-classifier', model_bytes))
conn.commit()

# Close the database connection
cur.close()
conn.close()

Once all the operations are done, close the cursor and connection to the database.

Finally, to read the model from the database, you can use the SELECT command by filtering the model either on name or id.

import psycopg2
import pickle

# Connect to the database
conn = psycopg2.connect(
  database="database-name", user=user-name, password='your-password', host='127.0.0.1', port= '5432'
)

# Retrieve the serialized model from the database
cur = conn.cursor()
cur.execute("SELECT model FROM models WHERE name = %s", ('iris-classifier',))
model_bytes = cur.fetchone()[0]

# Deserialize the model
model = pickle.loads(model_bytes)

# Close the database connection
cur.close()
conn.close()

Once the model is loaded from the database, you can use it to make predictions as follows:

# test loaded model
y_predict = model.predict(X_test)

# check results
print(classification_report(y_test, y_predict))

This is it, you have the model stored and loaded from the database.

Pros of storing ML models in a database

1 Storing ML models in a database provides a centralized storage location that can be easily accessed by multiple applications and users.
2 Since most organizations already have databases in place, integrating ML models into the existing infrastructure becomes easier.
3 Databases are optimized for data retrieval, which means that retrieving the ML models is faster and more efficient.
4 Databases are designed to provide robust security features such as authentication, authorization, and encryption. This ensures that the stored ML models are secure.

Cons of storing ML models in a database

1 Databases are designed for storing structured data and are not optimized for storing unstructured data such as ML models. As a result, there may be limitations in terms of model size, file formats, and other aspects of ML models that cannot be accommodated by databases.
2 Storing ML models in a database can be complex and requires expertise in both database management and machine learning.
3 If the ML models are large, storing them in a database may lead to scalability issues. Additionally, the retrieval of large models may impact the performance of the database.

While pickle, joblib, and JSON are common ways to save machine learning models, they have limitations when it comes to versioning, sharing, and managing machine learning models. Here ML model registries come to the rescue and resolve all the issues faced by the alternatives.

Next, you will see how saving ML models in the model registry can help you achieve reproducibility and reusability.

Storing ML models in model registry

A model registry is a central repository that can store, version, and manage machine learning models.
It typically includes features like model versioning, metadata control, comparing model runs, etc.
When working on any ML or DL projects, you can save and retrieve the models and their metadata from the model registry anytime you want.
Above all, model registries enable high collaboration among team members.

There are various options for the model registry, such as MLflow or Kubeflow. You can also use tools like neptune.ai – even though it’s an experiment tracker, it covers model registry and model versionins capabilities to a great extent. Although all these platforms have unique features on their own, it is rather wise to choose a registry that can provide you with a comprehensive set of features.

Storing models with MLflow

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It includes a model registry component that allows you to centrally manage models.

You can register a model with MLflow either in the UI or programmatically.

Registering a model via UI in MLflow | Source

Once registered, you can:

Version your models,
Transition models through stages (e.g., Staging, Production),
Add descriptions and tags,
Compare model versions,
Fetch registered models from the model registry.

Storing models with Neptune

Neptune is an experiment tracker designed with a strong focus on collaboration and scalability. It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye.

You can log, store, and organize your model metadata with Neptune’s flexible Python API. To log the model metadata, use the run object. Depending on your setup, you can separate the model and training metadata by creating multiple runs or log everything together.

With Neptune, you can:

Track models and model versions, along with the associated metadata.
Filter, sort, and compare the versioned data easily.
Manage model stages using tags.
Query and download any stored model files and metadata.

Pros of storing models with model registry

1 A centralized location for managing, storing, and version-controlling machine learning models.
2 Metadata regarding models, such as their version, performance metrics, etc. are frequently included in model registries, making it simpler to follow changes and comprehend the model’s past.
3 Model registries allow team members to collaborate on models and share their work easily.
4 Some model registries provide automated deployment options, which can simplify the process of deploying models to production environments.
5 Model registries often provide security features such as access control, encryption, and authentication, ensuring that models are kept secure and only accessible to authorized users.

Cons of storing models with model registry

1 A paid subscription is necessary for some model registries, which raises the cost of machine learning programs.
2 Model registries often have a learning curve, and it may take time to get up to speed with their functionality and features.
3 Using a model registry may require integrating with other tools and systems, which can create additional dependencies.

You have now seen different ways of saving an ML model (model registry being the most optimal one), this is time to check some ways to save the Deep Learning (DL) based models.

Best practices

In this section, you will see some of the best practices for saving the ML and DL models.

Ensure Library Versions: Using different library versions for saving and loading the models may create compatibility issues as there could be some structural changes with the library update. You must ensure that library versions while loading the machine learning models should be the same as the library versions used to save the model.
Ensure Python Versions: It is a good practice to use the same Python version across all stages of your ML pipeline development. Sometimes changes in the Python version can create execution issues, for example, TensorflowV1 is supported up till Python 3.7, and if you try to use it with later versions, you will face the errors.
Save Both Model Architecture and Weights: In the case of DL-based models, if you save only model weight but not architecture, then you can not reconstruct the model. Saving the model architecture along with the trained weights ensures that the model can be fully reconstructed and used later on.
Document the Model: The goal, inputs, outputs, and anticipated performance of the model should be documented. This can aid others in understanding the capabilities and constraints of the model.
Use Model Registry: Use a model registry like neptune.ai to keep track of models, their versions, and metadata and to collaborate with team members.
Keep the Saved Model Secure: Keep the saved model secure by encrypting it or storing it in a secure location, especially if it contains sensitive data.

Conclusions

In conclusion, saving machine learning models is an important step in the development process, as it allows you to reuse and share your models with others. There are several ways to save machine learning models, each with its own advantages and disadvantages. Some popular methods include using pickle, Joblib, JSON, TensorFlow save, and PyTorch save.

It is important to choose the appropriate file format for your specific use case and to follow best practices for saving and documenting models, such as version control, ensuring language and library versions, and testing the saved model. By following the practices discussed in this article, you can ensure that your machine-learning models are saved correctly, are easy to reuse and deploy, and can be effectively shared with others.

References

How Did We Get to ML Model Reproducibility

Gourav Bais — Tue, 14 Mar 2023 13:50:19 +0000

When working on real-world ML projects, you come face-to-face with a series of obstacles. The ml model reproducibility problem is one of them.

This article is going to take you through an experience-based, step-by-step approach to solve the ml model reproducibility challenge taken by my machine learninf team working on a fraud detection system for the insurance domain.

You’ll learn:

1 Why is reproducibility important in machine learning?
2 What were the challenges faced by the team?
3 What was the solution? (tool stack and a checklist)

Let’s start at the beginning!

Why is reproducibility important in machine learning?

To better understand this concept, I will share with you the journey of me and my team.

Project background

Before discussing the important details, let me tell you a little about the project. This machine learning project was a fraud detection system for the insurance domain where a classification model was used to classify if a person is prone to commit fraud or not, given the required details as input.

Initially, when we start working on any project, we don’t think about model deployment, reproducibility, model retraining, etc. Instead, we tend to spend much time on data exploration, preprocessing, and modeling. This is indeed an erroneous thing to do when working on machine learning projects at scale. To back this up, here is the Nature survey conducted in 2016.

According to this research, 1,500 scientists were chosen for a reproducibility test, yet 70% of them were unable to duplicate the experiments of other scientists, and more than 50% were unable to duplicate their own experiments. Keeping this and a few other details in mind, we created a project that was reproducible and deployed it successfully to production.

When working on this classification project, we realized that reproducibility is not only essential for consistent results but also for these reasons:

Stable ML Outcomes and Practices: To make sure that our fraud detection model outcomes are easily trusted by the clients, we had to make sure that we have stable outcomes. Reproducibility is the key factor when it comes to stabilizing the outcomes of any ML pipeline. For reproducibility, we used an identical dataset and pipeline so that the same results could be produced by anyone in our team running the model. But to ensure that our training data and pipeline components remained the same during the runs, we had to track them using different MLOps tools.

For example, we used code versioning tools, model versioning tools, and dataset versioning tools that helped us to keep track of everything in the machine learning pipeline. Also, these tools enabled high collaboration among our team members and ensured that the best practices were followed during the development.

Promotes Accuracy and Efficiency: One thing that we emphasized the most was that we wanted our model to generate the same results again and again, no matter when we ran it. As any reproducible model gives the same results in every run, we just had to make sure that we did not make any changes to the model configuration and hyperparameters every time we ran the model. This has helped us to identify the best model out of all that we have tried.

Prevents Duplication of Efforts: One major challenge before us while developing this classification project was that we had to make sure that whenever one of our team members runs a project, they need not do all the configurations from scratch to achieve the same results every time. Also, if any new developer joins our project, they can easily understand the pipeline to generate the same model. This is where version control tools and documentation helped us as team members, and new joiners had access to specific versions of code, certain datasets, and ML models.

Enables Bug-Free ML Pipeline Development: There were times when running the same classification model did not produce the same results, which helped us find the errors and bugs easily in our pipeline. Once identified we were able to fix those issues quickly to make our ML pipelines stable.

Every ML reproducibility challenge we faced

Now that you know about reproducibility and its different benefits, it is time to discuss the major reproducibility issues that my team and I faced during the development of this ML project. The important part is, all these challenges are very common for any type of machine learning or deep learning use case.

1. Lack of clear documentation

One major part that we were missing out on at the beginning was the documentation. Initially, when we did not have any documentation, it impacted our team members’ performance as they took more time than expected to understand the requirements and implement new features. It also became very difficult for the new developers on our team to understand the whole project. Due to this lack of documentation, a standard approach was missing which led to a failure to reproduce the same results every time they ran the model.

You can consider documentation a bridge between the conceptual understanding of a project and the actual technical implementation of that project. Documentation helps existing developers and new team members to understand the nuance of the solution and helps them to understand the structure of the project better.

2. Different computer environments

It is often possible for different developers in a team to have different environments like operating systems (OSs), language versions, library versions, etc. We had the same scenario while working on the project. This affected our reproducibility as each environment has some significant changes to the others in terms of different ml frameworks or different ways of package implementation etc.

It is a common practice to share code and artifacts among different team members for any ML project. So a slight change in the computer environment can create issues in running the existing project and ultimately developers will spend unnecessary time debugging the same code again and again.

3. Not tracking data, code, and workflow

Reproducible machine learning is only possible when you use the same data, code, and preprocessing steps. But not keeping track of these things might lead to different configurations used to run the same model which may result in different outputs in each run. So at some point in your project, you need to store all this information so that you can retrieve them whenever needed.

When working on the classification project, we did not keep track of all the models and their different hyperparameters at first, which turned out to be a barrier for our project to achieve reproducibility.

4. Lack of standard evaluation metrics and protocols

Selecting the right evaluation metric is one of the possible challenges while working on any classification use case. You need to decide on the metrics that can work best for use cases. For example, in the fraud detection use case, our model could not afford to predict a lot of False Negatives for which we tried to improve the recall of the overall system. Not using a standard metric can reduce clarity among team members about the objective and ultimately it can affect reproducibility.

Finally, we had to make sure that all of our team members followed the same protocols and code standards so that there was uniformity in the code which made the code more readable and understandable.

How to Solve Reproducibility in ML

Machine learning reproducibility checklist: solutions we adapted

As ML engineers we make sure that every problem should have one or multiple possible solutions, as is the case for ML reproducibility issues. Even though there were a lot of challenges for reproducibility in our project, we were able to solve them all with the right strategy and a righteous selection of tools. Let’s take a look now at the machine learning reproducibility checklist we have used.

1. Clear documentation of the solution

Our fraud detection project was the combination of multiple individual technical components and the integration among them. It was very hard for us to remember in words when and how what component would be used by which process. So for our project, we created a document containing information about each specific module that we have worked on for example, data collection, data preprocessing and exploration, modeling, deployment, monitoring, etc.

Documenting what solution strategies we have tried out or will be trying out, what tools and technologies we would be using throughout the project, what implementation decisions have been taken, etc. helped our ML developers better understand the ML project. With this proper documentation, they were able to follow the standard best practices, and step-by-step procedure to run the pipeline, and finally, they knew which error needed what kind of resolution. This resulted in reproducing the same results every time our team members ran the model and helped us improve the overall efficiency.

Also, this helped us improve the efficiency of our team as we did not have to spend time explaining the entire ML workflow to the new joiners and other developers as everything was just mentioned in the document.

2. Using the same computer environments

Developing the classification solution needed our ML developers to collaborate and work on the different sections of the machine learning pipeline. And since most of our developers were using different computing environments, it was hard for them to produce the same results due to various dependency changes. So, for reproducibility, we had to make sure that each developer was using the same computing environment, ML frameworks, language versions, etc.

PIP and virtual environments | Source

Using a Docker container or creating a shareable virtual environment are two of the best solutions for using the same computational environments. In our team, people were working on Windows and Unix environments, and different language and library versions, using the docker containers solved our problem and helped us to get to reproducibility.

3. Tracking data, code, and workflow

Versioning data and workflow

As we knew, data was the skeleton of our fraud detection use case, if we made a slight change in the dataset, it could affect our model’s reproducibility. The data that we were using for our use case was not in the required shape and format to train the model. So we had to apply different data preprocessing steps like NaN value removal, Feature Generation, Feature Encoding, Feature Scaling, etc. to make this data compatible with the selected model.

For this reason, we had to use data versioning tools like neptune.ai, Pachyderm, or DVC, which can help us systematically manage our data.

Also, we did not want to repeat all the data processing steps every time we ran the ML pipeline so using such data and workflow management tools helped us retrieve any specific version of preprocessed data for the ML pipeline run.

Learn more

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

Code versioning and management

During the development, we had to make multiple code changes for ML modules implementation, new features implementation, integration, testing, etc. To guarantee reproducibility, we had to make sure that we used the same code version every time we ran a pipeline.

There are multiple tools to version control your entire code, some of the popular ones are GitHub and Bitbucket. We have used GitHub for our use case to version control the entire codebase, also, this tool made the team collaboration quite easy as developers had access to each commit made by other developers. Code versioning tools made it easy for us to use the same code every time we ran an machine learning pipeline.

Experiment tracking in ML

Finally, the most important part of making our pipeline reproducible was to track all the models and experiments that we had tried out throughout the entire ML lifecycle. When working on the classification project we tried different ML models and hyperparameter values, it was very hard to keep track of them manually or with documentation. To solve this issue, we decided to pick one that could solve multiple problems. Although there are multiple tools available for tracking your entire code, data, and ML workflow. But instead of choosing a different tool for each of these tasks, neptune.ai seemed like the right solution.

It is a cloud-based platform designed to help data scientists with experiment tracking and model management. It provides a centralized location for all training activities, making it easier for teams to collaborate on projects and ensuring that everyone is working with the most up-to-date information.

Tools like neptune.ai, Comet, MLFlow, etc. enable developers to access any specific version of the model so that they can decide on which algorithm has worked out best for them and with what hyperparameters. Again, it depends on your use case and team dynamics – which tool you decide to go ahead with.

Learn more

Experiment Tracking for Systems Powering Self-Driving Vehicles [Case Study]

Experiment Tracking in Media Intelligence Analysis [Case Study]

4. Deciding on standard evaluation metrics and protocols

As we were working on a classification project and also had an imbalanced dataset, we had to decide on the metrics that could work well for us. Accuracy does not come out as a good measure for the imbalance dataset so we could not use it. We had to decide among Precision, Recall, AUC-ROC curve, etc.

In a fraud detection use case, precision and recall both are given importance. This is because false positives can cause inconvenience and annoyance to customers, and potentially damage the reputation of the business. However, false negatives can be much more damaging and result in significant financial losses. So we decided to keep Recall as our main metric for the use case.

Also, we decided to use the PEP8 standard for coding as we wanted our code to be uniform among all the components that we were developing. Choosing a single metric to focus on and PEP8 for standard coding practices helped us write easily reproducible code.

Conclusion

After reading this article, you now know that reproducibility is an important factor when working on ML use cases. Without reproducibility, it could be hard for anyone to trust your findings and results. I have also walked you through the importance of reproducibility standards with a personal experience, and also shared some of the challenges that I and my team faced and the proposed solutions.

If you need to remember one thing from this article, it would be to use specialized tools and services to version control each possible thing like Data, Pipeline, Model, and different experiments. This allows you to use any specific version and run the entire pipeline to get the same results every time.

References

Best ML Model Registry Tools

Gourav Bais — Fri, 30 Sep 2022 14:47:28 +0000

A model registry is a central repository that is used to version control Machine Learning (ML) models. It simply tracks the models while they move between training, production, monitoring, and deployment. It stores all the predominant information such as:

metadata,
lineage,
model versions,
annotations,
and training jobs.

As the model registry is shared by multiple team members working on the same machine learning project, model governance is a major advantage that these teams have. This governance data tells them:

which dataset was used for training,
who trained and published a model,
what’s the predictive performance of the model,
and finally, when the model was deployed to production.

Usually, while working in a team, different team members tend to try out different things, and only a few of them are finalized and pushed to the version control tool they use. The model registry helps them solve this issue as each team member can try their own versions of models, and they will have a record of all the things they have experimented with throughout the project journey.

This article will discuss the model registry tools and evaluation criteria for such tools. You will also see a comparison of different model registry and model management tools, such as:

MLflow,
Verta.ai,
Comet,
and neptune.ai,

So let’s get started!

Evaluation criteria for choosing model registry tools

The model registry is an important part of MLOps platforms/tools. There are plenty of tools available in the market that can fulfill your ML workflow needs. Here is an illustration that classifies these tools on the basis of their specialization.

Classification of model registry tools | Source

The products on the bottom right are focused on deployment and monitoring; those on the bottom-left focus on training and tracking. Those at the very top aim to cover every aspect of the ML lifecycle, while those in the middle-top do most or all of the spectrum with leaning one way or another.

To visualize it even more precisely, let’s have a look at another image:

More precise classification of model registry tools | Source

From the above image, it can be inferred that tools like Kubeflow and other cloud providers are the most balanced and cover every stage of an ML pipeline development equally. Specialized tools like Neptune and Polyaxon are closest to their axis, i.e., majorly focused on model training.

NOTE: The aforementioned evaluation criteria for these tools are subjective to the features these tools had at that point in time (November 2021). Many of these tools have moved much beyond their area of specialization in the past year, so take this discussion with a pinch of salt.

However, there are some evergreen factors that are integral to determining a registry tool’s effectiveness. From my own experience, some of them are:

Ease of automation

One of the requirements of a model registry tool is how easily the development team can make use of that tool.

Some tools require you to code all the things needed to store the model versions,
While some tools require very less coding, and you just need to drag and drop different components to use them.
There are also some tools fully based on the concept of AutoML and do not require you to write any code for storing your model versions.

Auto-ML tools have less flexibility for customizations while Low-Code tools provide both custom and automation options finally, Code-First tools only provide a writing code facility. You can choose a tool based on your requirement.

Updated model overview and model stages tracking

The entire purpose of a model registry tool is to provide an easy overview of all the versions of models that the development team has tried. While selecting the tool, you must remember that the tool must provide the model overview of each version at every stage. Tracking models extend beyond development; it is done for maintenance and enhancement in staging and production as well. The machine learning model lifetime including:

training,
staging,
and production,

must be tracked by the model registration tool.

Competence in managing the model dependencies

The model registry tool must have compatibility with all the dependencies the ML model needs. You should check the dependencies competence for the Machine Learning libraries, Python version, and data. If you are working on some use case that requires a special ML library and the registry tool does not support it, that tool would not make much sense for you.

Providing the flexibility of team collaboration

You may evaluate whether you and your team can collaborate on the registered model or not. If the model registry enables you to work with your team on the same ML model, then you can choose that tool.

Thus, you can follow the evaluation criteria to select the best model registry tool according to your requirements.

Comparison of model registry tools

Every model registry tool has different features and performs various unique operations. Here’s how they compare:

Functionality	MLflow	Comet	Verta.AI	neptune.ai
Dataset versioning	No	No	Yes	Yes
Versioning model files	Yes	Limited	Yes	Yes
Versioning model explanations	Yes	Limited	Yes	Limited
Model lineage	No	Limited	No	No
Main stage transition tags	Yes	Yes	Yes	Limited
Model compare	No	No	No	Yes
Model searching	Limited	Limited	No	Yes
Model packaging	Yes	No	Yes	No
Pricing	Free	Free for individuals and researchers, paid for teams	Open-source and paid versions available	Free for individuals, researchers, and small teams, paid for bigger teams

Model registry tools

Here are a number of model registry tools that are used across the industry:

MLflow

An open-source platform that you can use for managing the ML model lifecycle. MLFlow enables you to track the MLOps life cycle with the help of its APIs. It provides model versioning, model lineage, annotations, and transitions from development to deployment functionalities.

MLFlow dashboard | Source

Some features of MLflow model registry are as follows:

Model lineage tracking, showing which experiment and run produced a given model version
Predefined model stages as Archived, Staging, and Production but allocates one model stage at a time for different model versions.
Annotations and versioning, allowing you to document and manage top-level models and individual versions using Markdown
Webhooks, triggering actions based on registry events.
Email notifications, to stay informed about model lifecycle changes.

MLflow can be self-hosted or used as part of a managed service. While Databricks offers a full-featured hosted version, Amazon SageMaker and Azure Machine Learning also support the MLflow client, letting you track and register models within their ecosystems. However, in these cloud integrations, model data is logged to proprietary backends, and not all MLflow features are supported. These integrations provide convenience for teams operating within AWS or Azure, while still benefiting from MLflow’s open interface.

Learn more

Check detailed comparison between neptune.ai and MLflow.

Comet

Developers can use the Comet platform to manage machine learning experiments. This system allows you to version, register, and deploy the model using its Python SDK Experiment.

Comet keeps track of model versions and the experiment history of the model. You can check the detailed information of all model versions. Besides, you can maintain ML workflow more efficiently using model reproduction and optimization.

Comet dashboard | Source

The feature-rich Comet has various functionalities for running and tracking ML model experiments, including:

Comet allows you to easily check the history of evaluation/testing runs.
You can easily compare different experiments using the Comet model registry.
It allows you to access the code, dependencies, hyperparameters, and metrics within a single UI.
It has in-built reporting and visualization features to communicate with team members and stakeholders.
It lets you configure webhooks and integrate the Comet model registry with your CI/CD pipeline.

May be useful

Check detailed comparison between neptune.ai and Comet.

Verta.ai

You can use the Verta AI tool for the management and operations of the model in one unified space. It provides an interactive UI where you can register the ML models and publish the metadata, artefacts, and documents. Then, to manage the end-to-end experiment, you may connect the model to the experiment tracker. Version control solutions for ML projects are also offered by Verta AI.

Additionally, it enables you to keep track of changes made to data, code, environments, and model configuration. With the audit log’s accessibility, you may also examine the model’s dependability and compatibility at any time. You can also create a unique approval sequence that is appropriate for your project and incorporate it with the selected ticketing system.

Verta AI dashboard | Source

Some of the main features of Verta AI’s model registry are:

It enables end-to-end information tracking such as Model ID, description, tags, documentation, model versions, release stage, artifacts, model metadata, and more, which helps in selecting the best model.
It works on container tools like Kubernetes and Docker and is integrable with GitOps and Jenkins, which helps in automatically tracking model versions.
It provides access to detailed audit logs for compliance.
It has an environment like Git that makes it intuitive.
You can set up granular access control for editors, reviewers, and collaborators.

neptune.ai

Neptune is primarily an experiment tracker, but it provides model registry functionality to a great extent.

Neptune allows you to log, visualize, compare, and query all metadata related to ML experiments and models. It only takes a few lines of code to integrate Neptune with your code. The API is flexible, and the UI is user-friendly but also prepared for the high volume of logged metadata.

Some of the features of Neptune:

It lets you track models and model versions, along with the associated metadata. You can version model code, images, datasets, Git info, and notebooks.
It allows you to filter and sort the versioned data easily.
It lets you manage model stages using tags.
You can query and download any stored model files and metadata.
And it helps your team to collaborate on experiments by providing persistent links to the UI or building reports tailored to specific stakeholders or project.
It supports different connection modes such as asynchronous (default), synchronous, offline, read-only, and debug modes for the versioned metadata tracking.

Summary

After reading this article, I hope you now know what model registry tools are and the different criteria that one must look for while selecting a model registry tool. To offer a practical perspective, we also discussed some of the popular model registry tools and compared them with each other in several aspects. Now, let’s wrap the article with a few key takeaways:

Model registry performs model versioning and publishes them into production.
Before selecting a model registry tool, you must evaluate each model according to your requirement.
Model registry evaluation criteria can range from the capability to monitor and manage the different ML model stages and versions to its ease of use and pricing.
You may refer to the highlighted features of different model registry tools to get a better idea of that tool’s compatibility with your use case.

With these points in mind, I hope your model registry tool search will be much easier.

Building Deep Learning-Based OCR Model: Lessons Learned

Gourav Bais — Fri, 22 Jul 2022 06:53:58 +0000

Deep learning solutions have taken the world by storm, and all kinds of organizations like tech giants, well-grown companies, and startups are now trying to incorporate deep learning (DL) and machine learning (ML) somehow in their current workflow. One of these important solutions that have gained quite a popularity over the past few years is the OCR engine.

OCR (Optical Character Recognition) is a technique of reading textual information directly from digital documents and scanned documents without any human intervention. These documents could be in any format like PDF, PNG, JPEG, TIFF, etc. There are a lot of Advantages of using OCR systems, these are:

1 It increases productivity as it takes very less time to process (extract information) the documents.
2 It is resource-saving as you just need an OCR program that does the work and no manual work would be required.
3 It eliminates the need for manual data entry.
4 Chances of error become less.

Extracting information from digital documents is still easy as they have metadata, that can give you the text information. But for the scanned copies, you require a different solution as metadata does not help there. Here comes the need for deep learning that provides solutions for text information extraction from images.

In this article, you will learn about different lessons for building a deep learning-based OCR model so that when you are working on any such use case, you may not face the issues that I have faced during the development and deployment.

What is deep learning-based OCR?

OCR has become very popular nowadays and has been adopted by several industries for faster text data reading from images. While solutions like contour detection, image classification, connected component analysis, etc. are used for documents that have comparable text size and font, ideal lighting conditions, good image quality, etc., such methods are not effective for irregular, heterogeneous text often called wild text or scene text. This text could be from a car’s license plate, house number plate, poorly scanned documents (with no predefined conditions), etc. For this, Deep Learning solutions are used. Using DL for OCR is a three-step process and these steps are:

Preprocessing: OCR is not an easy problem, at least not as easy as we think it to be. Extracting text data from digital images/documents is still fine. But when it comes to scanned or phone-clicked images things change. Real-world images are not always clicked/scanned in ideal conditions, they can have noise, blur, skewness, etc. That needs to be handled before applying the DL models to them. For this reason, image preprocessing is required to tackle these issues.

Text Detection/Localization: At this stage models like Mask-RCNN, East Text Detector, YoloV5, SSD, etc. are used that locates the text in images. These models usually create bounding boxes (square/rectangle boxes) over each text identified in the image or a document.

Text Recognition: Once the text location is identified, each bounding box is sent to the text recognition model which is usually a combination of RNNs, CNNs, and Attention networks. The final output from these models is the text extracted from the documents. Some open-source text recognition models like Tesseract, MMOCR, etc. can help you gain good accuracy.

Deep learning based OCR model | Source: Author

To explain the effectiveness of OCR models, let’s have a look at a few of the segments where OCR is applied nowadays to increase the productivity and efficiency of the systems:

OCR in Banking: Automating the customer verification, check deposits, etc. processes using OCR-based text extraction and verification.

OCR in Insurance: Extracting the text information from a variety of documents in the insurance domain.

OCR in Healthcare: Processing the documents such as a patient’s history, x-ray report, diagnostics report, etc. can be a tough task that OCR makes easy for you.

These are just a few of the examples where OCR is applied, to know more about its use cases you can refer to the following link.

Lessons from building a deep learning-based OCR model

Now that you are aware of what OCR is and what makes it an important concept in the current times, it’s time to discuss some of the challenges that you may face while working on it. I have been part of several OCR-based projects that were related to the finance (insurance) sector. To name a few:

I have worked on a KYC verification OCR project where information from different identification documents needed to be extracted and validated against each other to verify a customer profile.
I have also worked on insurance documents OCR where information from different documents needed to be extracted and used for several other purposes like user profile creation, user verification, etc.

One thing that I have learned while working on these OCR use cases is that you need not fail every time to learn different things. You can learn from others’ mistakes as well. There were several stages where I faced challenges while working in a team for these financial DL-based OCR projects. Let’s discuss those challenges in the form of different stages of ML pipeline development.

Data collection

Problem

This is the first and most important stage while working on any ML or DL use case. Mostly OCR solutions are adopted by financial organizations like banks, insurance companies, brokerage firms, etc. As these organizations have a lot of documents that are hard to process manually. Since they are financial organizations here comes the government rules and regulations that these financial organizations must follow.

For this reason, if you are working on any POC (Proof of Concept) for these financial firms there might be the chance that they might not share a whole lot of data for you to train your text detection and recognition models. Since deep learning solutions are all about data you might get models with poor performance. This is related to of course the regulatory compliance that they might breach users’ privacy that can cause customer financial and other kinds of loss if they share the data.

Solution

Does this problem has any solution? Yes, it has. Let’s say you would want to work on some kind of Form or ID card for text extraction. For forms, you could ask clients for the empty templates and fill them with your random data (time-consuming but efficient) and for the id card, you may find a lot of samples on the internet that you can use to get started. Also, you can just have a few samples of these forms and ID cards and use image augmentation techniques to create new similar images for your model training.

Image augmentation for OCR | Source

Sometimes when you would want to start working on OCR use cases and do not have any organizational data, you can use one of the datasets available online (open-source) for OCR. You can check the list of best datasets for OCR here.

Labeling the data (data annotation)

Problem

Now that you have your data and also created new samples using image augmentation techniques, the next thing on the list is Data Labeling. Data Labeling is the process of creating bounding boxes on the objects that you would want your object detection model to find in images. In this case, our object is text so you need to create the bounding boxes over the text area that you would want your model to identify. Creating these labels is a very tedious but important task. This is something you can not get rid of.

Also, bounding boxes are too general when we talk about annotations, for different types of use cases different types of annotations are used. For example, for the use cases where you would want the most accurate coordinates of an object, you can not use square or rectangular bounding boxes, There you need to use Polynomial (multiline) bounding boxes. For Semantic Segmentation use cases where you want to separate an image into different portions, you need to assign a label to every pixel in an image. To know more about different types of annotations you can refer to this link.

Solution

Is there any way through which you can expedite the labeling process for your work? Yes, there is. Usually, if you are using image augmentation techniques like adding noise, blur, brightness, contrast, etc. There is no change in the image geometry so you can use the coordinates from the original image for these augmented images. Also If you are rotating your images, make sure you rotate them in multiple 90 Degree so that you can also rotate your annotations (labels) to the same angle and it would save you a lot of rework. For this task, you can use VGG or VoTT image annotations tools.

VoTT annotations | Source

Sometimes when you have a lot of data to annotate you can even outsource it, there are a lot of companies that provide annotation solutions. You just need to simply explain the type of annotation you want and the annotation team would do it for you.

Model architecture and training infrastructure

Problem

One thing that you must ensure is the hardware component that you have for training your models. Training object detection models require a decent RAM capacity and a GPU unit (some of them can work with CPU as well but training would be super slow).

Another part of it is over the years different object detection models have been introduced in the field of computer vision. Choosing the one that works best for your use case (text detection and recognition) and also works fine on your GPU/CPU machine can be difficult.

Solution

For the first part, if you have a GPU-based system then there is no need to worry as you can easily train your model. But, if you are using a CPU, training the whole model at once can take a lot of time. In that case, transfer learning can be the way to go as it doesn’t involve training models from scratch.

Each newly introduced computer vision model has either whole new architecture or improves the performance of the existing models. For smaller and dense objects like text, YoloV5 is preferred for text detection over others for its architectural benefits.

Yolov5 Architecture | Source

If you want to segment an image into multiple portions (pixel-wise), Masked-RCNN is considered best. For text recognition, some of the widely used models are MMOCR, PaddleOCR and CRNN.

Training

Problem

This is a very crucial stage where you would be training your DL-based text detection and recognition models. One thing that we all are aware of is that training deep learning model is a black box thing, you can just try out different parameters to get the best results for your use case and would not know what is going on underneath. You may need to try different deep learning models for text detection and recognition which is pretty hard with all those hyperparameters that you need to take care of for training.

Solution

One thing I have learned here is that you must focus on a single model until you have tried out everything like hyperparameter tuning, model architecture tuning, etc. You need not judge the performance of a model by trying out only a few things.

Furthermore, I would advise you to train your model in parts for eg. if you want to train your model to 50 epochs, divide it into three different steps 15 epochs, 15 epochs, and 20 epochs and evaluate it intermediately. This way you would have results at different stages and would get the gist of whether the model is performing well or badly. It is better than trying all 50 epochs at once for a few days and finally getting to know the model is not working at all on your data.

Also, as already discussed above, transfer learning could be the key. You can train your model from scratch but using an already trained model and fine-tuning it on your data would surely give you good accuracy.

Testing

Problem

Once you have your models ready the next thing in the queue is to test the performance of the model. Testing the deep learning models is quite easy as you can see the results (bounding boxes created on the object) or compare the extracted text with ground truth data, unlike traditional machine learning use cases where you need to interpret the results from numbers.

Nowadays you can use manual DL model testing or could try one of the available automated testing services. The manual process takes some time as you would have to go ahead and check every image on your own to tell the performance of the models. If you are working on financial use cases you might have to work on manual testing only as you can not share the data with online automation testing services.

Solution

One major advice that I would give here is never to test your models on the training datasets as it would not show the real performance of your model. You need to create three different datasets train, validation, and test. First, two would be used for training and run time model assessment while the testing dataset would show you the real performance of the model.

The next thing would be to decide the best metrics to assess the performance of your detection and recognition models. Since text detection is a type of object detection, mAP (mean average precision) is used to assess the performance of the models. It compares the model predicted bounding boxes with the ground truth bounding boxes and returns the score, the higher the score better the performance.

mAP formula | Source

For the text recognition model, the widely used measure is CER (Character Error Rate). For this measure each predicted character is compared with the ground truth to tell the model performance, the lower the CER, the better the model performance. You need your model to have less than 10% CER for it to be replaced with a manual process. To know more about CER and how to calculate it, you can check the following link.

Deployment and monitoring

Problem

Once you have your final models ready with decent accuracy you would have to deploy them somewhere to make them accessible to the target audience. This is one of the major steps where you might face some issues no matter where you are going to deploy it. Three important challenges that I have faced while deploying these models are:

I was using the PyTorch library to implement the object detection model, this library does not allow you to use multithreading at the time of inference if you have not trained it to be multithreaded at the time of training.
Model size might be too much as it would be the DL-based model and it might take longer to load at the time of inference.
Deploying the model is not enough, you need to monitor it for a few months to know if it is performing as expected or if it has further scope for improvement.

Solution

So to resolve the first issue I would suggest you must be aware that you would have to train the model using the Pytorch with multithreading so that you can have it at the time of inferencing or another solution would be to switch to another framework i.e. look for the TensorFlow alternative for the torch model that you want as it already supports multithreading and is quite easy to work with.

For the second point, if you have a very large model that takes a lot of time to load for inferencing, you can convert your model to the ONNX model, it can reduce the size of the model by ⅓ but with a slight impact on your accuracy.

Model monitoring can be done manually but it requires some engineering resources to look for the cases that are failing with your OCR model. Instead, you can use different ML model monitoring solutions that work in an automated way.

Aside

If you want a model monitoring solution for your experimentation and training processes, and a metadata store for your ML/AI workflow, you should check out neptune.ai.

Here’s an example of how Neptune helped ML team at Brainly optimize monitoring and debugging of their ML processes.

Neptune gives us really good insight into simple data processing jobs that are not even training. We can, for example, monitor the usage of resources and know whether we are using all cores of the machines. And it’s quick – two lines of code, and we have much better visibility.
Hubert Bryłkowski, Senior ML Engineer at Brainly

Read the full case study with Brainly
Watch the 2-min product demo

Conclusion

After reading this article, you now know what a Deep Learning based OCR is, its various use cases, and finally seen some lessons based on scenarios I have seen while working on OCR use cases. OCR technology is now taking over the manual data entry and document processing work, this might be the right time to get hands-on with it so that you would not feel left out in the DL world. While working on these types of use cases, you must remember that you can not have a good model in one go. You need to try out different things and learn from every step that you would work on.

Creating a solution from scratch might not be a good solution as you would not have a whole lot of data while working on different use cases, so trying transfer learning and fine-tuning different models on your data can help you achieve good accuracy. The motive of this article was to tell you different issues that I have faced while working on OCR use cases so that you need not face them in your work. Still, there may be some new issues with the change in technology and libraries but you must look for different solutions to get the work done.

Gourav Bais, Autor w serwisie neptune.ai

LLM Evaluation For Text Summarization

TL;DR

How does LLM text summarization work?

Dimensions of text summarization quality

Metrics for text summarization

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

How does the ROUGE metric work?

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-S

Interpretation of ROUGE metrics

Problems with ROUGE metrics

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

How does the METEOR metric work?

Interpreting the METEOR score

Problems with the METEOR metric

BLEU (Bilingual Evaluation Understudy)

How does the BLEU metric work?

Interpreting the BLEU score

Problems with the BLEU score

LLM evaluation frameworks for summarization tasks

BERTScore

How does BERTScore work?

Interpreting BERTScore

Problems with BERTScore

G-Eval

How does G-Eval work?

Problems with G-Eval

Open problems with current evaluation methods and metrics for LLM text summarization

Conclusion

How to Optimize Hyperparameter Search Using Bayesian Optimization and Optuna

TL;DR

How does the Bayesian hyperparameter optimization strategy work?

How to Improve ML Model Performance [Best Practices From Ex-Amazon AI Researcher]

Search space

Objective function

Surrogate function

Selection function

The complete Bayesian hyperparameter search process

Advantages of Bayesian optimization over other hyperparameter optimization methods

Optimizing hyperparameter search using Bayesian optimization and Optuna

Optuna vs Hyperopt: Which Hyperparameter Optimization Library Should You Choose?

Step 1: Install and load dependencies

Step 2: Load the dataset

Step 3: Select a performance measure

Step 4: Training the random forest model and establishing a performance baseline

Step 5: Defining the objective function

Step 6: Initialize Neptune for storing the Optuna Trials

Step 8: Optimizing the objective function

The Best Tools to Visualize Metrics and Hyperparameters of Machine Learning Experiments

Best practices for Bayesian optimization with Optuna

Understand the problem and data

Define a relevant search space

Set an appropriate number of trials

Experiment with different acquisition functions

Parallel and distributed optimization

Best Tools for Model Tuning and Hyperparameter Optimization

Conclusion

How to Save Trained Model in Python

Save vs package vs store ML models

How to save a trained model in Python?

Saving trained model with pickle

Pros of the Python pickle approach

Cons of the Python Pickle Approach

Saving trained model with Joblib

Pros of saving ML models with Joblib

Cons of Saving ML Models with Joblib

Saving trained model with JSON

Pros of saving ML models with JSON

Cons of Saving ML Models with JSON

Saving deep learning model with TensorFlow Keras

Pros of saving models with TensorFlow Keras

Cons of Saving Models with TensorFlow Keras

Saving deep learning model with Pytorch

Pros of saving models with Pytorch

Cons of Saving Models with Pytorch

How to package ML models?

Packaging models with PMML