Samadrita Ghosh, Autor w serwisie neptune.ai The experiment tracker for foundation model training. Fri, 25 Apr 2025 15:28:43 +0000 en-US hourly 1 https://i0.wp.com/neptune.ai/wp-content/uploads/2022/11/cropped-Signet-1.png?fit=32%2C32&ssl=1 Samadrita Ghosh, Autor w serwisie neptune.ai 32 32 211928962 How to Compare Machine Learning Models and Algorithms https://neptune.ai/blog/how-to-compare-machine-learning-models-and-algorithms Fri, 22 Jul 2022 06:22:41 +0000 https://neptune.test/how-to-compare-machine-learning-models-and-algorithms/ Machine learning has expanded rapidly in the last few years. Instead of simple, one-directional, or linear ML pipelines, today data scientists and AI/ML engineers run multiple parallel experiments that can get overwhelming even for large teams. Each experiment is expected to be recorded in an immutable and reproducible format, which results in endless logs with invaluable details.

We need to narrow down our techniques by thoroughly comparing machine learning models with parallel experiments. A well-planned approach is necessary to understand how to choose the right combination of algorithms and the data at hand.

So, in this article, we will explore how to approach comparing ML models and algorithms.

Comparing ML models is part of the broader process of tracking ML experiments.

Other than that, experiment tracking is about storing all the important data and metadata, debugging model training, and, generally, analyzing the results of experiments.

On the Neptune blog, you can find a separate piece on what experiment tracking is, written by Jakub Czakon, one of the co-founders of neptune.ai (which is actually an experiment tracking tool and the company behind this blog).

There’s also an article about the 13 best experiment tracking tools, including an extensive table comparing the features of all 13 tools.

The challenge of model selection

Each model or machine learning algorithm has several features that process the data in different ways. Often the data that is fed to these algorithms is also different depending on previous experimental stages. However, since machine learning teams and developers usually record their experiments, there’s ample data available for comparison. 

The challenge is to understand which parameters, data, and metadata must be considered to arrive at the final choice. It’s the classic paradox of having an overwhelming amount of details with no clarity.

An even greater challenge is determining whether a higher metric score actually means that the model is better than one with a lower score or if the difference is only caused by statistical bias or a flawed metric design.

Comparing machine learning algorithms: why we do it?

Comparing machine learning algorithms is valuable on its own, but there are some not-so-obvious benefits of effectively comparing various experiments. Let’s take a look at the goals of comparison:

  • Better performance: the primary objective of model comparison and selection is to improve the performance of the machine learning software/solution. The objective is to narrow down the best algorithms that suit the data and the business requirements.
  • Longer lifetime: high performance can be short-lived if the chosen model is overly dependent on the training data and fails to generalize to unseen data. So, it’s also important to select a model that captures the underlying data patterns so that the predictions remain accurate over time with minimal need for re-training. 
  • Easier re-training: when we evaluate models and prepare them for comparison, we also record documentation (the best parameters, configurations, results, etc.). These details ease retraining the model if there is a failure, because we don’t need to redo the previous analysis. As a result, we can retrace the decisions made during initial model selection and find the potential causes for the failure (making it easier to adjust the model based on past experiences). As a result, retraining can begin immediately and proceed with greater efficiency.
  • Speedy production: with the model details available at hand, it’s easy to narrow down on models that can offer high processing speed and that use memory resources optimally. Also during production, configuring machine learning solutions requires setting key parameters, such as memory usage, processing speed, and response time, to ensure optimal performance and resource efficiency.. Having production-level data can be useful for easily aligning with the production engineers. Moreover, knowing the resource demands of different algorithms, it will also be easier to check their compliance and feasibility concerning the organization’s allocated assets.

You may find interesting

The team over at ReSpo.Vision uses Neptune to track and compare their experiments.

They summarized the goals of efficient comparison methods and what benefits they bring:

Neptune made it much easier to compare the models and select the best one over the last couple of months, especially since we’ve been working on this player and team separation model in an unsupervised way, during a match, to split the players into two separate teams.

Łukasz Grad, Chief Data Scientist at ReSpo.Vision

If we can choose the best performing model, then we can save time because we would need fewer integrations to ensure high data quality. Customers are much happier because they receive higher quality data, enabling them to perform more detailed match analytics.

(…) If we know which models will be the best and how to choose the best parameters for them to run many pipelines, then we will just run fewer pipelines. This, in turn, will cause the compute time to be shorter, and then we save money by not running unnecessary pipelines that will deliver suboptimal results.

Wojtek Rosiński, Chief Technology Officer at ReSpo.Vision

Parameters of machine learning algorithms and how to compare them

Let’s dive right into analyzing and understanding how to compare the different characteristics of algorithms that can be used to sort and choose the best machine learning models. I divided the comparable parameters into two high-level categories:

  • development-based,
  • and production-based parameters.

Development-based parameters

Statistical tests

On a fundamental level, machine learning models are statistical equations that run at great speed on multiple data points to arrive at a conclusion. Therefore, conducting statistical tests on the algorithms is critical to set them right and also to understand if the model’s equation is the right fit for the dataset at hand. Here’s a handful of popular statistical tests that can be used to set the grounds for comparison:

  • Null hypothesis testing: null hypothesis testing is used to determine if the differences in two data samples or metric performances are statistically significant—meaning they reflect a true effect rather than random noise or coincidence. 
  • ANOVA (Analysis Of Variance): ANOVA is a statistical method used to determine whether there are significant differences between the means of three or more groups. For example, ANOVA can help reveal if different teaching methods result in different student scores or if all methods have similar effects. It uses one or more categorical independent variables (e.g., teaching method) to analyze their impact on a continuous dependent variable (e.g., student scores). Unlike Linear Discriminant Analysis (LDA), which is a classification technique, ANOVA focuses on comparing the means of the groups to assess variation.
  • Chi-Square: it’s a statistical tool or test that assesses  the likelihood of association or correlation between categorical variables by comparing the observed and expected frequencies in each category.
  • Student’s t-test: it compares the means of two samples from normal distributions when the standard deviation is unknown to determine if the differences are statistically significant.
  • Ten-fold cross-validation: the 10-fold cross-validation compares the performance of each algorithm on different datasets that have been configured with the same random seed to maintain uniformity in testing. Next, a hypothesis test like the student’s paired t-test should be deployed to validate if the differences in metrics between the two models are statistically significant.

Model features and objectives

To choose the best machine learning model for a given dataset, it’s essential to consider the features or parameters of the model. The parameters and model objectives help to gauge the model’s flexibility, assumptions, and learning style.

When comparing linear regression models, we can choose between different ways to measure their errors. Some models try to minimize Mean Squared Error (MSE), while others aim to reduce Mean Absolute Error (MAE). The choice really comes down to how we want to handle outliers in our data.

If we have outliers in our dataset and want to consider them without letting them skew our results, using MAE makes more sense. The reason is pretty straightforward: MAE just takes the absolute value of errors, so it treats all deviations more evenly. MSE, on the other hand, squares the errors, which makes extreme values have a much bigger impact on the final model. So when we want outliers to matter but not take over, an MAE-based model tends to work better.

Similarly for classification, if two models (for example, decision tree and random forest) are considered, then the primary basis for comparison will be the degree of generalization that the model can achieve. A decision tree model with just one tree will have a limited ability to reduce variance through the max_depth parameter, whereas a random forest model will have an extended ability to bring generalization via both max_depth and n_estimators parameters. 

Several other behavioural features of the model can be taken into account, like the type of assumptions made by the model, parametricity, speed, learning styles (tree-based vs non-tree-based), and more.

You can use parallel coordinates to see how different model parameters affect the metrics. Here’s what it looks like in Neptune:

See in the app
This parallel coordinates plot shows different models with various parameters and performance metrics. Each line represents a model, with vertical axes (from left to right) for model ID, creation time, monitoring time, validation accuracy, training accuracy, epochs, and max features. To choose the best model, we seek for a balance between training and validation accuracy (and that validation accuracy is not very degraded with respect to training accuracy). Based on this criteria, the best model is with the ID 385.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

Learning curves

Learning curves can help in determining if a model is on the correct learning trajectory of achieving the bias-variance tradeoff. It also provides a basis for comparing different machine learning models – a model with stable learning curves across both training and validation sets is likely going to perform well over a longer period on unseen data.

Bias is the assumption used by machine learning models to make the learning process easier. Variance is the measure of how much the estimated target variable will change with a change in training data. In other words, it measures the sensitivity of the model to variations in the training set. The ultimate goal is to reduce both bias and variance to a minimum – a state of high stability with few assumptions.

Bias and variance are indirectly proportional to each other, and the only way to reach a minimum point for both is at the intersection point. One way to understand if a model has achieved a significant level of trade-off is to see if its performance across training and testing datasets is nearly similar.

Reaching the optimal trade-off: This figure illustrates the bias-variance trade-off, where increasing model complexity reduces bias but increases variance, and vice versa. The optimal point represents the model complexity that minimizes the total prediction error (mean squared error, MSE here), balancing both bias and variance for the best overall performance.
Reaching the optimal trade-off: This figure illustrates the bias-variance trade-off, where increasing model complexity reduces bias but increases variance, and vice versa. The optimal point represents the model complexity that minimizes the total prediction error (mean squared error, MSE here), balancing both bias and variance for the best overall performance | Source

The best way to track the progress of model training is to use learning curves. These curves help to identify the optimal combinations of hyperparameters and assist massively in model selection and model evaluation. Typically, a learning curve is a way to track the learning or improvement in model performance on the y-axis and the time step on the x-axis.

The two most popular learning curves are:

  • Training learning curve – It effectively plots the evaluation metric score over time during a training process, thus helping to track the learning or progress of the model during training.
  • Validation learning curve – In this curve, the evaluation metric score is plotted against time on the validation set. 

Sometimes the training curve might show an improvement but the validation curve shows stunted performance. This may indicate that the model is overfitting and needs to be reverted to the previous iterations. In other words, the validation learning curve identifies how well the model is generalizing.

Therefore, there’s a trade-off between the training and validation learning curves. Model selection should focus on the point where the validation error stops decreasing or starts to increase, even as the training error continues to decrease. This “sweet spot” indicates the best balance between underfitting and overfitting.

Here’s an example of comparing learning curves in Neptune using the charts view:

See in the app
Training and validation loss comparison in a Neptune dashboard for multiple experiments. The model with the pink curve might look the best one but to come to a concrete decision, all models must be trained longer and ideally over the same number of epochs, preferably at least 50 epochs.

Loss functions and metrics

Often there’s confusion between loss functions and metric functions. Loss functions are used for model optimization or model tuning, whereas metric functions are used for model evaluation and selection. However, since regression accuracy can’t be calculated, regression loss functions are used to evaluate performance as well as optimization.

Loss functions are passed as arguments to the models such that the models can be tuned to minimize the loss function. A high penalty is provided by the loss function when the model settles on incorrect judgment. 

1. Loss functions and metrics for regression:

  • Mean Square Error (MSE): MSE calculates the average of the squared differences between the predicted and actual values, penalizing larger errors more heavily. While this can make the model more sensitive to outliers, it may also increase the risk of overfitting, as the model might try to fit these outliers too closely. MSE is useful for detecting overfitting by comparing training and validation values: if the training MSE is low but the validation/test MSE is high, the model may be overfitting and not generalizing well.
  • Mean Absolute Error (MAE): it’s the absolute difference between the estimated value and the true value. It decreases the weight of outliers, unlike MSE. MAE treats all errors equally (so it is less sensitive to outliers), giving a more robust measure of overall model performance when outliers are present in the data.
  • Smooth Absolute Error: it’s the absolute difference between the estimated value and true value for the predictions lying close to the real value, and it’s the square of the difference between the estimated and the true values of the outliers (or points far off from predicted values). Essentially, it’s a combination of MSE and MAE.
See in the app
Comparison of Mean absolute error (MAE) for multiple experiments in a Neptune dashboard.

2. Loss functions for classification:

  • 0-1 loss function: this is like counting the number of misclassified samples. It penalizes misclassifications with a loss of 1 for each misclassified sample and assigns 0 for correctly classified samples. It can easily be determined from a confusion matrix which shows the number of misclassifications and correct classifications. It’s designed to penalize misclassifications and to assign the smallest loss to the solution that has the greatest number of correct classifications.
  • Hinge loss function (L2 regularized): the hinge loss is used for maximum-margin classification, most notably used for support vector machines (SVMs). It penalizes points that fall within the margin or are misclassified on the wrong side of the decision boundary. The margin represents a region around the decision boundary that ideally remains free of data points to ensure better separation between classes. Hinge loss encourages a larger margin by penalizing points close to or within this region, improving class separation.
  • Logistic Loss: this function displays a similar convergence rate to the hinge loss function, and since it’s continuous (unlike Hinge Loss), gradient descent methods can be used. However, the logistic loss function doesn’t assign zero penalties to any point. Instead, functions that correctly classify points with high confidence are less penalized. This structure leads the logistic loss function to be sensitive to outliers in the data.
  • Cross entropy/log loss: measures the performance of a classification model whose output is a probability value between 0 and 1. It quantifies the difference between the true label and the predicted probability distribution. Cross-entropy loss increases as the predicted probability diverges from the actual label.

Several other loss functions can be used to optimize machine learning models. The aforementioned ones are essential to build the foundation for model design.

3. Metrics for classification:

For every classification model prediction, a matrix called the confusion matrix can be constructed which demonstrates the number of test cases correctly and incorrectly classified. It looks something like this (considering that 1 – Positive and 0 – Negative are the target classes):

 
Actual 0
Actual 1

Predicted 0

True Negatives (TN)

False Negatives (FN)

Predicted 1

False Positives (FP)

True Positives (TP)

  • TN: Number of negative cases correctly classified
  • TP: Number of positive cases correctly classified
  • FN: Number of positive cases incorrectly classified as negative
  • FP: Number of negative cases correctly classified as positive

4. Accuracy

Accuracy is the simplest metric that can be derived from a confusion matrix and can be defined as the number of test cases correctly classified divided by the total number of test cases.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

It can be applied to most generic problems but is not very useful when it comes to unbalanced datasets. For instance, if we’re detecting fraud in bank data, the ratio of fraud to non-fraud cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be 99% accurate just by predicting all the test cases as non-fraud. 

This is not desirable since for new samples, the model would not generalize well (failing to detect fraud in new data). If we want to detect fraudulent cases with more precision, we need to use more appropriate metrics.

5. Precision

Precision is the metric used to identify the correctness of classification.

Precision = TP / (TP + FP)

Intuitively, this equation is the ratio of correct positive classifications to the total number of predicted positive classifications. The greater the fraction, the higher the precision, which means the better ability of the model to correctly classify the positive class. This is a better choice if we face an imbalanced dataset.

6. Recall

Recall tells us the number of positive cases correctly identified out of the total number of positive cases. 

Recall = TP / (TP + FN)

7. F1 Score

The F1 score is the harmonic mean of Recall and Precision, therefore it balances out the strengths of each. It’s useful in cases where both recall and precision can be valuable – like in the identification of plane parts that might require repairing. Here, precision will be required to save on the company’s cost (because plane parts are extremely expensive) and recall will be required to ensure that the machinery is stable and not a threat to human lives.

F1 Score = 2 * (precision * recall) / (precision + recall))

8. AUC-ROC

The ROC curve is a plot of the true positive rate (recall) against the false positive rate (TN / (TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics and the higher the area, the better the model performance. If the curve is somewhere near the 50% diagonal line, it suggests that the model randomly predicts the output variable.

ROC curve illustrating the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) for a binary classifier. The Area Under the Curve (AUC) represents the classifier's ability to distinguish between classes, with a higher AUC indicating better performance. The dashed line represents random guessing, while the curve above it shows model performance.
ROC curve illustrating the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) for a binary classifier. The Area Under the Curve (AUC) represents the classifier’s ability to distinguish between classes, with a higher AUC indicating better performance. The dashed line represents random guessing, while the curve above it shows model performance. | Source

Again, you can compare each of those metrics in Neptune.

That’s what they do at Hypefactors, a media intelligence company:

We use Neptune for most of our tracking tasks, from experiment tracking to uploading the artifacts. A very useful part of tracking was monitoring the metrics, now we could easily see and compare those F-scores and other metrics. Andrea Duque Data Scientist at Hypefactors

You can use the charts view (like you could see in the examples I provided earlier). But you can also do it in a side-by-side table format. Here’s what it looks like:

See in the app
Side-by-side comparison of multiple experiments in terms of major classification metrics.

In general, whatever metadata you log to Neptune, you’ll most likely be able to compare it. Apart from metrics and parameter comparisons, which you could see in this article, the same applies to logged images or dataset artifacts.

Here’s an overview of comparison options in the Neptune docs.

Production-based parameters

Until now we observed the comparable model features that take precedence in the development phase. Let’s dive into a few production-centric features that accelerate the production and processing time.

Time complexity

Depending on the use case, the decision to choose a model can be primarily focused on the time complexity. For example, for a real-time solution, it’s best to avoid the KNN classifier since it calculates the distance of new data points from the training points at the time of prediction which makes it slow. However, for solutions that require batch processing, a slow predictor is not a big issue.

Note that the time complexities might differ during the training and testing phases given the chosen model. For example, a decision tree has to estimate the decision points during training, whereas during prediction the model has to simply apply the conditions already available at the pre-decided decision points. So, if the solution requires frequent retraining (e.g. in a time series solution), choosing a model that has speed during both training and testing will be the way to go.

Space complexity

Citing the above example of KNN, every time the model needs to predict, it has to load the entire training data into the memory to compare distances. If the training data is sizable, this can become an expensive drain on the company’s resources (such as RAM allotted) for the particular solution or storage space. The RAM should always have enough room for processing and computation functions. Loading an overwhelming amount of data can be detrimental to the solution’s speed and processing capabilities.

Also here, Neptune proves to be useful.

Related

Andreas Malekos, Head of AI at Continuum Industries, says:

The ability to compare runs on the same graph was the killer feature, and being able to monitor production runs was an unexpected win that has proved invaluable.

Read the full case study

You can monitor the resource usage of each experiment:

See in the app
The image shows CPU and memory usage of a reinforcement learning experiment at each time step.

You can also create a comparison of resources for multiple experiments:

See in the app
Comparing CPU and memory usage of four different reinforcement learning algorithms.

What next?

There are plenty of comparable techniques to gauge the effectiveness of the different machine learning models. However, the most important but often ignored requirement is to track the different equivalent parameters, ensuring that the results are reliable  and that pipelines can be easily reproduced. 

In this article, we learned a few popular methods for comparing machine learning models, but the list of these methods is much bigger. If you didn’t find the perfect method for your project here, don’t stop now—there’s plenty more to explore!

]]>
5823
Data Lineage in Machine Learning: Methods and Best Practices https://neptune.ai/blog/data-lineage-in-machine-learning Fri, 22 Jul 2022 06:20:36 +0000 https://neptune.test/data-lineage-in-machine-learning/ Data is supposed to be an organization’s most treasured asset. However, it wasn’t this way until recently, so very few people have experience in handling data and leveraging it to create more value.

As managers are becoming more data-fluent, many organizations are adopting the practice of tracking data lineage, which has become steady support for driving organizations towards data efficiency.

What is data lineage?

Data lineage is the story behind the data. It tracks the data from its creation point to the points of consumption. This pipeline of dataflow involves input and output points, transformation and modeling processes the data has undergone, record analysis, visualizations, and several other processes which are constantly tracked and updated.

The objective of data lineage is to observe the entire lifecycle of data such that the pipeline can be upgraded and leveraged for optimal performance.

Data Lineage
Source: Dremio

Data lineage vs. data provenance 

Data lineage is often confused with data provenance, as the difference is quite subtle and easy to miss. Both have the common objective to serve optimization through tracking and observation, but on a high level, data lineage is a subset of data provenance.

While data lineage focuses specifically on the data by tracking its journey including the origins, destinations, transformations, and processes it has undergone, data provenance additionally tracks all the systems and processes that play a part in influencing the data.

So, Data Lineage takes care of data about the data (metadata), while Data Provenance takes care of the information about the processes that influence the data.

Why is data lineage necessary?

Knowledge about the journey of data and the circumstances it has been through offers a lot of control over the present status of the data, which can be used to navigate the optimal routes for data solutions.

However, what makes data lineage absolutely necessary is the growing competition and data efficiency or expertise across organizations and industries. A few years ago, smart data or statistical solutions were good-to-have insights for the organization which offered a competitive edge. However, today most organizations want to optimize their data assets. Control and knowledge about the data is gradually becoming a competitive advantage.

Here are a few reasons to consider Data Lineage as one of the key contributors to data efficiency:

  • Data gathering

Data is a dynamic asset that keeps evolving, and relying on static data can be harmful to business decisions and outcomes. Constantly gathering, updating, and validating data is a critical process. Data lineage can be very useful in the process of upgrading data especially because of its tracking abilities. Upgrading data is not just about fetching new data, but also about understanding the relevance of older datasets. Data Lineage can help in combining new data with older relevant data such that the data consumers like developers, business teams, and stakeholders can derive the maximum value from the data assets.

  • Data governance

The metadata that gets recorded due to data lineage automatically offers details that can be used in compliance audits or can help to improve the security of the data pipeline. It also helps to understand the structure and process of dataflow and allows ample room for improvisations. Tracking the metadata consistently also reduces the technical debt and, therefore, the overall cost of risk management and compliance.

  • Standardized migration

In practical scenarios, there is a frequent need for data migration from one source to another. This implies that during migration several details such as location, formatting, storage, bandwidth, security concerns, etc. must be known beforehand. Data lineage is perfect for such a setting since it provides in-depth details almost instantly and saves both cost and time. WIth the available set of metadata, it’s also possible to automate and standardize migration processes depending on the parameters of the destination and source systems.

  • Rich business insights

Data lineage is essential to maintain the integrity of data which is crucial to businesses. By constantly tracking details about the data, it’s possible to instantly update and alert users in case of data discrepancy, irrelevance, or staleness. Several departments such as development, sales, marketing, and business operations depend on data to improve their process. Fresh and healthy data can bring immense speed and value to business decisions.

Who benefits from data lineage and how?

Since several departments in an organization rely on Data Lineage, let’s get a closer view of the dependents to understand the need for Data Lineage further.

  • ETL Users/Developers

Every organization that deals with data has an ETL department that looks after the Extract, Transform, and Load process. It’s the process of extracting data from a source, applying required transformations, and then sharing it with the destination system. ETL developers deal with heavy volumes of data, and data lineage comes in handy to detect bugs in the ETL process, and to create detailed reports of the transfer.

  • Security Teams

Security experts and developers are often devising ways to fool-proof the data pipeline’s vulnerable end points. Data lineage brings critical information on such endpoints consistently, which helps in experimentation through permutations and combinations. Records on vulnerabilities, fatalities, and processes that caused them, offer more scope for improvements in security.

  • Business Teams

Business teams work with multiple reports, and with the help of data lineage, they can easily navigate to the source of the reports and validate the data whenever necessary.

  • Data Stewards

A data steward is an individual responsible for the quality of an organization’s data asset(s). Therefore, a data steward is expected to know the ins and outs of the data he/she is governing, and data lineage makes this process much more accurate, transparent, and user-friendly.

Methods of data lineage

Here are a few ways to perform Data Lineage tracing:

  • Lineage through Data Tagging

This method works on the assumption that a transformation tool is consistently involved with the data and that it tags the data after executing transformations. To track the tags, it’s important to know the format of the tag so that it can be spotted across the pipelines. This method is only reliable in closed systems where only the known transformation tool is deployed.

  • Self-contained Lineage

This is when lineage is traced in a closed environment that is completely controlled by the organization. Almost every type of data support (data lakes, storage tools, data management, and processing logic) is a part of the environment. So, the lineage is also restricted to the boundaries of the data environment and is unaware of the processes outside. 

  • Parsing

Data lineage through parsing reads the code or the transformation logic to understand how the data came to its current state. On processing the transformation logic, it has the ability to trace back to previous states and complete the end-to-end lineage tracing. Parsing-based lineage is not technology agnostic since it needs to understand the coding language and transformation tools deployed on the data. This directly imposes restrictions on the flexibility of the process, in spite of it being one of the most advanced lineage-tracing techniques.

  • Pattern-based lineage

This method of data lineage doesn’t work with the code that’s responsible for data transformations. Instead, the process only observes the data and looks for patterns to trace the lineage, making it completely technology- or algorithm-agnostic. This is not a very reliable method, because it tends to lose out on patterns that are deep-rooted in the code.

Data lineage across the pipeline

To capture end-to-end lineage, data has to be tracked across every stage and process in the data pipeline. Here are the stages across which data lineage is performed:

  • Data Gathering Stage

The data gathering or ingestion stage is where data enters the core system. Data lineage can be used to track the vitals of the source and destination systems to validate the accuracy of the data, mappings, and transformations. Tracking the systems closely also helps in the easier identification of bugs.

  • Data Processing Stage

Data processing takes up a huge percentage of the entire process of creating data solutions. It involves multiple transformations, filters, data types, tables, and storage locations. Recording metadata from each step doesn’t just help in compliance and production speed but also makes the development process richer and more productive. It enables developers to analyze the causes behind the success or failures of processes in higher detail.

  • Data Storing and Access Stage

Organizations usually deploy large data lakes to store their data. Data lineage can be used to track the access permissions, vitals of endpoints, and data transactions. This will increase the degree of automation of security and compliance, which is a huge bonus given the size and complexity of data lakes.

  • Data Querying Stage

Users raise multiple data queries with a range of functions like joins and filters. Some functions can be heavy on the processors and therefore, less efficient. Data lineage can observe the queries to track and validate the processes and different versions of data resulting from them. Meanwhile, it also helps in optimizing the queries and provides reports including instances of optimal solutions. 

Best practices of data lineage

Data lineage is an evolving discipline and the processes are improving at great speed. Here are a few fundamental best practices that can keep the momentum going:

  • Automation

The general practice of organizations until now has been to record lineage manually. Given the dynamic and fast-paced nature of production, manual tracking is no longer feasible. Best-in-class data catalogs are also recommended to boost automation. They integrate AI and ML to combine metadata sourced from multiple systems to form a logical flow of lineage. It also has the capabilities to extract and form conclusions from the metadata.

  • Metadata validation

Data is always susceptible to errors, which is why it’s important to include the owners of different processes and tools in lineage tracing. The owners are closest to and most aware of the details generated by their applications and can be resourceful in pointing out the bugs or errors in the records or processes.

  • Inclusion of metadata source

Including the data generated by the different processes that process, transform or transfer the data, is vital to tracing data lineage most accurately. Therefore, metadata that is created by these processes on the data should be pulled into lineage tracking.

  • Progressive extraction and validation

To map the lineage most accurately, it’s recommended to record metadata consecutively as per the stages of the data pipeline. This creates a well-defined timeline and organizes the huge log of metadata in a much more readable format. Progressive validation of this data also becomes easier so that the high-level connections can be verified first. Once they’re clear, the deeper intricacies can be validated level-wise. The progressive approach maintains a logical pattern and reduces errors while reading or extracting the data.

Data lineage tools

Data lineage, even though a relatively new discipline, has been evolving in the background over the years. Initially, the tools were all version control systems which eventually expanded into the larger discipline of data lineage. Let’s take a tour through the generations of version control tools to understand how it gave us Data Lineage.

Generations of Data Lineage Platforms:

  • 1st Generation: The 1st generation version tracking was primitive, entirely manual and mostly managed by one person who had access to the documents via “locks”.
  • 2nd Generation: The 2nd generation was a huge improvement since it allowed social collaboration, enabling multiple users, usually in-house, to work on the code. The only fatal drawback was that it was inefficient with code mergers and required the developers to work on merging the code externally before making the final commits.
  • 3rd Generation: The 3rd generation worked on the drawbacks of 2nd generation tools and allowed developers, not just in-house but across the globe, to collaborate and merge their respective versions with differences at later stages after commits. The world-wide network and easy-merge facilities allowed huge scaling abilities, especially in the open-source community.
  • 4th Generation: The final and present generation of version control is a part of data lineage platforms like Pachyderm. It improvises on the 3rd generation by enabling version control with the rather open-ended process that goes into production of AI solutions. The job of the 4th generation is to keep track of all processes and tools involved in the system, such as the cloud, the storage, the data versions, the algorithms, and much more, while maintaining immutability. Overall, it successfully tracks the end-to-end flow of the data pipeline.

Tools/Platforms for Data Lineage:

Here are a few popular picks for Data Lineage tools.

Talend data catalog is a one-stop source for getting details on the processes that have acted on your data. It can search, extract, govern, and secure the metadata from multiple sources and has the ability to automatically crawl the sources.

IBM DataStage combines analytics, cloud environments, governance, and DataOps in a single platform built for AI development. It delivers high-quality data through a container-based architecture.

Datameer offers a visual no-code experience for building data pipelines and allows collaboration for data experts to discover, access, model, and transfer data. The high-quality user experience along with great tech support makes Datameer a strong contender.

Neptune is an experiment tracker designed with a strong focus on collaboration and scalability. It allows to monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye. The tool is known for its user-friendly interface and flexibility, enabling teams to adopt it into their existing workflows with minimal disruption. Neptune gives users a lot of freedom when defining data structures and tracking metadata.

Future of data lineage

Several new disciplines have started growing during the last few years to such an extent that their impact has touched the boundaries of several industries. Technologies such as 5G, edge computing, Internet of Things, and of course artificial intelligence, are set to generate loads of invaluable data. To leverage this volume, a well-defined tracking system is the need of the future, and the foundation of that infrastructure is the responsibility of the present.

Once these technologies expand further along with the cloud, the data is bound to get exposed to external agents like physical systems connected to IoT, edge servers, and the cloud in general. Transformations on data will take place at far-off and disconnected locations which are often vulnerable endpoints that have the potential to put the data pipeline in jeopardy. 

Data Lineage will become a competitive advantage for the early takers by securely governing the huge data landscape that is growing rapidly even today. With the lineage in place, it will be much easier to track the errors and vulnerabilities such that the systems responsible can be upgraded with haste. Furthermore, the extended benefits of adopting data lineage like reduced cost, high scalability, compliance, high data, and process quality are expected to become must-have benefits for data-based industries.

Resources

]]>
5781
A Comprehensive Guide to Data Preprocessing https://neptune.ai/blog/data-preprocessing-guide Thu, 21 Jul 2022 14:08:29 +0000 https://neptune.test/data-preprocessing-guide/ According to the 2020 International Data Corporation’s forecast, 59 zettabytes of data would have been created, consumed, captured and copied in 2020. This forecast gets more interesting when we go back in time to 2012, and find that IDC forecasted the digital universe to reach only 40 zettabytes by 2020, and not in 2020 alone.

This wide gap between actual and forecasted numbers has its reasons, the biggest being the COVID-19 pandemic. Global quarantine sent everyone online, and data generation spiked insanely. Not surprisingly, the years 2020 to 2024 have already been named years of the COVID-19 Data Bump. This calls for people, process and technology that can manage and optimize the data to get profitable insights and get it fast.

Skills like data management and data processing are becoming extremely valuable. Which is why this article is a comprehensive guide for data preprocessing steps and techniques, to get you going in this new world.

What is data preprocessing?

A considerable chunk of any data-related project is about data preprocessing and data scientists spend around 80% of their time on preparing and managing data. Data preprocessing is the method of analyzing, filtering, transforming and encoding data so that a machine learning algorithm can understand and work with the processed output.

Why is data preprocessing necessary?

Algorithms that learn from data are simply statistical equations operating on values from the database. So, as the popular saying goes, “if garbage goes in, garbage comes out”. Your data project can only be successful if the data going into the machines is high quality. 

In data extracted from real-world scenarios, there’s always noise and missing values. This happens due to manual errors, unexpected events, technical issues, or a variety of other obstacles. Incomplete and noisy data can’t be consumed by algorithms, because they’re usually not designed to handle missing values, and the noise causes disruption in the true pattern of the sample. Data preprocessing aims to solve these problems by thorough treatment of the data at hand.

How to go about data preprocessing?

Before we get into the details of step by step data-processing, let’s take a look at a few tools and libraries that can be used to execute and make data processing much more manageable!

Tools and libraries

Data preprocessing steps can be simplified through tools and libraries that make the process easier to manage and execute. Without certain libraries, one liner solutions can take hours of coding to develop and optimize.

Data Preprocessing with Python: Python is a programming language that supports countless open source libraries that can compute complex operations with a single line of code. For instance, for the smart imputation of missing values, one needs only use scikit learn’s impute library package. Or, for scaling datasets, just call upon the MinMaxScaler function from the preprocessing library package. There are countless data preprocessing functions available in the preprocessing library package.

Autumunge is a great tool, built as a python library platform, that prepares tabular data for direct application of machine learning algorithms.

Data Preprocessing with R: R is a framework that is mostly used for research/academic purpose. It has multiple packages which are just like python libraries and support data preprocessing steps considerably.

Data Preprocessing with Weka: Weka is a software that supports data mining and data preprocessing through in-built data preprocessing tools and machine learning models for intelligent mining. 

Data Preprocessing with RapidMiner: Similar to Weka, RapidMiner is an open source software that has various efficient tools for supporting data preprocessing.

Now that we have the appropriate tools to support multiple functions, let’s dive deep into the data preprocessing steps.

Purpose of data preprocessing

After you have properly gathered the data, it needs to be explored, or assessed, to spot key trends and inconsistencies. The main goals of Data Quality Assessment are:

  • Get data overview: understand the data formats and overall structure in which the data is stored. Also, find the properties of data like mean, median, standard quantiles and standard deviation. These details can help identify irregularities in the data.
  • Identify missing data: missing data is common in most real-world datasets. It can disrupt true data patterns, and even lead to more data loss when entire rows and columns are removed because of a few missing cells in the dataset. 
  • Identify outliers or anomalous data: some data points fall far out of the predominant data patterns. These points are outliers, and might need to be discarded to get predictions with higher accuracies, unless the primary purpose of the algorithm is to detect anomalies. 
  • Remove Inconsistencies: just like missing values, real-world data also has multiple inconsistencies like incorrect spellings, incorrectly populated columns and rows (eg.: salary populated in gender column), duplicated data and much more. Sometimes, these inconsistencies can be treated through automation, but most often they need a manual check-up.

Below are some popular data pre-processing techniques that can help you meet the above goals:

Handling missing values

Missing values are a recurrent problem in real-world datasets because real-life data has physical and manual limitations. For example, if data is captured by sensors from a particular source, the sensor might stop working for a while, leading to missing data. Similarly, different datasets have different issues that cause missing data points. 

We need to handle these missing values to optimally leverage available data. These are some tried and tested ways:

  • Drop samples with missing values: this is instrumental when both the number of samples is high, and the count of missing values in one row/sample is high. This is not a recommended solution for other cases, since it leads to heavy data loss.
  • Replace missing values with zero: sometimes this technique works for basic datasets, since the data in question assumes zero as a base number, signifying that the value is absent. However, in most cases, zero can signify a value in itself. For example, if a sensor generates temperature values and the dataset belongs to a tropical region. Similarly, in most cases, if missing values are populated with 0, then it would be misleading to the model. 0 can be used as replacement only when the dataset is independent of its effect. For example, in phone bill data, a missing value in the billed amount column can be replaced by zero, since it might indicate that the user didn’t subscribe to the plan that month. 
  • Replace missing value with mean, median or mode: you can deal with the above problem, resulting from using 0 incorrectly, by using statistical functions like mean, median or mode as a replacement for missing values. Even though they’re also assumptions, these values make more sense and are closer approximations when compared to one single value like 0.
  • Interpolate the missing values: interpolation helps to generate values inside a range based on a given step size. For instance, if there are 9 missing values in a column between cells with values 0 and 10, interpolation will populate the missing cells with numbers from 1 to 9. Understandably, the dataset needs to be sorted according to a more reliable variable (like the serial number) before interpolation.
  • Extrapolate missing values: extrapolation helps to populate values which are beyond a given range, like the extreme values of a feature. Extrapolation takes the help of another variable (usually the target variable) to compare the variable in question and populate it with a guided reference. 
  • Build a model with other features to predict the missing values: by far the most intuitive of all techniques we’ve mentioned. Here, an algorithm studies all the variables except the actual target variable (since that would lead to data leakage). The target variable for this algorithm becomes the feature with missing values. The model, if well trained, can predict the missing points and provide the closest approximations.

Scaling

Different columns can be present in different ranges. For example, there can be a column with a unit of distance, and another with the unit of a currency. These two columns will have starkly different ranges, making it difficult for any machine learning model to reach an optimal computation state.

In more technical terms, if one considers using Gradient Descent, it will take longer for the gradient descent algorithm to converge, since it has to process different ranges that are far apart. The same is demonstrated in the figure below.

Gradient Descent Convergence
Gradient Descent Convergence

The diagram on the left has scaled features. This means that features are brought down to values which are comparable with one another, so the optimization function doesn’t have to take major leaps to reach the optimal point. Scaling is not necessary for algorithms (like the decision tree), which are not distance-based. Distance based models however, must have scaled features without any exception.

Some popular scaling techniques are:

  • Min-Max Scaler: min-max scaler shrinks the feature values between any range of choice. For example, between 0 and 5.
  • Standard Scaler: a standard scaler assumes that the variable is normally distributed and then scales it down so that the standard deviation is 1 and the distribution is centered at 0.
  • Robust Scaler: robust scaler works best when there are outliers in the dataset. It scales the data with respect to the inter-quartile range after removing the median.
  • Max-Abs Scaler: similar to min-max scaler, but instead of a given range, the feature is scaled to its maximum absolute value. The sparsity of the data is preserved since it does not center the data.

Scikit Learn’s preprocessing library package can provide a one liner solution to execute all the above scaling methods.

Check also

Comparing Tools For Data Processing Pipelines

Outlier treatment

Outliers are data points that do not conform with the predominant pattern observed in the data. They can cause disruptions in the predictions by taking the calculations off the actual pattern.

Outliers can be detected and treated with the help of box-plots. Box plots are used to identify the median, interquartile ranges and outliers. To remove the outliers, the maximum and minimum range needs to be noted, and the variable can be filtered accordingly.

Feature encoding

Sometimes, data is in a format that can’t be processed by machines. For instance, a column with string values, like names, will mean nothing to a model that depends only on numbers. So, we need to process the data to help the model interpret it. This method is called categorical encoding. There are multiple ways in which you can encode categories. Here are a few to get you started:

Classic encoders

  • Label/Ordinal Encoding: embeds values from 1 to n in an ordinal (sequential) manner. ‘n’ is the number of samples in the column. If a column has 3 city names, label encoding will assign values 1, 2 and 3 to the different cities. This method is not recommended when the categorical values have no inherent order, like cities, but it works well with ordered categories, like student grades.  
  • One hot encoding: when data has no inherent order, you can use one hot encoding. One hot encoding generates one column for every category and assigns a positive value (1) in whichever row that category is present, and 0 when it’s absent. The disadvantage of this is that multiple features get generated from one feature, making the data bulky. This is not a problem when you don’t have too many features. 
  • Binary Encoding: this solves the bulkiness of one hot encoding. Every categorical value gets converted to its binary representation, and for each binary digit a new column is created. This compresses the number of columns compared to one hot encoding. With 100 values in a categorical column, one hot encoding will create 100 (or 99) new columns, whereas binary encoding will create much less, unless the values are too large.
  • BaseN Encoding: this is similar to binary encoding, with the only difference of base. Instead of base 2 as with binary, any other base can be used for baseN encoding. The higher the base number, the higher the information loss, but the encoder’s compression power will also keep increasing. A fair trade-off.
  • Hashing: hashing means generating values from a category with the use of mathematical functions. It’s like one hot encoding (with a true/false function), but with a more complex function and fewer dimensions. There is some information loss in hashing due to collisions of resulting values.

Bayesian encoders

These encoders borrow information from the target variable and map them to the categories in a column. This technique is of great use when the number of categories is significantly high. In such cases, if classic encoders are used, it will increase the dimensionality many-fold and trouble the model unnecessarily. When the number of features is extremely high, a problem called the curse of dimensionality reduces the efficiency of machine learning models since they’re still not adept enough to handle large volumes of features.

  • Target Encoding: the mean of only those rows in the target variable that correspond to a certain category in the feature is mapped to that category. Data leakage and overfitting issues must be taken care of by keeping the test data separate.
  • Weight of Evidence Encoding: weight of Evidence (WoE) is the measure of the extent to which an evidence (or a value) supports or negates a presumed hypothesis. WoE is usually used to encode continuous variables with the binning technique. 
  • Leave One Out Encoding: similar to target encoding, but it leaves the value of the current sample while calculating the mean. This helps in avoiding outliers and anomalous data.
  • James-Stein Encoding: takes the weighted average of the corresponding target means along with the mean of the entire target variable. This helps to reduce both overfitting and underfitting. The weights are decided based on the estimated variance of values. If the variance is high for the values which make up a mean, it will indicate that that particular mean is not very reliable. 

Feature creation and aggregation

New features can be created from raw features. For example, if two features called ‘total time’ and ‘total distance’ are available, you can create the feature of speed. This gives a new perspective to the model, which can now detect a logical relation between speed and the target variable. 

Similarly, you can do this to build other intuitive features, like kilometers binned on the basis of weekdays and weekends, or speed during rush hour. In case of deep learning models, neural network layers can identify complex relationships between raw features, so we don’t need to feed formula-based features to DL models.

Similarly, features can be appropriately aggregated to reduce data bulk, and also to create relevant information. For instance, in a time series model for rain forecast, the data has to be aggregated based on the day so that the total measure of rain per day can be assessed. Several records of rain measurements throughout the day will not add much value to the time series model.

Dimensionality reduction

Machine learning models are not adept enough to handle a large number of features, following the rule of “garbage in, garbage out”. In their book, “An Introduction to Variable and Feature Selection,” Guyon and Elisseeff write:

The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.

Using irrelevant and redundant features will make the model unnecessarily complex, and might even lower prediction scores. The main advantages of feature selection are:

  • Reduced processing time: the lesser the volume of data, the faster the training and prediction. More features mean more data for the model to learn from, and increased learning time. 
  • Improved Accuracy: when a model has no irrelevant variables to consider, irrelevant factors can’t enhance the model’s score.
  • Reduced Overfitting: the lower the number of irrelevant variables and redundant data, the lesser the propagation of noise across the model’s decisions.

Feature selection can be conducted on many levels. Primarily, feature selection techniques can be segregated into Univariate and Multivariate techniques.

Univariate selection

In every technique which comes under the hood of Univariate Selection, every feature is individually studied and the relationship it shares with the target variable is taken into account. Here are a few Univariate feature selection techniques:

Variance

Variance is the measure of change in a given feature. For example, if all the samples in a feature have the same values, it would mean that the variance of that feature is zero. It’s essential to understand that a column which doesn’t have enough variance is as good as a column with all ‘nan’ or missing values. If there’s no change in the feature, it’s impossible to derive any pattern from it. So, we check the variance and eliminate any feature that shows low or no variation.

Variance thresholds might be a good way to eliminate features in datasets, but in cases where there are minority classes (say, 5% 0s and 95% 1s), even good features can have very low variance and still end up being very strong predictors. So, be advised – keep the target ratio in mind and use correlation methods before eliminating features solely based on variance.

Correlation

Correlation is a univariate analysis technique. It detects linear relationships between two variables. Think of correlation as a measure of proportionality, which simply measures how the increase or decrease of a variable affects the other variable.

One major disadvantage of correlation is that it can’t properly capture non-linear relationships, not even the strong ones. The images below demonstrate the pros and cons of correlation:

Linear Relationship
Linear relationship
Sine Relationship
Sine relationship
Correlation scores

The second plot shows a strong sine-wave relationship between the dependent and independent variables. However, correlation (-0.389) can barely capture the strong dependence. On the other hand, we get a high correlation score (0.935) when there’s a linear dependency, even though it’s not as strong as the sine-dependency.

There are various correlation techniques, but essentially all of them try to track down the presence of linear relationships between two variables. There are three popular types of correlation techniques:

  1. Pearson Correlation: this is the simplest of all and comes with a lot of assumptions. The variables in question must be normally distributed, and have a linear relationship with each other. Also, the data must be equally distributed about the regression line. However, in spite of several assumptions, Pearson correlation works well for most data.
  1. Spearman Rank Correlation: it’s based on the assumption that variables are measured on an ordinal (rank-wise) scale. It tracks the variability of a variable that can be mapped to the other variable under observation.
  1. Kendall Rank Correlation: this method is a measure of the dependence between two variables and follows mostly the same assumption as Spearman’s, with the exception of measuring correlation based on probability instead of variability. It’s the difference between the probability that the two variables under observation are in order, and the probability that they’re not.

You have to do correlation tests not only with respect to the target (or dependent) variable, but also with respect to all the independent variables. For instance, if there are two variables which have a correlation of 90% and 85% with the target respectively, and a correlation of 95% with respect to each other, then it will be beneficial to drop either of the variables. Preferably the one with lower correlation with the target. You need to do this to get rid of redundant information in the data, so that the model is less biased.

Mutual information

Mutual information solves the problem caused by correlation. It effectively captures any non-linear relationship between given variables. This helps us eliminate the features which show no significant relationship with the target variable, helping us strengthen the predictive model. The idea behind mutual information is information gain. It asks the question: how much information on one variable can be extracted from another? Or, how much movement (increase or decrease) of a variable can be tracked using another variable?

For the same relationships, sine-wave and linear, mutual information scores are as follows:

Mutual information scores

Mutual information correctly suggests a strong sine-wave relationship, making it more competent when it comes to capturing information about non-linear relationships. Mutual information for sine relationship is 0.781, whereas correlation for the same was -0.389. 

Chi-Square

This is a statistical tool, or test, which can be used on groups of categorical features to evaluate the likelihood of association, or correlation, with the help of frequency distributions.

Multivariate selection

Also referred to as Wrapper methods, multivariate selection techniques take a group of features at a time and test the group’s competence in predicting the target variable.

Forward Selection: this method starts with a minimal number of features and measures how well the set performs at prediction. With every iteration, it adds another variable, selected on the basis of best performance (compared to all other variables). The set with the best performance among all sets is finalized as the feature set.

Backwards Elimination: similar to forward selection, but in reverse direction. Backward elimination starts with all possible features and measures performance at every iteration, eliminating extraneous variables, or variables performing poorly when paired with the feature set.

Backward Elimination is usually the preferred method compared to forward selection. This is because in forward selection, the suppression effect can get in the way. Suppression effect occurs when one variable can be utilized optimally only when another variable is held constant. In other words, with a newly added feature, it might so happen that an already existing feature in the set is rendered insignificant. To avoid this, p-value can be used, even though it’s a somewhat controversial statistical tool.

Recursive Feature Elimination: similar to backward elimination, but it replaces the iterative approach with a recursive approach. It’s a greedy optimization algorithm, and works with a model to find the optimized feature set. 

Linear Discriminant Analysis (LDA): helps you find a linear combination of features that separates two or more classes of a categorical variable.

ANOVA: aka ‘analysis of variance’, it’s similar to LDA, but uses one or more categorical features and one continuous target. It’s a statistical test for whether the means of different groups are similar or not.

Embedded methods: some machine learning models come with in-built feature selection methods. They’re engineered to apply coefficients to features based on the performance of the features in terms of predicting the target. Poorly performing variables get very low or zero as their coefficient, which omits them from the learned equation (almost) entirely. Examples of embedded methods are Ridge and Lasso models, which are linear models used in regression problems.

Summary

With that, we’ve reached the end of this guide. Please note that every step mentioned here has several more subtopics that deserve their own articles. The steps and techniques we’ve explored are the most used and popular working methods.

Once you’re done with data preprocessing, your data can be split into training, testing and validation sets for model fitting and model prediction phases. Thanks for reading, and good luck with your models!

  1. Extrapolation and Interpolation
  2. Statistical Intro: Mean, Median and Mode
  3. Gradient Descent
  4. Types of Category Encoders
  5. Correlation Types
  6. Mutual Information
  7. Univariate vs Multivariate Analysis
  8. Feature Selection Methods
  9. Linear Discriminant Analysis
]]>
3969
Top 10 Best Machine Learning Tools for Model Training https://neptune.ai/blog/top-10-best-machine-learning-tools-for-model-training Thu, 21 Jul 2022 13:36:56 +0000 https://neptune.test/top-10-best-machine-learning-tools-for-model-training/ Contrary to the popular notion, model training in machine learning is not simply a black box activity. For the machine learning (ML) solution to consistently perform well, the developers have to deep dive into each model to find the right fit with the data and the business use case.

In simple terms, a machine learning model is a simple statistical equation that is developed over time based on the data at hand. This learning process, also known as training, ranges from simple to complex processes. A model training tool is an interface that enables easy interaction between the developer and the complexities of machine learning models.

Read also

9 Steps of Debugging Deep Learning Model Training

Performance Metrics in Machine Learning [Complete Guide]

How to choose the right model training tool

In machine learning, there is no “jack of all trades”- no one tool can fix all problems because of vast variations in real-world problems and data. But there are model training tools that can fit you like a glove – you and your requirements specifically. 

To be able to choose a primary model training tool for your solution, you need to assess your existing development process, production infrastructure, skills level of your team, compliance restrictions, and similar vital details to be able to pin down the right tool. 

However, one key feature that is often overlooked, leading to a weak foundation and an unstable series of solutions, in the long run, is the ability of the model training tool to either track metadata or the ability to integrate seamlessly with a metadata store and monitoring tools.

Model metadata involves assets such as training parameters, experiment metrics, data versions, pipeline configurations, weight reference files, and much more. This data is powerful and cuts down both production and model recovery time. To choose the right metadata store, your team can do a cost-benefit analysis between building new vs. buying existing solutions.

May be useful

The Best MLOps Tools and How to Evaluate Them

Top 10 tools for ML model training

Below is a list of the top ten model training tools in the ML marketspace that you could use to estimate if your requirements match the features offered by the tool.

1. TensorFlow

tensorflow logo
Source: tensorflow.org

I remember coming across TensorFlow as an intern and being clearly intimidated after having barely explored scikit-learn. Looking back, it seems imminent because TensorFlow is a low-level library and requires working closely with the model code. Developers can achieve full control and train models from scratch with TensorFlow.

However, TensorFlow also offers some pre-built models that can be used for simpler solutions. One of the most desirable features of TensorFlow is dataflow graphs that come in handy especially when complex models are under development.

TensorFlow supports a wide range of solutions including NLP, computer vision, predictive ML solutions, and reinforcement learning. Being an open-source tool from Google, TensorFlow is constantly evolving due to a community of over 380,000 contributors worldwide.

Check how to keep track of TensorFlow/Keras model training.

2. PyTorch

pytorch_logo
Source: pytorch,org

PyTorch is another popular open-source tool that offers tough competition to TensorFlow. PyTorch has two significant features – tensor computing with accelerated processing on GPU and neural networks built on a tape-based auto diff system. 

Additionally, PyTorch supports a host of ML libraries and tools that can support a variety of solutions. Some examples include AllenNLP and ELF which is a game research platform. PyTorch also supports C++ and Java in addition to Python.

One leading difference between PyTorch and TensorFlow is that PyTorch supports dynamic dataflow graphs whereas TensorFlow is limited to static graphs. Compared to TensorFlow, PyTorch is easier to learn and implement since TensorFlow needs heavy code work.

Check how to keep track of PyTorch model training.

Moving From TensorFlow To PyTorch

8 Creators and Core Contributors Talk About Their Model Training Libraries From PyTorch Ecosystem

3. PyTorch Lightning

PyTorch Lightning is a wrapper on top of PyTorch, built primarily to redirect focus on research instead of on engineering or redundant tasks. It abstracts the underlying complexities of the model and common code structures so the developer can focus on multiple models in a short span.

The two strengths of PyTorch Lightning, as the name partially suggests, are speed and scale. It supports TPU integration and removes barriers to using multiple GPUs. For scale, PyTorch Lightning allows experiments to run in parallel on multiple virtual machines through grid.ai

PyTorch Lightning has significantly less need for code because of high-level wrappers. However, that does not restrict the flexibility since the primary objective of PyTorch is to reduce the need for redundant boilerplate code. Developers can still modify and deep dive into areas that need customization. 

Check how to keep track of PyTorch Lightning model training.

4. Scikit-learn

Scikit-learn is one of the top open-source frameworks ideal for getting started with machine learning. It has high-level wrappers which enable users to play around with multiple algorithms and explore the wide range of classification, clustering, and regression models. 

For the curious mind, scikit-learn can also be a great way to gain deeper insight into the models simply by unwrapping the code and following the dependencies. Scikit-learn’s documentation is highly detailed and easily readable by both beginners and experts.

Scikit-learn is great for ML solutions with a limited time and resource allotment. It is strictly machine learning-focused and has been an instrumental part of predictive solutions from popular brands over the last few years.

Check how to keep track of Scikit-learn model training.

5. Catalyst

Catalyst is another PyTorch framework built specifically for deep learning solutions. Catalyst is research-friendly and takes care of engineering tasks such as code reusability and reproducibility, facilitating rapid experimentation.

Deep learning has always been considered as complex and Catalyst enables developers to execute deep learning models with a few lines of code. It supports some of the top deep learning models such as ranger optimizer, stochastic weight averaging, and one-cycle training.

Catalyst saves source code and environment variables to enable reproducible experiments. Some other notable features include model checkpointing, callbacks, and early stopping.

Check how to keep track of Catalyst model training.

6. XGBoost

Source: xgboost.ai

XGBoost is a tree-based model training algorithm that uses gradient boosting to optimize performance. It is an ensemble learning technique which means several tree-based algorithms are used to achieve the optimal model sequence.

With gradient boosting, XGBoost grows the trees one after the other so that the following trees can learn from the weakness of the previous ones. It gradually moderates the weights of weak and strong learners by borrowing information from the preceding tree model.

To enhance speed XGBoost supports parallel model boosting across distributed environments such as Hadoop or MPI. XGBoost is well suited for large training datasets and combinations of numeric and categorical features.

Check how to keep track of XGBoost model training.

XGBoost vs LightGBM: How Are They Different

How to Organize Your XGBoost Machine Learning (ML) Model Development Process: Best Practices

7. LightGBM

Source: LightGBS docs

LightGBM, like XGBoost, is also a gradient boosting algorithm that uses tree-based models. But when it comes to speed, LightGBM has an upper hand over XGBoost. LightGBM is best suited for large datasets that otherwise would consume a lot of training time with other models.

While most tree-based algorithms split the tree level or depth-wise, LightGBM comes in with the unique technique of leaf or breadth-wise splits which has proven to increase performance. Even though this tends to overfit the model, the developer can avoid the situation by tweaking the max_depth parameter. 

LightGBM requires low memory space in spite of working with heavy datasets since it replaces continuous values with discrete bins. It also supports parallel learning which is again a major time saver.

Check how to keep track of LightGBM model training.

8. CatBoost

Catboost logo
Source: catboost.ai

CatBoost is a gradient boosting algorithm that provides best-in-class results with minimal training compared to most machine learning models. It is an open-source tool and has become a popular favorite because of its ease of use.

CatBoost cuts down preprocessing efforts since it can directly and optimally handle categorical data. It does so by generating numerical encodings and by experimenting with various combinations in the background. 

Even though CatBoost offers the scope of tuning extensively with a range of multiple hyperparameters, it does not require much tuning and can produce results without overfitting the training data. It is well-suited for both low and high-volume data.

9. Fast.ai

fastai logo
Source: fast.ai

Fast.ai’s catchy tagline says it all – “making neural nets uncool again”. Fast.ai aims to make deep learning accessible across multiple languages, operating systems, and small datasets. It was developed on the idea that transfer learning is a key strength in deep learning and can cut down a huge amount of redundant engineering work.

It offers an easy-to-use high-level interface for deep learning models and also allows users to download a set of pre-trained models. Fast.ai has multiple wrappers that hide the complexities of the underlying model architecture. This allows developers to focus on data intelligence and process breakthroughs. 

Fast.ai is also extremely popular for sharing their free online course, “Practical Deep Learning for Coders”, which does not demand any pre-requisite yet dives deep into deep learning concepts and illustrates how to make it easy through fast.ai. 

Check how to keep track of fast.ai model training.

10. PyTorch Ignite

ignite logo
Source: PyTorch Ignite

PyTorch Ignite is a wrapper built on top of PyTorch and is quite similar to PyTorch Lightning. Both offer an abstraction of model complexities and an easy-to-use interface to expand research abilities and diminish redundant code.

Architecture-wise, there is a subtle difference between the two. While PyTorch Lightning has a standard reproducible interface, Ignite does not have any standard version.

While it cannot support highly advanced features, Ignite works well with an ecosystem of integrations to support the machine learning solution whereas Lightning supports state-of-the-art solutions, advanced features, and distributed training.

Check how to keep track of PyTorch Ignite model training.

Other model training tools

There are several other options that might not be as popular as the above choices but are great for specific model training requirements. 

For example:

  • If high speed with limited GPU resources is your priority, Theano takes the lead.
  • For .NET and C# capabilities, Accord would be ideal. It also has a host of audio and image processing libraries. 
  • ML.NET is another tool for .NET developers. 
  • Other options for NLP-specific and computer vision solutions include Gensim and Caffe respectively. 

Conclusively, it is always better to do thorough market research before selecting the right fit for your specific solutions. It might not be the most popular or a well-known tool, but it can definitely be the right one for you.

Final note

As suggested earlier, no one tool has to be the solution for every business case or machine learning problem. Even if none of the tools seem like a perfect fit for you, a combination of them can be the ideal way to go since most of them are compatible with each other.

The trick is to first list down some of the best tools in the space, which we have already done for you, and then explore the shortlisted ones to arrive at the right match gradually. The tools shared here are easy to install and have extensive documentation on their respective sites for an easy kickstart!

Sources:

]]>
3353
The Ultimate Guide to Evaluation and Selection of Models in Machine Learning https://neptune.ai/blog/ml-model-evaluation-and-selection Thu, 21 Jul 2022 10:31:03 +0000 https://neptune.test/the-ultimate-guide-to-evaluation-and-selection-of-models-in-machine-learning/ To properly evaluate your machine learning models and select the best one, you need a good validation strategy and solid evaluation metrics picked for your problem. 

A good validation (evaluation) strategy is basically how you split your data to estimate future test performance. It could be as simple as a train-test split or a complex stratified k-fold strategy. 

Once you know that you can estimate the future model performance, you need to choose a metric that fits your problem. If you understand the classification and regression metrics, then most other complex metrics (in object detection, for example) are relatively easy to grasp.  

When you nail those two, you are good.

In this article, I will talk about:

  • Choosing a good evaluation method (resampling, cross-validation, etc)
  • Popular (and less known) classification and regression metrics
  • And bias / variance trade-offs in machine learning. 

So let’s get to it. 

Note from the product team

You can use neptune.ai to compare experiments and models based on metrics, parameters, learning curves, prediction images, dataset versions, and more.

It makes model evaluation and selection way easier.

See in app
Side-by-side experiment comparison in the Neptune web app

Just to make sure we are on the same page, let’s get the definitions out of the way.

What is model evaluation?

Model evaluation is a process of assessing the model’s performance on a chosen evaluation setup. It is done by calculating quantitative performance metrics like F1 score or RMSE or assessing the results qualitatively by the subject matter experts. The machine learning evaluation metrics you choose should reflect the business metrics you want to optimize with the machine learning solution.

What is model selection?

Model selection is the process of choosing the best ml model for a given task. It is done by comparing various model candidates on chosen evaluation metrics calculated on a designed evaluation schema. Choosing the correct evaluation schema, whether a simple train test split or a complex cross-validation strategy, is the crucial first step of building any machine learning solution.

How to evaluate machine learning models and select the best one?

We’ll dive into this deeper, but let me give you a quick step-by-step:

Step 1: Choose a proper validation strategy. Can’t stress this enough, without a reliable way to validate your model performance, no amount of hyperparameter tuning and state-of-the-art models will help you.

Step 2: Choose the right evaluation metric. Figure out the business case behind your model and try to use the machine learning metric that correlates with that. Typically no one metric is ideal for the problem.

So calculate multiple metrics and make your decisions based on that. Sometimes you need to combine classic ML metrics with a subject matter expert evaluation. And that is ok.

Step 3: Keep track of your experiment results. Whether you use a spreadsheet or a dedicated experiment tracker, make sure to log all the important metrics, learning curves, dataset versions, and configurations. You will thank yourself later.

Step 4: Compare experiments and pick a winner. Regardless of the metrics and validation strategy you choose, at the end of the day, you want to find the best model. But no model is really best, but some are good enough.

So make sure to understand what is good enough for your problem, and once you hit that, move on to other parts of the project, like model deployment or pipeline orchestration.

Model selection in machine learning (choosing model validation strategy)

Resampling methods

Resampling methods, as the name suggests, are simple techniques of rearranging data samples to inspect if the model performs well on data samples that it has not been trained on. In other words, resampling helps us understand if the model will generalize well.

Random Split

Random Splits are used to randomly sample a percentage of data into training, testing, and preferably validation sets. The advantage of this method is that there is a good chance that the original population is well represented in all the three sets. In more formal terms, random splitting will prevent a biased sampling of data.

It is very important to note the use of the validation set in model selection. The validation set is the second test set and one might ask, why have two test sets?

In the process of feature selection and model tuning, the test set is used for model evaluation. This means that the model parameters and the feature set are selected such that they give an optimal result on the test set. Thus, the validation set which has completely unseen data points (not been used in the tuning and feature selection modules) is used for the final evaluation.

Time-Based Split

There are some types of data where random splits are not possible. For example, if we have to train a model for weather forecasting, we cannot randomly divide the data into training and testing sets. This will jumble up the seasonal pattern! Such data is often referred to by the term – Time Series.

In such cases, a time-wise split is used. The training set can have data for the last three years and 10 months of the present year. The last two months can be reserved for the testing or validation set.

There is also a concept of window sets – where the model is trained till a particular date and tested on the future dates iteratively such that the training window keeps increasing shifting by one day (consequently, the test set also reduces by a day). The advantage of this method is that it stabilizes the model and prevents overfitting when the test set is very small (say, 3 to 7 days).  

However, the drawback of time-series data is that the events or data points are not mutually independent. One event might affect every data input that follows after. 

For instance, a change in the governing party might considerably change the population statistics for the years to follow. Or the infamous coronavirus pandemic is going to have a massive impact on economic data for the next few years. 

No machine learning model can learn from past data in such a case because the data points before and after the event have major differences.

K-Fold Cross-Validation

The cross-validation technique works by randomly shuffling the dataset and then splitting it into k groups. Thereafter, on iterating over each group, the group needs to be considered as a test set while all other groups are clubbed together into the training set. The model is tested on the test group and the process continues for k groups.

Thus, by the end of the process, one has k different results on k different test groups. The best model can then be selected easily by choosing the one with the highest score.

Stratified K-Fold

The process for stratified K-Fold is similar to that of K-Fold cross-validation with one single point of difference – unlike in k-fold cross-validation, the values of the target variable is taken into consideration in stratified k-fold.

If for instance, the target variable is a categorical variable with 2 classes, then stratified k-fold ensures that each test fold gets an equal ratio of the two classes when compared to the training set.

This makes the model evaluation more accurate and the model training less biased.

Bootstrap

Bootstrap is one of the most powerful ways to obtain a stabilized model. It is close to the random splitting technique since it follows the concept of random sampling.

The first step is to select a sample size (which is usually equal to the size of the original dataset). Thereafter, a sample data point must be randomly selected from the original dataset and added to the bootstrap sample. After the addition, the sample needs to be put back into the original sample. This process needs to be repeated for N times, where N is the sample size.

Therefore, it is a resampling technique that creates the bootstrap sample by sampling data points from the original dataset with replacement. This means that the bootstrap sample can contain multiple instances of the same data point.

The model is trained on the bootstrap sample and then evaluated on all those data points that did not make it to the bootstrapped sample. These are called the out-of-bag samples.

Probabilistic measures

Probabilistic Measures do not just take into account the model performance but also the model complexity. Model complexity is the measure of the model’s ability to capture the variance in the data. 

For example, a highly biased model like the linear regression algorithm is less complex and on the other hand, a neural network is very high on complexity.

Another important point to note here is that the model performance taken into account in probabilistic measures is calculated from the training set only. A hold-out test set is typically not required.

A fair bit of disadvantage however lies in the fact that probabilistic measures do not consider the uncertainty of the models and has a chance of selecting simpler models over complex models.

Akaike Information Criterion (AIC)

It is common knowledge that every model is not completely accurate. There is always some information loss which can be measured using the KL information metric. Kulback-Liebler or KL divergence is the measure of the difference in the probability distribution of two variables.

A statistician, Hirotugu Akaike, took into consideration the relationship between KL Information and Maximum Likelihood (in maximum-likelihood, one wishes to maximize the conditional probability of observing a datapoint X, given the parameters and a specified probability distribution) and developed the concept of Information Criterion (or IC). Therefore, Akaike’s IC or AIC is the measure of information loss. This is how the discrepancy between two different models is captured and the model with the least information loss is suggested as the model of choice.

  • K = number of independent variables or predictors
  • L = maximum-likelihood of the model 
  • N = number of data points in the training set (especially helpful in case of small datasets)

The limitation of AIC is that it is not very good with generalizing models as it tends to select complex models that lose less training information.

Bayesian Information Criterion (BIC)

BIC was derived from the Bayesian probability concept and is suited for models that are trained under the maximum likelihood estimation.

  • K = number of independent variables
  • L = maximum-likelihood
  • N = Number of sampler/data points in the training set

BIC penalizes the model for its complexity and is preferably used when the size of the dataset is not very small (otherwise it tends to settle on very simple models).

Minimum Description Length (MDL)

MDL is derived from the Information theory which deals with quantities such as entropy that measure the average number of bits required to represent an event from a probability distribution or a random variable.

MDL or the minimum description length is the minimum number of such bits required to represent the model.

  • d = model
  • D = predictions made by the model
  • L(h) = number of bits required to represent the model
  • L(D | h) = number of bits required to represent the predictions from the model

Structural Risk Minimization (SRM)

Machine learning models face the inevitable problem of defining a generalized theory from a set of finite data. This leads to cases of overfitting where the model gets biased to the training data which is its primary learning source. SRM tries to balance out the model’s complexity against its success at fitting on the data.

How to evaluate ML models (choosing performance metrics)

Models can be evaluated using multiple metrics. However, the right choice of an evaluation metric is crucial and often depends upon the problem that is being solved. A clear understanding of a wide range of metrics can help the evaluator to chance upon an appropriate match of the problem statement and a metric.

Classification metrics

For every classification model prediction, a matrix called the confusion matrix can be constructed which demonstrates the number of test cases correctly and incorrectly classified. 

It looks something like this (considering 1 -Positive and 0 -Negative are the target classes):

 
Actual 0
Actual 1

Predicted 0

True Negatives (TN)

False Negatives (FN)

Predicted 1

False Positives (FP)

True Positives (TP)

  • TN: Number of negative cases correctly classified
  • TP: Number of positive cases correctly classified
  • FN: Number of positive cases incorrectly classified as negative
  • FP: Number of negative cases correctly classified as positive

Accuracy

Accuracy is the simplest metric and can be defined as the number of test cases correctly classified divided by the total number of test cases.

It can be applied to most generic problems but is not very useful when it comes to unbalanced datasets. 

For instance, if we are detecting frauds in bank data, the ratio of fraud to non-fraud cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be 99% accurate by predicting all test cases as non-fraud. The 99% accurate model will be completely useless.

If a model is poorly trained such that it predicts all the 1000 (say) data points as non-frauds, it will be missing out on the 10 fraud data points. If accuracy is measured, it will show that that model correctly predicts 990 data points and thus, it will have an accuracy of (990/1000)*100 = 99%! 

This is why accuracy is a false indicator of the model’s health.

Therefore, for such a case, a metric is required that can focus on the ten fraud data points which were completely missed by the model.

Precision

Precision is the metric used to identify the correctness of classification.

Intuitively, this equation is the ratio of correct positive classifications to the total number of predicted positive classifications. The greater the fraction, the higher is the precision, which means better is the ability of the model to correctly classify the positive class.

In the problem of predictive maintenance (where one must predict in advance when a machine needs to be repaired), precision comes into play. The cost of maintenance is usually high and thus, incorrect predictions can lead to a loss for the company. In such cases, the ability of the model to correctly classify the positive class and to lower the number of false positives is paramount!

Recall

Recall tells us the number of positive cases correctly identified out of the total number of positive cases. 

Going back to the fraud problem, the recall value will be very useful in fraud cases because a high recall value will indicate that a lot of fraud cases were identified out of the total number of frauds.

F1 Score

F1 score is the harmonic mean of Recall and Precision and therefore, balances out the strengths of each. 

It is useful in cases where both recall and precision can be valuable – like in the identification of plane parts that might require repairing. Here, precision will be required to save on the company’s cost (because plane parts are extremely expensive) and recall will be required to ensure that the machinery is stable and not a threat to human lives.

AUC-ROC

ROC curve is a plot of true positive rate (recall) against false positive rate (TN / (TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics and the higher the area, the better is the model performance. 

If the curve is somewhere near the 50% diagonal line, it suggests that the model randomly predicts the output variable.

AUC ROC curve

Log Loss

Log loss is a very effective classification metric and is equivalent to -1* log (likelihood function) where the likelihood function suggests how likely the model thinks the observed set of outcomes was. 

Since the likelihood function provides very small values, a better way to interpret them is by converting the values to log and the negative is added to reverse the order of the metric such that a lower loss score suggests a better model.

Gain and Lift Charts

Gain and lift charts are tools that evaluate model performance just like the confusion matrix but with a subtle, yet significant difference. The confusion matrix determines the performance of the model on the whole population or the entire test set, whereas the gain and lift charts evaluate the model on portions of the whole population. Therefore, we have a score (y-axis) for every % of the population (x-axis). 

Lift charts measure the improvement that a model brings in compared to random predictions. The improvement is referred to as the ‘lift’.

K-S Chart

The K-S chart or Kolmogorov-Smirnov chart determines the degree of separation between two distributions – the positive class distribution and the negative class distribution. The higher the difference, the better is the model at separating the positive and negative cases.

Regression metrics

Regression models provide a continuous output variable, unlike classification models that have discrete output variables. Therefore, the metrics for assessing the regression models are accordingly designed.

Mean Squared Error or MSE

MSE is a simple metric that calculates the difference between the actual value and the predicted value (error), squares it and then provides the mean of all the errors.

MSE is very sensitive to outliers and will show a very high error value even if a few outliers are present in the otherwise well-fitted model predictions.

Root Mean Squared Error or RMSE

RMSE is the root of MSE and is beneficial because it helps to bring down the scale of the errors closer to the actual values, making it more interpretable.

Mean Absolute Error or MAE

MAE is the mean of the absolute error values (actuals – predictions).

If one wants to ignore the outlier values to a certain degree, MAE is the choice since it reduces the penalty of the outliers significantly with the removal of the square terms.

Root Mean Squared Log Error or RMSLE

In RMSLE, the same equation as that of RMSE is followed except for an added log function along with the actual and predicted values.

x is the actual value and y is the predicted value. This helps to scale down the effect of the outliers by downplaying the higher error rates with the log function. Also, RMSLE helps to capture a relative error (by comparing all the error values) through the use of logs.

R-Squared

R-Square helps to identify the proportion of variance of the target variable that can be captured with the help of the independent variables or predictors. 

R-square, however, has a gigantic problem. Say, a new unrelated feature is added to a model with an assigned weight of w. If the model finds absolutely no correlation between the new predictor and the target variable, w is 0. However, there is almost always a small correlation due to randomness which adds a small positive weight (w>0) and a new loss minimum is achieved due to overfitting.

This is why the R-squared increases with any new feature addition. Thus, its inability to decrease in value when new features are added limits its ability to identify if the model did better with lesser features.

Adjusted R-Squared

Adjusted R-Square solves the problem of R-Square by dismissing its inability to reduce in value with added features. It penalizes the score as more features are added.

The denominator here is the magic element which increases with the increase in the number of features. Therefore, a significant increase in R2 is required to increase the overall value.

Clustering metrics

Clustering algorithms predict groups of datapoints and hence, distance-based metrics are most effective.

Dunn Index

Dunn Index focuses on identifying clusters that have low variance (among all members in the cluster) and are compact. The mean values of the different clusters also need to be far apart.

  • δ(Xi, Yj) is the intercluster distance i.e. the distance between Xi and Xj
  • ∆(Xk) is the intercluster distance of cluster Xk i.e.distance within the cluster Xk

However, the disadvantage of Dunn index is that with a higher number of clusters and more dimensions, the computation cost increases.

Silhouette Coefficient

Silhouette Coefficient tracks how every point in one cluster is close to every point in the other clusters in the range of -1 to +1.: 

  • Higher Silhouette values (closer to +1) indicate that the sample points from two different clusters are far away. 
  • 0 indicates that the points are close to the decision boundary 
  • and values closer to -1 suggests that the points have been incorrectly assigned to the cluster.

Elbow method

The elbow method is used to determine the number of clusters in a dataset by plotting the number of clusters on the x-axis against the percentage of variance explained on the y-axis. The point in x-axis where the curve suddenly bends (the elbow) is considered to suggest the optimal number of clusters.

clusters
Source: Wikipedia

Trade-offs in ml model selection

Bias vs variance

On a high level, Machine Learning is the union of statistics and computation. The crux of machine learning revolves around the concept of algorithms or models, which are in fact, statistical estimations on steroids.

However, any given model has several limitations depending on the data distribution. None of them can be entirely accurate since they are just estimations (even if on steroids). These limitations are popularly known by the name of bias and variance

model with high bias will oversimplify by not paying much attention to the training points (e.g.: in Linear Regression, irrespective of data distribution, the model will always assume a linear relationship). 

Bias occurs when a model is strictly ruled by assumptions – like the linear regression model assumes that the relationship of the output variable with the independent variables is a straight line. This leads to underfitting when the actual values are non-linearly related to the independent variables.

model with high variance will restrict itself to the training data by not generalizing for test points that it hasn’t seen before (e.g.: Random Forest with max_depth = None).

Variance is high when a model focuses on the training set too much and learns the variations very closely, compromising on generalization. This leads to overfitting.

The issue arises when the limitations are subtle, like when we have to choose between a random forest algorithm and a gradient boosting algorithm or between two variations of the same decision tree algorithm. Both will tend to have high variance and low bias.

An optimal model is one that has the lowest bias and variance and since these two attributes are indirectly proportional, the only way to achieve this is through a tradeoff between the two. Therefore, the model selection should be such that the bias and variance intersect like in the image below.

Soure: Analytics Vidhya

This can be achieved by iteratively tuning the hyperparameters of the model in use (Hyperparameters are the input parameters that are fed to the model functions). After every iteration, the model evaluation must take place with the use of a suitable metric.  

Learning curves

The best way to track the progress of model training or build-up is to use learning curves. These curves help to identify the optimal points in a set of hyperparameter combinations and assists massively in the model selection and model evaluation process.

Typically, a learning curve is a way to track the learning or improvement in the ML model performance on the y-axis and the time or experience on the x-axis.

The two most popular learning curves are:

  • Training Learning Curve – It effectively plots the evaluation metric score overtime during a training process and thus, helps to track the learning or progress of the model during training.
  • Validation Learning Curve – In this curve, the evaluation metric score is plotted against time on the validation set. 

Sometimes it might so happen that the training curve shows an improvement but the validation curve shows stunted performance. 

This is indicative of the fact that the model is overfitting and needs to be reverted to the previous iterations. In other words, the validation learning curve identifies how well the model is generalizing.

Therefore, there is a tradeoff between the training learning curve and the validation learning curve and the model selection technique must rely upon the point where both the curves intersect and are at their lowest.

Ok, but how do you actually do it?

What is next

Evaluating ML models and selecting the best-performing one is one of the main activities you do in pre-production. 

Hopefully, with this article, you’ve learned how to properly set up a model validation strategy and then how to choose a metric for your problem. 

You are ready to run a bunch of experiments and see what works. 

With that comes another problem of keeping track of experiment parameters, datasets used, configs, and results. 

And figuring out how to visualize and compare all of those models and results. 

For that, you may want to check out:

Other resources

Cross-validation and evaluation strategies from Kaggle competitions:

Evaluation metrics and visualization:

Experiment tracking videos and real-world case studies:

]]>
2710
MLOps Challenges and How to Face Them https://neptune.ai/blog/mlops-challenges-and-how-to-face-them Wed, 25 Aug 2021 16:59:05 +0000 https://neptune.test/mlops-challenges-and-how-to-face-them/ Somewhere around 2018, enterprise organizations started experimenting with Machine Learning (ML) features to build add-ons to their pre-existing solutions, or to create brand new solutions for their clients.

At that time it wasn’t about speed, but more about gaining an extra edge over the competition. If you threw in a few sci-fi-sounding ML features in your offer, you could attract more clients who were interested in trying out the newest tech.

In current MLOps trends, the narrative has changed almost completely. Every year, Artificial Intelligence (AI) sees exponential advancements compared to technology from any other era. The field is evolving extremely quickly, and people are more aware of its limitations and opportunities.

Here are the three big factors that have influenced the evolution of AI and ML the most:

  • Mass Adoption – The enterprise world is now heavily interested in AI/ML solutions, and not just for the benefit of potential clients and customers, but to get investors and drive growth. AI-based features can literally be the deciding factor between companies getting funded or not.
  • Higher Competition – Because of rapid mass adoption, adding a simple ML feature to conventional software is no longer enough to give you an edge. In fact, so many organizations are running AI/ML projects now that it’s becoming a standard business feature, and not one that’s just nice to have.
  • High-Speed Production – Just like in conventional software production, high competition needs to be combatted through high-speed production of new and improved features. Given the earlier ways of ML development (without MLOps) this feat seemed almost impossible.

Overall, we can say that AI/ML solutions are becoming equivalent to regular software solutions in terms of how companies use them, so it’s no surprise that they need a well-planned framework for production just like DevOps.

This well-planned framework is MLOps. You’ve probably heard of it, but what exactly is MLOps?

What is MLOps?

MLOps is basically DevOps (systemic process for the collaboration of development and operations teams), but for machine learning development. 

MLOps combines Machine Learning and Operations by introducing structure and transparency in the end-to-end ML pipeline. With MLOps, data scientists can work and share their solutions in an organized and efficient way with data engineers who deploy the solutions. MLOps also increases visibility across various other technical and non-technical stakeholders who are engaged across various points of the production pipeline.

What is MLOps
What is MLOps | Illustration sourced from towardsdatascience.com

Over the years, organizations have started to see the benefits of MLOps in executing an efficient production pipeline. However, MLOps is still in its adolescent stage, and most organizations are still figuring out the optimal implementation that suits their respective projects.

There are several loopholes and open-ended challenges that need a workaround in MLOps. Let’s take a look at the challenges and weigh the probable solutions.

MLOps challenges and potential solutions

I divided the challenges into seven groups based on the seven different stages of the ML pipeline

MLOps challenges across different stages of ML
Summary of MLOps challenges across Stages of ML | Illustrated by Author

Stage 1: Defining the business requirements

This is the initial stage where business stakeholders design the solution. It usually involves three stakeholders: the customer, the solution architect, and the technical team. In this stage, you set the expectations, determine solution feasibility, define success metrics, and design the blueprint.

  • Challenge 1: Unrealistic expectations

Some businesses view AI as a magical solution to all problems. This point of view is often projected by non-technical stakeholders who follow the trending buzzwords without considering the background details.

Solution: This is where the technical leads have a key role. It’s necessary to make all the stakeholders aware of solution feasibility and clearly explain the limitations. After all, a solution is only as good as the data.

  • Challenge 2: Misleading success metrics

A machine learning solution’s effectiveness can be measured through just one or multiple metrics. As the popular saying goes, “you get what you measure”, and this is true even when building ML solutions. Poor analysis of solution requirements can lead to incorrect metric goals that can hamper the health of the design.

Solution: The technical team needs to carry out an in-depth analysis of solution objectives to come up with realistic metrics. Here, both the technical and non-technical stakeholders play a crucial role since it involves a deep business understanding. The best way to go about deciding metrics is to narrow down on two types of metrics:

  • High-Level Metrics: Apt for customer view and provides a good idea of where the solution is headed. In other words, the high-level metrics show the big-picture. 
  • Low-Level Metrics: These are the detailed metrics that support the developers during solution development. By analyzing multiple low-level metrics, the developer can tweak the solution to get better readings. The low-level metrics add up to the high-level metrics.

Stage 2: Data preparation

Data preparation involves gathering, formatting, cleaning, and storing the data as needed. This is a highly sensitive stage since the incoming data decides the fate of the solution. The ML engineer needs to perform a sanity check on data quality and data access points.

  • Challenge 1: Data discrepancies

Data often needs to be sourced from multiple sources and this leads to a mismatch in data formats and values. For instance, recent data can be directly taken from a pre-existing product, while older data can be collected from the client. Differences in mappings, if not properly evaluated, can disrupt the entire solution.

Solution: Limiting data discrepancies can be a manually intensive and time-consuming task, but you still need to do it. The best way to deal with this is to centralize data storage and to have universal mappings across various teams. This is a one-time setup for every new account and it benefits you as long as the client is on board.

  • Challenge 2: Lack of data versioning

Even if the data in use is free from any disruptions or format issues, there’s always the issue of time. Data keeps evolving and regenerating, and the results of the same models can differ widely for an updated data dump. Updates can be in the form of different processing steps, as well as new, modified, or deleted data. If you don’t version, your model performance records won’t be great.

Solution: Modifying pre-existing data dumps can be great for space optimization, but it’s best to create new data versions. However, for space optimization, you can store the metadata of a given data version so that it can be retrieved from the updated data unless the values are also modified.

Stage 3: Running experiments

Since ML solutions are heavily research-based, ample experimentation is needed to obtain the optimal route. Machine learning experiments are involved across all the stages of development including feature selection, feature engineering, model selection, and model tuning.

  • Challenge 1: Inefficient tools and infrastructure

Running multiple experiments can be chaotic and harsh on company resources. Different data versions and processes need to run on hardware that’s equipped to carry out complex calculations in minimal time. Also, immature teams rely on notebooks to run their experiments, which is inefficient and time-consuming.

Solution: If hardware is an issue, dev teams can seek budgets for subscriptions to virtual hardware such as those available on AWS or IBM Bluemix. When it comes to notebooks, the developers must make it a practice to perform experiments on scripts since they’re much more efficient and less time-consuming.

  • Challenge 2: Lack of model versioning

Every ML model has to be tested with multiple sets of hyperparameter combinations, but it’s not the main challenge. Changes in incoming data can worsen the performance of the chosen combination, so hyperparameters must be re-tweaked. Even though the code and hyperparameters are controlled by the developers, the data is the independent factor that influences the controlled elements.

Solution: Every version of the model should be recorded so that the optimal result can be found and reproduced with minimum hassle. This can be done seamlessly through experiment tracking platforms like neptune.ai.

Aside

neptune.ai is the experiment tracker for teams that train foundation models, designed with a strong focus on collaboration and scalability.

It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye.

Neptune is known for its user-friendly UI and seamlessly integrates with popular ML/AI frameworks, enabling quick adoption with minimal disruption.

See in app
  • Challenge 3: Budget constraints

Sometimes, development teams can’t use company resources because of budget restrictions, or because a resource is shared across multiple teams. Resources with high-powered computing or huge storage capacity, even though they’re crucial for scaling ML solutions, fall out of most organization’s budget criteria. In this case, ML teams have to find a workaround (that will often be suboptimal) to make the solution work with the same power if possible.

Solution: To reduce the long line of approvals and budget constraints, the development teams often need to delve into the business side and do a thorough cost-benefit analysis of limiting provisions vs the return on investment from working solutions that can run on those provisions. The teams might need to collaborate with other departments to get accurate feedback on cost data. The key decision-makers in organizations have either a short-term or a long-term profit-oriented view, and a cost-benefit analysis that promises growth can be the driving factor that opens up some of the bottlenecks. 

Stage 4: Validating solution

An ML model is trained on historical data and needs to be tested on unseen data to check model stability. The model needs to perform well in the validation stage to be good for deployment.

  • Challenge 1: Overlooking meta performance

Just considering the high-level and low-level success metrics for model validation isn’t enough. Giving a go-ahead just based on these factors can lead to slow processing and ultimately escalations from the end customer.

Solution: Factors such as memory and time consumption, hardware requirements, or production environment limitations should also be considered while validating solutions. They’re called meta metrics, and considering them will help you avoid the consequences of Murphy’s law to a great extent.

  • Challenge 2: Lack of communication

Validating the solution just from a developer’s standpoint is harmful. Not involving all stakeholders in the validation process can conflict with expectations and lead to rework and dissatisfaction.

Solution: Involve the business stakeholders to understand how the model performance can be linked to business KPIs and how they directly impact the stakeholders. Once they understand this link, the validation process will be much more effective as it compares the results with real-world implications.

  • Challenge 3: Overlooking biases

Biases are very sneaky and can be overlooked even by experienced developers. For example, we can get biased when the results are great on the validation set but fail terribly on incoming test data. This happens because the model is trained on biased samples, and when the validation set has samples similar to the biased ones.

Solution: Validate the model on multiple combinations of data, without replacement. If the results are almost consistent across all sets, that means the model is unbiased. However, if the results differ significantly across the sets, that means the training data needs to be updated with less biased samples and the model needs to be retrained.

Stage 5: Deploying solution

This stage is where the locally developed solution is launched on the production server so that it can reach the end customer. This is where the deployment and development teams collaborate to execute the launch.

  • Challenge 1: Surprising the IT department

In real-world scenarios, there’s significant friction between the development and deployment teams. This is because of no communication and no involvement of the IT department from the initial steps. Often, after a solution is devised, dev teams want it deployed at the earliest and demand expensive setups at very short notice to IT teams. There’s a reason for delayed communication – with multiple experiments, the dev team is not sure of which solution will be implemented, and each solution has different requirements.

Solution: Involve the IT team as soon as possible. Sometimes they have good insights on where a particular solution is headed in terms of requirements. They can also point out the common elements between potential solutions that can be set up early on.

  • Challenge 2: Lack of iterative deployment

The development and production teams are out of sync in most cases and start collaborating at the end of solution design. Even though ML has a research-based approach, the one-time deployment process is faulty and inefficient.

Solution: Just like in any regular software deployment, iterative deployment of ML solutions can save a lot of rework and friction. Iteratively setting up the different modules of the solution and updating them in sprints is ideal. Even in research-based solutions, where multiple variants of models need to be tested, a module for the model can be communicated to the deployment team which can be updated across sprints.

  • Challenge 3: Suboptimal company framework

The software deployment framework that a company has been working on might be suboptimal or even irrelevant for deploying ML solutions. For instance, a Python-based ML solution might have to be deployed through a Java-based framework just to comply with the company’s existing system. This can double the work for development and deployment teams since they have to replicate most of the codebase and it takes a lot of resources and time. Companies that are old and function uniformly on a previously built framework might not see the best results from ML teams in terms of resource optimization because the teams will be busy figuring out how to best deploy their solution through the limited available framework. Once figured out, they have to repeat the suboptimal process for every solution that they want to deploy.

Solution: A long-term fix for this is to invest in creating a separate ML stack that can integrate into the company framework, but also reduce work on the development front. A quick fix for this is to leverage virtual environments to deploy ML solutions for the end customer. Services such as Docker and Kubernetes are extremely useful in these cases. 

  • Challenge 4: Long-chain of approvals

For every change that needs to be reflected on the production server, approval is needed. This takes a long time since the verification process is lengthy, ultimately delaying the development and deployment plans. This problem is not just about production servers but also exists in terms of provisioning different company resources or integrating external solutions.

Solution: To shorten the time taken for approvals for machine learning libraries on production servers, the developers can restrict their code references to verified codebases such as TensorFlow and scikit-learn. Contrib libraries, if used, must be thoroughly checked by the developer to verify the input and output points.

Stage 6: Monitoring solution

Creating and deploying the solution is not the end of service. Models are trained on local and past data. It’s crucial to examine how they perform on new and unseen data. This is the stage where the stability and success of the solution are established.

  • Challenge 1: Manual Monitoring

Manual monitoring is highly demanding and wastes resources and time. Unless the resources are expendable, this is a subpar way to monitor model results. Also, manual monitoring is definitely not an option for time-sensitive projects since it fails to instantly raise alerts on declining performance.

Solution: There are two options. First, definitely automate the monitoring process and simultaneous alerts. Second, if automation is not an instant option at the moment, you can study recent monitoring data. If performance seems to consistently decrease, it’s time to set up the retraining process (not to start it but to set it up).

With Neptune’s Reporting feature, you can automate model performance tracking and visualize your own metrics in real-time. Instead of relying on manual log analysis, teams can create dashboards that provide a live view of key indicators like accuracy, loss, drift, and latency. 

  • Challenge 2: Change of data trends

Sometimes, the data can face abrupt changes due to external factors that aren’t in sync with the data history. For example, stock data can be heavily impacted by a related news article or import data can be impacted by new tax laws. The list is endless and it’s challenging to handle such sudden disruptions.

Solution: The solution is to keep the data up-to-date or extremely fresh, especially if the solution is time-sensitive. This can be done through automated crawlers that can check data periodically. This will also prevent lags in the performance data.

Stage 7: Retraining model

Model retraining is an unavoidable stage of any machine learning solution because they’re all heavily data-dependent and data trends change with time. Efficient model retraining helps to keep the solution relevant and saves the cost of recreating new solutions.

  • Challenge 1: Lack of scripts

ML teams that aren’t very mature are manually intensive. An early-stage team is still figuring out frameworks, optimum solutions, and responsibilities. However, this also reduces team efficiency and increases the wait time for retraining.

Solution: A script that summarizes the ML pipeline is not very difficult to create or to call and saves time and resources with a one-time setup. The best way to create a script is to use conditional calls for different sub-modules so that the degree of automation can be decided.

  • Challenge 2: Deciding the triggering threshold

When exactly should you start retraining the model? Every machine learning model has some real-world consequences in business, and a deviating performance can significantly impact the performance of various teams. Therefore, knowing when exactly to kickstart retraining is crucial, and equally challenging to pinpoint.

Solution: Since business stakeholders want good model performance, it’s important to weigh in their point of view and their challenges with decreasing performance. For example, a model that predicts the payment date of customers is directly affecting the calls that go out from payment collectors. So, factors such as drop-in call hits and lost payments need to be taken into account to make the retraining call. 

  • Challenge 3: Deciding the degree of automation

Model retraining can be tricky. Retraining of some solutions can be entirely automated, whereas other solutions need close manual intervention. Immature teams might make the mistake of automating the entire retraining script without assessing the causes of low performance. Models need retraining mainly because of deviating data. A model is trained on a data heap, but over time the learned trends may no longer be applicable to the incoming data. Tweaking model hyperparameters here won’t change much.

Solution: The solution is to observe the performance deviation and to figure out the causes behind it. If the data deviation is minimal, an automation script can handle the retraining process well enough. However, if the data has changed significantly (an extreme example: deviation of employment data during a global crisis), exploratory data analysis is the way to go (even though it’s a manually intensive process).

Conclusion

We’ve explored the most common high-level challenges in MLOps and ML pipelines. As you can see, they’re a mix of communicational and technical challenges. 

There are several more low-level challenges related to every stage, but getting started with an MLOps strategy that dissolves the big organizational, communicational, and technical issues, is the key to resolving low-level challenges as they come.

]]>
5456