Time Series - neptune.ai

Time Series Projects: Tools, Packages, and Libraries That Can Help

Enes Zvorničanin — Thu, 20 Oct 2022 14:43:45 +0000

Since you are here, you probably know that time series data is a bit different than static ML data. So when working on time series projects, oftentimes, Data Scientists or ML Engineers use specific tools and libraries. Or they use commonly known tools that have proved to be well adjusted to time series projects.

We figured it would be useful to have those tools gathered in one place, so here we are. This article is sort of a database of time series tools and packages. Some of them are pretty well-known and some may be new to you. Hope you’ll find the whole list useful!

Before we dig into tools, let’s cover some basics.

What is a time series?

A time series is a sequence of data points indexed in time order. It’s an observation of the same variable at successive points in time. In other words, it’s a set of data that has been observed over a period of time.

The data is often plotted as a line on a graph with time on the x-axis and the value at each point on the y-axis. Also, there are four main components of a time series:

1 Trend
2 Seasonal variations
3 Cyclic variations
4 Irregular or random variations

A Trend is simply a general direction of change in the data over many periods and it’s the long-term pattern in the data. The trend usually appears for a certain amount of time, after which it disappears or changes direction. For example, in financial markets, a ‘Bullish Trend’ indicates an upward trend where the prices of financial assets rise in general, while a ‘Bearish Trend’ indicates a decline in the prices.

Broadly, a trend in time series can be:

Upward trend: a time series increases over an observed period.
Downward trend: a time series decreases over an observed period.
Constant or horizontal trend: a time series doesn’t significantly rise or fall over an observed period.

Seasonal variations or seasonality is an important component to consider when looking at a time series because it can provide information about what might happen in the future based on past data. It refers to the variation in the value of a measure over the course of one or more seasons, such as winter and summer months but also might be on a daily, weekly, or monthly basis. For example, the temperature has a seasonal behavior because it is higher in summer and lower in winter.

In contrast to seasonal variations, cyclic variations don’t have precise time periods and might have some drifts in time. For instance, financial markets tend to cycle between periods of high and low values, but there is no predetermined period of time between them. Besides that, a time series can have both seasonal and cyclic variations. For instance, it’s known that the real estate market has both cyclic and seasonal patterns. The seasonal pattern shows that there are more transactions in the spring rather than in the summer. The cyclic pattern reflects the purchasing power of the people, which means that in a crisis there are fewer sales in contrast to the time when there is prosperity.

Irregular or random variations are what remain after trend, seasonal and cyclic components are removed. Because of that, it’s also known as the residual component. This is a non-systematic part of a time series that is completely random and can’t be predicted.

Time series components | Source

In general, time series are often used in many fields such as economics, mathematics, biology, physics, meteorology, etc. Concretely, some examples of time series data are:

The Dow Jones Industrial Average index prices
The temperature in New York City
Bitcoin price
ECG signals
Google trends of the term MLOps
The unemployment rate in the USA
Website traffic through time and similar

In this article, we will take a look at a few of the aforementioned examples.

Examples of time series projects

Stock market prediction

Stock market forecasting is a challenging and attractive topic where the main goal is to develop diverse methods and strategies for predicting future stock prices. There are a lot of different techniques, from classic algorithmic and statistics methods up to complex neural network architectures. The common thing is that they all utilize different time series to achieve accurate forecasts. Stock market forecasting methods are widely used by amateur investors, fintech startups, and big hedge funds.

There are many ways to use stock market forecasting methods in practice, but the most popular is probably trading. The number of automatic trading on stock exchanges is on the rise, and it’s estimated that about 75% of stocks traded on US stock exchanges come from algorithmic systems. There are two main approaches to predicting how stocks will perform in the future: fundamental analysis and technical analysis.

Fundamental analysis looks at factors such as a company’s financial statements, management, and industry trends. Also, it takes into account some macroeconomic indicators such as inflation rate, GDP, state of the economy, and similar. All these indicators are time-dependent and in that way can be represented as time series.

In contrast to fundamental analysis, technical analysis uses patterns in trading volume, price changes, and other information from the market itself to predict how stocks will perform in the future. It’s important for investors to understand both approaches before making an investment decision.

Technical indicators example | Source

Bitcoin price forecasting

Bitcoin is a digital currency that has significant fluctuations in price. It’s also one of the most volatile assets in the world. The price of bitcoin is determined by supply and demand. When demand for bitcoins increases, the price increases, and when demand falls, the price falls. As demand has increased in recent years, so has the price. Because of its very volatile nature, it is a very challenging task to forecast bitcoin’s future prices.

In general, this problem is very similar to stock market prediction, and almost the same methods can be used to solve it. Even bitcoin has been shown to correlate with some indices such as S&P 500 and Dow Jones. It means that the bitcoin price, to some degree, follows the prices of the mentioned indices. You can read more about this here:

Cryptocurrency price prediction using LSTMs | TensorFlow for hackers

LSTM for bitcoin prediction in Python

ECG anomaly detection

ECG anomaly detection is a technique that detects the abnormalities in an ECG. The ECG is a test that monitors the electrical activity of the heart. Basically, it is an electrical signal generated by the heart and represented as a time series.

The ECG anomaly detection is done by comparing the normal pattern of an ECG with the abnormal pattern. There are many types of anomalies in an ECG, and they can be classified as follows:

Heart rate anomalies: this refers to any change in heart rate from its normal range. This may be due to a problem with the heart or a problem with how it is being stimulated.
Heart rhythm anomalies: this refers to any change in rhythm from its normal pattern. This may be due to a problem with the way that impulses are being conducted through the heart or problems with how quickly they are conducted through it.

A lot of work has been done on this topic, ranging from academic research to commercial ECG machines, and there are some promising results. The biggest issue is that the system should have a high level of accuracy and should not have any false positives or negatives. This is due to the nature of the problem and the consequences of the wrong prediction.

ECG anomalies detection | Source

Time series anomaly detection using LSTM autoencoders with PyTorch in Python

Anomaly detection in ECG time signals via deep long short-term memory networks

Tools, packages, and libraries for time series projects

Since now we have some background regarding the importance of time series in the industry, let’s take a look at some popular tools, packages, and libraries that can be helpful for any time series project. Also, due to the fact that the majority of data science and machine learning projects related to time series are done in Python, it makes sense to discuss tools supported by Python.

We will discuss tools from majorly four categories:

1 Data preparation and feature engineering tools
2 Data analysis and visualization packages
3 Experiment tracking tools
4 Time series forecasting packages

Data preparation and feature engineering tools for time series

Data preparation and feature engineering are two very important steps in the data science pipeline. Data preparation is typically the first step in any data science project. It’s the process of getting data into a form that can be used for analysis and further processing.

Feature engineering is a process of extracting features from raw data to make it more useful for modelling and prediction. Below, we’ll mention some of the most popular tools used for these tasks.

Time series projects with Pandas

Pandas is a Python library for data manipulation and analysis. It includes data structures and methods for manipulating numerical tables and time series. Also, it contains extensive capabilities and features for working with time series data for all domains.

It supports data input from a variety of file types, including CSV, JSON, Parquet, SQL database tables and queries, and Microsoft Excel. Also, Pandas allows various data manipulation features such as merging, reshaping, selecting, as well as data cleaning and wrangling.

Some useful time series features are:

Date range generation and frequency conversions
Moving window statistics
Moving window linear regressions
Date shifting
Lagging and many more

More related content for time series can be found below:

Pandas documentation

W3Schools: Pandas tutorial

Time series projects with NumPy

NumPy is a Python library that adds support for huge, multi-dimensional arrays and matrices, as well as a vast number of high-level mathematical functions that may be used on these arrays. It has a very similar syntax to MATLAB and includes a high-performance multidimensional array object as well as capabilities for working with these arrays.

NumPy’s datetime64 data type and arrays enable an extremely compact representation of dates in time series. Using NumPy also makes it simple to do various time series operations using linear algebra operations.

NumPy documentation and tutorials:

Numpy website

W3Schools: NumPy introduction

Time series projects with Datetime

Datetime is a Python module that allows us to work with dates and times. This module contains the methods and functions required to handle the scenarios such as:

Representation of dates and times
Arithmetic of dates and times
Comparison of dates and times

Working with time series is simple using this tool. It allows users to transform dates and times into objects and manipulate them. For example, with only a few lines of code, we may convert from one DateTime format to another, add a number of days, months, or years to date, or calculate the difference in seconds between two-time objects.

Useful documentation around how to get started with this module:

Tutorials point: Python, date and time

Python documentation (datetime)

Time series projects with Tsfresh

Tsfresh is a Python package. It automatically calculates a large number of time series characteristics, known as features. The package combines established algorithms from statistics, time series analysis, signal processing, and non-linear dynamics with a robust feature selection algorithm to provide systematic time series feature extraction.

The Tsfresh package includes a filtering procedure to prevent the extraction of irrelevant features. This filtering procedure assesses each characteristic’s explaining power and significance for the regression or classification tasks.

Some examples of advanced time series features are:

Fourier transform components
Wavelet transform
Partial autocorrelation and others

More about the Tsfresh package can be found below:

Tsfresh documentation

Data analysis and visualization packages for time series

Data analysis and visualization packages are tools that help data analysts to create graphs and charts from their data. Data analysis is defined as the process of cleaning, transforming, and modelling data in order to uncover useful information for business decisions. The goal of data analysis is to extract useful information from data and make decisions based on that information.

The graphical representation of data is known as data visualization. Data visualization tools, which use visual elements such as charts and graphs, provide an easy way to see and understand trends and patterns in data.

There is a wide range of data analysis and visualization packages for time series and we’ll go through a few of them.

Time series projects with Matplotlib

Probably the most popular Python package for data visualization is Matplotlib. It’s used for creating static, animated, and interactive visualizations. With Matplotlib it’s possible to do some things such as:

Produce plots suitable for publication
Create interactive figures that can be zoomed in, panned, and updated
Change the visual style and layout

Also, it provides a variety of options for drawing time series charts. More about it is on the link below:

Matplotlib website

Example of the Matplotlib chart with time series | Source: Author

Time series projects with Plotly

Plotly is an interactive, open-source, and browser-based graphing library for Python and R. It’s a high-level, declarative charting library with over 30 chart types, including scientific charts, 3D graphs, statistical charts, SVG maps, financial charts, and more.

Besides that, with Plotly it’s possible to draw interactive time series-based charts such as lines, gantts, scatter plots, and similar. More about this package is presented in the documentation:

Plotly documentation

Example of the Plotly chart with time series | Source

Time series projects with Statsmodels

Statsmodels is a Python package that provides classes and functions for estimating a wide range of statistical models, as well as running statistical tests and statistical data analysis.

We’ll cover in more detail this library in the section about forecasting but here it’s worth mentioning that it provides a very convenient method for time series decomposition and its visualization. With this package, we can easily decompose any time series and analyze its components such as trend, seasonal components, and residual or noise. More about that is described in the tutorial:

How to decompose time series data into trend and seasonality

Experiment tracking tools for time series

Experiment tracking tools are usually high-level tools that can be used for a variety of purposes like tracking the results of an experiment, showing what would happen if one changed the parameters in an experiment, model management, and similar.

They are typically more user-friendly than low-level packages and can save a significant amount of time when developing machine learning models. Only two of them will be mentioned here, as they are most likely the most popular ones.

For time series, it’s especially important to have a convenient environment for tracking defined metrics and hyperparameters, since it’s most likely that we would need to run a lot of different experiments. Usually, time series models are not big in comparison to some convolution neural networks and as an input have a few hundred or thousand numerical values, so models train pretty fast. Also, they often require quite some time for hyperparameter tuning.

Finally, it would be very beneficial to connect in one place models from different packages as well as visualization tools.

Time series projects with neptune.ai

neptune.ai is an experiment tracker designed with a strong focus on collaboration and scalability. It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye. The tool is known for its user-friendly interface and flexibility, enabling teams to adopt it into their existing workflows with minimal disruption. Neptune gives users a lot of freedom when defining data structures and tracking metadata.

Data scientists and ML/AI researchers can log, store, organize, display, compare, and query all their model-building metadata in a single place. Neptune handles data such as model metrics and parameters, model checkpoints, images, videos, audio files, dataset versions, and visualizations. As for any type of data, Time-Series are not an exception and any project with them can be tracked on Neptune.

Time series projects with Weights & Biases

Weights & Biases (W&B) is a machine learning platform, similar to neptune.ai, aimed at developers to help them build better models faster. It’s intended to support and optimize key MLOps life cycle steps such as model management, experiment tracking, and dataset versioning.

As neptune.ai, this tool can be useful during work with Time-Series projects, providing useful features for tracking and managing Time-Series models. More about Weights & Biases is presented in their documentation.

ML experiment tracking with Weights and Biases | Source

Time series forecasting packages

Probably the most important part of the time series project is forecasting. Forecasting is the process of predicting future events based on current and past data. It’s based on the assumption that the future can be realized from the past. Also, it assumes that there are some patterns in the data that can be used to predict what will happen next.

There are many methods for time series forecasting, starting from simple ones such as linear regression and ARIMA based, up to complex multilayer neural networks or ensemble models. Here, we’ll present some packages that support different kinds of models.

Time series forecasting with Statsmodels

Statsmodels is a package that we’ve already mentioned in the section about data visualization tools. However, this is a more relevant package for forecasting. Basically, this package provides a range of statistical models and hypothesis tests.

Statsmodels package also includes model classes and functions for time series analysis. Autoregressive moving average models (ARMA) and vector autoregressive models (VAR) are examples of basic models. Markov switching dynamic regression and autoregression are examples of non-linear models. It also includes time series descriptive statistics such as autocorrelation, partial autocorrelation function, and periodogram, as well as the theoretical properties of ARMA or related processes.

How to get started with time series using the Statsmodels package is described below:

Statsmodels documentation

Time series forecasting with Pmdarima

Pmdarima is a statistical library that facilitates the modelling of time series using ARIMA-based methods. Aside from that, it has other features such as:

A set of statistical tests for stationarity and seasonality
Various endogenous and exogenous transformers including Box-Cox and Fourier transformations
Decompositions of seasonal time series, cross-validation utilities, and other tools

Maybe the most useful utility of this library is the Auto-Arima module that searches over all possible ARIMA models within the constraints provided and returns the best one, based on either AIC or BIC value.

More about Pmdarima is presented here:

Pmdarima: ARIMA estimators for Python

Pmdarima PyPI

Time series forecasting with Sklearn

Sklearn or Scikit-Learn is for sure one of the most commonly used machine learning packages in Python. It provides various classification, regression, and clustering methods including random forest, support vector machine, k-means, and others. Besides that, it provides some utilities related to dimensionality reduction, model selection, data preprocessing, and much more.

In addition to various models, for time series there are also available some useful functionalities such as pipelines, time series cross-validation functions, diverse metrics for measuring results, and similar.

Time series split using Sklearn | Source

More about this library can be found below:

Scikit-learn website

Tutorials Point: Scikit-lea

Time series forecasting with PyTorch

PyTorch is a Python-based deep learning library for fast and flexible experimentation. It was originally developed by researchers and engineers working on Facebook’s AI research team and then open-sourced. Deep learning software such as Tesla Autopilot, Uber’s Pyro, and Hugging Face’s Transformers are built on top of PyTorch.

With PyTorch, it’s possible to build powerful recurrent neural network models such as LSTM and GRU and forecast time series. Also, there is a PyTorch Forecasting package with state-of-the-art network architectures. It also includes a time series dataset class that abstracts handling variable transformations, missing values, randomized subsampling, multiple history lengths, and other similar issues. More about this is presented below:

Github: PyTorch forecasting

PyTorch website

Time series forecasting with Tensorflow (Keras)

TensorFlow is an open-source software library for machine learning, based on data flow graphs. It was originally developed by the Google Brain team for internal use, but later it was released as an open-source project. The software library provides a set of high-level data flow operators that can be combined to express complex computations involving multidimensional data arrays, matrices, and higher-order tensors in a natural way. It also provides some lower-level primitives such as kernels that are used to construct custom operators or to speed up the execution of common operations.

Keras is a high-level API that is built on top of TensorFlow. Using Keras and TensorFlow it is possible to build neural network models for time series forecasting. One example of a time series project using weather time series data set is explained in the tutorial below:

TensorFlow time series tutorial

Time series forecasting with Sktime

Sktime is an open-source Python library for time series and machine learning. It includes the algorithms and transformation tools needed to solve time series regression, forecasting, and classification tasks efficiently. Sktime was created to work with scikit-learn and make it easy to adapt algorithms for interrelated time series tasks as well as build composite models.

Overall, this package provides:

State-of-the-art algorithms for time series forecasting
Transformations for time series such as detrending or deseasonalization and similar
Pipelines for models and transformations, model tuning utilities, and other useful functionalities

How to get started with this library is described here:

Sktime documentation

Time series forecasting with Prophet

Prophet is an open-source library released by Facebook’s Core Data Science team. Briefly, it consists of a procedure for forecasting time series data, based on an additive model that combines a few non-linear trends with yearly, weekly and daily seasonality, as well as holiday effects. It works best with time series that have strong seasonal effects and historical data from multiple seasons. It’s capable of handling missing data, trend shifts and outliers in general.

More about Prophet library is presented below:

Github: Facebook Prophet

Time series forecasting with Pycaret

PyCaret is an open-source machine learning library in Python that automates machine learning workflows. With PyCaret it’s possible to build and test several machine learning models with minimal effort and a few lines of code.

Basically, with minimal code, not going deep into the details, it’s possible to build an end-to-end machine learning project from EDA to deployment.

This library has some useful time series models among which are:

Seasonal Naive Forecaster
ARIMA
Polynomial Trend Forecaster
Lasso Net with deseasonalize and detrending options and many others

Anomaly detection using PyCaret | Source

More about PyCaret can be found here:

PyCaret documentation

New time series with PyCaret

Time series forecasting with AutoTS

AutoTS is a time series package for Python, designed to automate time series forecasting. It can be used to find the best time series forecasting model both for univariate and multivariate time series. Also, AutoTS itself clears the data from any NaN values or outliers.

Nearly 20 predefined models like ARIMA, ETS, VECM are available, and using genetic algorithms, it finds the best models, preprocessing, and ensembling for a given dataset.

Some tutorials about this package are:

Github: AutoTS

Hands-on guide to AutoTS: effective model selection for multiple time series

Time series forecasting with Darts

Darts is a Python library that allows simple manipulation and forecasting of time series. It includes a wide range of models, from classics like ES and ARIMA up to RNN and transformers. All of the models can be used in the same way as in the scikit-learn package.

The library also allows easy backtesting of models, combining predictions from multiple models, and incorporating external data. It supports both univariate and multivariate models. The table of all available models as well as several examples can be found here:

Darts documentation

Time series forecasting with Kats

Kats is a package released by Facebook’s Infrastructure Data Science team, intended to perform time series analysis. The goal of this package is to provide everything needed for time series analysis, including detection, forecasting, feature extraction/embedding, multivariate analysis, and so on.

Kats provides a comprehensive set of forecasting tools, such as ensembling, meta-learning models, backtesting, hyperparameter tuning, and empirical prediction intervals. Also, it includes features for detecting seasonalities, outliers, change points, and slow trend changes in time series data. With the TSFeature option, it’s possible to generate 65 features with clear statistical definitions that can be used in most machine learning models.

More about Kats package is described below:

Github: Kats

Kats website

Forecasting libraries comparison

In order to easily compare forecasting packages and have a high-level overview, here is a table with some common features. It shows some metrics such as GitHub stars, year of release, supporting features, and similar.

	Year of release	GitHub stars	Statistics & econometrics	Machine learning	Deep learning
Statsmodels	2010	7200	++	+
Pmdarima	2018	1100	+	+
Sklearn	2007	50000	+	++	+
PyTorch	2016	55000		++	+
TensorFlow	2015	164000		+	++
Sktime	2019	5000	+	+
Prophet	2017	14000	+	+
PyCaret	2020	5500	+	+	+
AutoTS	2020	450	+	+
Darts	2021	3800	+	+	+
Kats	2021	3600	+	+

Conclusion

In this post, we described the most commonly used tools, packages, and libraries for time series projects. With this list of tools, it’s possible to cover almost any project related to time series. On top of that, we provided a comparison of libraries for forecasting that shows some interesting stats, such as year of release, popularity level, and what kind of models it supports.

If you want to dive deeper into the area of time series, there is a collection of different packages that can be used to process time series: “Github: using Python to work with time series data“.

For those who would like to learn more about time series in general with a theoretical approach, the great choice would be the book “New Introduction to Multiple Time Series Analysis” by professor dr. Helmut Lütkepohl.

ARIMA vs Prophet vs LSTM for Time Series Prediction

Konstantin Kutzkov — Tue, 13 Sep 2022 14:28:34 +0000

Assuming we subscribe to a linear understanding of time and causality, as Dr. Sheldon Cooper says, then representing historical events as a series of values and features observed over time provides the foundations for learning from the past. However, time series are somewhat different from other datasets, including sequential data like text or DNA sequences.

The time component provides additional information that can be useful when predicting the future. Thus, there are many different techniques designed specifically for dealing with time series. Such techniques range from simple visualization tools that show trends evolving or repeating over time to advanced machine learning models that utilize the specific structure of time series.

Check also

️ ARIMA & SARIMA: Real-World Time Series Forecasting [Advanced Guide]

️ How to Select a Model For Your Time Series Prediction Task [Guide]

In this post, we will discuss three popular approaches to learning from time-series data:

1 The classic ARIMA framework for time series prediction
2 Facebook’s in-house model Prophet, which is specifically designed for learning from business time series
3 The LSTM model, a powerful recurrent neural network approach that has been used to achieve the best-known results for many problems on sequential data

We will then show how to compare the results across the three models using neptune.ai and its powerful features.

Let’s start with a brief overview of the three methods.

Overview of the three methods: ARIMA, Prophet, and LSTM

ARIMA

ARIMA is a class of time series prediction models, and the name is an abbreviation for AutoRegressive Integrated Moving Average. The backbone of ARIMA is a mathematical model that represents the time series values using its past values. This model is based on two main features:

Past Values: Clearly, past behaviour is a good predictor of the future. The only question is how many past values we should use. The model uses the last p time series values as features. Here p is a hyperparameter that needs to be determined when we design the model.
Past Errors: The model can use the information on how well it has performed in the past. Thus, we add as features the most recent q errors the model made. Again, q is a hyperparameter.

An important aspect here is that the time series needs to be standardized such that the model becomes independent from seasonal or temporary trends. The formal term for this is that we want the model to be trained on a stationary time series. In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time. It does not mean that the series does not change over time, just that the way it changes does not itself change over time.

There are several approaches to making a time series stationary, the most popular being differencing. By replacing the n values in the series with the n-1 differences, we force the model to learn more advanced patterns. When the model predicts a new value, we simply add the last observed value to it in order to obtain a final prediction. Stationarity can be somewhat confusing if you encounter the concept for the first time, you can refer to this tutorial for more details.

Parameters

Formally, ARIMA is defined by three parameters p, d, and q that describe the three main components of the model.

Integrated (the I in ARIMA): The number of differences needed to achieve stationarity is given by the parameter d. Let the original features be Y_t where t is the index in the sequence. We create a stationary time series using the following transformations for different values of d.

For d=0

In this case the series is already stationary and we have nothing to do.

For d=1

This is the most typical transformation.

For d=2

Observe that differencing can be seen as a discrete version of differentiation. For d=1 the new features represent how the values change. While for d=2 the new features represent the rate of the change, just like the second derivative in calculus. The above can be generalized to d>2 as well but this is rarely used in practice.

AutoRegressive (AR): The parameter p tells us how many past values to consider for the expression of the current value. Essentially, we learn a model that predicts the value at time t as:

Moving Average (MA): How many of the forecast errors in the past should be considered. A new value is computed as:

The past prediction errors:

The combination of the three components gives the ARIMA(p, d, q) model. More precisely, we first integrate the time series, and then we add the AR and MA models and learn the corresponding coefficients.

Prophet

Prophet FB was developed by Facebook as an algorithm for the in-house prediction of time series values for different business applications. Therefore, it is specifically designed for the prediction of business time series.

It is an additive model consisting of four components:

Let us discuss the meaning of each component:

g(t): It represents the trend and the objective is to capture the general trend of the series. For example, the number of advertisements views on Facebook is likely to increase over time as more people join the network. But what would be the exact function of increase?
s(t): It is the Seasonality component. The number of advertisement views might also depend on the season. For example, in the Northern hemisphere during the summer months, people are likely to spend more time outdoors and less time in from of their computers. Such seasonal fluctuations can be very different for different business time series. The second component is thus a function that models seasonal trends.
h(t): The Holidays component. We use the information for holidays which have a clear impact on most business time series. Note that holidays vary between years, countries, etc. and therefore the information needs to be explicitly provided to the model.
The error term ε_t stands for random fluctuations that cannot be explained by the model. As usual, it is assumed that ε_t follows a normal distribution N (0, σ²) with zero mean and unknown variance σ that has to be derived from the data.

LSTM recurrent neural networks

LSTM stands for Long short-term memory. LSTM cells are used in recurrent neural networks that learn to predict the future from sequences of variable lengths. Note that recurrent neural networks work with any kind of sequential data and, unlike ARIMA and Prophet, are not restricted to time series.

The main idea behind LSTM cells is to learn the important parts of the sequence seen so far and forget the less important ones. This is achieved by the so-called gates, i.e., functions that have different learning objectives such as:

a compact representation of the time series seen so far
how to combine new input with the past representation of the series
what to forget about the series
what to output as a prediction for the next time step.

See Figure 1 and the Wikipedia article for more details.

Designing an optimal LSTM based model can be a difficult task that requires careful hyperparameter tuning. Here is the list of the most important parameters an LSTM based model needs to consider:

How many LSTM cells are to use in order to represent the sequence? Note that each LSTM cell will focus on specific aspects of the time series processed so far. A few LSTM cells are unlikely to capture the structure of the sequence while too many LSTM cells might lead to overfitting.
It is typical that first, we convert the input sequence into another sequence, i.e. the values h_t. This yields a new representation as the h_t states capture the structure of the series processed so far. But at some point, we won’t need all htvalues but rather only the last h_t. This will allow us to feed the different h_t’s into a fully connected layer as each h_tcorresponds to the final output of an individual LSTM cell. Designing the exact architecture might require careful finetuning and many trials.

Figure 1: the structure of an LSTM cell | Source

Finally, we would like to reiterate that recurrent neural networks are a general class of methods for learning from sequential data and they can work with arbitrary sequences such as natural text or audio.

Experimental evaluation: ARIMA vs Prophet vs LSTM

Dataset

We are going to use stock exchange data for Bajaj Finserv Ltd, an Indian financial services company in order to compare the three models. The dataset spans the period from 2008 until the end of 2021. It contains the daily stock price (mean, low, and high values) as well as the total volume and the turnover of traded stocks. A subsample of the dataset is shown in Figure 2.

Figure 2: the data used for evaluation | Source: Author

We are interested in predicting the Volume Weighted Average Price (VWAP) variable at the end of each day. A graph of the time series VWAP values is presented in Figure 3.

Figure 3: the daily values of the VWAP variable | Source: Author

For the evaluation, we divided the time series into a train and test time series where the training series consists of the data until the end of 2018 (see Figure 4).

Total number of observations: 3201

Training observations: 2624

Test observations: 577

Figure 4: the train and test subsets of the VWAP time series | Source: Author

Implementation

In order to work properly, machine learning models require good data and for this, we will do a little Feature engineering. The objective behind feature engineering is to design more powerful models that exploit different patterns in the data. As the three models learn patterns observed in the past, we create additional features that thoroughly describe the recent trends of the stock movements.

In particular, we track the moving average for the different trade features over a period of 3, 7, and 30 days. In addition, we consider features such as the month, the week number, and the weekday. Thus, the input to our models is multidimensional. A small example of the used feature engineering looks as follows:

lag_features = ["High", "Low", "Volume", "Turnover", "Trades"]
df_rolled_7d = df[lag_features].rolling(window=7, min_periods=0)
df_mean_7d = df_rolled_7d.mean().shift(1).reset_index().astype(np.float32)

The above code excerpt shows how to add the running mean over the last week of several features describing the sales of the stock. Overall, we create a set of exogenous features:

Now, let’s get started with our main models:

ARIMA

We implemented the ARIMA version from the publicly available package pmdarima. The function auto_arima accepts as an additional parameter a list of exogenous features where we provide the features created in the feature engineering step. The main advantage of auto_arima is that it first performs several tests in order to decide if the time series is stationary or not. Also, it employs a smart grid search strategy that determines the optimal parameters for p, d, and q discussed in the previous section.

from pmdarima import auto_arima
model = auto_arima(
	df_train["VWAP"],
	exogenous=df_train[exogenous_features],
	trace=True,
	error_action="ignore",
	suppress_warnings=True)

The grid search over different values of the parameters p, d, and q is shown below. In the end, the model with the smallest AIC value is returned. (The AIC value is a measure of model complexity that simultaneously optimizes the accuracy and the complexity of a prediction model.)

Predictions on the test set are then obtained by

forecast = model.predict(n_periods=len(df_valid),  exogenous=df_valid[exogenous_features])

Prophet

We use the publicly available Python implementation of Prophet. The input data must contain two specific fields:

Date: should be a valid calendar date from which the holidays can be computed
Y: the target variable we want to predict.

We instantiate the model as:

from prophet import Prophet
model = Prophet()

The features created during feature engineering have to be explicitly added to the model as follows:

for feature in exogenous_features:
	model.add_regressor(feature)

Finally, we fit the model:

model.fit(df_train[["Date", "VWAP"] + exogenous_features].rename(columns={"Date": "ds", "VWAP": "y"}))

And the forecast for the test set is obtained as:

forecast = model.predict(df_test[["Date", "VWAP"] + exogenous_features].rename(columns={"Date": "ds"}))

LSTM

We used the Keras implementation of LSTMs:

import tensorflow as tf
from keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.metrics import RootMeanSquaredError, MeanAbsoluteError
from tensorflow.keras.models import Sequential

The model is defined by the following function.

def get_model(params, input_shape):
	model = Sequential()
	model.add(LSTM(units=params["lstm_units"], return_sequences=True, input_shape=(input_shape, 1)))
	model.add(Dropout(rate=params["dropout"]))

	model.add(LSTM(units=params["lstm_units"], return_sequences=True))
	model.add(Dropout(rate=params["dropout"]))

	model.add(LSTM(units=params["lstm_units"], return_sequences=True))
	model.add(Dropout(rate=params["dropout"]))

	model.add(LSTM(units=params["lstm_units"], return_sequences=False))
	model.add(Dropout(rate=params["dropout"]))

	model.add(Dense(1))

	model.compile(loss=params["loss"],
              	optimizer=params["optimizer"],
              	metrics=[RootMeanSquaredError(), MeanAbsoluteError()])

	return model

Then we instantiate a model with a given set of parameters. We use the past 90 observations in the time series as a sequence for the input to the model. The other hyperparameters describe the architecture and the specific choices for training the model.

params = {
	"loss": "mean_squared_error",
	"optimizer": "adam",
	"dropout": 0.2,
	"lstm_units": 90,
	"epochs": 30,
	"batch_size": 128,
	"es_patience" : 10
}

model = get_model(params=params, input_shape=x_train.shape[1])

The above results in the following Keras model (see Figure 5):

Figure 5: a summary of the Keras LSTM model | Source: Author

We then create a callback to implement early stopping i.e. to stop training the model if it yields no improvement on the validation dataset for a given number of epochs (in our case 10):

es_callback = tf.keras.callbacks.EarlyStopping(monitor='val_root_mean_squared_error',
                                           	mode='min',
patience=params["es_patience"])

The parameter es_patience refers to the number of epochs for early stopping.

Finally, we fit the model using the predefined parameters:

model.fit(
	x_train,
	y_train,
	validation_data=(x_test, y_test),
	epochs=params["epochs"],
	batch_size=params["batch_size"],
	verbose=1,
	callbacks=[neptune_callback, es_callback]
)

Experiment tracking and model comparison

Since in this blog post, we want to answer the simple question of which model yields the most accurate predictions for the test dataset, we will need to see how these three models fare against each other.

There are many different approaches for model comparisons such as creating tables and charts that record the evaluation of different metrics, creating graphs that plot the predicted values vs the true values on a test set, etc. However, for this exercise, we will be using neptune.ai.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

It’s an experiment tracker built for teams that run a lot of experiments.‌ It gives you a single place to log, store, display, organize, compare, and query all your model-building metadata.

We first create a Neptune project and record the API of our account. You can check a detailed tutorial on how to do it in the Neptune documentation.

import neptune

# Create a Neptune run object
run = neptune.init_run(
    project="your-workspace-name/your-project-name",  
    api_token="YourNeptuneApiToken",  
)

The variable run can be seen as a folder in which we can create subfolders containing different information. For example, we can create a subfolder called model and record in it the name of the model:

run["model/name"] = "Arima"

We will compare the accuracy of these models with respect to two different metrics:

The root mean square error (RMSE)

The mean absolute error (MAE)

Note that these values can be logged into Neptune by setting the corresponding values, for example, setting:

run["test/mae"] = mae
 run["test/rmse"] = mse

The mean square error and the mean average error for the three models can be seen next to each other in the runs table:

Figure 6. the MSE and the MAE for the three models in the Neptune web app
(the tags for each project are at the top) | See in the Neptune app

The comparison of the three algorithms can be then seen side by side in Neptune, as shown in Figure 7.

Figure 7: The mean square error and the mean average error for the three models can be seen next to each other
(the tags for each project are at the top) | See in the Neptune app

We see that ARIMA yields the best performance, i.e., it achieves the smallest mean square error and mean absolute error on the test set. In contrast, the LSTM neural network performs the worst of the three models.

The exact predictions plotted against the true values can be seen in the following images. We observe that all three models capture the overall trend of the time series but the LSTM appears to be running behind the curve, i.e. it needs more to adjust itself to the change in trend. And Prophet appears to lose against ARIMA in the last few months of the considered test period where it underestimates the true values.

Figure 8: ARIMA predictions | Source: Author

Figure 9: prophet predictions | Source: Author

Figure 10: LSTM prediction | Source: Author

A deeper look into the performance of the models

ARIMA grid-search

When doing grid-search over different values for p, d, and q in ARIMA, we can plot the individual values for the mean squared error. The colored dots in Figure 11 show the mean square error values for different ARIMA parameters over a validation set.

Figure 11: grid-search over the ARIMA parameters | See in the Neptune app

Trends in Prophet

We collect in Neptune the parameters, forecast data frames, residual diagnostic charts, and other metadata while training models with Prophet. This is achieved using a single function that captures Prophet training metadata and logs it automatically to Neptune.

In Figure 12, we show the change of the different components of the Prophet. We observe that the trend follows a linear increase while the seasonal components exhibit fluctuations.

Figure 12: the change of values of the different components in the Prophet over time | Source: Author

Why did LSTM fare the worst?

We collect in Neptune the mean absolute error while training the LSTM model over several epochs. This is achieved using a Neptune callback which captures Keras training metadata and logs it automatically to Neptune. The results are shown in Figure 13.

Observe that while the error on the training dataset decreases over subsequent epochs, this is not the case for the error on the validation set which reaches its minimum in the second epoch and then fluctuates. This shows that the LSTM model is too advanced for a rather small dataset and is prone to overfitting. Despite adding regularization terms such as dropout, we can’t still avoid overfitting.

Figure 13: the evolution of train and test error over different epochs of training the LSTM model | See in the Neptune app

Conclusions

In this blog post, we presented and compared three different algorithms for time series prediction. As expected, there is no clear winner and each algorithm has its own advantages and limitations. Below we summarize our observations for each algorithm:

ARIMA is a powerful model and as we saw it achieved the best result for the stock data. A challenge is that it might need careful hyperparameter tuning and a good understanding of the data.
Prophet is specifically designed for business time series prediction. It achieves very good results for the stock data but, speaking from anecdotes, it can fail spectacularly on time series datasets from other domains. In particular, this holds for time series where the notion of calendar date is not applicable and we cannot learn any seasonal patterns. Prophet’s advantage is that it requires less hyperparameter tuning as it is specifically designed to detect patterns in business time series.
LSTM-based recurrent neural networks are probably the most powerful approach to learning from sequential data and time series are only a special case. The potential of LSTM based models is fully revealed when learning from massive datasets where we can detect complex patterns. Unlike ARIMA or Prophet, they do not rely on specific assumptions about the data such as time series stationarity or the existence of a Date field. A disadvantage is that LSTM based RNNs are difficult to interpret and it is challenging to gain intuition into their behaviour. Also, careful hyperparameter tuning is required in order to achieve good results.

Future directions

So I hope you enjoyed reading this article and now you must have a better understanding of the time-series algorithms that we discussed here. If you want to dig deeper, here are some links to some useful resources. Happy experimenting!

PMD ARIMA. The documentation for the respective Python package.
Prophet. Documentation and tutorial for Facebook Prophet.
Keras LSTM. Documentation and examples for LSTM RNNs in Keras.
Neptune. The Neptune website with tutorials and documentation.
A blog post on ML experiment tracking with neptune.ai.
A deeper overview of ARIMA models.
A tutorial on time series prediction with LSTM RNNs.
The original Prophet research paper.

How to Select a Model For Your Time Series Prediction Task [Guide]

Joos Korstanje — Tue, 13 Sep 2022 14:22:56 +0000

Are you working with time series data and seeking the most effective models? This guide explains how to select and evaluate time series models based on predictive performance—including classical, supervised, and deep learning-based models.

After comparing models and selecting the right one for our task, we’ll build models for stock market forecasting, benchmarking each to identify the best-performing approach.

Understanding time series datasets and forecasting

Most data sets that practitioners work with are based on independent observations. For example, given a website, you could track each visitor; each data point (e.g. row in a table) would represent an individual observation about each visitor. If we assign each visitor a “User ID,” each ID will be independent of the other visitors.

Example of a dataset with independent observations | Source: Author

In contrast, time series data are unique because they measure one or more variables as they change over time, creating dependencies between data points. Unlike typical datasets with independent observations, each time point in a time series dataset is related to its predecessors. This impacts the choice of machine learning algorithms we should use.

Example of a dataset with dependent observations taken over time | Source: Author

Key aspects of time series modeling

Before we dive into the models themselves, let’s make sure we have a good understanding of the nature of time series data and how modeling them differs from other types of data.

Univariate versus multivariate time series models

In time series data, timestamps hold intrinsic meaning. Univariate time series models use only one variable (the target variable) and its variation over time to make future predictions.

In contrast, multivariate time series models include additional variables. For instance, if you want to forecast product demand, you might consider including weather data as an influencing factor. Multivariate models extend univariate models by integrating these additional (or external) variables.

Univariate time series models	Multivariate time series models
Use only one variable	Use multiple variables
Cannot use external data	Can use external data
Based only on relationships between past and present	Based on relationships between past and present, and between variables

Suppose you want to look at the patterns (or changes over time) within your data to understand it better and make predictions. In this case, you need to understand the temporal variations you’ll encounter: seasonality, trend, and noise. The next topic we’ll discuss, time series decomposition, is a technique used to separate these components so you can analyze them individually.

Time series decomposition

You can decompose the time series to extract different types of variation from your dataset. This will extract three key features from your data:

Seasonality is a recurring pattern based on time periods (such as seasons of the year). For example, temperatures typically rise in the summer and fall in the winter. You can use this predictable pattern to help predict future values.
Trends reflect long-term increases or decreases in your data. Going back to our temperature example, you could observe a gradual upward trend due to global warming, layered on top of seasonal variations.
Noise is random variability that doesn’t follow seasonality or trend. It represents unpredictable fluctuations in the data, meaning no model can fully account for it (that’s why it’s also often called the “error” or “residual” in the data).

Time series decomposition in Python

Here’s a quick example of how to decompose a time series in Python using the CO2 dataset from the statsmodels library.

Before diving into the code, make sure you have the required dependencies installed. Run the following commands to set up your environment:

# Install the relevant libraries
!pip install numpy pandas matplotlib statsmodels scikit-learn xgboost pmdarima tensorflow yfinance neptune

Now, import the dataset:

# Import the CO2 dataset
import statsmodels.datasets.co2 as co2

co2_data = co2.load().data
print(co2_data)

The dataset includes a time index (weekly dates) and CO2 measurements, shown below.

Example measurements and timestamps from the statsmodels CO2 dataset | Source: author

There are a few missing (NA) values. To handle these, you can use interpolation like this:

# Handle missing values with interpolation
co2_data = co2_data.fillna(co2_data.interpolate())

Next, plot the CO2 values over time to see the temporal trend:

# Plot CO2 values over time
co2_data.plot()

This will generate the following plot:

Plot of the CO2 time series from the statsmodels CO2 dataset | Source: Author

To decompose the time series into trend, seasonality, and noise (labeled as “residual”), use the seasonal_decompose function from statsmodels:

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(co2_data)
result.plot()

The CO2 time series decomposed into seasonality, trend, and residual (noise) | Source: Author

In this decomposition, the CO2 data reveals an upward trend (reflected in the first plot) and strong seasonality (see the pattern in the third plot).

Autocorrelation

Autocorrelation is another key temporal feature in time series data. It measures how the current value of a time series correlates with past values, allowing for more accurate predictions based on recent trends.

Autocorrelation can be:

Positive: High values tend to be followed by other high values, and low values by low values. For example, in the stock market, a rising stock price often attracts more buyers, driving the price up further; when the price falls, many people usually sell, driving the price down.
Negative: High values are likely to be followed by low values, and vice versa. For example, a high rabbit population in the summer may deplete resources, leading to a lower population in the winter, allowing resources to recover and the population to increase again the following year.

Detecting autocorrelation

Two common tools for detecting autocorrelation are:

The autocorrelation function (ACF) plot
The partial autocorrelation function (PACF) plot

You can compute an ACF plot using Python as follows:

from statsmodels.graphics.tsaplots import plot_acf

plot_acf(co2_data)

For our CO2 dataset, this is what we get:

Autocorrelation plot for the CO2 dataset | Source: Author

On the x-axis, you see the time steps (or “lags”) going back in time. On the y-axis, you can see the correlation of each time step with the current time. This plot clearly shows significant autocorrelation.

The PACF (partial autocorrelation function) is an alternative to the ACF. Instead of showing all autocorrelations, it shows only the unique correlation at each time step, filtering out indirect effects. This helps identify the true relationship between each lag and the present time.

For example, if today’s value is similar to yesterday’s and the day before, the ACF would show high correlations for both days. The PACF, however, would only show yesterday’s value as correlated, removing redundant correlations from earlier days.

You can compute a PACF plot in Python as follows:

from statsmodels.graphics.tsaplots import plot_pacf

plot_pacf(co2_data)

Partial autocorrelation plot for the CO2 dataset | Source: Author

The PACF plot provides a clearer view of the autocorrelation in the CO2 data. It shows strong positive autocorrelation at lag 1, meaning a high value now likely indicates a high value in the next time step. The PACF only displays direct correlations, avoiding duplicate effects from earlier lags. This results in a cleaner and more straightforward representation.

Stationarity

Stationarity is another important concept in time series analysis. It means a series has no trend, meaning its statistical properties, like mean and variance, remain constant over time. Many time series models require stationarity to work effectively.

To check for non-stationarity, you can use the Dickey-Fuller Test.

Dickey-Fuller test

The Dickey-Fuller test is a statistical test that detects non-stationarity in a time series. Here’s how to apply it to the CO2 data in Python:

from statsmodels.tsa.stattools import adfuller
adf, pval, usedlag, nobs, crit_vals, icbest = adfuller(co2_data.co2.values)

print('ADF test statistic:', adf)
print('ADF p-value:', pval)
print('Number of lags used:', usedlag)
print('Number of observations:', nobs)
print('Critical values:', crit_vals)
print('Best information criterion:', icbest)

The result looks like this:

Results of the Dickey-Fuller Test for the CO2 data in Python. The ADF test suggests that the time series is non-stationary, as the test statistic (0.0337) is greater than the critical values, and the p-value (0.9612) is much higher than 0.05, failing to reject the null hypothesis of non-stationarity.

In the ADF test:

The null hypothesis assumes a unit root is present, meaning the series is non-stationary.
The alternative hypothesis suggests that the time series is stationary.

If the p-value is below 0.05, you can reject the null hypothesis, suggesting that the data is stationary. We cannot reject the null hypothesis if the p-value is above 0.05 (as we see above), meaning the data is likely non-stationary. This aligns with the trend we saw above in the CO2 data.

Differencing

To make a non-stationary series stationary, you can apply differencing, which removes trends and leaves only seasonal variations. This helps when using models that assume stationarity.

# Apply differencing to remove the trend
prev_co2_value = co2_data.co2.shift()
differenced_co2 = co2_data.co2 - prev_co2_value
differenced_co2.plot()

The differenced CO2 data looks like this:

The CO2 time series after applying differencing | Source: Author

Now, if we do the ADF test again on this differenced data, we can confirm that it has become stationary:

from statsmodels.tsa.stattools import adfuller
adf, pval, usedlag, nobs, crit_vals, icbest = adfuller(differenced_co2.dropna())

print('ADF test statistic:', adf)
print('ADF p-value:', pval)
print('ADF number of lags used:', usedlag)
print('ADF number of observations:', nobs)
print('ADF critical values:', crit_vals)
print('ADF best information criterion:', icbest)

Results of the Dickey-Fuller Test after applying differencing. The ADF test suggests that the time series is stationary, as the test statistic is smaller than the critical values and the p-value is much smaller than 0.05, so we can reject the null hypothesis of non-stationarity.

Now, the p-value is very small, indicating that we can reject the null hypothesis (non-stationarity) and assume that this data is stationary.

One-step vs multi-step time series models

Before diving into modeling, the final concept we should cover is the difference between one-step and multi-step models.

One-step models predict only the next time point in a series. To create multi-step forecasts with these models, you can repeatedly use the previous prediction as input for the next step. However, this approach can extend any existing errors over multiple steps.
Multi-step models are designed to predict multiple future points simultaneously. These models are generally better for long-term forecasts and perform well for single-step forecasts.

Choosing between one-step and multi-step models depends on how many steps you need to predict for your use case.

One-step forecasts	Multi-step forecasts
Designed to forecast only one step ahead	Designed to forecast multiple steps ahead
Can be extended to multi-step by windowing	Direct multi-step capability
May be less accurate for multi-step forecasts	Ideal for multi-step forecasts

Types of time series models

Now that we’ve covered key aspects of time series data, let’s explore the types of models used for forecasting.

Classical time series models

These traditional models, such as ARIMA and Exponential Smoothing, are based on time-based patterns in a time series. While highly effective for forecasting single-variable (univariate) series, some advanced options exist to add external variables as well.

Classical models like these are specific to time series data and generally aren’t suitable for other types of machine learning.

Supervised models

Supervised models are a family of models used for many different machine learning tasks. They use clearly defined input (X) and output (Y) variables.

For time series forecasting, you can create input features from date-based elements (e.g., year, month, day), and the target to be predicted is the value of your time series at that date. You can also include lagged values to add autocorrelation effects.

Deep learning models

The rise of deep learning has enabled new forecasting methods, especially useful for complex, sequential data. Specific model architectures like LSTMs have been developed and applied for sequence-based forecasting.

Major tech companies like Facebook and Amazon have released open-source forecasting tools, offering powerful new options for practitioners. These can sometimes outperform traditional models.

Classical time series models

Now, let’s dive deeper into classical time series models, starting with the ARIMA family, which combines multiple components to create a robust forecasting model.

ARIMA family

The ARIMA family of models consists of a set of smaller models that can be used on their own or combined (when all of the individual components are put together, you obtain the SARIMAX model). The main building blocks are:

1. Autoregression (AR): Uses past values to predict future ones. The order of an AR model, p, indicates the number of previous time steps included; the simplest model is the AR(1) model, which only uses one previous timestep to predict the current value.

2. Moving average (MA): Predicts future values based on past prediction errors rather than past values. The intuition here is that when a model has external perturbations, there may be a pattern in the error; the MA aims to capture this pattern. The order q represents how many error terms to include: MA(1) uses only the last error.

3. Autoregressive Moving Average (ARMA): Combines AR and MA, using both past values and past errors for predictions. ARMA can use different lags for either the AR or MA; for example, ARMA(1, 0) has an order of p=1 and q=0, effectively making it a regular AR(1) model. The ARMA model requires a stationary time series.

4. Autoregressive Integrated Moving Average (ARIMA): Extends ARMA by adding differencing (indicated by d) to make the series stationary, if necessary. The notation is ARIMA(p, d, q). For example, an ARMA(1, 2) model that needs to be differenced once would become an ARIMA(2, 1, 2) model. The first 2 is for the AR order, the second 1 is for the differencing, and the third 2 is for the MA order. ARIMA(1, 0, 1) would be the same as ARMA(1, 1).

5. Seasonal ARIMA (SARIMA): Adds seasonality to ARIMA, with seasonal parameters (P, D, Q) on top of the non-seasonal parameters (p, d, q). If seasonality is present in your time series, using it in your forecast is critical. The frequency m specifies the seasonal period (e.g., 12 for monthly data, or 4 for quarterly data). SARIMA notation is SARIMA(p, d, q)(P, D, Q)m.

6. Seasonal autoregressive integrated moving-average with exogenous regressors (SARIMAX)

6. SARIMA with Exogenous Variables (SARIMAX): Adds external variables (X) to SARIMA, allowing additional features to improve forecast accuracy. This is the most complex variant, combining AR, MA, differencing, and seasonal effects, along with the addition of external variables.

Example: using auto-ARIMA in Python on CO2 Data

Now that we’ve reviewed the building blocks of the ARIMA family, let’s apply them to create a predictive model for CO2 data.

Choosing the right parameters for ARIMA or SARIMAX models can be challenging, as there are many combinations of (p, d, q) or (p, d, q)(P, D, Q). While you can inspect autocorrelation graphs to make educated guesses, the pmdarima library provides an auto_arima function to automatically select optimal parameters.

First, import pmdarima and other necessary libraries:

import pmdarima as pm
from pmdarima.model_selection import train_test_split
import matplotlib.pyplot as plt

After installation, split the data into training and testing sets (we’ll go into why in more detail later on):

train, test = train_test_split(co2_data.co2.values, train_size=2200)

Next, fit the model using auto_arima on the training data with seasonal parameters, then make predictions with the best-selected model:

model = pm.auto_arima(train, seasonal=True, m=52)
preds = model.predict(test.shape[0])

Finally, visualize the actual vs. forecasted data:

plt.plot(co2_data.co2.values[:2200], train)
plt.plot(range(2200, 2200 + len(preds)), preds)
plt.legend()
plt.show()

In the plot, the blue line represents the actual data, and the orange line represents the forecast. | Source: Author

For more examples and details, check the pmdarina documentation.

Vector autoregression (VAR) and its variants: VARMA and VARMAX

Vector Autoregression (VAR) is a multivariate alternative to ARIMA, designed to predict multiple time series simultaneously. This is especially useful when strong relationships exist between series, as VAR models only the autoregressive component for multiple variables.

VARMA: The multivariate equivalent of ARMA, adding a moving average component to VAR, allowing it to model both past values and errors across multiple series.
VARMAX: Extends VARMA by adding exogenous (external) variables (X), which can improve forecasting accuracy without needing to be forecasted themselves. The statsmodels VARMAX implementation is a good way to get started with multivariate forecasting with external factors.

Advanced versions like Seasonal VARMAX (SVARMAX) also exist but can become highly complex, making implementation and interpretation challenging. In practice, simpler models may be preferable.

Smoothing techniques

Exponential smoothing is a statistical technique that helps to reduce short-term noise in time series data, making long-term patterns more visible. Time series patterns often have a lot of long-term variability and short-term (noisy) variability. Smoothed time series can reveal trends more effectively for analysis. The main smoothing techniques include:

1. Simple moving average: Replaces the current value with an average of the current and past values. Increasing the number of past values smooths the series further, but reduces detail. This is the simplest smoothing technique.

2. Simple exponential smoothing (SES): An adaptation of the moving average that applies weights to past values so that recent values have more influence. This approach smooths the series without losing as much detail as a simple moving average.

3. Double exponential smoothing (DES): Suitable for data with trends, DES uses two parameters—α (the data smoothing factor) and β (the trend smoothing factor)—to adjust for trends in the data. This method addresses cases where SES alone would fall short by recursively applying an exponential filter.

4. Holt-Winters Exponential Smoothing (HWES): Also known as Triple Exponential Smoothing, HWES is ideal for data with seasonality and trend. It adjusts for three components—trend, seasonal cycles (e.g., weekly or monthly), and noise.

Example: Exponential smoothing in Python

Here’s how to apply simple exponential smoothing (SES) to our CO2 data:

from statsmodels.tsa.api import SimpleExpSmoothing
import matplotlib.pyplot as plt

es = SimpleExpSmoothing(co2_data.co2.values).fit(smoothing_level=0.01)

plt.plot(co2_data.co2.values)
plt.plot(es.predict(es.params, start=0, end=None))
plt.legend()
plt.show()

The smoothing level indicates how smooth your curve should become. In this example, it’s set very low, indicating a very smooth curve. Feel free to play around with this parameter and see what less-smooth versions look like.

The blue line shows the original data, and the orange line shows the smoothed series | Source: Author

Supervised machine learning models

Supervised machine learning models categorize variables as either dependent (target) or independent (predictor) variables. While these models aren’t designed for time series, they can be adapted by treating time-based features (e.g., year, month, day) as independent variables.

Linear regression

Linear Regression is the simplest supervised model. It estimates linear relationships: each independent variable has a coefficient that indicates how much it affects the target.

Multiple Linear Regression: Uses multiple predictors (e.g., temperature and price) to model the target variable.

Simple Linear Regression: Uses one independent variable. For example, hot chocolate sales can be modeled based on the temperature outside (pictured below).

An example of linear regression fit to data of hot chocolate sales according to outside temperature | Source: Author

This is not a time series dataset yet: no time variable is present. To make it a time series, we can add date-based variables such as year, month, or day instead of only using temperature and price as predictors. For example, tying this back to our CO2 dataset:

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Extract seasonality data
months = [x.month for x in co2_data.index]
years = [x.year for x in co2_data.index]
day = [x.day for x in co2_data.index]
X = np.array([day, months, years]).T

# Fit the Linear Regression model
my_lr = LinearRegression()
my_lr.fit(X, co2_data.co2.values)

# Make predictions
preds = my_lr.predict(X)

# Plot the results
plt.plot(co2_data.index, co2_data.co2.values, label="Actual")
plt.plot(co2_data.index, preds, label="Predicted")
plt.legend()
plt.show()

We had to do a little bit of feature engineering to extract seasonality into variables, but the advantage is that adding external variables becomes much easier.

We used the scikit-learn library to build a linear regression model, fit it to our data, and make predictions. Let’s see what our model learned:

The plot shows the fit of our linear regression model (in orange) to the CO2 data (presented in blue) | Source: Author

This is a pretty good fit!

Random forest

Linear Regression is limited to linear relationships. For more flexibility, Random Forest—a widely used model for nonlinear relationships—can provide a better fit.

The scikit-learn library has the RandomForestRegressor that you can simply use to replace the LinearRegression class in the previous code:

from sklearn.ensemble import RandomForestRegressor

# Fit Random Forest
my_rf = RandomForestRegressor()
my_rf.fit(X, co2_data.co2.values)

# Make predictions
preds = my_rf.predict(X)

# Plot the results
plt.plot(co2_data.index, co2_data.co2.values, label="Actual")
plt.plot(co2_data.index, preds, label="Predicted")
plt.legend()
plt.show()

The fit is now even better than before:

This plot demonstrates how our random forest model fits our CO2 dataset. In blue, the original data; in orange, the predicted values. As we can see, the fit is better than with linear regression | Source: Author

For now, it’s enough to understand that this Random Forest model has been able to learn the training data better. Later, we’ll cover more quantitative methods for model evaluation.

XGBoost

XGBoost is another essential supervised model based on gradient boosting. It combines an ensemble of “weak learners”—like random forest—in sequence to minimize errors iteratively. It can perform parallel learning for efficiency.

import xgboost as xgb

# Fit XGBoost model
my_xgb = xgb.XGBRegressor()
my_xgb.fit(X, co2_data.co2.values)

# Make predictions
preds = my_xgb.predict(X)

# Plot the results
plt.plot(co2_data.index, co2_data.co2.values)
plt.plot(co2_data.index, preds)
plt.show()

This plot shows in orange the XGBoost’s strong fit to the data (represented in blue). | Source: Author

Advanced and specific time series models

This section covers two advanced models for time series forecasting: GARCH and TBATS.

GARCH

GARCH (Generalized Autoregressive Conditional Heteroskedasticity) is primarily used to estimate volatility in financial markets.

Rather than predicting actual values, GARCH models the error variance in a time series, assuming an ARMA model for the variance. It’s ideal for forecasting volatility rather than point values.

GARCH has several variants within its family, but it is best for predicting volatility, as it differs significantly from traditional time series models.

TBATS

TBATS stands for:

Trigonometric seasonality
Box-Cox transformation
ARMA errors
Trend
Seasonal components

Introduced in 2011, TBATS is designed to handle time series with multiple seasonal cycles. This model is newer and less commonly used than ARIMA models but is effective for data with complex seasonal patterns.

A Python implementation of TBATS is available in the sktime package.

Deep learning-based time series models

We can now look at more advanced deep learning models after exploring classical and supervised models, which focus on past-present relations and cause-effect relations. These models, while complex, may offer superior forecasting performance depending on the data and context.

LSTMs (Long Short-Term Memory)

LSTM networks are a type of Recurrent Neural Network (RNN) specifically designed to handle sequential data. In LSTM models, multiple nodes pass input data through layers, each learning simple tasks that together capture complex, nonlinear relationships.

LSTMs are especially effective for time series forecasting because they can remember long-term dependencies in sequence data. Although they require substantial data and are challenging to train, they can be highly effective for complex time series patterns.

Python’s Keras library is a popular starting point for building LSTM models.

Prophet

Prophet is a time series forecasting library open-sourced by Facebook. Prophet can generate forecasts with little user specification, making it easy to use, especially for non-experts in time series analysis.

However, it’s essential to validate Prophet forecasts carefully, as automated model building may overlook nuances in the data. When properly validated, Prophet can be an effective forecasting tool. More resources are available on Facebook’s GitHub.

DeepAR

DeepAR, developed by Amazon, is another black-box model designed to simplify time series forecasting. While the underlying mechanics differ from Prophet, its user experience is automated too.

A great and easy-to-use implementation of DeepAR is available in the Gluon package.

Time series model selection

After exploring various time series models—including classical, supervised, and recent developments like LSTM, Prophet, and DeepAR—the final step is choosing the model that best suits your use case.

Model evaluation and metrics

Defining metrics

To select a model, you must first define the right metric(s) to evaluate your model.

Time series forecasting key metrics include:

Mean Squared Error (MSE): Measures squared error at each time point, then averages.
Root Mean Squared Error (RMSE): Square root of MSE, to have the error in its original units.
Mean Absolute Error (MAE): Uses absolute values of errors, making it more interpretable.
Mean Absolute Percentage Error (MAPE): Expresses absolute errors as percentages of actual values, making results easier to interpret.

Train-test split and cross-validation

When evaluating machine learning models, remember that good performance on training data doesn’t guarantee good results on new, out-of-sample data. To estimate how well a model generalizes, two common approaches are train-test split and cross-validation.

Doing a train-test split involves holding back a portion of the data as a test set. For example, you could reserve the last 3 years of a CO2 dataset as a test set and use the remaining 40 years for training. Using a chosen evaluation metric, you’d then forecast the reserved period and compare predictions to actual values.

To benchmark multiple models, train each on the same 40-year data, forecast the test period, and select the model with the best performance.

A limitation of the train-test split in time series is that it only evaluates performance at a single point in time. Unlike non-sequential data, where test sets can be randomly selected, time series data relies on sequence order, so it is essential to reserve the final period as the test set. However, this approach may be unreliable if the last period is atypical (e.g., due to events like COVID-19, which disrupted trends and forecasts).

Cross-validation provides a more robust approach, repeatedly splitting the data for training and testing. For example, in 3-fold cross-validation, the data is divided into three parts. Each fold is used as a test set once, with the other two as training sets, producing three evaluation scores. Averaging these scores provides a more reliable measure of model performance.

By doing this, you avoid selecting a model that performs well on the test set by chance: you now have a more reliable measure of its performance.

Time series model experiments

To guide your time series model selection, consider the following key questions before starting your experiments:

Which metric will you use for evaluation?
Which period are you aiming to forecast?
How will you ensure the model performs well in the future using unseen data?

Once you’ve answered these questions, you can begin testing different models and applying your evaluation strategy to select and refine the best-performing model.

Example use case: time series forecasting for S&P 500

In this example, we’ll build a model to predict the next day’s direction (up or down) for the S&P 500 index, simulating a scenario where predictions are made nightly for potential trading insights (note: don’t take this as serious financial advice!).

Stock market forecasting data

To access stock data, we can use the Yahoo Finance (yfinance) package in Python:

!pip install yfinance
import yfinance as yf

# Download S&P 500 closing prices from Yahoo Finance
sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21")[['Close']]
sp500_data.plot(figsize=(12, 12))

The output:

This plot shows the evolution of S&P 500 closing prices since 1980 | Source: Author

Instead of using absolute prices, traders often focus on daily percentage changes. We can calculate these changes as follows:

difs = (sp500_data.shift() - sp500_data) / sp500_data
difs = difs.dropna()
difs.plot(figsize=(12, 12))

The plot now displays the percentage change in the S&P 500 over time | Source: Author

Defining the experimental approach

The model’s goal is to predict the next day’s percentage change accurately. Since our prediction period is just one day, the test set will be small, and multiple test splits are needed to ensure reliability.

We can set up 100 different train-test splits using the train-test split we discussed previously, with each split training on three months of data and testing on the following day. This setup allows consistent evaluation and better selection of the best-performing model.

Building a classical time series model: ARIMA

To address this forecasting problem, we’ll start with a classical ARIMA model. The code below sets up ARIMA models with orders ranging from (0, 0, 0) to (4, 4, 4). Each model is evaluated using 100 splits, where each split uses a maximum of three months for training and one day for testing.

There are a lot of training and test runs, so we will use a tracking tool, neptune.ai, for easy comparison.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

Before we continue, let’s first:

Sign up for a Neptune account and create a new project
Save your credentials as environment variables

Now that we are all set up, let’s start!

import numpy as np 
from sklearn.metrics import mean_squared_error 
from sklearn.model_selection import TimeSeriesSplit 
import neptune 
from neptune.utils import stringify_unsupported 
import statsmodels.api as sm

# List of ARIMA parameter combinations 
param_list = [(p, d, q) for p in range(5) for d in range(5) for q in range(5)] 

for order in param_list:
    # Initialize a Neptune run
    run = neptune.init_run(
        project="YOU/YOUR_PROJECT",
        api_token="YOUR_API_TOKEN",
    )

    run['parameters/order'] = order

    mses = []
    tscv = TimeSeriesSplit(n_splits=100, max_train_size=3*31, test_size=1)

    for train_index, test_index in tscv.split(co2_data.co2.values):
        try:
            train, test = co2_data.co2.values[train_index], co2_data.co2.values[test_index]

            # Fit ARIMA model
            model = sm.tsa.ARIMA(train, order=order)
            result = model.fit()
            prediction = result.forecast(1)[0]

            # Calculate Mean Squared Error (MSE)
            mse = mean_squared_error(test, prediction)
            mses.append(mse)

        except:
            # Ignore models that produce errors
            pass

    # Log results to Neptune
    run['average_mse'] = np.mean(mses) if mses else None
    run['std_mse'] = np.std(mses) if mses else None
    run.stop()

After running, you can view the results in a table format in the Neptune dashboard:

The model with the lowest average MSE is ARIMA(0, 1, 3). However, its standard deviation is unexpectedly 0, which raises concerns about the stability of this result. The next best models, ARIMA(1, 0, 3) and ARIMA(1, 0, 2), have very similar performance, indicating more reliable outcomes.

Based on this, ARIMA(1, 0, 3) is the best choice, with an average MSE of 0.00000131908 and a standard deviation of 0.00000197007, suggesting both accuracy and consistency in forecasting performance.

💡 If you work with Prophet, the Neptune-Prophet integration can help you track parameters, forecast data frames, residual diagnostic charts, and other model-building metadata.

Building a supervised machine learning model

Next, we’ll explore a supervised machine learning model and see how its performance compares to a classical time series model.

As we mentioned earlier, feature engineering is important in supervised machine learning for forecasting. Supervised models need both dependent (target) and independent (predictor) variables. Sometimes, you may have additional future data (like reservation numbers to help predict a restaurant’s daily customer count). However, for this stock market example, we only have past stock prices.

A supervised model can’t be trained on just a target variable, so we need to create features to capture seasonality and autocorrelation effects. For this model, we’ll use the stock prices from the past 30 days as input features to predict the price on the 31st day.

This approach will create a dataset where each entry contains 30 consecutive days as predictors and the 31st day as the target. By sliding this 30-day window across the S&P 500 data, we can generate a large training dataset for model development.

Now that you have the training database, you can use regular cross-validation: after all, the rows of the data set can be used independently. They are all sets of 30 training days and 1 ‘future’ test day. Thanks to this data preparation, you can use regular KFold cross-validation.

import yfinance as yf
import numpy as np

# Download S&P 500 closing price data
sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21")
sp500_data = sp500_data[['Close']]

# Calculate daily percentage changes
difs = (sp500_data.shift() - sp500_data) / sp500_data
difs = difs.dropna()  # Remove any NaN values from the dataset

# Extract the 'Close' values as our target variable
y = difs.Close.values

# Generate input windows of 30 days to predict the 31st day
X_data = []
y_data = []
for i in range(len(y) - 31):
    X_data.append(y[i:i+30])     # Last 30 days as input features
    y_data.append(y[i+30])       # 31st day as the target

# Convert lists to numpy arrays
X_windows = np.vstack(X_data)

With the training dataset prepared, you can now apply regular cross-validation. Each row in this dataset is a separate sequence of 30 training days followed by a target day, allowing them to be used independently in the model.

Using cross-validation will help evaluate model performance more reliably by testing it across different splits of the data, rather than relying on a single train-test split.

The code below performs a grid search with cross-validation using XGBoost on our prepared dataset. It evaluates model performance across multiple hyperparameter combinations and logs the results to Neptune.

import numpy as np
import xgboost as xgb
from sklearn.model_selection import KFold
import neptune
from neptune.utils import stringify_unsupported
from sklearn.metrics import mean_squared_error

# Define parameter grid for hyperparameter tuning
parameters = {
    'max_depth': list(range(2, 20, 4)),
    'gamma': list(range(0, 10, 2)),
    'min_child_weight': list(range(0, 10, 2)),
    'eta': [0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5]
}

# Create a list of all possible parameter combinations
param_list = [(x, y, z, a) for x in parameters['max_depth'] 
                              for y in parameters['gamma'] 
                              for z in parameters['min_child_weight'] 
                              for a in parameters['eta']]

# Iterate over all parameter combinations
for params in param_list:
    mses = []

    # Initialize Neptune run for logging
    run = neptune.init_run(
        project="YOU/YOUR_PROJECT",
        api_token="YOUR_API_TOKEN",
    )
    run['params'] = params

    # Set up KFold cross-validation
    my_kfold = KFold(n_splits=10, shuffle=True, random_state=0)

    for train_index, test_index in my_kfold.split(X_windows):
        X_train, X_test = X_windows[train_index], X_windows[test_index]
        y_train, y_test = np.array(y_data)[train_index], np.array(y_data)[test_index]

        # Create and train the XGBoost model
        xgb_model = xgb.XGBRegressor(
            max_depth=params[0],
            gamma=params[1],
            min_child_weight=params[2],
            eta=params[3]
        )
        xgb_model.fit(X_train, y_train)
        preds = xgb_model.predict(X_test)

        # Calculate and store Mean Squared Error for the fold
        mses.append(mean_squared_error(y_test, preds))

    # Log average MSE and standard deviation to Neptune
    average_mse = np.mean(mses)
    std_mse = np.std(mses)
    run['average_mse'] = average_mse
    run['std_mse'] = std_mse

    # Stop the Neptune run
    run.stop()

Some of the scores obtained using this loop are shown in the below table:

The parameters that were tested in this grid search are reproduced below:

Parameter name	Values tested	Description
Max Depth	2, 4, 6 8, 10	Controls the tree depth. Higher values make the model more complex and increase the risk of overfitting.
Min Child Weight	0, 2, 4	Minimum sum of instance weights needed in a child node. Higher values prevent overly complex models by stopping splits that don’t meet this threshold.
Eta	0.01, 0.1, 0.3	Learning rate (or step size). Low values mean slow learning, but they can improve accuracy by preventing overfitting.
Gamma	0, 2, 4	Minimum loss reduction required to split a node. Higher values make the model more conservative by reducing unnecessary splits.

For more information on XGBoost tuning, check out the official XGBoost documentation.

The best (lowest) MSE this XGBoost model achieves is 0.000129982, with several hyperparameter combinations reaching this score. However, the XGBoost model underperforms in its current setup compared to the classical time series model. To improve XGBoost’s results, a different approach to organizing the data may be needed.

LSTM model for time series forecasting

As a third model for the model comparison, let’s take an LSTM and see whether it can beat our ARIMA model.

The following code sets up an LSTM model using Keras. Instead of cross-validation (which can be time-consuming for LSTMs), a train-test split approach is used here to evaluate the model.

import tensorflow as tf
import numpy as np
import yfinance as yf
from sklearn.model_selection import train_test_split
import neptune

# Load and preprocess data
sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21")
sp500_data = sp500_data[['Close']]
difs = (sp500_data.shift() - sp500_data) / sp500_data
difs = difs.dropna()
y = difs.Close.values

# Create windows
X_data = []
y_data = []
window_size = 3 * 31  # 3 months of data
for i in range(len(y) - window_size):
    X_data.append(y[i:i+window_size])
    y_data.append(y[i+window_size])

X_windows = np.array(X_data)
y_data = np.array(y_data)

# Train/test/validation split
X_train, X_test, y_train, y_test = train_test_split(X_windows, y_data, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

# Reshape input data for LSTM (samples, timesteps, features)
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_val = X_val.reshape((X_val.shape[0], X_val.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Define LSTM architectures
archi_list = [
    [tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(window_size, 1)),
     tf.keras.layers.LSTM(32, return_sequences=False),
     tf.keras.layers.Dense(units=1)],
    [tf.keras.layers.LSTM(64, return_sequences=True, input_shape=(window_size, 1)),
     tf.keras.layers.LSTM(64, return_sequences=False),
     tf.keras.layers.Dense(units=1)],
    [tf.keras.layers.LSTM(128, return_sequences=True, input_shape=(window_size, 1)),
     tf.keras.layers.LSTM(128, return_sequences=False),
     tf.keras.layers.Dense(units=1)],
    [tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(window_size, 1)),
     tf.keras.layers.LSTM(32, return_sequences=True),
     tf.keras.layers.LSTM(32, return_sequences=False),
     tf.keras.layers.Dense(units=1)],
    [tf.keras.layers.LSTM(64, return_sequences=True, input_shape=(window_size, 1)),
     tf.keras.layers.LSTM(64, return_sequences=True),
     tf.keras.layers.LSTM(64, return_sequences=False),
     tf.keras.layers.Dense(units=1)],
]

# Loop through architectures and log results
for archi in archi_list:
    run = neptune.init_run(
        project="YOU/YOUR_PROJECT",
        api_token="YOUR_API_TOKEN",
    )

    run['params'] = f'LSTM Layers: {len(archi) - 1}, Units: {archi[0].units}'
    run['Tags'] = 'lstm_model_comparison'

    # Build, compile, and train model
    lstm_model = tf.keras.models.Sequential(archi)
    lstm_model.compile(loss=tf.losses.MeanSquaredError(),
                       optimizer=tf.optimizers.Adam(),
                       metrics=[tf.metrics.MeanSquaredError()])
    history = lstm_model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val), verbose=1)
    
    # Log final validation MSE to Neptune
    run['final_val_mse'] = history.history['val_mean_squared_error'][-1]
    run.stop()

Here, we can see the output for the 10 epochs:

The LSTM performed similarly to the XGBoost model. To improve the results, you could experiment with different training period lengths or adjust data standardization methods, which often impact neural network performance.

Selecting the best model

Out of the three models we tested, the ARIMA model (highlighted in the blue box below) showed the best performance based on a three-month training period and a one-day forecast, with the lowest mean squared error value (the row “average”, with MSE 0.423, compared to 0.46 and 1.07 for the other two models).

Next steps

To further improve this model, you could experiment with different training period lengths or add more data, such as seasonal indicators (day of the week, month, etc.) or additional predictors like market sentiment. If adding external variables, consider using a SARIMAX model.

Now you have a solid overview of time series model selection, including model types and tools like windowing and time series splits!

Time Series Prediction: How Is It Different From Other Machine Learning? [ML Engineer Explains]

Aayush Bajaj — Fri, 22 Jul 2022 06:44:43 +0000

Time-series is kind of a problem that every Data Scientist/ML Engineer will encounter in the span of their careers, more often than they think. So, it’s an important concept to understand in-out.

You see, time-series is a type of data that is sampled based on a time-based dimension like days, months, years, etc. We term this data as “dynamic” as we’ve indexed it based on a DateTime attribute. This gives data an implicit order. Don’t get me wrong, static data can still have an attribute that’s a DateTime value but the data will not be sampled or indexed based on that attribute.

When we apply machine learning algorithms on time-series data and want to make predictions for the future DateTime values, for e.g. predicting total sales for February given data for the previous 5 years, or predicting the weather for a certain day given weather data of several years. These predictions on time-series data are called forecasting. This contrasts with what we deal with when working on static data.

In this blog we’re going to talk about:

1 How is time-series prediction i.e forecasting different from static machine learning predictions?
2 Best practices while working on time series forecasting

Time-series data vs static ML

So far we’ve established a baseline on how we should perceive time-series data as compared to static data. In this section, we are going to talk about the difference in approaching both of these types of data.

Note: For the sake of simplicity we assume data to be continuous in all cases.

Imputation of missing data

Imputation of missing data is a key preprocessing step in any tabular machine learning project. In static data, techniques like Simple Imputation where you can fill missing data with mean, median, mode of the data depending on nature of the attribute, or more sophisticated methods like Nearest Neighbour imputation where you employ a KNN algorithm to identify missing datums.

However, In time-series, missing data looks something like this:

Time-series – missing data | Source

You have these visible gaps in the data that can’t be logically filled with any of the imputation strategies that can be used on static data. Let’s discuss some techniques that can be useful:

Why not fill it with mean? Static mean doesn’t do us any good here since it makes no sense to fill your missing values by taking cues from the future. In the plot above, it’s quite intuitive that gaps between 2001-2003 can logically be filled with only historical data i.e. pre-2001 data.

In Time-Series data, we use something called rolling mean or moving average or window mean which is taking mean of values pertaining to a predefined window for e.g., a 7-day window or a 1-month window. So, we can utilize this moving average to fill in any missing gaps in our time-series data.

Note: Stationarity plays an important role when working with averages in time-series data.

Interpolations are quite popular: Utilizing the implicit order that Time-Series data has, interpolation is quite often the go-to method for devising the missing parts in the Time-Series data. Interpolations, in brief, use the value present before and after the missing point to calculate the missing datum For eg, Linear interpolations work by calculating a straight line between the two points, averaging them, and getting the missing datum.

There are many types of interpolations available like Linear, Spline, Stineman. Their implementations are given in almost all major modules like python’s pandas interpolate() function and R imputeTime-Series package.

Although, interpolation can also be used in static data as well. However, it isn’t widely used since there are more sophisticated imputations techniques in static data (some of which are explained above).

Understanding the business use-case: This is not any technical method to deal with missing data. But I feel it’s the most underrated technique which can give results quickly. This involves understanding the problem at hand and then devising which method would work best. After all, SOTA might not be SOTA on your use case. For eg, Sales data should be treated differently than say stocks data, with both having a different set of market metrics.
By the way, this technique is common between static as well as time-series data.

Feature engineering in time-series model

Working with features is another major step that differentiates time-series data from static. Feature engineering is a broad term that encapsulates a variety of standard techniques and ad-hoc methods. Features are handled differently in time-series data as compared to static data.

Note: One might argue that imputation comes under Feature engineering, which is not wrong but I wanted to explain this under a separate section to give you a better idea.

In static data, it’s highly subjective on the kind of problem at hand but a few standard techniques include Feature Transformations, Scaling, Compression, Normalization, Encoding, etc.

Time-series data can have other attributes apart from time-based features. If those attributes are time-based then the resulting time-series would be multivariate and if static, the resulting would be univariate with static features. Non-time-based features can utilize methods from the static techniques in a way that doesn’t hinder the integrity of the data.

All time-based components have a definitive pattern that can be devised using some standard techniques. Let’s look at some of the techniques that prove useful while working with time-based features.

Time-series components: what is the main characteristic of time-series data

For starters, every time-series data has time-series components. We do an STL decomposition (Seasonal and Trend decomposition using Loess) to extract some of these components. Let’s take a look at what each of these means.

Example of an STL decomposition | Source

Trend: Time-series data shows a trend when its value variably changes with time, an increasing value shows a positive trend and decreasing, a negative trend. In the plot above, you can see a positive increasing trend.
Seasonality: Seasonality refers to a property of time-series that displays periodical patterns repeating at a constant frequency. In the example above, we can observe a seasonal component with the frequency being 12 months, which broadly means that the periodical pattern repeats every twelve months.
Remainder: After extracting Trend and Seasonality from the data, the remaining is what we call remainder (error) or Residual. This actually helps in anomaly detection in time-series.
Cycle: Time-series data is termed cyclical when there are trends with no set repetitions or seasonality.
Stationarity: Time-series data is stationary when its statistical features do not change over time i.e. a constant mean and standard deviation. The covariance is independent of time.

These components when extracted usually form the basis of the next steps in Feature engineering in time-series data. To put this in perspective of static data, STL decomposition is the descriptive part of the time-series world. There are a few more time-series specific metrics subjective to the type of time-series data like dummy variables when working on stock data.

Time-series components are highly important for analyzing the time-series variable of interest in order to understand its behavior, what patterns it has, and to be able to choose and fit an appropriate time-series model.

Learn more

How to Select a Model For Your Time Series Prediction Task [Guide]

Analysis and visualization in time-series models

Analysis

Time-series data analysis comes with a different blueprint than a static data analysis. As discussed in the previous section, time-series analysis starts with answering questions like:

Does this data has a trend?
Does this data contain any sort of pattern or seasonality?
Is the data stationary or non-stationary?

Ideally, one must proceed further with the analysis after working on the answers to the above questions. Similar to this, static data analysis has some procedures like Descriptive, Predictive and Prescriptive. Although, Descriptive is standard in all problem statements, Predictive and Prescriptive are subjective. These procedures are common in both time-series and static ML. However, many metrics used inside Descriptive, Predictive and Prescriptive are used differently, one of which is, Correlation.

Contrastingly, in time-series data we use something called Autocorrelation and Partial-Autocorrelation. Autocorrelation and Partial-Autocorrelation are both measures of association between current and past series values and indicate which past series values are most useful in predicting future values.

An example ACF and PACF plot in time-series | Source

While the approach for analysis is somewhat different between the two data kinds, the core idea is the same, it depends largely on the problem statement. E.g. Stocks and weather data, both are time-series but you can use stock data to predict future values and weather data to study the seasonal patterns. Similarly, using loan data you can use it to analyze patterns of the borrowers or check if a new borrower will default on loan repayment or not.

Visualization

Visualization is an integral part of any analysis. The differencing question isn’t what should you visualize but how should you visualize.

You see, time-series data’s time-based features should be visualized with one axis of the plot being time and non-time-based features are subjected to the strategy employed to work on the problem.

An example visualization of time-series | Source

Time-series forecasting vs static ML predictions

In the previous section, we saw the difference between the two data kinds pertaining to the initial steps and also the difference in approaches while comparing the two. In this section, we’re going to explore the next steps i.e. prediction or in terms of time-series, forecasting.

Algorithms

The choice of algorithms in time-series data is completely different from the one in static data. An algorithm that can extrapolate patterns and encapsulate the time-series components outside of the domain of training data can be considered as a time-series algorithm.

Now, most static machine learning algorithms like Linear regression, SVMs do not have this capability as they generalize the training space for any new prediction. They simply can’t exhibit any behaviour we discussed above.

Some common algorithms used for time-series forecasting:

ARIMA: It stands for Autoregressive-Integrated-Moving Average. It utilizes the combination of Autoregressive and moving averages to predict future values. Read more about it here.
EWMA/Exponential Smoothening: Exponentially weighted moving average or Exponential Smoothening serves as an upgrade to the Moving averages. It works by reducing the lag effect shown by moving averages by putting on more weight on values that occurred more recently. Read more about it here.
Dynamic Regression Models: This algorithm also takes other miscellaneous information into account such as public holidays, changes in law, etc. Read more about it here.
Prophet: Prophet, which was released by Facebook’s Core Data Science team, is an open-source library developed by Facebook and designed for automatic forecasting of univariate time series data.
LSTM: Long Short-Term Memory (LSTM) is a type of recurrent neural network that can learn the order dependence between items in a sequence. It is often used to solve time series forecasting problems.

Recommended for you

ARIMA vs Prophet vs LSTM for Time Series Prediction

This list is certainly not exhaustive. Many complex models or approaches such as Generalized Autoregressive Conditional Heteroskedasticity (GARCH) and Bayesian structural time-series (BS time-series) may be very useful in some cases. There are also neural network models like Neural Networks Autoregression (NNAR) that can be applied to time series which use lagged predictors and can handle features.

Evaluation metrics in time-series models

Forecasting evaluation involves metrics like scale-dependent errors such as Mean squared error(MSE) and Root mean squared error (RMSE), Percentage errors such as Mean absolute percentage error (MAPE), Scaled errors such as Mean absolute scaled error (MASE) to mention a few. These metrics are actually similar to static ML metrics.

However, while evaluation metrics help determine how close the fitted values are to the actual ones, they do not evaluate whether the model properly fits the time series. For this, we do something called Residual Diagnostics. Read about it in detail here.

Dealing with outliers/anomalies

Outliers plague almost every real-world data. Time-series and static data take two completely different routes from identification to the handling of outliers/anomalies.

Identification

For identification in static data, we use techniques from Z-score, Boxplot analysis to some advanced statistical techniques like hypothesis testing.
In time-series we use a range of techniques and algorithms starting from the STL analysis to using algorithms like Isolation forests. You can read about it in more detail here.

Handling

We use methods like Trimming, Quantile based flooring and capping, and Mean/Median Imputation in static data depending on the capacity and problem statement at hand.
In time-series data, there are a number of options that can be highly subjective to your use case. A few of them are:
- Using replacement: We can compute values that can replace the outlier and will make a better fit for the data. tsclean() function in R will fit a robust trend using loess (for non-seasonal series), or robust trend and seasonal components using STL (for seasonal series) to compute the replacement value.
- Studying the business: This is not a technical approach but an ad-hoc one. You see, identifying and studying the business behind the problem can really help deal with the outlier. Whether or not it is a wise choice to drop it or replace it will come from first studying it in-out.

Check also

Anomaly Detection in Time Series

Best practices while working on time-series data and forecasting

Although there are no fixed steps to be followed while working on time-series and forecasting, there are still some good practices one can employ to get optimal results.

No One-size-fits-all: No forecasting method performs best for all time-series. You need to understand the problem statement, type of features, and goals before starting to work on forecasting. Some domains you can select algorithms from depending on your need (compute + goals):
- statistical models,
- machine learning,
- and hybrid methods.
Feature selection: Selection of the features has an impact on the resulting forecast error. In other words, the selection has to be done carefully. There are different methods like correlation analysis also known as ﬁlter, wrapper (i.e., adding or removing features iterative), and embedded (i.e., the selection is already part of the forecasting method).
Countering Overfitting: During the training of the model, the risk of over-ﬁtting may occur, as the best model does not always lead to the best forecast. To counteract the over-ﬁtting problem, the historical data can be split into train and test data and internal validations can be conducted.
Data preprocessing: Data should be first analyzed and preprocessed to make it clean for forecasting. Data can contain missing values and as most forecasting methods can’t handle missing values, values have to be imputed.
Keep the Curse of Dimensionality in mind: When models in training are presented with a lot of dimensions and a lot of potential factors, they can encounter the Curse of Dimensionality, which says that as we have a finite amount of training data and we add more dimensions to that data, we start having diminishing returns in terms of accuracy.
Working with Seasonal Data Patterns: If there is seasonality in time series data, multiple cycles that include that seasonal pattern are required to make a proper forecast. Otherwise, there is no way for the model to learn the pattern.
Deal with Anomalies before moving to Forecast: Anomalies can create huge bias in the model learning and more often the results will always be subpar.
Studying the problem statement carefully: This is probably the most underrated practice especially when you’re just starting to work on a time-series problem. Identify your time-based and non-time-based features, study the data first before moving to any standard techniques.

You’ve reached the end!

We successfully understood the difference in structure and approach between Time-series and static data. The sections listed in this blog are by no means exhaustive. There can be more differences when we move to more granularities pertaining to specific data problems in each. Here are some of my favorite resources you can refer to while studying about time-series:

References

Machine Learning for Stock Price Prediction

Katherine (Yi) Li — Fri, 22 Jul 2022 06:15:04 +0000

The stock market is known for being volatile, dynamic, and nonlinear. Accurate stock price prediction is extremely challenging because of multiple (macro and micro) factors, such as politics, global economic conditions, unexpected events, a company’s financial performance, and so on.

But all of this also means that there’s a lot of data to find patterns in. So, financial analysts, researchers, and data scientists keep exploring analytics techniques to detect stock market trends. This gave rise to the concept of algorithmic trading, which uses automated, pre-programmed trading strategies to execute orders.

In this article, we’ll be using both traditional quantitative finance methodology and machine learning algorithms to predict stock movements. We’ll go through the following topics:

Stock analysis: fundamental vs. technical analysis
Stock prices as time-series data and related concepts
Predicting stock prices with Moving Average techniques
Introduction to LSTMs
Predicting stock prices with an LSTM model
Final thoughts on new methodologies, such as ESN

Disclaimer: this project/article is not intended to provide financial, trading, and investment advice. No warranties are made regarding the accuracy of the models. Audiences should conduct their due diligence before making any investment decisions using the methods or code presented in this article.

Stock analysis: fundamental analysis vs. technical analysis

When it comes to stocks, fundamental and technical analyses are at opposite ends of the market analysis spectrum.

Fundamental analysis (you can read more about it here):
- Evaluates a company’s stock by examining its intrinsic value, including but not limited to tangible assets, financial statements, management effectiveness, strategic initiatives, and consumer behaviors; essentially all the basics of a company.
- Being a relevant indicator for long-term investment, the fundamental analysis relies on both historical and present data to measure revenues, assets, costs, liabilities, and so on.
- Generally speaking, the results from fundamental analysis don’t change with short-term news.
Technical analysis (you can read more about it here):
- Analyzes measurable data from stock market activities, such as stock prices, historical returns, and volume of historical trades; i.e. quantitative information that could identify trading signals and capture the movement patterns of the stock market.
- Technical analysis focuses on historical data and current data just like fundamental analysis, but it’s mainly used for short-term trading purposes.
- Due to its short-term nature, technical analysis results are easily influenced by news.
- Popular technical analysis methodologies include moving average (MA), support and resistance levels, as well as trend lines and channels.

For our exercise, we’ll be looking at technical analysis solely and focusing on the Simple MA and Exponential MA techniques to predict stock prices. Additionally, we’ll utilize LSTM (Long Short-Term Memory), a deep learning framework for time-series, to build a predictive model and compare its performance against our technical analysis.

As stated in the disclaimer, stock trading strategy is not in the scope of this article. I’ll be using trading/investment terms only to help you better understand the analysis, but this is not financial advice. We’ll be using terms like:

trend indicators: statistics that represent the trend of stock prices,
medium-term movements: the 50-day movement trend of stock prices.

Stock prices as time-series data

Despite the volatility, stock prices aren’t just randomly generated numbers. So, they can be analyzed as a sequence of discrete-time data; in other words, time-series observations taken at successive points in time (usually on a daily basis). Time series forecasting (predicting future values based on historical values) applies well to stock forecasting.

Because of the sequential nature of time-series data, we need a way to aggregate this sequence of information. From all the potential techniques, the most intuitive one is MA with the ability to smooth out short-term fluctuations. We’ll discuss more details in the next section.

Dataset analysis

For this demonstration exercise, we’ll use the closing prices of Apple’s stock (ticker symbol AAPL) from the past 21 years (1999-11-01 to 2021-07-09). Analysis data will be loaded from Alpha Vantage, which offers a free API for historical and real-time stock market data.

To get data from Alpha Vantage, you need a free API key; a walk-through tutorial can be found here. Don’t want to create an API? No worries, the analysis data is available here as well. If you feel like exploring other stocks, code to download the data is accessible in this Github repo as well. Once you have the API, all you need is the ticker symbol for the particular stock.

For model training, we’ll use the oldest 80% of the data, and save the most recent 20% as the hold-out testing set.

# %% Train-Test split for time-series
stockprices = pd.read_csv("stock_market_data-AAPL.csv", index_col="Date")

test_ratio = 0.2
training_ratio = 1 - test_ratio

train_size = int(training_ratio * len(stockprices))
test_size = int(test_ratio * len(stockprices))
print(f"train_size: {train_size}")
print(f"test_size: {test_size}")

train = stockprices[:train_size][["Close"]]
test = stockprices[train_size:][["Close"]]

Creating a neptune.ai project

With regard to model training and performance comparison, neptune.ai makes it convenient for users to track everything model-related, including hyper-parameter specification and evaluation plots.

Disclaimer

neptune.ai is NOT a stock prediction software.

It is an experiment tracker for ML/AI teams that struggle with debugging and reproducing experiments, sharing results, and messy model handover.

neptune.ai lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye.

Watch the 2-min product demo

Now, let’s create a project in Neptune for this particular exercise and name it “StockPrediction”.

Disclaimer

Please note that this article references a deprecated version of Neptune.

For information on the latest version with improved features and functionality, please visit our website.

Evaluation metrics and helper functions

Since stock prices prediction is essentially a regression problem, the RMSE (Root Mean Squared Error) and MAPE (Mean Absolute Percentage Error %) will be our current model evaluation metrics. Both are useful measures of forecast accuracy.

, where N = the number of time points, At = the actual / true stock price, Ft = the predicted / forecast value.

RMSE gives the differences between predicted and true values, whereas MAPE (%) measures this difference relative to the true values. For example, a MAPE value of 12% indicates that the mean difference between the predicted stock price and the actual stock price is 12%.

Next, let’s create several helper functions for the current exercise.

Split the stock prices data into training sequence X and the next output value Y,

## Split the time-series data into training seq X and output value Y
def extract_seqX_outcomeY(data, N, offset):
    """
    Split time-series into training sequence X and outcome value Y
    Args:
        data - dataset
        N - window size, e.g., 50 for 50 days of historical stock prices
        offset - position to start the split
    """
    X, y = [], []

    for i in range(offset, len(data)):
        X.append(data[i - N : i])
        y.append(data[i])

    return np.array(X), np.array(y)

Calculate the RMSE and MAPE (%),

#### Calculate the metrics RMSE and MAPE ####
def calculate_rmse(y_true, y_pred):
    """
    Calculate the Root Mean Squared Error (RMSE)
    """
    rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))
    return rmse


def calculate_mape(y_true, y_pred):
    """
    Calculate the Mean Absolute Percentage Error (MAPE) %
    """
    y_pred, y_true = np.array(y_pred), np.array(y_true)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    return mape

Calculate the evaluation metrics for technical analysis and log to Neptune,

def calculate_perf_metrics(var):
    ### RMSE
    rmse = calculate_rmse(
        np.array(stockprices[train_size:]["Close"]),
        np.array(stockprices[train_size:][var]),
    )
    ### MAPE
    mape = calculate_mape(
        np.array(stockprices[train_size:]["Close"]),
        np.array(stockprices[train_size:][var]),
    )

    ## Log to Neptune
    run["RMSE"] = rmse
    run["MAPE (%)"] = mape

    return rmse, mape

Plot the trend of the stock prices and log the plot to Neptune,

def plot_stock_trend(var, cur_title, stockprices=stockprices):
    ax = stockprices[["Close", var, "200day"]].plot(figsize=(20, 10))
    plt.grid(False)
    plt.title(cur_title)
    plt.axis("tight")
    plt.ylabel("Stock Price ($)")

    ## Log to Neptune
    run["Plot of Stock Predictions"].upload(
        neptune.types.File.as_image(ax.get_figure())
    )

Predicting stock price with Moving Average (MA) technique

MA is a popular method to smooth out random movements in the stock market. Similar to a sliding window, an MA is an average that moves along the time scale/periods; older data points get dropped as newer data points are added.

Commonly used periods are 20-day, 50-day, and 200-day MA for short-term, medium-term, and long-term investment respectively.

Two types of MA are most preferred by financial analysts: Simple MA and Exponential MA.

Simple MA

SMA, short for Simple Moving Average, calculates the average of a range of stock (closing) prices over a specific number of periods in that range. The formula for SMA is:

, where Pn = the stock price at time point n, N = the number of time points.

For this exercise of building an SMA model, we’ll use the Python code below to compute the 50-day SMA. We’ll also add a 200-day SMA for good measure.

window_size = 50

# Initialize a Neptune run
run = neptune.init_run(
    project=myProject,
    name="SMA",
    description="stock-prediction-machine-learning",
    tags=["stockprediction", "MA_Simple", "neptune"],
)

window_var = f"{window_size}day"

stockprices[window_var] = stockprices["Close"].rolling(window_size).mean()

### Include a 200-day SMA for reference
stockprices["200day"] = stockprices["Close"].rolling(200).mean()

### Plot and performance metrics for SMA model
plot_stock_trend(var=window_var, cur_title="Simple Moving Averages")
rmse_sma, mape_sma = calculate_perf_metrics(var=window_var)

### Stop the run
run.stop()

In our Neptune run, we’ll see the performance metrics on the testing set; RMSE = 43.77, and MAPE = 12.53%. In addition, the trend chart shows the 50-day, 200-day SMA predictions compared with the true stock closing values.

In addition, the trend chart below shows the 50-day, 200-day SMA predictions compared with the true stock closing values.

It’s not surprising to see that the 50-day SMA is a better trend indicator than the 200-day SMA in terms of (short-to-) medium movements. Both indicators, nonetheless, seem to give smaller predictions than the actual values.

Exponential MA

Different from SMA, which assigns equal weights to all historical data points, EMA, short for Exponential Moving Average, applies higher weights to recent prices, i.e., tail data points of the 50-day MA in our example. The magnitude of the weighting factor depends on the number of time periods. The formula to calculate EMA is:

where Pt = the price at time point t,

EMAt-1 = EMA at time point t-1,

N = number of time points in EMA,

and weighting factor k = 2/(N+1).

One advantage of the EMA over SMA is that EMA is more responsive to price changes, which makes it useful for short-term trading. Here’s a Python implementation of EMA:

# Initialize a Neptune run
run = neptune.init_run(
    project=myProject,
    name="EMA",
    description="stock-prediction-machine-learning",
    tags=["stockprediction", "MA_Exponential", "neptune"],
)

###### Exponential MA
window_ema_var = f"{window_var}_EMA"

# Calculate the 50-day exponentially weighted moving average
stockprices[window_ema_var] = (
    stockprices["Close"].ewm(span=window_size, adjust=False).mean()
)
stockprices["200day"] = stockprices["Close"].rolling(200).mean()

### Plot and performance metrics for EMA model
plot_stock_trend(
    var=window_ema_var, cur_title="Exponential Moving Averages")
rmse_ema, mape_ema = calculate_perf_metrics(var=window_ema_var)

### Stop the run
run.stop()

Examining the performance metrics tracked in Neptune, we have RMSE = 36.68, and MAPE = 10.71%, which is an improvement from SMA’s 43.77 and 12.53% for RMSE and MAPE, respectively. The trend chart generated from this EMA model also implies that it outperforms the SMA.

Comparison of the SMA and EMA prediction performance

The screenshot below shows a comparison of SMA and EMA side-by-side in Neptune.

Introduction to LSTMs for the time-series data

Now, let’s move on to the LSTM model. LSTM, short for Long Short-term Memory, is an extremely powerful algorithm for time series. It can capture historical trend patterns, and predict future values with high accuracy.

In a nutshell, the key component to understand an LSTM model is the Cell State (Ct), which represents the internal short-term and long-term memories of a cell.

Source

To control and manage the cell state, an LSTM model contains three gates/layers. It’s worth mentioning that the “gates” here can be treated as filters to let information in (being remembered) or out (being forgotten).

Forget gate:

Source

As the name implies, forget gate decides which information to throw away from the current cell state. Mathematically, it applies a sigmoid function to output/returns a value between [0, 1] for each value from the previous cell state (Ct-1); here ‘1’ indicates “completely passing through” whereas ‘0’ indicates “completely filtering out”

Input gate:

Source

It’s used to choose which new information gets added and stored in the current cell state. In this layer, a sigmoid function is implemented to reduce the values in the input vector (it), and then a tanh function squashes each value between [-1, 1] (Ct). Element-by-element matrix multiplication of it and Ct represents new information that needs to be added to the current cell state.

Output gate:

Source

The output gate is implemented to control the output flowing to the next cell state. Similar to the input gate, an output gate applies a sigmoid and then a tanh function to filter out unwanted information, keeping only what we’ve decided to let through.

For a more detailed understanding of LSTM, you can check out this document.

Knowing the theory of LSTM, you must be wondering how it does at predicting real-world stock prices. We’ll find out in the next section, by building an LSTM model and comparing its performance against the two technical analysis models: SMA and EMA.

Predicting stock prices with an LSTM model

First, we need to create a Neptune experiment dedicated to LSTM, which includes the specified hyper-parameters.

layer_units = 50
optimizer = "adam"
cur_epochs = 15
cur_batch_size = 20

cur_LSTM_args = {
    "units": layer_units,
    "optimizer": optimizer,
    "batch_size": cur_batch_size,
    "epochs": cur_epochs,
}

# Initialize a Neptune run
run = neptune.init_run(
    project=myProject,
    name="LSTM",
    description="stock-prediction-machine-learning",
    tags=["stockprediction", "LSTM", "neptune"],
)
run["LSTM_args"] = cur_LSTM_args

Next, we scale the input data for LSTM model regulation and split it into train and test sets.

# Scale our dataset
scaler = StandardScaler()
scaled_data = scaler.fit_transform(stockprices[["Close"]])
scaled_data_train = scaled_data[: train.shape[0]]

# We use past 50 days’ stock prices for our training to predict the 51th day's closing price.
X_train, y_train = extract_seqX_outcomeY(scaled_data_train, window_size, window_size)

A couple of notes:

we use the StandardScaler, rather than the MinMaxScaler as you might have seen before. The reason is that stock prices are ever-changing, and there are no true min or max values. It doesn’t make sense to use the MinMaxScaler, although this choice probably won’t lead to disastrous results at the end of the day;
stock price data in its raw format can’t be used in an LSTM model directly; we need to transform it using our pre-defined `extract_seqX_outcomeY` function. For instance, to predict the 51st price, this function creates input vectors of 50 data points prior and uses the 51st price as the outcome value.

Moving on, let’s kick off the LSTM modeling process. Specifically, we’re building an LSTM with two hidden layers, and a ‘linear’ activation function upon the output. We also use Neptune’s Keras integration to monitor and log model training progress live.

Read more about the integration in the Neptune docs.

### Setup Neptune's Keras integration ###
from neptune.integrations.tensorflow_keras import NeptuneCallback

neptune_callback = NeptuneCallback(run=run)

### Build a LSTM model and log training progress to Neptune ###

def Run_LSTM(X_train, layer_units=50):
    inp = Input(shape=(X_train.shape[1], 1))

    x = LSTM(units=layer_units, return_sequences=True)(inp)
    x = LSTM(units=layer_units)(x)
    out = Dense(1, activation="linear")(x)
    model = Model(inp, out)

    # Compile the LSTM neural net
    model.compile(loss="mean_squared_error", optimizer="adam")

    return model


model = Run_LSTM(X_train, layer_units=layer_units)

history = model.fit(
    X_train,
    y_train,
    epochs=cur_epochs,
    batch_size=cur_batch_size,
    verbose=1,
    validation_split=0.1,
    shuffle=True,
    callbacks=[neptune_callback],
)

Training progress is visible live on Neptune.

Once the training completes, we’ll test the model against our hold-out set.

# predict stock prices using past window_size stock prices
def preprocess_testdat(data=stockprices, scaler=scaler, window_size=window_size, test=test):
    raw = data["Close"][len(data) - len(test) - window_size:].values
    raw = raw.reshape(-1,1)
    raw = scaler.transform(raw)

    X_test = [raw[i-window_size:i, 0] for i in range(window_size, raw.shape[0])]
    X_test = np.array(X_test)

    X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
    return X_test

X_test = preprocess_testdat()

predicted_price_ = model.predict(X_test)
predicted_price = scaler.inverse_transform(predicted_price_)

# Plot predicted price vs actual closing price
test["Predictions_lstm"] = predicted_price

Time to calculate the performance metrics and log them to Neptune.

# Evaluate performance
rmse_lstm = calculate_rmse(np.array(test["Close"]), np.array(test["Predictions_lstm"]))
mape_lstm = calculate_mape(np.array(test["Close"]), np.array(test["Predictions_lstm"]))

### Log to Neptune
run["RMSE"] = rmse_lstm
run["MAPE (%)"] = mape_lstm

### Plot prediction and true trends and log to Neptune
def plot_stock_trend_lstm(train, test):
    fig = plt.figure(figsize = (20,10))
    plt.plot(np.asarray(train.index), np.asarray(train["Close"]), label = "Train Closing Price")
    plt.plot(np.asarray(test.index), np.asarray(test["Close"]), label = "Test Closing Price")
    plt.plot(np.asarray(test.index), np.asarray(test["Predictions_lstm"]), label = "Predicted Closing Price")
    plt.title("LSTM Model")
    plt.xlabel("Date")
    plt.ylabel("Stock Price ($)")
    plt.legend(loc="upper left")

    ## Log image to Neptune
    run["Plot of Stock Predictions"].upload(neptune.types.File.as_image(fig))

plot_stock_trend_lstm(train, test)

### Stop the run after logging
run.stop()

In Neptune, it’s amazing to see that our LSTM model achieved an RMSE = 12.58 and MAPE = 2%; a tremendous improvement from the SMA and EMA models! The trend chart shows a near-perfect overlay of the predicted and actual closing price for our testing set.

Final thoughts on new methodologies

We’ve seen the advantage of LSTMs in the example of predicting Apple stock prices compared to traditional MA models. Be careful about making generalizations to other stocks, because, unlike other stationary time series, stock market data is less-to-none seasonal and more chaotic.

In our example, Apple, as one of the biggest tech giants, has not only established a mature business model and management, its sales figures also benefit from the release of innovative products or services. Both contribute to the lower implied volatility of Apple stock, making the predictions relatively easier for the LSTM model in contrast to different, high-volatility stocks.

To account for the chaotic dynamics of the stock market, Echo State Networks (ESN) is proposed. As a new invention within the RNN (Recurrent Neural Networks) family, ESN utilizes a hidden layer with several neurons flowing and loosely interconnected; this hidden layer is referred to as the ‘reservoir’ designed to capture the non-linear history information of input data.

Schema of an Echo State Network (ESN)

At a high level, an ESN takes in a time-series input vector and maps it to a high-dimensional feature space, i.e. the dynamical reservoir (neurons aren’t connected like a net but rather like a reservoir). Then, at the output layer, a linear activation function is applied to calculate the final predictions.

If you’re interested in learning more about this methodology, check out the original paper by Jaeger and Haas.

In addition, it would be interesting to incorporate sentiment analysis on news and social media regarding the stock market in general, as well as a given stock of interest. Another promising approach for better stock price predictions is the hybrid model, where we add MA predictions as input vectors to the LSTM model. You might want to explore different methodologies, too.

Hope you enjoyed reading this article as much as I enjoyed writing it! The whole Neptune project is available here for your reference.

ARIMA & SARIMA: Real-World Time Series Forecasting

Aayush Bajaj — Thu, 21 Jul 2022 15:10:16 +0000

Time series and forecasting have been some of the key problems in statistics and Data Science. A data becomes a time series when it’s sampled on a time-bound attribute like days, months, and years inherently giving it an implicit order. Forecasting is when we take that data and predict future values.

ARIMA and SARIMA are both algorithms for forecasting. ARIMA takes into account the past values (autoregressive, moving average) and predicts future values based on that. SARIMA similarly uses past values but also takes into account any seasonality patterns. Since SARIMA brings in seasonality as a parameter, it’s significantly more powerful than ARIMA in forecasting complex data spaces containing cycles.

May interest you

️ How to Select a Model For Your Time Series Prediction Task [Guide]

Further in the blog, we’re going to explore:

ARIMA
- What it is and how it forecasts
- Example of predicting GDP of USA using ARIMA
SARIMA
- What it is and how it forecasts
- Example of predicting electricity consumption
Pros and cons of both models
Real-world use-cases of ARIMA and SARIMA

Before we move on to the algorithms, there’s an important section about data processing that you should be wary about before embarking on your forecasting journey.

Data preprocessing for time series forecasting

Time series data is messy. Forecasting models from simple rolling averages to LSTMs requires data to be clean. So here are some techniques you could use before moving to forecasting.

Note: This data preprocessing step is general and intended to make readers emphasize it as real-world projects involve a lot of cleaning and preparation.

Detrending/ Stationarity: Before forecasting, we want our time series variables to be mean-variance stationery. This means that the statistical properties of a model do not vary depending on when the sample was taken. Models built on stationary data are generally more robust. This can be achieved by using differencing.
Anomaly detection: Any outlier present in the data might skew the forecasting results so it’s often considered a good practice to identify and normalize outliers before moving on to forecasting. You could follow this blog here where I have explained anomaly detection algorithms at length.
Check for sampling frequency: This is an important step to check the regularity of sampling. Irregular data has to be imputed or made uniform before applying any modeling techniques because irregular sampling leads to broken integrity of the time series and doesn’t fit well with the models.
Missing data: At times there can be missing data for some datetime values and it needs to be addressed before modeling. For example, a time series data with missing values looks like this:

Missing data in time series | Source

Now, let’s move on to the models.

ARIMA

ARIMA model is a class of linear models that utilizes historical values to forecast future values. ARIMA stands for Autoregressive Integrated Moving Average, each of which technique contributes to the final forecast. Let’s understand it one by one.

Autoregressive (AR)

In an autoregression model, we forecast the variable of interest using a linear combination of past values of that variable. The term autoregression indicates that it is a regression of the variable against itself. That is, we use lagged values of the target variable as our input variables to forecast values for the future. An autoregression model of order p will look like:

m_t = ₀ + ₁m_t-1 + ₂m_t-2 + ₃m_t-3+…+ _pm_t-p

In the above equation, the currently observed value of m is a linear function of its past p values. [ 0, p] are the regression coefficients that are determined after training. There are some standard methods to determine optimal values of p one of which is, analyzing Autocorrelation and Partial Autocorrelation function plots.

The autocorrelation function (ACF) is the correlation between the current and the past values of the same variable. It also considers the translative effect that values carry over with time apart from a direct effect. For example, prices of oil 2 days ago will affect prices 1 day ago and eventually, today. But the prices of oil 2 days ago might also have an effect on today which ACF measures.

Partial Autocorrelation (PACF) on the other hand measures only the direct correlation between past values and current values. For example, PACF will only measure the effect of prices of oil 2 days ago on today with no translative effect.

ACF and PACF plots help us determine past value dependency which in turn helps us deduce p in AR. Head over here to understand how to deduce values for p (AR), and q(MA) in depth.

Integrated (I)

Integrated represents any differencing that has to be applied in order to make the data stationary. A dickey-fuller test (code below) can be run on the data to check for stationarity and then experiment with different differencing factors. A differencing factor, d=1 means a lag of i.e.mt-mt-1. Let’s look at a plot of original vs differenced data.

Original Data | Source: Author

After applying d=1 | Source: Author

The difference between them is evident. After differencing we could see that it’s significantly more stationary than the original and the mean and variance are approximately consistent over the years. We could use the code below to conduct a dickey-fuller test.

def check_stationarity(ts):
    dftest = adfuller(ts)
    adf = dftest[0]
    pvalue = dftest[1]
    critical_value = dftest[4]['5%']
    if (pvalue < 0.05) and (adf < critical_value):
        print('The series is stationary')
    else:
        print('The series is NOT stationary')

Moving Average (MA)

Moving average models uses past forecast errors rather than past values in a regression-like model to forecast future values. A moving average model can be denoted by the following equation:

m_t= ₀ + ₁e_t-1 + ₂e_t-2 + ₃e_t-3+…+ _qe_t-q

This is referred as MA(q) model. In the above equation, e is called an error and it represents the random residual deviations between the model and the target variable. Since e can only be determined after fitting the model and since it’s a parameter too so in this case e is an unobservable parameter. Hence, to solve the MA equation, iterative techniques like Maximum Likelihood Estimation are used instead of OLS.

Since we’ve looked at how ARIMA works, let’s dive into an example and see how ARIMA is applied to time series data.

Implementing ARIMA

For the implementation, I’ve chosen catfish sales data from 1996 to 2008. We’re going to apply the techniques we learned above to this dataset and see them in action. Although the data doesn’t need a lot of cleaning and is in a read-to-be-analyzed state, you might have to apply cleaning techniques to your dataset.

Unfortunately, we cannot replicate each and every scenario as cleaning methods are highly subjective and depend on the team’s requirements too. But the techniques learned here can be directly applied to your dataset after cleaning.

Let’s start with importing essential modules.

Importing dependencies

from IPython.display import display

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 15)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import matplotlib.pyplot as plt
from datetime import datetime
from datetime import timedelta
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from time import time
import seaborn as sns
sns.set(style="whitegrid")

import warnings
warnings.filterwarnings('ignore')

RANDOM_SEED = np.random.seed(0)

These are pretty self-explanatory modules every Data Scientist will be familiar with. It’s always a good practice to set the RANDOM_SEED to make code reproducible with the same results.

Next, we’re going to import and plot the time series data

Extract-Transform-Load (ETL)

def parser(s):
    return datetime.strptime(s, '%Y-%m-%d')
#read data
catfish_sales = pd.read_csv('catfish.csv', parse_dates=[0], index_col=0, date_parser=parser)
#infer the frequency of the data
catfish_sales = catfish_sales.asfreq(pd.infer_freq(catfish_sales.index))

#transform
start_date = datetime(1996,1,1)
end_date = datetime(2008,1,1)
lim_catfish_sales = catfish_sales[start_date:end_date]

#plot
plt.figure(figsize=(14,4))
plt.plot(lim_catfish_sales)
plt.title('Catfish Sales in 1000s of Pounds', fontsize=20)
plt.ylabel('Sales', fontsize=16)

For the sake of simplicity, I’ve limited the data to only 1996-2008. The plot generated by above code looks like:

Catfish sales | Source: Author

First impressions say there is a definite trend and seasonality present in the data. Let’s do an STL decomposition to get a better understanding.

STL decomposition

plt.rc('figure',figsize=(14,8))
plt.rc('font',size=15)

result = seasonal_decompose(lim_catfish_sales,model='additive')
fig = result.plot()

Resulting plot look like this:

Points to ponder:

A 6-month and 12-month seasonal pattern is visible
An upwards and downwards trend is evident

Let’s look at ACF and PACF plots to get an idea for p and q values

ACF and PACF plots

plot_acf(lim_catfish_sales['Total'], lags=48);
plot_pacf(lim_catfish_sales['Total'], lags=30);

The output of the above code plots ACF and PACF:

Autocorrelation plot for Catfish data

Partial autocorrelation plot for Catfish data

Points to ponder:

There’s a significant spike at 6-month and 12-month in ACF
PACF is nearly sinusoidal

The differencing factor d should be kept at 1 since there’s a clear trend and non-stationary data. P can be tested with values 6 and 12.

Fitting ARIMA

We’re going to use statsmodels module to implement and use ARIMA. For this, we’ve imported the ARIMA class from the statsmodels. Now, let’s fit with the parameters we discussed in the previous section.

arima = ARIMA(lim_catfish_sales['Total'], order=(12,1,1))
predictions = arima.fit().predict()

As you notice above I started with (12,1,1) for (p,d,q) right from what we saw in the ACF and PACF plots.

Note: It is quite handy to use modules for algorithms (like scikit-learn) and you’ll be glad to know that statsmodels is one of the libraries that gets used a lot.

Check more tools

Time Series Projects: Tools, Packages, and Libraries That Can Help

Let’s see how our predictions stack up with the original data.

Visualizing the result

plt.figure(figsize=(16,4))
plt.plot(lim_catfish_sales.diff(), label="Actual")
plt.plot(predictions, label="Predicted")
plt.title('Catfish Sales in 1000s of Pounds', fontsize=20)
plt.ylabel('Sales', fontsize=16)
plt.legend()

The output of the above code will give you a comparative plot of predictions and the actual data.

A comparative plot of predictions and the actual data

You can witness here that the model didn’t really catch up with some of the peaks but captured the essence of the data well. We can experiment with more p,d,q values to generalize the model better and make sure it doesn’t overfit.

Trial and optimization is one way but you can also use Auto-ARIMA. It essentially does the heavy lifting for you and tunes the hyperparameters for you. This blog is a good starting point for auto-ARIMA.

Keep in mind that the explainability of the parameters will be something that you have to deal with while working on Auto-ARIMA and make sure it doesn’t get converted into a BlackBox as forecasting models have to go for governance before deployment. So, it’s good practice to be able to explain the parameter values and their contribution.

SARIMA

SARIMA stands for Seasonal-ARIMA and it includes seasonality contribution to the forecast. The importance of seasonality is quite evident and ARIMA fails to encapsulate that information implicitly.

The Autoregressive (AR), Integrated (I), and Moving Average (MA) parts of the model remain as that of ARIMA. The addition of Seasonality adds robustness to the SARIMA model. It’s represented as:

Source

where m is the number of observations per year. We use the uppercase notation for the seasonal parts of the model, and lowercase notation for the non-seasonal parts of the model.

Similar to ARIMA, the P,D,Q values for seasonal parts of the model can be deduced from the ACF and PACF plots of the data. Let’s implement SARIMA for the same Catfish sales model.

Implementing SARIMA

The ETL and dependencies will remain the same as in ARIMA so we’ll jump straight to the modeling part.

Fitting SARIMA

sarima = SARIMAX(lim_catfish_sales['Total'],
                order=(1,1,1),
                seasonal_order=(1,1,0,12))
predictions = sarima.fit().predict()

I experimented with taking 1,1,1 for the non-seasonal parts and took 1,1,0,12 for seasonal ones as ACF showed a 6-month and 12-month lagged correlation. Let’s see how it turned out.

Visualizing the result

plt.figure(figsize=(16,4))
plt.plot(lim_catfish_sales, label="Actual")
plt.plot(predictions, label="Predicted")
plt.title('Catfish Sales in 1000s of Pounds', fontsize=20)
plt.ylabel('Sales', fontsize=16)
plt.legend()

A comparative plot of predictions and the actual data

As you can see, at the start model, struggled to fit probably because of off-course initialization but it quickly learned the right path. The fit is quite good as compared to the ARIMA one suggesting that SARIMA can learn seasonality better and if it’s present in the data then it’d make sense to try SARIMA out.

Pros and cons of ARIMA and SARIMA models

Owing to the linear nature of both the algorithms, they are quite handy and used in the industry when it comes to experimentation and understanding the data, creating baseline forecasting scores. If tuned right with lagged values (p,d,q) they can perform significantly better. The simple and explainable nature of both the algorithms makes them one of the top picks by analysts and Data Scientists. There are, however, some pros and cons when working with ARIMA and SARIMA at scale. Let’s discuss both of those:

Pros of ARIMA & SARIMA

Easy to understand and interpret: The one thing that your fellow teammates and colleagues would appreciate is the simplicity and interpretability of the models. Focusing on both of these things while also maintaining the quality of the results will help with presentations with the stakeholders.
Limited variables: There are fewer hyperparameters so the config file will be easily maintainable if the model goes into production.

Cons of ARIMA & SARIMA

Exponential time complexity: When the value of p and q increases there are equally more coefficients to fit hence increasing the time complexity manifold if p and q are high. This makes both of these algorithms hard to put into production and makes Data Scientists look into Prophet and other algorithms. Then again, it depends on the complexity of the dataset too.
Complex data: There can be a possibility where your data is too complex and there is no optimal solution for p and q. Although highly unlikely that ARIMA and SARIMA would fail but if this occurs then unfortunately you may have to look elsewhere.
Amount of data needed: Both the algorithms require considerable data to work on, especially if the data is seasonal. For example, using three years of historical demand is likely not to be enough (Short Life-Cycle Products) for a good forecast.

May interest you

️ ARIMA vs Prophet vs LSTM for Time Series Prediction

Real-world use-cases of ARIMA and SARIMA

ARIMA/SARIMA are among the most popular econometrics models used for forecasting stock prices, demand forecasting, and even the spread of infectious diseases. When the underlying mechanisms are not known or are too complicated, e.g., the stock market, or not fully known, e.g., retail sales, it is usually better to apply ARIMA or a similar statistical model than complex deep algorithms like RNNs.

However, there are cases where applying ARIMA can give you par results.

Here are some curated papers that use ARIMA/SARIMA:

An Application of ARIMA Model to Forecast the Dynamics of COVID-19 Epidemic in India: This research paper utilized ARIMA to forecast COVID-19 cases numbers in India. The shortcoming of utilizing ARIMA, in this case, is, that it only utilizes past values to forecast the future. But with COVID-19 changes shape with the passage of time and it depends on a lot of other behavioral factors other than past values that ARIMA isn’t capable to capture.
Time Series ARIMA Model for Prediction of Daily and Monthly Average Global Solar Radiation: The Case Study of Seoul, South Korea: This is a study that forecasts solar radiation in South Korea based on the hourly solar radiation data obtained from the Korean Meteorological Administration over 37 years by using SARIMA.
Disease management with ARIMA model in time series: Another example of using ARIMA in disease management utilizing wide applicability of ARIMA/SARIMA models. The research papers touch on some real-life use cases for ARIMA. For example, a hospital in Singapore accurately predicted the number of beds they will be needing in 3 days during the SARS epidemic.
Forecasting of demand using the ARIMA model: This use case focuses on modeling and forecasting demand in a food company using ARIMA.

When it comes to the industry, here’s a nice article about forecasting in Uber.

More often than not you’ll find ARIMA/SARIMA used when the problem statement is limited to the past values whether it’s predicting hospital beds, COVID cases, or forecasting demand. The shortcoming, however, arises when there are other factors to consider in forecasting like attributes that are static. Look out for the problem statement you’re working on, if these circumstances occur for you then try to use other methods like Theta, QRF (quantile regression forests), Prophet, RNNs.

Conclusion and final notes

You’ve reached the end! In this blog, we discussed ARIMA and SARIMA at length pertaining to their utilization and importance for research in the industry. Their simplicity and robustness make them top contenders for modeling and forecasting. There are however some things to keep in mind while working on them in your real-world use case:

Increasing p,q can increase the time complexity of training exponentially. So, it’s advised to deduce their values priorly and then experiment.
They are prone to overfitting. So, make sure you set the hyperparameters right and do validation before moving to production.

That’s it from my side. Keep learning and stay tuned for more! Adios!

References

Anomaly Detection in Time Series

Aayush Bajaj — Thu, 21 Jul 2022 14:21:40 +0000

Time series are everywhere! In user behavior on a website, or stock prices of a Fortune 500 company, or any other time-related example. Time series data is evident in every industry in some shape or form.

Naturally, it’s also one of the most researched types of data. As a rule of thumb, you could say time series is a type of data that’s sampled based on some kind of time-related dimension like years, months, or seconds.

Time series are observations that have been recorded in an orderly fashion and which are correlated in time.

While analyzing time series data, we have to make sure of the outliers, much as we do in static data. If you’ve worked with data in any capacity, you know how much pain outliers cause for an analyst. These outliers are called “anomalies” in time series jargon.

What are anomalies/outliers and types of anomalies in time-series data?

From a traditional point of view, an outlier/anomaly is:

“An observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”

Therefore, you can think of outliers as observations that don’t follow the expected behavior.

As the figure above shows, outliers in time series can have two different meanings. The semantic distinction between them is mainly based on your interest as the analyst, or the particular scenario.

These observations have been related to noise, erroneous or unwanted data, which by itself isn’t interesting to the analyst. In these cases, outliers should be deleted or corrected to improve data quality, and generate a cleaner dataset that can be used by other data mining algorithms. For example, sensor transmission errors are eliminated to obtain more accurate predictions, because the main goal is to make predictions.

Nevertheless, in recent years – especially in the area of time series data – many researchers have aimed to detect and analyze unusual, but interesting phenomena. Fraud detection is a good example – the main objective is to detect and analyze the outlier itself. These observations are often referred to as anomalies.

The anomaly detection problem for time series is usually formulated as identifying outlier data points relative to some norm or usual signal. Take a look at some outlier types:

Let’s break this down one by one:

Point outlier

A point outlier is a datum that behaves unusually in a specific time instance when compared either to the other values in the time series (global outlier), or to its neighboring points (local outlier).

Example: are you aware of the Gamestop frenzy? A slew of young retail investors bought GME stock to get back at big hedge funds, driving the stock price way up. That sudden, short-lived spike that occurred due to an unlikely event is an additive (point) outlier. The unexpected growth of a time-based value in a short period (looks like a sudden spike) comes under additive outliers.

Source: Google

Point outliers can be univariate or multivariate, depending on whether they affect one or more time-dependent variables, respectively.

Fig. 1a contains two univariate point outliers, O1 and O2, whereas the multivariate time series is composed of three variables in Fig. 3b, and has both univariate (O3) and multivariate (O1 and O2) point outliers.

Fig: 1 — Point outliers in time series data. | Source

We will take a deeper look at Univariate Point Outliers in the Anomaly Detection section.

Subsequence outlier

This means consecutive points in time whose joint behavior is unusual, although each observation individually is not necessarily a point outlier. Subsequence outliers can also be global or local, and can affect one (univariate subsequence outlier) or more (multivariate subsequence outlier) time-dependent variables.

Fig. 2 provides an example of univariate (O1 and O2 in Fig. 2a, and O3 in Fig. 2b) and multivariate (O1 and O2 in Fig. 2b) subsequence outliers. Note that the latter does not necessarily affect all the variables (e.g., O2 in Fig. 2b).

Fig: 2 — Subsequence outliers in time series data. | Source

Anomaly detection techniques in time series data

There are few techniques that analysts can employ to identify different anomalies in data. It starts with a basic statistical decomposition and can work up to autoencoders. Let’s start with the basic one, and understand how and why it’s useful.

Note

Here you can find the notebook and the data used in the article

STL decomposition

STL stands for seasonal-trend decomposition procedure based on LOESS. This technique gives you the ability to split your time series signal into three parts: seasonal, trend, and residue.

It works for seasonal time-series, which is also the most popular type of time series data. To generate an STL-decomposition plot, we just use the ever-amazing statsmodels to do the heavy lifting for us.

plt.rc('figure',figsize=(12,8))
plt.rc('font',size=15)
result = seasonal_decompose(lim_catfish_sales,model='additive')
fig = result.plot()

This is Catfish sales data from 1996–2000 with an anomaly introduced in Dec-1998

If we analyze the deviation of residue and introduce some threshold for it, we’ll get an anomaly detection algorithm. To implement this, we only need the residue data from the decomposition.

plt.rc('figure',figsize=(12,6))
plt.rc('font',size=15)
fig, ax = plt.subplots()
x = result.resid.index
y = result.resid.values
ax.plot_date(x, y, color='black',linestyle='--')
ax.annotate('Anomaly', (mdates.date2num(x[35]), y[35]), xytext=(30, 20),
          textcoords='offset points', color='red',arrowprops=dict(facecolor='red',arrowstyle='fancy'))
fig.autofmt_xdate()
plt.show()

Residue from the above STL decomposition

Pros

It’s simple, robust, it can handle a lot of different situations, and all anomalies can still be intuitively interpreted.

Cons

The biggest downside of this technique is rigid tweaking options. Apart from the threshold and maybe the confidence interval, there isn’t much you can do about it. For example, you’re tracking users on your website that was closed to the public and then was suddenly opened. In this case, you should track anomalies that occur before and after launch periods separately.

Classification and Regression Trees (CART)

We can utilize the power and robustness of Decision Trees to identify outliers/anomalies in time series data.

First, you can use supervised learning to teach trees to classify anomaly and non-anomaly data points. In order to do that, we’d need to have labeled anomaly data points, which you won’t find often outside of toy datasets.
Unsupervised is what you need! We can use the Isolation Forest algorithm to predict whether a certain point is an outlier or not, without the help of any labeled dataset. Let’s see how.

The main idea, which is different from other popular outlier detection methods, is that Isolation Forest explicitly identifies anomalies instead of profiling normal data points. Isolation Forest, like any tree ensemble method, is based on decision trees.

In other words, Isolation Forest detects anomalies purely based on the fact that anomalies are data points that are few and different. The anomalies isolation is implemented without employing any distance or density measure.

When applying an IsolationForest model, we set contamination = outliers_fraction, that is telling the model what proportion of outliers are present in the data. This is a trial/error metric.
Fit and predict (data) performs outlier detection on data, and returns 1 for normal, -1 for the anomaly.
Finally, we visualize anomalies with the Time Series view.

Let’s do it step by step. First, visualize the time series data:

plt.rc('figure',figsize=(12,6))
plt.rc('font',size=15)
catfish_sales.plot()

The same Catfish Sales data but with different (multiple) anomalies introduced

Next, we need to set some parameters like the outlier fraction, and train our IsolationForest model. We can utilize the super useful scikit-learn to implement the Isolation Forest algorithm. You can find the complete notebook with code and other stuff here.

outliers_fraction = float(.01)
scaler = StandardScaler()
np_scaled = scaler.fit_transform(catfish_sales.values.reshape(-1, 1))
data = pd.DataFrame(np_scaled)
# train isolation forest
model =  IsolationForest(contamination=outliers_fraction)
model.fit(data)

Lastly, we need to visualize how the prediction was.

catfish_sales['anomaly'] = model.predict(data)
# visualization
fig, ax = plt.subplots(figsize=(10,6))
a = catfish_sales.loc[catfish_sales['anomaly'] == -1, ['Total']] #anomaly
ax.plot(catfish_sales.index, catfish_sales['Total'], color='black', label = 'Normal')
ax.scatter(a.index,a['Total'], color='red', label = 'Anomaly')
plt.legend()
plt.show();

Anomaly Detection using Isolation Forest algorithm

As you can see, the algorithm did a pretty good job in identifying our planted anomalies, but it also labeled a few points at the start as “outlier”. This is due to two reasons:

At the start, the algorithm is pretty naive to be able to comprehend what qualifies as an anomaly. The more data it gets, the more variance it’s able to see, and it adjusts itself.
If you see many true negatives, that means your contamination parameter is too high Conversely, if you don’t see the red dots where they should be, the contamination parameter is set too low.

Pros

The biggest advantage of this technique is you can introduce as many random variables or features as you like to make more sophisticated models.

Cons

The weakness is that a growing number of features can start to impact your computational performance fairly quickly. In this case, you should select features carefully.

Detection using Forecasting

Anomaly detection using Forecasting is based on an approach that several points from the past generate a forecast of the next point with the addition of some random variable, which is usually white noise.

As you can imagine, forecasted points in the future will generate new points and so on. Its obvious effect on the forecast horizon – the signal gets smoother.

The difficult part of using this method is that you should select the number of differences, number of autoregressions, and forecast error coefficients.

Each time you work with a new signal, you should build a new forecasting model.

Another obstacle is that your signal should be stationary after differencing. In simple words, it means your signal shouldn’t be dependent on time, which is a significant constraint.

We can utilize different forecasting methods such as Moving Averages, Autoregressive approach, and ARIMA with its different variants. The procedure for detecting anomalies with ARIMA is:

Predict the new point from past datums and find the difference in magnitude with those in the training data.
Choose a threshold and identify anomalies based on that difference threshold. That’s it!

To test this technique, we’re gonna use a popular module in time series called fbprophet. This module specifically caters to stationarity and seasonality, and can be tuned with some hyper-parameters.

The same Catfish Sales data but with different (multiple) anomalies introduced

We’ll utilize the same data as we did above with the same anomalies. First, let’s import it and make it ready for the environment:

from fbprophet import Prophet

Now let’s define the forecasting function. An important thing to note here is that fbprophet will add some additional metrics as features, in order to help identify anomalies better. For example, the predicted time series variable (by the model), the upper and lower limit of the target time series variable, and the trend metric.

def fit_predict_model(dataframe, interval_width = 0.99, changepoint_range = 0.8):
   m = Prophet(daily_seasonality = False, yearly_seasonality = False, weekly_seasonality = False,
               seasonality_mode = 'additive',
               interval_width = interval_width,
               changepoint_range = changepoint_range)
   m = m.fit(dataframe)
   forecast = m.predict(dataframe)
   forecast['fact'] = dataframe['y'].reset_index(drop = True)
   return forecast

pred = fit_predict_model(t)

We now have to push the pred variable to another function, which will detect anomalies based on a threshold of lower and upper limit in the time series variable.

def detect_anomalies(forecast):
   forecasted = forecast[['ds','trend', 'yhat', 'yhat_lower', 'yhat_upper', 'fact']].copy()
forecasted['anomaly'] = 0
   forecasted.loc[forecasted['fact'] > forecasted['yhat_upper'], 'anomaly'] = 1
   forecasted.loc[forecasted['fact'] < forecasted['yhat_lower'], 'anomaly'] = -1
#anomaly importances
   forecasted['importance'] = 0
   forecasted.loc[forecasted['anomaly'] ==1, 'importance'] =
       (forecasted['fact'] - forecasted['yhat_upper'])/forecast['fact']
   forecasted.loc[forecasted['anomaly'] ==-1, 'importance'] =
       (forecasted['yhat_lower'] - forecasted['fact'])/forecast['fact']

   return forecasted
pred = detect_anomalies(pred)

At last, we just need to plot the above predictions and visualize the anomalies.

Pros

This algorithm nicely handles different seasonality parameters like monthly or yearly, and it has native support for all time series metrics.

If you look closely, this algorithm can handle edge cases well as compared to the Isolation Forest algorithm.

Cons

Since this technique is based on forecasting, it will struggle in limited data scenarios. The quality of prediction in limited data will be lower, and so will the accuracy of anomaly detection.

Clustering-based anomaly detection

So far, we’ve looked at the IsolationForest algorithm as our unsupervised way of anomaly detection. Now, we’ll look into another unsupervised technique: Clustering!

The approach is pretty straightforward. Data instances that fall outside of defined clusters could potentially be marked as anomalies. We’re gonna use k-means clustering, because why not!

For the sake of visualizations, we’ll use a different dataset that corresponds to a multivariable time series with one or more time-based variables. The dataset will be a subset of the one found here (columns/features are the same).

Dataset Description: Data contains information on shopping and purchase as well as information on price competitiveness.

Now in order to process k-means, first we need to know the number of clusters we’re gonna be dealing with. The Elbow Method works pretty efficiently for this.

The Elbow method is a graph of the number of clusters vs the variance explained/objective/score

To implement this, we’ll use scikit-learn’s implementation of K-means.

data = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
n_cluster = range(1, 20)
kmeans = [KMeans(n_clusters=i).fit(data) for i in n_cluster]
scores = [kmeans[i].score(data) for i in range(len(kmeans))]
fig, ax = plt.subplots(figsize=(10,6))
ax.plot(n_cluster, scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show();

From the above elbow curve, we see that the graph levels off after 10 clusters, implying that the addition of more clusters do not explain much more of the variance in our relevant variable; in this case price_usd.

We set n_clusters=10, and upon generating the k-means output, use the data to plot the 3D clusters.

Now we need to find out the number of components (features) to keep.

data = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
X = data.values
X_std = StandardScaler().fit_transform(X)
#Calculating Eigenvecors and eigenvalues of Covariance matrix
mean_vec = np.mean(X_std, axis=0)
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
# Create a list of (eigenvalue, eigenvector) tuples
eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]
eig_pairs.sort(key = lambda x: x[0], reverse= True)
# Calculation of Explained Variance from the eigenvalues
tot = sum(eig_vals)
var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)] # Individual explained variance
cum_var_exp = np.cumsum(var_exp) # Cumulative explained variance
plt.figure(figsize=(10, 5))
plt.bar(range(len(var_exp)), var_exp, alpha=0.3, align='center', label='individual explained variance', color = 'y')
plt.step(range(len(cum_var_exp)), cum_var_exp, where='mid',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.show();

We see that the first component explains almost 50% of the variance. The second component explains over 30%. However, notice that almost none of the components are really negligible. The first 2 components contain over 80% of the information. So, we will set n_components=2.

The underlying assumption in the clustering-based anomaly detection is that if we cluster the data, normal data will belong to clusters while anomalies will not belong to any clusters, or belong to small clusters.

We use the following steps to find and visualize anomalies:

Calculate the distance between each point and its nearest centroid. The biggest distances are considered anomalies.
We use outliers_fraction to provide information to the algorithm about the proportion of the outliers present in our data set, similarly to the IsolationForest algorithm. This is largely a hyperparameter that needs hit/trial or grid-search to be set right – as a starting figure, let’s estimate, outliers_fraction=0.1
Calculate number_of_outliers using outliers_fraction.
Set the threshold as the minimum distance of these outliers.
The anomaly result of anomaly1 contains the above method Cluster (0:normal, 1:anomaly).
Visualize anomalies with cluster view.
Visualize anomalies with Time Series view.

# return Series of distance between each point and its distance with the closest centroid
def getDistanceByPoint(data, model):
   distance = pd.Series()
   for i in range(0,len(data)):
       Xa = np.array(data.loc[i])
       Xb = model.cluster_centers_[model.labels_[i]-1]
       distance.at[i]=np.linalg.norm(Xa-Xb)
   return distance
outliers_fraction = 0.1
# get the distance between each point and its nearest centroid. The biggest distances are considered as anomaly
distance = getDistanceByPoint(data, kmeans[9])
number_of_outliers = int(outliers_fraction*len(distance))
threshold = distance.nlargest(number_of_outliers).min()
# anomaly1 contain the anomaly result of the above method Cluster (0:normal, 1:anomaly)
df['anomaly1'] = (distance >= threshold).astype(int)
fig, ax = plt.subplots(figsize=(10,6))
colors = {0:'blue', 1:'red'}
ax.scatter(df['principal_feature1'], df['principal_feature2'], c=df["anomaly1"].apply(lambda x: colors[x]))
plt.xlabel('principal feature1')
plt.ylabel('principal feature2')
plt.show();

Now, in order to see the anomalies against real-world features, we process the dataframe we created in the previous step.

df = df.sort_values('date_time')
fig, ax = plt.subplots(figsize=(10,6))
a = df.loc[df['anomaly1'] == 1, ['date_time', 'price_usd']] #anomaly
ax.plot(pd.to_datetime(df['date_time']), df['price_usd'], color='k',label='Normal')
ax.scatter(pd.to_datetime(a['date_time']),a['price_usd'], color='red', label='Anomaly')
ax.xaxis_date()
plt.xlabel('Date Time')
plt.ylabel('price in USD')
plt.legend()
fig.autofmt_xdate()
plt.show()

This method is able to encapsulate peaks pretty well, with some misses of course. A part of the issue may be the outlier_fraction hasn’t played around with many values.

Pros

The biggest advantage of this technique is similar to other unsupervised techniques, which is that you can introduce as many random variables or features as you like to make more sophisticated models.

Cons

The weakness is that a growing number of features can start to impact your computational performance fairly quickly. In addition to this, there are more hyper-parameters to tune and get right, so there’s always a chance of high model variance in performance.

Autoencoders

Can’t talk about data techniques without Deep Learning! So, let’s discuss Anomaly detection using Autoencoders.

Read also

How to Work with Autoencoders

Autoencoders are an unsupervised technique that recreates the input data while extracting its features through different dimensions. So, in other words, if we use the Latent Representation of data from Autoencoders, it corresponds to dimensionality reduction.

Source

Why do we apply dimensionality reduction to find outliers?

Don’t we lose some information, including the outliers, if we reduce the dimensionality? The answer is that once the main patterns are identified, the outliers are revealed. Many distance-based techniques (e.g. KNNs) suffer the curse of dimensionality when they compute distances of every data point in the full feature space. High dimensionality has to be reduced.

Interestingly, during the process of dimensionality reduction outliers are identified. We can say outlier detection is a by-product of dimension reduction.

Autoencoders are an unsupervised approach to find anomalies.

Why autoencoders?

There are many useful tools, such as Principal Component Analysis (PCA), for detecting outliers. Why do we need autoencoders? The reason is that PCA uses linear algebra to transform. In contrast, autoencoder techniques can perform non-linear transformations with their non-linear activation function and multiple layers. It’s more efficient to train several layers with an autoencoder, rather than training one huge transformation with PCA. The autoencoder techniques thus show their merits when the data problems are complex and non-linear in nature.

Build the model

We can implement Autoencoders with popular frameworks like TensorFlow or Pytorch, but – for the sake of simplicity – we’re gonna use a python module called PyOD, which builds autoencoders internally using few inputs from the user.

For the data part, let’s use the utility function generate_data() of PyOD to generate 25 variables, 500 observations, and ten percent outliers.

import numpy as np
import pandas as pd
from pyod.models.auto_encoder import AutoEncoder
from pyod.utils.data import generate_data
contamination = 0.1  # percentage of outliers
n_train = 500  # number of training points
n_test = 500  # number of testing points
n_features = 25 # Number of features
X_train, y_train, X_test, y_test = generate_data(
   n_train=n_train, n_test=n_test,
   n_features= n_features,
   contamination=contamination,random_state=1234)
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

When you do unsupervised learning, it’s always a safe step to standardize the predictors like below:

from sklearn.preprocessing import StandardScaler
X_train = StandardScaler().fit_transform(X_train)
X_train = pd.DataFrame(X_train)
X_test = StandardScaler().fit_transform(X_test)
X_test = pd.DataFrame(X_test)

In order to get a good sense of what the data looks like, let’s use PCA to reduce it to two dimensions, and plot accordingly.

from sklearn.decomposition import PCA
pca = PCA(2)
x_pca = pca.fit_transform(X_train)
x_pca = pd.DataFrame(x_pca)
x_pca.columns=['PC1','PC2']
cdict = {0: 'red', 1: 'blue'}
# Plot
import matplotlib.pyplot as plt
plt.scatter(X_train[0], X_train[1], c=y_train, alpha=1)
plt.title('Scatter plot')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

The black points clustered together are the typical observations, and the yellow points are the outliers.

Model specification

[25, 2, 2, 25]. The input layer and the output layer have 25 neurons each. There are two hidden layers, each has two neurons.

Step 1 — Build your model

clf = AutoEncoder(hidden_neurons =[25, 2, 2, 25])
clf.fit(X_train)

Step 2 — Determine the cut point

Let’s apply the trained model Clf to predict the anomaly score for each observation in the test data. How do we define an outlier? An outlier is a point that’s distant from other points, so the outlier score is defined by distance. The PyOD function .decision_function() calculates the distance, or the anomaly score, for each data point.

# Get the outlier scores for the train data
y_train_scores = clf.decision_scores_
# Predict the anomaly scores
y_test_scores = clf.decision_function(X_test)  # outlier scores
y_test_scores = pd.Series(y_test_scores)
# Plot it!
import matplotlib.pyplot as plt
plt.hist(y_test_scores, bins='auto')
plt.title("Histogram for Model Clf1 Anomaly Scores")
plt.show()

If we use a histogram to count the frequency by the anomaly score, we will see the high scores corresponds to a low frequency — evidence of outliers. We choose 4.0 to be the cut point and those >=4.0 to be outliers.

Step 3 — Get the summary statistics by cluster

Let’s assign those observations with less than 4.0 anomaly scores to Cluster 0, and to Cluster 1 for those above 4.0. Also, let’s calculate the summary statistics by cluster using .groupby() . This model has identified 50 outliers (not shown).

df_test = X_test.copy()
df_test['score'] = y_test_scores
df_test['cluster'] = np.where(df_test['score']<4, 0, 1)
df_test['cluster'].value_counts()
df_test.groupby('cluster').mean()

The following output shows the mean variable values in each cluster. The values of Cluster ‘1’ (the abnormal cluster) are quite different from those of Cluster ‘0’ (the normal cluster). The “score” values show the average distance of those observations to others. A high “score” means that observation is far away from the norm.

This way, we can distinguish and label pretty perfectly between typical datums and anomalies.

Pros

Autoencoders can handle high-dimensional data with ease.
Pertaining to its nonlinearity behavior, it can find complex patterns within high-dimensional datasets.

Cons

Since it’s a deep learning-based strategy, it will particularly struggle if the data is less.
Computation costs will skyrocket if the depth of the network increases and while dealing with big data.

So far we’ve seen how to detect and identify anomalies. But the real question arises after finding them. Now what? What do we do about it?

Let’s discuss some of the pointers you could apply in your scenario.

How to deal with the anomalies?

After detection, there comes a big question of what to do about the stuff we identified. There are numerous ways to deal with the newly found information. I’ll list some of them based on my experience to give you a headway on how to approach this question.

Understanding the business case

Anomalies almost always provide new information and perspective to your problems. Stock prices going up suddenly? There has to be a reason for this like we saw with Gamestop, a pandemic could be another. So, understanding the reasons behind the spike can help you solve the problem in an efficient manner.

Understanding the business use case can also help you identify the problem better. For instance, you might be working on some sort of fraud detection which means your primary goal is indeed understanding the outliers in the data.

If none of this is your concern, you can move to remove or smoothen out the outlier.

Statistical methods to adjust outliers

Statistical methods let you adjust the value of your outlier to match the original distribution. Let’s see one of the methods that use mean to smoothen out the anomalies.

Using mean to smoothen out the outlier

The idea is to smoothen out the anomaly by using data from the previous DateTime. E.g., to even out a sudden usage of electricity due to an event that happened in your house, you could take an average of usages in the same month for previous years.

Let’s implement the same to get a clear picture. We’ll employ the same catfish sales data we did earlier. We can adjust with the mean using the script below.

adjusted_data = lim_catfish_sales.copy()
adjusted_data.loc[curr_anomaly] = december_data[(december_data.index != curr_anomaly) & (december_data.index < test_data.index[0])].mean()

Plotting the adjusted data and the old data will look something like this:

plt.figure(figsize=(10,4))
plt.plot(lim_catfish_sales, color='firebrick', alpha=0.4)
plt.plot(adjusted_data)
plt.title('Catfish Sales in 1000s of Pounds', fontsize=20)
plt.ylabel('Sales', fontsize=16)
for year in range(start_date.year,end_date.year):
   plt.axvline(pd.to_datetime(str(year)+'-01-01'), color='k', linestyle='--', alpha=0.2)
plt.axvline(curr_anomaly, color='k', alpha=0.7)

This way, you can proceed to apply forecasting or analysis without worrying much about skewness in your results.

There are numerous methods to deal with non-time series data but unfortunately cannot be used directly in Timeseries due to the difference in underlying structures. Non-time series methods of dealing involve a lot of distribution-based methods which can’t be simply translated to Timeseries data. If you wish to look at some of those, you can head over here.

Removing the Outlier

The last option if none of the above two sparks any debate in your solution is to get rid of the anomalies. This is not recommended (as you’re basically getting rid of some potentially valuable information) unless it’s absolutely necessary and it doesn’t harm the analysis in the future.

You can use the .drop() feature in pandas after identification. It will do the heavy lifting for you.

You’ve reached the end!

Congratulations! You now know about Anomalies, how to detect them, and what you can do about them. Few endnotes:

Time series data varies a lot depending on the business case, so it’s better to experiment and find out what works instead of just applying what you find. Experience can do wonders!
There are tons of techniques for anomaly detection apart from what we’ve discussed on this blog. I encourage you to read more in research papers.

You can find the complete notebook with code and some bonus stuff here!

That’s it for now, stay tuned for more! Adios!

Note: images are created by the author unless stated otherwise.

Time Series Forecasting: Data, Analysis, and Practice

Akshay P Jain — Thu, 21 Jul 2022 14:04:33 +0000

Usually, in the traditional machine learning approach, we randomly split the data into training data, test data, and cross-validation data.

Here, each point x_i in the dataset has:

60% probability of going into D_train
20% probability of going into D_test
20% probability of going into Validation

Instead of random-based splitting, we can use another approach called time-based splitting. When we have a timestamp given in our dataset, we can split the data according to time.

Imagine you’re an ML engineer at Amazon, trying to productionize a model to classify reviews. You randomly split the data into training data and test data, and after obtaining the required accuracy, you deploy the model. With additional reviews being added to new products, over time the model’s accuracy could decrease. Time-based splitting is a way to overcome this issue.

In time-based splitting, we generally divide the data based on the timestamp and train the model. With this, we have a better chance of getting higher accuracy than with random-based splitting.

Why do we need a different approach?

The standard ML approach doesn’t work for time series models:

Features and target variables are the same,
Data correlated over time,
Often non-stationary (hard to model),
Need a lot of data to capture the patterns and trends and model those changes appropriately.

What is a time series?

Time-series are a sequence of data points organized in time order.

Types of forecasting

Time series are everywhere

Finance: we’re trying to predict perhaps stock prices over time, asset prices, different macroeconomic factors that will have a large effect on our business objectives.

E-commerce: we’re trying to predict future page views compared to what happened in the past, and whether it’s trending up, down, or if there’s seasonality. Same with new users, how many new users are you getting/losing over time?

Business: we’re trying to predict the number of transactions, future revenue, and future inventory levels that you will need.

Time series decomposition involves thinking of a series as a combination of level, trend, seasonality, and noise components.Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.

One of the fundamental topics in time series is time series decomposition:

Components of time series data
Seasonal patterns and trends
Decomposition of time series data

What are the components of time series?

Trend: change direction over a period of time

Seasonality: seasonality is about periodic behavior, spikes or drops caused by different factors, for example:

Naturally occurring events, like weather fluctuations
Business or administrative procedures, like start or end of a fiscal year
Social and cultural behavior, like holidays or religious observances
Calendar events, like the number of Mondays per month or holidays shifting year to year

Residual: irregular fluctuations that we cannot predict using trend or seasonality.

The graphs of trends, seasonality, and residual factors are constructed below using Pandas and NumPy arrays in Python.

Decomposition models

Additive model

The additive model assumes the observed time series is the sum of components:

Observation = trend + seasonality

Additive models are used when the magnitude of seasonal and residual values are independent of the trend.

The above graph is generated using python which we will learn in a while

In the above example, we can see that seasonality in the residuals doesn’t increase or decrease as the trend increases, but rather it stays constant all the way. Looking at this plot, and subtracting out the straight line that is the trend, we can imagine that we just have the straight added on the seasonal component that says the same no matter what that trend is.

Multiplicative model

The multiplicative model assumes the observed time series is a product of its components:

Observation = trend * seasonality * residual

We can transform the multiplicative model to an additive model by applying a log transformation:

log(time * seasonality * residual) = log(Time) + log(seasonality) + log(residual)

These are used if the magnitudes of seasonal and residual values fluctuate with the trend.

The above graph is generated using python which we will learn in a while

In the above image, we see the trend increases, so we’re trending up. The seasonal component is also trending up with the trend. This means that it’s likely a multiplicative model, so we should divide out that trend, and then we would end up with more reasonable looking (more consistent) seasonality.

Pseudo-additive models

Pseudo-additive models combine the elements of both additive and multiplicative models. They can be useful when:

Time series values are close to or equal to zero
We expect features related to the multiplicative model
Division by zero often becomes a problem when this is the case

Time series decomposition using Python-Pandas

We will individually construct fictional trends, seasonality, and residual components. This is an example to show how a simple time-series dataset can be constructed using the Pandas module.

time = np.arange(1, 51)

Now we need to create a trend. Let’s pretend we have a sensor measuring electricity demand. We’ll ignore units to keep things simple.

trend = time * 2.75

Now lets plot to show trend as a function of time

Now let’s generate a seasonal component.

seasonal = 10 + np.sin(time) * 10

Let’s plot seasonality against time.

Now, let’s construct the residual component.

np.random.seed(10)  # reproducible results
residual = np.random.normal(loc=0.0, scale=1, size=len(time))

A quick plot of residuals:

Aggregate trend, seasonality, and residual components

Additive time series

Remember the equation for additive time series is simply: O_t = T_t + S_t + R_t

O_t = output
T_t = trend
S_t = seasonality
R_t = residual
_t = variable representing a particular point in time

additive = trend + seasonal + residual

The same follows for multiplicative time series, except we don’t add, but multiply the values of trend, seasonality, and residual.

Stationary and autocorrelation

What is stationarity?

For time series data to be stationary, the data must exhibit four properties over time:

1. Constant Mean:

A stationary time series will have a constant mean throughout the entire series.

As an example, if we were to draw the mean of the series, this holds as the mean throughout all of the time.

A good example where the mean wouldn’t be constant is if we had some type of trend. With an upward or downward trend, for example, the mean at the end of our series would be noticeably higher or lower than the mean at the beginning of the series.

2. Constant Variance:

A stationary time series will have a constant variance throughout the entire series.

3. Constant Autocorrelation Structure:

Autocorrelation simply means that the current time series measurement is correlated with a past measurement. For example, today’s stock price is often highly correlated with yesterday’s price.

The time interval between correlated values is called LAG. Suppose we wanted to know if today’s stock price correlated better with yesterday’s price, or the price from two days ago. We could test this by computing the correlation between the original time series and the same series delayed by one time interval. So, the second value of the original time series would be compared with the first of the delayed. The third original value would be compared with the second of the delayed, and so on. Performing this process for a lag of 1 and a lag of 2, respectively, would yield two correlation outputs. This output would tell which lag is more correlated. That is autocorrelation in a nutshell.

Time series smoothing

What is Smoothing?

Smoothing is a process that often improves our ability to forecast series by reducing the impact of noise.

Why is smoothing important?

Smoothing is an important tool that lets us improve forward-looking forecasts.

Consider the data in the below graph. How could we forecast what will happen in one, two, or three steps into the future?

One solution is to calculate the mean of the series and predict the value in the future.

But, using the mean to predict future values doesn’t seem like a good way, and we might not get accurate predictions. Instead, we employ a technique called exponential smoothing.

Single Exponential Smoothing

Single Exponential Smoothing, also called Simple Exponential Smoothing, is a time series forecasting method for univariate data without a trend or seasonality.

It requires a single parameter, called alpha (a), also called the smoothing factor or smoothing coefficient.

This parameter controls the rate at which the influence of observations at prior time steps decays exponentially. Alpha is often set to a value between 0 and 1. Large values mean that the model pays attention mainly to the most recent past observations, whereas smaller values mean more of the history is taken into account when making a prediction.

Double Exponential Smoothing

Double Exponential Smoothing is an extension to Exponential Smoothing that explicitly adds support for trends in the univariate time series.

In addition to the alpha parameter for controlling the smoothing factor for the level, a smoothing factor is added to control the decay of the influence of the change in a trend, called beta (b).

The method supports trends that change in different ways: an additive and a multiplicative, depending on whether the trend is linear or exponential respectively.

Double Exponential Smoothing with an additive trend is classically referred to as Holt’s linear trend model, named after the developer of the method, Charles Holt.

Triple Exponential Smoothing

Triple Exponential Smoothing is an extension of Exponential Smoothing that explicitly adds support for seasonality to the univariate time series.

This method is sometimes called Holt-Winters Exponential Smoothing, named for two contributors to the method: Charles Holt and Peter Winters.

In addition to the alpha and beta smoothing factors, a new parameter is added called gamma (g), which controls the influence on the seasonal component.

As with the trend, the seasonality may be modeled as either an additive or multiplicative process, for a linear or exponential change in the seasonality.

Autoregressive models and Moving Average (ARMA) models

ARMA models combine two models:

The first is an autoregressive (AR) model. Autoregressive models anticipate series dependence on its past values.

The second is the moving average (MA) model. Moving average model anticipates series dependence on past forecast errors.

The combination (ARMA) is also known as the Box-Jenkins approach.

ARMA model: Auto regressive (AR) part

ARMA models are often expressed using P and Q for the AR and MA components. For a time series variable X that we want to predict the time t, the last few observations are:

X_{t – 3}, X_{t – 2}, X_{t- 1}

AR(p) models are assumed to depend on the last p values of the time series. Let’s say p = 2, the forecast has the form:

Ma(q) models are assumed to depend on the last q values of the time series. Let say q = 2, the forecast has the form:

We’ll discuss what exactly these equations mean and how the errors are calculated in a while.

Now, to get our AR(p) and MA(q) models together, we combine the AR(p) and MA(P) to yield the ARMA(p,q) model. For p = 2 and q = 2 the ARMA (2,2) forecast will be:

Again we’ll see all these while doing the hands-on.

There are some things to keep in mind while implementing ARMA models:

First, the time series is going to be assumed to be stationary, and that regression approach will fail if we’re working with a non-stationary example.

A good rule of thumb is to have at least 100 observations when fitting an ARMA model, so that we can adequately demonstrate those past autocorrelations.

Now we’ll take a practical approach to understand auto-regressive models, and get a practical understanding of moving averages.

Hands-on approach

One of the key concepts in the quantitative toolbox is that of mean reversion. This process refers to a time series that displays a tendency to revert to its historical mean value. Mathematically, such a (continuous) time series is referred to as an Ornstein-Uhlenbeck process.

This is in contrast to a random walk (aka Brownian motion), which has no “memory” of where it has been at each particular instance of time.

The mean-reverting property of a time series can be exploited to produce better predictions.

A continuous mean-reverting time series can be represented by an Ornstein-Uhlenbeck stochastic differential equation:

= θ(μ− ) + σ

Where:

θ is the rate of reversion to the mean,
μ is the mean value of the process,
σ is the variance of the process,
is a Wiener Process or Brownian Motion.

In a discrete setting, the equation states that the change of the price series in the next time period is proportional to the difference between the mean price and the current price, with the addition of Gaussian noise.

For more details, have a look here.

Section 1: ARMA

Enter Autoregressive Integrated Moving Average (ARIMA) modeling. When we have autocorrelation between outcomes and their ancestors, we will see a theme or relationship in the outcome plot. This relationship can be modeled in its way, allowing us to predict the future with a confidence level proportionate to the strength of the relationship and the proximity to known values (prediction weakens the further out we go).

For second-order stationary data (both mean and variance: = and ² = ² for all ), autocovariance is expressed as a function only of the time lag :

= [( − )( + − )]

Therefore, the autocorrelation function is defined as:

= / ²

We use the plot of these values at different lags to determine optimal ARIMA parameters. Notice how phi changes the process.

Section 2: Autoregressive (AR) Models

Autocorrelation: a variable’s correlation with itself at different lags.

AR models regress on actual past values.

This is the first order or AR(1) formula you should know:

= 0 + 1 −1 +

The β’s are just like those in linear regression and ϵ is an irreducible error.

A second-order or AR(2) would look like this:

= 0 + 1 −1 + 2 −2 +

We’ll generate our data to gain insight into how AR models work.

# reproducibility
np.random.seed(123)

# create autocorrelated data
time = np.arange(100)
#Assuming 0 mean
ar1_sample = np.zeros(100)

# Set our first number to a random value with expected mean of 0 and standard deviation of 2.5
ar1_sample[0] += np.random.normal(loc=0, scale=2.5, size=1)

# Set every value thereafter as 0.7 * the last term plus a random error
for t in time[1:]:
    ar1_sample[t] = (0.7 * ar1_sample[t-1]) + np.random.normal(loc=0, scale=2.5, size=1)

plt.fill_between(time,ar1_sample)

Here we create a prediction for generated data to show we came up with a model that is approximately ar(1) with phi ≈ 0.7.

# using ARMA model from statsmodel package
model = sm.tsa.ARMA(ar1_sample, (1, 0)).fit(trend='nc', disp=0)
model.params

# create autocorrelated data
np.random.seed(112)
# Mean is again 0
ar2_sample = np.zeros(100)
# Set first two values to random values with expected mean of 0 and standard deviation of 2.5
ar2_sample[0:2] += np.random.normal(loc=0, scale=2.5, size=2)
# Set future values as 0.3 times the prior value and 0.3 times value two prior
for t in time[2:]:
    ar2_sample[t] = (0.3 * ar2_sample[t-1]) + (0.3 * ar2_sample[t-2]) + np.random.normal(loc=0, scale=2.5, size=1)

plt.fill_between(time,ar2_sample)

Section 3: Moving Average(MA) models

MA Model Specifics

A MA model is defined by this equation:

= + + θ1 − 1 + θ2 − 2 +⋯+ θ −

Where:

is the white noise value,
is a constant value,
‘s are coefficients, not unlike those found in linear regression.

MA Models != Moving Average Smoothing

An important distinction is that a moving average model is not the same thing as moving average smoothing. What we did in previous lessons was smoothing. It has important properties that we’ve discussed. However, moving average models are a completely different beast.

Moving average smoothing is useful for estimating the trend and seasonality of past data. MA models, on the other hand, are a useful forecasting model that regresses past forecast errors to forecast future values.

It’s easy to lump the two techniques together, but they serve very different functions. Thus, a moving-average model is conceptually a linear regression of the current value of the series against current and previous (unobserved) white noise error terms or random shocks.

The random shocks at each point are assumed to be mutually independent and to come from the same distribution, typically a normal distribution, with a location at zero and constant scale.

We’ll generate our data so we know the generative process for an MA series.

# reproducibility
np.random.seed(12)

# create autocorrelated data
time = np.arange(100)
#mean 0
ma1_sample = np.zeros(100)
#create vector of random normally distributed errors
error = np.random.normal(loc=0, scale=2.5, size=100)
# set first value to one of the random errors
ma1_sample[0] += error[0]

#set future values to 0.4 times error of prior value plus the current error term
for t in time[1:]:
    ma1_sample[t] = (0.4 * error[t-1]) + error[t]

plt.fill_between(time,ma1_sample)

 # find model params for generated sample 
model = sm.tsa.ARMA(ma1_sample, (0, 1)).fit(trend='nc', disp=0)
model.params

out:array([0.34274651])

Section 3: The Autocorrelation Function (ACF)

There’s a crucial question we need to answer: how do you choose the orders (p and q) for a time series?

To answer that question, we need to understand the Autocorrelation Function (ACF). Let’s start by showing an example ACF plot for our different simulated series.

fig = sm.tsa.graphics.plot_acf(ar1_sample, lags=range(1,30), alpha=0.05,title = 'ar1 ACF')
fig = sm.tsa.graphics.plot_acf(ma1_sample, lags=range(1,15), alpha=0.05,title = 'ma1 ACF')

An explanation is in order. First, the blue region represents a confidence interval. Alpha, in this case, was set to 0.05 (95% confidence interval). This can be set to whatever float value you require. See the plot_acf function for details.

The stems represent lagged correlation values. In other words, a lag of 1 will show a correlation with the prior endogenous value. A lag of 2 shows a correlation to the value 2 prior and so on. Remember that we’re regressing on past forecast values, that’s the correlation we’re inspecting here.

Correlations outside of the confidence interval are statistically significant, whereas the others are not.

Note that if lag 1 shows strong autocorrelation, lag 2 will show strong autocorrelation as well, since lag 1 is correlated with lag 2, lag 2 with lag 3, and so on. That’s why you see the ar1 model with slowly decaying correlation.

If we think about the functions, we note that autocorrelation will propagate for AR(1) models:

= 0 + 1 −1 +
−1 = 0 + 1 −2 + −1
= 0 + 0 + 1 −2 + −1 +

The past errors will propagate into the future, leading to the slowly decaying plot we just mentioned.

For MA(1) models:

= = 0 + θ1 −1 +

Only the prior error affects future errors.

So an easy way to identify an AR(1) model or MA(1) model is to see if the correlation from one affects the next.

fig = sm.tsa.graphics.plot_acf(ar2_sample, lags=range(1,15), alpha=0.05,title = 'ar2 ACF')
fig = sm.tsa.graphics.plot_acf(ma2_sample, lags=range(1,15), alpha=0.05,title = 'ma2 ACF')

Summary

In this post, we explored what exactly is time series forecasting, and what are the important components of time series forecasting, ie.: the constituent components that a time series can be decomposed into when performing an analysis.

We also went through different types of forecasting, and dove into moving averages, stationary models, and how to plot time series using Python.

In the next article, we’ll focus on how to model time series data using ARIMA, SARIMA, and FB PROPHET. Thanks for reading!

Reference:

Images reference:

Time Series - neptune.ai

Time Series Projects: Tools, Packages, and Libraries That Can Help

What is a time series?

Examples of time series projects

Stock market prediction

Bitcoin price forecasting

ECG anomaly detection

Tools, packages, and libraries for time series projects

Data preparation and feature engineering tools for time series

Time Series Forecasting: Data, Analysis, and Practice

Time series projects with Pandas

Time series projects with NumPy

Time series projects with Datetime

Time series projects with Tsfresh

Data analysis and visualization packages for time series

Time series projects with Matplotlib

Time series projects with Plotly

Time series projects with Statsmodels

Experiment tracking tools for time series

Time series projects with neptune.ai

Time series projects with Weights & Biases

Comparison Between Weights & Biases and neptune.ai

Time series forecasting packages

ARIMA & SARIMA: Real-World Time Series Forecasting

Time series forecasting with Statsmodels

Time series forecasting with Pmdarima

Time series forecasting with Sklearn

Time series forecasting with PyTorch

Time series forecasting with Tensorflow (Keras)

Time series forecasting with Sktime

Time series forecasting with Prophet

Time series forecasting with Pycaret

Time series forecasting with AutoTS

Time series forecasting with Darts

Time series forecasting with Kats

Forecasting libraries comparison

Conclusion

ARIMA vs Prophet vs LSTM for Time Series Prediction

Check also

Overview of the three methods: ARIMA, Prophet, and LSTM

ARIMA

Parameters

For d=0

For d=1

For d=2

Prophet

LSTM recurrent neural networks

Read also

Experimental evaluation: ARIMA vs Prophet vs LSTM

Dataset

Implementation

ARIMA

Prophet

LSTM

Experiment tracking and model comparison

A deeper look into the performance of the models

ARIMA grid-search

Trends in Prophet

Why did LSTM fare the worst?

Conclusions

Future directions

How to Select a Model For Your Time Series Prediction Task [Guide]

Understanding time series datasets and forecasting

Time Series Prediction: How Is It Different From Other Machine Learning? [ML Engineer Explains]

Key aspects of time series modeling

Univariate versus multivariate time series models

Time series decomposition

Time series decomposition in Python

Autocorrelation

Detecting autocorrelation

Stationarity

Dickey-Fuller test

Differencing

One-step vs multi-step time series models

Types of time series models

Classical time series models

Supervised models

Deep learning models

Classical time series models

ARIMA family