Time Series Forecasting Based on AutoRegression

8688-5.jpg

It is often necessary to quickly and efficiently calculate time series values to assess the behavior of an observed object or process. This can be an indicator of the company's performance in a certain direction, the temperature inside the room, the cost of a certain product for a promotion, an assessment of customer satisfaction with service, and so on.

Practical problems of this class have to be solved quite often, and instead of building the expected curve using a spreadsheet editor or pencil and paper, you can quickly and efficiently use Python and the Google Colab cloud service. We will choose the simplest and most popular method, autoregressive (AR), as a method for forecasting a time series.

Let's see how to solve this problem using the example of data on Covid-19 in Ukraine, specifically the number of new cases detected daily.

Time Series

A time series (or a dynamic series) is statistical material collected at different points in time about the value of any parameters (in the simplest case, one) of the process under study. Each unit of statistical material is called a measurement or data sample at a specific moment in time.

The time series for each sample must indicate the measurement time or the number of measurements in order. A time series differs significantly from a simple data sample since the analysis takes into account the relationship of measurements to time, and not only the statistical diversity and statistical characteristics of the sample.

A time series consists of two elements:

  • the period of time for which or as of which the numerical values are given;
  • numerical values of one or another indicator called the levels of the series.

A time series can be classified by the number of indicators for which the levels are determined at each point in time: one-dimensional and multi-dimensional time series.

Also, time series are divided depending on the presence of the main trend, while stationary series, in which the main value and variance are constant, and non-stationary ones, containing the main development trend, are distinguished.

Why use forecasting methods in data science?

A time series is one of the most important forecasting methods in various fields of human activity and natural phenomena. Time series, as a rule, results from the measurement of some indicator. These can be both indicators (characteristics) of technical systems, and indicators of natural, social, economic, and other systems, for example, weather data. A typical example of a time series is the stock exchange rate, for which the purpose is to determine the main direction of development or trends.

Currently, there are about 220 forecasting methods, but most often no more than 10 are used in practice.  These include: factual (extrapolation, interpolation, trend analysis), expert (including survey, questionnaire), the publication (including patent), citation-index, scenario, matrix, modeling, analogies, and building graphs.

Forecast estimates using extrapolation methods are calculated in several stages:

  • forecast baseline check
  • identification of patterns of the past development of the phenomenon
  • assessment of the degree of reliability of the revealed pattern of development in the past (selection of the trend function)
  • extrapolation - transferring the identified patterns to a certain period of the future
  • correction of the obtained forecast, taking into account the results of a meaningful analysis of the current state.

To obtain an objective forecast of the development of the phenomenon under study, the baseline data must meet the following requirements:

  • the time step for the entire baseline must be the same
  • observations are recorded at the same moment of every time interval (for example, at noon of every day, the first of every month)
  • the baseline must be complete, that is, no data skipping is allowed.

If there are no results in the observations for a short period of time, then to ensure the completeness of the baseline, it is necessary to fill them in with approximate data. For example, use the average value of adjacent segments.

Correction of the obtained forecast is performed to clarify the obtained long-term forecasts, taking into account the influence of seasonality or abrupt development of the phenomenon under study.

When predicting an object, you often have to predict not one, but several of its indicators. In this case, the forecast of the development of one indicator can be performed by one method, and another indicator by another method, i.e., combinations of methods are used.

The most popular methods in time series forecasting are:

  • Autoregression (AR)
  • Moving Average (MA)
  • Autoregressive Moving Average (ARMA)
  • Autoregressive Integrated Moving Average (ARIMA)
  • Seasonal Autoregressive Integrated Moving-Average (SARIMA)
  • Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)
  • Vector Autoregression (VAR)
  • Vector Autoregression Moving-Average (VARMA)
  • Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
  • Simple Exponential Smoothing (SES)
  • Holt Winter’s Exponential Smoothing (HWES)

All these methods can be used in time series forecasting, but some methods will be more effective in specific areas of domains. In practice, a  combination of methods will generate better results.

Analytics and Big Data Services for Informed Decisions Empower your business with Svitla Systems’ advanced analytics and considerable data expertise to drive more innovative strategies. Get a Consultation

How to make simple and effective time series forecasting on AutoRegression model

In statistics, economics, and digital signal processing, an autoregressive (AR) model, a random process, is common. It is used to describe certain time-varying processes in nature, sociology, economics, electronic and mechanical systems, and many other fields.

The autoregressive model determines that the output variable is linearly dependent on its own previous values and on the stochastic term (responsible for a parameter that cannot be fully predicted). The model takes the form of a stochastic difference equation (or a recurrence relation, but not in the form of a differential equation).

Together with the moving average (MA) model, it is a special case and a key component of the more general autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA) time series models.

The AR model of the time series, in general, is described as a formula:
Time series model

Where are the model parameters and et are the white noise values. Where p is the number of values by which the dependency is determined (aka lags).

For practical use of AR, the method is easiest to run in Python: for this, you can download the ready-made statsmodel library. You can read more about this library here.

As a practical and relevant task, we will choose the time series of the daily parameter of the detected new cases of COVID-19 in Ukraine (according to official sources).

In order to run a real example AR method, go to Google Colab and start by installing the statsmodel package, which can be easily done via pip install.

%pip install statsmodels --upgrade

After that, do not forget to restart the runtime. Without this, it will not work, since the environment does not see the newly installed package. To restart, click on the “RESTART RUNTIME” button.

Restart runtime

Let's train the model using AutoReg() and then predict using model_fit.predict() for the future 21 values (we predict three weeks ahead).

Set the lags parameter to 7, since the nature of the input data has a weekly frequency.

from statsmodels.tsa.ar_model import AutoReg
from random import random
# contrived dataset
data = numpy.array([4,7,2,10,21,26,11,29,83,22,92,46,119,73,97,149,
148,154,155,68,143,206,224,311,308,266,325,270,392,397,501,444,343,261,415,
467,578,477,478,492,392,401,456,540,455,550,502,418,366,487,507,504,515,522,
416,375,402,422,483,528,433,325,260,354,476,442,432,406,259,339,321,477,
429,393,468,340,328,482,589,553,550,485,463,394,525,689,683,753,648,656,
666,758,829,921,841,735,681,833,940,994,1109,948,917,646,706,664,889,876,
914,823,543,564,807,810,819,800,678,612,638,836,848,809,847,731,651,673,
829,856,972,1106,920,807,919,1022,1197,1090,1172,1112,990,1061,1271,1318,
1453,1489,1199,1008,1158,1433,1592,1732,1847,1637,1464,1616,1967,2134,2106,
2328,1987,1799,1658,1670,1974,2438,2481,2096,2141,2088,2495,2430,2723,2836,
2107,2174,2411,2551,2582,3144,3103,2476,2462,2905])
 
# fit model
model = AutoReg(data, lags=7)
model_fit = model.fit()
# let's make prediction
y = model_fit.predict(len(data), len(data)+21)
print(y)

plt.figure(figsize=(10,5))
plt.plot(range(1,len(y)+1), y, 'b')
plt.show()
Code language: PHP (php)

Then we get a forecast graph for 21 days according to the previous values.

Forecast graph for 21 days

But in this form, it is very difficult to visually assess how this graph is combined with the previous values. Therefore, let's add the old values (blue graph) and new values (red graph) in one picture.

plt.figure(figsize=(10,5))
prognose = numpy.concatenate((data,y))
plt.plot(range(1,len(prognose)+1), prognose, 'r')
plt.plot(range(1,len(data)+1), data, 'b')
plt.show()
Code language: JavaScript (javascript)
Forecast graph

Now let's change the value of lags = 75 to increase the impact of more data in the window, and see how the predicted value changes.

# fit model
model = AutoReg(data, lags=75)
model_fit = model.fit()
# make prediction
y = model_fit.predict(len(data), len(data)+20)
print(y)
 
plt.figure(figsize=(10,5))
prognose = numpy.concatenate((data,y))
plt.plot(range(1,len(prognose)+1), prognose, 'r')
plt.plot(range(1,len(data)+1), data, 'b')
plt.show()
Code language: PHP (php)
Descent increase forecast graph

As you can see, taking into account a larger number of previous values, we get a decent increase in the absolute values of the predicted parameter. Both results are plausible though.

Conclusion

If you are interested in the topic of time series and like the idea of using these methods in practice, we strongly recommend reading about other methods in the article by Dr. Jason Brownlee (Ph.D.) at this link.

There you will find descriptions of more advanced methods and can expand your toolbox specifically for your task.

Our experts from Svitla Systems will assist you in the research and development of machine learning and data analysis methods. Since the education system in our main locations prepares engineers who are savvy in mathematics and programming, you can contact our company for solving a problem even if you cannot clearly formulate it, but need a solution for your business process.

Our system analysts and experienced project managers will assess the project at a high level, will talk with you to clarify the tasks for the project, and choose the right development toolkit. At the same time, the copyrights are fully and completely retained by your company. Svitla Systems operates in many locations on several continents and will always find the right team for you to solve your software development and data science tasks.

FAQ

What is the time series forecasting?

Time series forecasting refers to the modeling of a historical sequence of time-ordered data and then using this model to predict an indicator in the future. It includes such disparate examples as stock prices, counts of Covid-19 cases, room temperature, or customer satisfaction. The explicit relationship between measurements and time ensures that patterns such as trend, seasonality, or stationarity are assumed more generally than with individual data sets. Modeling typically occurs through identifying past patterns under an assumption of regular time steps without missing values using a suitable model (often some variant of AR, MA, ARIMA, SARIMA, or exponential smoothing), and projecting these patterns into the future. In actual practice, hybrids of these methods-all available within Python’s statsmodels, e.g., AutoReg for autoregression – can be marshaled quickly to generate predictions and associated visualizations that inform real-world decisions.

What is time series in simple words?

A time series is basically a set of observations taken at successive times, where the sequence has inherent significance. It can be seen as a running log of what you are measuring – for instance, daily temperatures, monthly sales, or hourly counts of website hits. Each observation corresponds to some specific point or period in time, enabling us to see how something varies over time. The time dimension is important because only with it can certain regularities – such as trends and cycles – be discerned; they would not show up in an unorganized pile of numbers.

How do you forecast using time series?

Collect historical data at regular time intervals, make sure it is complete and consistently measured. Study previous behavior and look for patterns regarding trend, seasonality, or stability (stationarity) of the series, choose a relevant method among autoregression, ARIMA, exponential smoothing, their seasonal and vector variants. Fit the selected model to the historical data so that these patterns are captured, then extrapolate them into the future, which will produce forecasts for the coming periods. Review and adjust, if necessary, taking into account domain knowledge, seasonal effects, and recent, sudden changes in the underlying process.

What is the best forecasting method for time series data?

There is no one single best forecasting method that could be applied to every time series. The choice depends on whether the data is trending, seasonal, how noisy it is, and whether it’s univariate or multivariate. For most simple, quite stable series, methods such as autoregression (AR), moving average (MA), ARIMA, or exponential smoothing give good results. When strong seasonality or more complex behavior exists in the data, the extended models SARIMA, SARIMAX, and Holt-Winters are typically used with better results. Vector models: VAR, VARMA, and VARMAX are of use when modeling several related time series jointly. In practical work, analysts usually compare several methods and often use a combination thereof to have the most accurate and robust forecasts for a particular task.

How much data is required for time series forecasting?

The amount of data required depends on how complex the patterns inside the data are and what method is being used to forecast. More data will always be preferable since it will enable models to better spot trends, seasonality, or any other periodic pattern within the series. The observation has to be at regular intervals with no missing values, and it should also be relevant and recent.