U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Forecasting COVID-19: Vector Autoregression-Based Model

Khairan rajab.

1 Najran University, Najran, Kingdom of Saudi Arabia

Firuz Kamalov

2 Canadian University Dubai, Dubai, UAE

Aswani Kumar Cherukuri

3 Vellore Institute of Technology, Vellore, India

Forecasting the spread of COVID-19 infection is an important aspect of public health management. In this paper, we propose an approach to forecasting the spread of the pandemic based on the vector autoregressive model. Concretely, we combine the time series for the number of new cases and the number of new deaths to obtain a joint forecasting model. We apply the proposed model to forecast the number of new cases and deaths in the UAE, Saudi Arabia, and Kuwait. Test results based on out-of-sample forecast show that the proposed model achieves a high level of accuracy that is superior to many existing methods. Concretely, our model achieves mean absolute percentage error (MAPE) of 0.35%, 2.03%, and 3.75% in predicting the number of daily new cases for the three countries, respectively. Furthermore, interpolating our predictions to forecast the cumulative number of cases, we obtain MAPE of 0.0017%, 0.002%, and 0.024%, respectively. The strong performance of the proposed approach indicates that it could be a valuable tool in managing the pandemic.

Introduction

The COVID-19 pandemic has brought tremendous challenges to the governments and public health authorities around the globe. One of the key aspects to managing the governments’ response to the pandemic is forecasting the spread of the infection. Accurate forecast of the expected number of new cases can help the authorities better plan their policies and actions to achieve the optimal outcome. In this paper, we propose a vector autoregressive (VAR) model to forecast the daily number of new cases and deaths. The proposed algorithm produces accurate results and outperforms the existing state-of-the-art forecasting models.

The ability to accurately forecast the spread of the infection allows governments to make smart and informed decisions. If the number of infections is expected to rise sharply, the government may consider imposing a lockdown in order to stop the spread of the virus. On the other hand, if the number of new cases is expected to decline the government may consider easing some of the social and economic restrictions to improve the quality of life. Similarly, accurate forecast allows public health authorities to better manage their limited resources.

Forecasting COVID-19 infection has received a considerable amount of attention among researchers. However, our method differs from the existing approaches in two important ways. First, we consider the number of new cases in conjunction with the number of deaths. We believe that the two time series are related. Consequently, the information about one time series can be used to forecast the other time series. Using the VAR model, we can take into account the cross-correlation between the series and achieve more accurate forecasts. Second, we use data from some of the most extensively tested populations in the world—the UAE, Saudi Arabia, and Kuwait—to train our model. So the data accurately represent the true prevalence of the infection in each country. Additionally, our data cover a 12-month period which is considerably longer than many existing studies. We believe that the quality and depth of the data lead to more reliable results.

Although vector autoregression was used to forecast COVID-19 in the past [ 1 ], its application has been sparse. Our approach is based on a careful study of the times series plots, correlation plots, and information criteria. Remarkably, our analysis produced the same order model for each country. It indicates that the proposed approach can be adopted to forecast the infection rates in other countries. To test the efficacy of our approach, we measured the model accuracy based on 10-day ahead forecast. The mean absolute percentage error for the three countries is 0.0017%, 0.002%, and 0.024%, respectively.

Our paper is structured as follows. In Sect.  2 , we briefly discuss the existing efforts in the literature to forecast the COVID-19 infection. In Sect.  3 , we provide the required theoretical background about the VAR model. In Sect.  4 , we construct and apply the proposed model to forecast the infection rates in the UAE, Saudi Arabia, and Kuwait. We conclude the paper with a few closing remarks in Sect.  5 .

The topic of forecasting COVID-19 has recently attracted a significant amount of attention in the literature. A number of different forecasting approaches have been explored. The results vary depending on the model and data used in the study. Despite the considerable volume of research dedicated to the subject, the results have sometimes been criticized as inadequate [ 2 ]. One of the issues with the existing attempts in the literature is the size and the quality of the training data. The data are often too little or originate from countries with low testing rates [ 3 ]. To address this issue in our paper, we employ a 12-month dataset from rigorously tested countries. Another issue is the vulnerability of the underlying assumptions of the model. Most of the forecasting models are based on certain assumptions about the time series. If the underlying assumptions are not satisfied, then the model is not technically sound.

The majority of the existing forecasting methods can be grouped into three categories: autoregressive integrated moving average (ARIMA) models, mathematical growth models, and machine learning (ML) models. In an ARIMA model, the values of an individual time series are forecasted based on a linear combination of the past values and random shocks. Formally, ARIMA models are denoted ARIMA ( p , d , q ), where p is the number of time lags of the autoregressive model, d is the degree of differencing, and q is the order of past shocks. The authors in [ 4 ] find ARIMA(0,1,0) to be the best fit for predicting the trend of daily confirmed COVID-19 in Malaysia. The authors use the data from January 22 to March 31, 2020, for training and April 1 to April 17, 2020, for testing. The test results show MAPE of 16.01%. In a similar study [ 5 ], the authors attempted to estimate the total daily infected cases from the top five countries: US, Brazil, India, Russia, and Spain. The authors obtained different optimal ARIMA models for each country: (4,2,4), (3,1,2), (3,0,0), (4,2,4), and (1,2,1), respectively. Model specifications were estimated using Hannan and Rissanen algorithm [ 6 ]. The data for the study were taken from February 15 to June 30, 2020, for training and July 1 to July 18, 2020, for testing. The MAPE for each country is 3.701%, 1.844%, 1.090%, 0.832%, and 2.885%, respectively.

The mathematical growth (contagion) models are based on differential equations that model the spread of infection. In [ 7 ], the authors forecast the total number of daily confirmed and death cases in India using several models based on gene expression programming. The data for the study are taken from April 7 to May 5, 2020. The results show root-mean-squared error on training data: The confirmed and death cases are 5.5574 and 90.1863, respectively. The authors do not provide out-of-sample forecasting and testing. In [ 8 ], the authors forecast the total number of daily infected individuals in Brazil, UK, and South Korea using discrete-time-evolution model based on a set of four equations. The results show MAPE: Brazil 5.25%, UK 4%, and South Korea 3.75%, respectively.

Machine learning models have been used for forecasting in a variety of applications including finance [ 9 , 10 ], energy [ 11 ], education [ 12 ], temperature [ 13 ], and many others. A number of authors have employed ML methods such as regularized linear regression (LASSO) and recurrent neural networks (RNN) to forecast the spread of the infection [ 14 ]. In [ 15 ], the authors compared RNN, long short-term memory (LSTM), BiLSTM, gated recurrent units (GRU), and variational autoencoders (VAE) to forecast the total number of daily cases in Italy, Spain, Italy, China, the USA, and Australia. The study used data from January 22 to June 1, 2020, for training and June 1 to June 17, 2020, for testing. The VAE model achieved the best MAPE values: 5.90%, 2.19%, 1.88%, 0.128%, 0.236%, and 2.04%, respectively. Similar studies using LASSO, SVM, logistic regression, and others have also been carried out in [ 16 , 17 ].

Vector autoregression is used to model the joint dynamic behavior of a collection of time series. It was used in [ 18 ] to forecast mortality rates, where mortality rates of each age depend on the historical values of itself and the neighboring ages. The VAR model is commonly used in spatiotemporal settings. For instance, in [ 19 ] wind power forecast at multiple plants at different locations was done within a single framework using LASSO vector autoregression. The predicted output from each plant in the model is based on its own past values and the past values of the other plants included in the model. Similarly, LASSO vector autoregression was applied for wind power prediction in [ 20 ]. Vector autoregression has also been applied in a number of other fields including finance [ 21 ], tourism [ 22 ], and commodity prices [ 23 ].

The application of vector autoregression model in the context of COVID-19 has been limited. For instance, in [ 1 ] the VAR model was studied together with linear regression and multilayer perceptron. The authors in [ 24 ] used VAR model to forecast the infection, hospitalization, and ICU bed numbers in Italy. None of the two studies include any measures of accuracy to evaluate the models.

Vector Autoregressive Model

The VAR process is traditionally used to model together two or more related time series. In case of COVID-19, the number of new cases is related to the number of deaths. As the number of new cases increases, so does the number of deaths. Therefore, information about the former can help predict the latter. The VAR process allows to incorporate both the number of new cases and deaths into a single model, producing a more powerful forecasting paradigm. In addition, the VAR process requires minimal assumptions about the nature of the time series. As will be shown in Sect.  4.3 , modeling the number of new cases in conjunction with the number of deaths is more effective than modeling the series individually.

A good introduction to vector autoregression can be found in [ 25 ]. Let x t = x t , 1 x t , 2 ⋮ x t , k be a vector-valued time series consisting of k individual time series. Assume that x t is stationary, i.e., the cross-covariance function Cov ( x t , i , x s , j ) depends only on s - t . Then, the VAR( p ) model is given by the following equation:

where Φ j are matrices of coefficients and w t is the vector Gaussian white noise with Cov ( w t , w s ) = 0 for s ≠ t . In our paper, we examine a vector of two time series, so the corresponding VAR( p ) model is given by the following equation:

where x t and and y t are the numbers of new cases and deaths at time t , respectively. The coefficients of matrix Φ j = Φ 11 Φ 12 Φ 21 Φ 22 are estimated based on the maximum likelihood estimation. In other words, the matrix coefficients are calculated to maximize the likelihood of obtaining the sample. In our paper, we use the statsmodels package in Python [ 26 ] to implement the VAR model. The order p of the VAR model is chosen based on a combination of different factors including the time series plots, the correlation plots, and information criteria. In addition, residual analysis is employed to confirm the model assumptions about normality and independence.

Model Construction and Forecasting

The data used in this study consist of the daily number of new cases and deaths. The data are collected for three countries: the UAE, Saudi Arabia, and Kuwait. The countries were chosen due to rigorous COVID-19 testing conducted within the populations [ 3 ]. The data range over a 12-month period starting from March 20, 2020, to March 20, 2021. We used the data until March 10, 2021, for training and the data from March 11 to March 20, 2021, for testing. The data are sourced from OurWorldInData.org [ 27 ]. The time series plots for the number of new cases and deaths for each country are presented in Fig.  1 .

An external file that holds a picture, illustration, etc.
Object name is 13369_2021_6526_Fig1_HTML.jpg

The original time series data for the three countries

Data Preprocessing

According to the European Centre for Disease Prevention and Control, “the daily number of cases is frequently subject to retrospective corrections, delays in reporting and/or clustered reporting of data for several days.” Therefore, daily variations in the number of cases are unreliable for effective forecasting. To obtain a more balanced and well-founded time series, we employ a 7-day rolling mean. As shown in Fig.  2 , using the rolling mean produces a more reliable time series. We can also see from the plots that the time series for the number of cases and number of deaths are correlated with an approximately 30-day lag. For instance, in the case of the UAE, the peak for the number of new cases occurs in the end of January, while the peak for the number of new deaths occurs in the end of February.

An external file that holds a picture, illustration, etc.
Object name is 13369_2021_6526_Fig2_HTML.jpg

The 7-day rolling mean of the number of new cases and deaths

The plots in Fig.  2 show that the time series are not stationary. Let x t be the value of the times series at time t . To obtain a more stationary time series, we take the first difference of the time series values:

As shown in Fig.  3 , the resulting series is more stationary with a constant mean around zero. There remain spikes in the variance especially in the end of the period (Fig.  3 a) which can be attributed to the stochastic realization of the time series. Nevertheless, the transformed series appears largely stable to proceed with our analysis.

An external file that holds a picture, illustration, etc.
Object name is 13369_2021_6526_Fig3_HTML.jpg

Taking the first difference of the series helps achieve stationarity

Building the Forecasting Model

The key to building an effective VAR model is determining the correct order. To identify the order of the model we consider 3 factors: time series plots, correlation plots, and information criteria. We illustrate our approach to building a model in the case of the UAE. The models for Saudi Arabia and Kuwait are constructed in similar fashion. Visual examination of the time series in Fig.  1 a shows that the time series for the new cases and the time series for new deaths are correlated with an approximately 30-day lag. Next, we consider the auto and cross-correlation plots. As shown in Fig.  4 , there exists nontrivial correlation in both time series. Concretely, the autocorrelation plot for the new cases (upper left) contains nonzero values up to lag 28 while for the new deaths (lower right) it is up to lag 17. The cross-correlation function is not even, i.e., ρ xy ( s , t ) ≠ ρ yx ( s , t ) . Therefore, the off-diagonal cross-correlation plots are different. Since the number of new cases leads the number of new deaths, we consider the cross-correlation plot on the upper right. The cross-correlation plot shows nontrivial correlation which indicates that the number of new cases has a lagged correlation with the number of new deaths. So the use of the VAR model is statistically justified. In addition, the nonzero value at lag 29 is consistent with our visual observations in Fig.  1 a.

An external file that holds a picture, illustration, etc.
Object name is 13369_2021_6526_Fig4_HTML.jpg

Correlation plots for the UAE. The upper left and lower right plots represent autocorrelations for the number of cases and deaths, respectively. The off-diagonal plots are cross-correlations. The dashed horizontal lines represent the confidence limits

As the final step in identifying the order of the model, we study information criteria metrics. In particular, we calculate the Akaike information criterion (AIC) of the VAR model for the first 33 orders. The AIC is given by the following equation:

where k is the number of parameters in the model and L ^ is the maximum sample likelihood for a model. The model with the lowest AIC is considered optimal. As shown in Table  1 , the minimum AIC is achieved at lag 29. This is consistent with our earlier observations. Recall that the plots in Fig.  1 a show an approximately 30-day lag between the two time series. In addition, the cross-correlation plot had a maximum nonzero value at lag 29. We conclude that the optimal order for our VAR model is p = 29 .

AIC values for different orders p of the VAR model

p AICp AICp AICp AICp AICp AIC
05.69364.785124.337184.106243.990303.968
14.91174.624134.292194.113253.990313.998
24.87784.436144.198204.088263.986324.008
34.824 94.442154.081214.019273.999334.017
44.841104.310164.089223.977283.99734
54.826114.325174.102233.978293.95135

We split the data into training and testing subsets. The training subset encompasses the period March 20, 2020–March 10, 2021, while the testing subset includes the period March 11–March 20, 2021. We fit the VAR(29) model on the training. Then, we use the fitted model to forecast the new daily number of cases and deaths for the testing period. As shown in Fig.  5 a, the forecasted numbers of new cases match almost perfectly with the actual numbers. Similarly, it can be seen from Fig.  5 b, the forecasted numbers of new deaths match closely with the actual numbers.

An external file that holds a picture, illustration, etc.
Object name is 13369_2021_6526_Fig5_HTML.jpg

The actual and forecast values for the number of new cases and deaths in the UAE

To make the comparison between the forecasted and actual values more precise, we calculate the root-mean-squared error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) of the model forecast. Note that RMSE and MAE are conditional on the size of the population and should be used with caution for comparison between countries with significantly different numbers of cases. While our model is designed to forecast the number of new cases, it can be easily used to forecast the total number of cases. To obtain the forecast for the total number of cases, we simply add the forecast for the new number of cases to the current total number of cases. The results presented in Table  2 show that the proposed model produces highly accurate forecasts. The RMSE, MAE, and MAPE columns in Table  2 correspond to the number of new cases, while the MAPE T column corresponds to the number of total cases. Observe that the MAPE relative to the number of new cases is 0.35%, while the MAPE relative to the total number of cases is 0.0017%. The model is less accurate in predicting the number of deaths. The lower accuracy on the number of deaths can be attributed to two primary factors: insufficient data and the complex nature of death. On the other hand, the MAE for new deaths is 0.86 which means that the forecast is accurate within 1 count.

10-day-ahead forecast accuracy for the UAE

RMSEMAEMAPE
New cases9.657.450.35%0.0017%
New deaths1.020.8610.75%0.0612%

Indeed, the results are substantially better than for many current models in the literature. As shown in Table  3 , the proposed model outperforms existing approaches from a range of fields including ARIMA, mathematical modeling, and ML. In particular, the MAPE for predicting the total number of cases using our method is 0.0017% which is substantially smaller than the MAPE values presented in Table  3 . Although comparison with Table  3 is not ideal due to the differences in studies, it does provide a useful benchmark for our proposed model.

Accuracy results of the existing methods for forecasting the number of cases

SourceSingh et al. [ ]Singh et al. [ ]Curado et al. [ ]Zeroual et al. [ ]
MethodARIMAARIMAMathematical modelMachine learning
ResultsMAPE: 16.01% : 0.8–3.7% : 3–5% : 0.128–5.90%

To further validate the use of the VAR model, we compare its performance against the basic AR model. To this end, we fitted and tested the AR(30) model on the data. We obtain MAPE T of 0.0063% which is more than thrice the MAPE T for the VAR model. We conclude that the vector-based approach to forecasting COVID-19 infection is more effective than single-valued approach.

Saudi Arabia and Kuwait

To demonstrate the effectiveness of the proposed approach, we apply it to the case of Saudi Arabia and Kuwait. To construct the forecasting models for the two countries, we follow the same steps as in the case of the UAE. We consider the time series plots, correlation plots, and AIC. Our analysis yields VAR(29) and VAR(28) as the optimal models for Saudi Arabia and Kuwait, respectively. After fitting the models on the training data, we forecast the number of new cases and deaths for the test period March 11–March 20, 2021. The results are illustrated in Figs.  6 and ​ and7. 7 . As shown in Fig.  6 , the forecasted number of cases in Saudi Arabia matches very closely with the actual number of cases. Furthermore, the forecasted number of deaths is nearly identical with the actual numbers. Similarly, as shown in Fig.  7 , the forecasted values in Kuwait are not too far from the actual values.

An external file that holds a picture, illustration, etc.
Object name is 13369_2021_6526_Fig6_HTML.jpg

The actual and forecast values for the number of new cases and deaths in Saudi Arabia

An external file that holds a picture, illustration, etc.
Object name is 13369_2021_6526_Fig7_HTML.jpg

The actual and forecast values for the number of new cases and deaths in Kuwait

As shown in Tables  2 and  4 , the proposed forecasting approach achieves a high degree of accuracy in forecasting the number of new cases and deaths. The 10-day-ahead forecast of the number of new cases in the UAE is accurate within 0.35%. The results attained by the proposed VAR model are significantly better than the results in the existing literature (Table  3 ). The success of the proposed approach lies in the simplified assumptions that underlie the VAR model together with the high quality of the time series data. Given the performance of the proposed model, it can be similarly applied to other countries and help health officials combat the pandemic.

10-day-ahead forecast accuracy for Saudi Arabia and Kuwait

SaudiKuwait
RMSEMAEMAPE RMSEMAEMAPE
New cases8.597.412.03%0.002%64.4949.783.75%0.024%
New deaths0.100.071.29%0.001%0.850.7410.55%0.063%

In this paper, we investigated the use of vector autoregression to forecast the spread of COVID-19 infection. In particular, we applied the VAR model to jointly forecast the number of new cases and deaths the UAE, Saudi Arabia, and Kuwait. The results show a high level of accuracy of the proposed model. Indeed, the MAPE of the model is substantially lower than that of the existing models in the literature. The success of the proposed approach shows that it can also be used for forecasting in other countries.

Despite the high accuracy of the model, there remains room for improvement. Although taking the first difference stabilized the mean of the series around zero, additional transformations can be applied to improve the variance. The model may benefit by expanding and including quantities such as the percentage of tested population, the number of recoveries, and others. Alternative VAR models such as LASSO VAR and VARCH may also be investigated in future research. The proposed approach has a great potential to be a valuable tool in managing the government response against COVID-19.

This work was supported in part by the Najran University under Grant NU/-/SERC/10/597.

Contributor Information

Firuz Kamalov, Email: ea.ca.duc@zurif .

Aswani Kumar Cherukuri, Email: gro.mca@irukurehc .

  • R(eal) Basics
  • Econometric Methods
  • Time Series Topics
  • Network Analysis
  • Reproduction Projects
  • About and Disclaimer

An Introduction to Structural Vector Autoregression (SVAR)

Vector autoregressive (VAR) models constitute a rather general approach to modelling multivariate time series. A critical drawback of those models in their standard form is their missing ability to describe contemporaneous relationships between the analysed variables. This becomes a central issue in the impulse response analysis for such models, where it is important to know the contemporaneous effects of a shock to the economy. Usually, researchers address this by using orthogonal impulse responses, where the correlation between the errors is obtained from the (lower) Cholesky decomposition of the error covariance matrix. This requires them to arrange the variables of the model in a suitable order. An alternative to this approach is to use so-called structural vector autoregressive (SVAR) models, where the relationship between contemporaneous variables is modelled more directly. This post provides an introduction to the concept of SVAR models and how they can be estimated in R.

What does the term “structual” mean?

To understand what a structural VAR model is, let’s repeat the main characteristics of a standard reduced form VAR model:

\[y_{t} = A_{1} y_{t-1} + u_t \ \ \text{with} \ \ u_{t} \sim (0, \Sigma),\] where \(y_{t}\) is a \(k \times 1\) vector of \(k\) variables in period \(t\) . \(A_1\) is a \(k \times k\) coefficent matrix and \(\epsilon_t\) is a \(k \times 1\) vector of errors, which have a multivariate normal distribution with zero mean and a \(k \times k\) variance-covariance matrix \(\Sigma\) .

To understand SVAR models, it is important to look more closely at the variance-covariance matrix \(\Sigma\) . It contains the variances of the endogenous variable on its diagonal elements and covariances of the errors on the off-diagonal elements. The covariances contain information about contemporaneous effects each variable has on the others. The covariance matrices of standard VAR models is symmetric , i.e. the elements to the top-right of the diagonal (the “upper triangular”) mirror the elements to the bottom-left of the diagonal (the “lower triangular”). This reflects the idea that the relations between the endogenous variables only reflect correlations and do not allow to make statements about causal relationships.

Contemporaneous causality or, more precisely, the structural relationships between the variables is analysed in the context of SVAR models, which impose special restrictions on the covariance matrix and – depending on the model – on other coefficient matrices as well. The drawback of this approach is that it depends on the more or less subjective assumptions made by the researcher. For many researchers this is too much subjectiv information, even if sound economic theory is used to justify them. However, they can be useful tools and that is why it is worth to look into them.

Lütkepohl (2007) mentions four approaches to model structural relationships between the endogenouse variables of a VAR model: The A-model, the B-model, the AB-model and long-run restrictions à la Blanchard and Quah (1989).

The A-model

The A-model assumes that the covariance matrix is diagonal - i.e. it only contains the variances of the error term - and contemporaneous relationships between the observable variables are described by an additional matrix \(A\) so that

\[ A y_{t} = A^*_{1} y_{t-1} + ... + A^*_{p} y_{t-p} + \epsilon_t,\] where \(A^*_j = A A_j\) and \(\epsilon_{t} = A u_t \sim (0, \Sigma_\epsilon = A \Sigma_u A^{\prime})\) .

Matrix \(A\) has the special form:

\[ A = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ a_{21} & 1 & & 0 \\ \vdots & & \ddots & \vdots \\ a_{K1} & a_{K2} & \cdots & 1 \\ \end{bmatrix} \]

Beside the normalisation that is achieved by setting the diagonal elements of \(A\) to one, the matrix contains \((K(K - 1) / 2\) further restrictions, which are needed to obtain unique estimates of the structural coefficients. If less restirctions were provided, there would be mathematical problems - to express it in a simple way. In the above example, the upper triangular elements of the matrix are set to zero and the elements below the diagonal are freely estimated. However, it would also be possible to estimate a coefficient in the upper triangular area, if a value in the lower triangular area were set to zero. Furthermore, it would be possible to set more than \((K(K - 1) / 2\) elements of \(A\) to zero. In this case the model is said to be over-identified .

The B-model

The B-model describes the structural relationships of the errors directly by adding a matrix \(B\) to the error term and normalises the error variances to unity so that

\[ y_{t} = A_{1} y_{t-1} + ... + A_{p} y_{t-p} + B \epsilon_t,\] where \(u_{t} = B \epsilon_t\) and \(\epsilon_t \sim (0, I_K)\) . Again, \(B\) must contain at least \((K(K - 1) / 2\) restrictions.

The AB-model

The AB-model is a mixture of the A- and B-model, where the errors of the VAR are modelled as

\[ A u_t = B \epsilon_t \text{ with } \epsilon_t \sim (0, I_K).\]

This general form would require to specify more restrictions than in the A- or B-model. Thus, one of the matrices is usually replaced by an identiy matrix and the other contains the necessary restrictions. This model can be useful to estimate, for example, a system of equations, which describe the IS-LM-model.

Long-run restirctions à la Blanchard-Quah

Blanchard and Quah (1989) propose an approach, which does not require to directly impose restrictions on the structural matrices \(A\) or \(B\) . Instead, structural innovations can be identified by looking at the accumulated effects of shocks and placing zero restrictions on those accumulated relationships, which die out and become zero in the long run.

For this illustration we generate an artificial data set with three endogenous variables, which followe the data generating process

\[y_t = A_1 y_{t - 1} + B \epsilon_t,\]

\[ A_1 = \begin{bmatrix} 0.3 & 0.12 & 0.69 \\ 0 & 0.3 & 0.48 \\ 0.24 & 0.24 & 0.3 \end{bmatrix} \text{, } B = \begin{bmatrix} 1 & 0 & 0 \\ -0.14 & 1 & 0 \\ -0.06 & 0.39 & 1 \end{bmatrix} \text{ and } \epsilon_t \sim N(0, I_3). \]

research paper using var model

The vars package (Pfaff, 2008) provides functions to estimate structural VARs in R. The workflow is divided into two steps, where the first consists in estimating a standard VAR model using the VAR function:

The estimated coefficients are reasonably close to the true coefficients. In the next step the resulting object is used in function SVAR to estimate the various structural models.

The A-model requires to specify a matrix Amat , which contains the \(K (K - 1) / 2\) restrictions. In the following example, we create a diagonal matrix with ones as diagonal elements and zeros in its upper triangle. The lower triangular elements are set to NA , which indicates that they should be estimated.

The result is not equal to matrix \(B\) , because we estimated an A-model. In order to translate it into the structural coefficients of the B-model, we only have to obtain the inverse of the matrix:

Confidence intervals for the structural coefficients can be obtained by directly accessing the respective element in svar_est_a :

B-modes are estimated in a similar way as A-models by specifying a matrix Bmat , which contains restrictions on the structural matrix \(B\) . In the following example \(B\) is equal to Amat above.

Again, confidence intervals of the structural coefficients can be obtained by directly accessing the respective element in svar_est_b :

Blanchard, O., & Quah, D. (1989). The dynamic effects of aggregate demand and supply disturbances. American Economic Review 79 , 655-673.

Lütkepohl, H. (2007). New Introduction to Multiple Time Series Analyis (2nd ed.). Berlin: Springer.

Bernhard Pfaff (2008). VAR, SVAR and SVEC Models: Implementation Within R Package vars. Journal of Statistical Software 27 (4).

Sims, C. (1980). Macroeconomics and Reality. Econometrica, 48 (1), 1-48.

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: visual autoregressive modeling: scalable image generation via next-scale prediction.

Abstract: We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-like AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.
Comments: Demo website: https://var.vision/
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Time Series Analysis Handbook

Chapter 3: vector autoregressive methods ¶.

Prepared by: Maria Eloisa Ventura

Previously, we have introduced the classical approaches in forecasting single/univariate time series like the ARIMA (autoregressive integrated moving-average) model and the simple linear regression model. We learned that stationarity is a condition that is necessary when using ARIMA while this need not be imposed when using the linear regression model. In this notebook, we extend the forecasting problem to a more generalized framework where we deal with multivariate time series –time series which has more than one time-dependent variable. More specifically, we introduce vector autoregressive (VAR) models and show how they can be used in forecasting mutivariate time series.

Multivariate Time Series Model ¶

As shown in the previous chapters, one of the main advantages of using simple univariate methods (e.g., ARIMA) is the ability to forecast future values of one variable by only using past values of itself. However, we know that most if not all of the variables that we observe are actually dependent on other variables. Most of the time, the information that we gather is limited by our capacity to measure the variables of interest. For example, if we want to study weather in a particular city, we could measure temperature, humidity and precipitation over time. But if we only have a thermometer, then we’ll only be able to collect data for temperature, effectively reducing our dataset to a univariate time series.

Now, in cases where we have multiple time series (longitudinal measurements of more than one variable), we can actually use multivariate time series models to understand the relationships of the different variables over time. By utilizing the additional information available from related series, these models can often provide better forecasts than the univariate time series models.

In this section, we cover the foundational information needed to understand Vector Autoregressive Models, a class of multivariate time series models, by using a framework similar to univariate time series laid out in the previous chapters, and extending it to the multivariate case.

Definition: Univariate vs Multivariate Time Series ¶

Time series can either be univariate or multivariate. The term univariate time series consists of single observations recorded sequentially over equal time increments. When dealing with a univariate time series model (e.g., ARIMA), we usually refer to a model that contains lag values of itself as the independent variable.

On the other hand, a multivariate time series has more than one time-dependent variable. For a multivariate process, several related time series are observed simultaneously over time. As an extension of the univariate case, the multivariate time series model involves two or more input variables, and leverages the interrelationship among the different time series variables.

Example 1: Multivariate Time Series ¶

A. air quality data from uci ¶.

The dataset contains hourly averaged measurements obtained from an Air Quality Chemical Multisensor Device which was located on the field of a polluted area at an Italian city. The dataset can be downloaded here .

CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Unnamed: 15 Unnamed: 16
Date_Time
2004-03-10 00:00:00 18:00:00 2.6 1360.00 150.0 11.881723 1045.50 166.0 1056.25 113.0 1692.00 1267.50 13.6 48.875001 0.757754 NaN NaN
2004-03-10 00:00:00 19:00:00 2.0 1292.25 112.0 9.397165 954.75 103.0 1173.75 92.0 1558.75 972.25 13.3 47.700000 0.725487 NaN NaN

../_images/03_VectorAutoregressiveMethods_10_0.png

B. Global Health from The World Bank ¶

This dataset combines key health statistics from a variety of sources to provide a look at global health and population trends. It includes information on nutrition, reproductive health, education, immunization, and diseases from over 200 countries. The dataset can be downloaded here .

indicator_code Capital_health_expenditure Current_health_expenditure Domestic_general_government_health_expenditure
2000-12-31 0.013654 3.154818 1.400685
2001-12-31 0.012675 2.947059 1.196554
2002-12-31 0.018476 2.733301 1.012481

../_images/03_VectorAutoregressiveMethods_13_0.png

C. US Treasury Rates ¶

January, 1982 – December, 2016 (Weekly) https://essentialoftimeseries.com/data/ This sample dataset contains weekly data of US Treasury rates from January 1982 to December 2016. The dataset can be downloaded here .

1-month 3-month 6-month 1-year 2-year 3-year 5-year 7-year 10-year Excess CRSP Mkt Returns 10-year Treasury Returns Term spread Change in term spread 5-year Treasury Returns Unnamed: 15 Excess 10-year Treasury Returns Term Spread VXO Delta VXO
Date
1982-01-08 10.296 12.08 13.36 13.80 14.12 14.32 14.46 14.54 14.47 -1.632 NaN 2.39 NaN NaN NaN -0.286662 1.729559 20.461911 -0.003106
1982-01-15 10.296 12.72 13.89 14.39 14.67 14.73 14.79 14.84 14.76 -2.212 -2.9 2.04 -0.35 1.65 -2.556 -3.758000 4.464000 NaN NaN
1982-01-22 10.296 13.47 14.30 14.72 14.93 14.92 14.81 14.80 14.73 -0.202 0.3 1.26 -0.78 0.10 0.049 -0.558000 4.434000 NaN NaN

../_images/03_VectorAutoregressiveMethods_16_0.png

D. Jena Climate Data ¶

Time series dataset recorded at the Weather Station at the Max Planck Institute for Biogeochemistry in Jena, Germany from 2009 to 2016. It contains 14 different quantities (e.g., air temperature, atmospheric pressure, humidity, wind direction, and so on) were recorded every 10 minutes. You can download the data here .

../_images/03_VectorAutoregressiveMethods_19_0.png

Foundations ¶

Before we discuss VARs, we outline some fundamental concepts below that we will need to understand the model.

Weak Stationarity of Multivariate Time Series ¶

As in the univariate case, one of the requirements that we need to satisfy before we can apply VAR models is stationarity–in particular, weak stationarity. Both in the univariate and multivariate case, the first two moments of the time series are time-invariant. More formally, we describe weak stationarity below.

Consider a \(N\) -dimensional time series, \(\mathbf{y}_t = \left[y_{1,t}, y_{2,t}, ..., y_{N,t}\right]^\prime\) . This is said to be weakly stationary if the first two moments are finite and constant through time, that is,

\(E\left[\mathbf{y}_t\right] = \boldsymbol{\mu}\)

\(E\left[(\mathbf{y}_t-\boldsymbol{\mu})(\mathbf{y}_t-\boldsymbol{\mu})^\prime\right] \equiv \boldsymbol\Gamma_0 < \infty\) for all \(t\)

\(E\left[(\mathbf{y}_t-\boldsymbol{\mu})(\mathbf{y}_{t-h}-\boldsymbol{\mu})^\prime\right] \equiv \boldsymbol\Gamma_h\) for all \(t\) and \(h\)

where the expectations are taken element-by-element over the joint distribution of \(\mathbf{y}_t\) :

\(\boldsymbol{\mu}\) is the vector of means \(\boldsymbol\mu = \left[\mu_1, \mu_2, ..., \mu_N \right]\)

\(\boldsymbol\Gamma_0\) is the \(N\times N\) covariance matrix where the \(i\) th diagonal element is the variance of \(y_{i,t}\) , and the \((i, j)\) th element is the covariance between \(y_{i,t}\) and \({y_{j,t}}\)

\(\boldsymbol\Gamma_h\) is the cross-covariance matrix at lag \(h\)

Obtaining Cross-Correlation Matrix from Cross-Covariance Matrix ¶

When dealing with a multivariate time series, we can examine the predictability of one variable on another by looking at the relationship between them using the cross-covariance function (CCVF) and cross-correlation function (CCF). To do this, we begin by defining the cross-covariance between two variables, then we estimate the cross-correlation between one variable and another variable that is time-shifted. This informs us whether one time series may be related to the past lags of the other. In other words, CCF is used for identifying lags of one variable that might be useful as a predictor of the other variable.

Let \(\mathbf\Gamma_0\) be the covariance matrix at lag 0, \(\mathbf D\) be a \(N\times N\) diagonal matrix containing the standard deviations of \(y_{i,t}\) for \(i=1, ..., N\) . The correlation matrix of \(\mathbf{y}_t\) is defined as

where the \((i, j)\) th element of \(\boldsymbol\rho_0\) is the correlation coefficient between \(y_{i,t}\) and \(y_{j,t}\) at time \(t\) :

Let \(\boldsymbol\Gamma_h = E\left[(\mathbf{y}_t-\boldsymbol{\mu})(\mathbf{y}_{t-h}-\boldsymbol{\mu})^\prime\right]\) be the lag- \(h\) covariance cross-covariance matrix of \(\mathbf y_{t}\) . The lag- \(h\) cross-correlation matrix is defined as

The \((i,j)\) th element of \(\boldsymbol\rho_h\) is the correlation coefficient between \(y_{i,t}\) and \(y_{j,t-h}\) :

What do we get from this?

Correlation Coefficient

Interpretation

\(\rho_{i,j}(0)\neq0\)

\(y_{i,t}\) and \(y_{j,t}\) are

\(\rho_{i,j}(h)=\rho_{j,i}(h)=0\) for all \(h\geq0\)

\(y_{i,t}\) and \(y_{j,t}\) share

\(\rho_{i,j}(h)=0\) and \(\rho_{j,i}(h)=0\) for all \(h>0\)

\(y_{i,t}\) and \(y_{j,t}\) are said to be linearly

\(\rho_{i,j}(h)=0\) for all \(h>0\), but \(\rho_{j,i}(q)\neq0\) for at least some \(q>0\)

There is a between \(y_{i,t}\) and \(y_{j,t}\), where \(y_{i,t}\) does not depend on \(y_{j,t}\), but \(y_{j,t}\) depends on (some) lagged values of \(y_{i,t}\)

\(\rho_{i,j}(h)\neq0\) for at least some \(h>0\) and \(\rho_{j,i}(q)\neq0\) for at least some \(q>0\)

There is a between \(y_{i,t}\) and \(y_{j,t}\)

Vector Autoregressive (VAR) Models ¶

The vector autoregressive (VAR) model is one of the most successful models for analysis of multivariate time series. It has been demonstrated to be successful in describing relationships and forecasting economic and financial time series, and providing more accurate forecasts than the univariate time series models and theory-based simultaneous equations models.

The structure of the VAR model is just the generalization of the univariate AR model. It is a system regression model that treats all the variables as endogenous, and allows each of the variables to depend on \(p\) lagged values of itself and of all the other variables in the system.

A VAR model of order \(p\) can be represented as

where \(\mathbf y_t\) is a \(N\times 1\) vector containing \(N\) endogenous variables, \(\mathbf a_0\) is a \(N\times 1\) vector of constants, \(\mathbf A_1, \mathbf A_2, ..., \mathbf A_p\) are the \(p\) \(N\times N\) matrices of autoregressive coefficients, and \(\mathbf u_t\) is a \(N\times 1\) vector of white noise disturbances.

VAR(1) Model ¶

To illustrate, let’s consider the simplest VAR model where we have \(p=1\) .

Structural and Reduced Forms of the VAR model ¶

Consider the following bivariate system

where both \(y_{1,t}\) and \(y_{2,t}\) are assumed to be stationary, and \(\varepsilon_{1,t}\) and \(\varepsilon_{2,t}\) are the uncorrelated error terms with standard deviation \(\sigma_{1,t}\) and \(\sigma_{2,t}\) , respectively.

../_images/bivariate1.png

Structural VAR (VAR in primitive form) ¶

Described by equation above

Captures contemporaneous feedback effects ( \(b_{1,2}, b_{2,1}\) )

Not very practical to use

Contemporaneous terms cannot be used in forecasting

Needs further manipulation to make it more useful (e.g. multiplying the matrix equation by \(\mathbf B^{-1}\) )

Multiplying the matrix equation by \(\mathbf B^{-1}\) , we get

where \(\mathbf a_0 = \mathbf B^{-1}\mathbf Q_0\) , \(\mathbf A_1 = \mathbf B^{-1}\mathbf Q_1\) , \(L\) is the lag/backshift operator, and \(\mathbf u_t = \mathbf B^{-1}\boldsymbol\varepsilon_t\) , equivalently,

Reduced-form VAR (VAR in standard form) ¶

Only dependent on lagged endogenous variables (no contemporaneous feedback)

Can be estimated using ordinary least squares (OLS)

VMA infinite representation and Stationarity ¶

Consider the reduced form, standard VAR(1) model

Assuming that the process is weakly stationary and taking the expectation of \(\mathbf y_t\) , we have

where \(E\left[\mathbf u_t\right]=0.\) If we let \(\tilde{\mathbf y}_{t}\equiv \mathbf y_t - \boldsymbol \mu\) be the mean-corrected time-series, we can write the model as

Substituting \(\tilde{\mathbf y}_{t-1} = \mathbf A_1 \tilde{\mathbf y}_{t-2} + \mathbf u_{t-1}\) ,

If we keep iterating, we get

Letting \(\boldsymbol\Theta_i\equiv A_1^i\) , we get the VMA infinite representation

Stationarity of the VAR(1) model ¶

All the N eigenvalues of the matrix \(A_1\) must be less than 1 in modulus, to avoid that the coefficient matrix \(A_1^j\) from either exploding or converging to a nonzero matrix as \(j\) goes to infinity.

Provided that the covariance matrix of \(u_t\) exists, the requirement that all the eigenvalues of \(A_1\) are less than one in modulus is a necessary and sufficient condition for \(y_t\) to be stable, and thus, stationary.

All roots of \(det\left(\mathbf I_N - \mathbf A_1 z\right)=0\) must lie outside the unit circle.

VAR(p) Model ¶

Consider the VAR(p) model described by

Using the lag operator \(L\) , we get

where \(\tilde{\mathbf A} (L) = (\mathbf I_N - A_1 L - ... - A_p L^p)\) . Assuming that \(\mathbf y_t\) is weakly stationary, we obtain that

Defining \(\tilde{\mathbf y}_t=\mathbf y_t -\boldsymbol\mu\) , we have

Properties ¶

\(Cov[\mathbf y_t, \mathbf u_t] = \Sigma_u\) , the covariance matrix of \(\mathbf u_t\)

\(Cov[\mathbf y_{t-h}, \mathbf u_t] = \mathbf 0\) for any \(h>0\)

\(\boldsymbol\Gamma_h = \mathbf A_1 \boldsymbol\Gamma_{h-1} +...+\mathbf A_p \boldsymbol\Gamma_{h-p}\) for \(h>0\)

\(\boldsymbol\rho_h = \boldsymbol \Psi_1 \boldsymbol\Gamma_{h-1} +...+\boldsymbol \Psi_p \boldsymbol\Gamma_{h-p}\) for \(h>0\) , where \(\boldsymbol \Psi_i = \mathbf D^{-1/2}\mathbf A_i D^{-1/2}\)

Stationarity of VAR(p) model ¶

All roots of \(det\left(\mathbf I_N - \mathbf A_1 z - ...- \mathbf A_p z^p\right)=0\) must lie outside the unit circle.

Specification of a VAR model: Choosing the order p ¶

Fitting a VAR model involves the selection of a single parameter: the model order or lag length \(p\) . The most common approach in selecting the best model is choosing the \(p\) that minimizes one or more information criteria evaluated over a range of model orders. The criterion consists of two terms: the first one characterizes the entropy rate or prediction error, and second one characterizes the number of freely estimated parameters in the model. Minimizing both terms will allow us to identify a model that accurately models the data while preventing overfitting (due to too many parameters).

Some of the commonly used information criteria include: Akaike Information Criterion (AIC), Schwarz’s Bayesian Information Criterion (BIC), Hannan-Quinn Criterion (HQ), and Akaike’s Final Prediction Error Criterion (FPE). The equation for each are shown below.

Akaike’s information criterion

Schwarz’s Bayesian information criterion

Hannan-Quinn’s information criterion

Final prediction error

where \(M\) stands for multivariate, \(\tilde{\boldsymbol\Sigma}_u\) is the estimated covariance matrix of the residuals, \(T\) is the number of observations in the sample, and \(k\) is the total number of equations in the VAR( \(p\) ) (i.e. \(N^2p + N\) where \(N\) is the number of equations and \(p\) is the number of lags).

Among the metrics above, AIC and BIC are the most widely used criteria, but BIC penalizes higher orders more. For moderate and large sample sizes, AIC and FPE are essentially equivalent, but FPE may outperform AIC for small sample sizes. HQ also penalizes higher order models more heavily than AIC, but less than BIC. However, under small sample conditions, AIC/FPE may outperform BIC and/or HQ in selecting true model order.

There are cases when AIC and FPE show no clear minimum over a range or model orders or lag lengths. In this case, there may be a clear “elbow” when we plot the values over model order. This may indicate a suitable model order.

Building a VAR model ¶

In this section, we show how we can use the VAR model to forecast the air quality data. The following steps are shown below:

Check for stationarity.

Split data into train and test sets.

Select the VAR order p that gives.

Fit VAR model of order p on the train set.

Generate forecast.

Evaluate model performance.

Example 2: Forecasting Air Quality Data (CO, NO2 and RH) ¶

For illustration, we consider the carbon monoxide, nitrous dioxide and relative humidity time series from the Air Quality Dataset from 1 October 2014.

../_images/03_VectorAutoregressiveMethods_47_0.png

Quick inspection before we proceed with modeling…

To find out whether the multivariate approach is better than treating the signals separately as univariate time series, we examine the relationship between the variables using CCF. The sample below shows the CCF for the last 100 data points of the Air quality data for CO, NO2 and RH.

../_images/03_VectorAutoregressiveMethods_50_0.png

Observation/s :

As shown in the plot above, we can see that there’s a relationship between:

CO and some lagged values of RH and NO2

NO2 and some lagged values of RH and CO

RH and some lagged values of CO and NO2

This shows that we can benefit from the multivariate approach, so we proceed with building the VAR model.

1. Check stationarity ¶

To check for stationarity, we use the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test and the Augmented Dickey-Fuller (ADF) test. For the data to be suitable for VAR modelling, we need each of the variables in the multivariate time series to be stationary. In both tests, we need the test statistic to be less than the critical values to say that a time series (a variable) to be stationary.

Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test ¶

Recall: Null hypothesis is that an observable time series is stationary around a deterministic trend (i.e. trend-stationary) against the alternative of a unit root.

CO(GT) NO2(GT) RH
Test statistic 0.0702 0.3239 0.1149
p-value 0.1000 0.0100 0.1000
Critical value - 1% 0.2160 0.2160 0.2160
Critical value - 2.5% 0.2160 0.2160 0.2160
Critical value - 5% 0.1460 0.1460 0.1460
Critical value - 10% 0.1190 0.1190 0.1190

From the KPSS test, CO and RH are stationary.

Augmented Dickey-Fuller (ADF) test ¶

Recall: Null hypothesis is that a unit root is present in a time series sample against the alternative that the time series is stationary.

CO(GT) NO2(GT) RH
Test statistic -7.0195 -6.7695 -6.8484
p-value 0.0000 0.0000 0.0000
Critical value - 1% -3.4318 -3.4318 -3.4318
Critical value - 5% -2.8622 -2.8622 -2.8622
Critical value - 10% -2.5671 -2.5671 -2.5671

From the ADF test, CO, NO2 and RH are stationary.

2. Split data into train and test sets ¶

We use the dataset from 01 October 2014 to predict the last 24 points (24 hrs/1 day) in the dataset.

../_images/03_VectorAutoregressiveMethods_68_0.png

3. Select order p ¶

We compute the different multivariate information criteria (AIC, BIC, HQIC), and FPE. We pick the set of order parameters that correspond to the lowest values.

../_images/03_VectorAutoregressiveMethods_71_0.png

We find BIC and HQIC to be lowest at \(p=26\) , and we also observe an elbow in the plots for AIC, and FPE, so we choose the number of lags to be 26.

3. Fit VAR model with chosen order ¶

4. get forecast ¶.

../_images/03_VectorAutoregressiveMethods_77_0.png

Performance Evaluation: Comparison with ARIMA model ¶

When using ARIMA, we treat each variable as a univariate time series, and we perform the forecasting for each variable: 1 for CO, 1 for NO2, and 1 for RH

CO(GT)-ARIMA NO2(GT)-ARIMA RH-ARIMA
Date_Time
2005-04-03 15:00:00 0.999865 87.002935 14.502893
2005-04-03 16:00:00 0.999729 87.005870 16.743239
2005-04-03 17:00:00 0.999594 87.008806 19.253414
2005-04-03 18:00:00 0.999458 87.011741 21.715709
2005-04-03 19:00:00 0.999323 87.014676 23.974483

../_images/03_VectorAutoregressiveMethods_82_0.png

Performance Metrics

CO(GT)-VAR NO2(GT)-VAR RH-VAR CO(GT)-ARIMA NO2(GT)-ARIMA RH-ARIMA
MAE 0.685005 31.687612 9.748883 1.057097 59.283376 16.239017
MSE 1.180781 1227.046604 111.996306 2.163021 4404.904459 333.622140
MAPE 43.508500 29.811066 35.565987 56.960617 46.190671 51.199023

MAE: VAR forecasts have lower errors than ARIMA forecasts for CO and NO2 but not in RH.

MSE: VAR forecasts have lower errors for all variables (CO, NO2 and RH).

MAPE: VAR forecasts have lower errors for all variables (CO, NO2 and RH).

Training time is significantly reduced when using VAR compared to ARIMA (<0.1s run time for VAR while ~20s for ARIMA)

Structural VAR Analysis ¶

In addition to forecasting, VAR models are also used for structural inference and policy analysis. In macroeconomics, this structural analysis has been extensively employed to investigate the transmission mechanisms of macroeconomic shocks (e.g., monetary shocks, financial shocks) and test economic theories. There are particular assumptions imposed about the causal structure of the dataset, and the resulting causal impacts of unexpected shocks (also called innovations or perturbations) to a specific variable on the different variables in the model are summarized. In this section, we cover two of the common methods in summarizing the effects of these causal impacts: (1) impulse response functions, and (2): forecast error variance decompositions.

Impulse Response Function (IRF) ¶

Coefficients of the VAR models are often difficult to interpret so practitioners often estimate the impulse response function.

IRFs trace out the time path of the effects of an exogenous shock to one (or more) of the endogenous variables on some or all of the other variables in a VAR system.

IRF traces out the response of the dependent variable of the VAR system to shocks (also called innovations or impulses) in the error terms.

IRF in the VAR system for Air Quality ¶

Let \(y_{1,t}\) , \(y_{2,t}\) and \(y_{3,t}\) be the time series corresponding to CO signal, NO2 signal, and RH signal, respectively. Consider the moving average representation of the system shown below:

../_images/irf_eq.png

Suppose \(u_1\) in the first equation increases by a value of one standard deviation.

This shock will change \(y_1\) in the current as well as the future periods.

This shock will also have an impact on \(y_2\) and \(y_3\) .

Suppose \(u_2\) in the second equation increases by a value of one standard deviation.

This shock will change \(y_2\) in the current as well as the future periods.

This shock will also have an impact on \(y_1\) and \(y_3\) .

Suppose \(u_3\) in the third equation increases by a value of one standard deviation.

This shock will change \(y_3\) in the current as well as the future periods.

This shock will also have an impact on \(y_1\) and \(y_2\) .

../_images/03_VectorAutoregressiveMethods_91_0.png

Observation/s:

Effects of exogenous perturbation/shocks (1SD) of a variable on itself:

CO \(\rightarrow\) CO: A shock in the value of CO has a larger effect CO in the early hours but this decays over time.

NO2 \(\rightarrow\) NO2: A shock in the value of NO2 has a larger effect NO2 in the early hours but this decays over time.

RH \(\rightarrow\) RH: A shock in the value of RH has a largest effect in RH after 1 hour and this effect decays over time.

Effects of exogenous perturbation/shocks of a variable on another:

CO \(\rightarrow\) NO2: A shock in the value of CO has a largest effect in NO2 after 1 hour and this effect decays over time.

CO \(\rightarrow\) RH: A shock in the value of CO has an immediate effect in the value of RH. However, the effect decreases immediately after an hour, and the value seems to stay at around 0.2.

NO2 \(\rightarrow\) CO: A shock in NO2 only causes a small effect in the values of CO. There seems to be a delayed effect, peaking after 3 hours, but the magnitude is still small.

NO2 \(\rightarrow\) RH: A shock in NO2 causes a small (negative) effect in the values of RH. The magnitude seems to decline further after 6 hours. The value of the IRF reaches zero in about 7 hours.

RH \(\rightarrow\) CO: A shock in RH only causes a small effect in the values of CO.

RH \(\rightarrow\) NO2: A shock in the value of RH has a largest effect in NO2 after 1 hour and this effect decays over time. The value of the IRF reaches zero after 6 hours.

Forecast Error Variance Decomposition (FEVD) ¶

FEVD indicates the amount of information each variable contributes to the other variables in the autoregression

While impulse response functions trace the effects of a shock to one endogenous variable on to the other variables in the VAR, variance decomposition separates the variation in an endogenous variable into the component shocks to the VAR.

It determines how much of the forecast error variance of each of the variables can be explained by exogenous shocks to the other variables.

../_images/03_VectorAutoregressiveMethods_94_0.png

For CO, the variance is mostly explained by exogenous shocks to CO. This decreases over time but only by a small amount.

For NO2, the variance is mostly explained by exogenous shocks to NO2 and CO.

For RH, the variance is mostly explained by exogenous shocks to RH. Over time, the contribution of the exogenous shocks to CO increases.

Example 3: Forecasting the Jena climate data ¶

We try to forecast the Jena climate data using the method outlined above. We will train the VAR model using hourly weather measurements from January 1, 2019 (00:10) up to December 29, 2014 (18:10). The performance of the model will be evaluated on the test set which contains data from December 29, 2014 (19:10) to December 31, 2014 (23:20) which is equivalent to 17,523 data points for each of the variables.

Load dataset ¶

Check stationarity of each variable using adf test ¶.

p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%) VPmax (mbar) VPact (mbar) VPdef (mbar) sh (g/kg) H2OC (mmol/mol) rho (g/m**3) wv (m/s) max. wv (m/s) wd (deg)
Test statistic -15.5867 -7.9586 -8.3354 -8.5750 -17.7069 -9.1945 -9.0103 -13.5492 -9.0827 -9.0709 -9.3980 -24.2424 -24.3052 -19.8394
p-value 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Critical value - 1% -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305 -3.4305
Critical value - 5% -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616 -2.8616
Critical value - 10% -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668 -2.5668

From the values above, all the components of the Jena climate data are stationary, so we’ll use all the variables in our VAR model.

Select order p ¶

../_images/03_VectorAutoregressiveMethods_106_1.png

The model order that resulted to the minimum value varies for each information criteria, showing no clear minimum. We see an elbow at \(p=5\) , but if we look at AIC and HQIC we’re observing another elbow/local minimum at \(p=26\) . So, we choose \(p=26\) as our lag length.

Train VAR model using the training and validation data ¶

Forecast 24-hour weather measurements and evaluate performance on test set ¶.

../_images/03_VectorAutoregressiveMethods_115_0.png

Evaluate forecasts

p (mbar)-VAR T (degC)-VAR Tpot (K)-VAR Tdew (degC)-VAR rh (%)-VAR VPmax (mbar)-VAR VPact (mbar)-VAR VPdef (mbar)-VAR sh (g/kg)-VAR H2OC (mmol/mol)-VAR rho (g/m**3)-VAR wv (m/s)-VAR max. wv (m/s)-VAR wd (deg)-VAR
MAE 2.681284 2.542025 2.603187 1.797102 10.899944 2.526237 1.171625 2.379343 0.744665 1.186948 12.164486 3.136091 4.414089 66.420597
MSE 238.321753 652.733627 618.458262 102.257346 9581.665059 540.666240 39.837241 490.874048 15.612470 39.678736 9986.126326 17352.964659 23404.318476 87581.698748

Observation/s : The VAR(26) model outperformed the naive (MAE= 3.18), seasonal naive (MAE= 2.61) and ARIMA (MAE= 3.19) models.

Forecast 24 hours beyond test set ¶

../_images/03_VectorAutoregressiveMethods_125_0.png

VAR methods are useful when dealing with multivariate time series, as they allow us to use the relationship between the different variable to forecast.

These models allow us to forecast the different variables simultaneously, with the added benefit of easy (only 1 hyperparameter) and fast training.

Using the fitted VAR model, we can also explain the relationship between variables, and how the perturbation in one variable affects the others by getting the impulse response functions and the variance decomposition of the forecasts.

However, the application of these models is limited due to the stationarity requirement for ALL the variables in the multivariate time series. This method won’t work well if there is at least one variable that’s non-stationary. When dealing with non-stationary multivariate time series, one can explore the use of vector error correction models (VECM).

Preview to the Next Chapter ¶

In the next chapter, we further extend the use of VAR models to explain the relationships between variables in a multivariate time series using Granger causality , which is one of the most common ways to describe causal mechanisms in time series data.

References ¶

Main references

Lütkepohl, H. (2005). New introduction to multiple time series analysis. Berlin: Springer.

Kilian, L., & Lütkepohl, H. (2018). Structural vector autoregressive analysis. Cambridge: Cambridge University Press.

Supplementary references are listed here .

  • Bibliography
  • More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
  • Automated transliteration
  • Relevant bibliographies by topics
  • Referencing guides

Evaluating variable selection methods for multivariable regression models: A simulation study protocol

Roles Conceptualization, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

Affiliation Institute of Clinical Biometrics, Center for Medical Data Science, Medical University of Vienna, Vienna, Austria

ORCID logo

Roles Conceptualization, Funding acquisition, Methodology, Supervision, Writing – original draft, Writing – review & editing

Roles Writing – review & editing

Affiliation Institute of Biometry and Clinical Epidemiology, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany

Roles Software, Writing – review & editing

Affiliations Institute of Clinical Biometrics, Center for Medical Data Science, Medical University of Vienna, Vienna, Austria, Austrian Agency for Health and Food Safety (AGES), Vienna, Austria

* E-mail: [email protected]

¶ Membership list can be found in the Acknowledgments section.

  • Theresa Ullmann, 
  • Georg Heinze, 
  • Lorena Hafermann, 
  • Christine Schilhart-Wallisch, 
  • Daniela Dunkler, 
  • for TG2 of the STRATOS initiative

PLOS

  • Published: August 9, 2024
  • https://doi.org/10.1371/journal.pone.0308543
  • Peer Review
  • Reader Comments

Table 1

Researchers often perform data-driven variable selection when modeling the associations between an outcome and multiple independent variables in regression analysis. Variable selection may improve the interpretability, parsimony and/or predictive accuracy of a model. Yet variable selection can also have negative consequences, such as false exclusion of important variables or inclusion of noise variables, biased estimation of regression coefficients, underestimated standard errors and invalid confidence intervals, as well as model instability. While the potential advantages and disadvantages of variable selection have been discussed in the literature for decades, few large-scale simulation studies have neutrally compared data-driven variable selection methods with respect to their consequences for the resulting models. We present the protocol for a simulation study that will evaluate different variable selection methods: forward selection, stepwise forward selection, backward elimination, augmented backward elimination, univariable selection, univariable selection followed by backward elimination, and penalized likelihood approaches (Lasso, relaxed Lasso, adaptive Lasso). These methods will be compared with respect to false inclusion and/or exclusion of variables, consequences on bias and variance of the estimated regression coefficients, the validity of the confidence intervals for the coefficients, the accuracy of the estimated variable importance ranking, and the predictive performance of the selected models. We consider both linear and logistic regression in a low-dimensional setting (20 independent variables with 10 true predictors and 10 noise variables). The simulation will be based on real-world data from the National Health and Nutrition Examination Survey (NHANES). Publishing this study protocol ahead of performing the simulation increases transparency and allows integrating the perspective of other experts into the study design.

Citation: Ullmann T, Heinze G, Hafermann L, Schilhart-Wallisch C, Dunkler D, for TG2 of the STRATOS initiative (2024) Evaluating variable selection methods for multivariable regression models: A simulation study protocol. PLoS ONE 19(8): e0308543. https://doi.org/10.1371/journal.pone.0308543

Editor: Suyan Tian, The First Hospital of Jilin University, CHINA

Received: February 7, 2024; Accepted: July 25, 2024; Published: August 9, 2024

Copyright: © 2024 Ullmann et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: This manuscript is a protocol of a simulation study. We intend to share the software code after the study has been conducted and published. This will allow recreating our data and reproducing our simulation study.

Funding: This research was funded in part by the Austrian Science Fund (FWF, https://www.fwf.ac.at/en/ ) [I-4739-B] (for T.U. and C.W.) and by the German Research Foundation (DFG, https://www.dfg.de/en ) [RA 2347/8-1] (for L. H.). For open access purposes, the author has applied a CC BY public copyright license to any author accepted manuscript version arising from this submission. The funders did not and will not have any role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Data-driven variable selection is frequently performed when modeling the associations between an outcome and multiple independent variables (sometimes also referred to as explanatory variables, covariates or predictors). Variable selection may help to generate parsimonious and interpretable models, and may also yield models with increased predictive accuracy. Despite these potential advantages, data-driven variable selection can also have unintended negative consequences that many researchers are not fully aware of. Variable selection induces additional uncertainty in the estimation process and may cause biased estimation of regression coefficients, model instability (i.e., models that are not robust with respect to small perturbations of the data set), and issues with post-selection inference such as underestimated standard errors and invalid confidence intervals [ 1 – 5 ].

A recent review [ 1 ] provided guidance about variable selection and gave an overview of possible consequences of variable selection. However, there are few systematic simulation studies that compare different variable selection methods with respect to their consequences for the resulting models (for some exceptions, see [ 6 – 10 ]). While many articles proposing new variable selection methods include a comparison with existing methods (based on simulated or real data), these comparisons are typically somewhat limited, often comparing the new method to only one to three competitors, even though there are many more existing methods. Moreover, these articles are inherently biased towards demonstrating superiority of the new methods. In particular, such studies cannot be considered as neutral . A neutral comparison study is a study whose authors do not have a vested interest in one of the competing methods, and are (as a group) approximately equally familiar with all considered methods [ 11 , 12 ]. More neutral comparison studies about existing variable selection methods are needed to better understand their properties, a viewpoint that aligns with the goals of the STRATOS initiative (STRengthening Analytical Thinking for Observational Studies [ 13 ]). The STRATOS initiative is an international consortium of biostatistical experts, and aims to provide guidance in the design and analysis of observational studies for specialist and non-specialist audiences. This perspective motivates our comprehensive simulation study.

We will focus on descriptive modeling (i.e., describing the relationship between the outcome and the independent variables in a parsimonious manner) and predictive modeling (i.e., predicting the outcome as accurately as possible) [ 14 ]. Our setting is multivariable regression analysis with one outcome variable. The outcome is either continuous (linear regression) or binary (logistic regression). We simulate data in a low-dimensional scenario (20 variables consisting of 10 true predictors and 10 noise variables). Different variable selection methods with multiple parameter settings are compared: forward selection, stepwise forward selection, backward elimination, augmented backward elimination [ 15 ], univariable selection, univariable selection followed by backward elimination, the Lasso [ 16 ], the relaxed Lasso [ 9 , 17 ], and the adaptive Lasso [ 18 ]. We compare the performances of these methods with respect to false inclusion and/or exclusion of variables, consequences on bias and variance of the estimated regression coefficients, the validity of the confidence intervals for the coefficients, the accuracy of the estimated variable importance ranking, and finally the predictive performance of the selected models.

Using simulated instead of real data allows us to a) know the true data generating process and b) systematically vary several data characteristics [ 19 , 20 ]. For example, we will include varying sample sizes and R 2 , as the consequences of variable selection depend on these parameters. To ensure that the simulation results are practically relevant, we use real data as the starting point for our simulation. The distributions and correlation structure of the variables are based on data from the National Health and Nutrition Examination Survey (NHANES) [ 21 ]. The choice of variables and true regression coefficients is inspired by an applied study about predicting the difference between ambulatory/home and clinic blood pressure readings [ 22 ]. Our simulated data thus mimics real cardiovascular data.

Our focus is on low-dimensional data, which is reflected in our simulation setting with twenty independent variables. Data of this type frequently appears in medicine and other application fields, and researchers often apply variable selection in this context. For example, a systematic review of models for COVID-19 prognosis [ 23 , 24 ] identified 236 newly developed regression models for prediction. Data-driven variable selection was applied (and reported) for 196 models. In 165 models both the number of candidate predictors (i.e., the predictors considered at the start of data-driven selection) and the number of predictors in the final model were reported; the median numbers were 28 (range 4–130), and 6 (range 1–38), respectively. This demonstrates that low- to medium-dimensional data played an important role in COVID-19 prediction research. Of course, data-driven variable selection is also relevant for high-dimensional data. Comparing variable selection methods for high-dimensional data would require a different study design and is not the purpose of this planned simulation study.

As mentioned above, neutrality is an important goal when conducting systematic comparison studies. “Perfect” neutrality may be the ultimate goal, but this ideal can be difficult to achieve in practice. While we aim to be as neutral as possible, we disclose (for the purpose of full transparency) that one of the methods for variable selection included in our comparison, namely augmented backwards elimination, was originally proposed by two authors of the present study protocol [ 15 ]. Our goal was to not let this fact influence our choice of study design, though unconscious biases can never be fully excluded. Striving for as much neutrality as possible motivated us to publish this study protocol. This will allow us to integrate the comments of reviewers before performing the simulation. For the design of our study, results from previous smaller simulation studies and pilot studies were taken into account [ 1 ]; however, the study outlined in this protocol has not yet been run and analyzed. Preregistration of study protocols for simulation studies/methodological studies is still very rare (for an exception, see [ 25 ]). However, this practice could offer similar advantages to those discussed for preregistration in applied research, such as increased transparency and prevention of “hindsight bias” [ 26 ]. Potential advantages of preregistering protocols for simulation studies, but also possible limitations and challenges, are discussed more extensively elsewhere [ 27 ].

A specific goal of our simulation study is to evaluate previously published recommendations about variable selection [ 1 ], which we discuss in Section 2. We then describe our simulation design in Section 3, explain the planned code review in Section 4, and conclude the protocol with some final remarks in Section 5.

2 Previous variable selection recommendations

Varied viewpoints exist in the literature as to whether researchers should apply data-driven variable selection, and, if so, which methods and parameters are deemed preferable. Some authors generally caution against data-driven variable selection, stressing potential negative consequences [ 5 ]. Other authors put more focus on potential advantages of variable selection and are more optimistic about using selection methods, at least if the sample size is large enough and if selection is accompanied by a stability analysis [ 28 ]. In a review conducted by three co-authors of the present study protocol, Heinze et al. [ 1 ] summarized different perspectives from the literature. Drawing upon existing recommendations, but also taking their own experience and a small simulation study into account, they derived recommendations for the usage of variable selection methods. These recommendations consider both benefits and drawbacks of variable selection, thereby reconciling different viewpoints on the matter. The recommendations depend on the “events-per-variable” (EPV) in the data. The EPV is the ratio between sample size (in linear regression) or the number of the less frequent outcome (in logistic regression) and the number of independent variables. Data-driven variable selection is applied on a carefully designed “global” model which includes all independent variables relevant for the research question. The denominator of EPV refers to the number of design variables (including possible dummy variables and other constructed variables) in this global model. The following bullet points list the recommendations, and how we plan to evaluate them.

  • EPV > 25: While variable selection may generally work well for a large EPV value, the selection of independent variables with small effect size can still be unstable. If backward elimination is used, a stringent threshold of α = 0.05 or selection with the BIC may lead to a more accurate selection of variables than milder thresholds. In our study : We will check whether selection rates of variables with small standardized regression coefficients (e.g., ±0.25) are notably different from either 0 or 1 (which indicates instability). For backward elimination, we will evaluate whether the selection of variables is more accurate when using the threshold α = 0.05 or the BIC (which corresponds to even stricter thresholds for our considered sample sizes [ 1 ]), compared to using larger α values.
  • 10 < EPV ≤ 25: In general, the selection of variables might be unstable with such an EPV. When variables with unclear effect size are selected, their effects might be over-estimated. Penalized estimation (Lasso) or postestimation shrinkage is thus recommended. If backward elimination is used, a threshold corresponding to selection with the AIC (approximately α = 0.157) is recommended, but not smaller α values. In our study : Again, we will evaluate stability by checking whether selection rates of variables, particularly those with small standardized regression coefficients, are notably different from either 0 or 1. We will also calculate the conditional bias (i.e., bias conditioned on selection) of the variables and analyze whether variables with small standardized regression coefficients have large conditional bias away from zero. For backward elimination, we will evaluate to which extent a threshold of α = 0.157 (or an even milder threshold of α = 0.5) selects the true predictors more frequently than smaller thresholds (i.e., a fixed threshold of α = 0.05 or selection with the BIC) [ 3 ].
  • EPV ≤ 10: Data-driven variable selection is generally not recommended. In our study : We will analyze whether variable selection has negative consequences with respect to the different performance criteria.

The results of variable selection are not only influenced by EPV, but also by other aspects such as the R 2 of the model. We will thus consider different R 2 values in our simulation study. The recommendations above do not take R 2 into account, as the R 2 of the model is typically not known prior to the data analysis.

3 Simulation design

Morris et al. [ 19 ] proposed to describe the following components when reporting a simulation study: the aims of the study (A), the data-generating mechanisms (D), the estimands (i.e., the population quantities which are estimated) and other targets of interest (E), the methods to be compared (M), and the performance measures used for evaluating the methods (P). The ADEMP components of our study are briefly summarized in Tables 1 and 2 . We now describe the components in more detail.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0308543.t001

thumbnail

https://doi.org/10.1371/journal.pone.0308543.t002

3.1 Aims (A)

We aim to compare different variable selection methods for multivariable linear or logistic regression, with respect to their consequences for the resulting models. We consider consequences on bias and variance of the estimated regression coefficients, validity of confidence intervals for the coefficients, false inclusion or exclusion of variables, and predictive performance. We analyze the behavior of variable selection methods…

  • … depending on sample size/EPV, with particular focus on evaluating the recommendations of Heinze et al. [ 1 ],
  • … depending on the R 2 of the population model,
  • … depending on the modeling goal (description or prediction),
  • … when functional forms are misspecified (i.e., when fitting models assuming linear functional forms of continuous predictors even though the true functional forms are nonlinear),
  • … when switching from our realistic scenario that mimics cardiovascular data to simplified scenarios (i.e., all variables are normally distributed and/or uncorrelated).

3.2 Data-generating mechanisms (D)

3.2.1 simulation of independent variables (predictors and noise variables)..

We simulate 20 independent variables: 10 true predictors (from now on just called “predictors”) and 10 noise variables. The correlation structure and distributions are based on real-world data from the 2013–14 and 2015–2016 cycles of the National Health and Nutrition Examination Survey (NHANES) [ 21 ]. To choose suitable variables in the NHANES data, we drew inspiration from a regression model reported by Sheppard et al. [ 22 ] for predicting the difference between diastolic blood pressure readings as measured ambulatory/at home versus in the clinic. The variables are described in detail in S1 Appendix . The correlation matrix Σ for the simulation is based on the empirical correlation matrix of the variables. For better interpretability, we set correlations below 0.15 to zero and round all values to the closest multiple of 0.05 (see S1 Fig and S1 Table for the resulting correlation matrix).

To obtain distributions from the NHANES data, we fit Bernoulli distributions for the binary variables, and normal distributions, log-normal distributions, or approximations of the empirical cumulative distribution function (CDF) for the continuous variables. For each continuous variable, we truncate its distribution with the minimum of the variable in the NHANES data as the lower bound and the maximum as the upper bound. The resulting distributions are as follows (see also Fig 1 ):

  • predictors: X 1 (log-normal), X 2 (continuous with approximated CDF), X 3 (log-normal), X 4 (binary, p = 0.50), X 5 (normal), X 6 (binary, p = 0.29), X 7 (log-normal), X 8 (log-normal), X 9 (normal), X 10 (binary, p = 0.11)
  • noise variables: X 11 (log-normal), X 12 (normal), X 13 (log-normal), X 14 (binary, p = 0.61), X 15 (normal), X 16 (binary, p = 0.20), X 17 (log-normal), X 18 (normal), X 19 (normal), X 20 (binary, p = 0.20)

thumbnail

Predictors are ordered by absolute values of standardized regression coefficients. Histograms are based on a large simulated dataset ( n = 100, 000).

https://doi.org/10.1371/journal.pone.0308543.g001

The distributions, together with the correlation matrix Σ, are then used as input for the normal-to-anything (NORTA) method for simulation [ 29 , 30 ].

3.2.2 Choice of regression coefficients.

research paper using var model

This choice reflects a mixture of stronger and weaker effects, a situation typical for many applications in biology and medicine. We would expect different behaviors of the predictors during variable selection depending on their effects.

research paper using var model

The regression coefficients for the noise variables X 11 , …, X 20 are set to zero.

research paper using var model

3.2.3 Simulation of outcome Y .

The outcome Y is simulated as follows:

research paper using var model

3.2.4 Nonlinear functional forms.

So far, we assumed that the functional forms of the effects of continuous predictors on Y are linear. In applied studies in biology and medicine, the actual functional forms of such variables might often be nonlinear, but researchers nonetheless fit a model with linear functional forms, e.g., because they are not aware that some functional forms might be nonlinear, or because they prefer a simpler model. To analyze the behavior of variable selection methods in this scenario, we include settings 1b-7b (corresponding to settings 1–7) where all predictors have nonlinear functional forms. The models that we consider for analysing the simulated data (linear/logistic regression) will not take the nonlinear functional forms into account and will thus be misspecified.

For each continuous predictor X j , we define a function g j ( x ) that describes the nonlinear functional form of the effect of the predictor on Y . We choose various functional forms: quadratic, log-quadratic, exponential and sigmoid. The functions are depicted in S3 Fig ; exact definitions are given in S1 Appendix .

research paper using var model

After determining β ( g ) , the outcome Y is simulated as previously described in Section 3.2.3, with xβ replaced by the nonlinear composite predictor.

research paper using var model

https://doi.org/10.1371/journal.pone.0308543.t003

For the global model in the settings with nonlinear effects, we will not only calculate the usual standard errors of the regression coefficients, but also robust standard errors [ 31 ], to check whether robust SEs improve the coverage of the confidence intervals. If robust SEs improve the coverage for the global model, it would be interesting to analyze whether this is also the case for models obtained by variable selection; however, combining robust standard errors with variable selection requires some further work and would go beyond the scope of the proposed study. For now, we will restrict the investigation of robust SEs to the global model for linear regression.

3.2.5 Simplified settings.

While our main focus is on simulating variables of various distribution types (e.g., Bernoulli, normal, and log-normal) and with correlation matrix Σ based on the empirical correlation matrix from the NHANES data ( S1 Table ), we are also interested in the behavior of the variable selection methods for data with simpler distribution-correlation structures. We thus consider the three following simplified scenarios:

research paper using var model

  • The variables have the same individual distributions as described in Section 3.2.1 ( Fig 1 ), but are not correlated.

research paper using var model

Depending on the results for settings 1b-2b and 4b-7b with nonlinear effects, we might additionally consider nonlinear effects for the simplified scenario 3 (variables not multivariate normal and not correlated).

3.2.6 Sample sizes.

For linear regression, we consider eight different sample sizes: 100, 200, 400, 500, 800, 1600, 3200, and 6400. These sample sizes result when doubling sample size six times from 100. Additionally, the sample size 500 is included because it corresponds to EPV = 25, and this EPV value was specifically mentioned in the recommendations of Heinze et al. [ 1 ].

research paper using var model

Because this procedure is unstable for small event rates, we do not use the alignment based on standard errors for event rate 0.05. Instead, we choose sample sizes corresponding to the EPV values in linear regression.

The resulting sample sizes are displayed in Table 4 . The numbers below the sample sizes indicate the corresponding EPV values. For event rate 0.05, we will first include sample sizes only up to 10,000 (EPV = 25) to save computation time. We expect the variable selection methods to behave similarly for both event rates (0.3 and 0.05). If we observe different behaviors for event rate 0.05, we will include the additional sample sizes.

thumbnail

https://doi.org/10.1371/journal.pone.0308543.t004

In S1 Appendix , we additionally report expected shrinkage factors for each setting, based on sample size and R 2 [ 32 , 33 ].

3.3 Estimands and other targets (E)

As estimands, we consider the true regression coefficients of the data generating models. As further targets, we are interested in model selection (e.g., whether the true model is selected) and predictive performance of the selected models.

For the settings with linear functional forms, the regression coefficient estimands are the coefficients β (respectively c β for logistic regression) as described in Sections 3.2.2 and 3.2.3. For the settings with nonlinear effects, we cannot take the coefficients β ( g ) as defined in 3.2.4 as estimands, because our linear/logistic regression models will not take nonlinear functional forms into account and will thus be misspecified.

research paper using var model

3.4 Methods (M)

3.4.1 overview of variable selection methods..

We include the following methods:

  • Forward selection with AIC: starting from the model containing only the intercept, variables are iteratively added to the model based on their capability to decrease the AIC when included.
  • Stepwise forward selection with AIC (i.e., forward selection with backward elimination steps): like simple forward selection, this method starts from the intercept model and adds variables based on the AIC. However, in each step, re-exclusion of already selected variables is allowed, based on the capability to decrease the AIC when removed.
  • Backward elimination with α = 0.05, with BIC, with AIC, and with α = 0.5: starting from the global model, variables are iteratively removed, either based on their capability to decrease the BIC/AIC when removed, or based on the p -values of their coefficients. We do not consider a stepwise variant of backward elimination with forward selection steps, following the recommendations of Royston and Sauerbrei [ 28 , p. 32] who argue that allowing re-inclusion of removed variables in backward elimination is rarely relevant, while allowing re-exclusion of included variables may cause a notable difference for forward selection.
  • Augmented backward elimination (ABE) with AIC [ 15 ]: backward elimination is combined with the change-in-estimate criterion [ 34 , 35 ]. A variable that would be removed in backward elimination based on AIC may stay in the model if its removal would induce a large change in the estimated regression coefficients of the other variables that are currently in the model. As threshold for the standardized change-in-estimate, we choose τ = 0.05. We will use the R package abe [ 36 ].
  • Univariable selection with α = 0.05 and α = 0.20: a variable is selected if its regression coefficient in a univariable model is significant at level α . While many authors have advised against using univariable selection [ 5 , 37 , 38 ], the method is still often used in practice, which is why we include it in our simulation study.
  • Univariable selection with α = 0.20, followed by backward elimination with α = 0.05: frequently, researchers use this combination instead of using only univariable selection or only backward elimination [ 39 , 40 ] However, the warnings against univariable selection still apply to the combination method.
  • Lasso [ 16 ]: a penalty on the coefficients is added to the OLS criterion (linear regression) or the negative log-likelihood (logistic regression), causing shrinkage of the coefficients toward zero and setting some of them to exactly zero.
  • Relaxed Lasso [ 9 , 17 ]: variables are selected with the Lasso, but the shrinkage of the coefficients of the selected variables is relaxed by refitting the model with the selected variables without penalty.
  • Adaptive Lasso [ 18 ]: first, the global linear/logistic model is fit, then a Lasso with variable-specific weights for the penalty is estimated. The estimates from the first step serve to get the variable-specific weights for the second step: the weights are calculated such that a variable with larger regression coefficient in the first step is penalized less than a variable with smaller regression coefficient. For all variants of the Lasso, we will use the R package glmnet [ 41 ]. The complexity parameter λ will be tuned with 10-fold cross-validation (CV). As performance criterion for the prediction on test sets during CV, we use the mean squared error for linear regression and deviance for logistic regression. For the relaxed Lasso, we additionally consider tuning λ with the BIC.

We also consider the global model with all variables.

3.4.2 Firth correction in logistic regression.

In the models for logistic regression, separation may occur (i.e., perfect separation of events and non-events by a linear combination of covariates), particularly for small to medium sample sizes and low event rates [ 42 ]. In this case, at least one parameter estimate is infinite. While separation can be detected by linear programming [ 43 ], we found that in practice, a simple and robust check can be performed by inspecting the model standard errors of the regression coefficients. If at least one standard error is extremely large, this indicates separation. A possible solution to the problem of separation is to apply the Firth correction to obtain finite parameter estimates [ 42 , 44 ].

In the simulation settings for logistic regression, we check for each individual simulated dataset whether separation occurs. In the case of separation, we apply the Firth correction (with the FLIC intercept correction [ 45 ] to obtain unbiased predictions), otherwise we use the standard logistic regression. When Firth correction is applied, confidence intervals for the regression coefficients are calculated based on the profile penalized likelihood, otherwise based on the profile likelihood.

We describe our procedure to check for separation based on the model standard errors of the coefficients in S1 Appendix .

3.5 Performance measures (P)

We organize the performance measures into three categories, based on which estimands/targets they pertain to. Formulas for all performance measures are given in S1 Appendix .

research paper using var model

Performance measures for model selection as target include the selection rate of the true model consisting exactly of the ten predictors, the selection rate of an “over-selection” model which we define as a model including all predictors as well as at least one noise variable (previously called an “inflated” model [ 15 ]), and the selection rate of any “under-selection” model defined as a model not containing all predictors but possibly including noise variables (previously called a “biased” model [ 15 ]).

research paper using var model

The performance measures for the regression coefficients and for model selection are primarily relevant for descriptive models, while performance measures for predictive performance are mainly relevant for prediction models. However, a descriptive model may also be suitable for prediction; therefore, performance measures for prediction could also be relevant for descriptive modeling. Vice versa, in prediction models, aspects such as interpretability, fairness etc. often play an important role; researchers might thus consider performance measures such as bias of coefficients also for prediction models.

3.6 Monte Carlo errors and number of simulation runs

research paper using var model

4 Code review

To ensure reproducibility, as well as readability, the code will be checked by another researcher (a “code reviewer”) who works at the same institute as the first, second and last author of this protocol, but was not involved in planning the study. After writing the code, the first author (T.U.) will hand over the code to the code reviewer, together with instructions for running the code as well as some partial results (using less than the full n sim = 2000 repetitions). The code reviewer will then check the plausibility of the partial results and provide feedback on the simulation code, focusing on a) data generation, b) the implementation of the compared models, and c) the implementation of the performance measures applied to these models. Once T.U. and the code reviewer have agreed upon a final version of the code, T.U. will re-run the partial results, and the code reviewer will check the complete computational reproducibility by re-running the code on another machine. This check for reproducibility is done on the partial results as the generation of the final results is expected to require large amounts of computational resources. Once the reproducibility check has successfully concluded, T.U. will perform the full n sim = 2000 repetitions to generate the final results.

5 Final remarks

Our simulation study will enable researchers to better understand the consequences of variable selection, and will clarify differences in the performance of different selection methods depending on the considered scenarios. To make the results of the study more accessible and interpretable, we plan to display all results in an interactive web app (Shiny app) that will be published alongside the main paper. We will also make our code available on a Git repository, and will specify random seeds to ensure reproducibility of the results.

The performance measures for our study (Section 3.5) are defined as expected values and probabilities. Their estimation by simulation thus always involves taking the mean over (a part of) the simulation repetitions. However, if one only calculates the mean over the repetitions, one might miss relevant properties of the distribution of the values over the simulation repetitions. We will thus use distribution plots and correlation analyses to evaluate the simulation results in more detail [ 19 ]. Moreover, we will analyze how many variables were selected by each variable selection method. We did not include model size as a performance measure in Section 3.5 because there is no clear target value and smaller/larger values are not automatically better/worse (a smaller model size is preferable in some applications, but might be less relevant in others). A specific focus on model size (e.g., comparing different variable selection methods under constraints w.r.t. the number of chosen variables) would require a different study design.

research paper using var model

In future work, it would be interesting to consider various extensions of our simulation. For example, while we focus on linear and logistic regression in the present protocol, data-driven variable selection is also often used in the context of survival analysis. We plan to conduct a further simulation study comparing different data-driven variable selection methods for Cox regression and the accelerated failure time model.

In the present study, we include several settings where all predictors have true nonlinear functional forms, but we nevertheless fit all models with linear functional forms; this mimics the frequent misspecification of models in practice. Generally, when fitting a regression model with linear effects, it is advisable to check for misspecification by analyzing the residuals. If misspecification is only mild, then a model with linear effects might still be justifiable. If misspecification is too severe, functional form selection can be performed to account for nonlinear effects, e.g., with spline-based approaches. In future work, our study could be extended by considering the combination of variable selection and functional form selection, which is a complex issue [ 39 ].

We focus on low-dimensional data in our study. Future studies could compare variable selection methods for high-dimensional data. Finally, our study considers variable selection in a frequentist framework. Future simulation studies could also evaluate Bayesian methods for variable selection.

Supporting information

S1 fig. correlation network graph..

https://doi.org/10.1371/journal.pone.0308543.s001

S2 Fig. Absolute standardized regression coefficients plotted against coefficients of determination for each independent variable.

https://doi.org/10.1371/journal.pone.0308543.s002

S3 Fig. Nonlinear effects.

https://doi.org/10.1371/journal.pone.0308543.s003

S1 Appendix. Details of the simulation design.

https://doi.org/10.1371/journal.pone.0308543.s004

S1 Table. Correlation table.

https://doi.org/10.1371/journal.pone.0308543.s005

Acknowledgments

We would like to thank the members of Topic Group (TG) 2 and the Publications Panel of the STRengthening Analytical Thinking for Observational Studies (STRATOS) initiative for helpful comments. In particular, we thank Willi Sauerbrei, Frank Harrell, Nadja Klein and Harald Binder.

At the time of submission, STRATOS TG2 consisted of the following members (in alphabetical order): Michal Abrahamowicz, Harald Binder, Daniela Dunkler, Frank Harrell, Georg Heinze, Marc Henrion, Michael Kammer, Aris Perperoglou, Willi Sauerbrei, and Matthias Schmid. The group is co-chaired by Georg Heinze ( [email protected] ), Aris Perperoglou, and Willi Sauerbrei.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 5. Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer Series in Statistics. Cham: Springer International Publishing; 2015.
  • 21. Centers for Disease Control and Prevention (CDC), National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data; 2023. Available from: https://www.cdc.gov/nchs/nhanes/ .
  • 24. COVID-19 living review, summary details per model;. https://www.covprecise.org/living-review/ [Accessed: 2024-05-13].
  • 28. Royston P, Sauerbrei W. Multivariable model-building: a pragmatic approach to regression anaylsis based on fractional polynomials for modelling continuous variables. John Wiley & Sons; 2008.
  • 29. Cario MC, Nelson BL. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Department of Industrial Engineering and Management, Northwestern University; 1997.
  • 34. Hosmer DW Jr, Lemeshow S. Applied logistic regression. New York: John Wiley & Sons; 2000.
  • 36. Blagus R. abe: Augmented Backward Elimination. R package version 5.1.1; 2022.
  • 43. Konis K. Linear programming algorithms for detecting separated data in binary logistic regression models [PhD thesis]. University of Oxford; 2007.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 14 August 2024

Has your paper been used to train an AI model? Almost certainly

  • Elizabeth Gibney

You can also search for this author in PubMed   Google Scholar

Person holding smartphone with logo of US publishing company John Wiley and Sons Inc. in front of their website.

Academic publisher Wiley has sold access to its research papers to firms developing large language models. Credit: Timon Schneider/Alamy

Academic publishers are selling access to research papers to technology firms to train artificial-intelligence (AI) models. Some researchers have reacted with dismay at such deals happening without the consultation of authors. The trend is raising questions about the use of published and sometimes copyrighted work to train the exploding number of AI chatbots in development.

Experts say that, if a research paper hasn’t yet been used to train a large language model (LLM), it probably will be soon. Researchers are exploring technical ways for authors to spot if their content being used.

research paper using var model

AI models fed AI-generated data quickly spew nonsense

Last month, it emerged that the UK academic publisher Taylor & Francis, had signed a US$10-million deal with Microsoft, allowing the US technology company to access the publisher’s data to improve its AI systems. And in June, an investor update showed that US publisher Wiley had earned $23 million from allowing an unnamed company to train generative-AI models on its content.

Anything that is available to read online — whether in an open-access repository or not — is “pretty likely” to have been fed into an LLM already, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle. “And if a paper has already been used as training data in a model, there’s no way to remove that paper after the model has been trained,” she adds.

Massive data sets

LLMs train on huge volumes of data, frequently scraped from the Internet. They derive patterns between the often billions of snippets of language in the training data, known as tokens, that allow them to generate text with uncanny fluency.

Generative-AI models rely on absorbing patterns from these swathes of data to output text, images or computer code. Academic papers are valuable for LLM builders owing to their length and “high information density”, says Stefan Baack, who analyses AI training data sets at the Mozilla Foundation, a global non-profit organization in San Francisco, California that aims to keep the Internet open for all to access.

research paper using var model

How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models

Training models on a large body of scientific information also give them a much better ability to reason about scientific topics, says Wang, who co-created S2ORC, a data set based on 81.1 million academic papers. The data set was originally developed for text mining — applying analytical techniques to find patterns in data — but has since been used to train LLMs.

The trend of buying high-quality data sets is growing. This year, the Financial Times has offered its content to ChatGPT developer OpenAI in a lucrative deal, as has the online forum Reddit, to Google. And given that scientific publishers probably view the alternative as their work being scraped without an agreement, “I think there will be more of these deals to come,” says Wang.

Information secrets

Some AI developers, such as the Large-scale Artificial Intelligence Network, intentionally keep their data sets open, but many firms developing generative-AI models have kept much of their training data secret, says Baack. “We have no idea what is in there,” he says. Open-source repositories such as arXiv and the scholarly database PubMed of abstracts are thought to be “very popular” sources, he says, although paywalled journal articles probably have their free-to-read abstracts scraped by big technology firms. “They are always on the hunt for that kind of stuff,” he adds.

Proving that an LLM has used any individual paper is difficult, says Yves-Alexandre de Montjoye, a computer scientist at Imperial College London. One way is to prompt the model with an unusual sentence from a text and see whether the output matches the next words in the original. If it does, that is good evidence that the paper is in the training set. But if it doesn’t, that doesn’t mean that the paper wasn’t used — not least because developers can code the LLM to filter responses to ensure they don’t match training data too closely. “It takes a lot for this to work,” he says.

research paper using var model

Robo-writers: the rise and risks of language-generating AI

Another method to check whether data are in a training set is known as membership inference attack. This relies on the idea that a model will be more confident about its output when it is seeing something that it has seen before. De Montjoye’s team has developed a version of this, called a copyright trap, for LLMs.

To set the trap, the team generates sentences that look plausible but are nonsense, and hides them in a body of work, for example as white text on a white background or in a field that’s displayed as zero width on a webpage. If an LLM is more ‘surprised’ — a measure known as its perplexity — by an unused control sentence than it is by the one hidden in the text, “that is statistical evidence that the traps were seen before”, he says.

Copyright questions

Even if it were possible to prove that an LLM has been trained on a certain text, it is not clear what happens next. Publishers maintain that, if developers use copyrighted text in training and have not sought a licence, that counts as infringement. But a counter legal argument says that LLMs do not copy anything — they harvest information content from training data, which gets broken up, and use their learning to generate new text.

research paper using var model

AI is complicating plagiarism. How should scientists respond?

Litigation might help to resolve this. In an ongoing US copyright case that could be precedent-setting, The New York Times is suing Microsoft and ChatGPT’s developer OpenAI in San Francisco, California. The newspaper accuses the firms of using its journalistic content to train their models without permission.

Many academics are happy to have their work included in LLM training data — especially if the models make them more accurate. “I personally don’t mind if I have a chatbot who writes in the style of me,” says Baack. But he acknowledges that his job is not threatened by LLM outputs in the way that those of other professions, such as artists and writers, are.

Individual scientific authors currently have little power if the publisher of their paper decides to sell access to their copyrighted works. For publicly available articles, there is no established means to apportion credit or know whether a text has been used.

Some researchers, including de Montjoye, are frustrated. “We want LLMs, but we still want something that is fair, and I think we’ve not invented what this looks like yet,” he says.

doi: https://doi.org/10.1038/d41586-024-02599-9

Reprints and permissions

Related Articles

research paper using var model

  • Machine learning

Chatbots in science: What can ChatGPT do for you?

Chatbots in science: What can ChatGPT do for you?

Career Column 14 AUG 24

Weather and climate predicted accurately — without using a supercomputer

Weather and climate predicted accurately — without using a supercomputer

News & Views 13 AUG 24

Physics solves a training problem for artificial neural networks

Physics solves a training problem for artificial neural networks

News & Views 07 AUG 24

Estonians gave their DNA to science — now they’re learning their genetic secrets

Estonians gave their DNA to science — now they’re learning their genetic secrets

News 26 JUN 24

Not all ‘open source’ AI models are actually open: here’s a ranking

Not all ‘open source’ AI models are actually open: here’s a ranking

News 19 JUN 24

A guide to the Nature Index

A guide to the Nature Index

Nature Index 05 JUN 24

Postdoctoral Fellow in Epigenetics/RNA Biology in the Lab of Yvonne Fondufe-Mittendorf

Van Andel Institute’s (VAI) Professor Yvonne Fondufe-Mittendorf, Ph.D. is hiring a Postdoctoral Fellow to join the lab and carry out an independent...

Grand Rapids, Michigan

Van Andel Institute

research paper using var model

Faculty Positions in Center of Bioelectronic Medicine, School of Life Sciences, Westlake University

SLS invites applications for multiple tenure-track/tenured faculty positions at all academic ranks.

Hangzhou, Zhejiang, China

School of Life Sciences, Westlake University

research paper using var model

Faculty Positions, Aging and Neurodegeneration, Westlake Laboratory of Life Sciences and Biomedicine

Applicants with expertise in aging and neurodegeneration and related areas are particularly encouraged to apply.

Westlake Laboratory of Life Sciences and Biomedicine (WLLSB)

research paper using var model

Faculty Positions in Chemical Biology, Westlake University

We are seeking outstanding scientists to lead vigorous independent research programs focusing on all aspects of chemical biology including...

Assistant Professor Position in Genomics

The Lewis-Sigler Institute at Princeton University invites applications for a tenure-track faculty position in Genomics.

Princeton University, Princeton, New Jersey, US

The Lewis-Sigler Institute for Integrative Genomics at Princeton University

research paper using var model

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Stock Investment Modeling and Prediction Using Vector Autoregression (VAR) and Cross Industry Standard Process for Data Mining (CRISP-DM)

  • Conference paper
  • First Online: 29 April 2023
  • Cite this conference paper

research paper using var model

  • Agung Triayudi 40 ,
  • Iskandar Fitri 40 ,
  • Sumiati 41 &

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1008))

310 Accesses

Prediction investment share is one issue important right now, as many people now have to switch to digital investment. Many studies have been done to help Stock data prediction using machine learning. However, most machine learning models used complex and predictive only one variable row time. This research focuses on creating machine learning models using the VAR algorithm to predict several variables at a time with 1 model and provides recommendations, and uses the framework Cross Industry Standard Process for Data Mining (CRISP-DM) work in holding his research. The contribution of this research is to analyze whether the open, high, low, and close share price variables can be predicted based on each variable’s past data. Then build a forecasting model using machine learning technology, the Vector Autoregression (VAR) algorithm, and the Cross Industry Standard Process for Data Mining (CRISP-DM) method. From the resulting study, the VAR model is able to produce a model capable of predicting three variables at a time, that is, price highs, lows, and closes with each R2 Score is 0.60, 0.51, 0.54 and uses an optimal lag of 273 but for the variable price opening make a separate model with the lag difference is two lags and the R2 Score is 0.63. Based on the results of testing and evaluation of the use of R2, MAE, and RMSE scores on the model that was successfully created, it can be concluded that VAR can be used to predict the highest, lowest, and closing stock prices at once and has a fairly good accuracy even though the opening price variable must make a separation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

research paper using var model

Using Autoregressive Modelling and Machine Learning for Stock Market Prediction and Trading

research paper using var model

ARIMA Versus ANN—A Comparative Study of Predictive Modelling Techniques to Determine Stock Price

research paper using var model

A Large Dimensional VAR Model with Time-Varying Parameters for Daily Forex Forecasting

Yasin H, Warsito B, Santoso R, Suparti (2018) Soft computation vector autoregressive neural network (VAR-NN) GUI-based. E3S Web Conf 73:13008. https://doi.org/10.1051/e3sconf/20187313008

Hushani P (2019) Using autoregressive modelling and machine learning for stock market prediction and trading. In: Third international congress on information and communication technology, pp 767–774. Springer, Singapore. https://doi.org/10.1007/978-981-13-1165-9_70

Gupta R, Huber F, Piribauer P (2020) Predicting international equity returns: evidence from time-varying parameter vector autoregressive models. Int Rev Financ Anal 68:101456. https://doi.org/10.1016/j.irfa.2020.101456

Article   Google Scholar  

Farid S, Tashfeen R, Mohsan T, Burhan A (2021) Forecasting stock prices using a data mining method: evidence from emerging market. Int J Financ Econ. https://doi.org/10.1002/ijfe.2516

Mailinda I, Ruldeviyani Y, Tanjung F, Mikoriza T, Putra R, Fauziah AT (2021) Stock price prediction during the pandemic period with the SVM, BPNN, and LSTM algorithm. In: 2021 4th international seminar on research of information technology and intelligent systems (ISRITI), pp 189–194. IEEE. https://doi.org/10.1109/ISRITI54043.2021.9702865

Rohmawati AA, Gunawan PH (2019) The causality effect on vector autoregressive model: the case for rainfall forecasting. In: 2019 7th international conference on information and communication technology (ICoICT). IEEE, pp 1–5. https://doi.org/10.1109/ICoICT.2019.8835379

Chaiboonsri C, Wannapan S (2021) Applying quantum mechanics for extreme value prediction of VaR and ES in the ASEAN stock exchange. Economies 9:13. https://doi.org/10.3390/economies9010013

Suroso, Rusiadi, Purba br R, Siahaan APU, Sari AK, Novalina A, Lubis AIF (2018) Autoregression vector prediction on banking stock return using CAPM model approach and multi-factor apt. Int J Civ Eng Technol (IJCIET) 9:1093–1103

Google Scholar  

Lu W, Li J, Wang J, Qin L (2021) A CNN-BiLSTM-AM method for stock price prediction. Neural Comput Appl 33:4741–4753. https://doi.org/10.1007/s00521-020-05532-z

Chen C, Zhao L, Bian J, Xing C, Liu T-Y (2019) Investment behaviors can tell what inside. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, New York, NY, USA, pp 2376–2384. https://doi.org/10.1145/3292500.3330663

Mauritsius T (201) Bank marketing data mining using CRISP-DM approach. Int J Adv Trends Comput Sci Eng 8:2322–2329. https://doi.org/10.30534/ijatcse/2019/71852019

Jaggia S, Kelly A, Lertwachara K, Chen L (2020) Applying the CRISP-DM framework for teaching business analytics. Decis Sci J Innov Educ 18:612–634. https://doi.org/10.1111/dsji.12222

Huber S, Wiemer H, Schneider D, Ihlenfeldt S (2019) DMME: data mining methodology for engineering applications—a holistic extension to the CRISP-DM model. Procedia CIRP 79:403–408. https://doi.org/10.1016/j.procir.2019.02.106

Majumder A, Rahman MdM, Biswas AA, Zulfiker MdS, Basak S (2022) Stock market prediction: a time series analysis. https://doi.org/10.1007/978-981-16-2877-1_35

Triayudi A, Sumiati S, Dwiyatno S, Karyaningsih D, Susilawati S (2021) Measure the effectiveness of information systems with the naïve Bayes classifier method. IAES Int J Artif Intell (IJ-AI) 10:414. https://doi.org/10.11591/ijai.v10.i2.pp414-420

Triayudi A, Widyarto WO, Rosalina V (2020) CLG clustering for mapping pattern analysis of student academic achievement. ICIC Express Lett 14:1225–1234

Exenberger E, Bucko J (2020) Analysis of online consumer behaviour—design of CRISP-DM process model. Agris On-Line Pap Econ Inform 12:13–22. https://doi.org/10.7160/aol.2020.120302

Nagashima H, Kato Y (2019) APREP-DM: a framework for automating the pre-processing of a sensor data analysis based on CRISP-DM. In: 2019 IEEE international conference on pervasive computing and communications workshops (PerCom Workshops). IEEE, pp 555–560. https://doi.org/10.1109/PERCOMW.2019.8730785

Purbasari A, Rinawan FR, Zulianto A, Susanti AI, Komara H (2021) CRISP-DM for data quality improvement to support machine learning of stunting prediction in infants and toddlers. In: 2021 8th international conference on advanced informatics: concepts, theory and applications (ICAICTA). IEEE, pp 1–6. https://doi.org/10.1109/ICAICTA53211.2021.9640294

Ribeiro R, Pilastri A, Moura C, Rodrigues F, Rocha R, Cortez P (2020) Predicting the tear strength of woven fabrics via automated machine learning: an application of the CRISP-DM methodology. In: Proceedings of the 22nd international conference on enterprise information systems. SCITEPRESS—Science and Technology Publications, pp 548–555. https://doi.org/10.5220/0009411205480555

Rezki D, Mouss LH, Baaziz A (2018) Using a data mining CRISP-DM methodology for rate of penetration (ROP) prediction in oil well drilling. In: The second European international conference on industrial engineering and operations management. Proceedings of the international conference on industrial engineering and operations management, Paris, France

Schafer F, Zeiselmair C, Becker J, Otten H (2018) Synthesizing CRISP-DM and quality management: a data mining approach for production processes. In: 2018 IEEE international conference on technology management, operations and decisions (ICTMOD). IEEE, pp 190–195. https://doi.org/10.1109/ITMC.2018.8691266

Martinez-Plumed F, Contreras-Ochando L, Ferri C, Hernandez-Orallo J, Kull M, Lachiche N, Ramirez-Quintana MJ, Flach P (2021) CRISP-DM twenty years later: from data mining processes to data science trajectories. IEEE Trans Knowl Data Eng 33:3048–3061. https://doi.org/10.1109/TKDE.2019.2962680

Kristoffersen E, Aremu OO, Blomsma F, Mikalef P, Li J (2019) Exploring the relationship between data science and circular economy: an enhanced CRISP-DM process model. https://doi.org/10.1007/978-3-030-29374-1_15

Parot A, Michell K, Kristjanpoller WD (2019) Using artificial neural networks to forecast exchange rate, including VAR-VECM residual analysis and prediction linear combination. Intell Syst Account Financ Manag 26:3–15. https://doi.org/10.1002/isaf.1440

Azhar Z, Putra HS, Saputra D (2020) Effect of macroeconomic factors on the composite stock price index using the vector auto regression (VAR) method. In: Proceedings of the 4th Padang international conference on education, economics, business and accounting (PICEEBA-2 2019). Atlantis Press, Paris, France. https://doi.org/10.2991/aebmr.k.200305.081

Sathyanarayana S, Gargesa S (2019) Modeling cryptocurrency (Bitcoin) using vector autoregressive (Var) model. SDMIMD J Manag 10:47–64. https://doi.org/10.18311/sdmimd/2019/23181

Lu F, Qiao H, Wang S, Lai KK, Li Y (2017) Time-varying coefficient vector autoregressions model based on dynamic correlation with an application to crude oil and stock markets. Environ Res 152:351–359. https://doi.org/10.1016/j.envres.2016.07.015

Shahrestani P, Rafei M (2020) The impact of oil price shocks on Tehran Stock Exchange returns: application of the Markov switching vector autoregressive models. Resour Policy 65:101579. https://doi.org/10.1016/j.resourpol.2020.101579

Aydin AD, Cavdar SC (2015) Comparison of prediction performances of artificial neural network (ANN) and vector autoregressive (VAR) models by using the macroeconomic variables of gold prices, Borsa Istanbul (BIST) 100 index and US Dollar-Turkish Lira (USD/TRY) exchange rates. Procedia Econ Financ 30:3–14. https://doi.org/10.1016/S2212-5671(15)01249-6

Download references

Author information

Authors and affiliations.

Department of ICT, Universitas Nasional, Jakarta, Indonesia

Agung Triayudi & Iskandar Fitri

Informatics Department, Universitas Serang Raya, Serang, Indonesia

Engineering Department, Universitas Faletehan, Kabupaten Serang, Indonesia

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Agung Triayudi .

Editor information

Editors and affiliations.

Department of Electromedical Engineering, Poltekkes Kemenkes Surabaya, Surabaya, Indonesia

Triwiyanto Triwiyanto

School of Electrical Engineering, Telkom University, Indonesia, Indonesia

Achmad Rizal

Faculty of Integrated Technologies, Universiti Brunei Darussalam, Gadong, Brunei Darussalam

Wahyu Caesarendra

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Triayudi, A., Fitri, I., Sumiati, Iksal (2023). Stock Investment Modeling and Prediction Using Vector Autoregression (VAR) and Cross Industry Standard Process for Data Mining (CRISP-DM). In: Triwiyanto, T., Rizal, A., Caesarendra, W. (eds) Proceeding of the 3rd International Conference on Electronics, Biomedical Engineering, and Health Informatics. Lecture Notes in Electrical Engineering, vol 1008. Springer, Singapore. https://doi.org/10.1007/978-981-99-0248-4_20

Download citation

DOI : https://doi.org/10.1007/978-981-99-0248-4_20

Published : 29 April 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-0247-7

Online ISBN : 978-981-99-0248-4

eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

American Psychological Association

How to cite ChatGPT

Timothy McAdoo

Use discount code STYLEBLOG15 for 15% off APA Style print products with free shipping in the United States.

We, the APA Style team, are not robots. We can all pass a CAPTCHA test , and we know our roles in a Turing test . And, like so many nonrobot human beings this year, we’ve spent a fair amount of time reading, learning, and thinking about issues related to large language models, artificial intelligence (AI), AI-generated text, and specifically ChatGPT . We’ve also been gathering opinions and feedback about the use and citation of ChatGPT. Thank you to everyone who has contributed and shared ideas, opinions, research, and feedback.

In this post, I discuss situations where students and researchers use ChatGPT to create text and to facilitate their research, not to write the full text of their paper or manuscript. We know instructors have differing opinions about how or even whether students should use ChatGPT, and we’ll be continuing to collect feedback about instructor and student questions. As always, defer to instructor guidelines when writing student papers. For more about guidelines and policies about student and author use of ChatGPT, see the last section of this post.

Quoting or reproducing the text created by ChatGPT in your paper

If you’ve used ChatGPT or other AI tools in your research, describe how you used the tool in your Method section or in a comparable section of your paper. For literature reviews or other types of essays or response or reaction papers, you might describe how you used the tool in your introduction. In your text, provide the prompt you used and then any portion of the relevant text that was generated in response.

Unfortunately, the results of a ChatGPT “chat” are not retrievable by other readers, and although nonretrievable data or quotations in APA Style papers are usually cited as personal communications , with ChatGPT-generated text there is no person communicating. Quoting ChatGPT’s text from a chat session is therefore more like sharing an algorithm’s output; thus, credit the author of the algorithm with a reference list entry and the corresponding in-text citation.

When prompted with “Is the left brain right brain divide real or a metaphor?” the ChatGPT-generated text indicated that although the two brain hemispheres are somewhat specialized, “the notation that people can be characterized as ‘left-brained’ or ‘right-brained’ is considered to be an oversimplification and a popular myth” (OpenAI, 2023).

OpenAI. (2023). ChatGPT (Mar 14 version) [Large language model]. https://chat.openai.com/chat

You may also put the full text of long responses from ChatGPT in an appendix of your paper or in online supplemental materials, so readers have access to the exact text that was generated. It is particularly important to document the exact text created because ChatGPT will generate a unique response in each chat session, even if given the same prompt. If you create appendices or supplemental materials, remember that each should be called out at least once in the body of your APA Style paper.

When given a follow-up prompt of “What is a more accurate representation?” the ChatGPT-generated text indicated that “different brain regions work together to support various cognitive processes” and “the functional specialization of different regions can change in response to experience and environmental factors” (OpenAI, 2023; see Appendix A for the full transcript).

Creating a reference to ChatGPT or other AI models and software

The in-text citations and references above are adapted from the reference template for software in Section 10.10 of the Publication Manual (American Psychological Association, 2020, Chapter 10). Although here we focus on ChatGPT, because these guidelines are based on the software template, they can be adapted to note the use of other large language models (e.g., Bard), algorithms, and similar software.

The reference and in-text citations for ChatGPT are formatted as follows:

  • Parenthetical citation: (OpenAI, 2023)
  • Narrative citation: OpenAI (2023)

Let’s break that reference down and look at the four elements (author, date, title, and source):

Author: The author of the model is OpenAI.

Date: The date is the year of the version you used. Following the template in Section 10.10, you need to include only the year, not the exact date. The version number provides the specific date information a reader might need.

Title: The name of the model is “ChatGPT,” so that serves as the title and is italicized in your reference, as shown in the template. Although OpenAI labels unique iterations (i.e., ChatGPT-3, ChatGPT-4), they are using “ChatGPT” as the general name of the model, with updates identified with version numbers.

The version number is included after the title in parentheses. The format for the version number in ChatGPT references includes the date because that is how OpenAI is labeling the versions. Different large language models or software might use different version numbering; use the version number in the format the author or publisher provides, which may be a numbering system (e.g., Version 2.0) or other methods.

Bracketed text is used in references for additional descriptions when they are needed to help a reader understand what’s being cited. References for a number of common sources, such as journal articles and books, do not include bracketed descriptions, but things outside of the typical peer-reviewed system often do. In the case of a reference for ChatGPT, provide the descriptor “Large language model” in square brackets. OpenAI describes ChatGPT-4 as a “large multimodal model,” so that description may be provided instead if you are using ChatGPT-4. Later versions and software or models from other companies may need different descriptions, based on how the publishers describe the model. The goal of the bracketed text is to briefly describe the kind of model to your reader.

Source: When the publisher name and the author name are the same, do not repeat the publisher name in the source element of the reference, and move directly to the URL. This is the case for ChatGPT. The URL for ChatGPT is https://chat.openai.com/chat . For other models or products for which you may create a reference, use the URL that links as directly as possible to the source (i.e., the page where you can access the model, not the publisher’s homepage).

Other questions about citing ChatGPT

You may have noticed the confidence with which ChatGPT described the ideas of brain lateralization and how the brain operates, without citing any sources. I asked for a list of sources to support those claims and ChatGPT provided five references—four of which I was able to find online. The fifth does not seem to be a real article; the digital object identifier given for that reference belongs to a different article, and I was not able to find any article with the authors, date, title, and source details that ChatGPT provided. Authors using ChatGPT or similar AI tools for research should consider making this scrutiny of the primary sources a standard process. If the sources are real, accurate, and relevant, it may be better to read those original sources to learn from that research and paraphrase or quote from those articles, as applicable, than to use the model’s interpretation of them.

We’ve also received a number of other questions about ChatGPT. Should students be allowed to use it? What guidelines should instructors create for students using AI? Does using AI-generated text constitute plagiarism? Should authors who use ChatGPT credit ChatGPT or OpenAI in their byline? What are the copyright implications ?

On these questions, researchers, editors, instructors, and others are actively debating and creating parameters and guidelines. Many of you have sent us feedback, and we encourage you to continue to do so in the comments below. We will also study the policies and procedures being established by instructors, publishers, and academic institutions, with a goal of creating guidelines that reflect the many real-world applications of AI-generated text.

For questions about manuscript byline credit, plagiarism, and related ChatGPT and AI topics, the APA Style team is seeking the recommendations of APA Journals editors. APA Style guidelines based on those recommendations will be posted on this blog and on the APA Style site later this year.

Update: APA Journals has published policies on the use of generative AI in scholarly materials .

We, the APA Style team humans, appreciate your patience as we navigate these unique challenges and new ways of thinking about how authors, researchers, and students learn, write, and work with new technologies.

American Psychological Association. (2020). Publication manual of the American Psychological Association (7th ed.). https://doi.org/10.1037/0000165-000

Related and recent

Comments are disabled due to your privacy settings. To re-enable, please adjust your cookie preferences.

APA Style Monthly

Subscribe to the APA Style Monthly newsletter to get tips, updates, and resources delivered directly to your inbox.

Welcome! Thank you for subscribing.

APA Style Guidelines

Browse APA Style writing guidelines by category

  • Abbreviations
  • Bias-Free Language
  • Capitalization
  • In-Text Citations
  • Italics and Quotation Marks
  • Paper Format
  • Punctuation
  • Research and Publication
  • Spelling and Hyphenation
  • Tables and Figures

Full index of topics

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

energies-logo

Article Menu

research paper using var model

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Performance enhancement of h-type darrieus vawt using a hybrid method of blade pitch angle regulation.

research paper using var model

1. Introduction

2. solution strategy, 3. materials and methods, 3.1. double multiple streamtube model, 3.2. numerical modeling and simulation, 3.3. experimental validation, 4. results and discussion, 4.1. impact of blade pitch angle regulation on h-type darrieus vawt aerodynamics, 4.2. effect of blade pitch angle regulation on h-type darrieus vawt performance, 4.3. self-starting behavior of h-type darrieus vawt, 4.4. hybrid blade pitch angle technique, 5. conclusions, author contributions, data availability statement, conflicts of interest.

  • Didane, D.H.; Behery, M.R.; Al-Ghriybah, M.; Manshoor, B. Recent Progress in Design and Performance Analysis of Vertical-Axis Wind Turbines—A Comprehensive Review. Processes 2024 , 12 , 1094. [ Google Scholar ] [ CrossRef ]
  • Acosta-López, J.G.; Blasetti, A.P.; Lopez-Zamora, S.; de Lasa, H. CFD Modeling of an H-Type Darrieus VAWT under High Winds: The Vorticity Index and the Imminent Vortex Separation Condition. Processes 2023 , 11 , 644. [ Google Scholar ] [ CrossRef ]
  • Zereg, A.; Aksas, M.; Bouzaher, M.T.; Laghrouche, S.; Lebaal, N. Efficiency Improvement of Darrieus Wind Turbine Using Oscillating Gurney Flap. Fluids 2024 , 9 , 150. [ Google Scholar ] [ CrossRef ]
  • Abdalrahman, G.; Daoud, M.A.; Melek, W.W.; Lien, F.-S.; Yee, E. Design and Implementation of an Intelligent Blade Pitch Control System and Stability Analysis for a Small Darrieus Vertical-Axis Wind Turbine. Energies 2022 , 15 , 235. [ Google Scholar ] [ CrossRef ]
  • Kumar, R.; Raahemifar, K.; Fung, A.S. A critical review of vertical axis wind turbines for urban applications. Renew. Sustain. Energy Rev. 2018 , 89 , 281–291. [ Google Scholar ] [ CrossRef ]
  • Du, L.; Ingram, G.; Dominy, R.G. A review of H-Darrieus wind turbine aerodynamic research. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2019 , 233 , 7590–7616. [ Google Scholar ] [ CrossRef ]
  • Nizamani, Z.; Muhammad, A.; Ali, M.O.A.; Wahab, M.A.; Nakayama, A.; Ahmed, M.M. Renewable wind energy resources in offshore low wind speeds regions near the equator: A review. Ocean Eng. 2024 , 311 , 118834. [ Google Scholar ] [ CrossRef ]
  • Gupta, A.; Abderrahmane, H.A.; Janajreh, I. Flow analysis and sensitivity study of vertical-axis wind turbine under variable pitching. Appl. Energy 2024 , 358 , 122648. [ Google Scholar ] [ CrossRef ]
  • Sun, X.; Hao, T.; Zhang, J.; Dong, L.; Zhu, J. The performance increase of the wind-induced rotation VAWT by application of the passive variable pitching blade. Int. J. Low-Carbon Technol. 2022 , 17 , 1420–1434. [ Google Scholar ] [ CrossRef ]
  • Zhao, Z.; Wang, R.; Shen, W.; Wang, T.; Xu, B.; Zheng, Y.; Qian, S. Variable pitch approach for performance improving of straight-bladed VAWT at rated tip speed ratio. Appl. Sci. 2018 , 8 , 957. [ Google Scholar ] [ CrossRef ]
  • Yamada, T.; Kiwata, T.; Kita, T.; Hirai, M.; Komatsu, N.; Kono, T. Overspeed Control of a Variable-Pitch Vertical-Axis Wind Turbine by Means of Tail Vanes. J. Environ. Eng. 2012 , 7 , 39–52. [ Google Scholar ] [ CrossRef ]
  • Miliket, T.A.; Ageze, M.B.; Tigabu, M.T. Aerodynamic performance enhancement and computational methods for H-Darrieus vertical axis wind turbines: Review. Int. J. Green Energy 2022 , 19 , 1428–1465. [ Google Scholar ] [ CrossRef ]
  • Szczerba, Z.; Szczerba, P.; Szczerba, K.; Szumski, M.; Pytel, K. Wind Tunnel Experimental Study on the Efficiency of Vertical-Axis Wind Turbines via Analysis of Blade Pitch Angle Influence. Energies 2023 , 16 , 4903. [ Google Scholar ] [ CrossRef ]
  • Hao, W.; Abdi, A.; Wang, G.; Wu, F. Study on the Pitch Angle Effect on the Power Coefficient and Blade Fatigue Load of a Vertical Axis Wind Turbine. Energies 2023 , 16 , 7279. [ Google Scholar ] [ CrossRef ]
  • Castillo, O.C.; Andrade, V.R.; Rivas, J.J.R.; González, R.O. Comparison of Power Coefficients in Wind Turbines Considering the Tip Speed Ratio and Blade Pitch Angle. Energies 2023 , 16 , 2774. [ Google Scholar ] [ CrossRef ]
  • Sagharichi, A.; Maghrebi, M.J.; ArabGolarcheh, A. Variable pitch blades: An approach for improving performance of Darrieus wind turbine. J. Renew. Sustain. Energy 2016 , 8 , 053305. [ Google Scholar ] [ CrossRef ]
  • Liu, L.; Liu, C.; Zheng, X. Modeling, simulation, hardware implementation of a novel variable pitch control for H-type vertical axis wind turbine. J. Electr. Eng. 2015 , 66 , 264–269. [ Google Scholar ] [ CrossRef ]
  • Zhao, Z.; Qian, S.; Shen, W.; Wang, T.; Xu, B.; Zheng, Y.; Wang, R. Study on variable pitch strategy in H-type wind turbine considering effect of small angle of attack. J. Renew. Sustain. Energy 2017 , 9 , 053302. [ Google Scholar ] [ CrossRef ]
  • Arif, S.K.; Maree, I.E. Investigation on aerodynamic performance of H-type darrieus vertical axis wind turbine with different series airfoil shapes. Al-Qadisiyah J. Eng. Sci. 2023 , 16 , 121–126. [ Google Scholar ] [ CrossRef ]
  • Islam, M.; Ting, D.S.-K.; Fartaj, A. Aerodynamic models for Darrieus-type straight-bladed vertical axis wind turbines. Renew. Sustain. Energy Rev. 2008 , 12 , 1087–1109. [ Google Scholar ] [ CrossRef ]
  • Moutis, P.; Pastromas, S.; Hatziargyriou, N.D. Load-frequency control supported by variable speed variable pitch wind generators—From theory to testing. Eur. Wind Energy Conf. Exhib. EWEC 2013 , 3 , 1743–1748. [ Google Scholar ]
  • Chougule, P.; Nielsen, S. Overview and Design of self-acting pitch control mechanism for vertical axis wind turbine using multi body simulation approach. J. Phys. Conf. Ser. 2014 , 524 , 012055. [ Google Scholar ] [ CrossRef ]
  • Schönborn, A.; Chantzidakis, M. Development of a hydraulic control mechanism for cyclic pitch marine current turbines. Renew. Energy 2007 , 32 , 662–679. [ Google Scholar ] [ CrossRef ]
  • Abdalrahman, G.; Melek, W.; Lien, F.-S. Pitch angle control for a small-scale Darrieus vertical axis wind turbine with straight blades (H-Type VAWT). Renew. Energy 2017 , 114 , 1353–1362. [ Google Scholar ] [ CrossRef ]
  • Zhang, L.-X.; Liang, Y.-B.; Liu, X.-H.; Guo, J. Effect of blade pitch angle on aerodynamic performance of straight-bladed vertical axis wind turbine. J. Cent. South Univ. 2014 , 21 , 1417–1427. [ Google Scholar ] [ CrossRef ]
  • Rezaeiha, A.; Kalkman, I.; Blocken, B. Effect of pitch angle on power performance and aerodynamics of a vertical axis wind turbine. Appl. Energy 2017 , 197 , 132–150. [ Google Scholar ] [ CrossRef ]
  • Yang, Y.; Guo, Z.; Song, Q.; Zhang, Y.; Li, Q. Effect of blade pitch angle on the aerodynamic characteristics of a straight-bladed vertical axis wind turbine based on experiments and simulations. Energies 2018 , 11 , 1514. [ Google Scholar ] [ CrossRef ]
  • Firdaus, R.; Kiwata, T.; Kono, T.; Nagao, K. Numerical and experimental studies of a small vertical-axis wind turbine with variable-pitch straight blades. J. Fluid Sci. Technol. 2015 , 10 , JFST0001. [ Google Scholar ] [ CrossRef ]
  • Gerrie, C.; Islam, S.Z.; Gerrie, S.; Turner, N.; Asim, T. 3D CFD Modelling of Performance of a Vertical Axis Turbine. Energies 2023 , 16 , 1144. [ Google Scholar ] [ CrossRef ]
  • Luo, Z.; Sun, Z.; Ma, F.; Qin, Y.; Ma, S. Power optimization for wind turbines based on stacking model and pitch angle adjustment. Energies 2020 , 13 , 4158. [ Google Scholar ] [ CrossRef ]
  • Sarkar, R.; Julai, S.; Tong, C.W.; Uddin, M.; Romlie, M.; Shafiullah, G. Hybrid pitch angle controller approaches for stable wind turbine power under variable wind speed. Energies 2020 , 13 , 3622. [ Google Scholar ] [ CrossRef ]
  • Templin, R.J. Aerodynamic Performance Theory for the NRC Vertical-Axis Wind Turbine ; NASA STI/Recon Technical Report N76; National Aeronautical Establishment: Ottawa, ON, Canada, 1974. [ Google Scholar ]
  • Wilson, R.E. LPBS, Machines Power Wind of Aerodynamics, Ntis Pb 238594 ; Oregon State University: Corvallis, OR, USA, 1974. [ Google Scholar ]
  • Strickland, J.H. Darrieus Turbine: A Performance Prediction Model Using Multiple Streamtubes ; Sandia Labs.: Albuquerque, NM, USA, 1975. [ Google Scholar ]
  • Paraschivoiu, I.; Fraunie, P.; Beguier, C. Streamtube expansion effects on the Darrieus wind turbine. J. Propuls. Power 1985 , 1 , 150–155. [ Google Scholar ] [ CrossRef ]
  • Pawsey, N. Development and Evaluation of Passive Variable-Pitch Vertical Axis Wind Turbines. Ph.D. Thesis, The University of New South Wales, Sydney, NSW, Australia, 2002. [ Google Scholar ]
  • Shahriar, A.; Kumar, R.; Shoele, K. Vortex dynamics of axisymmetric cones at high angles of attack. Theor. Comput. Fluid Dyn. 2023 , 37 , 337–356. [ Google Scholar ] [ CrossRef ]
  • Faure, T.M.; Leogrande, C. High angle-of-attack aerodynamics of a straight wing with finite span using a discrete vortex method. Phys. Fluids 2020 , 32 , 104109. [ Google Scholar ] [ CrossRef ]
  • Sogukpinar, H. The effects of NACA 0012 airfoil modification on aerodynamic performance improvement and obtaining high lift coefficient and post-stall airfoil. AIP Conf. Proc. 2018 , 1935 , 020001. [ Google Scholar ]
  • Hansen, M.O.L. Aerodynamics of Wind Turbines , 2nd ed.; Earthscan: London, UK, 2008. [ Google Scholar ]
  • Torabi, F. Fundamentals of Wind Farm Aerodynamic Layout Design ; Academic Press: Cambridge, MA, USA, 2022. [ Google Scholar ]
  • Battisti, L. Design Options to Improve the Dynamic Behavior and the Control of Small H-Darrieus VAWTs. Appl. Sci. 2021 , 11 , 9222. [ Google Scholar ] [ CrossRef ]
  • Robert, S.E.; Paul, K.C. Aerodynamic Characteristics of Seven Symmetrical Airfoil Sections through 180-Degree Angle of Attack for Use in Aerodynamic Analysis of Vertical Axis Wind Turbines. Sandia National Lab.: Albuquerque, NM, USA, 1981. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

H-Type Darrieus VAWT SpecificationSymbolValue
Blade Profile------NACA0018
Blade Shape------Straight
Number of bladesN3
Rotor Radius (mm)R300
Blade height (mm)H400
Chord length (mm)C250
Solidityσ0.191
V β = −15°β = −12.5°β = −10°β = −7.5°β = −5°β = −2.5°β = 0°β = 2.5°β = 5°β = 7.5°
4 m/sF-SF-SF-SF-SF-S100110110120F-S
5 m/s002030708090110110120
6 m/s0000205060100110120
7 m/s000010606090100120
8 m/s0000002080100110
9 m/s000000000110
10 m/s000000000110
11 m/s000000000110
12 m/s000000000100
13 m/s000000000100
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Hammad, M.A.; Mahmoud, A.M.; Abdelrhman, A.M.; Sarip, S. Performance Enhancement of H-Type Darrieus VAWT Using a Hybrid Method of Blade Pitch Angle Regulation. Energies 2024 , 17 , 4044. https://doi.org/10.3390/en17164044

Hammad MA, Mahmoud AM, Abdelrhman AM, Sarip S. Performance Enhancement of H-Type Darrieus VAWT Using a Hybrid Method of Blade Pitch Angle Regulation. Energies . 2024; 17(16):4044. https://doi.org/10.3390/en17164044

Hammad, Mahmood Abduljabbar, Abdelgadir Mohamed Mahmoud, Ahmed M. Abdelrhman, and Shamsul Sarip. 2024. "Performance Enhancement of H-Type Darrieus VAWT Using a Hybrid Method of Blade Pitch Angle Regulation" Energies 17, no. 16: 4044. https://doi.org/10.3390/en17164044

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. A Structural VAR Approach to Estimating Budget

    research paper using var model

  2. (PDF) Application of structural VAR models and impulse response function

    research paper using var model

  3. (PDF) Prediction and Impulse of Major Currency Pairs Using VAR Model

    research paper using var model

  4. VAR Model

    research paper using var model

  5. Impulse response analysis chart of the VAR model.

    research paper using var model

  6. PPT

    research paper using var model

COMMENTS

  1. PDF VAR Analysis of Economic Activity, Unemployment, and Inflation during

    research as the sample size for my model is 82. Bayoumi and Eichengreen (1992) further interrogate the previously stated conclusion regarding supply and demand shocks. They conduct multiple bi-variate VAR models using price and output data within the European Union. Their research primarily identifies and isolates the result of shocks between ...

  2. Vector Autoregressive (VAR) Models and Granger Causality in Time Series

    To demonstrate the utility of VAR modeling in nursing research, we carried out an analysis with the purpose of developing a stable patient-specific multivariate time series VAR model— using HR, RR and SpO 2 in a sample of SDU patients—in order to study the Granger casual dynamics among the monitored vital signs leading up to a first CRI ...

  3. PDF Vector Autoregression Analysis: Estimation and Interpretation

    mine what variables should be in the VAR, the appropriate number of lags, whether seasonal dummies should be included and, indeed, whether a VAR is even appropriate for the research problem at hand. To focus strictly on the mechanics at this point, however, these model-selection issues are postponed to a later section. 3

  4. An Introduction to Vector Autoregression (VAR) · r-econometrics

    Since the seminal paper of Sims (1980) vector autoregressive models have become a key instrument in macroeconomic research. This post presents the basic concept of VAR analysis and guides through the estimation procedure of a simple model. ... In order to estimate the VAR model I use the vars package by Pfaff (2008). The relevant function is ...

  5. (PDF) Vector Autoregressive Model and Analysis

    In this paper, we develop a vector autoregressive (VAR) model of the Turkish financial markets for the period of June 15 2006-June 15 2010 and forecasts ISE100 index, TRY/USD exchange rate, and ...

  6. (PDF) A comprehensive review of Value at Risk methodologies

    Abstract. In this article we present a theoretical review of the existing literature on Value at Risk (VaR) specifically focussing on the development of new approaches for its estimation. We ...

  7. Forecasting COVID-19: Vector Autoregression-Based Model

    In our paper, we use the statsmodels package in Python to implement the VAR model. The order p of the VAR model is chosen based on a combination of different factors including the time series plots, the correlation plots, and information criteria. In addition, residual analysis is employed to confirm the model assumptions about normality and ...

  8. VAR Models: Estimation, Inferences, and Applications

    Vector autoregression (VAR) models have been used extensively in finance and economic analysis. This paper provides a brief overview of the basic VAR approach by focusing on model estimation and statistical inferences. Applications of VAR models in some finance areas are discussed, including asset pricing, international finance, and market ...

  9. PDF The Econometrics of Oil Market VAR Models

    Introduction. In the last decade, structural vector autoregressive (VAR) models of the global oil market have. become the standard tool for understanding the evolution of the real price of oil and its effect on. the macro economy. These models have helped create a consensus among researchers and.

  10. Engineering Proceedings

    In this paper, the scope is to study whether and how the COVID-19 situation affected the unemployment rate in Greece. To achieve this, a vector autoregression (VAR) model is employed and data analysis is carried out. Another interesting question is whether the situation affected more heavily female and the youth unemployment (under 25 years old) compared to the overall unemployment.

  11. PDF Vector Autoregressive Models for Multivariate Time Series

    The general form of the VAR(p) model with de-terministic terms and exogenous variables is given by. Yt= Π1Yt−1+Π2Yt−2+ · · · + ΠpYt−p + ΦDt+GXt + εt (11.4) where × represents an (m 1) matrix of exogenous variables, and Φ and parameter matrices. ×. Dt represents an (l 1) matrix of deterministic components, Xt.

  12. A dominance approach for comparing the performance of VaR forecasting

    We introduce three dominance criteria to compare the performance of alternative value at risk (VaR) forecasting models. The three criteria use the information provided by a battery of VaR validation tests based on the frequency and size of exceedances, offering the possibility of efficiently summarizing a large amount of statistical information. They do not require the use of any loss function ...

  13. Vector Autoregressive Models: Specification, Estimation, Inference, and

    The seminal work of Sims (1972; 1980a; 1980b; 1982) introduced the vector autoregressive (VAR) methodology into the mainstream of applied macro-economic research as an alternative to large scale macroeconometric models.

  14. An Introduction to Structural Vector Autoregression (SVAR)

    This requires them to arrange the variables of the model in a suitable order. An alternative to this approach is to use so-called structural vector autoregressive (SVAR) models, where the relationship between contemporaneous variables is modelled more directly. This post provides an introduction to the concept of SVAR models and how they can be ...

  15. [2404.02905] Visual Autoregressive Modeling: Scalable Image Generation

    We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize ...

  16. The Knowledge Analysis of Panel Vector Autoregression: A Systematic

    Among them, the panel vector autoregression (PVAR) model, proposed by Holtz-Eakin et al. (1988), is a new model based on panel data that adapted from the VAR model. It allows for the presence of unobservable individual heterogeneity and time effects. Unlike the VAR model, the PVAR model does not have strict requirements on data volume and format.

  17. Comparison of Forecasting Performance with VAR vs. ARIMA Models Using

    This research found that the VAR model presented a better forecast than ARIMA models for the highly correlated variables such as GDP vs. GNP, Export vs. Import, etc. But ARIMA and VAR models ...

  18. Choosing between AR(1) and VAR(1) models in typical ...

    Time series of individual subjects have become a common data type in psychological research. The Vector Autoregressive (VAR) model, which predicts each variable by all variables including itself at previous time points, has become a popular modeling choice for these data. However, the number of observations in typical psychological applications is often small, which puts the reliability of VAR ...

  19. Chapter 3: Vector Autoregressive Methods

    We will train the VAR model using hourly weather measurements from January 1, 2019 (00:10) up to December 29, 2014 (18:10). The performance of the model will be evaluated on the test set which contains data from December 29, 2014 (19:10) to December 31, 2014 (23:20) which is equivalent to 17,523 data points for each of the variables. ...

  20. Dissertations / Theses: 'VECTOR AUTOREGRESSIVE (VAR) MODELS ...

    The model is estimated via maximum likelihood technique. The results suggest DSGE-VAR model outperforms the Classical VAR, but not the Bayesian VARs. However, it does indicate that the forecast accuracy can be improved alarmingly by using the estimated version of the DSGE model. The third paper develops a micro-founded New-Keynesian DSGE ...

  21. VAR Models: Estimation, Inferences, and Applications

    This paper provides a brief overview of the basic VAR approach by focusing on model estimation and statistical inferences. Applications of VAR models in some finance areas are discussed, including asset pricing, international finance, and market micro-structure. It is shown that such approach provides a powerful tool to study financial market ...

  22. VARMA versus VAR for Macroeconomic Forecasting

    econometric research paper has considered the VARMA model as an alternative to the finite-order vector autoregressive (VAR) model. Ever since the publication of the seminal paper by Christo pher Sims (Sims 1980), the finite-order VAR model has become the cornerstone of macroeconometric modeling. The reason for

  23. Evaluating variable selection methods for multivariable regression

    Researchers often perform data-driven variable selection when modeling the associations between an outcome and multiple independent variables in regression analysis. Variable selection may improve the interpretability, parsimony and/or predictive accuracy of a model. Yet variable selection can also have negative consequences, such as false exclusion of important variables or inclusion of noise ...

  24. Optimizing Supply Chain Efficiency using Innovative Goal ...

    This paper presents an optimization approach for supply chain management that incorporates goal programming (GP), dependent chance constraints (DCC), and the hunger games search algorithm (HGSA). The model acknowledges uncertainty by embedding uncertain parameters that promote resilience and efficiency. It focuses on minimizing costs while maximizing on-time deliveries and optimizing key ...

  25. Has your paper been used to train an AI model? Almost certainly

    Experts say that, if a research paper hasn't yet been used to train a large language model (LLM), it probably will be soon. Researchers are exploring technical ways for authors to spot if their ...

  26. Stock Investment Modeling and Prediction Using Vector ...

    The contribution of this research is (1) to find out whether the VAR algorithm can produce a model with good accuracy, especially using existing variables, so that it can be used to forecast stock data; (2) create a forecasting model that can predict more than one line time variable at a time using a machine learning model with the VAR ...

  27. A Thermodynamically Consistent Phase-Field Model and an ...

    We use the phase-field model to represent a two-phase incompressible fluid flow with variable physical properties and thermocapillary effects along the fluid/fluid interface. In particular, we use the following formulation as the variable density for the two-phase fluid: ρ ( Φ 1 , Φ 2 ) = ρ 1 Φ 1 + ρ 2 Φ 2 , where ρ i > 0 is the ...

  28. How to cite ChatGPT

    Title: The name of the model is "ChatGPT," so that serves as the title and is italicized in your reference, as shown in the template. Although OpenAI labels unique iterations (i.e., ChatGPT-3, ChatGPT-4), they are using "ChatGPT" as the general name of the model, with updates identified with version numbers.

  29. Performance Enhancement of H-Type Darrieus VAWT Using a Hybrid ...

    Blade pitch angle regulation is an effective approach to enhance the performance of H-type Darrieus Vertical Axis Wind Turbines (VAWTs). Improving the blade interaction with the wind for this type of rotor is a challenging task, especially in unsteady wind conditions. This paper presents a novel hybrid approach that integrates fixed and variable blade pitch angle regulation techniques, aiming ...