Overview
Regression analysis: Modeling the relationship between a target variable and an explanatory variable, and quantitatively analyzing how well the target variable is explained by the explanatory variables.
(If there is only one explanatory variable, it is called single regression analysis; if there are multiple explanatory variables, it is called multiple regression analysis.
It is called Multiple Linear Regression (MLR), Ordinary Least Squares (OLS), Classical Linear Regression (CLS), etc.
When there is a strong correlation between explanatory variables (multicollinearity), the regression coefficients become unstable, and measures such as regularization and dimensionality reduction are necessary.
Simple regression analysis
Consider the simplest form of regression analysis (when there is only one explanatory variable).
Consider NN data (xixi, yiyi, i=1,2,…. ,Ni=1,2,…. , N), the approximation line for these data
The approximation line for these data is w0w0, w1w1, where the square of the difference between the measured and predicted values (squared error | equation 2) is the smallest.
y=w0+w1x1(1)
y=w0+w1x1(1)
∑i=1N{yi-(x0+w1x1)}2(2)
∑i=1N{yi-(x0+w1x1)}2(2)
Multiple regression analysis
We obtain the weights (coefficients) of each variable when there are multiple explanatory variables, which was one in the simple regression analysis.
In other words, by obtaining the following equation w0w0~wnwn, it is possible to determine how much the target variable changes when the explanatory variable xx changes (≈ the degree of contribution), or to predict the value of the target variable when a certain combination of explanatory variables is given.
y=w0+w1x1+⋯+wnxn(3)
y=w0+w1x1+⋯+wnxn(3)
Other statistics that can be calculated
Partial regression coefficients: Each coefficient.
Standard partial regression coefficients: Partial regression coefficients calculated from data where the explanatory and objective variables are standardized. Since the scales of each variable are aligned, it is possible to compare the magnitude of each standardized partial regression coefficient.
The standardized partial regression coefficients are calculated from the standardized data of each variable. (Standardization: transforming a variable so that its mean value is 0 and its variance is 1)
Standard error of the coefficients (SE | Std. Error): The standard error of the estimated values of the coefficients. The smaller the error, the more accurate the estimation.
The t statistic: A statistic used to test whether each partial regression coefficient (including the constant term) is zero.
The null hypothesis is that the partial regression coefficients = 0. The test is conducted assuming that the t-values obtained from the equation below follow the tt distribution with (n-k-1)(n-k-1) degrees of freedom.
(nn: sample size, kk: number of explanatory variables, βi^βi^: partial regression coefficient, se(βi^)se(βi^): standard error of the coefficient of variable ii)
ti=βi^-0se(βi^)
ti=βi^-0se(βi^)
(Based on the t-test) p-value: The probability that the tt value will be the value obtained above in the tt distribution with (n-k-1)(n-k-1) degrees of freedom. Generally, the null hypothesis is rejected when the p-value is less than 5% or 1%.
Confidence interval for partial regression coefficient: Using the tt distribution with (n-k-1)(n-k-1) degrees of freedom, we can find the (100(1-α))(100(1-α)% confidence interval for the partial regression coefficient β̂ β^ from the following equation.
(nn: sample size, kk: number of explanatory variables, tα2(n-k-1)tα2(n-k-1): the value of tt for which the upper probability α2α2 in the tt distribution with (n-k-1)(n-k-1) degrees of freedom)
βi^-tα2(n-k-1)×se(βi^)≤βi^≤βi^+tα2(n-k-1)×se(βi^)
βi^-tα2(n-k-1)×se(βi^)≤βi^≤βi^+tα2(n-k-1)×se(βi^)
The coefficient of determination (R-squared): “the proportion of the variability of the target variable that can be explained by the explanatory variables”, which indicates how well the regression equation fits. 1 is the best.
Adjusted R-squared: Used to determine the goodness of fit after considering the number of explanatory variables.
F-statistic: A statistic that tests whether the regression equation is meaningful (i.e. the null hypothesis that “all coefficients are zero”).
(Based on the F-test) p-value: The p-value that can be calculated from the F-statistic. Generally, the null hypothesis is rejected if the p-value is less than 5% or 1%.
Code.
Let’s perform a multiple regression analysis using the wine data in scikit-learn.
Try to find the alcohol concentration by regression using data other than alcohol concentration.
Two patterns are implemented: scikit-learn and StatsModel.