For my lectures on applied linear models, I wanted to illustrate the fact that the R^2 is never a good measure of the goodness of the model, since it’s quite easy to improve it. Consider the following dataset

n=100 df=data.frame(matrix(rnorm(n*n),n,n)) names(df)=c("Y",paste("X",1:99,sep="")) |

with one variable of interest y, and 99 features x_j. All of them being (by construction) independent. And we have 100 observations… Consider here the regression on the first k features, and compute R_k^2 of that regression

reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) summary(model)$adj.r.squared} |

Let us see what’s going on…

plot(1:99,Vectorize(reg)(1:99)) |

(actually, it’s not exactly what we have on the graph…. we have the average obtained over 1,000 samples randomly generated, with 90% confidence bands). Oberve that \mathbb{E}[R^2_k]=k/n, i.e. if we add some pure random noise, we keep increasing the R^2 (up to 1, actually).

Good news, as we’ve seen in the course, the adjusted R^2 – denoted \bar R^2-might help. Observe that \mathbb{E}[\barR^2_k]=0, so, in some sense, adding features does not help here…

reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) summary(model)$r.squared} plot(1:99,Vectorize(reg)(1:99)) |

We can actually do the same with Akaike criteria AIC_k and Schwarz (bayesian) criteria BIC_k.

reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) AIC(model)} plot(1:99,Vectorize(reg)(1:99)) |

For the AIC, the intitial increase makes sense : we should not prefer the model with 10 covariates, compared with nothing. The strange thing is the far right behavior : we prefer here 80 random noise features to none ! Which I find hard to interprete… For the BIC the code is simply

reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) BIC(model)} plot(1:99,Vectorize(reg)(1:99)) |

and here also, we have the same pattern, where we prefer a big model with juste pure noise to nothing…

A last one to conclude (or not) : what about the leave-one-out cross validation mean squared error ? More precisely, CV=\frac{1}{n}\sum_{i=1}\widehat{\varepsilon}^2_{-i}where \widehat{\varepsilon}^2_{-i}=y_i-\widehat{y}_{-i} where \widehat{y}_{-i} is the predicted value obtained with the model is estimated when the ith observation is deleted. One can prove that \widehat{\beta}_{-i}=\widehat{\beta}-(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\hat\varepsilon_i(1-H_{i,i})^{-1}where H is the classical hat matrix, thus\widehat{\varepsilon}_{-i}=(1-H_{i,i})^{-1}\hat\varepsilon_ii.e. we do note have to estimate (at each round) n models

reg=function(k){ frm=paste("Y~",paste("X",1:k,collapse="+",sep="")) model=lm(frm,data=df) h=lm.influence(model)$hat/2 mean( (residuals(model)/1-h)^2 ))} plot(1:99,Vectorize(reg)(1:99)) |

Here, it make sense : adding noisy features yields overfit ! So the mean squared error is decreasing !

That’s all nice, but it might not be very realistic… Here, for my model with only one variable, I just pick one, at random…. In practice, we try to get the “best one”… So a more natural idea would be to order the variables according to their correlations with y,

df=data.frame(matrix(rnorm(n*n),n,n)) df=df[,rev(order(abs(cor(df)[1,])))] names(df)=c("Y",paste("X",1:99,sep=""))} |

and as before, we can plot the evolution of R^2_k as a function of k the number of features considered,

which is increasing, with a higher slope at the beginning… For the \bar R^2_k we might actually prefer a correlated noise to nothing (which makes sense actually). So here since we somehow chose our variables, \bar R^2_k seems to be always positive…

For the AIC_k here also, there is an improvement. Before coming back to the original situation (with about 80 features) and here also, we observe the drop on the far right part of the graph

The BIC_k might like the top three features, but soon, we have a deterioration…. even if here also, we have the drop at the far right (with more than 95 features… for 100 observations).

Finally, observe that here again, our (leave-one-out) cross-validation has not been mesled by our noisy variables : it is always decreasing !

So it seems that cross-validation techniques are more robust than the AIC and BIC (even if we mentioned in a previous post connexions between all those concepts) when we have a lot a noisy (non-relevent) features.