Regression using Genetic Algorithm

Sat, Sep 14, 2019 regression, Genetic Algorithm, r regression, genetic algorithm, r

The idea of this post is to solve (or at least approximate) the solution to Ordinary Least Squares (OLS), Least Absolute Deviation (LAD), Quantile Regression, Lasso and \(L0\) regularization using Genetic Algorithm and the GA package.

Creating the data

I create data with some X not correlated with the response Y.

# Packages
library(GA)
library(L1pack)
library(L0Learn)
library(quantreg)
library(glmnet)
library(knitr)

# Problem data
set.seed(1)
p <- 8
n <- 5000
DENSITY <- 0.5   # Fraction of non-zero beta
beta.mean<-10; intercept<-13
beta_true <- matrix(rnorm(p, beta.mean), ncol = 1)
idxs <- sample.int(p, size = floor((1 - DENSITY) * p), replace = FALSE)
beta_true[idxs] <- 0
X <- matrix(rnorm(n * p, sd = 5), nrow = n, ncol = p)
sigma <- 20
eps <- matrix(rnorm(n, sd = sigma), ncol = 1)
beta_true<-c(intercept, beta_true)
X.intercept<-cbind(1, X)
Y <- X.intercept %*% beta_true + eps
K <- as.numeric(ncol(X))
m <- length(Y)

X<-scale(X) # standardize for regularization techniques

Ordinary Least Squares (OLS)

The solution could be easily calculated in closed form using \(\hat{\beta}_{OLS}= (X^T X)^{-1} X^T y\), which gives:

beta.ols<-function(xx=X, yy=Y){
        xx<-cbind(1, xx)
        beta<-solve(t(xx)%*%xx)%*%t(xx)%*%yy
        beta<-as.numeric(beta)
}

ols.beta<-beta.ols()

In this case using Genetic Algorithm is definitely inefficient and less precise. The same is true for the other models in this post: they don’t have generally a solution in closed form, but they have more robust numerical methods that take advantage of the structure of the problems.

To solve the problems with the package GA we need to specify some parameters:

max.iter<-10000 # max number of iteration
max.run<-1000 # number of iteration without any improvement before stopping
theta<-rep(0,p) # empty vector of suggested parameters without intercept
iper=0.01 # iperparameter for regularization methods
up<-rep(100, p+1) # maximum values of parameters
low<- -up # minimum values of parameters

We can solve OLS, minimizing directly the function:

\[\min_{\beta} \sum_{i=1}^n (y_i - x_i^T \beta)^2 \] And maximize it, given lower and upper bounds for the parameters, in this case a vector \((-100, \dots, -100)\) and a vector \((100, \dots, 100)\), and an optional vector of the suggested initial solution, in this case a vector \((0, \dots, 0)\).

# negative sign, since ga actually maximize the function
ols.func<-function(beta, xx=X, yy=Y){  
        xx<-cbind(1, xx)
        J <- -sum((xx%*%beta- yy)^2)
        return(J) 
}
GA.ols <- ga(type = "real-valued", fitness = ols.func,
         lower = low, upper = up, suggestions = c(0,theta),
        popSize = 300, maxiter = max.iter, run = max.run, monitor = NULL)

GA.ols<-as.numeric(summary(GA.ols)$solution)

Least Absolute Deviation (LAD)

The Least Absolute Deviation (LAD) doesn’t have a solution in closed form, but can be solved with numerical methods, like Barrodale and Roberts algorithm (the default of the L1pack package used in this post) or EM:

# pacchetto L1pack
lad.package<-lad(Y~X, data=list(X,Y), method = "BR")$coefficients

We can solve LAD, minimizing directly the function:

\[\min_{\beta} \sum_{i=1}^n |y_i - x_i^T \beta| \]

# negative sign, since ga actually maximize the function
lad.func<-function(beta, xx=X, yy=Y){  
  xx<-cbind(1, xx)
  lad<-sum(abs(yy-xx%*%beta))
  J<- - lad
  return(J) 
}

GA.lad <- ga(type = "real-valued", fitness = lad.func,
               lower = low, upper =up,
              suggestions=c(0,theta),
               popSize = 300, maxiter = max.iter,
             run = max.run, monitor = NULL)

GA.lad<-as.numeric(summary(GA.lad)$solution)

Quantile Regression

Also quantile regression can be solved with several numerical methods: the default one used by quantreg is a modified version of the Barrodale and Roberts algorithm for l1-regression described in detail in Koenker and d’Orey (1987, 1994). For larger problems it is advantageous to use the Frisch-Newton interior point method or other variants described in detail in Portnoy and Koenker(1997).

quantile.fit<-rq(Y~X, data=list(X,Y), tau = c(0.05, 0.5, 0.95), method = "br")
quantile.fit<-t(quantile.fit$coefficients)

The loss of quantile regression can be calculated using the check function:

\[\rho(x, \ \tau)= \begin{cases} \tau x & \quad \text{if } x \geq 0\\ (1-\tau) x & \quad \text{if } x < 0\\ \end{cases}\]

check.function<-function(z, t=0.5){
  risultato<-ifelse(z>=0, t*z, (t-1)*z)
  risultato[z==0]<-0
  risultato
}

Then we can define the loss function of the quantile regression as:

\[\min_{\beta} \frac{1}{n} \sum_{i=1}^n \rho(y_i - x_i^T \beta, \ \tau)\]

Where the constant \(\frac{1}{n}\) can be omitted from the optimization procedure, since it doesn’t depend on \(\beta\):

quant_loss <- function(beta, xx=X, yy=Y, tau=0.5) { 
  xx<-cbind(1, xx)
  error<-yy-xx%*%beta
  loss<-check.function(error, t=tau)
  -sum(loss)
}

GA.quant <- ga(type = "real-valued", fitness = quant_loss, tau=0.5,
             lower = low, upper = up,
             suggestions = c(0,theta), # quantile.fit[1,],
             popSize = 300, maxiter = max.iter, run = max.run, monitor = NULL)

GA.quant.50<-as.numeric(summary(GA.quant)$solution)

GA.quant.95 <- ga(type = "real-valued", fitness = quant_loss, tau=0.95,
               lower = low, upper =up,
               suggestions = c(0,theta), #quantile.fit[2,],
               popSize = 300, maxiter = max.iter, run = max.run, monitor = NULL)

GA.quant.95<-as.numeric(summary(GA.quant.95)$solution)

GA.quant.05 <- ga(type = "real-valued", fitness = quant_loss, tau=0.05,
                  lower = low, upper = up,
                  suggestions = c(0,theta),#  quantile.fit[3,],
                  popSize = 300, maxiter = max.iter, run = max.run, monitor = NULL)

GA.quant.05<-as.numeric(summary(GA.quant.05)$solution)

Lasso - \(L1\) Penalization

The Least Absoulute Shrinkage and Selection Operator (Lasso) problem can be solved with the glmnet package that uses cyclical coordinate descent algorithms, which successively optimizes the objective function over each parameter with others fixed, and cycles repeatedly until convergence:

fit.glmnet.lasso<-glmnet(X, Y, family ="gaussian", lambda = iper)
lasso.glmnet<-as.numeric(coef(fit.glmnet.lasso))

The loss can be minimized directly:

\[\min_\beta \frac{1}{2n} \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j )^2 + \lambda \sum_{j=1}^p |\beta_j|\]

lasso.loss<-function(beta, xx=X, yy=Y, lambda=iper){  
        const<-1/(length(yy)*2)
        # xx<-scale(xx) # already normalized previously!
        ols<-sum((yy-xx%*%beta)^2)
        J <- -((const*ols)+(lambda*sum(abs(beta)) ))
        return(J) 
}

The X must be standardized before optimizing the function. The intercept is not used in the optimization procedure and it’s estimated using:

\[\hat{\beta}_0= \bar{y} - \sum_{j=1}^p \bar{x}_j \hat{\beta}_j\]

intercept.mean<-function(beta, xx=X, yy=Y){
  intercept<-mean(yy)-apply(xx,2,mean)%*%beta
  intercept
}

So the full solution is given by:

GA.lasso <- ga(type = "real-valued", fitness = lasso.loss,
               lower = low[-1], upper = up[-1],
               suggestions=theta, # rbind(theta, lasso.glmnet[-1]),
               popSize = 300, maxiter = max.iter,
               run = max.run, monitor = NULL)


GA.lasso<-as.numeric(summary(GA.lasso)$solution)
GA.lasso<-c(intercept.mean(GA.lasso), GA.lasso)

Zero Norm L0 Regularization

The L0 regularization is very similar, but it penalizes the \(\beta \neq 0\):

\[\min_\beta \frac{1}{2n} \sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j )^2 + \lambda \sum_{j=1}^p I_{\beta_j \neq 0}(\beta_j)\]

Where \(I_{beta_j \neq 0}(x)\) returns 1 if \(\beta_j \neq 0\) and 0 otherwise.

The problem can be solved with the package L0Learn which uses a variant of cyclic coordinate descent as default or a local combinatorial search on top of coordinate descent (typically achieves higher quality solutions at the expense of increased running time).

# L0 regularization L0Learn package
fit.L0<-L0Learn.fit(X, Y, penalty="L0", autoLambda=FALSE, lambdaGrid=list(iper),
                    loss="SquaredError", algorithm = "CD")
L0.package<-as.numeric(coef(fit.L0))

The loss can be minimized directly using GA:

L0.loss<-function(beta, xx=X, yy=Y, lambda=iper){  
  const<-1/length(yy)
  ols<-sum((yy-xx%*%beta)^2)
  J <- -((const*ols)+(lambda*sum(beta!=0) ))
  return(J) 
}

GA.zero.norm <- ga(type = "real-valued", fitness = L0.loss,
             lower = low[-1], upper = up[-1],
             suggestions=theta,# rbind(theta, lasso.glmnet[-1], L0.package[-1]),
             popSize = 300, maxiter = max.iter, run = max.run, monitor = NULL)

GA.zero.norm<-as.numeric(summary(GA.zero.norm)$solution)
GA.zero.norm<-c(intercept.mean(GA.zero.norm), GA.zero.norm)

Results

We can compare the results of GA with the true beta (since we know the data generation process) and with their packages counterpart:

beta.problema<-rbind(beta_true=beta_true,
                     ols.beta=ols.beta,
                     GA.ols=GA.ols,
                     lad.package=lad.package,
                     GA.lad=GA.lad,
                     quantile.fit,
                     GA.quant.05=GA.quant.05,
                     GA.quant.50=GA.quant.50,
                     GA.quant.95=GA.quant.95,
                     lasso.glmnet=lasso.glmnet,
                     GA.lasso=GA.lasso,
                     L0.package=L0.package,
                     GA.zero.norm=GA.zero.norm)

kable(round(beta.problema,2))

	(Intercept)	X1	X2	X3	X4	X5	X6	X7	X8
beta_true	13.00	0.00	0.00	9.16	11.60	10.33	0.00	10.49	0.00
ols.beta	11.30	0.43	-0.10	46.21	57.34	53.04	-0.24	52.34	0.09
GA.ols	11.32	0.40	-0.07	46.14	57.31	53.03	-0.23	52.26	0.09
lad.package	10.99	0.41	-0.48	46.42	57.20	52.83	-0.37	52.65	0.12
GA.lad	10.98	0.39	-0.47	46.43	57.18	52.81	-0.36	52.66	0.14
tau= 0.05	-22.26	1.40	0.59	45.40	57.32	53.27	-0.53	52.32	-0.65
tau= 0.50	10.99	0.41	-0.48	46.42	57.20	52.83	-0.37	52.65	0.12
tau= 0.95	44.48	-0.10	0.15	45.98	56.80	52.67	0.04	51.87	0.93
GA.quant.05	-22.22	1.46	0.51	45.35	57.27	53.31	-0.48	52.21	-0.70
GA.quant.50	10.99	0.42	-0.47	46.44	57.18	52.79	-0.37	52.66	0.14
GA.quant.95	44.53	-0.14	0.18	45.79	56.70	52.53	0.05	51.84	0.91
lasso.glmnet	11.30	0.42	-0.09	46.20	57.33	53.03	-0.23	52.33	0.08
GA.lasso	11.30	0.42	-0.09	46.20	57.27	53.02	-0.22	52.28	0.06
L0.package	11.30	0.00	0.00	46.20	57.34	53.04	0.00	52.34	0.00
GA.zero.norm	11.30	0.42	-0.12	46.17	57.31	52.94	-0.22	52.24	0.09

All the results of GA are close to their package counterpart, excluded the zero norm and the package L0Learn, which tends to have a more sparse solution. This could be due to a numerical approximation or just a different parametrization since the package authors didn’t write the full mathematical formula used in the package. If that is the case, we are optimizing different functions, where the iperparamer \(\lambda\) may have a different meaning.

It’s possible to calculate the predictions and errors from the \(\beta\):

predictions<-X.intercept%*%t(beta.problema)
errors<-apply(predictions, 2, function(x) Y-x)
mae<-apply(errors, 2, function(x) mean(abs(x)))
kable(mae, col.names = "MAE")

	MAE
beta_true	16.11658
ols.beta	331.65362
GA.ols	331.30705
lad.package	331.88195
GA.lad	331.84284
tau= 0.05	332.02116
tau= 0.50	331.88195
tau= 0.95	329.40789
GA.quant.05	331.70249
GA.quant.50	331.81229
GA.quant.95	328.50262
lasso.glmnet	331.57432
GA.lasso	331.33601
L0.package	331.66200
GA.zero.norm	331.12767

The regularized methods don’t outperform the others, since I haven’t chosen the iperparameter \(\lambda\) by validation: I have just chosen one to replicate the results of the packages.

Conclusion

Genetic Algorithms can be very useful when we don’t have a solution in closed form or when we can’t employ other more consistent numerical methods (especially for non-convex problems). This article is mostly an exercise to the use of Genetic Algorithm in regression, but the real advantage of GA can appear with less trivial losses to minimize.

At the same time, we have shown that GA can give a good approximation of other numerical methods, but it is very slow and it is generally better to employ more smart solutions.

Genetic Algorithms can be a useful first step to try to solve problems we don’t fully understand, a smart use of brute force.