Regression Models: GAM, GLM and GLMM
Overview
Generalized linear model (GLM) is a cure to some issues posted by ordinary linear regression. In the well-known linear regression model, we often assume
GLM
We note that GLM has three major parts:
- An exponential family of probability distributions:
, some examples include: - normal
- exponential
- gamma
- chi-squared
- beta
- Dirichlet
- Bernoulli
- categorical
- Poisson
- A function of predictor (in GLM it is
, in extended models, it can be other things, see GAM and GLMM), we can estimate via maximum likelihood or Bayesian methods like laplace approximation and Gibbs sampling, etc. - A link function
such that (sometime we may have tractable distribution for variance
1. Pros and Cons for GLM and GLMM
Pros:
- Easy to interpret
- Easy to grasp
- Coefficients can be further used in numerical models
- Easy to extend: link functions, fixed and random effects, correlation structures
Cons:
- Not good for dynamic models (the model is not linear and transformation may not help or would loose information
Generalized additive models (GAMs)
- GAMs are extensions to GLMs in which the linear predictor
is not restricted to be linear in the covariates but is the sum of smoothing functions applied to the each . For example,
- Is useful if relationship between Y and X is likely to be non-linear but we don't have any theory or any mechanistic model to suggest a particular functional form
- Each
is linked with by a smoothing function instead of a coefficient - GAMS are data-driven rather than model-driven, that is, the resulting fitted values do not come from an a priori model (non-parametric)
- All of the distribution families allowed with GLM are available with GAM
1. Pros and Cons for GAM
- Pros:
- By combining the basis functions GAMs can represent a large number of functional relationship (to do so they rely on the assumption that the true relationship is likely to be smooth, rather than wiggly)
- Particularly useful for uncovering nonlinear effects of numerical covariates, and for doing so in an "automatic" fashion
- More Flexible as now each sample's Y is associated with its X by a smoothing function instead of a coefficient
- Cons:
- Interpretability of the coefficient
need to be estimated graphically - Coefficients are not easily transferable to other datasets and parameterization
- Very sensitive to gaps in the data and outliers
- Lack underlying theory for the use of hypothesis tests
one solution is to do bootstrapping and get aggregated result for more reliable confidence bands
- Interpretability of the coefficient
2. Examples of GAM (different predictor representation functions):
Loess (Locally weighted regression smoothing)
- The key factor is the span width (usually set to be a proportion of the data set: 0.5 as a standard starting point)
- Main idea: Split the data into separate blobs using sliding windows and fit linear regressions in each blob/interval
- Pros:
- Easily interpretable. At each test case, a local linear model is fit (eventually explained by linear behaviours)
- a popular way to see smooth trends on scatterplots
- Cons:
- If there are a lot of data points, fitting a LOESS over the entire range of the predictor can be slow because so many local linear regressions must be fit.
Regression Splines (piecewise polynomials over usually a finite range)
- Main constraint is that the splines must remain smooth and continuous at knots
- To avoid overfitting of splines, penalty terms are added
- The penalty term also reflects the degree of smoothness in the regression
- The less smooth the regression is (after fitting the spline functions), the higher the penalty terms
- Pros:
- cover all sorts of nonlinear trends and are computationally very attractive because spline terms fit exactly into a least squares linear regression framework. Least squares models are very easy to fit computationally
- Cons:
- It is possible to create multidimensional splines by creating interactions between spline terms for different predictors. This suffers from the curse of dimensionality like KNN because we are trying to estimate a wavy surface in a large dimensional (many variable) space where data points will only sparsely cover the many many regions of the space
GLMM
The model has the form:
1. Code implementation
I recommend beginners to use statsmodels package because the output via .summary()
function is very clear to read. For advanced users, you may implement the function yourself by referring to the mathematical expressions and package documentations from the following
- statsmodels:
statsmodels.formula.api.mixedlm
- pymc3
- theano
- pystan
- tensorflow
- keras
2. A sample code using statsmodels
1 | import statsmodels.formula.api as smf |
Regression Models: GAM, GLM and GLMM
https://criss-wang.github.io/post/blogs/supervised/regressions-3/