A Comprehensive Introduction to Linear Regression

In predicting real world events, there’s two main types of outcomes we can use to cover most questions. If the outcome we’re trying to predict can take on any and every value in some interval (such as temperature, bodyweight, speed, calories) then we say the outcome is continuous. If the quantity is really better represented as falling into a certain set of groups or buckets (such as cancer/not cancer, spam/not spam, car manufacturer, dog species) then we say the outcome is discrete.

Furthermore, when we are creating a statistical model to predict and draw inference about the outcome, we can use either a linear or non-linear approach (more on this below). We will be talking specifically about linear approaches in this guide, and more specifically about linear approaches to model continuous outcomes. Such approaches are , unsurprisingly, called linear regression models.

A linear model simply seeks to model our output Y as a linear function of one or many features X=[x1,x2…xp]. This means that the model can be written as Y=β0+β1(x1)+…βp(xp), where each βi is a coefficient measuring the contribution of the feature xi to y, and β0 is the intercept or baseline prediction when all features are zero. Such models manifest themselves as lines or straight planes in space. Let’s look at an example below.

## importing all libraries for this guide
from sklearn.datasets import load_boston
from sklearn.mixture import GaussianMixture
import numpy as np
from sklearn.datasets import load_boston
import statsmodels.api as smf
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, BayesianRidge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
try:
# Python >= 3.0
from urllib.request import urlopen
except ImportError:
# Python < 3.0
from urllib2 import urlopen
from scipy.stats import truncnorm, gamma, expon, norm
%matplotlib inline

## Read in Motortrend (mtcars) dataset from web
url = 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
response = urlopen(url)
dat = pd.read_csv(response)

## Plot and run a simple linear regression
sns.regplot(dat.hp, dat.mpg)
Predicting the fuel efficiency of a vehicle (in Miles Per Gallon) as a function of the vehicle’s horsepower

The above data is from MotorTrend vehicle tests, with specifically fuel efficiency (in miles per gallon) plotted against horsepower. The line represents a simple linear regression attempting to capture the relationship between (and predict) the fuel efficiency of a vehicle given it’s horsepower. Notice the line doesn’t cross many points and is particularly bad around the ends of the data values, however it accurately captures what we could call the ‘spirit’ of the relationship, MPG generally decreases as horsepower goes up. We will look at ways to assess how good the fit of this model is soon. The equation for this model (the one that produced the line above) is:

The parameters 27.09, 0.068 were found using a method called Ordinary Least Squares (OLS). Looking at the regression line generated above it would be easy to assume that what are looking for is the conditional probability distribution of Y given X.

The probability of different Y values given a particular value of X

However, for many real-life systems there is not a unique output produced for each input. If multiple auto manufacturers produce cars with the same amount of horsepower, there is no reason they should all have exactly the same fuel efficiency, since there’s many components besides horsepower that play a part in a vehicle’s fuel efficiency. However, given a specific horsepower, it makes more sense to try to determine what the fuel efficiency would be *on average*. Thus we are actually looking for:

The expected value of Y given a particular value of X

But the paragraph above may spark an interesting point. The reason we cannot say all cars will have the same mpg given a certain horsepower is because there’s many other factors that go into a car’s fuel efficiency. What if we had access to all those other factors? Indeed, linear models are very powerful in their ability to generalize the main idea of relationships, and adding more relevant variables (or engineering new features from what we already have) should be the first step we should take before trying more complicated models. Linear models also have significant power because of their interpretability (in properly constrained and formatted models), speed, and general stability.

Let’s understand the problem we’re trying to solve and how it leads to the linear regression model.

We are trying to predict a vector of responses, Y, given a vector of observations of a variable X. Specifically, we are trying to predict Y as a linear function of X. Since our prediction won’t be perfect (there may be other factors affecting Y) we need an error term ϵ. Thus we should be able to write our model as:

A mathematical representation of our linear model

Now what we’re trying to really do is minimize the amount of error ϵ we make. We can rearrange the equation above so that:

Unfortunately, this quantity depends on the signs of our data, hence we need to take either absolute value or square the difference for it to represent a proper distance. For computational reasons, it is easier to minimize to square the differences. Hence, we are looking for the β0 and β1 that minimize the square of the equation above for all of our data points, this can be written mathematically as:

Notice if we have multiple input features (X1,X2 … Xp) then we need a β term for each and the equation above quickly becomes unwieldy:

We can use linear algebra to significantly simplify this equation. Let X=[X1,X2…,X3] so that each Xi is a feature vector with n observations (i.e X1=(x11,x12,…x1n)). Let β be the vector of βi’s, one for each feature. Using matrices to represent X,Y we can rewrite the above minimization problem much more concisely as follows:

But we’re missing the β0, we can’t add it to the β vector because then the dimensions of the X matrix and β matrix wouldn’t conform. We can fix this with a small technical edit, we can simply add a column to the X matrix that is always 1 and add β0 as the first entry in the β vector. Now we can continue with our matrix multiplication. We would like to obtain the minimum of this quantity (with respect to β), we can do this by setting the derivative to 0 and solving for β. After carrying out the minimization we get that the following β minimizes the above equation. This is called the **Ordinary Least Squares (OLS)** estimator for linear regression. We place a hat above it to reflect that it is an estimator of the true value.

Recall that the inverse of (XtX) only exists if the columns of X aren’t linearly dependent (one column can’t be written as a multiple of another column plus a constant). This means that if we have features that are just linear transformations of one another (degrees in Fahrenheit and degrees in Celsius for example), we literally cannot solve for the optimal β. In reality, even high degrees of linear correlation will still bother our model, we will explore this in a later section.

Let’s estimate a linear regression using the OLS process we derived above. We will use a sample dataset of tips at a restaurant.

# Loading tips dataset
tips = sns.load_dataset("tips")
print(tips.head())
# Building linear regression model with statsmodels
X = tips.total_bill
y = tips.tip
tips_lr = smf.OLS(y,X).fit()
# Plotting a linear model with seaborn
sns.regplot(X,y)

Under some conditions, the OLS estimator is the BLUE (best linear unbiased estimator) for the parameters β, essentially the best we can do given X and Y. These conditions are fairly straightforward and often referred to as the Gauss-Markov Conditions. Firstly, observe that we have an ϵ error term as part of the population equation that we are trying to estimate. If this term has some kind of pattern in it (or is a constant), we could use it for modeling to make our model better. Hence, to make sure the ϵ term captures the actual unpredictable error noise in trying to predict Y using X, we must assume a few things:

The Gauss-Markov Conditions for Linear Regression Residuals

The first assumption is satisfied by construction of the OLS estimator. We can verify the other two assumptions are satisfied by looking at the residual plot for the fitted regression. This is simply a plot of the predicted values from the regression vs. their distance from the actual value (called the residual). It is useful to standardize the residuals (by subtracting their mean then dividing by their standard deviation) in order to make them comparable between models. These residuals correspond to the ϵ term when using a sample estimator on real data. Let’s calculate and plot the residuals from the example above:

fitted = tips_lr.predict(X)
residual = fitted - y
residual_mean = np.mean(residual)
std_residual = residual/residual_mean
print("Mean of residuals: " + str(np.round(residual_mean)))


sns.regplot(fitted,std_residual,fit_reg=False)

The X-axis is for the predicted value and the Y-axis is the standardized residual for that predicted value. We can deduce a few things by studying the residual plot above. Namely that the ‘spread’ (or variance) of the points seems to increase with X. This is not good news as this violates our assumption about the error residuals having constant variance (2). There is no obvious ‘pattern’ from one residual to the next, so there doesn’t seem to be a high degree of auto-correlation. Let’s examine the checking and consequences of each of these issues:

Sometimes the spread of the residuals increases or decreases with different values of the predictor variables. This means the variance of the residuals carries information about the predictors and is thus not random error. This occurs quite frequently in practice as many real-life processes are highly non-linear or chaotic on the extremes (in our context, tips for very small or very large bills may not be strictly linear compared to tips for average bills). Looking at the above plot the variance of the points seems to be increasing with X. Given that we’ve just studied linear regression, a natural question might be if we can simply regress the errors on X and examine the resulting linear regression model. This would be an equation of the form:

This is the premise of the Breusch-Pagan test for homoskedasticity (in fact, the test statistic is simply the R2R2 of the regression of squared residuals against XX, times number of observations). The null-hypothesis of the Breusch-Pagan test is homoskedasticity, that is that the value of the residuals are linearly independent of the XX values. High values of the test statistic (and corresponding low p-values for the test statistic) indicate the presence of heteroskedasticity. This test (as many others) may not be reliable with highly non-normal data. For our tips regression, the test yields a significantly high test statistic of 78.72 (corresponding to a p-value < 0.0001) and thus supports our intuition that there is heteroskedasticity present in our regression model’s residuals.

test = sms.het_breuschpagan(tips_lr.resid, tips_lr.model.exog)
print("Breusch-Pagan Test Statistic: " + str(test[0]))

Because of this, our OLS estimator for linear regression is not BLUE. Specifically, it does not have the lowest possible variance out of all estimators. Depending on the severity of the problem this may or may not affect significance tests but definitely causes a bias in the standard errors (often underestimating the actual variance). This leads to confidence bounds being inaccurate. Luckily, there exists some simple methods to overcome heteroskedasticity (such as applying a log() or power transform to the response) that we will discuss in the next section.

Sometimes the overall variance of the residuals might be constant, but the residuals exhibit a clear ‘pattern’ such as a sine wave. Essentially this means that each residual contains information about the next or previous residuals (and thus about the response variable). This usually occurs when a linear regression is mistakenly applied with time as a predictor (time-series models are more appropriate here) or when the response variable is cyclic in nature. This often (but not always) occurs alongside heteroskedasticity and the combination of both can seriously hamper reliability and interpretability of a linear regression model. Intuitively, if we consider our residuals resid=[r1,r2…,rn] (where the ri residual corresponds to Yi) then autocorrelation means that there exists a relationship between ri, ri+1 or ri, ri−1. We can test for this using the Durbin-Watson test for autocorrelation.

Consider the following equation:

Durbin-Watson test statistic for autocorrelation

D is large when the difference between succesive residuals is large relative to the sum of the squared residuals. We will return to the derivation of this test statistic in the time-series chapter but for now it is important to know that when D=2 there is no autocorrelation, when D < 2 there is evidence for positive autocorrelation and when D > 2 there is evidence for negative autocorrelation. Autocorrelation is not very easy to mitigate and likely indicates a time-dependence of either the predictors or response or inappropriate choice of model. If linear regression must be used, the Cochrane–Orcutt procedure exists to adjust for time dependence by modeling the residuals seperately but is outside the scope of this text. For our tips data, the Durbin-Watson statistic is roughly 2.13, indicating no serious problems with autocorrelation.

sms.durbin_watson(tips_lr.resid)

While autocorrelation and constant variance represent some of the more serious issues in linear regression (and are sufficient to make the OLS estimator BLUE), an additional assumption of residual normality is important to make sure some of the statistical tests and variance bounds we draw for our regression are valid. A regression where the residuals are normally distributed gives us much more confidence in being able to quantify the uncertainty around some of our parameters. In addition, serious violations of normality in the residuals can include other problems (such as with outliers) that might crop up. The easiest way to check for normality is with a Quantile-Quantile Plot, which shows where the quantiles of one dataset fall compared to another. We will use this to check where the quantiles of our residual values fall vs where they would fall if they followed a perfectly normal distribution. If the quantiles line up exactly they will form a line.

smf.qqplot(tips_lr.resid,line='45').show()

Dealing with categorical variables

Our independent variable are not always continuous values, even if we are trying to predict a continuous value. For example, we might be trying to predict the fuel efficiency of a car given its horsepower and make. If there are 20 makes and we label them 1–20 we implicitly are telling the model that make 19 and make 20 are ‘closer’ then make 1 and make 20. This is likely not the case. The simple fix is to create 20 new features labeled with each of the makes. These features will be 1 if the particular observations is of the particular make and 0 otherwise. In general, these features are called dummy variables. There will be C−1 dummy variables if we have C categories (as the Nth category is simply equal to setting all the other categories equal to 0). In our tips data we have a feature for ‘day’ indicating day of the week, this takes values ‘Thur’,’Fri’,’Sat’,’Sun’. We will create 3 new dummy variables that indicate the day of the week. We will also need one dummy variable for the two ‘sex’ categories, one dummy variable for the two ‘smoker’ categories, and finally one dummy variable for the two ‘time’ categories.

tips.head()tips_w_dummies = pd.get_dummies(tips,drop_first=True)
tips_w_dummies.head()

Given that a linear regression demonstrates no significant autocorrelation or heteroskedasticity, we can say that our OLS linear regression is BLUE. However, this doesn’t mean that all aspects of our regression analysis are accurate and actionable. In particular, we’ve seen in the linear regression equation that if there are columns of X (features) that are linear transformations of one another (say X2=2X1+10X2=2X1+10) then we cannot invert the X matrix and derive the OLS estimator. However, when there exists high degrees of linear correlation between the features (say X2X2 is only roughly equal to 2X1+102X1+10 plus some small error) our estimates for the regression coefficients can become distorted. If the coefficients are being used for interpretation this can lead to highly misleading conclusions. We can first evaluate the linear correlation between features using a correlation matrix.

tips_w_dummies.corr()

It is worth noting a few things about this matrix:

(1) The values on the diagonal are always 1, as they represent the correlation of a variable with itself

(2) The matrix is symmetric about the diagonal, since Cor(X1,X2)=Cor(X2,X1)we only need to inspect values above or below the diagonol.

(3) It doesn’t make sense to look at correlations between the values within a category (for example day_Sat vs. day_Sun) since these are always perfectly correlated (if it’s Saturday, it cannot be Sunday) and their values will only depend on the fractions of each category

(4) High correlations of features with the response (tip in this case) are good

In general, it is hard to simply look at the correlation matrix above and judge which values are problematic for our model. We notice that ‘size’ and ‘total_bill’ are highly correlated (0.598) which makes sense given that larger groups of people usually order more food and thus have larger bills. We also notice a rather large positive correlation between the day being Saturday and the order being for Dinner time. Being female also seems to negatively correlate with bill size (makes sense if females eat less on average). Does this indicate any issues that need to be taken care of?

One way to look at this problem is to measure how much more uncertainty (variance) we introduce into our parameter estimates if we include vs. exclude a feature. This is aptly called the Variance-Inflation Factor (VIF) of the feature. To calculate for a particular feature Xi, we will pretend like Xi is the response variable and regress the other features on it. This will give us Ri2, or the R2 value of the model using Xi as a response. Then the following simple transform (derived from the equation for standard error of βi) will give us the VIF for Xi:

numerical_features = ['total_bill', 'size']for feature_idx in range(0,tips_w_dummies.shape[1]):
feature_name = tips_w_dummies.columns[feature_idx]
if feature_name in numerical_features:
vif = variance_inflation_factor(tips_w_dummies.values, feature_idx)
feature_name = tips_w_dummies.columns[feature_idx]
print(str(feature_name) + " : VIF = " + str(vif))

Given that we’ve done the proper pre-processing techniques and dealt with the few major issues in regression outlined above, we should have a fairly good model. However, things change a bit when we’re using a really large number of features. Firstly, generating the correlation matrix or VIF scores might be computationally very difficult. Unfortunately, large feature spaces will also lend themselves to high degrees of collinearity either because of chance or because of latent variables (these will be studied later). Models generated with a large number of features using unmodified linear regression will tend to be unreliable for interpretation, mainly because of the collinearity issues and high parameter variances associated with having many features. Secondly, models generated with a large number of features are more prone to overfitting, or calibrating too closely to noise in the data such that it leads to a poor ability to generalize to unseen data. Both of these problems can be addressed using shrinkage methods, which have the basic idea of trying to drive the coefficients for ‘unhelpful’ features close to zero. This can not only prevent overfitting in high-dimensional data but also address collinear features by shrinking one of them close to zero. First, recall the OLS estimator we found was the optimal solution to minimize the sum of squared errors:

In some settings (such as with high numbers of highly correlated features), this problem is ill-posed, or without a single unique solution. This can lead to multiple problems, not the least of which include under or over-fitting the model and having distorted parameter estimates. We will attempt to solve this problem by adding another constraint besides just minimizing the sum of squared errors; also minimizing the size of the coefficients. Thus we look for a solution to minimize the sum of squared errors such that the β vector is smallest. Before we go further, let us refresh the idea of measuring the ‘size’ of a vector. For a vector in P dimensions, centered at the origin, the length of the vector in Euclidean space is given by:

This is also known as the L2 norm of the vector β (because of the exponent of 2 on each coordinate and the degree of the root). We use this terminology because we can simply vary the power of the sum above to arrive at a measure of distance in a different space. For our first estimator we simply use the L2 norm to define the size of the β vector. Thus, to minimize both the error and the size of the β vector we can pose this problem as minimizing the penalized sum of squares, as follows:

The solution to this more strongly constrained optimization problem is given by the Ridge Estimator for linear regression:

Note that since we have artificially altered the problem to search for smaller β solutions, this estimator is no longer unbiased. However, a property that may seem surprising is that there always exists at least one λ such that the Ridge estimator has a lower MSE (mean-squared-error) then the OLS estimator, meaning we can always use the bias introduced by λ to lead us to a better solution. Choosing the exactly optimal λ is not feasible, however depending on the values of the features, coefficients, and standard errors there is often a range of λ where we can achieve this property. We can try different values of λ to settle on one that gives us the best mean squared error using cross-validation. The Ridge estimator also helps to prevent overfitting and has monotone decreasing variance (but monotone increasing bias) as λ gets bigger. We can use cross-validation on our data and see how well the Ridge estimator performs:

X = tips_w_dummies.drop('tip',1)
y = tips_w_dummies.tip
X_train,X_test,y_train,y_test = train_test_split(X,y)best_alpha = 0
best_mse = 1000000000
for alpha in [0,0.2,0.4,0.6,0.8,1]:
tips_ridgereg = Ridge(alpha=alpha)
#mse = mean_squared_error(y_true=y_train,y_pred=tips_ridgereg.fit(X_train,y_train).predict(X_train))
mse = cross_val_score(tips_ridgereg,X_train,y_train,cv=5,scoring='neg_mean_squared_error')
mse = np.abs(np.mean(mse))
if mse<best_mse:
best_alpha = alpha
best_mse = mse
if alpha==0:
print("Unpenalized Linear Regression MSE = " + str(mse))
else:
print("Ridge Regression (alpha=" + str(alpha) + ") MSE = " + str(mse))
print("*** Best CV-MSE at alpha=" + str(best_alpha) + " ***")

Now, recall that we tried to minimize the size of the β vector, in Euclidean space. Nothing stops us from trying to change the loss term to penalize for large beta vectors in a different space, such as the Manhattan (or L1) space. In this space the distance is represented by the sum of the absolute values of the coordinates. Let’s go over the distance metrics one more time to see what form these both take:

Vector distance from the origin in Euclidean (or L2) space
Vector distance from the origin in Manhattan (or L1) space

Notice all that’s changed in the exponent on each term (which has gone from 2 to 1, the absolute values are implied in Euclidean space since all squared values are positive) and the degree of the root. Now, we can edit the problem to minimize the L1 Penalized Sum of Squares:

Unlike Ridge and OLS, there is no closed form equation for the lasso estimator (solution to the equation above) and we must use something like coordinate descent to arrive at the estimator. Just as before, we can choose λ with cross validation. Because Lasso() below is using coordinate descent and not plugging into an equation, we avoid testing the regular linear regression case (λ=0 we tried above) because of problems with dividing by 0. We can see the improvement in MSE given by lasso below:

best_alpha = 0
best_mse = 1000000000
for alpha in [0.2,0.4,0.6,0.8,1]:
tips_lassoreg = Lasso(alpha=alpha)
mse = cross_val_score(tips_lassoreg,X_train,y_train,cv=5,scoring='neg_mean_squared_error')
mse = np.abs(np.mean(mse))
if mse<best_mse:
best_alpha = alpha
best_mse = mse
print("Lasso Regression (alpha=" + str(alpha) + ") MSE = " + str(mse))
print("*** Best CV-MSE at alpha=" + str(best_alpha) + " ***")

Latent Variables

We saw above how including a categorical variable (as a dummy variable) separated our regression line (or plane) into two distinct parts. In our data we have categorical variables that are manifested, or observed. Occasionally, we may have features we suspect play a role in the values of the covariates (such as gender, if we have body weight as a feature) but these features are latent, or unobserved. If we were to recover such features it would explain a large portion of some of the covariates as well as improving the performance of our model. But how could we hope to calculate features inherently missing from our data?

Let’s generate some hypothetical data for predicting height in cm (y) from weight in lbs (X). We will draw the observations for weight 50/50 from two separate gaussian distributions, representing males and females respectively.

np.random.seed(42)mu_1 = 185
mu_2 = 135
sd_1 = 25
sd_2 = 15
n = 150X_male = np.random.normal(loc=mu_1,scale=sd_1,size=n)
X_female = np.random.normal(loc=mu_2,scale=sd_2,size=n)
y_male = np.random.normal(loc=175,scale=5,size=n)
y_female = np.random.normal(loc=161,scale=4,size=n)
gender = np.concatenate([np.repeat(0,n),np.repeat(1,n)], axis=0)X = np.concatenate([X_male,X_female],axis=0)
y = np.concatenate([y_male,y_female],axis=0)
plt.scatter(X, y, marker='o', c=gender,s=25, edgecolor='k')
plt.title("Weight vs. Height with gender observed")

We can clearly see the distinction in the two groups above when the gender is manifest. However, if we take gender to be an unobserved latent variable, we would see the data as below.

plt.scatter(X, y, s=25)
plt.title("Weight vs. Height with gender latent")

We will look at a particular kind of latent variables model called a Gaussian Mixture Model. A GMM assumes that there are C gaussian/normal distributions (C=2 in our case) from which are data can be drawn, and each datapoint is a mixture of each of these gaussians. Thus, we can write any datapoint xi as:

Where C1 and C2 are normally distributed random variables with distinct means and variances, and ϕ1 and ϕ2 are the probabilities of xi being drawn from the respective distributions.

This makes sense for our case if you imagine each individual being a linear combination of their male and female parent’s heights. If we know the probabilities of being drawn from each distribution (the ϕs) then this model is very simple to solve. However, in reality the ϕs are unobserved hence our need to use a latent variables model. All is not lost, however. We can use our data to guide our selection of the ϕs.

Imagine you only had two datapoints, one for an individual with height 5'3" and a weight of 110 and another for an individual with height 5'9" and a weight of 200lbs. Without knowing the genders we can make a guess that we have a female and a male (respectively) and set our ϕs to 0.5 each. Assume now, we instead see the same individual with height 5'3" and weight 110lbs, however now we see 9 other individuals with weights above 200lbs and heights above 6'. We may now assume that we have 1 female and 9 males and set our ϕ1=0.1 and ϕ2=0.9. Essentially what we are doing is setting our parameters to what would maximise the probability of the data we see, this is the basic idea behind the Expectation-Maximazation algorithm we will discuss later on. We can use EM to find best estimates for ϕ1 and ϕ2

C=2X = X.reshape((X.shape[0],1))
gmm = GaussianMixture(n_components=C, n_init=10)
gmm.fit(X,y)
for c in range(0,C):
print("Latent class discovered with mean=" + str(gmm.means_[c]) + " and std.dev=" + str(np.sqrt(gmm.covariances_[c])))

Recall the original means and standard deviations we used to define our groups.

For C1 we had μ1=185, σ1=25

For C2 we had μ2=135, σ2=15

The estimates of 188,137 for the μs and 20,15 for the σs generated by the EM algorithm are not bad at all. We somehow recovered information about the average height for each gender without knowing anything about which datapoint is of which gender! Since the GMM gives each datapoint a probability of being from each distribution (each gender in our case), we can use the highest of the probabilities to assign each datapoint to a distribution and recover an estimate for the categorical variable (gender) we omitted.

gender_estimated = gmm.predict(X)plt.figure(figsize=(10,4))plt.subplot(1,3,2)
plt.scatter(X, y, marker='o', s=25, edgecolor='k')
plt.title("Weight vs. Height with gender latent")
plt.subplot(1,3,2)
plt.scatter(X, y, marker='o', c=gender_estimated,s=25, edgecolor='k')
plt.title("Weight vs. Height with gender estimated")
plt.subplot(1,3,3)
plt.scatter(X, y, marker='o', c=gender,s=25, edgecolor='k')
plt.title("Weight vs. Height with gender observed")
plt.tight_layout()
Estimating the unobserved latent variable for gender using the EM algorithm

Observe that since we used a linear equation to estimate the categorical variable there’s a bit of error as EM is trying to estimate a hard linear cutoff between genders. At this point, we could attach our estimated gender variable to our original data and run a plain linear regression. We can also build individual linear models for each group or take a stratified regression approach.

weight_lr  = LinearRegression().fit(X,y)
lr_predictions = weight_lr.predict(X)
gender_estimated = gender_estimated.reshape(gender_estimated.shape[0],1)
newX = np.concatenate([X,gender_estimated], axis=1)
weight_mm = LinearRegression().fit(newX,y)
mm_predictions = weight_mm.predict(newX)
## Estimate with original gender feature
lr_perf = mean_squared_error(y_pred=mm_predictions,y_true=y)
mm_perf = mean_squared_error(y_pred=lr_predictions,y_true=y)
print("MSE without estimating gender")
print(lr_perf)
print("")
print("MSE using estimated gender as a feature")
print(mm_perf)
for est_gender in [0,1]:
gender = newX[:,-1]==est_gender
X_gender = newX[gender]
y_gender = y[gender]
gender_lr = LinearRegression().fit(X_gender,y_gender)
print("MSE for gender " + str(est_gender))
print(mean_squared_error(y_pred=gender_lr.predict(X_gender),y_true=y_gender))
print("")

We can see that using the generated latent variable as a feature yields no improvement in our case and running a stratified regression allows us to discern that it is easier to predict height for males in our generated data using a linear regression.

Linear regression provides a powerful way to model and predict a continuous response given at least one continuous feature. We saw how to estimate a linear regression model using OLS, some common pitfalls and problems with this approach, and how to circumvent them using shrinkage methods and diagnostic tests. We also learned how we can use latent variables modeling to discover new discrete features to add to our linear regression models. Thank you for reading.

Data Scientist developing the cutting edge of geospatial machine learning with the GeoAI team @ Esri