Linear Regression and Matrices
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
1 Load Data
= pd.read_csv("../dataset/Boston.csv") Boston
2 Simple Linear Regression
The ISLR2
library contains the Boston
data set, which
records medv
(median house value) for \(506\) census tracts in Boston. We will seek to predict medv
using \(12\) predictors such as rmvar
(average number of rooms per house), age
(average age of houses), and lstat
(percent of households with low socioeconomic status).
Boston.head()
Unnamed: 0 | crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 4.98 | 24.0 |
1 | 2 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 9.14 | 21.6 |
2 | 3 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 | 34.7 |
3 | 4 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 2.94 | 33.4 |
4 | 5 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 5.33 | 36.2 |
To find out more about the data set, we can type ?Boston
.
We will start by using the lm()
function to fit a simple linear regression model, with medv
as the response and lstat
as the predictor. The basic syntax is {}, where y
is the response, x
is the predictor, and data
is the data set in which these two variables are kept.
= Boston['lstat']
x = Boston['medv'] y
2.1 Plotting Gotcha: Remember to Add/Remove Bias column
We calculate best-fit model by appending bias column to X but in plotting we need to remember to remove the bias column.
Convert_NP
will convert panda.Series to np.ndarrayX_bias
: Will be used for plottingX
: Input vectorlstat
appended withbias
columnbias
: Column vector of 1’s- This allows us to get the Y-intercept or Beta_0 from linear regression
= lambda g: g.to_numpy().reshape(-1,1).astype('float32') #convert to one column ndarray
Convert_NP = Convert_NP(x) #shape (560,1)
X = Convert_NP(y) #shape (560,1)
Y
= X #USED FOR PLOTTING PURPOSES
X_NoBias
= np.ones((X.shape[0],1)).astype('float32') #column of 1's represent intercept coefficient
bias = np.append(X,bias,axis=1) #shape (560,2)
X_bias = X_bias
X print(f"X-dim: {X.shape}\nY-dim: {Y.shape}")
X-dim: (506, 2)
Y-dim: (506, 1)
= np.linspace(Y.min()-Y.std(),Y.max()+Y.std(), Y.shape[0])
x_manual plt.scatter(X_NoBias,Y)

3 Manual
\[ \hat{\beta} = (X^T X)^{-1} X^T Y \]
= np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)),X.T),Y)
Betas print(f"Betas: {Betas}")
Betas: [[-0.95004904]
[34.55383 ]]
\[ \hat{Y} = X \hat{\beta} \]
medev
\(=\) \(\beta_0 +\) \(\beta_1\)lstat
= np.dot(X, Betas) #shape (506,1)
Y_hat = np.linspace(Y_hat.min()-Y_hat.std(),Y_hat.max()+Y_hat.std(), Y_hat.shape[0])
x_manual
plt.scatter(X_NoBias,Y)='g') plt.plot(X_NoBias,Y_hat,color

\[Residual = e_i = Y_i - \hat{Y}_i \tag{reducible error}\]
\[RSS = \sum_i (Y_i-\hat{Y}_i)^2 = \sum_i e_i^2\]
\[RSS = \sum Residual^2\]
\[MSE = \frac{1}{n} RSS \tag{loss function}\]
= Y - Y_hat
Residual = (Residual**2).sum()
RSS = (1/Y_hat.shape[0])*RSS
MSE assert RSS == ((Y-Y_hat)**2).sum()
print(f"Residual{Residual[:3]}\nRSS: {RSS}\nMSE: {MSE}")
Residual[[-5.822586 ]
[-4.27038 ]
[ 3.9748688]]
RSS: 19472.3828125
MSE: 38.48296998517786
3.1 prob-plot
Prob-plot is what ISLR actually uses but misleading labels QQ plot. QQ plot, plots againsts quantiles which are non-negative. We get these quantiles using the inverse CDF.
Prob-plot the theoretical quantiles is plotted on the x(Calculated using Filliben’s estimate) and our data’s z-score is plotting on the y.
Gotcha: Sort the x-axis of your Plot, REMEMBER TO USE PARAM axis=0
The x-axis, are the points that are divided into n-quantiles of the bell curve, n being the number of residuals you have.
This means that the area under the bell curve bounded between each of the n points are equal.
Know that since bell-curve is bigger in the middle means that there are denser points in the middle(less distance) and spread apart at each ends.
Z-score tells how many std-devs we are from the mean, can be positive or negative.
Quantile is equal area partition of probability density function.
CDF :: Z-score --> 100-Quantile
, note that 100-quantile is just the percentile
InvCDF = stats.norm.ppf :: 100-Quantile --> Z-score
= Residual/Residual.std()
Resid_normed
def Filliben(val):
= val.shape[0]
n = 1-(0.5**(1/n))
fst = [(i-0.3175)/(n+0.365) for i in range(1,n-1)]
midd = 0.5**(1/n)
lst return np.append(np.append(fst,midd),lst)
=0)
Resid_normed.sort(axis
print(np.append(Resid_normed[0], Resid_normed[1]))
= [stats.norm.ppf(i) for i in Filliben(Resid_normed)]
InvertedPercentile plt.scatter(InvertedPercentile,Resid_normed)
[-2.4449956 -1.6037545]

= Residual/Residual.std()
Residual_Standardized # notice the standardized residual is the z-score
= np.random.normal( size=506)
normdist print(Residual_Standardized.mean())
print(Residual_Standardized.std())
1.2740787e-06
1.0
= plt.subplots(2,2)
fig,ax 0,0].scatter(Y_hat,Residual,facecolors="none",edgecolors="black")
ax[0,0].set_title("yhat vs e")
ax[#yy = [stats.percentileofscore(Residual_Standardized, a, 'rank') for a in Residual_Standardized]
#yy2=np.sort(yy)
= np.sort(Residual_Standardized,axis=0)
SortedResidStand = np.random.normal( size=506)
normdist = np.sort(normdist)
Sortednormdist 0,1].scatter(Sortednormdist,SortedResidStand)
ax[1,0].scatter(Y_hat,np.sqrt(np.abs(Residual_Standardized)))
ax[
plt.show()

- yhat vs e(Top Left): used to check if there is a non-linear pattern
- The data point looks like a curve, clearly not linear
- Normal QQ Plot(Top Right): tells us if residuals fall in a normal dist if it fits the y=x diagonal
- the top part shows set of points above the y=x diagonal indicating heavy tails/ more chance of seeing extremes
- Scale-Location Plot(Bottom Left): tells us if residuals have non-constant variance (heteroscedasticity).
- If our data was linear, a constant variance would show a horizontal line pattern
- Since our data is not-linear, its hard to tell if the non-horizontal pattern is due to dynamic variance vs non-linearity
- Residual v Leverage Influence Plot(Bottom Right): Leverage points are outliers of the x-axis
- Leverage points should be removed
- Anything in Cook’s distance of 0.5 as shown by boundary is a leverage point to be removed
\[ Variance(\hat{Y}) = \sigma^2 = \frac{\sum_i (\hat{Y}_i-\overline{\hat{Y}}_i)^2}{n}\]
\[ TSS = \sum_i (\hat{Y}_i-\overline{\hat{Y}}_i)^2 \]
\[ Stddev(\hat{Y}) = \sigma\]
\[R^2 = 1-\frac{RSS}{TSS} \]
- TSS is just the numerator of Variance of Predictions
- TSS can be considered the background variance of our outputs Centered around the mean
- RSS can be considered variance between our datapoint Centered around the best-fit line
- RSS/TSS behaves like a normalized measure of how scattered or clumped our points are from our best-fit line.
= ((Y_hat - Y_hat.mean())**2).sum()/Y_hat.shape[0]
Var_Yhat = Var_Yhat**0.5
Std_Yhat = ((Y_hat - Y_hat.mean())**2).sum()
TSS
assert np.isclose(Var_Yhat, Y_hat.var())
assert np.isclose(Std_Yhat, Y_hat.std())
\[SE(\hat{\mu}) = \frac{\sigma^2}{n}\]
- SE of sample mean tells us how far our sample mean is from population mean
= (Var_Yhat)/Y_hat.shape[0]
SE_Yhat print(SE_Yhat)
0.09078371181201081
4 Statsmodels
= sm.OLS(Y,X)
model = model.fit()
fit print(fit.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.544
Model: OLS Adj. R-squared: 0.543
Method: Least Squares F-statistic: 601.6
Date: Fri, 22 Apr 2022 Prob (F-statistic): 5.08e-88
Time: 10:33:30 Log-Likelihood: -1641.5
No. Observations: 506 AIC: 3287.
Df Residuals: 504 BIC: 3295.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -0.9500 0.039 -24.528 0.000 -1.026 -0.874
const 34.5538 0.563 61.415 0.000 33.448 35.659
==============================================================================
Omnibus: 137.043 Durbin-Watson: 0.892
Prob(Omnibus): 0.000 Jarque-Bera (JB): 291.374
Skew: 1.453 Prob(JB): 5.36e-64
Kurtosis: 5.319 Cond. No. 29.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
= plt.subplots(2,2)
fig,ax
= np.sort(Residual_Standardized,axis=0); # parameter axis=0 IS IMPORTANT!!
pp ='45',ax=ax[0,1])
sm.qqplot(pp,line
= fit.get_influence().hat_matrix_diag
leverage = fit.get_influence().resid_studentized_internal
normresid = fit.get_influence().cooks_distance[0]
model_cooks 1,1].scatter(leverage,Residual_Standardized)
ax[1,1].set_xlim(0,max(leverage)+0.01)
ax[
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>

5 Scipy
= stats.linregress(x,y)
result print(f"""
Beta_1:\t\t{result.slope}
Beta_0:\t\t{result.intercept}
R:\t\t{result.rvalue}
R^2:\t\t{result.rvalue**2}
Beta_1 pvalue:\t{result.pvalue}
SE:\t\t{result.stderr}
Beta_0 SE:\t{result.intercept_stderr}
""")
Beta_1: -0.9500493537579908
Beta_0: 34.5538408793831
R: -0.737662726174015
R^2: 0.5441462975864798
Beta_1 pvalue: 5.081103394387796e-88
SE: 0.03873341621263941
Beta_0 SE: 0.562627354988433
- MISSING
- Beta_0 pvalue
- tvalues for both Betas,
6 Numpy
=np.linalg.lstsq(X,Y,rcond=None)
Betas, RSS, Rank, SingVals print(f"""
Beta_1, Beta_0: {[i[0] for i in Betas]}
RSS: {RSS}
RSE : {(RSS/(X.shape[0]-2)) ** (0.5)}
Rank: {Rank}
SingularVal: {SingVals}
""")
Beta_1, Beta_0: [-0.95004934, 34.55384]
RSS: [19472.38]
RSE : [6.21576]
Rank: 2
SingularVal: [327.33368 11.027905]
7 Torch
7.1 Simple LinReg
import torch
= torch.from_numpy(X)
Xtensor = torch.from_numpy(Y)
Ytensor
= torch.linalg.lstsq(Xtensor, Ytensor).solution
W print(W)
# print(X[:,[1]])
= torch.from_numpy(X[:,[0]])
Xlayer = torch.from_numpy(X[:,[1]])
hidlayer = torch.from_numpy(Y) Ytensor
tensor([[-0.9500],
[34.5538]])
7.2 1-layer NN
import random
import torch
from torch import nn, optim
import math
from IPython import display
= torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device = 1
seed
random.seed(seed)
torch.manual_seed(seed)
= 1
inputDim = 1
outputDim = 100
hiddenDim
= 1e-4
learning_rate = 1e-6
lambda_l2
= nn.Sequential(
model
nn.Linear(inputDim, hiddenDim),
nn.Linear(hiddenDim, outputDim)
)# Convert to CUDA
model.to(device)
# nn package also has different loss functions.
# we use MSE loss for our regression task
= torch.nn.MSELoss()
criterion
# we use the optim package to apply
# stochastic gradient descent for our parameter updates
= torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=lambda_l2) # built-in L2
optimizer
# Training
for t in range(1000):
# Feed forward to get the logits
= model(Xlayer)
y_pred
# Compute the loss (MSE)
= criterion(y_pred, Ytensor)
loss print("[EPOCH]: %i, [LOSS or MSE]: %.6f" % (t, loss.item()))
=True)
display.clear_output(wait
# zero the gradients before running
# the backward pass.
optimizer.zero_grad()
# Backward pass to compute the gradient
# of loss w.r.t our learnable params.
loss.backward()
# Update params
optimizer.step()
[EPOCH]: 999, [LOSS or MSE]: 39.632763
plt.scatter(Xlayer.data.cpu().numpy(), Ytensor.data.cpu().numpy())'r-', lw=5)
plt.plot(Xlayer.data.cpu().numpy(), y_pred.data.cpu().numpy(), 'equal'); plt.axis(

7.3 1 layer NN Class style
import torch
from torch.autograd import Variable
import torch.nn.functional as F
= np.delete(X,1,1)
Xm = torch.from_numpy(Xm)
Xtensor = torch.from_numpy(Y)
Ytensor = 1
numHiddenlayer #In pytorch we dont need to artifically add a hidden layer column to our X input
class LinearRegressionModel(torch.nn.Module):
def __init__(self):
super(LinearRegressionModel, self).__init__()
self.linearA = torch.nn.Linear(Xtensor.shape[1], numHiddenlayer,bias=True) # One in and one out
self.linearB = torch.nn.Linear(numHiddenlayer,Ytensor.shape[1])
def forward(self, x):
#y_pred = F.relu(self.linear(x))
= self.linearB(self.linearA(x))
y_pred #y_pred = self.linearA(x)
return y_pred
# our model
= LinearRegressionModel()
our_model
= torch.nn.MSELoss()
criterion
#EXTRREMELY SENSITIVE TO LEARNING RATE 1e-4 is sweetspot
= (1e-4)
learning_rate
= torch.optim.SGD(our_model.parameters(), lr=learning_rate)
optimizer
#w,b = our_model.parameters()
#print(f"initial: {w,b}")
= 99999
bestloss = -1
bestepoch for epoch in range(4000): #EXTREMELY SENSITIVE TO EPOCH, 50000 is sweetspot
+= 1
epoch # Convert numpy array to torch Variable
#inputs = torch.from_numpy(p).requires_grad_()
#labels = torch.from_numpy(q)
# Clear gradients w.r.t. parameters
optimizer.zero_grad()
# Forward to get output
= our_model(Xtensor)
outputs
# Calculate Loss
= criterion(outputs, Ytensor)
loss = loss.item()
currloss if currloss < bestloss:
= currloss
bestloss = epoch
bestepoch # if currloss - bestloss > 200 and epoch > 50:
# print("boom")
# Getting gradients w.r.t. parameters
loss.backward()
# Updating parameters
optimizer.step()
print('epoch {}, loss {}'.format(epoch, loss.item()))
print('epoch {}, loss {}'.format(bestepoch, bestloss))
#w,b = our_model.parameters()
#print(w,b)
epoch 4000, loss 38.486961364746094
epoch 4000, loss 38.486961364746094
= our_model.parameters() w,b
print(w,b)
Parameter containing:
tensor([[-4.9096]], requires_grad=True) Parameter containing:
tensor([0.5459], requires_grad=True)
8 Numpy vs Scipy vs Torch vs Statsmodels
From the above linear regression methods we can see that Numpy, Scipy, Torch lack the fine details that Statsmodel provide.
Statsmodel provide details that resemble R much more closely.