Intro To Statistical Learning Notes Mathematica

Posted on April 2, 2019

Tags: probability

1 Intro
2 terms
- 2.1 Variance
3 Chapter 2 OLS
4 Chapter 4 Classification
5 Chapter 5 Resampling

Addataset = Import["https://www.statlearning.com/s/Advertising.csv","Dataset","HeaderLines"->1];

autodata = Import["https://www.statlearning.com/s/Auto.data","Table","HeaderLines" -> 0];
autodataset = Dataset[AssociationThread[First[autoData] -> #] & /@ Rest[autoData]];

autodataset = Import["https://www.statlearning.com/s/Auto.csv","Dataset","HeaderLines"->1];

collegedataset = Import["https://www.statlearning.com/s/College.csv","Dataset","HeaderLines"->1];

ch12ex13dataset = Import["https://www.statlearning.com/s/Ch12Ex13.csv","Dataset","HeaderLines"->1];

creditdataset = Import["https://www.statlearning.com/s/Credit.csv","Dataset","HeaderLines"->1];

heartdataset = Import["https://www.statlearning.com/s/Heart.csv","Dataset","HeaderLines"->1];

income1dataset = Import["https://www.statlearning.com/s/Income1.csv","Dataset","HeaderLines"->1];

income2dataset = Import["https://www.statlearning.com/s/Income2.csv","Dataset","HeaderLines"->1];

import all of the datasets

AdDataset = Import["https://www.statlearning.com/s/Advertising.csv","Dataset","HeaderLines"->1];
AutoDataset = Import["https://www.statlearning.com/s/Auto.csv","Dataset","HeaderLines"->1];
CollegeDataset = Import["https://www.statlearning.com/s/College.csv","Dataset","HeaderLines"->1];
Ch12ex13Dataset = Import["https://www.statlearning.com/s/Ch12Ex13.csv","Dataset","HeaderLines"->1];
CreditDataset = Import["https://www.statlearning.com/s/Credit.csv","Dataset","HeaderLines"->1];
HeartDataset = Import["https://www.statlearning.com/s/Heart.csv","Dataset","HeaderLines"->1];
Income1Dataset = Import["https://www.statlearning.com/s/Income1.csv","Dataset","HeaderLines"->1];
Income2Dataset = Import["https://www.statlearning.com/s/Income2.csv","Dataset","HeaderLines"->1];

1 Intro

Wage : Linear Regression How does age affect wage

Smarket : Classification Classification

2 terms

2.1 Variance

variance is squared(distance-to-mean)
That means there is larger value for scattered points than clumped points.
The square operation has an absolute value like effect.

TSS = Total sum squared = sum of variance of y’s

Normalising against variance by dividing by TSS.

3 Chapter 2 OLS

Goal: Find $\beta$ weight vector that minimizes sum square residual

Method: 1. Find inflection(min/max): Solve derivative of Error wrt. weight equal to 0 2. Prove that this is a min, not a max: Check 2nd derivative is positive definite(analogous to postive in real numbers)

3.1 Reality vs Estimate

Ideal Model of Reality \[Y = X\beta + \epsilon\]

Estimated Model \[Y=X\hat{\beta} + e\]

Population analogous to Reality
Sample analogous to a simulation or model of reality

Statistics is about using Samples(Simulation/Models/Estimations) to estimate Population(Reality)

3.2 Residual vs Irreducible Error

\[X_1 .. X_n \sim N(\mu ,\sigma^2)\] $X_1..X_n$ each represent the choice of a random singleton subset AKA singleton sample from population.
$\mu$ is the population mean.

\[\bar{X} = \frac{X_1 + .. X_n}{n}\] $\bar{X}$ is sample mean. Notice it is a random variable because our choice of sample subset is random which also implies a random meam for each of these possible subsets.

\[\bar{X} \sim N(\mu, \sigma^2)\]

\[E(Y_i) = \mu\] \[\epsilon_i = Y_i - E(Y_i) \text{ typically impossible to find}\]

\[ e_i = X_i - \bar{X}\]

\[e \neq \epsilon\]

Residuals $e$ are basically estimates of Error $\epsilon$

$\epsilon$ is random noise of reality we typically can’t measure.
- Error is random error from population data.
$e$ is our residual error: vertical distance of datapoint from best-fit line
- Residual is deviation between our estimated model and sample data.

We can ignore our Ideal Model $\dot{Y},\dot{\beta},\epsilon$ because they are simply model of an impossible ideal.

3.3 Finding best fit

Regression is the best fit line that is minimizes $e$ or residual squared error.

\[Y=X\hat{\beta} + e\]

\[e=Y-X\hat{\beta}\]

3.3.1 RSS: Residual Sum Squared

Sum of Squared residuals

\[RSS(\hat{\beta}) = e^t e \]

\[e^Te = (Y-X\hat{\beta})^T(Y-X\hat{\beta})\]

Set derivative to 0 to find inflection point

\[\frac{\partial e^Te}{\partial\beta}=-2X^TY+2X^TX\hat{\beta} = 0\]

\[(X^TX)\hat{\beta}=X^T Y\]

\[ \hat{Y} = X\hat{\beta}=(X^T X)^{-1} X^T Y\]

\[\hat{Y} = (X^T X)^{-1} X^T ( X \beta + \epsilon )\]

Show that this inflection point is minimum by proving the 2nd derivative is positive (or positive definite for matrices).

\[\frac{\partial^2 e^Te}{\partial\beta \partial\beta^T} = 2X^T X\]
Assuming X has full column rank, $X^T X$ can be shown to be positive definite.

3.3.1.1 Ax=b

Tricky

$X\beta = Y$
$Ax = b$

The $\beta$ represents $x$.
The input $X$ does NOT represent $x$

Each input or row of $X$ represent an equation or constraint which constrains the beta coefficients.
This is an overdetermined system (too many constraints/equation against too few variables)
Overdetermined systems have no exact solutions but we can find the best fit solution or $x$(in our case coefficient $\beta$)

3.3.2 Covariance matrix

The $X^T X$ is the Covariance Matrix in multivariate normal distribution.

$$= Cov(X,X) = (X_i - _i)(X_j-_j)

\[ E[Y-\hat{Y}]^2 = E[f(X)+\epsilon -\hat{f}(X)]^2 = {\color{red}E[f(X)-\hat{f}(X)]^2} + Var(\epsilon)\]

\[{\color{red}e = f(X)-\hat{f}(X)}\]

reducible error : ${\color{red}E[f(X)-\hat{f}(X)]^2}$
irreducible error: $Var(\epsilon)$

Steps:

Choose model: Linear equation

\[ Y = f(X_1,X_2,...) + \epsilon\]

$f$ is matrix multiplication with weight vector $\beta$

\[Y = X\beta + \epsilon\]

Train parameters: Ordinary Least Squares(OLS)

a <- c(2,3,5,6)

library(ISLR)



x <- rnorm(50,mean = 0,sd = 1)  # creates 50 points
y <- x + rnorm(50,mean=1,sd=0.5) # creates 50 points 
cor(x,y) # correlation
#> 0.8344

3.4 Validate linearity

We can use linear regression on any dataset but that DOES NOT imply a linear relation exists.
We must perform t-test(single independent var) or ANOVA F-test(multiple independent var) on the linear regression model’s coefficients to prove whether a linear relation exists.

3.4.1 T-test

T-dist is a probability dist like a normal-dist but bigger tails
- infinite degree of freedom result in T-dist = normal-dist
smaller degree of freedoms imply bigger tails
Can be useful to model returns with fat fails

Null H: There is no linear relationship between independent variable and output $\beta = 0$
Alt H: There is a linear relationship between independent variable and output $\beta \neq 0$

We do a T-test for EACH coefficient.
What happens when only some coefficient are significant and others arent.
ANSWER: Typically, we throw out nonsignificant (high p-value) coefficients to sacrifice some bias to reduce variance.
Remember high variance leads to overfit.

3.4.1.1 degree of freedom

3.4.2 Validate multi-linear regression model

3.4.2.1 F-test

Null H: None of the independent variable have a linear relationship with output $\forall n, \beta_n = 0$
Alt H: At least one of the independent variable have a linear relationship with output$\exists n, \beta_n \neq 0$

Variance WITHIN groups vs Variance BETWEEN groups

Example:

High WITHIN group Variance

Height: 10, 9, 12
Weight: 92, 95, 93
Age: 20, 22, 21

High BETWEEN group Variance

Height: 15, 30

3.5 Model fit ( Loss Function )

3.5.1 MSE

MSE(Mean Squared Error) : used to check how well our regression model fits
- Add the squared distance from point to regression, then divide by count of datapoints
Notice $MSE = \frac{1}{n} RSS$

Goal is to minimize MSE of test dataset, not training dataset

3.5.1.1 Standard Error SE

how far the sample {mean,weights,residuals} are from the reality or population {mean,weights,residuals}

3.5.1.2 RSE = SE(RSS)

Standard Error of RSS is RSE

3.5.2 R^2

$R^2$ is a number in [0,1] and related to Variance(Scatter or Clumped)
$R^2$ measure how scattered or clumped the data WRT our regression line aka model.

Close to 1 means every datapoint is on our regression line
Close to 0 means the datapoint are scattered but broadly follow trend with our regression line.

Optional: Drop coefficients aka Independent Random variables that do not contribute well in $R^2$ despite having low p-value

Example: Age variable gives $R^2$ 0.7 with p-value: 0.04 Adding a height variable gives $R^2$ 0.71 with p-value 0.005
We can safely drop height variable

Cons:

R^2 increases as you add more parameters
- Solution: used Adj-R^2

3.5.3 Cor

\[R^2 = Cor(Y,\hat{Y})^2 \text{ for multiple lin reg}\]
Linear fit models aim to $max(Cor(Y,\hat{Y}))$

3.6 Prediction

Prediction interval

4 Chapter 4 Classification

Why can’t we just do one-hot encoding with linear regression?
We Can but only for binary outcomes

If we do it for more than binary outcomes, we create an equidistant ordering which may not reflect reality.

5 Chapter 5 Resampling

Remember statistics is about how using our subset/sample to understand the population.

Resampling is about repeatedly taking different subsets or samples.

Two common resampling method: Cross-validation, bootstrap