Intro To Statistical Learning Notes Mathematica
= Import["https://www.statlearning.com/s/Advertising.csv","Dataset","HeaderLines"->1]; Addataset
= Import["https://www.statlearning.com/s/Auto.data","Table","HeaderLines" -> 0];
autodata = Dataset[AssociationThread[First[autoData] -> #] & /@ Rest[autoData]]; autodataset
= Import["https://www.statlearning.com/s/Auto.csv","Dataset","HeaderLines"->1]; autodataset
= Import["https://www.statlearning.com/s/College.csv","Dataset","HeaderLines"->1]; collegedataset
= Import["https://www.statlearning.com/s/Ch12Ex13.csv","Dataset","HeaderLines"->1]; ch12ex13dataset
= Import["https://www.statlearning.com/s/Credit.csv","Dataset","HeaderLines"->1]; creditdataset
= Import["https://www.statlearning.com/s/Heart.csv","Dataset","HeaderLines"->1]; heartdataset
= Import["https://www.statlearning.com/s/Income1.csv","Dataset","HeaderLines"->1]; income1dataset
= Import["https://www.statlearning.com/s/Income2.csv","Dataset","HeaderLines"->1]; income2dataset
import all of the datasets
= Import["https://www.statlearning.com/s/Advertising.csv","Dataset","HeaderLines"->1];
AdDataset = Import["https://www.statlearning.com/s/Auto.csv","Dataset","HeaderLines"->1];
AutoDataset = Import["https://www.statlearning.com/s/College.csv","Dataset","HeaderLines"->1];
CollegeDataset = Import["https://www.statlearning.com/s/Ch12Ex13.csv","Dataset","HeaderLines"->1];
Ch12ex13Dataset = Import["https://www.statlearning.com/s/Credit.csv","Dataset","HeaderLines"->1];
CreditDataset = Import["https://www.statlearning.com/s/Heart.csv","Dataset","HeaderLines"->1];
HeartDataset = Import["https://www.statlearning.com/s/Income1.csv","Dataset","HeaderLines"->1];
Income1Dataset = Import["https://www.statlearning.com/s/Income2.csv","Dataset","HeaderLines"->1]; Income2Dataset
1 Intro
Wage
: Linear Regression
How does age
affect wage
Smarket
: Classification
Classification
2 terms
2.1 Variance
variance is squared(distance-to-mean)
That means there is larger value for scattered points than clumped points.
The square operation has an absolute value like effect.
TSS = Total sum squared = sum of variance of y’s
Normalising against variance by dividing by TSS.
3 Chapter 2 OLS
Goal: Find \(\beta\) weight vector that minimizes sum square residual
Method: 1. Find inflection(min/max): Solve derivative of Error wrt. weight equal to 0 2. Prove that this is a min, not a max: Check 2nd derivative is positive definite(analogous to postive in real numbers)
3.1 Reality vs Estimate
Ideal Model of Reality \[Y = X\beta + \epsilon\]
Estimated Model \[Y=X\hat{\beta} + e\]
- Population analogous to Reality
- Sample analogous to a simulation or model of reality
Statistics is about using Samples(Simulation/Models/Estimations) to estimate Population(Reality)
3.2 Residual vs Irreducible Error
\[X_1 .. X_n \sim N(\mu ,\sigma^2)\]
\(X_1..X_n\) each represent the choice of a random singleton subset AKA singleton sample from population.
\(\mu\) is the population mean.
\[\bar{X} = \frac{X_1 + .. X_n}{n}\] \(\bar{X}\) is sample mean. Notice it is a random variable because our choice of sample subset is random which also implies a random meam for each of these possible subsets.
\[\bar{X} \sim N(\mu, \sigma^2)\]
\[E(Y_i) = \mu\] \[\epsilon_i = Y_i - E(Y_i) \text{ typically impossible to find}\]
\[ e_i = X_i - \bar{X}\]
\[e \neq \epsilon\]
Residuals \(e\) are basically estimates of Error \(\epsilon\)
- \(\epsilon\) is random noise of reality we typically can’t measure.
- Error is random error from population data.
- Error is random error from population data.
- \(e\) is our residual error: vertical distance of datapoint from best-fit line
- Residual is deviation between our estimated model and sample data.
We can ignore our Ideal Model \(\dot{Y},\dot{\beta},\epsilon\) because they are simply model of an impossible ideal.
3.3 Finding best fit
Regression is the best fit line that is minimizes \(e\) or residual squared error.
\[Y=X\hat{\beta} + e\]
\[e=Y-X\hat{\beta}\]
3.3.1 RSS: Residual Sum Squared
Sum of Squared residuals
\[RSS(\hat{\beta}) = e^t e \]
\[e^Te = (Y-X\hat{\beta})^T(Y-X\hat{\beta})\]
Set derivative to 0 to find inflection point
\[\frac{\partial e^Te}{\partial\beta}=-2X^TY+2X^TX\hat{\beta} = 0\]
\[(X^TX)\hat{\beta}=X^T Y\]
\[ \hat{Y} = X\hat{\beta}=(X^T X)^{-1} X^T Y\]
\[\hat{Y} = (X^T X)^{-1} X^T ( X \beta + \epsilon )\]
Show that this inflection point is minimum by proving the 2nd derivative is positive (or positive definite for matrices).
\[\frac{\partial^2 e^Te}{\partial\beta \partial\beta^T} = 2X^T X\]
Assuming X has full column rank, \(X^T X\) can be shown to be positive definite.
3.3.1.1 Ax=b
Tricky
- \(X\beta = Y\)
- \(Ax = b\)
The \(\beta\) represents \(x\).
The input \(X\) does NOT represent \(x\)
Each input or row of \(X\) represent an equation or constraint which constrains the beta coefficients.
This is an overdetermined system (too many constraints/equation against too few variables)
Overdetermined systems have no exact solutions but we can find the best fit solution or \(x\)(in our case coefficient \(\beta\))
3.3.2 Covariance matrix
The \(X^T X\) is the Covariance Matrix in multivariate normal distribution.
$$= Cov(X,X) = (X_i - _i)(X_j-_j)
\[ E[Y-\hat{Y}]^2 = E[f(X)+\epsilon -\hat{f}(X)]^2 = {\color{red}E[f(X)-\hat{f}(X)]^2} + Var(\epsilon)\]
\[{\color{red}e = f(X)-\hat{f}(X)}\]
- reducible error : \({\color{red}E[f(X)-\hat{f}(X)]^2}\)
- irreducible error: \(Var(\epsilon)\)
Steps:
- Choose model: Linear equation
\[ Y = f(X_1,X_2,...) + \epsilon\]
\(f\) is matrix multiplication with weight vector \(\beta\)
\[Y = X\beta + \epsilon\]
- Train parameters: Ordinary Least Squares(OLS)
<- c(2,3,5,6)
a
library(ISLR)
<- rnorm(50,mean = 0,sd = 1) # creates 50 points
x <- x + rnorm(50,mean=1,sd=0.5) # creates 50 points
y cor(x,y) # correlation
#> 0.8344
3.4 Validate linearity
- We can use linear regression on any dataset but that DOES NOT imply a linear relation exists.
- We must perform t-test(single independent var) or ANOVA F-test(multiple independent var) on the linear regression model’s coefficients to prove whether a linear relation exists.
3.4.1 T-test
- T-dist is a probability dist like a normal-dist but bigger tails
- infinite degree of freedom result in T-dist = normal-dist
- smaller degree of freedoms imply bigger tails
- Can be useful to model returns with fat fails
Null H: There is no linear relationship between independent variable and output \(\beta = 0\)
Alt H: There is a linear relationship between independent variable and output \(\beta \neq 0\)
We do a T-test for EACH coefficient.
What happens when only some coefficient are significant and others arent.
ANSWER: Typically, we throw out nonsignificant (high p-value) coefficients to sacrifice some bias to reduce variance.
Remember high variance leads to overfit.
3.4.1.1 degree of freedom
3.4.2 Validate multi-linear regression model
3.4.2.1 F-test
Null H: None of the independent variable have a linear relationship with output \(\forall n, \beta_n = 0\)
Alt H: At least one of the independent variable have a linear relationship with output\(\exists n, \beta_n \neq 0\)
Variance WITHIN groups vs Variance BETWEEN groups
Example:
High WITHIN group Variance
- Height: 10, 9, 12
- Weight: 92, 95, 93
- Age: 20, 22, 21
High BETWEEN group Variance
- Height: 15, 30
3.5 Model fit ( Loss Function )
3.5.1 MSE
- MSE(Mean Squared Error) : used to check how well our regression model fits
- Add the squared distance from point to regression, then divide by count of datapoints
- Notice \(MSE = \frac{1}{n} RSS\)
Goal is to minimize MSE of test dataset, not training dataset
3.5.1.1 Standard Error SE
- how far the sample {mean,weights,residuals} are from the reality or population {mean,weights,residuals}
3.5.1.2 RSE = SE(RSS)
Standard Error of RSS is RSE
3.5.2 R^2
\(R^2\) is a number in [0,1] and related to Variance(Scatter or Clumped)
\(R^2\) measure how scattered or clumped the data WRT our regression line aka model.
- Close to 1 means every datapoint is on our regression line
- Close to 0 means the datapoint are scattered but broadly follow trend with our regression line.
Optional: Drop coefficients aka Independent Random variables that do not contribute well in \(R^2\) despite having low p-value
Example:
Age variable gives \(R^2\) 0.7 with p-value: 0.04
Adding a height variable gives \(R^2\) 0.71 with p-value 0.005
We can safely drop height variable
Cons:
R^2
increases as you add more parameters- Solution: used
Adj-R^2
- Solution: used
3.5.3 Cor
\[R^2 = Cor(Y,\hat{Y})^2 \text{ for multiple lin reg}\]
Linear fit models aim to \(max(Cor(Y,\hat{Y}))\)
3.6 Prediction
Prediction interval
4 Chapter 4 Classification
Why can’t we just do one-hot encoding with linear regression?
We Can but only for binary outcomes
If we do it for more than binary outcomes, we create an equidistant ordering which may not reflect reality.
5 Chapter 5 Resampling
Remember statistics is about how using our subset/sample to understand the population.
Resampling is about repeatedly taking different subsets or samples.
Two common resampling method: Cross-validation, bootstrap