10 Regression

Given a dataset with inputs \(X\) and output \(Y\), a regression model is a model \[\hat{Y} = \hat{f}(X)\] which optimizes parameters of \(\hat{f}\) such that the squared error between the model and observations is minimized. In other words, regression is the process of finding a “best-fit” function \(\hat{f}\) to your data.

This section utilizes NumPy and Pyplot:

import numpy as np 
import matplotlib.pyplot as plt

10.1 Polynomial Regression

NumPy’s polyfit function determines optimized parameters for a polynomial to fit the given data \(X\) and \(Y\), along with the specified degree of the polynomial. With those parameters, the polyval function evaluates polynomials which is useful for model prediction and visualization.

Given the data:

X = np.array([1,2,3,4,5, 6, 7, 8, 9])
Y = np.array([8,6,5,6,9,12,15,16,16])

we fit polynomials of desired degree with polyfit:

Description	Model	Python
Constant	\(\hat{Y}=\beta_0\)	`params = np.polyfit(X,Y,0)`
Linear	\(\hat{Y}=\beta_0 + \beta_1 X\)	`params = np.polyfit(X,Y,1)`
Quadratic	\(\hat{Y}=\beta_0 + \beta_1 X + \beta_2 X^2\)	`params = np.polyfit(X,Y,2)`
Cubic	\(\hat{Y}=\beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3\)	`params = np.polyfit(X,Y,3)`

The polyfit function returns the parameters of the model. Confusingly, it returns the parameters in reverse order from \(\beta_n\) down to \(\beta_0\).

10.1.1 Linear Regression

A complete linear regression looks like this:

params = np.polyfit(X, Y, 1)
print( params )

[1.46666667 3.        ]

That is the slope of \(\beta_1=1.46\) and \(y\)-intercept of \(\beta_0=3\). Any polynomial can be evaluated for given \(X\) values with polyval, which is useful for making model predictions and plotting:

# Plot the data
plt.scatter(X, Y)

# Plot the regression 
Y_hat = np.polyval(params, X)
plt.plot(X, Y_hat )

plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression')
plt.show()

Model predictions can be calculated for individual input values:

# Input value 
x1 = 3.5  

# Model prediction 
y1 = np.polyval(params, x1)

Residual error is the difference between the observed data \((x_i, y_i)\) and the model prediction \((x_i, \hat{y}_i)\), \[ e_i = y_i - \hat{y}_i \] The residuals for our linear model are calculated and visualized with:

Y_hat = np.polyval(params, X)
residuals = Y - Y_hat 

plt.axhline(0, color='black')
plt.scatter(X, residuals) 
plt.xlabel('x')
plt.title('Residual Error')
plt.show()

The structure of residual error can give us information about how well the model fits the data. Here, there appears to be a wave-like signal our model has failed to capture. Roughly, we want to see residual error that looks like noise.

10.1.2 Higher-Order Polynomials

Let us look at multiple polynomial models simultaneously:

# Fit models to the data 
line_params = np.polyfit(X, Y, 1)
quad_params = np.polyfit(X, Y, 2)
cube_params = np.polyfit(X, Y, 3)

# Plot the data
plt.scatter(X, Y)

# Plot the models 
x = np.linspace(1,9,50)
plt.plot( x, np.polyval(line_params, x) )
plt.plot( x, np.polyval(quad_params, x) )
plt.plot( x, np.polyval(cube_params, x) )

plt.xlabel('x')
plt.ylabel('y')
plt.show()

10.2 Model Regression

Exercises

Find the parameters of the linear regression of the model \(\hat{Y}=\beta_0+\beta_1 X\) on the following data:

\(x\) -4 -1 0 2 3 5 6 7 9

\(y\) 22 19 15 16 15 19 19 20 22

Given the linear model \(\hat{Y}=\beta_0+\beta_1 X\) for the above data, find:
1. The residual errors \(e_i = y_i - \hat{y}_i\)
2. The sum of squared errors \(\sum_{i=1}^{n} e_{i}^{2}\)
3. The mean-squared error \(\frac{1}{n}\sum_{i=1}^{n} e_{i}^{2}\)
Plot the data and the linear model.
Repeat the above for a quadratic model \(\hat{Y}=\beta_0+\beta_1 X + \beta_2 X^2\).