Regression using SciKit Learn¶

This notebook will demonstrate using linear and polynomial regression to model the relationship between feature variables and the response variable.

We will use the MPG dataset to practice using regression to predict a fuel economy(MPG) of a car given its features.

For the mathematical explanation and theory, please view lecture notes: http://george1328.github.io/lecture_notes/Regression_Regularization.pdf

Step 1: Get the data¶

import pandas as pd
%pylab inline

Populating the interactive namespace from numpy and matplotlib

# Column/Feature label are not available in the dataset, so we create a list of features using auto-mpg.names
features = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']

# Import the data directly into pandas from the url, specify header=None as column labels are not in dataset
import urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# file is fixed-width-format so we will use read_fwf instead of read_csv
df = pd.read_fwf(urllib.urlopen(url), header = None)
df.columns = features

# Alternatively, we can download the data
# We use the bang(!) within iPython Notebooks to run command line statements directly from the Notebook
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30286  100 30286    0     0  85395      0 --:--:-- --:--:-- --:--:-- 85553
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1660  100  1660    0     0   5315      0 --:--:-- --:--:-- --:--:--  5320

# Since this dataset has no column headings, we need to explicitely state names=features
df = pd.read_fwf("auto-mpg.data", names = features)

Step 2: Clean the data¶

# Use head, describe, info and unique to get a sense of the data
df.describe()

df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
car name        398 non-null object
dtypes: float64(4), int64(3), object(2)

# Name and Horsepower are the only non-numeric fields. Name of a car is unlikely to have an influence on the MPG.
df.horsepower.unique()

array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '?', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)

(df.horsepower == '?').sum()

6

# Lets convert horsepower to a numeric field so we can use it in our analysis
df['horsepower'] = df['horsepower'].convert_objects(convert_numeric = True)

# We will drop the 6 records that are missing horsepower. We could extimate these missing values but for the sake of accuracy
# we will not. Also, its only 6 missing values
df = df.dropna()

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null float64
weight          392 non-null float64
acceleration    392 non-null float64
model year      392 non-null int64
origin          392 non-null int64
car name        392 non-null object
dtypes: float64(5), int64(3), object(1)

Step 3: Get a sense of the data using Exploratory Data Analysis(EDA)¶

import seaborn as sns

sns.boxplot(df.mpg, df.cylinders)
# This is interesting. 4 cylinder vehicles have better mileage on average than 3 cylinder vehicles

<matplotlib.axes._subplots.AxesSubplot at 0x11156f590>

three_cyl = df[df.cylinders == 3]
print three_cyl['car name']
## Aha! Tiny Mazda roadsters...

71     "mazda rx2 coupe"
111          "maxda rx3"
243         "mazda rx-4"
334      "mazda rx-7 gs"
Name: car name, dtype: object

sns.violinplot(df.mpg, df['model year'])
# Fancy seaborn graphing

<matplotlib.axes._subplots.AxesSubplot at 0x1116b3950>

sns.barplot(df.mpg, df.horsepower)

<matplotlib.axes._subplots.AxesSubplot at 0x1115fd9d0>

sns.barplot(df.mpg, df.weight)

<matplotlib.axes._subplots.AxesSubplot at 0x1129ff9d0>

sns.boxplot(df.mpg, df.origin)
# Although the values of origin are not given, we can guess that 1=USA, 2=Europe and 3=Japan... Maybe...

<matplotlib.axes._subplots.AxesSubplot at 0x113e3eb90>

sns.boxplot(df.mpg, df.acceleration)
# Little cars have pretty good accelaration AND good mileage so not a great association

<matplotlib.axes._subplots.AxesSubplot at 0x1127112d0>

sns.kdeplot(df.mpg, df.cylinders)
# Showing different plot options in seaborn :-)

<matplotlib.axes._subplots.AxesSubplot at 0x116532710>

An alternate method to visualizing the data is to print the correlation matrix

df.corr()

Step 4: Prepare data for analysis¶

# Create numpy variables X and y with the predictor and class variables
X = df[['weight', 'model year', 'horsepower', 'origin', 'displacement']].values
y = df['mpg'].values

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 5: Fit, predict and interpret the results¶

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

predictions = model.predict(X_test)

for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])

Predicted: 24.1374463009, Actual: 23.0
Predicted: 20.7218206344, Actual: 21.0
Predicted: 21.8941489423, Actual: 21.0
Predicted: 31.2739533219, Actual: 43.4
Predicted: 8.73849758971, Actual: 10.0
Predicted: 12.3567499567, Actual: 13.0
Predicted: 20.5539888194, Actual: 18.6
Predicted: 23.5043942628, Actual: 21.5
Predicted: 24.2250158308, Actual: 20.8
Predicted: 29.7560919602, Actual: 23.9
Predicted: 23.9300320498, Actual: 19.1
Predicted: 18.1430047987, Actual: 17.5
Predicted: 24.1607297159, Actual: 25.0
Predicted: 20.2761485564, Actual: 18.0
Predicted: 35.8155410311, Actual: 38.0
Predicted: 8.7074446523, Actual: 13.0
Predicted: 17.8979064936, Actual: 17.5
Predicted: 24.4450715073, Actual: 19.0
Predicted: 17.6644426204, Actual: 18.0
Predicted: 31.3474114534, Actual: 32.0
Predicted: 20.4370044076, Actual: 17.7
Predicted: 15.9413895668, Actual: 15.0
Predicted: 21.0107531943, Actual: 22.0
Predicted: 25.7540134863, Actual: 23.0
Predicted: 23.9496970655, Actual: 20.2
Predicted: 29.6572909222, Actual: 35.0
Predicted: 26.9401894891, Actual: 28.8
Predicted: 19.8034067713, Actual: 18.0
Predicted: 24.5575812291, Actual: 23.0
Predicted: 35.4302563906, Actual: 37.0
Predicted: 20.8287149326, Actual: 21.0
Predicted: 29.6734765718, Actual: 29.8
Predicted: 22.015168613, Actual: 16.2
Predicted: 26.484637129, Actual: 21.5
Predicted: 31.8567092134, Actual: 29.5
Predicted: 25.3383968192, Actual: 19.0
Predicted: 25.368336478, Actual: 23.2
Predicted: 25.9351262498, Actual: 28.0
Predicted: 24.533323956, Actual: 23.0
Predicted: 24.3018889873, Actual: 25.0
Predicted: 16.9777759556, Actual: 16.0
Predicted: 10.832224498, Actual: 13.0
Predicted: 30.9380337002, Actual: 29.5
Predicted: 23.5404225255, Actual: 19.4
Predicted: 28.6884754537, Actual: 31.0
Predicted: 27.5495074176, Actual: 28.0
Predicted: 31.3874972604, Actual: 32.0
Predicted: 26.2884785574, Actual: 24.0
Predicted: 26.8559022819, Actual: 28.4
Predicted: 31.0940066284, Actual: 32.0
Predicted: 24.0129770055, Actual: 24.0
Predicted: 30.9106684972, Actual: 31.3
Predicted: 23.8318097224, Actual: 20.0
Predicted: 11.4172815594, Actual: 14.0
Predicted: 25.4509846351, Actual: 19.8
Predicted: 19.9742706988, Actual: 18.0
Predicted: 20.823209095, Actual: 20.0
Predicted: 25.589920275, Actual: 30.0
Predicted: 24.8602224809, Actual: 20.2
Predicted: 23.4832212726, Actual: 23.0
Predicted: 20.5270445691, Actual: 15.0
Predicted: 28.6886846162, Actual: 27.0
Predicted: 20.5578198929, Actual: 18.0
Predicted: 26.3241951584, Actual: 28.1
Predicted: 17.6514477796, Actual: 18.0
Predicted: 22.5131941701, Actual: 20.0
Predicted: 9.68360980809, Actual: 14.0
Predicted: 26.8113390062, Actual: 29.0
Predicted: 26.8711409002, Actual: 30.7
Predicted: 29.5757309976, Actual: 35.0
Predicted: 31.8732349221, Actual: 41.5
Predicted: 26.8232988233, Actual: 28.0
Predicted: 14.976597184, Actual: 15.0
Predicted: 10.4056403926, Actual: 13.0
Predicted: 29.4808727792, Actual: 29.0
Predicted: 19.6230595321, Actual: 18.0
Predicted: 23.2566865315, Actual: 21.0
Predicted: 21.2311062773, Actual: 19.2
Predicted: 24.4456286124, Actual: 26.0

model.score(X_test,y_test)

0.77764337315272958

sns.regplot(predictions, y_test)

<matplotlib.axes._subplots.AxesSubplot at 0x11680fd10>

from sklearn.preprocessing import PolynomialFeatures

quad_model =PolynomialFeatures(degree=2)

quad_X_train = quad_model.fit_transform(X_train)
quad_X_test = quad_model.fit_transform(X_test)

model.fit(quad_X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

predictions = model.predict(quad_X_test)

for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])

Predicted: 23.3025864121, Actual: 23.0
Predicted: 19.9410146537, Actual: 21.0
Predicted: 19.4236241937, Actual: 21.0
Predicted: 35.4731652612, Actual: 43.4
Predicted: 12.1003117586, Actual: 10.0
Predicted: 13.3266849819, Actual: 13.0
Predicted: 18.369698152, Actual: 18.6
Predicted: 21.0278008026, Actual: 21.5
Predicted: 23.181315018, Actual: 20.8
Predicted: 27.0425561005, Actual: 23.9
Predicted: 23.7648774373, Actual: 19.1
Predicted: 15.7164721958, Actual: 17.5
Predicted: 23.6872919766, Actual: 25.0
Predicted: 19.9817607207, Actual: 18.0
Predicted: 38.5456159447, Actual: 38.0
Predicted: 13.504280478, Actual: 13.0
Predicted: 16.3986029163, Actual: 17.5
Predicted: 24.8423074503, Actual: 19.0
Predicted: 16.6782040171, Actual: 18.0
Predicted: 32.7968375922, Actual: 32.0
Predicted: 14.9025660389, Actual: 17.7
Predicted: 15.3464262096, Actual: 15.0
Predicted: 18.9638111493, Actual: 22.0
Predicted: 26.9743738532, Actual: 23.0
Predicted: 23.2312119071, Actual: 20.2
Predicted: 29.4165488717, Actual: 35.0
Predicted: 24.6202932979, Actual: 28.8
Predicted: 19.5000915272, Actual: 18.0
Predicted: 23.090561581, Actual: 23.0
Predicted: 38.0664049108, Actual: 37.0
Predicted: 20.9651746814, Actual: 21.0
Predicted: 29.5417968194, Actual: 29.8
Predicted: 17.0144256287, Actual: 16.2
Predicted: 20.8326170148, Actual: 21.5
Predicted: 32.1670449438, Actual: 29.5
Predicted: 23.0082059997, Actual: 19.0
Predicted: 23.2513931027, Actual: 23.2
Predicted: 24.1017375894, Actual: 28.0
Predicted: 23.8755314491, Actual: 23.0
Predicted: 21.5433611521, Actual: 25.0
Predicted: 16.0666414964, Actual: 16.0
Predicted: 13.0808748237, Actual: 13.0
Predicted: 31.2958897725, Actual: 29.5
Predicted: 22.3450668295, Actual: 19.4
Predicted: 28.1136406981, Actual: 31.0
Predicted: 26.8495836, Actual: 28.0
Predicted: 31.5882815681, Actual: 32.0
Predicted: 23.8615361052, Actual: 24.0
Predicted: 25.8740604078, Actual: 28.4
Predicted: 30.214390327, Actual: 32.0
Predicted: 23.1906818455, Actual: 24.0
Predicted: 32.449544013, Actual: 31.3
Predicted: 23.2856693712, Actual: 20.0
Predicted: 12.0127567535, Actual: 14.0
Predicted: 24.7192335988, Actual: 19.8
Predicted: 18.2473381975, Actual: 18.0
Predicted: 19.2284388676, Actual: 20.0
Predicted: 26.992739702, Actual: 30.0
Predicted: 23.7443765288, Actual: 20.2
Predicted: 24.9418790366, Actual: 23.0
Predicted: 19.7488154845, Actual: 15.0
Predicted: 28.7659298749, Actual: 27.0
Predicted: 19.4636716738, Actual: 18.0
Predicted: 28.0162056597, Actual: 28.1
Predicted: 16.9002717618, Actual: 18.0
Predicted: 20.2513640854, Actual: 20.0
Predicted: 14.5779224126, Actual: 14.0
Predicted: 26.5113041664, Actual: 29.0
Predicted: 29.3333899853, Actual: 30.7
Predicted: 28.4753991337, Actual: 35.0
Predicted: 32.9672553815, Actual: 41.5
Predicted: 25.9449460258, Actual: 28.0
Predicted: 14.2340168329, Actual: 15.0
Predicted: 12.4077841734, Actual: 13.0
Predicted: 29.9203154541, Actual: 29.0
Predicted: 17.5081234779, Actual: 18.0
Predicted: 22.8148987001, Actual: 21.0
Predicted: 19.4805423487, Actual: 19.2
Predicted: 24.3435280626, Actual: 26.0

model.score(quad_X_test,y_test)

0.85141966237441713

sns.regplot(predictions, y_test)

<matplotlib.axes._subplots.AxesSubplot at 0x11686bc90>

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, quad_X_train, y_train, cv =10)

# R squared is the proportion of variance in the response variable that is explained by the model.
# The R squared score tells us the accuracy of our model with 1.0 being a perfect prediction. 
scores

array([ 0.86963228,  0.88186708,  0.83549739,  0.82170629,  0.8882992 ,
        0.87384832,  0.83684078,  0.8580886 ,  0.84189087,  0.87059883])

Using attributes of a car to predict MPG is not an exact science. What I mean is that car manufacturers make cars based on their marketing plans. Although some cars are marketed as being fuel efficient, some could be marketed for comfort, some might be manufactured to appeal to a certain demographic(women, young adults, corporate types) and some others even for speed. Given these assumptions, our model does a fairly good job.¶

	mpg	cylinders	displacement	weight	acceleration	model year	origin
count	398.000000	398.000000	398.000000	398.000000	398.000000	398.000000	398.000000
mean	23.514573	5.454774	193.425879	2970.424623	15.568090	76.010050	1.572864
std	7.815984	1.701004	104.269838	846.841774	2.757689	3.697627	0.802055
min	9.000000	3.000000	68.000000	1613.000000	8.000000	70.000000	1.000000
25%	17.500000	4.000000	104.250000	2223.750000	13.825000	73.000000	1.000000
50%	23.000000	4.000000	148.500000	2803.500000	15.500000	76.000000	1.000000
75%	29.000000	8.000000	262.000000	3608.000000	17.175000	79.000000	2.000000
max	46.600000	8.000000	455.000000	5140.000000	24.800000	82.000000	3.000000

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin
mpg	1.000000	-0.777618	-0.805127	-0.778427	-0.832244	0.423329	0.580541	0.565209
cylinders	-0.777618	1.000000	0.950823	0.842983	0.897527	-0.504683	-0.345647	-0.568932
displacement	-0.805127	0.950823	1.000000	0.897257	0.932994	-0.543800	-0.369855	-0.614535
horsepower	-0.778427	0.842983	0.897257	1.000000	0.864538	-0.689196	-0.416361	-0.455171
weight	-0.832244	0.897527	0.932994	0.864538	1.000000	-0.416839	-0.309120	-0.585005
acceleration	0.423329	-0.504683	-0.543800	-0.689196	-0.416839	1.000000	0.290316	0.212746
model year	0.580541	-0.345647	-0.369855	-0.416361	-0.309120	0.290316	1.000000	0.181528
origin	0.565209	-0.568932	-0.614535	-0.455171	-0.585005	0.212746	0.181528	1.000000

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18	8	307	130.0	3504	12.0	70	1	"chevrolet chevelle malibu"
1	15	8	350	165.0	3693	11.5	70	1	"buick skylark 320"
2	18	8	318	150.0	3436	11.0	70	1	"plymouth satellite"
3	16	8	304	150.0	3433	12.0	70	1	"amc rebel sst"
4	17	8	302	140.0	3449	10.5	70	1	"ford torino"