Regression using SciKit Learn

This notebook will demonstrate using linear and polynomial regression to model the relationship between feature variables and the response variable.

We will use the MPG dataset to practice using regression to predict a fuel economy(MPG) of a car given its features.

For the mathematical explanation and theory, please view lecture notes: http://george1328.github.io/lecture_notes/Regression_Regularization.pdf

Step 1: Get the data

In [43]:
import pandas as pd
%pylab inline
Populating the interactive namespace from numpy and matplotlib

In [44]:
# Column/Feature label are not available in the dataset, so we create a list of features using auto-mpg.names
features = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']
In [45]:
# Import the data directly into pandas from the url, specify header=None as column labels are not in dataset
import urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
# file is fixed-width-format so we will use read_fwf instead of read_csv
df = pd.read_fwf(urllib.urlopen(url), header = None)
df.columns = features
In [46]:
# Alternatively, we can download the data
# We use the bang(!) within iPython Notebooks to run command line statements directly from the Notebook
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
! curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30286  100 30286    0     0  85395      0 --:--:-- --:--:-- --:--:-- 85553
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1660  100  1660    0     0   5315      0 --:--:-- --:--:-- --:--:--  5320

In [47]:
# Since this dataset has no column headings, we need to explicitely state names=features
df = pd.read_fwf("auto-mpg.data", names = features)

Step 2: Clean the data

In [48]:
# Use head, describe, info and unique to get a sense of the data
df.describe()
Out[48]:
mpg cylinders displacement weight acceleration model year origin
count 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000 398.000000
mean 23.514573 5.454774 193.425879 2970.424623 15.568090 76.010050 1.572864
std 7.815984 1.701004 104.269838 846.841774 2.757689 3.697627 0.802055
min 9.000000 3.000000 68.000000 1613.000000 8.000000 70.000000 1.000000
25% 17.500000 4.000000 104.250000 2223.750000 13.825000 73.000000 1.000000
50% 23.000000 4.000000 148.500000 2803.500000 15.500000 76.000000 1.000000
75% 29.000000 8.000000 262.000000 3608.000000 17.175000 79.000000 2.000000
max 46.600000 8.000000 455.000000 5140.000000 24.800000 82.000000 3.000000
In [49]:
df.head()
Out[49]:
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18 8 307 130.0 3504 12.0 70 1 "chevrolet chevelle malibu"
1 15 8 350 165.0 3693 11.5 70 1 "buick skylark 320"
2 18 8 318 150.0 3436 11.0 70 1 "plymouth satellite"
3 16 8 304 150.0 3433 12.0 70 1 "amc rebel sst"
4 17 8 302 140.0 3449 10.5 70 1 "ford torino"
In [50]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
car name        398 non-null object
dtypes: float64(4), int64(3), object(2)
In [51]:
# Name and Horsepower are the only non-numeric fields. Name of a car is unlikely to have an influence on the MPG.
df.horsepower.unique()
Out[51]:
array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '?', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)
In [52]:
(df.horsepower == '?').sum()
Out[52]:
6
In [53]:
# Lets convert horsepower to a numeric field so we can use it in our analysis
df['horsepower'] = df['horsepower'].convert_objects(convert_numeric = True)
In [54]:
# We will drop the 6 records that are missing horsepower. We could extimate these missing values but for the sake of accuracy
# we will not. Also, its only 6 missing values
df = df.dropna()
In [55]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null float64
weight          392 non-null float64
acceleration    392 non-null float64
model year      392 non-null int64
origin          392 non-null int64
car name        392 non-null object
dtypes: float64(5), int64(3), object(1)

Step 3: Get a sense of the data using Exploratory Data Analysis(EDA)

In [56]:
import seaborn as sns
In [57]:
sns.boxplot(df.mpg, df.cylinders)
# This is interesting. 4 cylinder vehicles have better mileage on average than 3 cylinder vehicles
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x11156f590>
In [58]:
three_cyl = df[df.cylinders == 3]
print three_cyl['car name']
## Aha! Tiny Mazda roadsters...
71     "mazda rx2 coupe"
111          "maxda rx3"
243         "mazda rx-4"
334      "mazda rx-7 gs"
Name: car name, dtype: object

In [59]:
sns.violinplot(df.mpg, df['model year'])
# Fancy seaborn graphing
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x1116b3950>
In [60]:
sns.barplot(df.mpg, df.horsepower)
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x1115fd9d0>
In [61]:
sns.barplot(df.mpg, df.weight)
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x1129ff9d0>
In [62]:
sns.boxplot(df.mpg, df.origin)
# Although the values of origin are not given, we can guess that 1=USA, 2=Europe and 3=Japan... Maybe...
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x113e3eb90>
In [63]:
sns.boxplot(df.mpg, df.acceleration)
# Little cars have pretty good accelaration AND good mileage so not a great association
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1127112d0>
In [64]:
sns.kdeplot(df.mpg, df.cylinders)
# Showing different plot options in seaborn :-)
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x116532710>

An alternate method to visualizing the data is to print the correlation matrix

In [65]:
df.corr()
Out[65]:
mpg cylinders displacement horsepower weight acceleration model year origin
mpg 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 0.580541 0.565209
cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 -0.345647 -0.568932
displacement -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 -0.369855 -0.614535
horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 -0.416361 -0.455171
weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 -0.309120 -0.585005
acceleration 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 0.290316 0.212746
model year 0.580541 -0.345647 -0.369855 -0.416361 -0.309120 0.290316 1.000000 0.181528
origin 0.565209 -0.568932 -0.614535 -0.455171 -0.585005 0.212746 0.181528 1.000000

Step 4: Prepare data for analysis

In [66]:
# Create numpy variables X and y with the predictor and class variables
X = df[['weight', 'model year', 'horsepower', 'origin', 'displacement']].values
y = df['mpg'].values
In [67]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 5: Fit, predict and interpret the results

In [69]:
model = LinearRegression()
model.fit(X_train, y_train)
Out[69]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
In [70]:
predictions = model.predict(X_test)
In [71]:
for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])
Predicted: 24.1374463009, Actual: 23.0
Predicted: 20.7218206344, Actual: 21.0
Predicted: 21.8941489423, Actual: 21.0
Predicted: 31.2739533219, Actual: 43.4
Predicted: 8.73849758971, Actual: 10.0
Predicted: 12.3567499567, Actual: 13.0
Predicted: 20.5539888194, Actual: 18.6
Predicted: 23.5043942628, Actual: 21.5
Predicted: 24.2250158308, Actual: 20.8
Predicted: 29.7560919602, Actual: 23.9
Predicted: 23.9300320498, Actual: 19.1
Predicted: 18.1430047987, Actual: 17.5
Predicted: 24.1607297159, Actual: 25.0
Predicted: 20.2761485564, Actual: 18.0
Predicted: 35.8155410311, Actual: 38.0
Predicted: 8.7074446523, Actual: 13.0
Predicted: 17.8979064936, Actual: 17.5
Predicted: 24.4450715073, Actual: 19.0
Predicted: 17.6644426204, Actual: 18.0
Predicted: 31.3474114534, Actual: 32.0
Predicted: 20.4370044076, Actual: 17.7
Predicted: 15.9413895668, Actual: 15.0
Predicted: 21.0107531943, Actual: 22.0
Predicted: 25.7540134863, Actual: 23.0
Predicted: 23.9496970655, Actual: 20.2
Predicted: 29.6572909222, Actual: 35.0
Predicted: 26.9401894891, Actual: 28.8
Predicted: 19.8034067713, Actual: 18.0
Predicted: 24.5575812291, Actual: 23.0
Predicted: 35.4302563906, Actual: 37.0
Predicted: 20.8287149326, Actual: 21.0
Predicted: 29.6734765718, Actual: 29.8
Predicted: 22.015168613, Actual: 16.2
Predicted: 26.484637129, Actual: 21.5
Predicted: 31.8567092134, Actual: 29.5
Predicted: 25.3383968192, Actual: 19.0
Predicted: 25.368336478, Actual: 23.2
Predicted: 25.9351262498, Actual: 28.0
Predicted: 24.533323956, Actual: 23.0
Predicted: 24.3018889873, Actual: 25.0
Predicted: 16.9777759556, Actual: 16.0
Predicted: 10.832224498, Actual: 13.0
Predicted: 30.9380337002, Actual: 29.5
Predicted: 23.5404225255, Actual: 19.4
Predicted: 28.6884754537, Actual: 31.0
Predicted: 27.5495074176, Actual: 28.0
Predicted: 31.3874972604, Actual: 32.0
Predicted: 26.2884785574, Actual: 24.0
Predicted: 26.8559022819, Actual: 28.4
Predicted: 31.0940066284, Actual: 32.0
Predicted: 24.0129770055, Actual: 24.0
Predicted: 30.9106684972, Actual: 31.3
Predicted: 23.8318097224, Actual: 20.0
Predicted: 11.4172815594, Actual: 14.0
Predicted: 25.4509846351, Actual: 19.8
Predicted: 19.9742706988, Actual: 18.0
Predicted: 20.823209095, Actual: 20.0
Predicted: 25.589920275, Actual: 30.0
Predicted: 24.8602224809, Actual: 20.2
Predicted: 23.4832212726, Actual: 23.0
Predicted: 20.5270445691, Actual: 15.0
Predicted: 28.6886846162, Actual: 27.0
Predicted: 20.5578198929, Actual: 18.0
Predicted: 26.3241951584, Actual: 28.1
Predicted: 17.6514477796, Actual: 18.0
Predicted: 22.5131941701, Actual: 20.0
Predicted: 9.68360980809, Actual: 14.0
Predicted: 26.8113390062, Actual: 29.0
Predicted: 26.8711409002, Actual: 30.7
Predicted: 29.5757309976, Actual: 35.0
Predicted: 31.8732349221, Actual: 41.5
Predicted: 26.8232988233, Actual: 28.0
Predicted: 14.976597184, Actual: 15.0
Predicted: 10.4056403926, Actual: 13.0
Predicted: 29.4808727792, Actual: 29.0
Predicted: 19.6230595321, Actual: 18.0
Predicted: 23.2566865315, Actual: 21.0
Predicted: 21.2311062773, Actual: 19.2
Predicted: 24.4456286124, Actual: 26.0

In [72]:
model.score(X_test,y_test)
Out[72]:
0.77764337315272958
In [73]:
sns.regplot(predictions, y_test)
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x11680fd10>
In [74]:
from sklearn.preprocessing import PolynomialFeatures
In [75]:
quad_model =PolynomialFeatures(degree=2)
In [76]:
quad_X_train = quad_model.fit_transform(X_train)
quad_X_test = quad_model.fit_transform(X_test)
In [77]:
model.fit(quad_X_train, y_train)
Out[77]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
In [78]:
predictions = model.predict(quad_X_test)
In [79]:
for i, prediction in enumerate(predictions):
    print 'Predicted: %s, Actual: %s' % (prediction, y_test[i])
Predicted: 23.3025864121, Actual: 23.0
Predicted: 19.9410146537, Actual: 21.0
Predicted: 19.4236241937, Actual: 21.0
Predicted: 35.4731652612, Actual: 43.4
Predicted: 12.1003117586, Actual: 10.0
Predicted: 13.3266849819, Actual: 13.0
Predicted: 18.369698152, Actual: 18.6
Predicted: 21.0278008026, Actual: 21.5
Predicted: 23.181315018, Actual: 20.8
Predicted: 27.0425561005, Actual: 23.9
Predicted: 23.7648774373, Actual: 19.1
Predicted: 15.7164721958, Actual: 17.5
Predicted: 23.6872919766, Actual: 25.0
Predicted: 19.9817607207, Actual: 18.0
Predicted: 38.5456159447, Actual: 38.0
Predicted: 13.504280478, Actual: 13.0
Predicted: 16.3986029163, Actual: 17.5
Predicted: 24.8423074503, Actual: 19.0
Predicted: 16.6782040171, Actual: 18.0
Predicted: 32.7968375922, Actual: 32.0
Predicted: 14.9025660389, Actual: 17.7
Predicted: 15.3464262096, Actual: 15.0
Predicted: 18.9638111493, Actual: 22.0
Predicted: 26.9743738532, Actual: 23.0
Predicted: 23.2312119071, Actual: 20.2
Predicted: 29.4165488717, Actual: 35.0
Predicted: 24.6202932979, Actual: 28.8
Predicted: 19.5000915272, Actual: 18.0
Predicted: 23.090561581, Actual: 23.0
Predicted: 38.0664049108, Actual: 37.0
Predicted: 20.9651746814, Actual: 21.0
Predicted: 29.5417968194, Actual: 29.8
Predicted: 17.0144256287, Actual: 16.2
Predicted: 20.8326170148, Actual: 21.5
Predicted: 32.1670449438, Actual: 29.5
Predicted: 23.0082059997, Actual: 19.0
Predicted: 23.2513931027, Actual: 23.2
Predicted: 24.1017375894, Actual: 28.0
Predicted: 23.8755314491, Actual: 23.0
Predicted: 21.5433611521, Actual: 25.0
Predicted: 16.0666414964, Actual: 16.0
Predicted: 13.0808748237, Actual: 13.0
Predicted: 31.2958897725, Actual: 29.5
Predicted: 22.3450668295, Actual: 19.4
Predicted: 28.1136406981, Actual: 31.0
Predicted: 26.8495836, Actual: 28.0
Predicted: 31.5882815681, Actual: 32.0
Predicted: 23.8615361052, Actual: 24.0
Predicted: 25.8740604078, Actual: 28.4
Predicted: 30.214390327, Actual: 32.0
Predicted: 23.1906818455, Actual: 24.0
Predicted: 32.449544013, Actual: 31.3
Predicted: 23.2856693712, Actual: 20.0
Predicted: 12.0127567535, Actual: 14.0
Predicted: 24.7192335988, Actual: 19.8
Predicted: 18.2473381975, Actual: 18.0
Predicted: 19.2284388676, Actual: 20.0
Predicted: 26.992739702, Actual: 30.0
Predicted: 23.7443765288, Actual: 20.2
Predicted: 24.9418790366, Actual: 23.0
Predicted: 19.7488154845, Actual: 15.0
Predicted: 28.7659298749, Actual: 27.0
Predicted: 19.4636716738, Actual: 18.0
Predicted: 28.0162056597, Actual: 28.1
Predicted: 16.9002717618, Actual: 18.0
Predicted: 20.2513640854, Actual: 20.0
Predicted: 14.5779224126, Actual: 14.0
Predicted: 26.5113041664, Actual: 29.0
Predicted: 29.3333899853, Actual: 30.7
Predicted: 28.4753991337, Actual: 35.0
Predicted: 32.9672553815, Actual: 41.5
Predicted: 25.9449460258, Actual: 28.0
Predicted: 14.2340168329, Actual: 15.0
Predicted: 12.4077841734, Actual: 13.0
Predicted: 29.9203154541, Actual: 29.0
Predicted: 17.5081234779, Actual: 18.0
Predicted: 22.8148987001, Actual: 21.0
Predicted: 19.4805423487, Actual: 19.2
Predicted: 24.3435280626, Actual: 26.0

In [80]:
model.score(quad_X_test,y_test)
Out[80]:
0.85141966237441713
In [81]:
sns.regplot(predictions, y_test)
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x11686bc90>
In [82]:
from sklearn.cross_validation import cross_val_score
In [83]:
scores = cross_val_score(model, quad_X_train, y_train, cv =10)
In [84]:
# R squared is the proportion of variance in the response variable that is explained by the model.
# The R squared score tells us the accuracy of our model with 1.0 being a perfect prediction. 
scores
Out[84]:
array([ 0.86963228,  0.88186708,  0.83549739,  0.82170629,  0.8882992 ,
        0.87384832,  0.83684078,  0.8580886 ,  0.84189087,  0.87059883])

Using attributes of a car to predict MPG is not an exact science. What I mean is that car manufacturers make cars based on their marketing plans. Although some cars are marketed as being fuel efficient, some could be marketed for comfort, some might be manufactured to appeal to a certain demographic(women, young adults, corporate types) and some others even for speed. Given these assumptions, our model does a fairly good job.