Enron POI Classification

Project Overview

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I'm going to play detective, and build a person of interest identifier based on financial and email data made public as a result of the Enron scandal.

Python Preparation

In [15]:
import sys
import pickle
import numpy
import pandas
import sklearn
from ggplot import *
import matplotlib
%matplotlib inline

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import test_classifier, dump_classifier_and_data

Feature Selection

I'm going to try to use all features, filter them and choose the best.

All available features fall into three major types of features, namely financial features, email features and POI labels.

  • financial features: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)

  • email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)

  • POI label: [‘poi’] (boolean, represented as integer)

In [16]:
## Create features list
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 
                 'loan_advances', 'bonus', 'restricted_stock_deferred', 
                 'deferred_income', 'total_stock_value', 'expenses', 'from_poi_to_this_person', 
                 'exercised_stock_options', 'from_messages', 'other', 'from_this_person_to_poi', 
                 'long_term_incentive', 'shared_receipt_with_poi', 'restricted_stock', 'director_fees'] 

### Load the dictionary containing the dataset
enron_data = pickle.load(open("final_project_dataset.pkl", "r") )

## Load POI names file
f = open("poi_names.txt", "r")

# Print available information for Jeffrey Skilling
print enron_data["SKILLING JEFFREY K"]
{'salary': 1111258, 'to_messages': 3627, 'deferral_payments': 'NaN', 'total_payments': 8682716, 'exercised_stock_options': 19250000, 'bonus': 5600000, 'restricted_stock': 6843672, 'shared_receipt_with_poi': 2042, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 26093672, 'expenses': 29336, 'loan_advances': 'NaN', 'from_messages': 108, 'other': 22122, 'from_this_person_to_poi': 30, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 1920000, 'email_address': 'jeff.skilling@enron.com', 'from_poi_to_this_person': 88}

I want my classifier be adopted to 'NaN' variables, so i'm going to reformat dataset to numpy data frame. Fortunately, python pandas package does all this routine for us in just one line of code. Also let's have a first look on our data using pandas describe.

In [17]:
df = pandas.DataFrame.from_records(list(enron_data.values()))
persons = pandas.Series(list(enron_data.keys()))

print df.head()
     bonus deferral_payments deferred_income director_fees  \
0   600000               NaN             NaN           NaN   
1  1200000           1295738        -1386055           NaN   
2   350000               NaN         -400729           NaN   
3      NaN               NaN             NaN           NaN   
4  1500000               NaN        -3117011           NaN   

              email_address exercised_stock_options expenses from_messages  \
0      mark.metts@enron.com                     NaN    94299            29   
1                       NaN                 6680544    11200           NaN   
2  steven.elliott@enron.com                 4890344    78552           NaN   
3     bill.cordes@enron.com                  651850      NaN            12   
4    kevin.hannon@enron.com                 5538001    34039            32   

  from_poi_to_this_person from_this_person_to_poi        ...         \
0                      38                       1        ...          
1                     NaN                     NaN        ...          
2                     NaN                     NaN        ...          
3                      10                       0        ...          
4                      32                      21        ...          

  long_term_incentive    other    poi restricted_stock  \
0                 NaN     1740  False           585062   
1             1586055  2660303  False          3942714   
2                 NaN    12961  False          1788391   
3                 NaN      NaN  False           386335   
4             1617011    11350   True           853064   

  restricted_stock_deferred  salary shared_receipt_with_poi to_messages  \
0                       NaN  365788                     702         807   
1                       NaN  267102                     NaN         NaN   
2                       NaN  170941                     NaN         NaN   
3                       NaN     NaN                      58         764   
4                       NaN  243293                    1035        1045   

  total_payments total_stock_value  
0        1061827            585062  
1        5634343          10623258  
2         211725           6678735  
3            NaN           1038185  
4         288682           6391065  

[5 rows x 21 columns]

Looks very good. Let handle 'NaN' values later.

Removing Outliers and Handling 'NaN' values

Let's define numeric columns list and look at variance without 'NaN'. See how many 'NaN' in each column and make decision about imputing this values or just removing this column from dataset. Also be interesting to see outlieres after imputing and creating tidy dataset.

In [18]:
# Convert to numpy nan
df.replace(to_replace='NaN', value=numpy.nan, inplace=True)

# Count number of NaN's for columns
print df.isnull().sum()

# DataFrame dimeansion
print df.shape
bonus                         64
deferral_payments            107
deferred_income               97
director_fees                129
email_address                 35
exercised_stock_options       44
expenses                      51
from_messages                 60
from_poi_to_this_person       60
from_this_person_to_poi       60
loan_advances                142
long_term_incentive           80
other                         53
poi                            0
restricted_stock              36
restricted_stock_deferred    128
salary                        51
shared_receipt_with_poi       60
to_messages                   60
total_payments                21
total_stock_value             20
dtype: int64
(146, 21)

I'm don't want to spend so much time on this and i'll do simple deicision. If NaN counter more that 65, I'm going to remove this column, otherwise I do imputation.

In [19]:
# Removing column from database if counter > 65

for column, series in df.iteritems():
    if series.isnull().sum() > 65:
        df.drop(column, axis=1, inplace=True)
       
# Drop email address column
if 'email_address' in list(df.columns.values):
    df.drop('email_address', axis=1, inplace=True)

        
#impute = sklearn.preprocessing.Imputer(missing_values=numpy.nan, strategy='mean', axis=0)
#impute.fit(df)

#df_imp = pandas.DataFrame(impute.transform(df.copy(deep=True)))
df_imp = df.replace(to_replace=numpy.nan, value=0)
df_imp = df.fillna(0).copy(deep=True)
df_imp.columns = list(df.columns.values)
print df_imp.isnull().sum()
print df_imp.head()

df_imp.describe()
bonus                      0
exercised_stock_options    0
expenses                   0
from_messages              0
from_poi_to_this_person    0
from_this_person_to_poi    0
other                      0
poi                        0
restricted_stock           0
salary                     0
shared_receipt_with_poi    0
to_messages                0
total_payments             0
total_stock_value          0
dtype: int64
     bonus  exercised_stock_options  expenses  from_messages  \
0   600000                        0     94299             29   
1  1200000                  6680544     11200              0   
2   350000                  4890344     78552              0   
3        0                   651850         0             12   
4  1500000                  5538001     34039             32   

   from_poi_to_this_person  from_this_person_to_poi    other    poi  \
0                       38                        1     1740  False   
1                        0                        0  2660303  False   
2                        0                        0    12961  False   
3                       10                        0        0  False   
4                       32                       21    11350   True   

   restricted_stock  salary  shared_receipt_with_poi  to_messages  \
0            585062  365788                      702          807   
1           3942714  267102                        0            0   
2           1788391  170941                        0            0   
3            386335       0                       58          764   
4            853064  243293                     1035         1045   

   total_payments  total_stock_value  
0         1061827             585062  
1         5634343           10623258  
2          211725            6678735  
3               0            1038185  
4          288682            6391065  
Out[19]:
bonus exercised_stock_options expenses from_messages from_poi_to_this_person from_this_person_to_poi other poi restricted_stock salary shared_receipt_with_poi to_messages total_payments total_stock_value
count 146.000000 1.460000e+02 146.000000 146.000000 146.000000 146.000000 146.000000 146 1.460000e+02 146.000000 146.000000 146.000000 1.460000e+02 1.460000e+02
mean 1333474.232877 4.182736e+06 70748.267123 358.602740 38.226027 24.287671 585431.794521 0.1232877 1.749257e+06 365811.356164 692.986301 1221.589041 4.350622e+06 5.846018e+06
std 8094029.239637 2.607040e+07 432716.319438 1441.259868 73.901124 79.278206 3682344.576631 0.3298989 1.089995e+07 2203574.963717 1072.969492 2226.770637 2.693448e+07 3.624681e+07
min 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 False -2.604490e+06 0.000000 0.000000 0.000000 0.000000e+00 -4.409300e+04
25% 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0 8.115000e+03 0.000000 0.000000 0.000000 9.394475e+04 2.288695e+05
50% 300000.000000 6.082935e+05 20182.000000 16.500000 2.500000 0.000000 959.500000 0 3.605280e+05 210596.000000 102.500000 289.000000 9.413595e+05 9.659550e+05
75% 800000.000000 1.714221e+06 53740.750000 51.250000 40.750000 13.750000 150606.500000 0 8.145280e+05 270850.500000 893.500000 1585.750000 1.968287e+06 2.319991e+06
max 97343619.000000 3.117640e+08 5235198.000000 14368.000000 528.000000 609.000000 42667589.000000 True 1.303223e+08 26704229.000000 5521.000000 15149.000000 3.098866e+08 4.345095e+08

Just from summary we can explore big amount of outliers using max and .75 quantile. Actually i don't think that these are outliers or wrong data, it's just extreme values for some persons. Let's explore persons names at first.

In [20]:
print(enron_data.keys())
['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'MCCARTY DANNY J', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY REX', 'LEMAISTRE CHARLES', 'DEFFNER JOSEPH M', 'KISHKILL JOSEPH G', 'WHALLEY LAWRENCE G', 'MCCONNELL MICHAEL S', 'PIRO JIM', 'DELAINEY DAVID W', 'SULLIVAN-SHAKLOVITZ COLLEEN', 'WROBEL BRUCE', 'LINDHOLM TOD A', 'MEYER JEROME J', 'LAY KENNETH L', 'BUTTS ROBERT H', 'OLSON CINDY K', 'MCDONALD REBECCA', 'CUMBERLAND MICHAEL S', 'GAHN ROBERT S', 'MCCLELLAN GEORGE', 'HERMANN ROBERT J', 'SCRIMSHAW MATTHEW', 'GATHMANN WILLIAM D', 'HAEDICKE MARK E', 'BOWEN JR RAYMOND M', 'GILLIS JOHN', 'FITZGERALD JAY L', 'MORAN MICHAEL P', 'REDMOND BRIAN L', 'BAZELIDES PHILIP J', 'BELDEN TIMOTHY N', 'DURAN WILLIAM D', 'THORN TERENCE H', 'FASTOW ANDREW S', 'FOY JOE', 'CALGER CHRISTOPHER F', 'RICE KENNETH D', 'KAMINSKI WINCENTY J', 'LOCKHART EUGENE E', 'COX DAVID', 'OVERDYKE JR JERE C', 'PEREIRA PAULO V. FERRAZ', 'STABLER FRANK', 'SKILLING JEFFREY K', 'BLAKE JR. NORMAN P', 'SHERRICK JEFFREY B', 'PRENTICE JAMES', 'GRAY RODNEY', 'PICKERING MARK R', 'THE TRAVEL AGENCY IN THE PARK', 'NOLES JAMES L', 'KEAN STEVEN J', 'TOTAL', 'FOWLER PEGGY', 'WASAFF GEORGE', 'WHITE JR THOMAS E', 'CHRISTODOULOU DIOMEDES', 'ALLEN PHILLIP K', 'SHARP VICTORIA T', 'JAEDICKE ROBERT', 'WINOKUR JR. HERBERT S', 'BROWN MICHAEL', 'BADUM JAMES P', 'HUGHES JAMES A', 'REYNOLDS LAWRENCE', 'DIMICHELE RICHARD G', 'BHATNAGAR SANJAY', 'CARTER REBECCA C', 'BUCHANAN HAROLD G', 'YEAP SOON', 'MURRAY JULIA H', 'GARLAND C KEVIN', 'DODSON KEITH', 'YEAGER F SCOTT', 'HIRKO JOSEPH', 'DIETRICH JANET R', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'PAI LOU L', 'BAY FRANKLIN R', 'HAYSLETT RODERICK J', 'FUGH JOHN L', 'FALLON JAMES B', 'KOENIG MARK E', 'SAVAGE FRANK', 'IZZO LAWRENCE L', 'TILNEY ELIZABETH A', 'MARTIN AMANDA K', 'BUY RICHARD B', 'GRAMM WENDY L', 'CAUSEY RICHARD A', 'TAYLOR MITCHELL S', 'DONAHUE JR JEFFREY M', 'GLISAN JR BEN F']

Generally all this string are names of persons from Enron data. But I've observed "TOTAL" and "THE TRAVEL AGENCY IN THE PARK", which aren't names. This observations should be deleted.

In [21]:
total_index = enron_data.keys().index("TOTAL")
print(total_index)
travel_index = enron_data.keys().index("THE TRAVEL AGENCY IN THE PARK")
print(travel_index)
df_subset = df_imp.drop(df_imp.index[[total_index,travel_index]])
df_subset.describe()
104
101
Out[21]:
bonus exercised_stock_options expenses from_messages from_poi_to_this_person from_this_person_to_poi other poi restricted_stock salary shared_receipt_with_poi to_messages total_payments total_stock_value
count 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144 144.000000 144.000000 144.000000 144.000000 1.440000e+02 144.000000
mean 675997.354167 2075801.979167 35375.340278 363.583333 38.756944 24.625000 294745.534722 0.125 868536.291667 185446.034722 702.611111 1238.555556 2.256543e+06 2909785.611111
std 1233155.255938 4795513.145239 45309.303038 1450.675239 74.276769 79.778266 1131325.452833 0.3318733 2016572.388715 197042.123807 1077.290736 2237.564816 8.847189e+06 6189018.075043
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 False -2604490.000000 0.000000 0.000000 0.000000 0.000000e+00 -44093.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 24345.000000 0.000000 0.000000 0.000000 9.019275e+04 244326.500000
50% 300000.000000 608293.500000 20182.000000 17.500000 4.000000 0.000000 919.000000 0 360528.000000 210596.000000 114.000000 347.500000 9.413595e+05 965955.000000
75% 800000.000000 1683580.250000 53328.250000 53.000000 41.250000 14.000000 148577.000000 0 737456.000000 269667.500000 933.750000 1623.000000 1.945668e+06 2295175.750000
max 8000000.000000 34348384.000000 228763.000000 14368.000000 528.000000 609.000000 10359729.000000 True 14761694.000000 1111258.000000 5521.000000 15149.000000 1.035598e+08 49110078.000000

After this feature engineering let's explore what values do we have and data dimension.

In [22]:
print "Values:", list(df_subset.columns.values)

print "Shape: ", df_subset.shape

print "Number of POI in DataSet: ", (df_subset['poi'] == 1).sum()
print "Number of non-POI in Dataset: ", (df_subset['poi'] == 0).sum()
Values: ['bonus', 'exercised_stock_options', 'expenses', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi', 'other', 'poi', 'restricted_stock', 'salary', 'shared_receipt_with_poi', 'to_messages', 'total_payments', 'total_stock_value']
Shape:  (144, 14)
Number of POI in DataSet:  18
Number of non-POI in Dataset:  126

Creating new Features

In my opinion very useful feature is the ratio between messages connected with poi and all messages. I'm going to name this variable 'poi_ratio'. This variable store ratio in percentages. Also will be added next features:

  • the fraction of all emails to a person that were sent from a person of interest
  • the fraction of all emails that a person sent that were addressed to persons of interest

The hypothesis behind these features was that there might be stronger email connections between POIs than between POIs and non-POIs, and a scatterplot of these two features suggests that there might be some truth to that hypothesis.

Additionally i want to scale 'salary' to range [0,100].

In [23]:
poi_ratio = (df_subset['from_poi_to_this_person'] + df_subset['from_this_person_to_poi']) / (df_subset['from_messages'] + df_subset['to_messages'])
fraction_to_poi = (df_subset['from_this_person_to_poi']) / (df_subset['from_messages'])
fraction_from_poi = (df_subset['from_poi_to_this_person']) / (df_subset['to_messages'])
scale = sklearn.preprocessing.MinMaxScaler(feature_range=(0, 100), copy=True)

df_subset['poi_ratio'] = pandas.Series(poi_ratio) * 100
df_subset['fraction_to_poi'] = pandas.Series(fraction_to_poi) * 100
df_subset['fraction_from_poi'] = pandas.Series(fraction_from_poi) * 100
salary_scaled = scale.fit_transform(df_subset['salary'])
df_subset.describe()
Out[23]:
bonus exercised_stock_options expenses from_messages from_poi_to_this_person from_this_person_to_poi other poi restricted_stock salary shared_receipt_with_poi to_messages total_payments total_stock_value poi_ratio fraction_to_poi fraction_from_poi
count 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144.000000 144 144.000000 144.000000 144.000000 144.000000 1.440000e+02 144.000000 86.000000 86.000000 86.000000
mean 675997.354167 2075801.979167 35375.340278 363.583333 38.756944 24.625000 294745.534722 0.125 868536.291667 16.687937 702.611111 1238.555556 2.256543e+06 2909785.611111 4.770893 18.405548 3.796298
std 1233155.255938 4795513.145239 45309.303038 1450.675239 74.276769 79.778266 1131325.452833 0.3318733 2016572.388715 17.731447 1077.290736 2237.564816 8.847189e+06 6189018.075043 4.645513 21.061035 4.054439
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 False -2604490.000000 0.000000 0.000000 0.000000 0.000000e+00 -44093.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 24345.000000 0.000000 0.000000 0.000000 9.019275e+04 244326.500000 1.304333 1.242096 0.919988
50% 300000.000000 608293.500000 20182.000000 17.500000 4.000000 0.000000 919.000000 0 360528.000000 18.951135 114.000000 347.500000 9.413595e+05 965955.000000 3.176610 10.057359 2.584537
75% 800000.000000 1683580.250000 53328.250000 53.000000 41.250000 14.000000 148577.000000 0 737456.000000 24.266867 933.750000 1623.000000 1.945668e+06 2295175.750000 6.696795 27.203947 5.608871
max 8000000.000000 34348384.000000 228763.000000 14368.000000 528.000000 609.000000 10359729.000000 True 14761694.000000 100.000000 5521.000000 15149.000000 1.035598e+08 49110078.000000 22.435175 100.000000 21.734104

Let's visualize poi_ratio vs fraction_to_poi.

In [24]:
g = ggplot(aes(x='salary', y='fraction_to_poi', colour='poi'), data=df_subset) + geom_point()
print g
<ggplot: (280128253)>

Training Classifier

Split the data at first on training and test data. We will use Stratified Shuffle Split due to small dataset

In [25]:
labels = df_subset['poi'].copy(deep=True).astype(int).as_matrix()
features = (df_subset.drop('poi', axis=1)).fillna(0).copy(deep=True).as_matrix()
shuffle = sklearn.cross_validation.StratifiedShuffleSplit(labels, 4, test_size=0.3, random_state=0)

print labels
print features
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1]
[[  6.00000000e+05   0.00000000e+00   9.42990000e+04 ...,   4.66507177e+00
    3.44827586e+00   4.70879802e+00]
 [  1.20000000e+06   6.68054400e+06   1.12000000e+04 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  3.50000000e+05   4.89034400e+06   7.85520000e+04 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 ..., 
 [  6.00000000e+05   3.18125000e+06   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  8.00000000e+05   7.65920000e+05   9.62680000e+04 ...,   2.24351747e+01
    5.00000000e+01   2.17341040e+01]
 [  6.00000000e+05   3.84728000e+05   1.25978000e+05 ...,   6.52418448e+00
    3.75000000e+01   5.95647194e+00]]

Let's try four different initial models, and then start tune parameters of few models.

Random Forest

In [26]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=10)
scores = sklearn.cross_validation.cross_val_score(rf_clf, features, labels)

print scores
[ 0.875       0.875       0.85416667]

AdaBoost

In [27]:
from sklearn.ensemble import AdaBoostClassifier

ab_clf = AdaBoostClassifier(n_estimators=100)
scores = sklearn.cross_validation.cross_val_score(ab_clf, features, labels)
print scores 
[ 0.85416667  0.85416667  0.77083333]

Extremely Randomized Tree

In [28]:
from sklearn.ensemble import ExtraTreesClassifier

erf_clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0)
scores = sklearn.cross_validation.cross_val_score(erf_clf, features, labels)
print scores
[ 0.85416667  0.91666667  0.8125    ]

Gaussian Naive Bayes

In [29]:
from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
scores = sklearn.cross_validation.cross_val_score(gnb_clf, features, labels)
print scores
[ 0.85416667  0.875       0.77083333]

Tunning Choosen Classifiers

Choosen Algorithms:

  • Random Forest
  • AdaBoost

Why not SVM? It's too slow for tuning, i want more computational time efficient algorithm.

Why not NB? It's too simple, and from initial scores it's seems not to be good choose.

Why not Extremely Randomized Forest? It's too complicated for tuning.

Tuning Random Forest

Let's investigate main parameters of random forest:

  • min_samples_split
  • n_estimators
  • min_samples_leaf
  • criterion
In [30]:
from sklearn import grid_search

cv = sklearn.cross_validation.StratifiedShuffleSplit(labels, n_iter=10)

def scoring(estimator, features_test, labels_test):
     labels_pred = estimator.predict(features_test)
     p = sklearn.metrics.precision_score(labels_test, labels_pred, average='micro')
     r = sklearn.metrics.recall_score(labels_test, labels_pred, average='micro')
     if p > 0.3 and r > 0.3:
            return sklearn.metrics.f1_score(labels_test, labels_pred, average='macro')
     return 0
                                     
parameters = {'max_depth': [2,3,4,5,6],'min_samples_split':[2,3,4,5], 'n_estimators':[10,20,50], 'min_samples_leaf':[1,2,3,4], 'criterion':('gini', 'entropy')}
rf_clf = RandomForestClassifier()
rfclf = grid_search.GridSearchCV(rf_clf, parameters, scoring = scoring, cv = cv)
rfclf.fit(features, labels)

print rfclf.best_estimator_
print rfclf.best_score_
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=5, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=2,
            min_samples_split=5, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)
0.483333333333
/Users/oleksiirenov/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.py:1771: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)

Tuning AdaBoost

Let's investigate main parameters of adaboost:

  • n_estimators
  • learning_rate
  • algorithm
In [31]:
from sklearn import grid_search
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
parameters = {'n_estimators' : [5, 10, 30, 40, 50, 100,150], 'learning_rate' : [0.1, 0.5, 1, 1.5, 2, 2.5], 'algorithm' : ('SAMME', 'SAMME.R')}
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=8))
adaclf = grid_search.GridSearchCV(ada_clf, parameters, scoring = scoring, cv = cv)
adaclf.fit(features, labels)

print adaclf.best_estimator_
print adaclf.best_score_
AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=8, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=None, splitter='best'),
          learning_rate=2.5, n_estimators=10, random_state=None)
0.616666666667

Testing Classifier

Testing Random Forest best estimator

In [32]:
rf_best_clf = rfclf.best_estimator_
list_cols = list(df_subset.columns.values)
list_cols.remove('poi')
list_cols.insert(0, 'poi')
data = df_subset[list_cols].fillna(0).to_dict(orient='records')
enron_data_sub = {}
counter = 0
for item in data:
    enron_data_sub[counter] = item
    counter += 1
    
test_classifier(rf_best_clf, enron_data_sub, list_cols)
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=5, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=2,
            min_samples_split=5, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)
	Accuracy: 0.85760	Precision: 0.39408	Recall: 0.12650	F1: 0.19152	F2: 0.14638
	Total predictions: 15000	True positives:  253	False positives:  389	False negatives: 1747	True negatives: 12611

Testing AdaBoost best estimator

In [33]:
ada_best_clf = adaclf.best_estimator_
test_classifier(ada_best_clf, enron_data_sub, list_cols)
AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=8, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=None, splitter='best'),
          learning_rate=2.5, n_estimators=10, random_state=None)
	Accuracy: 0.81860	Precision: 0.31966	Recall: 0.31950	F1: 0.31958	F2: 0.31953
	Total predictions: 15000	True positives:  639	False positives: 1360	False negatives: 1361	True negatives: 11640

Dump successful classifier

Hopefully my AdaBoost beat 0.3 precision and recall. Unfortunately random forest not.

In [34]:
dump_classifier_and_data(ada_best_clf, enron_data_sub, list_cols)

List Of Resources