In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I'm going to play detective, and build a person of interest identifier based on financial and email data made public as a result of the Enron scandal.
import sys
import pickle
import numpy
import pandas
import sklearn
from ggplot import *
import matplotlib
%matplotlib inline
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import test_classifier, dump_classifier_and_data
I'm going to try to use all features, filter them and choose the best.
All available features fall into three major types of features, namely financial features, email features and POI labels.
financial features: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)
email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)
POI label: [‘poi’] (boolean, represented as integer)
## Create features list
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments',
'loan_advances', 'bonus', 'restricted_stock_deferred',
'deferred_income', 'total_stock_value', 'expenses', 'from_poi_to_this_person',
'exercised_stock_options', 'from_messages', 'other', 'from_this_person_to_poi',
'long_term_incentive', 'shared_receipt_with_poi', 'restricted_stock', 'director_fees']
### Load the dictionary containing the dataset
enron_data = pickle.load(open("final_project_dataset.pkl", "r") )
## Load POI names file
f = open("poi_names.txt", "r")
# Print available information for Jeffrey Skilling
print enron_data["SKILLING JEFFREY K"]
I want my classifier be adopted to 'NaN' variables, so i'm going to reformat dataset to numpy data frame. Fortunately, python pandas package does all this routine for us in just one line of code. Also let's have a first look on our data using pandas describe.
df = pandas.DataFrame.from_records(list(enron_data.values()))
persons = pandas.Series(list(enron_data.keys()))
print df.head()
Looks very good. Let handle 'NaN' values later.
Let's define numeric columns list and look at variance without 'NaN'. See how many 'NaN' in each column and make decision about imputing this values or just removing this column from dataset. Also be interesting to see outlieres after imputing and creating tidy dataset.
# Convert to numpy nan
df.replace(to_replace='NaN', value=numpy.nan, inplace=True)
# Count number of NaN's for columns
print df.isnull().sum()
# DataFrame dimeansion
print df.shape
I'm don't want to spend so much time on this and i'll do simple deicision. If NaN counter more that 65, I'm going to remove this column, otherwise I do imputation.
# Removing column from database if counter > 65
for column, series in df.iteritems():
if series.isnull().sum() > 65:
df.drop(column, axis=1, inplace=True)
# Drop email address column
if 'email_address' in list(df.columns.values):
df.drop('email_address', axis=1, inplace=True)
#impute = sklearn.preprocessing.Imputer(missing_values=numpy.nan, strategy='mean', axis=0)
#impute.fit(df)
#df_imp = pandas.DataFrame(impute.transform(df.copy(deep=True)))
df_imp = df.replace(to_replace=numpy.nan, value=0)
df_imp = df.fillna(0).copy(deep=True)
df_imp.columns = list(df.columns.values)
print df_imp.isnull().sum()
print df_imp.head()
df_imp.describe()
Just from summary we can explore big amount of outliers using max and .75 quantile. Actually i don't think that these are outliers or wrong data, it's just extreme values for some persons. Let's explore persons names at first.
print(enron_data.keys())
Generally all this string are names of persons from Enron data. But I've observed "TOTAL" and "THE TRAVEL AGENCY IN THE PARK", which aren't names. This observations should be deleted.
total_index = enron_data.keys().index("TOTAL")
print(total_index)
travel_index = enron_data.keys().index("THE TRAVEL AGENCY IN THE PARK")
print(travel_index)
df_subset = df_imp.drop(df_imp.index[[total_index,travel_index]])
df_subset.describe()
After this feature engineering let's explore what values do we have and data dimension.
print "Values:", list(df_subset.columns.values)
print "Shape: ", df_subset.shape
print "Number of POI in DataSet: ", (df_subset['poi'] == 1).sum()
print "Number of non-POI in Dataset: ", (df_subset['poi'] == 0).sum()
In my opinion very useful feature is the ratio between messages connected with poi and all messages. I'm going to name this variable 'poi_ratio'. This variable store ratio in percentages. Also will be added next features:
The hypothesis behind these features was that there might be stronger email connections between POIs than between POIs and non-POIs, and a scatterplot of these two features suggests that there might be some truth to that hypothesis.
Additionally i want to scale 'salary' to range [0,100].
poi_ratio = (df_subset['from_poi_to_this_person'] + df_subset['from_this_person_to_poi']) / (df_subset['from_messages'] + df_subset['to_messages'])
fraction_to_poi = (df_subset['from_this_person_to_poi']) / (df_subset['from_messages'])
fraction_from_poi = (df_subset['from_poi_to_this_person']) / (df_subset['to_messages'])
scale = sklearn.preprocessing.MinMaxScaler(feature_range=(0, 100), copy=True)
df_subset['poi_ratio'] = pandas.Series(poi_ratio) * 100
df_subset['fraction_to_poi'] = pandas.Series(fraction_to_poi) * 100
df_subset['fraction_from_poi'] = pandas.Series(fraction_from_poi) * 100
salary_scaled = scale.fit_transform(df_subset['salary'])
df_subset.describe()
Let's visualize poi_ratio vs fraction_to_poi.
g = ggplot(aes(x='salary', y='fraction_to_poi', colour='poi'), data=df_subset) + geom_point()
print g
Split the data at first on training and test data. We will use Stratified Shuffle Split due to small dataset
labels = df_subset['poi'].copy(deep=True).astype(int).as_matrix()
features = (df_subset.drop('poi', axis=1)).fillna(0).copy(deep=True).as_matrix()
shuffle = sklearn.cross_validation.StratifiedShuffleSplit(labels, 4, test_size=0.3, random_state=0)
print labels
print features
Let's try four different initial models, and then start tune parameters of few models.
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=10)
scores = sklearn.cross_validation.cross_val_score(rf_clf, features, labels)
print scores
from sklearn.ensemble import AdaBoostClassifier
ab_clf = AdaBoostClassifier(n_estimators=100)
scores = sklearn.cross_validation.cross_val_score(ab_clf, features, labels)
print scores
from sklearn.ensemble import ExtraTreesClassifier
erf_clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0)
scores = sklearn.cross_validation.cross_val_score(erf_clf, features, labels)
print scores
from sklearn.naive_bayes import GaussianNB
gnb_clf = GaussianNB()
scores = sklearn.cross_validation.cross_val_score(gnb_clf, features, labels)
print scores
Choosen Algorithms:
Why not SVM? It's too slow for tuning, i want more computational time efficient algorithm.
Why not NB? It's too simple, and from initial scores it's seems not to be good choose.
Why not Extremely Randomized Forest? It's too complicated for tuning.
Let's investigate main parameters of random forest:
from sklearn import grid_search
cv = sklearn.cross_validation.StratifiedShuffleSplit(labels, n_iter=10)
def scoring(estimator, features_test, labels_test):
labels_pred = estimator.predict(features_test)
p = sklearn.metrics.precision_score(labels_test, labels_pred, average='micro')
r = sklearn.metrics.recall_score(labels_test, labels_pred, average='micro')
if p > 0.3 and r > 0.3:
return sklearn.metrics.f1_score(labels_test, labels_pred, average='macro')
return 0
parameters = {'max_depth': [2,3,4,5,6],'min_samples_split':[2,3,4,5], 'n_estimators':[10,20,50], 'min_samples_leaf':[1,2,3,4], 'criterion':('gini', 'entropy')}
rf_clf = RandomForestClassifier()
rfclf = grid_search.GridSearchCV(rf_clf, parameters, scoring = scoring, cv = cv)
rfclf.fit(features, labels)
print rfclf.best_estimator_
print rfclf.best_score_
Let's investigate main parameters of adaboost:
from sklearn import grid_search
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
parameters = {'n_estimators' : [5, 10, 30, 40, 50, 100,150], 'learning_rate' : [0.1, 0.5, 1, 1.5, 2, 2.5], 'algorithm' : ('SAMME', 'SAMME.R')}
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=8))
adaclf = grid_search.GridSearchCV(ada_clf, parameters, scoring = scoring, cv = cv)
adaclf.fit(features, labels)
print adaclf.best_estimator_
print adaclf.best_score_
rf_best_clf = rfclf.best_estimator_
list_cols = list(df_subset.columns.values)
list_cols.remove('poi')
list_cols.insert(0, 'poi')
data = df_subset[list_cols].fillna(0).to_dict(orient='records')
enron_data_sub = {}
counter = 0
for item in data:
enron_data_sub[counter] = item
counter += 1
test_classifier(rf_best_clf, enron_data_sub, list_cols)
ada_best_clf = adaclf.best_estimator_
test_classifier(ada_best_clf, enron_data_sub, list_cols)
Hopefully my AdaBoost beat 0.3 precision and recall. Unfortunately random forest not.
dump_classifier_and_data(ada_best_clf, enron_data_sub, list_cols)
More code and other stuff can be found here https://github.com/orenov/udacity-nanodegree-data-analyst/tree/master/Machine-Learning/final_project.