Enron POI Classification¶

Project Overview¶

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives. In this project, I'm going to play detective, and build a person of interest identifier based on financial and email data made public as a result of the Enron scandal.

Python Preparation¶

import sys
import pickle
import numpy
import pandas
import sklearn
from ggplot import *
import matplotlib
%matplotlib inline

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import test_classifier, dump_classifier_and_data

Feature Selection¶

I'm going to try to use all features, filter them and choose the best.

All available features fall into three major types of features, namely financial features, email features and POI labels.

financial features: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)
email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)
POI label: [‘poi’] (boolean, represented as integer)

## Create features list
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 
                 'loan_advances', 'bonus', 'restricted_stock_deferred', 
                 'deferred_income', 'total_stock_value', 'expenses', 'from_poi_to_this_person', 
                 'exercised_stock_options', 'from_messages', 'other', 'from_this_person_to_poi', 
                 'long_term_incentive', 'shared_receipt_with_poi', 'restricted_stock', 'director_fees'] 

### Load the dictionary containing the dataset
enron_data = pickle.load(open("final_project_dataset.pkl", "r") )

## Load POI names file
f = open("poi_names.txt", "r")

# Print available information for Jeffrey Skilling
print enron_data["SKILLING JEFFREY K"]

{'salary': 1111258, 'to_messages': 3627, 'deferral_payments': 'NaN', 'total_payments': 8682716, 'exercised_stock_options': 19250000, 'bonus': 5600000, 'restricted_stock': 6843672, 'shared_receipt_with_poi': 2042, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 26093672, 'expenses': 29336, 'loan_advances': 'NaN', 'from_messages': 108, 'other': 22122, 'from_this_person_to_poi': 30, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 1920000, 'email_address': 'jeff.skilling@enron.com', 'from_poi_to_this_person': 88}

I want my classifier be adopted to 'NaN' variables, so i'm going to reformat dataset to numpy data frame. Fortunately, python pandas package does all this routine for us in just one line of code. Also let's have a first look on our data using pandas describe.

df = pandas.DataFrame.from_records(list(enron_data.values()))
persons = pandas.Series(list(enron_data.keys()))

print df.head()

     bonus deferral_payments deferred_income director_fees  \
0   600000               NaN             NaN           NaN   
1  1200000           1295738        -1386055           NaN   
2   350000               NaN         -400729           NaN   
3      NaN               NaN             NaN           NaN   
4  1500000               NaN        -3117011           NaN   

              email_address exercised_stock_options expenses from_messages  \
0      mark.metts@enron.com                     NaN    94299            29   
1                       NaN                 6680544    11200           NaN   
2  steven.elliott@enron.com                 4890344    78552           NaN   
3     bill.cordes@enron.com                  651850      NaN            12   
4    kevin.hannon@enron.com                 5538001    34039            32   

  from_poi_to_this_person from_this_person_to_poi        ...         \
0                      38                       1        ...          
1                     NaN                     NaN        ...          
2                     NaN                     NaN        ...          
3                      10                       0        ...          
4                      32                      21        ...          

  long_term_incentive    other    poi restricted_stock  \
0                 NaN     1740  False           585062   
1             1586055  2660303  False          3942714   
2                 NaN    12961  False          1788391   
3                 NaN      NaN  False           386335   
4             1617011    11350   True           853064   

  restricted_stock_deferred  salary shared_receipt_with_poi to_messages  \
0                       NaN  365788                     702         807   
1                       NaN  267102                     NaN         NaN   
2                       NaN  170941                     NaN         NaN   
3                       NaN     NaN                      58         764   
4                       NaN  243293                    1035        1045   

  total_payments total_stock_value  
0        1061827            585062  
1        5634343          10623258  
2         211725           6678735  
3            NaN           1038185  
4         288682           6391065  

[5 rows x 21 columns]

Looks very good. Let handle 'NaN' values later.

Removing Outliers and Handling 'NaN' values¶

Let's define numeric columns list and look at variance without 'NaN'. See how many 'NaN' in each column and make decision about imputing this values or just removing this column from dataset. Also be interesting to see outlieres after imputing and creating tidy dataset.

# Convert to numpy nan
df.replace(to_replace='NaN', value=numpy.nan, inplace=True)

# Count number of NaN's for columns
print df.isnull().sum()

# DataFrame dimeansion
print df.shape

bonus                         64
deferral_payments            107
deferred_income               97
director_fees                129
email_address                 35
exercised_stock_options       44
expenses                      51
from_messages                 60
from_poi_to_this_person       60
from_this_person_to_poi       60
loan_advances                142
long_term_incentive           80
other                         53
poi                            0
restricted_stock              36
restricted_stock_deferred    128
salary                        51
shared_receipt_with_poi       60
to_messages                   60
total_payments                21
total_stock_value             20
dtype: int64
(146, 21)

I'm don't want to spend so much time on this and i'll do simple deicision. If NaN counter more that 65, I'm going to remove this column, otherwise I do imputation.

# Removing column from database if counter > 65

for column, series in df.iteritems():
    if series.isnull().sum() > 65:
        df.drop(column, axis=1, inplace=True)
       
# Drop email address column
if 'email_address' in list(df.columns.values):
    df.drop('email_address', axis=1, inplace=True)

        
#impute = sklearn.preprocessing.Imputer(missing_values=numpy.nan, strategy='mean', axis=0)
#impute.fit(df)

#df_imp = pandas.DataFrame(impute.transform(df.copy(deep=True)))
df_imp = df.replace(to_replace=numpy.nan, value=0)
df_imp = df.fillna(0).copy(deep=True)
df_imp.columns = list(df.columns.values)
print df_imp.isnull().sum()
print df_imp.head()

df_imp.describe()

bonus                      0
exercised_stock_options    0
expenses                   0
from_messages              0
from_poi_to_this_person    0
from_this_person_to_poi    0
other                      0
poi                        0
restricted_stock           0
salary                     0
shared_receipt_with_poi    0
to_messages                0
total_payments             0
total_stock_value          0
dtype: int64
     bonus  exercised_stock_options  expenses  from_messages  \
0   600000                        0     94299             29   
1  1200000                  6680544     11200              0   
2   350000                  4890344     78552              0   
3        0                   651850         0             12   
4  1500000                  5538001     34039             32   

   from_poi_to_this_person  from_this_person_to_poi    other    poi  \
0                       38                        1     1740  False   
1                        0                        0  2660303  False   
2                        0                        0    12961  False   
3                       10                        0        0  False   
4                       32                       21    11350   True   

   restricted_stock  salary  shared_receipt_with_poi  to_messages  \
0            585062  365788                      702          807   
1           3942714  267102                        0            0   
2           1788391  170941                        0            0   
3            386335       0                       58          764   
4            853064  243293                     1035         1045   

   total_payments  total_stock_value  
0         1061827             585062  
1         5634343           10623258  
2          211725            6678735  
3               0            1038185  
4          288682            6391065

Just from summary we can explore big amount of outliers using max and .75 quantile. Actually i don't think that these are outliers or wrong data, it's just extreme values for some persons. Let's explore persons names at first.

print(enron_data.keys())

['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'MCCARTY DANNY J', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY REX', 'LEMAISTRE CHARLES', 'DEFFNER JOSEPH M', 'KISHKILL JOSEPH G', 'WHALLEY LAWRENCE G', 'MCCONNELL MICHAEL S', 'PIRO JIM', 'DELAINEY DAVID W', 'SULLIVAN-SHAKLOVITZ COLLEEN', 'WROBEL BRUCE', 'LINDHOLM TOD A', 'MEYER JEROME J', 'LAY KENNETH L', 'BUTTS ROBERT H', 'OLSON CINDY K', 'MCDONALD REBECCA', 'CUMBERLAND MICHAEL S', 'GAHN ROBERT S', 'MCCLELLAN GEORGE', 'HERMANN ROBERT J', 'SCRIMSHAW MATTHEW', 'GATHMANN WILLIAM D', 'HAEDICKE MARK E', 'BOWEN JR RAYMOND M', 'GILLIS JOHN', 'FITZGERALD JAY L', 'MORAN MICHAEL P', 'REDMOND BRIAN L', 'BAZELIDES PHILIP J', 'BELDEN TIMOTHY N', 'DURAN WILLIAM D', 'THORN TERENCE H', 'FASTOW ANDREW S', 'FOY JOE', 'CALGER CHRISTOPHER F', 'RICE KENNETH D', 'KAMINSKI WINCENTY J', 'LOCKHART EUGENE E', 'COX DAVID', 'OVERDYKE JR JERE C', 'PEREIRA PAULO V. FERRAZ', 'STABLER FRANK', 'SKILLING JEFFREY K', 'BLAKE JR. NORMAN P', 'SHERRICK JEFFREY B', 'PRENTICE JAMES', 'GRAY RODNEY', 'PICKERING MARK R', 'THE TRAVEL AGENCY IN THE PARK', 'NOLES JAMES L', 'KEAN STEVEN J', 'TOTAL', 'FOWLER PEGGY', 'WASAFF GEORGE', 'WHITE JR THOMAS E', 'CHRISTODOULOU DIOMEDES', 'ALLEN PHILLIP K', 'SHARP VICTORIA T', 'JAEDICKE ROBERT', 'WINOKUR JR. HERBERT S', 'BROWN MICHAEL', 'BADUM JAMES P', 'HUGHES JAMES A', 'REYNOLDS LAWRENCE', 'DIMICHELE RICHARD G', 'BHATNAGAR SANJAY', 'CARTER REBECCA C', 'BUCHANAN HAROLD G', 'YEAP SOON', 'MURRAY JULIA H', 'GARLAND C KEVIN', 'DODSON KEITH', 'YEAGER F SCOTT', 'HIRKO JOSEPH', 'DIETRICH JANET R', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'PAI LOU L', 'BAY FRANKLIN R', 'HAYSLETT RODERICK J', 'FUGH JOHN L', 'FALLON JAMES B', 'KOENIG MARK E', 'SAVAGE FRANK', 'IZZO LAWRENCE L', 'TILNEY ELIZABETH A', 'MARTIN AMANDA K', 'BUY RICHARD B', 'GRAMM WENDY L', 'CAUSEY RICHARD A', 'TAYLOR MITCHELL S', 'DONAHUE JR JEFFREY M', 'GLISAN JR BEN F']

Generally all this string are names of persons from Enron data. But I've observed "TOTAL" and "THE TRAVEL AGENCY IN THE PARK", which aren't names. This observations should be deleted.

total_index = enron_data.keys().index("TOTAL")
print(total_index)
travel_index = enron_data.keys().index("THE TRAVEL AGENCY IN THE PARK")
print(travel_index)
df_subset = df_imp.drop(df_imp.index[[total_index,travel_index]])
df_subset.describe()

104
101

After this feature engineering let's explore what values do we have and data dimension.

print "Values:", list(df_subset.columns.values)

print "Shape: ", df_subset.shape

print "Number of POI in DataSet: ", (df_subset['poi'] == 1).sum()
print "Number of non-POI in Dataset: ", (df_subset['poi'] == 0).sum()

Values: ['bonus', 'exercised_stock_options', 'expenses', 'from_messages', 'from_poi_to_this_person', 'from_this_person_to_poi', 'other', 'poi', 'restricted_stock', 'salary', 'shared_receipt_with_poi', 'to_messages', 'total_payments', 'total_stock_value']
Shape:  (144, 14)
Number of POI in DataSet:  18
Number of non-POI in Dataset:  126

Creating new Features¶

In my opinion very useful feature is the ratio between messages connected with poi and all messages. I'm going to name this variable 'poi_ratio'. This variable store ratio in percentages. Also will be added next features:

the fraction of all emails to a person that were sent from a person of interest
the fraction of all emails that a person sent that were addressed to persons of interest

The hypothesis behind these features was that there might be stronger email connections between POIs than between POIs and non-POIs, and a scatterplot of these two features suggests that there might be some truth to that hypothesis.

Additionally i want to scale 'salary' to range [0,100].

poi_ratio = (df_subset['from_poi_to_this_person'] + df_subset['from_this_person_to_poi']) / (df_subset['from_messages'] + df_subset['to_messages'])
fraction_to_poi = (df_subset['from_this_person_to_poi']) / (df_subset['from_messages'])
fraction_from_poi = (df_subset['from_poi_to_this_person']) / (df_subset['to_messages'])
scale = sklearn.preprocessing.MinMaxScaler(feature_range=(0, 100), copy=True)

df_subset['poi_ratio'] = pandas.Series(poi_ratio) * 100
df_subset['fraction_to_poi'] = pandas.Series(fraction_to_poi) * 100
df_subset['fraction_from_poi'] = pandas.Series(fraction_from_poi) * 100
salary_scaled = scale.fit_transform(df_subset['salary'])
df_subset.describe()

Let's visualize poi_ratio vs fraction_to_poi.

g = ggplot(aes(x='salary', y='fraction_to_poi', colour='poi'), data=df_subset) + geom_point()
print g

<ggplot: (280128253)>

Training Classifier¶

Split the data at first on training and test data. We will use Stratified Shuffle Split due to small dataset

labels = df_subset['poi'].copy(deep=True).astype(int).as_matrix()
features = (df_subset.drop('poi', axis=1)).fillna(0).copy(deep=True).as_matrix()
shuffle = sklearn.cross_validation.StratifiedShuffleSplit(labels, 4, test_size=0.3, random_state=0)

print labels
print features

[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1]
[[  6.00000000e+05   0.00000000e+00   9.42990000e+04 ...,   4.66507177e+00
    3.44827586e+00   4.70879802e+00]
 [  1.20000000e+06   6.68054400e+06   1.12000000e+04 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  3.50000000e+05   4.89034400e+06   7.85520000e+04 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 ..., 
 [  6.00000000e+05   3.18125000e+06   0.00000000e+00 ...,   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  8.00000000e+05   7.65920000e+05   9.62680000e+04 ...,   2.24351747e+01
    5.00000000e+01   2.17341040e+01]
 [  6.00000000e+05   3.84728000e+05   1.25978000e+05 ...,   6.52418448e+00
    3.75000000e+01   5.95647194e+00]]

Let's try four different initial models, and then start tune parameters of few models.

Random Forest¶

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=10)
scores = sklearn.cross_validation.cross_val_score(rf_clf, features, labels)

print scores

[ 0.875       0.875       0.85416667]

AdaBoost¶

from sklearn.ensemble import AdaBoostClassifier

ab_clf = AdaBoostClassifier(n_estimators=100)
scores = sklearn.cross_validation.cross_val_score(ab_clf, features, labels)
print scores

[ 0.85416667  0.85416667  0.77083333]

Extremely Randomized Tree¶

from sklearn.ensemble import ExtraTreesClassifier

erf_clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0)
scores = sklearn.cross_validation.cross_val_score(erf_clf, features, labels)
print scores

[ 0.85416667  0.91666667  0.8125    ]

Gaussian Naive Bayes¶

from sklearn.naive_bayes import GaussianNB

gnb_clf = GaussianNB()
scores = sklearn.cross_validation.cross_val_score(gnb_clf, features, labels)
print scores

[ 0.85416667  0.875       0.77083333]

Tunning Choosen Classifiers¶

Choosen Algorithms:

Random Forest
AdaBoost

Why not SVM? It's too slow for tuning, i want more computational time efficient algorithm.

Why not NB? It's too simple, and from initial scores it's seems not to be good choose.

Why not Extremely Randomized Forest? It's too complicated for tuning.

Tuning Random Forest¶

Let's investigate main parameters of random forest:

min_samples_split
n_estimators
min_samples_leaf
criterion

from sklearn import grid_search

cv = sklearn.cross_validation.StratifiedShuffleSplit(labels, n_iter=10)

def scoring(estimator, features_test, labels_test):
     labels_pred = estimator.predict(features_test)
     p = sklearn.metrics.precision_score(labels_test, labels_pred, average='micro')
     r = sklearn.metrics.recall_score(labels_test, labels_pred, average='micro')
     if p > 0.3 and r > 0.3:
            return sklearn.metrics.f1_score(labels_test, labels_pred, average='macro')
     return 0
                                     
parameters = {'max_depth': [2,3,4,5,6],'min_samples_split':[2,3,4,5], 'n_estimators':[10,20,50], 'min_samples_leaf':[1,2,3,4], 'criterion':('gini', 'entropy')}
rf_clf = RandomForestClassifier()
rfclf = grid_search.GridSearchCV(rf_clf, parameters, scoring = scoring, cv = cv)
rfclf.fit(features, labels)

print rfclf.best_estimator_
print rfclf.best_score_

RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=5, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=2,
            min_samples_split=5, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)
0.483333333333

/Users/oleksiirenov/anaconda/lib/python2.7/site-packages/sklearn/metrics/metrics.py:1771: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)

Tuning AdaBoost¶

Let's investigate main parameters of adaboost:

n_estimators
learning_rate
algorithm

from sklearn import grid_search
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
parameters = {'n_estimators' : [5, 10, 30, 40, 50, 100,150], 'learning_rate' : [0.1, 0.5, 1, 1.5, 2, 2.5], 'algorithm' : ('SAMME', 'SAMME.R')}
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=8))
adaclf = grid_search.GridSearchCV(ada_clf, parameters, scoring = scoring, cv = cv)
adaclf.fit(features, labels)

print adaclf.best_estimator_
print adaclf.best_score_

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=8, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=None, splitter='best'),
          learning_rate=2.5, n_estimators=10, random_state=None)
0.616666666667

Testing Classifier¶

Testing Random Forest best estimator¶

rf_best_clf = rfclf.best_estimator_
list_cols = list(df_subset.columns.values)
list_cols.remove('poi')
list_cols.insert(0, 'poi')
data = df_subset[list_cols].fillna(0).to_dict(orient='records')
enron_data_sub = {}
counter = 0
for item in data:
    enron_data_sub[counter] = item
    counter += 1
    
test_classifier(rf_best_clf, enron_data_sub, list_cols)

RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=5, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=2,
            min_samples_split=5, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)
	Accuracy: 0.85760	Precision: 0.39408	Recall: 0.12650	F1: 0.19152	F2: 0.14638
	Total predictions: 15000	True positives:  253	False positives:  389	False negatives: 1747	True negatives: 12611

Testing AdaBoost best estimator¶

ada_best_clf = adaclf.best_estimator_
test_classifier(ada_best_clf, enron_data_sub, list_cols)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(compute_importances=None, criterion='gini',
            max_depth=8, max_features=None, max_leaf_nodes=None,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            random_state=None, splitter='best'),
          learning_rate=2.5, n_estimators=10, random_state=None)
	Accuracy: 0.81860	Precision: 0.31966	Recall: 0.31950	F1: 0.31958	F2: 0.31953
	Total predictions: 15000	True positives:  639	False positives: 1360	False negatives: 1361	True negatives: 11640

Dump successful classifier¶

Hopefully my AdaBoost beat 0.3 precision and recall. Unfortunately random forest not.

dump_classifier_and_data(ada_best_clf, enron_data_sub, list_cols)

More code and other stuff can be found here https://github.com/orenov/udacity-nanodegree-data-analyst/tree/master/Machine-Learning/final_project.

List Of Resources¶

sklearn documentation: http://scikit-learn.org/stable/index.html
pandas documentation: http://pandas.pydata.org

	bonus	exercised_stock_options	expenses	from_messages	from_poi_to_this_person	from_this_person_to_poi	other	poi	restricted_stock	salary	shared_receipt_with_poi	to_messages	total_payments	total_stock_value
count	146.000000	1.460000e+02	146.000000	146.000000	146.000000	146.000000	146.000000	146	1.460000e+02	146.000000	146.000000	146.000000	1.460000e+02	1.460000e+02
mean	1333474.232877	4.182736e+06	70748.267123	358.602740	38.226027	24.287671	585431.794521	0.1232877	1.749257e+06	365811.356164	692.986301	1221.589041	4.350622e+06	5.846018e+06
std	8094029.239637	2.607040e+07	432716.319438	1441.259868	73.901124	79.278206	3682344.576631	0.3298989	1.089995e+07	2203574.963717	1072.969492	2226.770637	2.693448e+07	3.624681e+07
min	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000	False	-2.604490e+06	0.000000	0.000000	0.000000	0.000000e+00	-4.409300e+04
25%	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000	0	8.115000e+03	0.000000	0.000000	0.000000	9.394475e+04	2.288695e+05
50%	300000.000000	6.082935e+05	20182.000000	16.500000	2.500000	0.000000	959.500000	0	3.605280e+05	210596.000000	102.500000	289.000000	9.413595e+05	9.659550e+05
75%	800000.000000	1.714221e+06	53740.750000	51.250000	40.750000	13.750000	150606.500000	0	8.145280e+05	270850.500000	893.500000	1585.750000	1.968287e+06	2.319991e+06
max	97343619.000000	3.117640e+08	5235198.000000	14368.000000	528.000000	609.000000	42667589.000000	True	1.303223e+08	26704229.000000	5521.000000	15149.000000	3.098866e+08	4.345095e+08

	bonus	exercised_stock_options	expenses	from_messages	from_poi_to_this_person	from_this_person_to_poi	other	poi	restricted_stock	salary	shared_receipt_with_poi	to_messages	total_payments	total_stock_value
count	144.000000	144.000000	144.000000	144.000000	144.000000	144.000000	144.000000	144	144.000000	144.000000	144.000000	144.000000	1.440000e+02	144.000000
mean	675997.354167	2075801.979167	35375.340278	363.583333	38.756944	24.625000	294745.534722	0.125	868536.291667	185446.034722	702.611111	1238.555556	2.256543e+06	2909785.611111
std	1233155.255938	4795513.145239	45309.303038	1450.675239	74.276769	79.778266	1131325.452833	0.3318733	2016572.388715	197042.123807	1077.290736	2237.564816	8.847189e+06	6189018.075043
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	False	-2604490.000000	0.000000	0.000000	0.000000	0.000000e+00	-44093.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0	24345.000000	0.000000	0.000000	0.000000	9.019275e+04	244326.500000
50%	300000.000000	608293.500000	20182.000000	17.500000	4.000000	0.000000	919.000000	0	360528.000000	210596.000000	114.000000	347.500000	9.413595e+05	965955.000000
75%	800000.000000	1683580.250000	53328.250000	53.000000	41.250000	14.000000	148577.000000	0	737456.000000	269667.500000	933.750000	1623.000000	1.945668e+06	2295175.750000
max	8000000.000000	34348384.000000	228763.000000	14368.000000	528.000000	609.000000	10359729.000000	True	14761694.000000	1111258.000000	5521.000000	15149.000000	1.035598e+08	49110078.000000

	bonus	exercised_stock_options	expenses	from_messages	from_poi_to_this_person	from_this_person_to_poi	other	poi	restricted_stock	salary	shared_receipt_with_poi	to_messages	total_payments	total_stock_value	poi_ratio	fraction_to_poi	fraction_from_poi
count	144.000000	144.000000	144.000000	144.000000	144.000000	144.000000	144.000000	144	144.000000	144.000000	144.000000	144.000000	1.440000e+02	144.000000	86.000000	86.000000	86.000000
mean	675997.354167	2075801.979167	35375.340278	363.583333	38.756944	24.625000	294745.534722	0.125	868536.291667	16.687937	702.611111	1238.555556	2.256543e+06	2909785.611111	4.770893	18.405548	3.796298
std	1233155.255938	4795513.145239	45309.303038	1450.675239	74.276769	79.778266	1131325.452833	0.3318733	2016572.388715	17.731447	1077.290736	2237.564816	8.847189e+06	6189018.075043	4.645513	21.061035	4.054439
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	False	-2604490.000000	0.000000	0.000000	0.000000	0.000000e+00	-44093.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0	24345.000000	0.000000	0.000000	0.000000	9.019275e+04	244326.500000	1.304333	1.242096	0.919988
50%	300000.000000	608293.500000	20182.000000	17.500000	4.000000	0.000000	919.000000	0	360528.000000	18.951135	114.000000	347.500000	9.413595e+05	965955.000000	3.176610	10.057359	2.584537
75%	800000.000000	1683580.250000	53328.250000	53.000000	41.250000	14.000000	148577.000000	0	737456.000000	24.266867	933.750000	1623.000000	1.945668e+06	2295175.750000	6.696795	27.203947	5.608871
max	8000000.000000	34348384.000000	228763.000000	14368.000000	528.000000	609.000000	10359729.000000	True	14761694.000000	100.000000	5521.000000	15149.000000	1.035598e+08	49110078.000000	22.435175	100.000000	21.734104