Spam Email Classification¶
The topic of this project is to experiment with different models to find which one works best for Email Classification. This is a binary classification problem that will be solved using supervised learning.
The dataset was obtained from UCI:
Hopkins, M., Reeber, E., Forman, G., & Suermondt, J. (1999). Spambase [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X.
A similar problem was tackled in module 4, though it was solved using regulated decision trees only. This project aims to tackle it using Logistic Regression, Ensembles and SVMs, and conclude which parameters give the best results.
The data structure is basically the same as the spamdata.csv used in module 4. To quote UCI's description on the source of the data and why it was structured the way it is:
Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
The data contains 4601 data points with 57 features in total. The following description of the features was also adapted from UCI's website:
48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
6 continuous real [0,100] attributes of type char_freq_CHAR]
= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail
1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters
1 continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters
1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail
1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Cleaning¶
The data does not need cleaning as it comes already clean.
Can we do better?¶
One of the objectives will be to obtain results better than those shown on the UCI website. The website gives the following achieved precisions:
As well as the following accuracy data:
import math
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
# importing all the required libraries
from math import exp
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from matplotlib.colors import Normalize
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
Import the data¶
The following cells import the data and split it appropriately.
!pip install ucimlrepo
Requirement already satisfied: ucimlrepo in /opt/conda/lib/python3.7/site-packages (0.0.7) Requirement already satisfied: pandas>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from ucimlrepo) (1.0.3) Requirement already satisfied: certifi>=2020.12.5 in /opt/conda/lib/python3.7/site-packages (from ucimlrepo) (2025.1.31) Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=1.0.0->ucimlrepo) (2.8.1) Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=1.0.0->ucimlrepo) (2020.1) Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=1.0.0->ucimlrepo) (1.18.4) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=1.0.0->ucimlrepo) (1.14.0) [notice] A new release of pip is available: 23.3.2 -> 24.0 [notice] To update, run: pip install --upgrade pip
from ucimlrepo import fetch_ucirepo
# fetch dataset
spambase = fetch_ucirepo(id=94)
  
# data (as pandas dataframes)
X = spambase.data.features
y = spambase.data.targets
  
# variable information
print(spambase.variables)
# metadata
spambase.metadata
                          name     role        type demographic  \
0               word_freq_make  Feature  Continuous        None   
1            word_freq_address  Feature  Continuous        None   
2                word_freq_all  Feature  Continuous        None   
3                 word_freq_3d  Feature  Continuous        None   
4                word_freq_our  Feature  Continuous        None   
5               word_freq_over  Feature  Continuous        None   
6             word_freq_remove  Feature  Continuous        None   
7           word_freq_internet  Feature  Continuous        None   
8              word_freq_order  Feature  Continuous        None   
9               word_freq_mail  Feature  Continuous        None   
10           word_freq_receive  Feature  Continuous        None   
11              word_freq_will  Feature  Continuous        None   
12            word_freq_people  Feature  Continuous        None   
13            word_freq_report  Feature  Continuous        None   
14         word_freq_addresses  Feature  Continuous        None   
15              word_freq_free  Feature  Continuous        None   
16          word_freq_business  Feature  Continuous        None   
17             word_freq_email  Feature  Continuous        None   
18               word_freq_you  Feature  Continuous        None   
19            word_freq_credit  Feature  Continuous        None   
20              word_freq_your  Feature  Continuous        None   
21              word_freq_font  Feature  Continuous        None   
22               word_freq_000  Feature  Continuous        None   
23             word_freq_money  Feature  Continuous        None   
24                word_freq_hp  Feature  Continuous        None   
25               word_freq_hpl  Feature  Continuous        None   
26            word_freq_george  Feature  Continuous        None   
27               word_freq_650  Feature  Continuous        None   
28               word_freq_lab  Feature  Continuous        None   
29              word_freq_labs  Feature  Continuous        None   
30            word_freq_telnet  Feature  Continuous        None   
31               word_freq_857  Feature  Continuous        None   
32              word_freq_data  Feature  Continuous        None   
33               word_freq_415  Feature  Continuous        None   
34                word_freq_85  Feature  Continuous        None   
35        word_freq_technology  Feature  Continuous        None   
36              word_freq_1999  Feature  Continuous        None   
37             word_freq_parts  Feature  Continuous        None   
38                word_freq_pm  Feature  Continuous        None   
39            word_freq_direct  Feature  Continuous        None   
40                word_freq_cs  Feature  Continuous        None   
41           word_freq_meeting  Feature  Continuous        None   
42          word_freq_original  Feature  Continuous        None   
43           word_freq_project  Feature  Continuous        None   
44                word_freq_re  Feature  Continuous        None   
45               word_freq_edu  Feature  Continuous        None   
46             word_freq_table  Feature  Continuous        None   
47        word_freq_conference  Feature  Continuous        None   
48                 char_freq_;  Feature  Continuous        None   
49                 char_freq_(  Feature  Continuous        None   
50                 char_freq_[  Feature  Continuous        None   
51                 char_freq_!  Feature  Continuous        None   
52                 char_freq_$  Feature  Continuous        None   
53                 char_freq_#  Feature  Continuous        None   
54  capital_run_length_average  Feature  Continuous        None   
55  capital_run_length_longest  Feature  Continuous        None   
56    capital_run_length_total  Feature  Continuous        None   
57                       Class   Target      Binary        None   
                 description units missing_values  
0                       None  None             no  
1                       None  None             no  
2                       None  None             no  
3                       None  None             no  
4                       None  None             no  
5                       None  None             no  
6                       None  None             no  
7                       None  None             no  
8                       None  None             no  
9                       None  None             no  
10                      None  None             no  
11                      None  None             no  
12                      None  None             no  
13                      None  None             no  
14                      None  None             no  
15                      None  None             no  
16                      None  None             no  
17                      None  None             no  
18                      None  None             no  
19                      None  None             no  
20                      None  None             no  
21                      None  None             no  
22                      None  None             no  
23                      None  None             no  
24                      None  None             no  
25                      None  None             no  
26                      None  None             no  
27                      None  None             no  
28                      None  None             no  
29                      None  None             no  
30                      None  None             no  
31                      None  None             no  
32                      None  None             no  
33                      None  None             no  
34                      None  None             no  
35                      None  None             no  
36                      None  None             no  
37                      None  None             no  
38                      None  None             no  
39                      None  None             no  
40                      None  None             no  
41                      None  None             no  
42                      None  None             no  
43                      None  None             no  
44                      None  None             no  
45                      None  None             no  
46                      None  None             no  
47                      None  None             no  
48                      None  None             no  
49                      None  None             no  
50                      None  None             no  
51                      None  None             no  
52                      None  None             no  
53                      None  None             no  
54                      None  None             no  
55                      None  None             no  
56                      None  None             no  
57  spam (1) or not spam (0)  None             no  
{'uci_id': 94,
 'name': 'Spambase',
 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase',
 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv',
 'abstract': 'Classifying Email as Spam or Non-Spam',
 'area': 'Computer Science',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 4601,
 'num_features': 57,
 'feature_types': ['Integer', 'Real'],
 'demographics': [],
 'target_col': ['Class'],
 'index_col': None,
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 1999,
 'last_updated': 'Mon Aug 28 2023',
 'dataset_doi': '10.24432/C53G6X',
 'creators': ['Mark Hopkins',
  'Erik Reeber',
  'George Forman',
  'Jaap Suermondt'],
 'intro_paper': None,
 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collection of spam e-mails came from our postmaster and individuals who had filed spam.  Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word \'george\' and the area code \'650\' are indicators of non-spam.  These are useful when constructing a personalized spam filter.  One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.\n\nFor background on spam: Cranor, Lorrie F., LaMacchia, Brian A.  Spam!, Communications of the ACM, 41(8):74-83, 1998.\n\nTypical performance is around ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter. See also Hewlett-Packard Internal-only Technical Report. External version forthcoming. ',
  'purpose': None,
  'funded_by': None,
  'instances_represent': 'Emails',
  'recommended_data_splits': None,
  'sensitive_data': None,
  'preprocessing_description': None,
  'variable_info': 'The last column of \'spambase.data\' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.  Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail.  The run-length attributes (55-57) measure the length of sequences of consecutive capital letters.  For the statistical measures of each attribute, see the end of this file.  Here are the definitions of the attributes:\r\n\r\n48 continuous real [0,100] attributes of type word_freq_WORD \r\n= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail.  A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.\r\n\r\n6 continuous real [0,100] attributes of type char_freq_CHAR] \r\n= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail\r\n\r\n1 continuous real [1,...] attribute of type capital_run_length_average \r\n= average length of uninterrupted sequences of capital letters\r\n\r\n1 continuous integer [1,...] attribute of type capital_run_length_longest \r\n= length of longest uninterrupted sequence of capital letters\r\n\r\n1 continuous integer [1,...] attribute of type capital_run_length_total \r\n= sum of length of uninterrupted sequences of capital letters \r\n= total number of capital letters in the e-mail\r\n\r\n1 nominal {0,1} class attribute of type spam\r\n= denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.  \r\n',
  'citation': None}}
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, test_size=.1)
Helpful functions¶
# Adapted from Module 6, used by plotSearchGrid
class MidpointNormalize(Normalize):
    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)
    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))
# Adapted from Module 6
def plotSearchGrid(grid, var1, var2):
    scores = [x for x in grid.cv_results_["mean_test_score"]]
    scores = np.array(scores).reshape(len(grid.param_grid[var1]), len(grid.param_grid[var2]))
    #plt.figure(figsize=(10, 8))
    plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
    plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,
               norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
    plt.xlabel(var2)
    plt.ylabel(var1)
    plt.colorbar()
    plt.xticks(np.arange(len(grid.param_grid[var2])), grid.param_grid[var2], rotation=45)
    plt.yticks(np.arange(len(grid.param_grid[var1])), grid.param_grid[var1])
    plt.title('Validation accuracy')
    plt.show()
def calculate_precision(y_true, y_pred, pos_label_value=1.0):
    '''
    This function accepts the labels and the predictions, then
    calculates precision for a binary classifier.
    
    Args
        y_true: np.ndarray
        y_pred: np.ndarray
        
        pos_label_value: (float) the number which represents the postiive
        label in the y_true and y_pred arrays. Other numbers will be taken
        to be the non-positive class for the binary classifier.
    
    Returns precision as a floating point number between 0.0 and 1.0
    '''
    TP = np.sum((y_true == pos_label_value) * (y_pred == pos_label_value))
    P_data = np.sum(y_pred == pos_label_value)
    return TP/P_data
def calculate_recall(y_true, y_pred, pos_label_value=1.0):
    '''
    This function accepts the labels and the predictions, then
    calculates recall for a binary classifier.
    
    Args
        y_true: np.ndarray
        y_pred: np.ndarray
        
        pos_label_value: (float) the number which represents the postiive
        label in the y_true and y_pred arrays. Other numbers will be taken
        to be the non-positive class for the binary classifier.
    
    Returns precision as a floating point number between 0.0 and 1.0
    '''
    TP = np.sum((y_true == pos_label_value) * (y_pred == pos_label_value))
    P_data = np.sum(y_true == pos_label_value)
    return TP/P_data
EDA¶
All the data, exept the results, are continuous numbers. As there are way too many features to do a figure/analyses for each, the EDA will focus on the correlation between the features, starting with a correlation heatmap.
sns.heatmap(X.corr(), -1, 1, "coolwarm", 0, annot=True, ax=plt.subplots(figsize=(30,30))[1])
<matplotlib.axes._subplots.AxesSubplot at 0x7535ac4d7bd0>
From the correlation heat map above, we see that there are two "groups" of high correlation - one between 11 particular word frequecies, and one between the capital run lengths. In particular, word_freq_857 and word_freq_415 have a collinearity of 1, meaning that at least one of them is practically redundant.
The two groups are further explored using pairplots below.
sns.pairplot(X[["word_freq_hp", "word_freq_hpl", "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet", "word_freq_857", "word_freq_415", "word_freq_85", "word_freq_technology", "word_freq_direct"]], diag_kind="kde")
<seaborn.axisgrid.PairGrid at 0x75353521aa90>
As we see above. The patterns show a very high degree of collinearity. With that we expect that the regulation coefficient will have a big impact on improving the Logistic Regression model.
sns.pairplot(X[["capital_run_length_average", "capital_run_length_longest", "capital_run_length_total"]], diag_kind="kde")
<seaborn.axisgrid.PairGrid at 0x75352d98bc90>
As for the capital run lengths, they are obviously collinear, though less obvious from the plots. As we will see from the Logistic Regression coefficients, capital_run_length_total gets almost rid off by the regulation, while word_freq_857 is becomes literally 0.
Model training¶
Logistic Regression¶
The first model to use is logistic regression. Using GridSearchCV, different regulation techniques and strengths can be experimented with.
Logistic Regression ordinarily suffers from the curse of dimensionality. However with regulation, especially l1, this becomes less of a concern.
The coefficients from the best model are printed to provide some insight onto which features had the most influence.
LogRegGrid = GridSearchCV(
    LogisticRegression(solver="liblinear"),
    {
        "penalty": ("l1", "l2"),
        "C": np.logspace(-5, 5, base=2, num=11)
    }
).fit(X_train, y_train)
plotSearchGrid(LogRegGrid, "C", "penalty")
print("Best Logistic Regression parameters:", LogRegGrid.best_params_)
print("Best Logistic Regression score:", LogRegGrid.best_score_)
LogRegBest = LogRegGrid.best_estimator_
Best Logistic Regression parameters: {'C': 8.0, 'penalty': 'l1'}
Best Logistic Regression score: 0.9231884057971016
for feature, coef in zip(X.columns.tolist(), LogRegBest.coef_[0]):
    print(feature, ":", coef)
word_freq_make : -0.3352771610625865 word_freq_address : -0.144134104466229 word_freq_all : 0.112018672406649 word_freq_3d : 2.076387357353494 word_freq_our : 0.554593550485832 word_freq_over : 1.2904401963364194 word_freq_remove : 2.5267285725399806 word_freq_internet : 0.5172196645160981 word_freq_order : 0.627473532037221 word_freq_mail : 0.17249868699823487 word_freq_receive : -0.25581126978924495 word_freq_will : -0.14234046357326724 word_freq_people : -0.11854115705721327 word_freq_report : 0.17412149671014215 word_freq_addresses : 1.0500388675253347 word_freq_free : 0.9634466906627731 word_freq_business : 0.9305392973545669 word_freq_email : 0.10133392987631257 word_freq_you : 0.08488243536713497 word_freq_credit : 1.3819517056872157 word_freq_your : 0.2654681374949797 word_freq_font : 0.21749820028292105 word_freq_000 : 2.0810250283128378 word_freq_money : 0.36561247488532317 word_freq_hp : -1.7902775548167147 word_freq_hpl : -1.2336732801585475 word_freq_george : -10.880824805770052 word_freq_650 : 0.44226034330510694 word_freq_lab : -2.2885745842672742 word_freq_labs : -0.2654647593774336 word_freq_telnet : -0.16163430120735922 word_freq_857 : 0.0 word_freq_data : -0.785684361230612 word_freq_415 : 0.45936893654380306 word_freq_85 : -1.9259154363336377 word_freq_technology : 0.8586013549791187 word_freq_1999 : 0.09105557504327932 word_freq_parts : -0.5351601870000458 word_freq_pm : -0.7897860751763727 word_freq_direct : -0.3114823079515479 word_freq_cs : -16.272347427710518 word_freq_meeting : -2.513144116975214 word_freq_original : -1.1180888218223113 word_freq_project : -1.6215101779132928 word_freq_re : -0.8091323320430137 word_freq_edu : -1.307624281598115 word_freq_table : -1.3921522502944936 word_freq_conference : -3.6846542873407873 char_freq_; : -1.233208442411895 char_freq_( : -0.13236694620475534 char_freq_[ : -0.545299178734433 char_freq_! : 0.3144554745969651 char_freq_$ : 4.90464860192831 char_freq_# : 1.85535122747682 capital_run_length_average : 0.014264456456158873 capital_run_length_longest : 0.007832612484971016 capital_run_length_total : 0.0008400454072136723
Feature ranking¶
It seems from the data above that the $ sign is the strongest spam indicator, while cs is the strongest anti-indicator.
The data used was not normalized though, so one should exercise caution when ranking solely based on the feature coefficients.
Decision Tree Ensembles¶
Second ensembles are used. Both Random Forest and AdaBoost are used.
The best n_estimators for both is searched for. In addition, Random Forest experiments with max_depth, while AdaBoost experiments with learning_rate.
In module 4 using Decision Trees, we could not get an accuracy above 0.95 for the test data. Can we do better?
RFGrid = GridSearchCV(
    RandomForestClassifier(max_samples=.7),
    {
        "max_depth": range(10, 20, 3),
        "n_estimators": range(100, 1100, 300)
    }
).fit(X_train, y_train)
plotSearchGrid(RFGrid, "max_depth", "n_estimators")
print("Best Random Forest parameters:", RFGrid.best_params_)
print("Best Random Forest score:", RFGrid.best_score_)
RFBest = RFGrid.best_estimator_
Best Random Forest parameters: {'max_depth': 19, 'n_estimators': 700}
Best Random Forest score: 0.9487922705314009
ABGrid = GridSearchCV(
    AdaBoostClassifier(),
    {
        "learning_rate": np.logspace(-4, -1, base=2, num=4),
        "n_estimators": range(100, 1100, 300)
    }
).fit(X_train, y_train)
plotSearchGrid(ABGrid, "learning_rate", "n_estimators")
print("Best AdaBoost parameters:", ABGrid.best_params_)
print("Best AdaBoost score:", ABGrid.best_score_)
ABBest = ABGrid.best_estimator_
Best AdaBoost parameters: {'learning_rate': 0.0625, 'n_estimators': 1000}
Best AdaBoost score: 0.9475845410628019
fig, ax = plt.subplots()
ax.set_xlabel("Iteration")
ax.set_ylabel("Error")
ax.set_title("Misclassification error for training and testing sets when using AdaBoost")
ax.plot(1 - np.fromiter(ABBest.staged_score(X_train, y_train), float), label="train")
ax.plot(1 - np.fromiter(ABBest.staged_score(X_test, y_test), float), label="test")
ax.legend()
plt.show()
SVMs¶
Finally SVMs are used, first with rbf kernels as they seem to strike a good balance in my opinion and require optimisation. GridSearchCV is employed to find the best values for C and gamma.
Then we check if the simpler LinearSVC actually performs better or not on this dataset.
SVMs could be a great model for this dataset as it is less vulnerable to the curse of dimensionality, but will the assumption of the existence of a separating hyperplace apply well to this situation?
SVCGrid = GridSearchCV(
    SVC(),
    {
        "gamma": np.logspace(-7, -3, base=2, num=5),
        "C": np.logspace(1, 5, base=2, num=5)
    }
).fit(X_train, y_train)
plotSearchGrid(SVCGrid, "C", "gamma")
print("Best SVM parameters:", SVCGrid.best_params_)
print("Best SVM score:", SVCGrid.best_score_)
SVCBest = SVCGrid.best_estimator_
Best SVM parameters: {'C': 8.0, 'gamma': 0.0078125}
Best SVM score: 0.8657004830917874
SVCGrid2 = GridSearchCV(
    LinearSVC(),
    {
        "loss": ("hinge", "squared_hinge"),
        "C": np.logspace(-15, -5, base=2, num=11)
    }
).fit(X_train, y_train)
plotSearchGrid(SVCGrid2, "C", "loss")
print("Best LinearSVC parameters:", SVCGrid2.best_params_)
print("Best LinearSVC score:", SVCGrid2.best_score_)
LinearSVCBest = SVCGrid2.best_estimator_
Best LinearSVC parameters: {'C': 0.015625, 'loss': 'hinge'}
Best LinearSVC score: 0.9031400966183576
Results¶
We start by checking the validation scores (using the initial train-test split, not CV) of each of the best models:
print("Logistic Regression validation score:", LogRegBest.score(X_test, y_test))
print("Random Forest       validation score:", RFBest.score(X_test, y_test))
print("AdaBoost            validation score:", ABBest.score(X_test, y_test))
print("SVM with rbf        validation score:", SVCBest.score(X_test, y_test))
print("Linear SVC          validation score:", LinearSVCBest.score(X_test, y_test))
Logistic Regression validation score: 0.9501084598698482 Random Forest validation score: 0.9631236442516269 AdaBoost validation score: 0.9674620390455532 SVM with rbf validation score: 0.8937093275488069 Linear SVC validation score: 0.9175704989154013
Comparing these values, all models outperform the baseline model scores published on UCI's website, and by a good margin.
But do we get better precision values as well? Let's check:
# These variables store the true and predicted values so we don't need to recalculate them again and again.
y_test_flat = np.ravel(y_test)
LogRegPredict = LogRegBest.predict(X_test)
RFPredict = RFBest.predict(X_test)
ABPredict = ABBest.predict(X_test)
SVCPredict = SVCBest.predict(X_test)
LinearSVCPredict = LinearSVCBest.predict(X_test)
print("Logistic Regression precision score:", calculate_precision(y_test_flat, LogRegPredict))
print("Random Forest       precision score:", calculate_precision(y_test_flat, RFPredict))
print("AdaBoost            precision score:", calculate_precision(y_test_flat, ABPredict))
print("SVM with rbf        precision score:", calculate_precision(y_test_flat, SVCPredict))
print("Linear SVC          precision score:", calculate_precision(y_test_flat, LinearSVCPredict))
Logistic Regression precision score: 0.9481865284974094 Random Forest precision score: 0.9735449735449735 AdaBoost precision score: 0.9738219895287958 SVM with rbf precision score: 0.8808290155440415 Linear SVC precision score: 0.9247311827956989
The precision values also outperform the baseline model scores.
While recall values are not given by the UCI website, let us calculate them nonetheless:
print("Logistic Regression recall score:", calculate_recall(y_test_flat, LogRegPredict))
print("Random Forest       recall score:", calculate_recall(y_test_flat, RFPredict))
print("AdaBoost            recall score:", calculate_recall(y_test_flat, ABPredict))
print("SVM with rbf        recall score:", calculate_recall(y_test_flat, SVCPredict))
print("Linear SVC          recall score:", calculate_recall(y_test_flat, LinearSVCPredict))
Logistic Regression recall score: 0.9336734693877551 Random Forest recall score: 0.9387755102040817 AdaBoost recall score: 0.9489795918367347 SVM with rbf recall score: 0.8673469387755102 Linear SVC recall score: 0.8775510204081632
Finally, let us check the ROC curves:
for label, pred in ("Logistic Regression", LogRegPredict), ("Random Forest", RFPredict), ("AdaBoost", ABPredict), ("SVM", SVCPredict), ("Linear SVC", LinearSVCPredict):
    fpr, tpr, _ = roc_curve(y_test_flat, pred)
    roc_auc = roc_auc_score(y_test_flat, pred)
    plt.plot(fpr, tpr, label=f"micro-average ROC curve (AUC = {roc_auc})")
    plt.title(f"ROC by {label}, AUC={roc_auc}")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.show()
Analysis, Discussion and Conclusion¶
Comparing the above results, we can safely say that the ensemble models (Random Forest and AdaBoost) outdid the other models by far. Logistic Regression did not do bad either.
SVMs though performed quite poorly. It seems that the assumptions inherit to SVMs do not hold well for the dataset at hand. This might have been expected though, given that the SVC baseline model from UCI did poorly as well. The fact that LinearSVC performed better than rbf was more surprising though. It seems that the underlaying model is more linear than circular, as evidenced by the rbf model leaning strongly towards smaller gamma values.
Linear Regression¶
It seems that the model preferred l1 regulisation. This was expected as this dataset contains high covariance requiring the effective elimination of some features. Though, looking at the heatmap, it seems that as regulation gets tight, l2 becomes more suitable.
One of the things that might improve the results is normalizing the data before training. That should help the regulisation be more fair between the features.
Ensembles¶
Random Forest seemed to prefer deeper trees with this dataset, while AdaBoost performed better and better the lower the learning rate. From the Misclassification error graph obtained from the staged score, we can clearly see that AdaBoost does run the risk of slightly overfitting as the number of estimators increases.
Further experiments could attempt to improve on the results by trying deeper trees for Random Forest and even lower learning rates for AdaBoost.
SVMs¶
As mentioned earlier, the rbf kernel performed poorly on the dataset, with it clearning leanning towards zero gamma. This might suggest that the non-linearity introduced by the kernel is instead causing overfitting.
LinearSVC by comparison is preferring much less regulization, again suggesting that the model is very sensitive to high variance. For the selected value of C, hinge vs squared_hinge did not seem to make much of a difference, though squared_hinge seems to cope better with a more extreme values of C.
Never the less, there might be some steps that can be taken to improve the results, such as normalizing the data beforehand.
Precision vs Recall¶
In this particular case, false positives are especially bad - we never want legitimate email to be falsely classified as spam. False negatives on the other hand are less problematic. As such, precision is much more important than mere accuracy. Thankfully, from comparing the precisions to recalls, we see that precision is generally higher. Especially the ensembles are exceptional in terms of precision.
This result can be also seen from the ROC/AUC graphs. Both ensembles outperform the rest of the models.
That being said, this is still not satisfactory. Steps should be taken to increase the precision even if that meant lower accuracy overall. Possible methods include changing the accepting threshold for Logistic Regression or the corresponding weights for the other methods.
Takeaway¶
Overall, it seems that Adaboost and Random Forest are the best algorithms for this task, as they top both precision and accuracy metrics. Improvements can be made though, both to incease the performance by tweeking the hyperparameters and insure that precision is given top priority.