email-spam-classification

Email Spam Classification using Machine Learning

Author: Ameen Mohammad

This project implements a full machine learning pipeline for spam classification.

Viewing Pre-Generated Results

The notebook has been pre-executed, so you can explore the findings without running the code.

🔗 Open in Google Colab

Running the Notebook (Optional)

For those interested in running the full pipeline:

Open the notebook in Google Colab using the link above.
Navigate to Runtime in the top menu and select Run all (or press CTRL + F9 / Cmd + F9 on Mac).
If prompted with a security notice about execution permissions, click “Run Anyway.”
The notebook will process the dataset and generate results step by step.

Dataset Overview

The dataset used in this project is a collection of 5,574 email messages, labeled as ‘spam’ or ‘notspam’. Below is a summary of the dataset:

Total Emails: 5,574
Spam Emails: 747 (13.4%)
Not Spam Emails: 4,827 (86.6%)
Average Email Length: 80 characters
Number of Duplicate Entries: 406

Note: Missing values in the ‘text’ column have been handled by replacing them with an empty string.

Key Findings

After processing the data and training multiple models, the following results were obtained:

Model	Feature Extraction	Accuracy	Spam Recall	Spam F1-score
Logistic Regression	TF-IDF	95.87%	74%	85%
Support Vector Machine	TF-IDF	97.76%	86%	92%
Random Forest	TF-IDF	97.49%	86%	91%
Logistic Regression	BoW	98.21%	89%	94%

Conclusion:

Best Overall Accuracy: Logistic Regression with BoW features achieved the highest accuracy at 98.21%.
Best Spam Detection: Support Vector Machine with TF-IDF features achieved the highest spam recall at 86%.

Note: TF-IDF (Term Frequency-Inverse Document Frequency) and BoW (Bag of Words) are techniques used to convert text data into numerical features.

Future Improvements

To further enhance the spam classification model, the following approaches can be considered:

Hybrid Feature Extraction: Combining TF-IDF and BoW features to capture more nuances in the text data.
Advanced Modeling Techniques: Implementing deep learning models such as LSTMs or Transformers (e.g., BERT) to improve performance.
Hyperparameter Tuning: Fine-tuning the hyperparameters of the models to achieve better accuracy and recall.

Explore the notebook to analyze the results or modify the code to experiment with different techniques.

email-spam-classification

Email Spam Classification using Machine Learning

Viewing Pre-Generated Results

Running the Notebook (Optional)

Dataset Overview

Key Findings

Future Improvements

Comments

Leave a Reply Cancel reply

More posts

VirtualRehabTokenSaleDocs

clonk_transpilation

clonk_transpilation

css-exercises