Spam Detection

Introduction

Spam emails and SMS messages are a common nuisance for many people, clogging up inboxes and taking up valuable time to sort through. In this project, we aim to build a machine learning model that can accurately classify SMS messages as spam or non-spam (also known as "ham").

To achieve this goal, we will be using the SMS Spam Collection data set, which contains a set of 5,574 SMS messages that have been labeled as either spam or ham. We will use a Naive Bayes classifier to train and test the model on this data. Our evaluation metrics will include accuracy, precision, and recall to understand the model's performance in more detail.

The results of this project have the potential to be applied in a variety of contexts, such as creating spam filters for email or SMS systems, or helping individuals and organizations more efficiently manage their communications.

UCI dataset link : https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Data Preprocessing

The SMS Spam Collection data set was loaded into a Pandas DataFrame, and basic statistics and visualizations were used to understand the distribution and characteristics of the data. The messages were then converted into a numerical representation using the CountVectorizer from scikit-learn. To ensure a fair evaluation of the model, the data was split into training and test sets using the train_test_split function from scikit-learn. The training set was used to fit the model, and the test set was used to evaluate the model's performance.

Model Training and Evaluation

The Naive Bayes classifier was trained on the training data using the fit method. The classifier was then used to make predictions on the test data, and the performance was evaluated using the classification_report function from scikit-learn.

The evaluation results showed that the model achieved an overall accuracy of 98.4%, with a precision of 99% for detecting spam and a recall of 100% for correctly identifying non-spam messages.

Conclusion

In this project, we successfully built a machine learning model that can accurately classify SMS messages as spam or non-spam. The model achieved high levels of accuracy, precision, and recall, indicating that it is effective at identifying spam messages.

An example of a spam message is shown below