Machine Learning Project on Stress Detection
Project by Samuel Nnamani a.k.a SammystTheAnalyst
Project Description
Stress, according to the World Health Organization(W.H.O) can be defined as a state of worry or mental tension caused by difficult situation. It is a natural response that prompts us to address difficult challenges and situations in our lives, and so long as you are a human and still alive, you are prone to experience stress to some degree.
Stress level amongst individuals differ and that’s due to either the improvement of the situation and/or the experience gotten in coping emotionally with past stressful situation.
There are 116 columns in the dataset, out of which 2 features were used for the project: text(input) and label(output) column.
Aim:
· Detection of Stress; whether a patient has stress or not.
Python Libraries used:
· Pandas: for Data Analysis
· nltk, re and Stopwords: for cleaning text column
· WordCloud: for Data Visualization
· Bernoulli Naive Bayes algorithm: for Binary Classification
CountVectorizer: for Model Evaluation
Project Design:
I imported the necessary libraries for the project.
Then I proceed to read the data into dataframe called df and print out the top rows of the data.
Then I proceeded to check for the total number of columns in the dataset, so as to ascertain the enormity of the data I am working on.
From the above, I am working on 116 columns and that’s a very huge number of features. The next step I did was to proceed to the data cleaning stage.
Data Cleaning
In the data cleaning, since I was very much interested in the text, I will only be focusing on the text column. But before that, I checked if there are missing values in the dataset.
There appears not to be missing values in the dataset, so I proceed to cleaning the text column using the libraries such as nltk stopwords(for filtering out common words that are not important in my modelling project such as the, a, an, in, etc.), re(for checking if a string contains the specified search pattern).
In the above, I used stopwords to filter out irrelevant words, then I defined a function that enables me to search through the text columns for values as specified in each line of code, which returns a cleaned text.
Text Visualization
In visualizing the text, I imported libraries such as matplotlib and WordCloud.
As can be seen in the image below, the words appearing in large sizes correspond to the frequency of such words in the text i.e., the more a word a used frequently in users’ description of stress, the more emboldened the word appears in a WordCloud and vice versa.
The next step I did was to map the label column having binary values 0 and 1 for each record into a new column named “label” having categorical values “No Stress” and “Stress” respectively. Afterwards, I created a new dataframe with only the input feature “text” and the output feature “label”.
Data Modelling
In building the model, I used the functions CountVectorizer(for transforming the text into a matrix, in which each unique word is a matrix column and each word occurrence is a matrix row) and train_test_split(for splitting the data into a train dataset and a test dataset). As can be seen below, I took the test size of 30%.
Then I proceeded to use the Bernoulli Naïve Bayes algorithm as the model algorithm for my project. Reason being that, since the aim is to predict an outcome, the Bernoulli Naïve Bayes utilizes the principle of a trial and two possible outcomes, which can be True or False, Yes or No, Success and Failure.
As can be seen below, the model returns 92% accuracy of prediction.
Let’s now test the efficacy of the model in predicting if a user is stressed or not. I create an input box that allows the user to input text on stress at random.
The text inputted returns the response that the user is not stressed.
Whilst, in this case, the text inputted returns the response that the user is stressed.
Conclusion
In conclusion, we can see that social media interaction has really exposed the fact that many users suffer from stress, anxiety and depression, just by their keywords and text. Also, the effectiveness of the model in predicting the stress status of a user can further be amplified to support those who are in need of urgent assistance.
I wrap this up by saying that the earlier one knows he/she is struggling with stress, the better such a person begins to work on its relief.
P.S. Python syntax code for this project can be accessed on my GitHub profile here