COSC2960 – Foundations of Artificial Intelligence for STEM
Assessment 2B
Data Analysis and Modelling Assignment
1.0 Statement of Problem/Introduction
In the era of data abundance, the significance lies not merely in its volume, but in our ability to derive meaningful insights. This task invites the application of class-acquired techniques to dissect and interpret data.
Our focus rests on scrutinizing the pivotal role of sleep-in overall health and well-being, encompassing cognitive, emotional, and physical realms. The analysis of the Sleep Health and Lifestyle Dataset, collected via wearable devices, is paramount. Tasked as data scientists, our mission is to enhance sleep tracking accuracy and elucidate the impact of lifestyle factors on sleep quality.
Through meticulous analysis, we seek to uncover trends and correlations, illuminating pathways for enhanced sleep hygiene and the mitigation of sleep-related disorders. This endeavour extends to the identification of predictors for sleep duration and quality, clustering individuals based on sleep behaviour and lifestyle, and discerning patterns indicative of sleep disorders.
Our pursuit is to contribute to the scholarly discourse on sleep health, ofering actionable recommendations for fostering improved sleep habits and overall well-being.
The following dataset contains a large variety of data categories and variables within them that are respective to our study. To be exact, there are 14 categories which outline all key aspects of sleep health and our knowledge of it. We will segment them into Numerical and Nominal variables respectively:
- Age: The Age of a Person in Years.
- Sleep Duration (hours): The Number of Hours a Person has Slept in a Day.
- Physical Activity Level (minutes/day): The Number of Minutes the Person Engages in Physical Activity per Day.
- Systolic Blood Pressure: The blood pressure measurement of the person.
- Diastolic Blood Pressure: The blood pressure measurement of the person.
- Heart Rate (bpm): The resting heart rate of the person in beats per minute.
- Daily Steps: The number of steps the person takes per day.
- Stress Level: A subjective rating of the stress level experienced by the person.
- Quality of Sleep: A subjective Rating of Quality of Sleep.
- Gender: The gender of the person.
- Occupation: The occupation or profession of the person.
- BMI Category: The BMI category of the person.
- Sleep Disorder: The presence or absence of a sleep disorder in the person.
3.0 Data Cleaning and Processing
Category |
Values Missing |
Sleep Duration |
2 (1%) |
Quality of Sleep |
4 (1%) |
Physical Activity Level |
1 (>1%) |
Stress Level |
1 (>1%) |
BMI Category |
7 (2%) |
Systolic BP |
1 (>1%) |
Diastolic BP |
3 (1%) |
Heart Rate |
4 (1%) |
Daily Steps |
7 (2%) |
TOTAL |
30 |
Table 1. Missing Values in Sleep Data Set
Using WEKA function NumericCleaner to display all missing values, we can clearly identify each single missing unit in our data set.
Fig 1. WEKA Attribute Missing Values Display (Sleep Duration)
Fig 2. WEKA Attribute Missing Value Example (ID 15, Sleep Duration)
As displayed, a total of 30 values were missing from the 374 People surveyed for the study. Missing values in a large dataset can occur due to various reasons such as data entry errors, equipment malfunction, incomplete surveys, participant non-response, or simply because certain variables were not applicable or recorded. These missing values may be sporadic and random, or they could follow a pattern influenced by specific factors within the data collection process.
3.2 Dealing with Values Missing:
After having identified all missing values, still lies the decision on how to approach dealing with them. Two clear options are viable in order to contain the homogeneity of the data, that being:
1. Removing all missing values altogether.
2. Adjusting the missing values with the mean/mode of their respective category.
3. Replacing all missing values with a constant arbitrary variable.
For the following reasons, replacing the missing values using the WEKA function ReplaceMissingValues with their respective mean/mode categories came to be the clearer option:
Preservation of Data Integrity: By filling in missing values with the mean of their respective categories, you retain more data points, thus preserving the integrity and completeness of your dataset. This ensures that you can still analyse trends and patterns across all variables without significant loss of information.
Maintaining Sample Size: Removing missing values altogether could result in a reduction in sample size, which might afect the statistical power of your analysis. By imputing missing values with the mean, you avoid this loss of sample size, allowing for more robust statistical analysis and potentially more accurate results.
Fig 3. WEKA Attribute Information Post Replacement (Sleep Duration)
Mitigating Bias: Removing missing values can introduce bias, especially when the missing values aren’t random. Replacing missing values with the average helps to reduce this bias by keeping the overall data distribution consistent across each category.
Simplicity and Ease of Interpretation: The average is a straightforward method to fill in missing values with. It’s simple to implement and interpret, avoiding the complexity of more complex imputation techniques.
Category |
Mean/Mode Value |
Sleep Duration |
7.135 |
Quality of Sleep |
7.308 |
Physical Activity Level |
59.088 |
Stress Level |
5.389 |
BMI Category |
Normal |
Systolic BP |
128.523 |
Diastolic BP |
84.668 |
Heart Rate |
68.457 |
Daily Steps |
6843.869 |
Table 2. Mean/Mode Values of Categories with Missing Values
Fig 4. WEKA Attribute Replaced Value Example (ID 15, Sleep Duration)
3.3 Outliers
Using the WEKA InterquartileRange function, we are able to identify all the outliers of the data set. A total of 14 outliers out of 374 instances is identified in the Sleep Data set. In other words, any instance labelled with a ‘yes’ is considered as an outlier instance, whether that be one or various values that define it as that.
Fig 5. WEKA Attribute Outliers Information
Fig 6. WEKA Outliers Example on Data Set Table
3.4 Dealing with Outliers and Scaling Method
To deal with the outliers in the dataset, it was decided that removing all instances defined as outliers are to be removed. This scaling method is called standardisation, removing outliers through the use of the standard deviation which acts as a ‘range’ to easier identify which outliers need removing from the dataset. Utilising the use the WEKA RemoveWithValues function, all outliers are removed from the dataset. We repeat the process with ‘Extreme Values’, these are also considered outliers, just to a larger extreme than outliers.
Fig 7. WEKA Attribute Outliers Information (Post Removal)
Fig 8. WEKA Outliers Removal Example on Data SetTable
Removing outliers from a dataset can be crucial for several reasons:
Maintaining Data Integrity: By definition, outliers are information focuses that contrast fundamentally from most of the informational index. Even though they might address substantial perceptions, they may likewise be because of blunders, commotion, or exceptions. By eliminating outliers, you guarantee that your dataset precisely mirrors the foundation peculiarities being considered and, in this manner, safeguards its honesty.
Enhancing Statistical Analysis: Outliers can contort factual measures like the mean, middle, and standard deviation, prompting misdirecting translations. Eliminating outliers balances out these measurements, making factual investigation more dependable. This guarantees that elucidating insights give a clearer perspective on the focal inclination and fluctuation of the information.
Improving Model Performance: Outliers can adversely influence the exhibition of prescient models by adding commotion and predisposition. Models prepared on datasets with outliers can deliver less exact expectations and more unfortunate speculations to new information. By eliminating outliers, you work on the nature of the information used to prepare the model, prompting more exact and solid forecasts.
Facilitating Visualization: Outliers can misshape information representations, making it challenging to distinguish significant examples and connections. Eliminating outliers can work on the lucidity and interpretability of perceptions, empowering a superior comprehension of the fundamental construction of the information.
Ensuring Assumptions Hold: Numerous factual procedures and AI calculations expect that the information is regularly dispersed or has specific properties. Outliers can disregard these presumptions and lead to possibly wrong ends. Eliminating outliers guarantees that the suppositions fundamental the investigation are right, subsequently working on the legitimacy of the outcomes.
4.0 Data Cleaning and Processing
4.1 Unique Values within each Category of the Dataset
Category |
Unique Values |
Age |
1 (>1%) |
Occupation |
1 (>1%) |
Physical Activity Level |
2 (1%) |
Stress Level |
1 (>1%) |
Systolic BP |
3 (1%) |
Diastolic BP |
1 (>1%) |
Heart Rate |
2 (1%) |
Daily Steps |
2 (1%) |
TOTAL |
13 |
Table 3. Unique Values in Sleep Data Set
In our deep dive into sleep data, we're not just crunching numbers; we're peeling back the layers of diferent jobs and how they shape our snooze habits.
In this massive sea of data, each job tells its own story, giving us a peek into how folks catch their Z's. Whether you're a night owl keeping watch or a day hustler hitting the grind, we're sailing through this patchwork of jobs, shedding light on how work and sleep dance together.
Fig 9. Count of Occupation Graph
Table 4. Count of Occupation
4.3 Sleep Duration Dependency on Occupation Type
In our investigation, we're diving into the impact of various professions on sleep duration. We're interested in comparing how specific occupations—scientist, salesperson, teacher, software engineer, manager, doctor, nurse, accountant, lawyer, and engineer—afect both the length and quality of sleep.
Through rigorous analysis, we aim to uncover the intricate connections between these professional roles and the crucial aspect of achieving adequate rest. Join us as we explore the relationship between occupation and sleep duration, shedding light on the diverse factors influencing our nightly rejuvenation.
Fig 10. Occupation vs Sleep Duration
Table 5. Mean/Modal/Median Sleep Duration vs Occupation
Fig 11. Occupation Mean/Modal/Median Sleep Duration
4.4 Most Common BMI Categories
We turn our attention to the prevalence of Body Mass Index (BMI) categories. Specifically, we aim to dissect the distribution of individuals across the most common BMI classifications: Normal, Normal Weight, Obese, and Overweight.
By scrutinizing this data, we seek to unveil patterns and trends regarding the prevalence of each BMI category within our sample population.
Through meticulous analysis, we endeavour to gain insights into the frequency and proportions of individuals falling into each category, providing valuable information on the distribution of BMI across our dataset.
Fig 12. BMI Categories Count Graph
4.5 Age Groups and Their Respective BMI Index Categories
Through rigorous analysis, we aim to explore the demographic composition of individuals across various BMI ranges by examining both mean and median ages. Our goal is to identify any discernible patterns or trends in age distribution within each BMI category. By uncovering these insights, we hope to shed light on the relationship between BMI and age within our dataset. Join us as we delve into this analytical journey to unravel the complexities of BMI categories and age distribution, gaining valuable insights into their interplay.
Fig 13. BMI Categories vs Age Graph
Table 7. BMI Categories and Mean/Median Age
4.6 Blood Pressure dependency on BMI Index Category
Our aim is to uncover any observable patterns or trends in blood pressure levels across diferent BMI categories, ofering valuable insights into how BMI may influence cardiovascular health within our dataset. Join us as we delve into this analytical endeavour, examining the relationship between BMI categories and blood pressure to gain a deeper understanding of cardiovascular health metrics.
Fig 14. BMI Categories vs Blood Pressures Graph
Table 8. Systolic/Diastolic Blood Pressure Mean Values
4.7 Diferent Sleep Quality against Increasing Stress Levels
We focus on understanding the potential impact of increased stress levels on the quality of sleep within a diverse and extensive dataset. By scrutinizing a range of variables related to stress and sleep quality, we aim to discern any correlations or trends that may exist.
Through rigorous analysis, we shed light on the intricate relationship between stress levels and sleep quality, ofering valuable insights into the factors that may influence individuals' ability to attain restful sleep amidst varying stress levels.
Fig 15. Stress Level vs Sleep Quality Graph
Table 9. Stress Level vs Sleep Quality Mean/Modal Values
4.8 Male vs Female Sleeping Disorder Count
We're examining the occurrence of sleeping disorders across genders in our dataset, aiming to identify diferences between males and females. By carefully documenting the instances of sleeping disorders in each gender, we aim to quantify and contrast their frequency. Through comprehensive analysis, our goal is to uncover any noticeable patterns or trends that might indicate which gender has a higher prevalence of sleeping disorders.
Fig 16. Female Sleep Disorder Count Graph
Table 10. Female Sleep Disorder Count
Fig 17. Male Sleep Disorder Count Graph
Table 11. Male Sleep Disorder Count
4.9 How Daily Steps afect Quality of Sleep
In this study, we're exploring how boosting daily step count might afect sleep quality. We're analysing data that tracks individuals' daily steps alongside metrics of sleep quality to investigate the connection between physical activity and sleep patterns. Our aim is to gain a deeper understanding of how changes in daily step count could impact factors like sleep duration, eficiency, and disturbances. We're aiming to uncover valuable insights into the potential advantages of upping physical activity levels for better sleep quality. Ultimately, our goal is to enhance our understanding of lifestyle factors that influence overall sleep health.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。