Data Cleaning in R for Data Science :
The Experiment :
The experiment conducted here is retrieved from UCI Machine Learning Repository where a group of 30 volunteers (age bracket of 19–48 years) performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a Samsung Galaxy S smartphone. The data collected from the embedded accelerometers was divided into testing and trained data.
Step 1: Retrieving Data from URL
The first step required is to obtain the data. Often, to avoid the headache of manually downloading thousands of files, they are downloaded using small code snippets. Since this was a zipped folder .
Data Reference :
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
- download.file(“https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", destfile = “files”, method = “curl”, mode = “wb”)
- unzip(“files”)
Step 2: Reading the files into R
features <- read.table("...\\UCI\\features.txt", col.names = c("serial", "Functions"))
features
activities <- read.table("...\\UCI\\activity_labels.txt", col.names = c("serial", "Activity"))
activities
x_test <- read.table("...\\test\\X_test.txt", col.names = features$Functions)
x_test
y_test <- read.table("...\\test\\y_test.txt", col.names = "serial")
y_test
subject_test <- read.table("...\\test\\subject_test.txt", col.names = "subject")
subject_test
subject_train <- read.table("...UCI\\train\\subject_train.txt", col.names = "subject")
subject_train
x_train <- read.table("...\\UCI\\train\\X_train.txt", col.names = features$Functions)
x_train
y_train <- read.table("...\\UCI\\train\\y_train.txt", col.names = "serial")
y_train
Note: It might be difficult to understand at first what the data means and what column names to use, but after a while you’ll start making sense.
This clearly implies two things:
- I had to merge the training and test sets by row binding them
- I had to merge the different attributes of the subjects by column binding them.
Step 3: Merging the tables intelligently
binded_x <- rbind(x_test, x_train)
binded_y <- rbind(y_test, y_train)
subject <- rbind(subject_test, subject_train)
#Next, I used the cbind() function to complete attaching the columns as well.
raw_data_combined <- cbind(subject, binded_x, binded_yraw_data_combine
Step 4: Filtering out only the mean and std columns
One thing to understand is the data is humongous, and we might need to perform certain filtering operations to extract the attributes we need. I had to filter out only those columns that mentioned ‘mean’ or ‘std’ in them. I used the select() function here which tidies up your code 10x better.
Note: Download the package “dplyr” and then load library to use its functions like select(), arrange(), mutate(), filter(), summarise().
#install.packages("dplyr")
library(dplyr)
analysis <- raw_data_combined %>%
select(serial , subject ,contains("mean") , contains("std"))
analysis
Step 5: Changing the activity labels from numeric codes to descriptive values
activities
“activity_labels.txt” has 1–6 numbers assigned to the six activities and these codes were being used instead of the activity names. For better readability, I changed them into descriptive values using the following commands:
analysis$serial[analysis$serial == "1"] <- "WALKING"
analysis$serial[analysis$serial == "2"] <- "WALKING_UPSTAIRS"
analysis$serial[analysis$serial == "3"] <- "WALKING_DOWNSTAIRS"
analysis$serial[analysis$serial == "4"] <- "SITTING"
analysis$serial[analysis$serial == "5"] <- "STANDING"
analysis$serial[analysis$serial == "5"] <- "LAYING"
analysis
Step 6: Changing columns names to enhance readability
- names() will give you only the column names of the dataset you’ve provided to it.
- gsub() will replace an old string with the new string you pass to it.
names(analysis)<- gsub("Acc", "Accelerometer", names(analysis))
names(analysis)<- gsub("tBody", "time", names(analysis))
names(analysis)<- gsub("fBody", "frequency", names(analysis))
names(analysis)<- gsub("Gyro", "Gyroscope", names(analysis))
names(analysis)<-gsub("BodyBody", "Body", names(analysis))
names(analysis)<-gsub("Mag", "Magnitude", names(analysis))
names(analysis)<-gsub("serial", "Activity", names(analysis))
names(analysis)
Step 7: Creating an independent tidy data set with the average of each variable for each activity and each subject
To avoid the confusion, this simply means we need to take the mean of each feature in the ‘analysis’ dataset and represent them both by activity(s) and the subject(s).
tidy_data <- analysis %>% group_by(subject, Activity) %>% summarise_all(list(mean))
tidy_data
The group_by() function categorizes your data according to the columns you feed into it, and summarise_all() function performs any function you feed into it (in this case mean())
This concludes this project, as the raw data has been transformed into a tidy data set that can be used to analysis later.
Bonus Information :
Check Out My YouTube Video :(Introduction of AWS EC2 instance)
More such simplified Data Science concepts will follow. If you liked this or have some feedback or follow-up questions please comment below.
Thanks for Reading!
Comments
Post a Comment