Data Cleaning in R for Data Science

Data Cleaning in R for Data Science :

Removing duplicate values

Removing null values

Changing column names to readable, understandable, formatted names

Removing commas from numeric values i.e. (1,000,657 to 1000657)

Converting data types into their appropriate types for analysis

The Experiment :

The experiment conducted here is retrieved from UCI Machine Learning Repository where a group of 30 volunteers (age bracket of 19–48 years) performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a Samsung Galaxy S smartphone. The data collected from the embedded accelerometers was divided into testing and trained data.

Step 1: Retrieving Data from URL

The first step required is to obtain the data. Often, to avoid the headache of manually downloading thousands of files, they are downloaded using small code snippets. Since this was a zipped folder .

Data Reference :

http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

download.file(“https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", destfile = “files”, method = “curl”, mode = “wb”)
unzip(“files”)

Step 2: Reading the files into R

features <- read.table("...\\UCI\\features.txt", col.names = c("serial", "Functions"))
features

activities <- read.table("...\\UCI\\activity_labels.txt", col.names = c("serial", "Activity"))
activities

x_test <- read.table("...\\test\\X_test.txt", col.names = features$Functions)
x_test

y_test <- read.table("...\\test\\y_test.txt", col.names = "serial")
y_test


subject_test <- read.table("...\\test\\subject_test.txt", col.names = "subject")
subject_test

subject_train <- read.table("...UCI\\train\\subject_train.txt", col.names = "subject")
subject_train

x_train <- read.table("...\\UCI\\train\\X_train.txt", col.names = features$Functions)
x_train

y_train <- read.table("...\\UCI\\train\\y_train.txt", col.names = "serial")
y_train

Note: It might be difficult to understand at first what the data means and what column names to use, but after a while you’ll start making sense.

This clearly implies two things:

I had to merge the training and test sets by row binding them
I had to merge the different attributes of the subjects by column binding them.

Step 3: Merging the tables intelligently

binded_x <- rbind(x_test, x_train)
binded_y <- rbind(y_test, y_train)
subject <- rbind(subject_test, subject_train)
#Next, I used the cbind() function to complete attaching the columns as well.
raw_data_combined <- cbind(subject, binded_x, binded_yraw_data_combine

Step 4: Filtering out only the mean and std columns

One thing to understand is the data is humongous, and we might need to perform certain filtering operations to extract the attributes we need. I had to filter out only those columns that mentioned ‘mean’ or ‘std’ in them. I used the select() function here which tidies up your code 10x better.

Note: Download the package “dplyr” and then load library to use its functions like select(), arrange(), mutate(), filter(), summarise().

#install.packages("dplyr") library(dplyr) analysis <- raw_data_combined %>% select(serial , subject ,contains("mean") , contains("std")) analysis

Step 5: Changing the activity labels from numeric codes to descriptive values

activities

“activity_labels.txt” has 1–6 numbers assigned to the six activities and these codes were being used instead of the activity names. For better readability, I changed them into descriptive values using the following commands:

analysis$serial[analysis$serial == "1"] <- "WALKING" analysis$serial[analysis$serial == "2"] <- "WALKING_UPSTAIRS" analysis$serial[analysis$serial == "3"] <- "WALKING_DOWNSTAIRS" analysis$serial[analysis$serial == "4"] <- "SITTING" analysis$serial[analysis$serial == "5"] <- "STANDING" analysis$serial[analysis$serial == "5"] <- "LAYING"

analysis

Step 6: Changing columns names to enhance readability

names() will give you only the column names of the dataset you’ve provided to it.
gsub() will replace an old string with the new string you pass to it.

names(analysis)<- gsub("Acc", "Accelerometer", names(analysis)) names(analysis)<- gsub("tBody", "time", names(analysis)) names(analysis)<- gsub("fBody", "frequency", names(analysis)) names(analysis)<- gsub("Gyro", "Gyroscope", names(analysis)) names(analysis)<-gsub("BodyBody", "Body", names(analysis)) names(analysis)<-gsub("Mag", "Magnitude", names(analysis)) names(analysis)<-gsub("serial", "Activity", names(analysis))

names(analysis)

Step 7: Creating an independent tidy data set with the average of each variable for each activity and each subject

To avoid the confusion, this simply means we need to take the mean of each feature in the ‘analysis’ dataset and represent them both by activity(s) and the subject(s).

tidy_data <- analysis %>% group_by(subject, Activity) %>% summarise_all(list(mean))
tidy_data

The group_by() function categorizes your data according to the columns you feed into it, and summarise_all() function performs any function you feed into it (in this case mean())

This concludes this project, as the raw data has been transformed into a tidy data set that can be used to analysis later.

Bonus Information :

Check Out My YouTube Video :(Introduction of AWS EC2 instance)

More such simplified Data Science concepts will follow. If you liked this or have some feedback or follow-up questions please comment below.

Thanks for Reading!

“Gapminder” Exploratory Data Analysis using R for Data Science

M ain focus is to investigate the dataset Gapminder and interact with it. To illustrate the basic use of EDA in the dplyr,ggplot2 package, I use a “gapminder” datasets. This data is a data.frame created for the purpose of predicting sales volume. Using the dplyr package to perform data transformation and manipulation operations. Using the ggplot2 package to visually analyze our data. Load Packages #install.packages("gapminder") library(gapminder) library(dplyr) library(ggplot2) The variables are explained as follows: Country — factor with 142 levels Continent — Factor with 5 levels Year — ranges from 1952 to 2007 in increments of 5 years lifeExp — life expectancy at birth, in years pop — population dgoPercap — GDP per capita head(gapminder_unfiltered,5) #Unfiltered data tail(gapminder_unfiltered,5) Display name of Variables : names(gapminder_unfiltered) Data Cleaning : Finding the missing values as we can see this data has no missing values str(gapminder_unfiltered) sum...

Blogs by Shoaib

Search This Blog