Skip to main content

Data Cleaning in R for Data Science

Data Cleaning in R for Data Science :


  • Removing duplicate values
  • Removing null values
  • Changing column names to readable, understandable, formatted names
  • Removing commas from numeric values i.e. (1,000,657 to 1000657)
  • Converting data types into their appropriate types for analysis
  • The Experiment :

    The experiment conducted here is retrieved from UCI Machine Learning Repository where a group of 30 volunteers (age bracket of 19–48 years) performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a Samsung Galaxy S smartphone. The data collected from the embedded accelerometers was divided into testing and trained data.

    Step 1: Retrieving Data from URL

    The first step required is to obtain the data. Often, to avoid the headache of manually downloading thousands of files, they are downloaded using small code snippets. Since this was a zipped folder .

    Data Reference :

    http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

    Step 2: Reading the files into R

    features <- read.table("...\\UCI\\features.txt", col.names = c("serial", "Functions"))
    features

    activities <- read.table("...\\UCI\\activity_labels.txt", col.names = c("serial", "Activity"))
    activities

    x_test <- read.table("...\\test\\X_test.txt", col.names = features$Functions)
    x_test

    y_test <- read.table("...\\test\\y_test.txt", col.names = "serial")
    y_test


    subject_test <- read.table("...\\test\\subject_test.txt", col.names = "subject")
    subject_test

    subject_train <- read.table("...UCI\\train\\subject_train.txt", col.names = "subject")
    subject_train

    x_train <- read.table("...\\UCI\\train\\X_train.txt", col.names = features$Functions)
    x_train

    y_train <- read.table("...\\UCI\\train\\y_train.txt", col.names = "serial")
    y_train

    Note: It might be difficult to understand at first what the data means and what column names to use, but after a while you’ll start making sense.

    This clearly implies two things:

    1. I had to merge the training and test sets by row binding them
    2. I had to merge the different attributes of the subjects by column binding them.

    Step 3: Merging the tables intelligently

    binded_x <- rbind(x_test, x_train)
    binded_y <- rbind(y_test, y_train)
    subject <- rbind(subject_test, subject_train)
    #Next, I used the cbind() function to complete attaching the columns as well.
    raw_data_combined <- cbind(subject, binded_x, binded_yraw_data_combine

    Step 4: Filtering out only the mean and std columns

    One thing to understand is the data is humongous, and we might need to perform certain filtering operations to extract the attributes we need. I had to filter out only those columns that mentioned ‘mean’ or ‘std’ in them. I used the select() function here which tidies up your code 10x better.

    Note: Download the package “dplyr” and then load library to use its functions like select(), arrange(), mutate(), filter(), summarise().

    #install.packages("dplyr")
    library(dplyr)
    analysis <- raw_data_combined %>% 
     select(serial , subject ,contains("mean") , contains("std"))
    analysis

    Step 5: Changing the activity labels from numeric codes to descriptive values

    activities

    “activity_labels.txt” has 1–6 numbers assigned to the six activities and these codes were being used instead of the activity names. For better readability, I changed them into descriptive values using the following commands:

    analysis$serial[analysis$serial == "1"] <- "WALKING"
    analysis$serial[analysis$serial == "2"] <- "WALKING_UPSTAIRS"
    analysis$serial[analysis$serial == "3"] <- "WALKING_DOWNSTAIRS"
    analysis$serial[analysis$serial == "4"] <- "SITTING"
    analysis$serial[analysis$serial == "5"] <- "STANDING"
    analysis$serial[analysis$serial == "5"] <- "LAYING"

    analysis

    Step 6: Changing columns names to enhance readability

    1. names() will give you only the column names of the dataset you’ve provided to it.
    2. gsub() will replace an old string with the new string you pass to it.

    names(analysis)<- gsub("Acc", "Accelerometer", names(analysis))
    names(analysis)<- gsub("tBody", "time", names(analysis))
    names(analysis)<- gsub("fBody", "frequency", names(analysis))
    names(analysis)<- gsub("Gyro", "Gyroscope", names(analysis))
    names(analysis)<-gsub("BodyBody", "Body", names(analysis))
    names(analysis)<-gsub("Mag", "Magnitude", names(analysis))
    names(analysis)<-gsub("serial", "Activity", names(analysis))

    names(analysis)

    Step 7: Creating an independent tidy data set with the average of each variable for each activity and each subject

    To avoid the confusion, this simply means we need to take the mean of each feature in the ‘analysis’ dataset and represent them both by activity(s) and the subject(s).

    tidy_data <- analysis %>% group_by(subject, Activity) %>% summarise_all(list(mean))
    tidy_data

    The group_by() function categorizes your data according to the columns you feed into it, and summarise_all() function performs any function you feed into it (in this case mean())

    This concludes this project, as the raw data has been transformed into a tidy data set that can be used to analysis later.


    Bonus Information :

    Check Out My YouTube Video :(Introduction of AWS EC2 instance)




    Comments

    Popular posts from this blog

    Java Objects & Classes

      Java Objects & Classes : J ava is an Object-Oriented Language. As a language that has the Object-Oriented feature, Java supports the following fundamental concepts − Polymorphism Inheritance Encapsulation Abstraction Classes Objects Instance Method Message Passing Objects in JAVA : Everything in Java is associated with classes and objects, along with its attributes and methods . For example : in real life, a car is an object. The car has attributes, such as weight and color, and methods , such as drive and brake. A Class is like an object constructor, or a “blueprint” for creating objects. Classes in Java A class is a blueprint from which individual objects are created. public class Dog {  String breed;  int age;  String color;  void barking() { …. }  void hungry() { …. }  void sleeping() { …. } } Constructors One of the most important sub topic would be constructors. Every class has a constructor. If we do not exp...

    “Gapminder” Exploratory Data Analysis using R for Data Science

    M ain focus is to investigate the dataset Gapminder and interact with it. To illustrate the basic use of EDA in the dplyr,ggplot2 package, I use a “gapminder” datasets. This data is a data.frame created for the purpose of predicting sales volume. Using the dplyr package to perform data transformation and manipulation operations.  Using the ggplot2 package to visually analyze our data. Load Packages #install.packages("gapminder") library(gapminder) library(dplyr) library(ggplot2) The variables are explained as follows: Country — factor with 142 levels Continent — Factor with 5 levels Year — ranges from 1952 to 2007 in increments of 5 years lifeExp — life expectancy at birth, in years pop — population dgoPercap — GDP per capita head(gapminder_unfiltered,5) #Unfiltered data tail(gapminder_unfiltered,5) Display name of Variables : names(gapminder_unfiltered) Data Cleaning : Finding the missing values as we can see this data has no missing values str(gapminder_unfiltered) sum...