Skip to main content

RegEx in R for Data Science

The ‘regex’ family of languages and commands is used for manipulating text strings. More specifically, regular expressions are typically used for finding specific patterns of characters and replacing them with others.

Finding Regex Matches in String Vectors

The grep function takes your regex as the first argument, and the input vector as the second argument. If you pass value=FALSE or omit the value parameter then grep returns a new vector with the indexes of the elements in the input vector that could be (partially) matched by the regular expression. If you pass value=TRUE, then grep returns a vector with copies of the actual elements in the input vector that could be (partially) matched.

> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=FALSE)
[1] 1 3 4
> grep("a+", c("abc", "def", "cba a", "aa"), perl=TRUE, value=TRUE)
[1] "abc" "cba a" "aa"

The grepl function takes the same arguments as the grep function, except for the value argument, which is not supported. grepl returns a logical vector with the same length as the input vector. Each element in the returned vector indicates whether the regex could find a match in the corresponding string element in the input vector.

> grepl("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] TRUE FALSE TRUE TRUE

The regexpr function takes the same arguments as grepl. regexpr returns an integer vector with the same length as the input vector. Each element in the returned vector indicates the character position in each corresponding string element in the input vector at which the (first) regex match was found. A match at the start of the string is indicated with character position 1. If the regex could not find a match in a certain string, its corresponding element in the result vector is -1. The returned vector also has a match.length attribute. This is another integer vector with the number of characters in the (first) regex match in each string, or -1 for strings that didn’t match.

gregexpr is the same as regexpr, except that it finds all matches in each string. It returns a vector with the same length as the input vector. Each element is another vector, with one element for each match found in the string indicating the character position at which that match was found. Each vector element in the returned vector also has a match.length attribute with the lengths of all matches. If no matches could be found in a particular string, the element in the returned vector is still a vector, but with just one element -1.

> regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] 1 -1 3 1
attr(,"match.length")
[1] 1 -1 1 2
> gregexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
[[1]] [1] 1 attr(,"match.length") [1] 1
[[2]] [1] -1 attr(,"match.length") [1] -1
[[3]] [1] 3 5 attr(,"match.length") [1] 1 1
[[4]] [1] 1 attr(,"match.length") [1] 2

Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input that you passed to regexpr or gregexpr . As the second argument, pass the vector returned by regexpr or gregexpr. If you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from gregexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.

>x <- c("abc", "def", "cba a", "aa")
> m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1] "a" "a" "aa"
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[[1]] [1] "a"
[[2]] character(0)
[[3]] [1] "a" "a"
[[4]] [1] "aa"

Replacing Regex Matches in String Vectors

The sub function has three required parameters: a string with the regular expression, a string with the replacement text, and the input vector. sub returns a new vector with the same length as the input vector. If a regex match could be found in a string element, it is replaced with the replacement text. Only the first match in each string element is replaced. If no matches could be found in some strings, those are copied into the result vector unchanged.

Use gsub instead of sub to replace all regex matches in all the string elements in your vector. Other than replacing all matches, gsub works in exactly the same way, and takes exactly the same arguments.

R uses its own replacement string syntax. Even though R 4.0.0 uses the PCRE2 regex flavor when you pass perl=TRUE, it still uses the R replacement string syntax. There is no option to use the PCRE2 replacement string syntax.

You can use the backreferences \1 through \9 in the replacement text to reinsert text matched by a capturing group. You cannot use backreferences to groups 10 and beyond. If your regex has named groups, you can use numbered backreferences to the first 9 groups. There is no replacement text token for the overall match. Place the entire regex in a capturing group and then use \1 to insert the whole regex match.

> sub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc" "def" "cbzaz a" "zaaz"
> gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zazbc" "def" "cbzaz zaz" "zaaz"

You can use \U and \L to change the text inserted by all following backreferences to uppercase or lowercase. You can use \E to insert the following backreferences without any change of case. These escapes do not affect literal text.

> sub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc" "def" "cbzAz a" "zAAz"
> gsub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc" "def" "cbzAz zAz" "zAAz"

A very powerful way of making replacements is to assign a new vector to the regmatches function when you call it on the result of gregexpr. The vector you assign should have as many elements as the original input vector. Each element should be a character vector with as many strings as there are matches in that element. The original input vector is then modified to have all the regex matches replaced with the text from the new vector.

> x <- c("abc", "def", "cba a", "aa")
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m) <- list(c("one"), character(0), c("two", "three"), c("four"))
> x
[1] "onebc" "def" "cbtwo three" "four"

Regular expressions can conveniently be created using rex::rex().



follow me on YouTube for Technical videos :






such simplified Data Science concepts will follow. If you liked this or have some feedback or follow-up questions please comment below. Thank you...

Comments

Popular posts from this blog

Java Objects & Classes

  Java Objects & Classes : J ava is an Object-Oriented Language. As a language that has the Object-Oriented feature, Java supports the following fundamental concepts − Polymorphism Inheritance Encapsulation Abstraction Classes Objects Instance Method Message Passing Objects in JAVA : Everything in Java is associated with classes and objects, along with its attributes and methods . For example : in real life, a car is an object. The car has attributes, such as weight and color, and methods , such as drive and brake. A Class is like an object constructor, or a “blueprint” for creating objects. Classes in Java A class is a blueprint from which individual objects are created. public class Dog {  String breed;  int age;  String color;  void barking() { …. }  void hungry() { …. }  void sleeping() { …. } } Constructors One of the most important sub topic would be constructors. Every class has a constructor. If we do not exp...

“Gapminder” Exploratory Data Analysis using R for Data Science

M ain focus is to investigate the dataset Gapminder and interact with it. To illustrate the basic use of EDA in the dplyr,ggplot2 package, I use a “gapminder” datasets. This data is a data.frame created for the purpose of predicting sales volume. Using the dplyr package to perform data transformation and manipulation operations.  Using the ggplot2 package to visually analyze our data. Load Packages #install.packages("gapminder") library(gapminder) library(dplyr) library(ggplot2) The variables are explained as follows: Country — factor with 142 levels Continent — Factor with 5 levels Year — ranges from 1952 to 2007 in increments of 5 years lifeExp — life expectancy at birth, in years pop — population dgoPercap — GDP per capita head(gapminder_unfiltered,5) #Unfiltered data tail(gapminder_unfiltered,5) Display name of Variables : names(gapminder_unfiltered) Data Cleaning : Finding the missing values as we can see this data has no missing values str(gapminder_unfiltered) sum...