String Manipulation

Introduction

Before we begin discussing webscraping, it will be useful to have a baseline level of knowledge on string manipulation. The majority of this document will refer to string manipulation in R, but the discussion of regular expressions is more generally applicable.

This section will first discuss the package stringr in trivial applications, and then introduce regular expressions and their use; especially how they greatly augment the abilities of the stringr package.

Note: A major complication with strings in any programming language is region/locale specific distinctions. This document assumes all strings are in a English (US) locale. The stringr package discussed below attempts to be locale-agnostic; if you work with text in different locales, some of these functions take a locale argument that you can tweak.

stringr

The stringr package extends R’s basic string handling capabilities, specifically in searching strings for given words. We will quickly give examples of various functions.

library(stringr)

Note that in these, I refer to a “vector” of strings. Recall that a single string (or numeric) can be considered a vector of length 1.

Basics

str_count

str_count(c("abc", "", NA))
## [1]  3  0 NA

Given a vector of strings, returns the number of characters in each string. Some characters in strings are “escaped” characters, either whitespace characters (for example, "\t" is a tab) or characters that have special meaning (for example "\""). Despite having two characters to represent a single, they are counted as a single character:

str_count("\"")
## [1] 1
writeLines("\"")
## "

str_sort and str_order

str_sort(c("Matt", "Alice", "Frank", "Wayne"))
## [1] "Alice" "Frank" "Matt"  "Wayne"
str_order(c("Matt", "Alice", "Frank", "Wayne"))
## [1] 2 3 1 4

Given a vector of strings, sorts them alphabetically. str_order can be useful programmatically.

str_sub

str_sub("abcde", 2, 4)
## [1] "bcd"
str_sub("abcde", 7, 10)
## [1] ""
str_sub("abcde", 4, 10)
## [1] "de"

Given a vector of strings, extract the substrings beginning at the 2nd argument and ending at the 3rd. If the start is longer than a given string, return "". If the end is past the end, return only up to the end.

See also str_sub (again).

Searching

lorem <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
            "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.",
            "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.",
            "Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.")

str_detect

str_detect(lorem, "dolor")
## [1]  TRUE FALSE  TRUE FALSE

Given a vector of strings, return a vector indicating whether the search string is found within each.

str_count

str_count(lorem, "dolor")
## [1] 2 0 2 0

Similar to str_detect except it counts the number of occurrences.

str_locate and str_locate_all

str_locate(lorem, "dolor")
##      start end
## [1,]    13  17
## [2,]    NA  NA
## [3,]    17  21
## [4,]    NA  NA
str_locate_all(lorem, "dolor")
## [[1]]
##      start end
## [1,]    13  17
## [2,]   104 108
##
## [[2]]
##      start end
##
## [[3]]
##      start end
## [1,]    17  21
## [2,]    71  75
##
## [[4]]
##      start end

Given a vector of strings, return the first (or all) starting and ending position of each match.

str_extract and str_extract_all

str_extract(lorem, "dolor")
## [1] "dolor" NA      "dolor" NA
str_extract_all(lorem, "dolor")
## [[1]]
## [1] "dolor" "dolor"
##
## [[2]]
## character(0)
##
## [[3]]
## [1] "dolor" "dolor"
##
## [[4]]
## character(0)

Similar to str_locate and str_locate_all, but it instead returns the matched text. (See regular expressions below for why this function isn’t useless and in fact is probably the most useful of all.)

str_which and str_subset

str_which(lorem, "dolor")
## [1] 1 3
str_subset(lorem, "dolor")
## [1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
## [2] "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur."

Given a vector of strings, extract either the position (str_which) or entire string (str_subset) which match the expression.

Modification

str_split and str_split_fixed

s <- c("1997_John_M", "2007_Mary_F")
str_split(s, "_")
## [[1]]
## [1] "1997" "John" "M"
##
## [[2]]
## [1] "2007" "Mary" "F"
str_split_fixed(s, "_", 2)
##      [,1]   [,2]
## [1,] "1997" "John_M"
## [2,] "2007" "Mary_F"

Given a vector of strings and a string (usually a single character) to split on, return the split string. str_split_fixed is extremely useful if you have a vector of strings which should be multiple variables, and you know exactly how many variables, and each follows the same pattern. For example above, If I want to keep first name and middle initial together, I only split into two pieces.

str_c

str_c("a", "b")
## [1] "ab"
str_c("a", "b", sep = ".")
## [1] "a.b"
str_c(c("a", "b"), c("d", "e"))
## [1] "ad" "be"
str_c(c("a", "b"), "d", sep = "-", collapse = ".")
## [1] "a-d.b-d"

Given either a collection of strings or a collection of vectors of strings, concatenate the strings together into one string. sep controls the character appearing between the items in the collection (default is "") and collapse, if not NULL, collapses the concatenated vectors as well with the given string.

str_pad and str_trim

str_pad("a", 4)
## [1] "   a"
str_trim("   a   b   ")
## [1] "a   b"

Given a vector of strings, either adds (str_pad) or removes (str_trim) whitespace. The side argument to either controls which side(s) it operates on.

str_replace and str_replace_all

str_replace(lorem[1], "dolor", "ABRACADABRA")
## [1] "Lorem ipsum ABRACADABRA sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
str_replace_all(lorem[3], "dolor", "Oops")
## [1] "Duis aute irure Oops in reprehenderit in voluptate velit esse cillum Oopse eu fugiat nulla pariatur."

Given a vector of strings, replace the first or all occurences of the pattern with the given string. Replacement string does not need to be the same length as the search string.

str_sub (again)

s <- "abcde"
substr(s, 2, 4) <- "A"
s
## [1] "aAcde"
substr(s, 2, 4) <- "ABCDEFGHI"
s
## [1] "aABCe"

In addition to extracting substrings, this can also be used for replacement within a string. It will only replace the appropriate number of characters. For a replacement string which differs in the number of characters, use str_replace.

Regular Expressions

The point of introducing these fairly basic string manipulation functions is to build up to using regular expressions. Regular expressions are a way to search for text in a more advanced fashion than simple letter detection. For a simple example, if we wanted to extract all capital words from a document:

[A-Z][a-z]+

We’ve been using the most basic forms of regular expressions already - Any search pattern above was really a regular expression, just the simplest form of text to match.

Note: Regular expressions exist outside of R (in fact most languages implement them, a lot of search fields support them, unix command line supports it, etc). There are different implementations of regular expressions. Base R supports both the POSIX (the default) and Perl-like (by passing perl = TRUE). Note that stringr, which we use here, only supports POSIX. Other languages will generally support one or both of these, but may either not fully support them or will extend them. All this is to say: While most of these skills will carry over to other uses of regular expressions, be careful of language specific implementation.

Special characters

Before we dive into the ways to match more flexibly than exact strings, it helps to be aware of the 12 special characters which cannot be used directly in a search string. These characters are . \ | ( ) [ { ^ $ * + ?. For example:

str_detect("function(s)", "(")
## Error in stri_detect_regex(string, pattern, opts_regex = opts(pattern)): Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

There are two ways around this. First, you can tell the function that you are actually using a fixed string and not a regular expression. In stringr, you can pass the search pattern into the fixed function:

str_detect("function(s)", fixed("("))
## [1] TRUE

If you have a special character inside a more complex regular expression, doing this will obviously not work. The second option1 is to escape the characters using \\:

str_detect("function(s)", "\\(")
## [1] TRUE

Escaping special characters

Sometimes you need to escape special characters with the double backslash, e.g. \\(, and sometimes only a single backslash, e.g. \".

The difference is in R specific special characters versus regular expressions special characters.

R specific special characters are those which cannot be stored in a string. They include things such as \n and \t for newline and tab, or more mundane things such as \" or \' which would just be problematic.

a <- '"'
a
## [1] "\""
cat(a)
## "

When referencing these in regular expressions, we need to refer to the entire special character, which is e.g. \". Hence,

str_extract("a\"b", "\"")
## [1] "\""

The problem child is that \ needs to be escaped as well:

str_count("\\")
## [1] 1

Now, consider regular expression specific characters which need the double backslash, First R escapes the \\ into \, then the regular expression can work. In other words, \\? gets processed by R into \? which then gets processed by regular expression engine into ? instead of a special character.

Square Brackets

The simplest modification to the basic string search is allowing flexibility in terms of a single character. We can include characters inside square brackets, [ and ], and the expression will match exactly one of the characters listed there. For example, let’s say we want to find all references to someone named Brian, but are concerned that people may be misspelling it as Brien.

str_extract(c("Brian", "Brien", "Brin"), "Bri[ae]n")
## [1] "Brian" "Brien" NA

What if we want to match many characters? The - can be used to represent a range such as a-e or A-Z. Notice the case sensitivity here. You can also combine such as [a-eA-Z].

str_extract(c("Brian", "Brien", "BriTn", "Bri8n", "Briaan", "Bri&n"), "Bri[a-zA-Z0-9]n")
## [1] "Brian" "Brien" "BriTn" "Bri8n" NA      NA

Note this doesn’t catch the double “aa” or the &.

We can also negate an entire bracket by making the first character inside the brackets ^.

str_extract(c("Brian", "Brien", "Brion"), "Bri[^ae]n")
## [1] NA      NA      "Brion"

Note that “not-ing” anything besides a bracket may not work. Use R’s built in ! in that case, e.g.

!str_detect(c("Brian", "Brien", "Brion"), "Brian")
## [1] FALSE  TRUE  TRUE

Predefined classes

There are a few special classes. You can see a full list in help("regex", package = "base") but a few useful ones include:

  • [:lower:] and [:upper:]: Equivalent to [a-z] and [A-Z] respectively.
  • [:alnum:]: Equivalent to [a-zA-Z0-9].
  • [:punct:]: Matches punctuation characters (see that help for the full list).
  • [:space:]: Matches whitespace such as spaces, tabs, newlines, etc.
  • [:print:]: Everything printable.
str_extract(c("Brian", "Brien", "BriTn", "Bri8n", "Briaan", "Bri&n"), "Bri[:lower:]n")
## [1] "Brian" "Brien" NA      NA      NA      NA

Start and End of String

To match the beginning or end of a string, use ^ and $ respectively. Note that this use of ^ is different than not’ing; as ^ is the first character of the regular expression instead of the first character of a bracket.

str_extract(c("moose", "a moose", "moosewood"), "^moose")
## [1] "moose" NA      "moose"
str_extract(c("moose", "a moose", "moosewood"), "[^moose]")
## [1] NA  "a" "w"

Here we extract the first character which is not “m”, “o”, “s” or “e”. The “w” in the third string is not at the beginning of the string.

str_extract(c("moose", "a moose", "moosewood"), "^[^moose]")
## [1] NA  "a" NA

Here we use ^ in two different cases; it now checks if only the character at the beginning of the string isn’t found in “moose”.

str_extract(c("moose", "a moose", "moosewood"), "moose$")
## [1] "moose" "moose" NA
str_extract(c("moose", "a moose", "moosewood"), "^moose$")
## [1] "moose" NA      NA

Using both ensures that the entire string is matched.

Repeated matches

By default, any matched character (whether individual character or a bracketed set of possible matches) matches exactly once. We can modify this by attaching a modifier after it.

  • ?: Match 0 or 1 times.
  • *: Match 0 or more times.
  • +: Match 1 or more times.
  • {n}: Match exactly \(n\) times.
  • {n,}: Match at least \(n\) times.
  • {n,m}: Match at least \(n\) up to at most \(m\) times.
s <- c("ct", "cat", "caat", "caaat", "caaaaaaaaat")
str_extract(s, "ca?t")
## [1] "ct"  "cat" NA    NA    NA
str_extract(s, "ca*t")
## [1] "ct"          "cat"         "caat"        "caaat"       "caaaaaaaaat"
str_extract(s, "ca+t")
## [1] NA            "cat"         "caat"        "caaat"       "caaaaaaaaat"
str_extract(s, "ca{3}t")
## [1] NA      NA      NA      "caaat" NA
str_extract(s, "ca{2,4}t")
## [1] NA      NA      "caat"  "caaat" NA
str_extract(c("caat", "cabt", "cact"), "c[ab]+t")
## [1] "caat" "cabt" NA

“or”

The | special character represents an “or”. It can be used in place of brackets:

str_extract_all("abcde", "[bc]")
## [[1]]
## [1] "b" "c"
str_extract_all("abcde", "b|c")
## [[1]]
## [1] "b" "c"

This use is trivial. It gets more interesting when we “or” groups, inside parantheses ( and ). These are not brackets, and these two expressions are identical: “abcd”, “(ab)(cd)”. However, we can do something like

s <- c("hotdog", "catdog", "hotcat")
str_extract(s, "(hot|cat)dog")
## [1] "hotdog" "catdog" NA

Without the parantheses, that would match only hotatdog and hocatdog. However with the paranthesese, the “or” is on hot vs cat.

Real examples

Here’s a few real examples of how this string manipulation might be used.

Example 1

I had a list of files over a number of years. The files were not yet processed so the data within them was raw, but I wanted to only concern myself with files from 2003 or more recent.

files <- c("file1999.txt", "file2000.txt", "file2001.txt", "file2002.txt", "file2003.txt",
           "file2009.txt", "file2010.txt", "file2011.txt")
str_subset(files, "(200[3-9]{1}|201[0-9]{1})")
## [1] "file2003.txt" "file2009.txt" "file2010.txt" "file2011.txt"

Example 2

I was debugging an intermittent disconnection from a remote server. The R code connected to the server to access the data, but every once in a while that connection would drop. We’d ruled out network issues so it must have been an issue with the R code. To make a long story short, I needed to look through a log file that was exceptionally long and cluttered, and the lines that would be useful would refer to the server (the actual address it connected to was ns##.server.net2), that a change in connection was taking place (“connection”) but was not a initialization (“connection init”), and that the result was a numeric error within a particular range (the errors were printed “E##”, errors below E17 were really warnings, so I needed errors E18 and above). My goal was to extract the errors that were occuring.

The readLines function in R reads a file and creates a vector of strings where each entry is a single line.

r <- readLines(logfile)
r <- str_subset(r, "ns[0-9]{2}\\.server\\.net")]
r <- str_subset(r, "connection")]
r <- r[!str_detect(r, "init")]]
r <- str_subset(r, "E([2-9]{1}[0-9]|1[89]{1})")

Example 3

Quick functions to check whether a word is all lowercase or all UPPERCASE.

isupper <- function(x) {
  str_detect(x, "^[:upper:]+$")
}
islower <- function(x) {
  str_detect(x, "^[:lower:]+$")
}
abc <- c("ABC", "abc", "ABc")
islower(abc)
## [1] FALSE  TRUE FALSE
isupper(abc)
## [1]  TRUE FALSE FALSE

Note that these use two special characters not yet introduced (in this context), ^3 and $. These refer to the start and end of a string. So [ab] refers to a match of “a” or “b” anywhere in a string, whereas ^[ab] means that “a” or “b” must be the first character, and [ab]$ means one of them must be the last character.


  1. Note that if you were to use base R functions such as gsub, grep, substr, etc instead of the stringr versions, this is the only version that works. fixed() is exclusive to stringr functions.

  2. server.net was obviously a real address, but it’s removed for privacy

  3. This is distinct from the negation of a bracketed expression.

Josh Errickson