Best Practices in Data Cleaning

Thank you for visiting my notes. I've noticed a large uptick in views lately which is terrific! However, these notes are quite old (2017) and some details may be outdated. I have a more recent class (2023) which covers much of the same information and is hopefully more up-to-date.

If you are using these notes, I'd love to hear from you and how you are using them! Please reach out to jerrick (at) umich (dot) edu.

Introduction

“Best practices” might be bit melodramatic, but here’s a list of things I find important during data cleaning stages. Some of these may be old news for some of you but hopefully there’s some new ideas as well.

RMarkdown

We’ve covered RMarkdown already, but it is worth repeating - save everything you do with good comments and/or explanations! If you find a mistake in the data later, if your RMarkdown file is clean, fixing the issue and re-running the clean should be straightforward.

Variable names

Referring to variables in a data set (see my notes on code style for R variables); try to be consistent and concise, and keep a data dictionary if needed. Far too often I see clients come in with variable names like How.often.do.you.meet.with.your.youth.in.person when something like freq.meeting would suffice!

Another common issue is when merging two or more files, variables that represent the same thing in the two data sets. I recommend using a prefix to identify which file the variable came from - in fact I recommend this for all merged data situations regardless of duplicate variable names.

For example, if you have two sources, one “Survey of Local Residents”, containing aggregated responses by region, and “Region Information” containing geographic and demographic results, I’d suggest naming variables slr._____ and ri._____ for clarity.

Variable types

We’ve already addressed this quite a bit in scraping, mostly when we converted strings to numeric, but it is worth repeating. This issue can come up even when you are handed pre-“cleaned” data, as who knows how reliable the previous cleaner was!

The first check you should run on any data is str. This will give you the quickest overview of what’s going on in the data.

This data is the Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System from data.gov

d <- read.csv("https://chronicdata.cdc.gov/api/views/hn4x-zwk7/rows.csv?accessType=DOWNLOAD",
              stringsAsFactors = FALSE)
str(d)
## 'data.frame':    35168 obs. of  33 variables:
##  $ YearStart                 : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ YearEnd                   : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ LocationAbbr              : chr  "AL" "AL" "AL" "AL" ...
##  $ LocationDesc              : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Datasource                : chr  "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" ...
##  $ Class                     : chr  "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
##  $ Topic                     : chr  "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
##  $ Question                  : chr  "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" ...
##  $ Data_Value_Unit           : logi  NA NA NA NA NA NA ...
##  $ Data_Value_Type           : chr  "Value" "Value" "Value" "Value" ...
##  $ Data_Value                : num  32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
##  $ Data_Value_Alt            : num  32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
##  $ Data_Value_Footnote_Symbol: chr  "" "" "" "" ...
##  $ Data_Value_Footnote       : chr  "" "" "" "" ...
##  $ Low_Confidence_Limit      : num  30.5 29.9 30 29.9 30.2 31 23.7 12.6 30.7 31.6 ...
##  $ High_Confidence_Limit     : num  33.5 34.7 33.6 37.6 35.6 36.8 29.3 20.9 40 39.6 ...
##  $ Sample_Size               : int  7304 2581 4723 1153 2402 1925 1812 356 598 865 ...
##  $ Total                     : chr  "Total" "" "" "" ...
##  $ Age.years.                : chr  "" "" "" "" ...
##  $ Education                 : chr  "" "" "" "Less than high school" ...
##  $ Gender                    : chr  "" "Male" "Female" "" ...
##  $ Income                    : chr  "" "" "" "" ...
##  $ Race.Ethnicity            : chr  "" "" "" "" ...
##  $ GeoLocation               : chr  "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" ...
##  $ ClassID                   : chr  "OWS" "OWS" "OWS" "OWS" ...
##  $ TopicID                   : chr  "OWS1" "OWS1" "OWS1" "OWS1" ...
##  $ QuestionID                : chr  "Q036" "Q036" "Q036" "Q036" ...
##  $ DataValueTypeID           : chr  "VALUE" "VALUE" "VALUE" "VALUE" ...
##  $ LocationID                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StratificationCategory1   : chr  "Total" "Gender" "Gender" "Education" ...
##  $ Stratification1           : chr  "Total" "Male" "Female" "Less than high school" ...
##  $ StratificationCategoryId1 : chr  "OVR" "GEN" "GEN" "EDU" ...
##  $ StratificationID1         : chr  "OVERALL" "MALE" "FEMALE" "EDUHS" ...

This comes out relatively cleaned - Nothing needs immediate attention. The numeric variables are all properly coded as numeric or integer (integers are obviously whole numbers, but if you were to add a decimal to a collection of intergers, R would convert it automatically to numeric, so for our purposes we don’t care which it is. The only time this would matter is if you had data that was taxing the limits of your storage, then storing integers as numeric takes slightly more space, so you’d want to make sure they were proper intergers.)

There are a few variables (e.g. Gender) which are strings but probably should be numeric; we can transform these

table(d$Gender, useNA = 'ifany')
##
##        Female   Male
##  32656   1256   1256
library(stringr)
gg <- str_replace(d$Gender, "Male", "1")
gg <- as.numeric(str_replace(gg, "Female", "0"))

gg2 <- as.numeric(d$Gender == "Male")
gg2[d$Gender == ""] <- NA

identical(gg, gg2)
## [1] TRUE
d$Male <- gg
str(d)
## 'data.frame':    35168 obs. of  34 variables:
##  $ YearStart                 : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ YearEnd                   : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ LocationAbbr              : chr  "AL" "AL" "AL" "AL" ...
##  $ LocationDesc              : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Datasource                : chr  "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" ...
##  $ Class                     : chr  "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
##  $ Topic                     : chr  "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
##  $ Question                  : chr  "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" ...
##  $ Data_Value_Unit           : logi  NA NA NA NA NA NA ...
##  $ Data_Value_Type           : chr  "Value" "Value" "Value" "Value" ...
##  $ Data_Value                : num  32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
##  $ Data_Value_Alt            : num  32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
##  $ Data_Value_Footnote_Symbol: chr  "" "" "" "" ...
##  $ Data_Value_Footnote       : chr  "" "" "" "" ...
##  $ Low_Confidence_Limit      : num  30.5 29.9 30 29.9 30.2 31 23.7 12.6 30.7 31.6 ...
##  $ High_Confidence_Limit     : num  33.5 34.7 33.6 37.6 35.6 36.8 29.3 20.9 40 39.6 ...
##  $ Sample_Size               : int  7304 2581 4723 1153 2402 1925 1812 356 598 865 ...
##  $ Total                     : chr  "Total" "" "" "" ...
##  $ Age.years.                : chr  "" "" "" "" ...
##  $ Education                 : chr  "" "" "" "Less than high school" ...
##  $ Gender                    : chr  "" "Male" "Female" "" ...
##  $ Income                    : chr  "" "" "" "" ...
##  $ Race.Ethnicity            : chr  "" "" "" "" ...
##  $ GeoLocation               : chr  "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" ...
##  $ ClassID                   : chr  "OWS" "OWS" "OWS" "OWS" ...
##  $ TopicID                   : chr  "OWS1" "OWS1" "OWS1" "OWS1" ...
##  $ QuestionID                : chr  "Q036" "Q036" "Q036" "Q036" ...
##  $ DataValueTypeID           : chr  "VALUE" "VALUE" "VALUE" "VALUE" ...
##  $ LocationID                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StratificationCategory1   : chr  "Total" "Gender" "Gender" "Education" ...
##  $ Stratification1           : chr  "Total" "Male" "Female" "Less than high school" ...
##  $ StratificationCategoryId1 : chr  "OVR" "GEN" "GEN" "EDU" ...
##  $ StratificationID1         : chr  "OVERALL" "MALE" "FEMALE" "EDUHS" ...
##  $ Male                      : num  NA 1 0 NA NA NA NA NA NA NA ...

A few things to note there:

  1. In the stringr approach, when we cast it to numeric, the blanks ("^$") are automatically converted to NA so we don’t need to manually handle them.
  2. In the second approach, blanks are not equal to “Male”, so the equality is false. So in this case, we do need to manually enter the missing data.
  3. Don’t get confused with that table output, the 32,656 value refers to how many "" entries there are.
  4. Note that I store the results in d$Male, not d$Gender. That makes the dummy variable “is the individual male?” so that I don’t need to remember/write down that 0 = Female and 1 = Male.

stringsAsFactors

Notice that I passed stringAsFactors = FALSE to read.csv. The default in (most) R functions is to convert strings to factors1. This works great when you actually want factors (e.g. if you have a categorical variable that you need in a model) but its far easier to have bugs related to factors than strings.

For that reason, I recommend always using stringsAsFactors whenever you use

  • read.csv() or read.table() or any other read.____()
  • data.frame() or as.data.frame()

To convert a string to a factor is as easy as as.factor. Other functions useful for dealing with factors include levels and reorder.

Additionally, there is a package forcats by Hadley Wickham. I have not gotten into this package myself, but his work is generally high caliber, so it might be worth learning if you are just starting with factors.

Missing data

After ensuring data is the proper type, I recommend next getting a handle of the missing data. summary will report the number of NA’s, but will not handle either characters (e.g. "") or people who use values like 999 or -1 for missing values.

We can use sapply to quickly check:

sapply(d, function(x) {
  round(sum(is.na(x) | x == "")/nrow(d), 2)
})
##                  YearStart                    YearEnd
##                       0.00                       0.00
##               LocationAbbr               LocationDesc
##                       0.00                       0.00
##                 Datasource                      Class
##                       0.00                       0.00
##                      Topic                   Question
##                       0.00                       0.00
##            Data_Value_Unit            Data_Value_Type
##                       1.00                       0.00
##                 Data_Value             Data_Value_Alt
##                       0.09                       0.09
## Data_Value_Footnote_Symbol        Data_Value_Footnote
##                       0.91                       0.91
##       Low_Confidence_Limit      High_Confidence_Limit
##                       0.09                       0.09
##                Sample_Size                      Total
##                       0.09                       0.96
##                 Age.years.                  Education
##                       0.79                       0.86
##                     Gender                     Income
##                       0.93                       0.75
##             Race.Ethnicity                GeoLocation
##                       0.71                       0.02
##                    ClassID                    TopicID
##                       0.00                       0.00
##                 QuestionID            DataValueTypeID
##                       0.00                       0.00
##                 LocationID    StratificationCategory1
##                       0.00                       0.00
##            Stratification1  StratificationCategoryId1
##                       0.00                       0.00
##          StratificationID1                       Male
##                       0.00                       0.93

We’ll discuss multiple imputation for missing data later, but for now, here’s some tips on handling it appropriately.

By default, most R models (e.g. lm) do “complete case” analysis - any observation missing any variable in the model will be dropped. However, most regular functions don’t do this:

mean(c(1,2,NA))
## [1] NA

Instead, a lot of these functions take an argument, na.rm (sometimes na.action, or even table’s useNA argument).

mean(c(1,2,NA), na.rm = TRUE)
## [1] 1.5

Be sure to check documentation to see what the default action is, and to see how to change it!

For manipulating data, complete.cases can be used to subset the data:

table(complete.cases(d[, 15:20]), useNA = 'ifany')
##
## FALSE  TRUE
##  3204 31964
d2 <- d[complete.cases(d[, 15:20]), 15:20]
dim(d2)
## [1] 31964     6

(I restrict to columns 15-20 as there are no complete cases across the whole data set!) There is also na.omit which works somewhat inversely:

d3 <- na.omit(d[, 15:20])
identical(matrix(d2), matrix(d3))
## [1] TRUE

(I convert to matrix to remove attributes which differ but don’t really affect anything.)

Outlier detection

Outlier detection, in a statistical sense, is very amorphous. If you have an apparently normally distributed variable with an extreme value, is that truly an outlier? Is it evidence that the true distribution is something like Cauchy? Is your data not IID?

On the other hand, computational outlier detection is much more straightforward - we’re not trying to determine why a value is extreme nor what to do with it (yet); we’re merely trying to identify values that could cause issues later.

Basic outlier detection is straightforward - use summary or hist to look at numerical values, use table to check strings/factors. Sorting data can also sometimes be useful (though I can’t think of many situations where hist and table don’t suffice). Be sure to especially be on the lookout for typos/errors - is gender defined as both “male” and “Male”? Is a percentage stored between 0 and 1 occasionally stored as 64 instead of .64?

One catch to look out for:

1 > "a"
## [1] FALSE
"a" < "b"
## [1] TRUE

If you’ve properly handled the variable types, this shouldn’t come up, but it might. Just keep it in mind!

Dates

The built-in date functionality in R is clunky at best. If you need to deal with dates, try the lubridate package, by Hadley Wickham and others.

Cleaning strings

We saw a lot of this in the web scraping notes. A few useful functions:

  • str_trim and str_pad: In stringr, remove or add whitespace around a string.
  • toupper and tolower: In base R, converts cases.

In handling string data, there’s the idea of “distance” between strings. There are various ways of measuring this (e.g. the Levenshtein distance which is the number of manipulations necessary for changing one string into another). If you have user-entered data where, in theory, there should be a limited number of possible responses but in practice, the users made a ton of typos or embellishments, you can use the package “stringdist” to try and visualize this issue, to perhaps find a few categories (detected as clusters) to collapse to.

library(stringdist)

Say for example, users were asked what town they lived in. This poll was taken in Ann Arbor, so we expect answers only of “Ann Arbor”, “Dexter”, “Ypsilanti” and “Saline” (pretend there aren’t other towns). If we take the unique values of the entries, it might look like:

towns <- c("Ann Arbor", "AA", "Ann arbor", "AnnArbor", "Ann Abror",
           "Dexter", "dexter", "Dexter Township",
           "Ypsi", "Ypsilanti", "ypsilanti",
           "Saline", "saline", "Salene")

Now, we can compute the string metrics between each, and cluster:

clust <- hclust(stringdistmatrix(towns, method = "jw"))
plot(clust, label = towns)

This uses the Jaro distance, which I found by trial and error. See all the distances offered in help(stringdist-metrics).

It’s not always this nice (try adding “Ann Arbor Township”) but it gives us a great start at identifying clusters. If you have a large number of unique entries such that table will be not useful (but not too many that the dendrogram is un-plotable), this approach can really assist.

Consistency of multi-level data

Hierarchical or multi-level data can occur when you have repeated measures (e.g. repeated measurements by individual while attributes like race or age don’t change) or clustered data (e.g. student data and school data). Often in these situations, we assume consistency of higher level data. For example, the stuatt date in the eeptools package is a synthetic set of student records data (created explicitly to practice data cleaning on!).

data(stuatt, package = "eeptools")
help(stuatt, package = "eeptools")
head(stuatt)
##   sid school_year male race_ethnicity birth_date first_9th_school_year_reported hs_diploma      hs_diploma_type hs_diploma_date
## 1   1        2004    1              B      10869                           2004          0
## 2   1        2005    1              H      10869                           2004          0
## 3   1        2006    1              H      10869                           2004          0
## 4   1        2007    1              H      10869                           2004          0
## 5   2        2006    0              W      11948                             NA          1     Standard Diploma        6/5/2008
## 6   2        2007    0              B      11948                             NA          1 College Prep Diploma       5/24/2009

As you can see, each sid has data across multiple schools year. Variables such as male and race_ethnicity should be the same within a given sid. Just by looking at the first individual, we can see that they aren’t.

There’s several ways to check this, and get a list of broken occurrences. This is just one way:

tt <- tapply(stuatt$race_ethnicity, stuatt$sid,
             function(x) {
               length(unique(x))
             })
names(tt)[tt > 1]
##  [1] "1"     "2"     "3"     "117"   "118"   "119"   "122"   "124"
##  [9] "125"   "126"   "127"   "128"   "229"   "230"   "231"   "233"
## [17] "234"   "235"   "237"   "238"   "239"   "240"   "241"   "246"
## [25] "247"   "249"   "342"   "343"   "344"   "345"   "347"   "348"
## [33] "349"   "351"   "457"   "458"   "459"   "460"   "462"   "463"
## [41] "464"   "465"   "466"   "467"   "468"   "474"   "475"   "478"
## [49] "559"   "560"   "561"   "562"   "563"   "566"   "676"   "677"
## [57] "678"   "687"   "689"   "899"   "902"   "903"   "904"   "907"
## [65] "908"   "909"   "910"   "1108"  "1309"  "1313"  "1320"  "1323"
## [73] "8552"  "11382"

The tapply function applies the length(unique()) call to each collection of race_ethnicity values broken up by sid.

Josh Errickson