Best Practices in Data Cleaning
Introduction
“Best practices” might be bit melodramatic, but here’s a list of things I find important during data cleaning stages. Some of these may be old news for some of you but hopefully there’s some new ideas as well.
RMarkdown
We’ve covered RMarkdown already, but it is worth repeating - save everything you do with good comments and/or explanations! If you find a mistake in the data later, if your RMarkdown file is clean, fixing the issue and re-running the clean should be straightforward.
Variable names
Referring to variables in a data set (see my notes on code style for R variables); try to be consistent and concise, and keep a data dictionary if needed. Far too often I see clients come in with variable names like How.often.do.you.meet.with.your.youth.in.person
when something like freq.meeting
would suffice!
Another common issue is when merging two or more files, variables that represent the same thing in the two data sets. I recommend using a prefix to identify which file the variable came from - in fact I recommend this for all merged data situations regardless of duplicate variable names.
For example, if you have two sources, one “Survey of Local Residents”, containing aggregated responses by region, and “Region Information” containing geographic and demographic results, I’d suggest naming variables slr._____
and ri._____
for clarity.
Variable types
We’ve already addressed this quite a bit in scraping, mostly when we converted strings to numeric, but it is worth repeating. This issue can come up even when you are handed pre-“cleaned” data, as who knows how reliable the previous cleaner was!
The first check you should run on any data is str
. This will give you the quickest overview of what’s going on in the data.
This data is the Nutrition, Physical Activity, and Obesity - Behavioral Risk Factor Surveillance System from data.gov
d <- read.csv("https://chronicdata.cdc.gov/api/views/hn4x-zwk7/rows.csv?accessType=DOWNLOAD",
stringsAsFactors = FALSE)
str(d)
## 'data.frame': 35168 obs. of 33 variables:
## $ YearStart : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ YearEnd : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ LocationAbbr : chr "AL" "AL" "AL" "AL" ...
## $ LocationDesc : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Datasource : chr "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" ...
## $ Class : chr "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
## $ Topic : chr "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
## $ Question : chr "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" ...
## $ Data_Value_Unit : logi NA NA NA NA NA NA ...
## $ Data_Value_Type : chr "Value" "Value" "Value" "Value" ...
## $ Data_Value : num 32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
## $ Data_Value_Alt : num 32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
## $ Data_Value_Footnote_Symbol: chr "" "" "" "" ...
## $ Data_Value_Footnote : chr "" "" "" "" ...
## $ Low_Confidence_Limit : num 30.5 29.9 30 29.9 30.2 31 23.7 12.6 30.7 31.6 ...
## $ High_Confidence_Limit : num 33.5 34.7 33.6 37.6 35.6 36.8 29.3 20.9 40 39.6 ...
## $ Sample_Size : int 7304 2581 4723 1153 2402 1925 1812 356 598 865 ...
## $ Total : chr "Total" "" "" "" ...
## $ Age.years. : chr "" "" "" "" ...
## $ Education : chr "" "" "" "Less than high school" ...
## $ Gender : chr "" "Male" "Female" "" ...
## $ Income : chr "" "" "" "" ...
## $ Race.Ethnicity : chr "" "" "" "" ...
## $ GeoLocation : chr "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" ...
## $ ClassID : chr "OWS" "OWS" "OWS" "OWS" ...
## $ TopicID : chr "OWS1" "OWS1" "OWS1" "OWS1" ...
## $ QuestionID : chr "Q036" "Q036" "Q036" "Q036" ...
## $ DataValueTypeID : chr "VALUE" "VALUE" "VALUE" "VALUE" ...
## $ LocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StratificationCategory1 : chr "Total" "Gender" "Gender" "Education" ...
## $ Stratification1 : chr "Total" "Male" "Female" "Less than high school" ...
## $ StratificationCategoryId1 : chr "OVR" "GEN" "GEN" "EDU" ...
## $ StratificationID1 : chr "OVERALL" "MALE" "FEMALE" "EDUHS" ...
This comes out relatively cleaned - Nothing needs immediate attention. The numeric variables are all properly coded as numeric
or integer
(integer
s are obviously whole numbers, but if you were to add a decimal to a collection of interger
s, R would convert it automatically to numeric
, so for our purposes we don’t care which it is. The only time this would matter is if you had data that was taxing the limits of your storage, then storing integers as numeric
takes slightly more space, so you’d want to make sure they were proper interger
s.)
There are a few variables (e.g. Gender
) which are strings but probably should be numeric; we can transform these
table(d$Gender, useNA = 'ifany')
##
## Female Male
## 32656 1256 1256
library(stringr)
gg <- str_replace(d$Gender, "Male", "1")
gg <- as.numeric(str_replace(gg, "Female", "0"))
gg2 <- as.numeric(d$Gender == "Male")
gg2[d$Gender == ""] <- NA
identical(gg, gg2)
## [1] TRUE
d$Male <- gg
str(d)
## 'data.frame': 35168 obs. of 34 variables:
## $ YearStart : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ YearEnd : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ LocationAbbr : chr "AL" "AL" "AL" "AL" ...
## $ LocationDesc : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Datasource : chr "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" ...
## $ Class : chr "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
## $ Topic : chr "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
## $ Question : chr "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" "Percent of adults aged 18 years and older who are obese" ...
## $ Data_Value_Unit : logi NA NA NA NA NA NA ...
## $ Data_Value_Type : chr "Value" "Value" "Value" "Value" ...
## $ Data_Value : num 32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
## $ Data_Value_Alt : num 32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
## $ Data_Value_Footnote_Symbol: chr "" "" "" "" ...
## $ Data_Value_Footnote : chr "" "" "" "" ...
## $ Low_Confidence_Limit : num 30.5 29.9 30 29.9 30.2 31 23.7 12.6 30.7 31.6 ...
## $ High_Confidence_Limit : num 33.5 34.7 33.6 37.6 35.6 36.8 29.3 20.9 40 39.6 ...
## $ Sample_Size : int 7304 2581 4723 1153 2402 1925 1812 356 598 865 ...
## $ Total : chr "Total" "" "" "" ...
## $ Age.years. : chr "" "" "" "" ...
## $ Education : chr "" "" "" "Less than high school" ...
## $ Gender : chr "" "Male" "Female" "" ...
## $ Income : chr "" "" "" "" ...
## $ Race.Ethnicity : chr "" "" "" "" ...
## $ GeoLocation : chr "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" ...
## $ ClassID : chr "OWS" "OWS" "OWS" "OWS" ...
## $ TopicID : chr "OWS1" "OWS1" "OWS1" "OWS1" ...
## $ QuestionID : chr "Q036" "Q036" "Q036" "Q036" ...
## $ DataValueTypeID : chr "VALUE" "VALUE" "VALUE" "VALUE" ...
## $ LocationID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StratificationCategory1 : chr "Total" "Gender" "Gender" "Education" ...
## $ Stratification1 : chr "Total" "Male" "Female" "Less than high school" ...
## $ StratificationCategoryId1 : chr "OVR" "GEN" "GEN" "EDU" ...
## $ StratificationID1 : chr "OVERALL" "MALE" "FEMALE" "EDUHS" ...
## $ Male : num NA 1 0 NA NA NA NA NA NA NA ...
A few things to note there:
- In the
stringr
approach, when we cast it to numeric, the blanks ("^$"
) are automatically converted to NA so we don’t need to manually handle them. - In the second approach, blanks are not equal to “Male”, so the equality is false. So in this case, we do need to manually enter the missing data.
- Don’t get confused with that
table
output, the 32,656 value refers to how many""
entries there are. - Note that I store the results in
d$Male
, notd$Gender
. That makes the dummy variable “is the individual male?” so that I don’t need to remember/write down that 0 = Female and 1 = Male.
stringsAsFactors
Notice that I passed stringAsFactors = FALSE
to read.csv
. The default in (most) R functions is to convert strings to factors1. This works great when you actually want factors (e.g. if you have a categorical variable that you need in a model) but its far easier to have bugs related to factors than strings.
For that reason, I recommend always using stringsAsFactors
whenever you use
read.csv()
orread.table()
or any otherread.____()
data.frame()
oras.data.frame()
To convert a string to a factor is as easy as as.factor
. Other functions useful for dealing with factors include levels
and reorder
.
Additionally, there is a package forcats by Hadley Wickham. I have not gotten into this package myself, but his work is generally high caliber, so it might be worth learning if you are just starting with factors.
Missing data
After ensuring data is the proper type, I recommend next getting a handle of the missing data. summary
will report the number of NA’s, but will not handle either characters (e.g. ""
) or people who use values like 999
or -1
for missing values.
We can use sapply
to quickly check:
sapply(d, function(x) {
round(sum(is.na(x) | x == "")/nrow(d), 2)
})
## YearStart YearEnd
## 0.00 0.00
## LocationAbbr LocationDesc
## 0.00 0.00
## Datasource Class
## 0.00 0.00
## Topic Question
## 0.00 0.00
## Data_Value_Unit Data_Value_Type
## 1.00 0.00
## Data_Value Data_Value_Alt
## 0.09 0.09
## Data_Value_Footnote_Symbol Data_Value_Footnote
## 0.91 0.91
## Low_Confidence_Limit High_Confidence_Limit
## 0.09 0.09
## Sample_Size Total
## 0.09 0.96
## Age.years. Education
## 0.79 0.86
## Gender Income
## 0.93 0.75
## Race.Ethnicity GeoLocation
## 0.71 0.02
## ClassID TopicID
## 0.00 0.00
## QuestionID DataValueTypeID
## 0.00 0.00
## LocationID StratificationCategory1
## 0.00 0.00
## Stratification1 StratificationCategoryId1
## 0.00 0.00
## StratificationID1 Male
## 0.00 0.93
We’ll discuss multiple imputation for missing data later, but for now, here’s some tips on handling it appropriately.
By default, most R models (e.g. lm
) do “complete case” analysis - any observation missing any variable in the model will be dropped. However, most regular functions don’t do this:
mean(c(1,2,NA))
## [1] NA
Instead, a lot of these functions take an argument, na.rm
(sometimes na.action
, or even table
’s useNA
argument).
mean(c(1,2,NA), na.rm = TRUE)
## [1] 1.5
Be sure to check documentation to see what the default action is, and to see how to change it!
For manipulating data, complete.cases
can be used to subset the data:
table(complete.cases(d[, 15:20]), useNA = 'ifany')
##
## FALSE TRUE
## 3204 31964
d2 <- d[complete.cases(d[, 15:20]), 15:20]
dim(d2)
## [1] 31964 6
(I restrict to columns 15-20 as there are no complete cases across the whole data set!) There is also na.omit
which works somewhat inversely:
d3 <- na.omit(d[, 15:20])
identical(matrix(d2), matrix(d3))
## [1] TRUE
(I convert to matrix
to remove attributes which differ but don’t really affect anything.)
Outlier detection
Outlier detection, in a statistical sense, is very amorphous. If you have an apparently normally distributed variable with an extreme value, is that truly an outlier? Is it evidence that the true distribution is something like Cauchy? Is your data not IID?
On the other hand, computational outlier detection is much more straightforward - we’re not trying to determine why a value is extreme nor what to do with it (yet); we’re merely trying to identify values that could cause issues later.
Basic outlier detection is straightforward - use summary
or hist
to look at numerical values, use table
to check strings/factors. Sorting data can also sometimes be useful (though I can’t think of many situations where hist
and table
don’t suffice). Be sure to especially be on the lookout for typos/errors - is gender defined as both “male” and “Male”? Is a percentage stored between 0 and 1 occasionally stored as 64 instead of .64?
One catch to look out for:
1 > "a"
## [1] FALSE
"a" < "b"
## [1] TRUE
If you’ve properly handled the variable types, this shouldn’t come up, but it might. Just keep it in mind!
Dates
The built-in date functionality in R is clunky at best. If you need to deal with dates, try the lubridate package, by Hadley Wickham and others.
Cleaning strings
We saw a lot of this in the web scraping notes. A few useful functions:
str_trim
andstr_pad
: In stringr, remove or add whitespace around a string.toupper
andtolower
: In base R, converts cases.
In handling string data, there’s the idea of “distance” between strings. There are various ways of measuring this (e.g. the Levenshtein distance which is the number of manipulations necessary for changing one string into another). If you have user-entered data where, in theory, there should be a limited number of possible responses but in practice, the users made a ton of typos or embellishments, you can use the package “stringdist” to try and visualize this issue, to perhaps find a few categories (detected as clusters) to collapse to.
library(stringdist)
Say for example, users were asked what town they lived in. This poll was taken in Ann Arbor, so we expect answers only of “Ann Arbor”, “Dexter”, “Ypsilanti” and “Saline” (pretend there aren’t other towns). If we take the unique
values of the entries, it might look like:
towns <- c("Ann Arbor", "AA", "Ann arbor", "AnnArbor", "Ann Abror",
"Dexter", "dexter", "Dexter Township",
"Ypsi", "Ypsilanti", "ypsilanti",
"Saline", "saline", "Salene")
Now, we can compute the string metrics between each, and cluster:
clust <- hclust(stringdistmatrix(towns, method = "jw"))
plot(clust, label = towns)
This uses the Jaro distance, which I found by trial and error. See all the distances offered in help(stringdist-metrics)
.
It’s not always this nice (try adding “Ann Arbor Township”) but it gives us a great start at identifying clusters. If you have a large number of unique entries such that table
will be not useful (but not too many that the dendrogram is un-plotable), this approach can really assist.
Consistency of multi-level data
Hierarchical or multi-level data can occur when you have repeated measures (e.g. repeated measurements by individual while attributes like race or age don’t change) or clustered data (e.g. student data and school data). Often in these situations, we assume consistency of higher level data. For example, the stuatt date in the eeptools package is a synthetic set of student records data (created explicitly to practice data cleaning on!).
data(stuatt, package = "eeptools")
help(stuatt, package = "eeptools")
head(stuatt)
## sid school_year male race_ethnicity birth_date first_9th_school_year_reported hs_diploma hs_diploma_type hs_diploma_date
## 1 1 2004 1 B 10869 2004 0
## 2 1 2005 1 H 10869 2004 0
## 3 1 2006 1 H 10869 2004 0
## 4 1 2007 1 H 10869 2004 0
## 5 2 2006 0 W 11948 NA 1 Standard Diploma 6/5/2008
## 6 2 2007 0 B 11948 NA 1 College Prep Diploma 5/24/2009
As you can see, each sid
has data across multiple schools year. Variables such as male
and race_ethnicity
should be the same within a given sid
. Just by looking at the first individual, we can see that they aren’t.
There’s several ways to check this, and get a list of broken occurrences. This is just one way:
tt <- tapply(stuatt$race_ethnicity, stuatt$sid,
function(x) {
length(unique(x))
})
names(tt)[tt > 1]
## [1] "1" "2" "3" "117" "118" "119" "122" "124"
## [9] "125" "126" "127" "128" "229" "230" "231" "233"
## [17] "234" "235" "237" "238" "239" "240" "241" "246"
## [25] "247" "249" "342" "343" "344" "345" "347" "348"
## [33] "349" "351" "457" "458" "459" "460" "462" "463"
## [41] "464" "465" "466" "467" "468" "474" "475" "478"
## [49] "559" "560" "561" "562" "563" "566" "676" "677"
## [57] "678" "687" "689" "899" "902" "903" "904" "907"
## [65] "908" "909" "910" "1108" "1309" "1313" "1320" "1323"
## [73] "8552" "11382"
The tapply
function applies the length(unique())
call to each collection of race_ethnicity
values broken up by sid
.
For some historical context, read http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ and http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh↩