Multidimensional Scaling
Introduction
Multidimensional Scaling (MDS) is a dimension-reduction technique designed to project high dimensional data down to 2 dimensions while preserving relative distances between observations. It can be used to look at higher dimensional data and try to find patterns or groupings. It is most useful when the observations are significant (e.g. if its a random sample of people, probably not useful; but if its all fish species found in a certain lake, it may be useful) and relatively small (basically to the limits of scatter plots).
MDS vs PCA
Without getting too heavily into the theory behind MDS, note that you can define the distance however you want, and MDS with euclidean distance is equivalent to extracting two principal components from a PCA analysis. In general, other distance metrics can be used.
Implementation in R
For this example, I combined data about country’s birth rates, death rates and homicide rates from Wikipedia. The code is long and not novel at this point, so feel free to skip it.
library(rvest)
page1 <- read_html("https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_birth_rate")
page1 <- html_table(html_nodes(page1, ".wikitable")[1], fill = TRUE)[[1]]
page1 <- page1[-(1:2), ]
page1[, -1] <- sapply(page1[, -1], as.numeric)
names(page1)[1] <- "country"
page2 <- read_html("https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_mortality_rate")
page2 <- html_table(html_nodes(page2, ".wikitable")[1], fill = TRUE)[[1]]
page2 <- page2[-(1:2), ]
page2[, -1] <- sapply(page2[, -1], as.numeric)
names(page2)[1] <- "country"
page3 <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate")
page3 <- html_table(html_nodes(page3, ".wikitable")[3], fill = TRUE)[[1]]
page3 <- page3[-(1:2),]
page3 <- page3[, c(1:3,5)]
page3[, 2:3] <- sapply(page3[, 2:3], as.numeric)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
names(page3)[1] <- "country"
names(page3)[4] <- "region"
library(stringr)
countrydata <- Reduce(function(...) merge(..., by = "country", all = TRUE),
list(page1, page2, page3))
## Warning in merge.data.frame(..., by = "country", all = TRUE): column names
## 'WB 2010', 'OECD 2011.x', 'CIA WF 2013', 'CIA WF 2014.x', 'CIA WF 2016',
## 'OECD 2011.y', 'CIA WF 2014.y' are duplicated in the result
## Warning in merge.data.frame(..., by = "country", all = TRUE): column names
## 'WB 2010', 'OECD 2011.x', 'CIA WF 2013', 'CIA WF 2014.x', 'CIA WF 2016',
## 'OECD 2011.y', 'CIA WF 2014.y' are duplicated in the result
countrydata$region <- str_replace_all(countrydata$region,
str_c("Central |Middle |Eastern |Western |",
"South_Eastern |Southern |Northern |South-"),
"")
countrydata$region <- str_replace(countrydata$region, "Melanesia", "Asia")
countrydata$region <- str_replace(countrydata$region, "Caribbean", "America")
countrydata$region <- str_replace(countrydata$region, "^America$", "North America")
countrydata <- countrydata[, c(1, 18, 4:9, 12:16)]
countrydata <- na.omit(countrydata)
The end result is the countrydata
data.frame:
head(countrydata)
## country region OECD 2011.x OECD 2011.x.1 CIA WF 2013
## 1 Afghanistan Asia 45.1 3 39.05
## 2 Albania Europe 11.5 183 12.57
## 3 Algeria Africa 24.8 75 24.25
## 5 Andorra Europe 10.2 211 8.88
## 6 Angola Africa 40.9 13 39.16
## 7 Anguilla (UK) North America 11.1 194 12.82
## CIA WF 2013.1 CIA WF 2014.x CIA WF 2014.x.1 OECD 2011.y OECD 2011.y.1
## 1 12 38.84 10 18.2 2
## 2 159 12.73 156 6.9 144
## 3 63 23.99 63 4.4 207
## 5 212 8.48 218 3.5 221
## 6 10 38.97 9 15.2 8
## 7 155 12.68 157 3.3 226
## CIA WF 2014.y CIA WF 2014.y.1 UNODC murder rates. Most recent year.1
## 1 14.12 7 6.5
## 2 6.47 152 4.0
## 3 4.31 205 1.5
## 5 6.82 139 0.0
## 6 11.67 29 10.8
## 7 4.54 201 7.5
Don’t worry about the unclear variable names; we don’t need them for this approach (though of course in general I should clean them up!).
First, we must compute the distance between each countries data. We’ll stick with Euclidean distance. I also pull out the country
and region
to make plotting easier later. Note that I normalize the data with scale
before computing the distance; in this case it doesn’t change much, but if there were any extreme variables (e.g. if we had population), then those variables would dominate the distance.
country <- countrydata$country
region <- countrydata$region
countrydata <- countrydata[, -(1:2)]
countrydata <- scale(countrydata)
countrydist <- dist(countrydata)
Now we can perform the MDS
mdsout <- data.frame(cmdscale(countrydist))
head(mdsout)
## X1 X2
## 1 5.379460 2.7569699
## 2 -2.114017 -0.4592517
## 3 0.584729 -2.8292988
## 5 -3.670878 -0.8448818
## 6 4.938717 1.7627268
## 7 -2.651280 -2.1841003
Note that the default output of cmdscale
is a matrix, I convert it to a data.frame for ggplot.
library(ggplot2)
ggplot(mdsout, aes(x = X1, y = X2)) +
geom_label(aes(fill = region, label = country))
From this we can see clustering, e.g. the cluster of African countries or former USSR countries. We also see the outliers; Japan is similar to European countries or Seychelles (in Africa) is similar to the United States.