Multidimensional Scaling

Introduction
MDS vs PCA
Implementation in R

Thank you for visiting my notes. I've noticed a large uptick in views lately which is terrific! However, these notes are quite old (2017) and some details may be outdated. I have a more recent class (2023) which covers much of the same information and is hopefully more up-to-date.

If you are using these notes, I'd love to hear from you and how you are using them! Please reach out to jerrick (at) umich (dot) edu.

Introduction

Multidimensional Scaling (MDS) is a dimension-reduction technique designed to project high dimensional data down to 2 dimensions while preserving relative distances between observations. It can be used to look at higher dimensional data and try to find patterns or groupings. It is most useful when the observations are significant (e.g. if its a random sample of people, probably not useful; but if its all fish species found in a certain lake, it may be useful) and relatively small (basically to the limits of scatter plots).

MDS vs PCA

Without getting too heavily into the theory behind MDS, note that you can define the distance however you want, and MDS with euclidean distance is equivalent to extracting two principal components from a PCA analysis. In general, other distance metrics can be used.

Implementation in R

For this example, I combined data about country’s birth rates, death rates and homicide rates from Wikipedia. The code is long and not novel at this point, so feel free to skip it.

library(rvest)

page1 <- read_html("https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_birth_rate")
page1 <- html_table(html_nodes(page1, ".wikitable")[1], fill = TRUE)[[1]]
page1 <- page1[-(1:2), ]
page1[, -1] <- sapply(page1[, -1], as.numeric)
names(page1)[1] <- "country"

page2 <- read_html("https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_mortality_rate")
page2 <- html_table(html_nodes(page2, ".wikitable")[1], fill = TRUE)[[1]]
page2 <- page2[-(1:2), ]
page2[, -1] <- sapply(page2[, -1], as.numeric)
names(page2)[1] <- "country"

page3 <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate")
page3 <- html_table(html_nodes(page3, ".wikitable")[3], fill = TRUE)[[1]]
page3 <- page3[-(1:2),]
page3 <- page3[, c(1:3,5)]
page3[, 2:3] <- sapply(page3[, 2:3], as.numeric)
## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion
names(page3)[1] <- "country"
names(page3)[4] <- "region"
library(stringr)

countrydata <- Reduce(function(...) merge(..., by = "country", all = TRUE),
                      list(page1, page2, page3))
## Warning in merge.data.frame(..., by = "country", all = TRUE): column names
## 'WB 2010', 'OECD 2011.x', 'CIA WF 2013', 'CIA WF 2014.x', 'CIA WF 2016',
## 'OECD 2011.y', 'CIA WF 2014.y' are duplicated in the result
## Warning in merge.data.frame(..., by = "country", all = TRUE): column names
## 'WB 2010', 'OECD 2011.x', 'CIA WF 2013', 'CIA WF 2014.x', 'CIA WF 2016',
## 'OECD 2011.y', 'CIA WF 2014.y' are duplicated in the result
countrydata$region <- str_replace_all(countrydata$region,
                                  str_c("Central |Middle |Eastern |Western |",
                                        "South_Eastern |Southern |Northern |South-"),
                                  "")
countrydata$region <- str_replace(countrydata$region, "Melanesia", "Asia")
countrydata$region <- str_replace(countrydata$region, "Caribbean", "America")
countrydata$region <- str_replace(countrydata$region, "^America$", "North America")

countrydata <- countrydata[, c(1, 18, 4:9, 12:16)]

countrydata <- na.omit(countrydata)

The end result is the countrydata data.frame:

head(countrydata)

##         country        region OECD 2011.x OECD 2011.x.1 CIA WF 2013
## 1   Afghanistan          Asia        45.1             3       39.05
## 2       Albania        Europe        11.5           183       12.57
## 3       Algeria        Africa        24.8            75       24.25
## 5       Andorra        Europe        10.2           211        8.88
## 6        Angola        Africa        40.9            13       39.16
## 7 Anguilla (UK) North America        11.1           194       12.82
##   CIA WF 2013.1 CIA WF 2014.x CIA WF 2014.x.1 OECD 2011.y OECD 2011.y.1
## 1            12         38.84              10        18.2             2
## 2           159         12.73             156         6.9           144
## 3            63         23.99              63         4.4           207
## 5           212          8.48             218         3.5           221
## 6            10         38.97               9        15.2             8
## 7           155         12.68             157         3.3           226
##   CIA WF 2014.y CIA WF 2014.y.1 UNODC murder rates. Most recent year.1
## 1         14.12               7                                    6.5
## 2          6.47             152                                    4.0
## 3          4.31             205                                    1.5
## 5          6.82             139                                    0.0
## 6         11.67              29                                   10.8
## 7          4.54             201                                    7.5

Don’t worry about the unclear variable names; we don’t need them for this approach (though of course in general I should clean them up!).

First, we must compute the distance between each countries data. We’ll stick with Euclidean distance. I also pull out the country and region to make plotting easier later. Note that I normalize the data with scale before computing the distance; in this case it doesn’t change much, but if there were any extreme variables (e.g. if we had population), then those variables would dominate the distance.

country <- countrydata$country
region <- countrydata$region
countrydata <- countrydata[, -(1:2)]
countrydata <- scale(countrydata)
countrydist <- dist(countrydata)

Now we can perform the MDS

mdsout <- data.frame(cmdscale(countrydist))
head(mdsout)

##          X1         X2
## 1  5.379460  2.7569699
## 2 -2.114017 -0.4592517
## 3  0.584729 -2.8292988
## 5 -3.670878 -0.8448818
## 6  4.938717  1.7627268
## 7 -2.651280 -2.1841003

Note that the default output of cmdscale is a matrix, I convert it to a data.frame for ggplot.

library(ggplot2)
ggplot(mdsout, aes(x = X1, y = X2)) + 
  geom_label(aes(fill = region, label = country))

From this we can see clustering, e.g. the cluster of African countries or former USSR countries. We also see the outliers; Japan is similar to European countries or Seychelles (in Africa) is similar to the United States.