A lot of attempts been made at mapping the social networks of famous books, movies and TV series. For example, here is the Harry Potter network from the Intrepid Mathematician blog:

Source: The Intrepid Mathematician

As the blog post notes:

The fun thing was that we derived these graphs without directly reading any of the text of the novels! We devised a graph extraction algorithm (analogous to the one in Network of Thrones) that based on character names being fifteen or fewer words apart in the text.

This technique is spot on and systematic for mapping networks. You can also create these links by asking questions like: Have the characters been in the same paragraph or chapter? But I wanted to move away from networks and take this a step further from these time/space links (conceptually, these techniques notch a link if the characters share time and space). I wanted to try to map Harry Potter’s character inside an all-encompassing thematic map.

**Let’s get cracking.**

I’m going to walk you how to create these thematic maps R. And along the way, I’ll give you a 10k foot understanding of vector space models (the backbone of these thematic maps) and point you in all the right directions if you want to explore these further. I hope you enjoy it!

The first step is always to import all the necessary libraries.

- I use the
*tidyverse*, the ultimate data wrangling framework, pretty much in every project I work on - Instead of downloading the txt files, the
*harrypotter*library allows for direct access from R. - The
*stringr*library is an R workhorse for text processing - The
*wordVectors*and*wordspace*libraries are for the creation and manipulation of vector spaces - The
*proxy*package has a bunch of great distance calculations - Although I won’t load it explicitly, I will also briefly use the
*tidytext*library to demonstrate the usefuleness of having texts in a data frame format. - I won’t load the
*ggdendro*library for plotting dendrograms, but I will use it.

```
library(tidyverse)
library(harrypotter)
library(stringr)
library(wordspace)
library(wordVectors)
library(proxy)
```

The *wordVectors* library were developed by Bradley Boehmke and Benjamin Schmidt, respectively, and are not in CRAN (R’s package repository). Instead, you need to install them from his GitHub. Here are some links to help you do that.

- installing packages from GitHub: (https://cran.r-project.org/web/packages/githubinstall/vignettes/githubinstall.html)
- harrypotter GitHub page: https://github.com/bradleyboehmke/harrypotter
- wordVectors GitHub page: https://github.com/bmschmidt/wordVectors
- wordVectors installation: https://github.com/bmschmidt/wordVectors/blob/master/vignettes/introduction.Rmd)

Please note that there is another *harrypotter* library on CRAN that is not the same. If you install and load that one, the code will not work.

One final note: I had some trouble installing the newest version of *wordVectors*. It tuned out my Rtools, which is “a collection of resources for building packages for R under Microsoft Windows, or for building R itself”, and which is needed to install GitHub packages, was a little dated. Whether to install it for the first time (you need it for this tutorial) or to update it, here is the link to install Rtools:

https://cran.r-project.org/bin/windows/Rtools/)

Ok, now we are all set. Are you ready? The answer I am hoping for is:

Source: gifer

The *harrypotter* library makes it super easy to load each of the seven books. Also, beyond this ease, all the texts are consistent in their format. I have found other resources from which I can download the texts, but the formatting (e.g. spacing between paragraphs) has varied from book to book. Consistency makes text processing much easier. The package loads the books as a character vector where each element is a chapter.

```
# Load each book with its order, and annotate the chapters
# Wrapping the data frames in the bind_rows, collapsed them into one
all_books_df <-
bind_rows(
data_frame(
txt = philosophers_stone,
book = 1),
data_frame(
txt = chamber_of_secrets,
book = 2),
data_frame(
txt = prisoner_of_azkaban,
book = 3),
data_frame(
txt = goblet_of_fire,
book = 4),
data_frame(
txt = order_of_the_phoenix,
book = 5),
data_frame(
txt = half_blood_prince,
book = 6),
data_frame(
txt = deathly_hallows,
book = 7)
) %>%
# Since each vector element is a chapter,
# this the chapter numbers inside of each book
# The group_by allows the counts to restart for each book
group_by(book) %>%
mutate(chapter = 1:n()) %>%
# Always a good idea to ungroup to avoid problems later
ungroup()
```

Having done all this. I now must admit that for this post, creating this data frame is overkill. We are not going to use the data frame format of the different books. But having this format is incredibly useful for libraries like *tidytext* that use data frame structures for text analytics. These libraries are very intuitive, and, as the “tidy” prefix suggests, they fit right into the tidyverse.

Ok, I cannot help myself. Here is a super brief example of how to see the counts of the word “Hermione” across all books and chapters. I am going to use the “::” notation, which allows you to call a function from a package without loading the entire thing. For python users, this is que equivalent as: from package import function. Just make sure you’ve installed *tidytext*.

```
tidytext_demo_df <-
all_books_df %>%
# The following creates a data frame with 3 columns:
# book and chapter (from the all_books df)
# and word, the breakdown of text into a "bag of words"
tidytext::unnest_tokens(output = word, input = txt)
# Create line chart of the counts of "hermione" over each of the books
tidytext_demo_df %>%
# Filter for "hermione"
filter(word == 'hermione') %>%
# Create counts for each book-chapter pair
group_by(book, chapter) %>%
summarise(cnt = n()) %>%
# Plot the lines broken up by books
ggplot(aes(x = chapter, y = cnt)) +
geom_line() +
facet_wrap(~ book, nrow = 2)
```

I really hope that you can see the enormous value in this type of text analytics in R. It is incredibly intuitive and useful.

In case you want to dig a little deeper into this, go straight to the source. This is a book written by the people that wrote the library:

https://www.tidytextmining.com/.

Ok. Sorry for the tangent…

Source: History Vikings

Btw, if you haven’t watched Vikings yet, it is spectacular!

https://www.google.com/search?q=vikings+rating

To create the vector space model (I will explain in just a little bit what that is all about), the *wordVectors* library needs for all the text to be stored in a folder. This is typically called a corpus. Although I have not read this explicitly anywhere, the probable reason for reading from a folder is that, given all the calculations and thus RAM that R needs to run the algorithm, it is better to not have the text already loaded and taking up memory. Moreover, you might want to create a vector space model with a lot of text, and it might not all load into memory.

For example, Google trained its word2vec algorithm on close to 50 billion words:

http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

In comparison, all the Harry Potter books have about 1.1 million words. If you are wondering where I got this number, it’s simply the number of rows of tidytext_demo_df. The tidytext demo did come in handy!

```
# See the top rows
nrow(tidytext_demo_df)
```

## [1] 1089386

To store the text, we just need to collapse all the books into a single character vector and dump it into a text file. I create this file inside of a folder called “all_books”, which has already been created.

```
# Collapse the data frame column containing all the text
all_books <-
str_c(all_books_df, collapse = '\n\n\n')
# Establish a connection with the file
file_conn <- file('all_books/hp_all.txt')
# Dump the vector into it
writeLines(all_books, file_conn)
# Close the connection
close(file_conn)
```

Perfect. Now we have the text document with all the HP books in the all_books folder. Note that it would have also been fine to create 7 different files, one for each book.

A vector space model, for lack of a better explanation, is a map. Although these models tend to have hundreds of dimensions, lets first work with just 2. This allows us to think of it as a flat surface (or a cartesian plane), just like a map. The points on a map (e.g. towns or cities) have an x value, longitude, and a y value, latitude.

Source: asperia.org

These points also have relationships between one another, distance being the most basic: what is the closest city to Germany? Or: which is farther from Berlin, Paris or Brussels? To the idea of distance, you might also add direction: where would we end up if we took the vector (or the line) going Berlin to Paris, but started at Brussels? This is the equivalent of:

Probably somewhere in the Atlantic.

It’s very easy to check. Create a vector with latitude and longitude for each of the cities (got them from Google) and check the result of the calculation.

```
# Create vectors
paris <-
c(48.8566, 2.3522)
berlin <-
c(52.52, 13.405)
brussels <-
c(50.8503, 4.3517)
# Calculate new vector
# Note on dealing with North, South, East and West:
# north latitudes are positive and south latitudes are negative
# east longitudes are positive and west longitudes are negative
new_vector <-
paris - berlin + brussels
new_vector
```

## [1] 47.1869 -6.7011

So, the new vector is 47.1869N, 6.7011W. A quick Google search for these coordinates and we can confirm that the new vector is indeed in the Atlantic.

Source: Google Maps

If we consider this example as a table or data frame, Berlin, Paris and Brussels have vectors (rows) with 2 values (columns), latitude and longitude. We could also add population as a 4th, non-spatial, dimension. Although it is very difficult to visualize how this would work, you can certain calculate a difference between the population of Paris and that of Berlin, which would be the equivalent of a distance.

If we keep the 2-dimensional map in mind, but replace the cities with the words of books, we have ourselves a vector space model. In this model, the dimensions don’t have a significance like latitude and longitude do. But all the comparisons we discussed, in terms of distance and direction, can certainly be maintained.

It is precisely through these comparisons that the word2vec algorithm that Google created got its fame.

http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

Remember how we discussed calculating Paris – Berlin + Brussels? Google did something similar, with some astonishing results.

Source: Depends on the definition

And this result makes perfect sense: if you convert a king from a man to a woman, you get a queen. So, the vector space model was so well created that it contains the very subtle concepts of gender and monarchy, and they are essentially operational.

There are a lot of other super interesting results that came from Google’s “word map”. Here are some links that discuss them and how the word2vec algorithm works in detail:

A vector space model is simply a map. Although many applications (including this one) have to do with text analytics and natural language processing (so creating maps of words), there are many other applications of these models. For example:

https://towardsdatascience.com/a-non-nlp-application-of-word2vec-c637e35d3668.

Fivethirtyeight also had an amazing article of using a vector space model of Reddit groups for better understanding the groups related to Donald Trump:

https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/.

So, there is a huge amount of applicability.

Now we’re ready to get our hands dirty.

Before training the vector space model (“training” in machine learning terminology means running data through an algorithm), the *wordVectors* library preprocesses the data. This is just like any other text analytics process in which the excesses of the text (uppercase letters, stop words, punctuation, etc.) are removed from the text. This is essentially removing noise to get the best signal possible. Once again, the results are saved as an external file to not take space in memory. Although in this example we only have a single source file (all 7 books in a single document), if we had more files in the origin folder, they all would be processed and concatenated into a single destination file.

```
prep_word2vec(
origin = 'all_books',
destination = 'hp_all_processed.txt',
lowercase = TRUE,
# ngrams refers to the maximum number of words per phrase
bundle_ngrams = 2)
```

Now that we have the preprocessed file, we can train the model. Once again, the results are stored in an external file (a binary file) which then can be read in. Note that the training could be assigned into an object (object <- train_word2vec(…) ), in which case it would have both written the binary file and kept it in an object in memory. Other than the choice of number of dimensions, the rest of the function parameters go into the innerworkings of the word2vec model. The library’s GitHub page has further explanations that you can explore and get further insight into what each parameter is tuning.

```
# Only train model if the binary model file does not exist
# If model exists, the train_word2vec function throws an error
if(!file.exists('hp_all_processed.bin')){
# Train the model
train_word2vec(
train_file = 'hp_all_processed.txt',
output_file = 'hp_all_processed.bin',
# This defines the number of dimensions per vector
vectors = 200,
threads = 4,
window = 12,
iter = 5,
negative_samples = 0)
}
# Read the model's binary file
suppressMessages(
w2v_model <-
read.binary.vectors('hp_all_processed.bin')
)
```

And that’s it. The model is complete.

One of the most interesting features of vector space models is the ability to test the proximity of the model elements, in this case the words and phrases of the Harry Potter books. For the more curious among you, this is done by looking at the cosine similarity between elements. Here is a great example and explanation of how cosine distance is calculated and why it is a better metric for these types of models (as opposed to something Euclidean distance): https://cmry.github.io/notes/euclidean-v-cosine.

The *wordVectors* package has a very easy to use function for just this calculation.

```
# Get the 10 closest concepts to "patronus"
w2v_model %>%
closest_to(vector = "patronus", n = 10)
```

## word similarity to "patronus" ## 1 patronus 1.0000000 ## 2 stag 0.5989453 ## 3 produce 0.4764178 ## 4 dementor 0.4694126 ## 5 doe 0.4575373 ## 6 produced 0.4274573 ## 7 guardian 0.4243686 ## 8 goat 0.4041396 ## 9 otter 0.3854000 ## 10 true 0.3732683

As you can see, the closest concept to “patronus” is “stag”. This makes complete sense. In the books, patronuses usually appear when Harry produces them and Harry’s patronus is a stag. Still, even though the cosine similarity of “stag” is considerably larger that the next closest, there are going to be other terms that are much closer to one another. There are other cool things that you can do (king – man + woman type operations). I recommend you look at the GitHub documentation of the *wordVectors* to get some ideas.

Source: Shepherd of the Gurneys

A very similar way of analyzing vector space data is through a clustering algorithm. If you are not familiar with clustering, it groups the observations of a dataset that are closest to one another based on a chosen distance metric. Here is a great explanation of clustering and its main algorithms:

https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/.

In this vector space model, clustering would allow to determine which terms in the Harry Potter books naturally group together.

Although it would be very interesting to see how all the books’ concepts cluster, I am more interested in how the different characters cluster. For this, I got a list of all Harry Potter characters from Wikipedia and added a few more variables (stored in the harry_potter_characters.csv file). The most important of these is a search parameter.

https://en.wikipedia.org/wiki/List_of_Harry_Potter_characters

For example, instead of looking for instances of “Alastor Mad-Eye Moody” in the text, I would simply look for “Mad-Eye”. This might not be perfect since there are instances in which he is referred to as “Alastor”, but it works well enough. The optimal way to search would be have multiple search terms per character and have them all map to a single name, but for the vector space models, these changes must be made to the source text. So, for the sake of simplicity, a I used a single term (the one I consider most relevant) for each character.

```
# Read dataset with all characters and relevant metadata
hp_characters_df <-
read_csv('harry_potter_characters.csv', progress = FALSE)
# Extract the vector space from the word2vec model
w2v_matrix <-
# It's stored in the ".Data" element of the class
w2v_model@.Data %>%
# Convert it to a regular matrix
as.matrix()
# This matrix has all the vectors representing the different terms
# stored as the rows (the name of the row is the term)
# Only keep the terms whose row name matches the character search variable
w2v_matrix_filtered <-
w2v_matrix[rownames(w2v_matrix) %in% tolower(hp_characters_df$search), ]
# Inspect the first 5 columns/dimensions of the matrix
head(w2v_matrix_filtered[, 1:5])
```

## [,1] [,2] [,3] [,4] [,5] ## harry 0.008171266 -0.03027906 -0.10212392 -0.07503098 -0.094751164 ## ron -0.087304167 -0.07390372 0.03376728 -0.03057711 -0.019961827 ## hermione -0.010194995 -0.03472352 -0.02310267 -0.02202773 -0.049171895 ## dumbledore -0.030959761 0.06883359 0.11200651 0.02781693 -0.092563055 ## hagrid 0.184760004 0.07664469 0.07896528 0.01836261 0.008585766 ## snape -0.102141000 0.19810164 -0.06812304 -0.10695923 -0.202439532

Awesome! So, to reiterate, in this matrix each character has a 200-dimension vector that represents, for a lack of a better term, their thematic position. So, characters that are close to one another in this conceptual space are related to one another thematically. And this is exactly what we want to get at.

Remember the map analogy we discussed before? Let’s first try to recreate that, but instead of European cities, we’ll create a map of Harry Potter characters. For this, all we need to do is reduce the 200 dimensions to just 2. For this, we’ll use a very important concept in data science: dimensionality reduction. Here is a super quick overview:

In short, these dimensionality reduction techniques squeeze all the information / variability from many variables to a few. The most frequently techniques used are Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). Here is a detailed explanation and comparison of both:

```
# Reduce matrix to 2 dimensions using single value decomposition.
character_map_df <-
# Using dsm.projection from the wordspace library
dsm.projection(model = w2v_matrix_filtered,
# Reducing to 2 dimensions
n = 2,
# Using singular value decomposition
method = 'svd') %>%
# Convert matrix to a data frame
as.data.frame() %>%
# Convert rownames of character into a column named character
rownames_to_column('character') %>%
# Rename the resulting dimensions to vec1 and vec2 (from svd1 and svd2)
# This is so that you can try different reduction algorithms
# but have consistent names for plotting
rename('vec1' = !!names(.[2]),
'vec2' = !!names(.[3]))
# Plot
character_map_df %>%
ggplot(aes(x = vec1, y = vec2)) +
geom_text(aes(label = character))
```

This map is pretty cool, but it’s pretty hard to fully read. For another and a little clearer way of visualizing this data, clustering and dendrograms (a clustering visualization technique) offer a solution. At the end, it’ll be very easy to read which characters are closest to each other and form natural thematic groups. I previously briefly covered clustering, and, in particular, k-means clustering in the post “From an image of a scatter plot to a regression in R”. Here are great overviews of the hierarchical clustering algorithm and of dendrograms:

- https://www.displayr.com/what-is-hierarchical-clustering/.
- https://www.displayr.com/what-is-dendrogram/

```
# The first step is to calculate the distance matrix
# As mentioned previously, the cosine distance is the best
# distance metric for vector space models.
distance_matrix <-
dist(w2v_matrix_filtered,
method = "cosine")
# Run hierarchical clustering algorithm - hclust function (stats library)
hierarchical_cluster <-
# Feed the distance matrix
hclust(distance_matrix,
# For a full explanation of methods, go to function help
method = "complete")
# Replace the search terms with the full names
name_match <-
# Create a data frame with the cluster labels
data_frame(
char_names = hierarchical_cluster$labels
) %>%
# Join the characters data frame matching the search term
left_join(hp_characters_df %>%
mutate(char_names = tolower(search)),
by = 'char_names') %>%
# Mark down order to be able to maintain it when the names are switched
mutate(order = 1:n()) %>%
# Only maintain name per character using the group_by and slice
# This is just a cleaning up method
group_by(char_names) %>%
slice(1) %>%
ungroup() %>%
# Return to original order - group_by changes the data frame order
arrange(order)
# Replace the cluster labels with the full names
hierarchical_cluster$labels <-
name_match$full_name
# Plot the obtained dendrogram
ggdendro::ggdendrogram(
data = hierarchical_cluster,
rotate = TRUE
)
```

Now we can start seeing the natural groupings of the characters. To make it easier to read the graph and to make it look prettier :), I exported it as a PDF, popped it into Adobe Illustrator, and improved made some of the graphical elements (background, font size, colors, spacing, etc.).

Awesome! This is exactly what I was looking for when I began experimenting with word2vec and the Harry Potter books. Hope you enjoyed it and that you learned some cool stuff.

Cheers!

Source: Ignited Moth

]]>Imagine some co-workers send you this old image of a scatter plot, without any other information, and ask you to help them figure out what type of regression best fits the data.

Source: betterevaluation

Your first reaction might be to extract the values manually with a quick square or something of the sort. But your co-workers tell you they’ll send you few more of these your way. And, although the work with the quick square might take you back to your middle school days, you want to figure out if you can do the whole process in a more automated way… Yep, that story is a little weak, but go with it. I promise won’t regret it (famous last words). In any case…

**R to the rescue!!**

I’m going to show you how you can do all of this in R. And along the way, you might learn a bunch of stuff about pipes, images representations, arrays, data munging, regression, even how a scene from Rocky IV is a great way to think about certain clustering methods. I hope you enjoy it!

The first step is always to import all the necessary libraries.

- I use the
*tidyverse*, the ultimate data wrangling framework, pretty much in every project I work on - The
*magick*and*EBImage*libraries are for image loading and processing - The
*cluster*library is for advanced clustering

```
library(tidyverse)
library(magick)
library(EBImage)
library(cluster)
```

Although I don’t get very fancy with the data wrangling and the tidyverse, I do use pipes (%>%) a whole lot. For those that are not familiar with them, pipes take the result from an operation and use it as the first argument of the following operation.

In the following example, I pipe ‘rogue’ into the paste function (which pastes, or concatenates, 2 strings together) to paste it with ‘data’, and then pipe that result into a paste function with ‘science’. The final result is ‘rogue data science.’ This is the exact same thing as nesting all of them together. It is just much more readable.

```
a <- 'rogue'
b <- 'data'
c <- 'science'
# Pasting with pipes
a %>%
paste(b) %>%
paste(c)
```

## [1] "rogue data science"

```
# Same result as pasting with nested functions
paste(a, paste(b, paste(c)))
```

## [1] "rogue data science"

If you want to get more familiar with pipes, here are a couple of resources:

- Quick overview: http://blog.revolutionanalytics.com/2014/07/magrittr-simplifying-r-code-with-pipes.html
- More complete introduction: https://www.datacamp.com/community/tutorials/pipe-r-tutorial

Now we’re all set to start. We’ll load the scatter plot image with the magick library’s *image_read* function. Although you could crop out the unnecessary parts of the image within R, I did it manually, for the sake of brevity.

For those that are interested in how magick represents an image in R (the structure of the data), we can quicky inspect the new object with the *str* function.

```
# https://www.betterevaluation.org/evaluation-options/scatterplot
# Load image into img object
img <-
image_read('random_scatterplot_trimmed.jpg')
# Look at the structure of the img object
str(img)
```

## Class 'magick-image' <externalptr>

We don’t get a whole lot of information back, but this does tell us that magick image object is simply an external pointer (). This means that R is creating a temporary file (on your computer’s hard drive) and it is pointing to that file as the image remains loaded. This is different from other R data structures which are stored on your computer’s memory (its RAM). Using some of the hundreds of megabytes of your hard drive is actually how some packages (such as bigmemory) deal with data that is bigger that the available memory. But I digress…

Although magick is great for loading images, the EBImage library is great for manipulating those images. What’s great is that magick actually has a function that converts magick objects into EBImage objects: *as_EBImage*.

Once again, we can look at the structure of the object to better understand it and also the *class* function to specifically see what type of object it is.

```
# Convert the magick image object into an EBImage object
# and overwrite the old img object with the new one
img <-
img %>%
as_EBImage()
# See structure
str(img)
```

## Formal class 'Image' [package "EBImage"] with 2 slots ## ..@ .Data : num [1:684, 1:542, 1:3] 1 1 1 1 1 1 1 1 1 1 ... ## ..@ colormode: int 2

So an EBImage object (of Image class), which unlike magick is contained within the R environment (it’s not a pointer), contains 2 elements: the data (an array representation of an image) and a colormode.

The array representation of the image is what we are really interested in. The dimension of this array are 684 (width) x 542 (height) x 3 (rgb). For those that are not super familiar with matrices and arrays, think of this as an Excel file with 3 sheets, and where each sheet, has 684 columns and 542 rows.

Here is a great visualization of what an array representation of an image is.

Source: mathworks

As you can see, for each pixel in an image, there is a value for each red, green and blue channels. Here is more detailed information on the rgb color model: https://en.wikipedia.org/wiki/RGB_color_model.

One more note on image array representation. While some images, like .jpg, simply have rgb information, other image types, like .png, can also have information about the transparency of the image (the little white and light grey in Google images represent transparency). This When you import a png image, there is an additional transparency channel: the alpha channel (these images are said to have an rgba color space). To see how this works, let’s load this free Superman logo from the web and look at its structure (yep, you can load images, and any other file, really, directy from the web as long as you have an internet connection and permission):

Source: freepngimg

```
# Read image from the web (as magick object)
image_read('http://www.freepngimg.com/download/superman_logo/2-2-superman-logo-png-file.png') %>%
# Convert into EBImage
as_EBImage() %>%
# See structure
str()
```

## Formal class 'Image' [package "EBImage"] with 2 slots ## ..@ .Data : num [1:3001, 1:2252, 1:4] 1 1 1 1 1 1 1 1 1 1 ... ## ..@ colormode: int 2

```
# Notice that we are not saving an object but rather feeding the converted web image
# directrly into the structure function
```

You can see from the *str* results that the logo is 3001 x 2252, and now, in addition to the 3 rgb dimensions, there is a 4th dimension (a 4th sheet on in the Excel file) the representing transparency of each pixel.

Now that we know how images are loaded and represented, let’s try to make sense of the data and get it in a format that we can use.

Instead of trying to make sense of all channels, for the purpose of this excercise, we are simply going to choose one, the red channel. By only selecting a single channel, the resulting data is a matrix, which is a special kind of array that simply has 2 dimesional. Going back to the Excel analogy, we are getting rid of 2 of the 3 sheets.

```
# From the img object, grab the .Data component (the @ here is exacly the same a the $)
# And select all of the items from the first 2 dimension
# and only the first item from the 3rd dimension
image_mat <-
# Blanks within the brackets means everything
img@.Data[, , 1]
# See structure
str(image_mat)
```

## num [1:684, 1:542] 1 1 1 1 1 1 1 1 1 1 ...

```
# See class
class(image_mat)
```

## [1] "matrix"

The structure of the object confirms it is now a 2-dimensional object. The class confirms that it is no longer an EBImage object but rather a native R matrix. This is exactly what we were looking for. Sweet.

If you’re still having trouble understanding what we are doing, this might help. Think about being in a three story building. Inside each of the floors, you can move wherever you want and you can even go a floor and stand in the exact same place as in the floor below. So you have a lot of horizontal movement, and are somewhat flexible in terms of vertical movement (one of 3 stories). Suddenly, the top 2 floors close down for repairs. Now you are completely limited to horixontal movement. So the 3 story building is like an array, and a single floor within that building is a matrix. Hope that clears things up.

The next step is to get the data into a data frame, the best format for data munging and modeling. By simply converting the data into a data frame, although we do in fact get a data frame, it is not in the right shape. We can see this by using the *dim* function, which outputs the dimensions of an object.

```
# Convert image_mat (a matrix) into a data fram
image_df <-
image_mat %>%
as.data.frame()
# See dimensions of image_df
dim(image_df)
```

## [1] 684 542

This data frame had 684 rows and 542 colums. This makes sense because we simply converted the image_mat without any other manupulation, so we can expect the result to be the same shape. But remember that our end product is to get a scatter plot, for which we can only have 2 colums: one with a value for x and another with a value for y.

Still, all of the data we need is there, except it is not stores in the actually values within the data frame, but rather in the names of its columns and its rows: the row names represent the x variable position and the variable names represent the y variable position (R has added a “V”“ to the beginning of the column names because columns, like the names of other R objects, should not have or start with numbers), and the actual value represents the red color value of the pixels. So the red value for the pixel with coordinates x = 1 and y = 1 is 1:

```
image_df[1, 1]
```

## [1] 1

So the next step is to reshape this data and convert it into into a 3-variable data frame: x positon, y position, and the value for that pixel.

**Advanced note:** For those of you that have worked with matrices, this is exactly how low-density matrices (matrices with a lot of zeros) are represented. Only the elements of these matrices that have a value other than 0 are recorded.

This type of data conversion is called wide format (where there are many variables typically measuring the same thing – in this case pixel colors) to long format (where the variable names are stored in a variables and the other values are stored in another variable). Let’s go ahead and do this tranformation.

```
# Convert image_df from wide to long format
image_df <-
image_df %>%
# Convert the rownames of the data into a variable
rownames_to_column() %>%
# Rename this new variable which represents the x values
rename(x = rowname) %>%
# This new variable is a string so we have to convert to numeric
mutate(x = as.numeric(x)) %>%
# Convert all variables other than the x into long form
gather('y', 'pixel_value', everything(), -x) %>%
# Fix the y values
mutate(y =
y %>%
# Remove the V
str_replace_all('V', '') %>%
# Convert to numeric
as.numeric())
# See the top rows
head(image_df)
```

## x y pixel_value ## 1 1 1 1 ## 2 2 1 1 ## 3 3 1 1 ## 4 4 1 1 ## 5 5 1 1 ## 6 6 1 1

Awesome. That did the trick.

Now, to finalize the data, we need to make sure we only include the pixel values that actually interest us: those containing the points of the scatter plot.

To determine how to adequately choose the pixels of interest, we can view the values of these pixels in a histogram and see how they are distributed. We can preemtively filter out any value equal to 1, which are the white pixels (full red, green and blue result in white, while having all values at 0 results in black). We’ll use ggplot’s *geom_histogram* for the graph.

```
# Plot pixel_value histogram
image_df %>%
# Use filtering functionality of data frame to remove all 1s (white pixels)
filter(pixel_value < 1) %>%
# Visualize in histogram of pixel_values
ggplot(aes(pixel_value)) +
geom_histogram(bins = 50)
```

From the histogram, we can tell that there are more than just pure white values (where red is equal to 1), and that the values we are interested in probably lie between 0.1 and 0.25. But to make sure we don’t miss anything and since ther is such a big gap between the values of the dark pixels and those of the light ones, we can make 0.5 our cut point.

```
image_df <-
image_df %>%
# Filter
filter(pixel_value < 0.5)
```

Now we should be all set. The next step it to actually take a look at the scatter plot we are trying to replicate.

Now that the data is finalized, at the structure of the data and the shoice of pixels, we can use ggplot’s *geom_point* to view the scatter plot.

```
image_df %>%
ggplot(aes(x = x, y = y)) +
geom_point()
```

If we compare this scatter plot to the original one we loaded, it seems to be upside down. This is becuase the array representation the image flips the image within the data structure. I am not 100% sure why that is, but I am sure it has some purpose.

Either way, it’s an easy fix. All we need is to flip the points. To do so, we just have to mutpliy the y values by -1, which flips the data, and add the difference of the biggest positive and negative points, which corrects the negative values that resulted from the flip.

The other wuick change we have to make is to make sure that the scales of the x and the y are the same as the original plot. Eyeballing it, the y values seem to range from a 4 to a 98, and the x from 1900 to 1920. In order to rescaled the x and the y, we have to perform the following calculation:

Source: stackexchange

```
min_x_new <- 1900
max_x_new <- 1920
min_y_new <- 4
max_y_new <- 98
image_df <-
image_df %>%
# First flip the points
mutate(new_y = y * (-1)) %>%
mutate(new_y = new_y + (max(y) - max(new_y))) %>%
select(-y) %>%
rename(y = new_y) %>%
# Then rescale the values
mutate(x =
((max_x_new - min_x_new) / (max(x) - min(x))) *
(x - max(x)) + max_x_new,
y =
((max_y_new - min_y_new) / (max(y) - min(y))) *
(y - max(y)) + max_y_new)
image_df %>%
ggplot(aes(x = x, y = y)) +
geom_point()
```

Perfect. Onto the next part!

Although we got the data in the shape that we want it, we still have a problem with the points. We want each point on the scatter plot to be represented by a single point in the data. Right now, we have multiple points, which makes sense since, in the image, each point is made up of multiple pixels.

Clustering to the rescue! Specifically, k-means clustering to the rescue!

For those that are not familiar with clustering, I was going to do a brief overview of k-means clustering, but there are so many great sources already out there, that pointing you towards them is probably more efficient. So before moving forward, I recommend the following sources:

- Non-mathy overview: http://www.dummies.com/programming/big-data/data-science/data-clustering-with-the-k-means-algorithm/
- More technical overview: https://www.datascience.com/blog/k-means-clustering

While most of the times one uses k-means clsutering, there is a bit of guesstimation as to how many clusters to look for (although there are also ways finding the optimal number of clusters), we actually know exactly how many clusters we are looking for: 21, the number of points in the scatter plot. This makes the algorithm super easy to run.

Once we have the results, all we have to do is extract the centroids of the clusters. These are simply the central points (more specifically, the centers of mass) of each of the groups.

Source: humanoriented

For those familiar with the Rocky movies, using the centroids here is the equivalent of “hit the one in the middle!”

Movie reference: check!

```
# Set a random seed so that the resutls are the same everytime
set.seed(1)
# Run the k-means algorithm on the x and y values of the data
kmeans_fit <-
kmeans(
# We won't need the whole dataframe, just the x and y values
x =
image_df %>%
select(x, y),
# We set the number of clusters to look for to 21
centers = 21)
# Extract the centroids from the results as a dataframe
simplified_df <-
kmeans_fit$centers %>%
as_data_frame()
# Plot the centroids in a scatter plot
simplified_df %>%
ggplot(aes(x = x, y = y)) +
geom_point(size = 6, alpha = .4)
```

The k-means clustering did a really good job. Almost exactly what we wanted, but not quite. Let’s overlay the original scatter plot onto this new one to see where there are discrepancies (not supper pretty, but it does the job).

So k-clustering did not recognize the right clusters. Luckily, there is another clustering algorithm developed precisely for when k-means fails: the pam, or partitions around medioids, algorithm (https://www.r-bloggers.com/when-k-means-clustering-fails/). The *cluster* library has a great implementation of pam.

```
# To run pam, we use the exact same arguments as with kmeans
pam_fit <-
cluster::pam(x =
image_df %>%
select(x, y),
k = 21)
# Extract the medioids from the results as a dataframe
simplified_df2 <-
pam_fit$medoids %>%
as_data_frame()
# Plot the medioids in a scatter plot
simplified_df2 %>%
ggplot(aes(x = x, y = y)) +
geom_point(size = 6, alpha = 0.4)
```

It seems like it did a better job. But to make sure, let’s overlap the original scatter plot onto this new one once again.

Sweet! That worked perfectly.

We got our data ready and we’ve made sure it matches the original scatter plot. Now all we have to do is see which regression fits the data best.

The first thing we can do is a good old eye test. Using the ggplot stat_smooth function, we can plot both a linear regression and a quadratic. Here is a pretty good and simple overview of these different regression types: http://blog.minitab.com/blog/adventures-in-statistics-2/curve-fitting-with-linear-and-nonlinear-regression.

```
# Plot the scatter plot with the linear and quadratic regression lines
simplified_df2 %>%
ggplot(aes(x = x, y = y)) +
# First plot the simple linear regression
stat_smooth(method = 'lm',
# The function of a linear regresion is y = a + bx
formula = y ~ x,
# This removes the confidence interval plot
se = FALSE,
# Color of the line
col = 'deepskyblue2',
# 20% alpha (or 80% transparency)
alpha = 0.2,
# Width of the line
size = 2) +
stat_smooth(method = 'lm',
# The function of a linear regresion is y = a + b1x + b2x^2
formula = y ~ x + I(x^2),
se = FALSE,
col = 'forestgreen',
alpha = 0.2,
size = 2) +
# Add the scatter plot
geom_point(size = 6, alpha = 0.4)
```

```
# Note that I first plotter the regression lines and the the scatter plot.
# This is simply to have the lines in the "background" of the graph
# but you can do it whichever way you prefer.
```

Based on the graph, it seems like the quadratic is a slightly better fit, especially at the edges of the curves. But it’s close enough where we would probably want to make sure.

To do so, we can run the two different regressions and compare their r-squared results. If you’re not familiar with r-squared, it is a very common measure for determining how well a regression fit. It is typically interpreted and the percent of variability of the dependent variable (y in this case) that can be explained by the indepent variable or varibles (in the case there is more than 1). The higher the r-squared the better. Once again, if you want a little more information about this topic, here is a pretty good and simple explanation: http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit.

```
# First we run the linear regression (with the same formula as in the ggplot graph)
linreg_fit <-
lm(y ~ x, data = simplified_df2)
# Second we run the quadratic regression (also with the same formula as in the ggplot graph)
quadreg_fit <-
lm(y ~ x + I(x^2), data = simplified_df2)
# Now we can return the r-squared.
# We can do this in a more fancy manner with the message function
message(
paste('Linear regression r-squared:',
# Get the r-squared element from the summary of the fit
summary(linreg_fit)$r.squared)
)
```

## Linear regression r-squared: 0.799783601324086

```
message(
paste('Quadratic regression r-squared:',
# Get the r-squared element from the summary of the fit
summary(quadreg_fit)$r.squared)
)
```

## Quadratic regression r-squared: 0.861010457639293

This proves what we concluded from the graph: the quadratic is a better fit. It explains 86% percent of the variability of the dependent variable, while the linear regression only explained 80%, which is still pretty good.

Awesome, we’re all done. Hope you enjoyed it and learned a few new things. Please reach out if you have any questions.

Cheers!

]]>I do not intend this blog to be a full-fledged Lonely Planet or Not For Tourists guide of the city. I don’t even want it to be a Time Out.

Instead, I want it to be more like visiting friends and letting them take you to their favorite places. The ones they’ve discovered over the years and where they know the pizza makers by name.

Source: sideways.nyc

So let me show you around.

]]>