Love him and his music or not, Bob Dylan is considered by many to be one of the greatest lyricists of all time. He has recorded 36 studio albums, written hundreds of songs, and is still touring the world, performing at dozens of concerts each year. Not bad for a 74-year-old grandpa who started his career more than 5 decades ago. He’s been recognized so many times that Wikipedia has a separate page to list all of his awards and nominations.
But what is it that draws people to his songs? Do his lyrics have a particular essence? Are they happy or sad? Does Dylan sing primarily about himself and his experiences? What words does he use most frequently? If you’ve ever wondered about any of this, my latest viz, and a submission to Tableau’s IronViz Music Viz Contest, attempts to help answer these questions. In this post I break down how I approached the challenge of finding new ways to understand Dylan through text analysis of his lyrics.
I started this project by first determining where to source my primary data set – the lyrics! There are many websites out there that archive song lyrics, but I decided to rely on Dylan’s official website, presumably the most reliable source of information about the artist and his writing. Most of his songs do indeed have listed lyrics; I say most because several songs are missing lyrics and a handful of pieces are instrumental. I used import.io to scrape albums and song lists, as well as lyrics. All you need to do this yourself is use their free app, available on their website. Be sure to check out the tutorials in the Help sections to get started. The program is fairly easy to use but video-walkthroughs will likely save you time and some potential frustration.
I scraped the lyrics twice. For text analysis in Tableau I needed each line in a separate row and for sentiment analysis (more on this later) I wanted to have the text of the whole song in a single row.
Preparing data for Tableau
After training import.io’s crawler to separate each line into its own row, this is what I got:
What I eventually wanted to end up with was each word in a separate row. To achieve this, I took the data to Tableau and used the INDEX() function to number each line, Restarting Every song. I copied the resulting crosstab back to Excel and used Excel’s Text-to-Columns command to split each line into words, using space as a delimiter. I then added numbering to each column of words:
From there, I relied on Tableau’s free data reshape tool to convert my wide data into a long format. This gave me the table like below, with each word identified by the line number and the word’s location in the line. The only thing lost in translation were paragraph breaks between verses, but I can live with that.
If you were paying attention, you may have noticed that I added 2 additional columns: pct_before (for punctuation preceding the word) and pct_after (for punctuation following the word). I’d like to credit Robert Rouse for this method of treating punctuation in data prep. He did some fantastic work with his visualization of Bible text and I got many great ideas by digging into his viz.
Separating but keeping punctuation marks is important if you want to analyze raw words while retaining the ability to properly display your text in Tableau, by concatenating raw words with the punctuation marks.
Okay, so we can now analyze our lyric statistics in Tableau, but how can we assess the mood and emotion of the song? R to the rescue! As I mentioned, for sentiment analysis I needed each song as a long string (one song per row). That is the data input format required by R. Bora Beran writes more on his blog about running an R sentiment package from within Tableau, but for a Tableau Public viz I had to do my analysis in R, and plug the results into Tableau. What the R sentiment package does (download it here), is it cross-references the words analyzed against its built in database of over 6000 words classified as positive, negative, or neutral, compiles results for the whole string (a song in our case) and uses fancy statistics to best fit a single descriptor to your text. The database of words is just a text file and you can add your own words and their sentiment to the database to adapt the package to your data set, especially if the text you are analyzing contains uncommon words, or word combinations.
Below is the R code that reads in the data file (two column CSV with the song name and lyrics), strips punctuation marks, numbers, and converts all words to lowercase, runs its classification, and outputs the results to a new CSV.
# load library
# load data
data <- read.csv("words.csv")
# remove numbers
data[,2] = gsub("[[:digit:]]", "", data[,2])
# remove punctuation
data[,2] = gsub("[[:punct:]]", "", data[,2])
# convert to lowercase
data[,2] = tolower(data[,2])
# classify emotion
class_emo = classify_emotion(data, algorithm="bayes", prior=1.0)
# get best fit
emotion = class_emo[,7]
# classify polarity
class_pol = classify_polarity(data, algorithm="bayes")
# get best fit
polarity = class_pol[,4]
sentiment = cbind(data, emotion, polarity)
The rest was Tableau fun and the result is below. I hope you’ll enjoy.