Friday, June 10, 2016

Group Project: Sentiment Analysis of Poetry in Python (DHSI 2016)

I took a one-week course on Coding Fundamentals at DHSI 2016 with Dennis Tenen (Columbia University) and John Simpson (University of Alberta). You can see the syllabus for the course here

Let me start with a quick plug for Dennis Tenen's group at Columbia, the "Group for Experimental Methods in the Humanities"  You can see some of the projects they are doing at their Github site; one in particular that seems really interesting is RikersBot, a Twitter bot that conveys a series of statements from inmates at Rikers Island Prison in New York. It was created as a joint project between Columbia University students and Rikers inmates interested in learning coding; part of the project involved teaching all of the young people in the class the coding they would need to build a Twitter bot. The Bot is currently not active, but the stream it produced over several months is well worth a look.

*

Why coding? I wanted to get started with coding because it seems to be one of the major dividing lines between people who can chart their own independent course through the digital humanities and people who work with ideas and tools developed by others. It's not the be-all, end-all, of course (as I've said before, you can do so much now with off-the-shelf tools), but some experience with coding seems like it could be really helpful for projects that don't quite fit the mold of what's come before.

The class itself was intense, frustrating, and sometimes really fun. I'm not going to lie: learning how to code is hard. I can't say that I will readily be able to start spitting out Python scripts after four days of working with the language, but I might at least be able to figure out how to a) do some simple scripts to process batches of text files that otherwise require repetitive, laborious work, and b) use libraries of code developed by others in Python to do more advanced things.

*



Starting Thursday morning, we were working in small groups on various projects of our own design and choice. The group I ended up with was interested in learning how to do sentiment analysis in Python with various bodies of poetry.

I have talked about my explorations with sentiment analysis before; last fall, I spent some time learning the rudiments of how to apply Matt Jockers' sentiment analysis package in R to a series of Victorian novels, and wrote about it here. Sentiment analysis is a method that many literary scholars in particular don't fully trust, since it depends on the algorithm's ability to parse and accurately understand the emotional valence of individual sentences. Individual sentences can 'trick' the parser ("I was extremely happy yesterday; today, not so much."), so the data is most meaningful when the sample size is large (if a novel has 10,000 sentences, it's unlikely that more than a few will be tricky in that way). Another concern is of course the idea of quantifying sentiment to begin with -- it can't
really pick up on irony or hyperbole.

And yes, it might be worth acknowledging the psychological truth that human sentiments -- human emotions -- don't follow a 1-dimensional axis (-1--+1). "Elated" and "ecstatic" are both highly positive sentiments, but they are qualitatively different from one another. But as of right now, this is what we have to work with (a dream project might entail recalibrating the entire sentiment analysis algorithm away from a 1-dimensional / linear measure and try and structure it as a 2-dimensional or 3-dimensional array of linguistic representations of emotional states...).

So, for now sentiment is measured  on a -1 --- +1 scale, with +1 a very 'happy' value and -1 a very 'unhappy' value. The package we were using in Python also had an additional quality it was measuring, called "Subjectivity." For the sentence, "I was extremely happy yesterday; today, not so much" the algorithm gives us a value of 0.5 on sentiment (moderately positive), and 0.6 on subjectivity, which is just above the half-way point between "very concrete" (0) and "very subjective" (1).

Our Code / My Particular Data Set 

The question I started with for my own data set for this micro-project was, "Is Modernist poetry darker than Victorian poetry?  (Hypothesis: Yes)  I was inspired to pursue the question of the gap between Modernist and Victorian poetry by Ted Underwood's blog post from May 2015, where he showed that in terms of vocabulary and the prestige economy, there are more continuities between Victorian and early 20th century literary scenes than differences.

For reasons of time I had to work with a very limited data set (50 Victorian poems, 50 modernist poems); if I were to seriously pursue this project, I would try and work from a much larger data set (i.e., using archives from MJP). Others in my group had their own data sets and questions they were trying to answer. The nice thing about this particular bit of code is that it could work for what I wanted to do as much as for any number of other projects.

I took two text files from Project Gutenberg, the Oxford Book of English Verse (published in 1914; I used only the Victorian poets in the anthology), and Some Imagist Poets (1916; the Amy Lowell imagist anthology). I thought the fact that this was an anthology published almost at exactly the same time as Some Imagist Poets might act as a sort of control: this is what an Oxford editor in the 1910s thought was the most representative Victorian poetry (what the Victorians themselves thought at the time might have looked different; what editors of today's big anthologies think is also different, and potentially, distorting). The Oxford Book of English Verse had mostly British writers, though writers like Emerson and Poe are also included amongst the Victorians.

Each dataset had about 50 poems (admittedly a very small sample size!),

Just by taking an average of the +/- polarity the sentiment analyzer produced, I saw a modest validation of my hypothesis -- Some Imagist Poets has an average sentiment value of 0.03 (just about balanced), while the Victorian writers in the Oxford Book of English Verse have an average sentiment value of 0.13. Since quite a number of the poems have values between -0.5 and 0.5 (you need pretty extreme emotional states to get outside that window) a difference of 0.1 is actually a significant difference. (It would be more significant if it were to hold up on an expanded dataset.)

With help from one of my colleagues in my group (thanks Josh K.), who was an expert in data visualization, I took the data we outputted from our Python script, and created the following preliminary chart using Tableau data visualization software. The X-axis shows subjectivity, while the Y-axis shows polarity (+/- sentiment value).





This chart includes subjectivity (0-1) as well as sentiment (-1-+1). The orange dots are Victorian poems, while the blue dots are Modernist. One thing I think we can see here is that the Modernist poems in Some Imagist Poets seem to have a clear clustering along the lines of more concrete, more negative, while the Victorians have a cluster (on the upper right) that is more subjective, more positive.

When I have time and energy I'll try and go back into my data and link each data point to the particular names of poems. For now, they are just numbered "poem1," "poem2," etc.
(I also want to break the poets in the data sets down by gender and compare the sentiment values of the men and the women... Maybe later!)

*

The coding process. Since we were first starting out with this language and with coding in general, even getting Python to do simple things was a challenge at first. It's easy to open a single text file into memory and do something with it; it's significantly harder to open a batch of them and perform the same operation on a whole directory of files. A big stumbling block is data types; strings (i.e., text) are iterable (you can do things to them), while integers aren't. Some steps in our process required data in the form of strings, while others required lists.

We figured out how to open the files; we then started to work on how to use the TextBlob library for Python and feed the text we had called up into memory into the sentiment analyzer algorithm that is included with TextBlob. Finally, we wanted to output our results to a new file we were creating, which we called "Workfile." Each of these steps was, not surprisingly, harder than we thought it would be. With help from our teachers, we managed to figure it out, and successfully got an output file with Sentiment values in the form of a comma-separated list. (I should say that the others in my group were especially good at resolving the problem of getting the data in the right form for TextBlob to be able to interpret it; I found it a struggle... )

With a comma-separated list it's easy to put the data into a spreadsheet and produce a simple visualization. My colleague had experience working in a much more advanced dataviz application called Tableau, and he was kind enough to walk me through the methodology that allowed us to produce the chart above. (Tableau software is free.)

So that's it, so far. The project in and of itself isn't anything terribly exciting at present, but it could potentially be the basis of something bigger down the road. (We'll see. First, a nap.) 

No comments: