As Claire mentioned, I do have a post coming up about network graphs. And, I should note, I ended up with more static images from that experiment than from all the other experiments I performed combined. There’s something about networking that makes me want to document every single change I make to the image.
Anyway, that’s not this post. This post is a return to the problem of the pretty, which I have not touched on in a while. I am at DHSI, the Digital Humanities Summer Institute, taking a course with David Hoover on seeing what can be done with text analysis. So while the focus of the class is on playing with textual analysis and seeing what it can provide on an interpretive level (or possibly just the “hmm, I wonder what this button does” level), almost all the analyses we have done have produced, almost as an afterthought, a visualization.
So I couldn’t resist.
Here are the results of three linked analyses done in Minitab, which is statistical analytic software. My basic relationship with this software is as follows: I can interpret the results. I even have a pretty good broad idea of how the computer gets the results. But if you want a concrete explanation, I soon get very very lost in my own verbiage and I recommend that you do what I do when I don’t know what to do…turn to Google.
My emphasis here is on the visualizations, however, so let us move towards those.
1) Cluster Observations of the Daniel Deronda Chapters
So…what do we make of this?
I’ve taken a leaf out of David Hoover’s book and color coded the labels to make it easier to see patterns. The green are the chapters from Daniel’s perspective, the purple are the chapters from Gwendolen’s perspective and the blue are chapters in which the perspectives switch back and forth between them. Uncolored chapters are those from the perspective of non-main characters. This was done in Pixelmator (a reasonably-priced image editing app for Mac) at some less than salubrious hour of the night and it rather shows.
Also, I spent several minutes that felt like hours last night trying to change the colors of the dendrogram (as I discovered, the technical term for this tree is a dendrogram). Of course, when I went back to it this morning, I figured out how to do it by accident. However, given the choice between recoloring the entire bottom or simply dealing with the ugly shade of red, I chose laziness over aesthetics.
Anyway, back to the graph. So this analysis take the 990 most frequent words in the novel after all proper names and gendered pronouns have been removed* and uses them to try and determine which chapters are most like other chapters based on how they use those words. It then shows the clusters (hence cluster observations) with the heights of the linkages in the dendrograms conveying degree of similarity.
So what this tells us is that the word usage is different enough that the top 1000 words, even without names, are enough to broadly distinguish between the chapters dealing with Deronda and those dealing with Gwendolen (It’s worth remembering that this is the top 1000 of ALL words in the novel, including words like “the,” “I” and “and”). It’s not perfect, but as Eliot is writing a fair amount of free indirect discourse, I would have been more surprised if it were. I’m actually surprised that it worked as well as it did.
And if you’re wondering about that weird blue line that is separate from the entire rest of the novel, it’s a very short chapter that consists almost entirely of a letter to Deronda from his long-lost mother. I had to go and look up the chapter to see why it was weird, but once I had it made perfect sense. (The human brain, of course, can justify anything, but I happen to think mine is right in this case.)
Onto the next graph!
2) Principle Component Analysis. (Adam Crymble, this one’s for you.)
100 mfw means that these divisions are based only on the 100 most frequent words. That’s less accurate, but because I’m going to superimpose all the words used in this analysis on the graph in a minute, I need to keep the number of words down to a size that doesn’t resemble a plague of locusts.
So what’s going on here?
Okay, here’s the version I understand. What PCA does is it takes the raw data, in this case word frequency proportions, and “rotates them through multiple dimensions” to figure out the best combination of variables that represents the variations seen by the data. Once it has worked out these components (using math far beyond my comprehension), it graphs them based on the two principle components–that is, the two that best describe the differences. And that’s the graph. (I am utterly indebted to Charles Shirley for directing me to the following link, http://www.mun.ca/biology/scarr/Lab_4_-_Adaptive_shape_variation_09Nov02.pdf, which helped me understand what is going on.)
In terms of interpretive work, Deronda and Gwendolen don’t really cluster, but they do tend to divide with Gwendolen showing up predominantly above the x axis and Deronda predominantly below it. So the second Principle Component involved in dividing these two groups of chapters is the main character. I have no idea what the first component is, though there is a way to…if not find out, to think about it.
This is a graph that uses the same components as above, but this time it graphs the 100 words used onto the component axes so you can compare where the words fall on the above graph. Imagine it superimposed on the previous graph. (I would do so myself but I haven’t figured out a way to pull it off without making both graphs entirely illegible). It’s…an interesting way to think about the distinctions in word usage. For example, my intuitive response to the way the words are laid out is to guess that the left-right principle component tells you how much dialogue is in a given chapter. The dialogue heavy chapters are on the right. (Obviously, the principle component itself is much more mathematical than that, but in my terms it seems to equate with heaviness of dialogue). My next step would be to check the chapters themselves and see if my predictions are correct.
The fact that these graphs are appearing at the end of this portion of the project is, in its own way, unfortunate. I think of these graphs as a pedagogical tool that works best at the beginning of project. If you wish to use them as such, what I would suggest doing, once you’ve gotten up the learning curve and worked out how to read them, is use these graphs to direct future research. For example, look at the divide between Eliot’s writing when she writes dialogue heavy chapters versus when she writes as the narrator (in fact, her narrative voice is quite interesting in its own right). That might be something to look into. And coming up with explanations for why the chapters in the dendrogram divided the way they did could certainly be fun (although suspect from the perspective of valid textual explanations).
However, what I would really do is look for better ways to display the information in these charts. Statistics programs were not invented to create beauty, unless you find order out of chaos to be beautiful. As a project for the future–perhaps over the summer–it might be worth thinking of ways to artistically reinterpret statistical charts in a manner that, though it sacrifices fidelity to the data, still conveys the information but in a manner that is aesthetically appealing and draws in the reader. Basically, I want a visualization that attracts both people who love graphs and people who start running in the opposite direction when they see one. That would be a really useful visualization.
*I removed the proper names and pronouns because I already knew the information they could give me. If I found that the best indication that a chapter was about Gwendolen was the frequency of the word Gwendolen, I might get a very accurate result, but not a very interesting one. This way, though perhaps my results will be less accurate in division, they will be more interesting.