Playful Visualizations at Work, Working Visualizations at Play

Archive for the ‘Liz’ Category

I Blog Therefore I Am Doing Something

There’s not much to report on the visualization front this week. I have created a couple of elementary (actually, closer to Kindergarten) graphs in R by following the instructions in Matthew Jockers’ excellent book, Text Analysis with R for Students of Literature, which is currently in draft form, but an excellent resource nonetheless. So I have learned some things about the relative frequencies of the words “whale” and “Ahab” and, more importantly, I’m gaining some insight into what else I could do with my newfound knowledge of statistical programming. But my studies in R are still very much at the learning stage and I have yet to reach a point where I can imagine using it in a more playful, exploratory sense. While this is not true of every tool, R is one of the ones that must be mastered before it can be misused in an interesting manner. Which is not to say that it cannot be used badly – I am getting good at that – but the difference between using a tool badly and playfully is a critical distinction. A playful framework is one that eschews the tool’s obvious purpose in order to see what else it can produce; a framework that validates a kind of “What the hell, why not?” approach to analysis. Playfulness exists when we search for new ways to analyze old data and disconcerting methods for presenting it. It can be found in the choice to showcase one’s work as a large-scale three dimensional art-project and in the decision to bring the history of two Wikipedia articles about the author into one’s examination of the text. It is not, more’s the pity, found in code that fails to execute.*

All this adds up to an apology: I have no intriguing word clouds for you this week. I don’t even have any less-than-intriguing word clouds this week. But I do have some thoughts about the nature of this blogging endeavor, nearly a year and a half after it was started.

This blog began as a way to record our visualization experiments in a forum where we could treat them as part of a larger group and where we would be forced to engage with them publicly. It was a way to hold ourselves accountable to our professor, to each other and to ourselves. At the same time, it was a way to provide all our visualizations (even the ones that did not make it into our final seminar papers) with a home and a life beyond our hard drives and walls.

The class has ended and the blog lives on. Last year, it was a place for me to think through a social-network graph of William Faulkner’s Light in August; a project that grew out of the work I did on Daniel Deronda. This year, it’s serving as a repository for experiments that I perform as part of my work in UCSB’s Transcriptions Center.

And throughout those different iterations, one element of common purpose stands out to me. The blog is a place for scholarly work-in-progress. It’s where projects that need an audience, but are not meant for traditional publication can go. It’s where projects that have reached a dead end in my mind and require a new perspective can be aired for public consumption. It is, at its most basic level, a way of saying “This work that I am in the process of doing is meaningful”.

And that, I think, is the real key to why I find maintaining this blog – despite my sporadic updating during my exam year – so valuable. Blogging about my work gives me a reason to do it. This might sound absurd, if not simplistic, but bear with me for a moment. Academia is a goal-oriented endeavor. We begin with the understanding that we finish our readings on time in order to have meaningful conversation about them in order do well in a course. We do our own research in order to write a paper about it in order, once again, to do well in a course or in order to present it at a conference. (Obviously, I’m not arguing that the only reason that anyone reads anything is for a grade, but the fact that graduate students turn said research into a well-argued paper within a reasonable time-frame is tied to the power of the grade.)  The books we read, the programs we learn, the courses we teach are oriented towards the dual goals of spreading knowledge in the classroom and publishing knowledge in the form of an article or monograph.

So where does practical knowledge within the digital humanities fit in? In the goal-oriented culture of academia, where is the value in learning a program before you have a concrete idea of what you will use it for? Why learn R without a specific project in mind? Why topic model a collection of books if you’re not really interested in producing knowledge from that form of macroanalysis? My experience with academia has not really encouraged a “for the hell of it” attitude and yet a number of the tools used specifically within the digital humanities require one to invest time and practice before discovering the ways in which they might be useful.

There are several answers to the above question. One that is used to great effect in this department and that is becoming more popular in other Universities as well is the Digital Humanities course. I am thinking in particular of Alan Liu’s Literature+ course, the seminar for which this blog was originally created. By placing digital training within the framework of a quasi-traditional class, we as students are introduced to and taught to deploy digital forms of scholarship in the same manner that we learn other forms of scholarly practice. If we master close-reading in the classroom, we should master distant reading in it as well.

And yet, what does one do when the class is over? Styles of human reading are consistently welcome in graduate seminars in a way that machinic readings are not. And there are only so many times one can take the same class over and over again, even assuming that one’s institution even offers a class like Literature+.

The alternative is to take advantage of blogging as a content-production platform. The blog takes over as the goal towards which digital training is oriented. Which is a very long way of saying that I blog so that I have something to do with my digital experiments and I perform digital experiments so that I have something to blog about. Which seems like circular logic (because it is), but the decision to make blogging an achievement like, albeit not on the same level as, producing a conference paper is one that allows me, once again, to hold myself accountable for producing work and results.

This year, “Ludic Analytics” will be my own little Literature+ class, a place where I record my experiments in order to invest them with a kind of intellectual meaning and sense of achievement. Learning to count incidences of “Ahab” and “Whale” in Moby Dick may not be much, but just wait until next week when I start counting mentions of “Gwendolen” and “Deronda”…

*I apologize for the slight bitterness, I spend half an hour today combing through some really simple code trying to find the one mistake. There was a “1” instead of an “i” near the top.

Advertisements

MALLET redux

I considered many alternative titles for this post:

“I Think We’re Gonna Need a Bigger Corpus”

“Long Book is Long”

“The Nail is Bigger, but the MALLET Remains the Same”

“Corpo-reality: The Truth About Large Data Sets”

(I reserve the right to use that last at some later date). But there is something to be said for brevity (thank you, Twitter) and, after all, the real point of this experiment is to see what needed to be done to generate better results using MALLET. The biggest issue with the previous run–as is inevitably the case with tools designed for large-scale analysis–was that I was using a corpus that consisted of one text. So my goal, this time around, is to see what happens when I scale up. So I copied the largest 150 novels out of collection of 19th and early 20th century texts that I happened to have sitting on my hard drive and split them into 500 word chunks. (Many many thanks to David Hoover at NYU, who had provided me with those 300 texts several years ago as part of his Graduate Seminar on Digital Humanities.. As they were already stripped of their metadata, I elected to use them.) Then I ran the topic modeling command in MALLET and discovered the first big difference between working with one large book and with 150. Daniel Deronda took 20 seconds to model. My 19th Century Corpus took 49 minutes. (In retrospect, I probably shouldn’t have used my MacBook Air to run MALLET this time.)

Results were…mixed. Which is to say that the good results were miles ahead of last time and the bad results were…well, uninformative. I set the number of topics to 50 and, out of those 50 topics, 21 were not made up of a collection of people’s names from the books involved.*  I was fairly strict with the count, so any topic with more than three or so names in the top 50 words was relegated to my mental “less than successful” pile. But the topics that did work worked nicely.

So here are two examples. The first is of a topic that, to my mind, works quite well and is easily interpretable. The second example is of a topic that is the opposite of what I want though it too is fairly easy to interpret.

Topic #1

First

So, as a topic, this one seems to be about the role of people in the world. And by people, of course, we mean MEN.

Topic #2:

Second

Now, this requires a some familiarity with 19th century literature. This topic is “Some Novels by Anthony Trollope”. While, technically accurate, it’s not very informative, especially not compared to the giant man above. The problem is that, while it’s a fairy trivial endeavor to put the cast of one novel into a stop list, it’s rather more difficult to find every first and last name mentioned in 150 Victorian novels and take them out. In an even larger corpus (one with over 1,000 books, say), these names might not be as noticeable simply because there are so many books. But in a corpus this size, a long book like “He Knew He Was Right” can dominate a topic.

There is a solution to this problem, of course. It’s called learning how to quickly and painlessly (for a given value of both of those terms) remove proper nouns from a text. I doubt I will have mastered that by next week, but it is on my to do list (under “Learn R” which is, as with most things, easier said than done).

In the meantime, here are six more word clouds culled from my fifty. 5 of these are from the “good” set and one more is from the “bad”.

Topic #3:

Third

Topic #4:

Fourth

(I should note, by the way, that party appears in another topic as well. In that one, it means party as a celebration. So MALLET did dinstinguish between the two parties.)

Topic #5:

Fifth

Topic #6:

Sixth

Topic #7

Seventh

Topic #8:

Eighth

There are 42 more topics, but since I’m formatting these word clouds individually in Many Eyes, I think these 8 are enough to start with.

So the question now on everyone’s mind (or, certainly on mine) is what do I do with these topic models? I could (and may, in some future post) take some of the better topics and look for the novels in which they are most prevalent. I could see where in the different novels reading is the dominant topic, for example. I could also see which topics, over all, are the most popular in my corpus. On another note, I could use these topics to analyze Daniel Deronda and see what kinds of results I get.

Of course, I could also just stare up at the world clouds and think. What is going on with the “man” cloud up in topic 1? (Will it ever start raining men?). Might there be some relationship between that and evolving ideas of masculinity in the Victorian era? Why is “money” so much bigger than anything else in topic #6? What does topic #7 have to say about family dynamics?

And, perhaps the most important question to me, how do you bring the information in these word clouds back into the texts in a meaningful fashion? Perhaps that will be next week’s post.

*MALLET allows you to add a stopwords list, which is a list of words automatically removed from the text. I did include the list, but it’s by no means a full list of every common last name in England. And, even if it was, the works of Charles Dickens included in this list would leave it utterly stymied.

Hammering at Daniel Deronda

This time, we are using a MALLET!

(I apologize for the pun, but it does not seem to get old).

MALLET stands for MAchine Learning for LanguagE Toolkit and is proof that, among other things, there is no such thing as an impossible acronym. MALLET is a Java-based package designed for multiple kinds of natural language processing/machine learning, including what I used it for – Topic Modeling.

So what is Topic Modeling? Well, let’s say that texts are made up of a number of topics. How many? That depends on the text. So every word in that text (with the exception of common words like “an” ) should be related to one of those topics. What MALLET does in topic modeling mode is it divides a set of texts up into X number of topics (where X is your best guesstimate on how many there should be) and outputs all the words in that topic, with a shorter list of top words for each topic. Your job, as the human, is to guess what those topics are.

For more on the idea behind topic modeling, check out Matthew Jockers’ Topic Modeling Fable for the decidedly non-technical version or Clay Templeton’s Overview of Topic Modeling in the Humanities.

Now for the second question – why am I doing it? Beyond the “well, it’s cool!” and “because I can,” that is, both of which are valid reasons especially in DH. And my third reason is a subset of the second, in a way. I want to test the feasibility of topic modeling so that, as this year’s Transcriptions Fellow*, I can help others  use it in their own work. But in order to help others, I need to first help myself.

So, for the past two weeks or so, I’ve been playing around with MALLET which is fairly easy to run and, as I inevitably discovered, fairly easy to run badly. Because of the nature of topic modeling, which is less interested in tracking traditional co-occurrences of words (i. e. how often are two specific words found within 10 words of each other) and more interested in seeing text segments as larger grab-bags of words where every word is equidistant from every other**, you get the best topic models when working with chunks of 500-1000 words. So after a few less-than useful results when I had divided the text by chapters, I realized that I needed a quick way to turn a 300,000+ word text file into 300+ 1000 word text files. Why so long a text? Well, George Eliot’s Daniel Deronda is in fact a really long text. Why Daniel Deronda? Because, as the rest of this blog demonstrates, DD has become my go-to text for experimenting with text analysis (and, well, any other form of analysis). So I have MALLET, I have Daniel Deronda, I now also have a method for splitting the text thanks to my CS friends on Facebook and, finally, I have IBM’s “Many Eyes” visualization website for turning the results into human-readable graphics. All that’s missing is a place to post the results and discuss them.

I knew Ludic Analytics would not let me down. So, without further ado, I present the 6 topics of Daniel Deronda, organized into word clouds where size, as always, represents the word’s frequency within the topic:

Topic 1:

Topic1

Topic 2:

Topic2

Topic 3:

Topic3

Topic 4:

TOPIC4

Topic 5:

Topic5

Topic 6:

Topic6

 

You will notice that the topics themselves do not yet have titles, only identifying numbers. Which brings us to the problem with Topic Modeling small text sets – too few examples to really get high quality results that identify what we would think of as topics. (Also, topic modeling is apparently better when one uses a POS (parts of speech) tagger and even gets rid of everything that isn’t a noun. Or so I have heard.)

Which is not to say that I will not take a stab at identifying them, not as topics, but as people. (If you’ve never read Daniel Deronda, this may make less sense to you…)

  1. Daniel
  2. Mordecai
  3. Society
  4. Mirah
  5. Mirah/Gwendolen
  6. Gwendolen

I will leave you all with two questions:

Given the caveat that one needs a good-sized textual corpus to REALLY take advantage of topic modeling as it is meant to be used, in what interesting ways might we play with MALLET by using it on smaller corpora or single texts like this? Do the 6 word clouds above suggest anything interesting to you?

And, as a follow-up. what do you make of my Daniel Deronda word clouds? If you’ve never read the text, what would you name each topic? And, if you have read the text, what do you make of my categorizations? 

*Oh, yes. I’m the new Graduate Fellow at the Transcriptions Center for Literature & the Culture of Information. Check us out online and tune in again over the course of the next few weeks to see some of the exciting recent developments at the Center. Just because I haven’t gotten them up onto the site yet doesn’t mean they don’t exist!

**This is a feature, not a bug. Take, for example, a series of conversation between friends and, in every conversation, they always reference the same 10 movies although not always in the same order. MALLET would be able to identify that set of references as one topic–one that the human would probably call movies–while collocation wouldn’t be able to tell that the first movie and last movie were part of the same group. By breaking a long text up into 500-1000 word chunks, we are approximating how long something stays on the same topic.

The Limits of Social Networks

Though we have mostly gone our separate ways over the past year, I find that I am attached to the idea of the LuAn collective and want to keep it going just a bit longer. After all, you never know when you might need a data viz blog that you co-run.

As a second year student in the English department at UCSB, I am gearing up to take (i.e. reading madly for) my qualifying exams this June. As luck would have it, I am also finishing up my course requirements this quarter, so I find myself in the…unenviable position of writing a paper on a topic that would ordinarily lie far outside my interests in the 19th century English novel: William Faulkner. So I did what any digital humanist with an unhealthy interest in visualization would do in my situation – I made a graph.

I wanted to write a final paper for this course that reflects my theoretical interests and would allow me to continue developing a subset of my digital skills. Of course, trying to get all of my interests to move in more or less the same directions is like herding kittens, but I had been seeking another opportunity to think through a novel using a social network graph and, well, I wouldn’t have to start from scratch this time. I knew how my graphing software, yEd, worked and I knew how long it took to turn a book into a collection of Excel cells denoting conversations (20% longer than you think it will take, for those of you wondering). So why not create a social network graph of one story in Yoknapatawpha?

Don’t answer that question.

Light in August is widely considered to be the most novel-like of Faulkner’s novels, which made it a good choice for my project. After all, I had experience turning a novel-like novel into a social network graph and no experience whatsoever with a text like The Sound and the Fury. Much as I was intrigued by and even enjoyed The Sound and the Fury and Absalom, Absalom!, the prospect of figuring out the rules for graphing them was…intimidating to say the least.

For all its novelistic tendencies, Light in August is still decidedly Faulknerian and, in order to work with it, I found myself either revising some of my previous rules or inventing new ones. When I worked on George Eliot’s Daniel Deronda, I had used a fairly simple set of two rules: “A bidirectional interaction occurs when one named character speaks aloud (that is, with quotation marks) to another named character. A unidirectional interaction occurs when a named character speaks aloud about another named character.”

Here are the Faulkner rules:

  1. When one character speaks to another, that interaction is marked with a thicker, dark grey arrow.
  2. When one character speaks about another, that interaction is marked with a thin, dark blue arrow.
  3. When one character speaks to another within another character’s narration (i.e. X is telling a story and, in it, Y talks to Z), that interaction is marked with a thicker, light grey arrow
  4. When one character speaks about another within another character’s narration, that interaction is marked with a thin, green arrow.

There are several changes of note here. First, I learned more about yEd and figured out how to put properties like line size and color in the spreadsheet itself so that the software would automatically map color and line weight as appropriate. This meant I could make finer and clearer distinctions than last time, at least in terms of showing kinds of communication. Second, I changed the rule about quotation marks because quotation marks don’t necessarily connote audible speech in Faulkner, nor does their absence connote internal monologue. I relied entirely on the dialogue tags in the text to decide whether a sentence was spoken aloud or not. Finally, I changed the rule about named characters. All speaking characters are represented in the graph, regardless of whether or not we are ever told their names. Had I not changed this rule, the number of characters of color represented in this graph would have fallen from 15 to 3. There are 103 distinct nodes in this graph, which means 103 characters speak in this text.

Jeffrey Stayton, in an article entitled “Southern Expressionism: Apocalyptic Hillscapes, Racial Panoramas, and Lustmord in William Faulkner’s Light in August” (which, in the interest of full-disclosure, I am still in the middle of reading), discusses how Faulkner figures racial landscapes in Light in August as a kind of Southern Expressionism. It is fitting, of course, that one of Willem de Kooning’s expressionist paintings is based on and entitled “Light in August”. But this graph highlights the relationship between fading into the background and remaining unnamed, it shows how easily racial landscapes can become racial backgrounds and how easily it is to elide the unnamed. In the Victorian novel, a certain charactorial parsimony seems to ensure that everyone who speaks is named. Daniel Deronda is 800 pages long and contains 62 character nodes. Light in August is 500 pages long and contains 103. If you remove all the unnamed characters, there are 44 character nodes. (For those of you counting, thats 38/88, close to half of the white characters, and 12/15 or four fifths of the black characters. The other 8 are groups of people, who seem to speak and are spoken to fairly often in this text.)

There are several ways to interpret this difference and I am loathe to embrace any of them without, frankly, having done more work both with Faulkner and with the Victorian novels. One of the things I find striking, though, is that Light in August seems to be making visible (though only just) things that are either not visible or entirely not-present in Daniel Deronda. Light in August is told from different characters’ viewpoints and the narration always locates itself in their perspective and confines itself to what they know. So the graph becomes a record not only of what they have seen, but also of how they have seen it.

I can hear some of you grumbling “What graph? You haven’t shown us a graph yet!”

My apologies. For that, I will give you three. Anything worth doing is worth overdoing.

1) The first graph.

Light in August Social Network Organic DiskClick to see it in full size.

In this graph, color corresponds to importance, as determined by number of interactions. The darker the color, the more interactions that character has had. That dark red mark in the middle is Joe Christmas.

2) The graph without the unnamed characters

Light in August Social Network Organic Disk Sans Unnamed

Click for full size.

Colors mean the same here that it did in the previous graph.

There are several differences between the two graphs. Obviously, the second is legible in a way that the first one is not, which is not entirely a virtue. When it comes to graphing, legibility and completeness tend not to walk hand in hand. The more you leave out, the more you can see so, contra-positively  the less you can see, the less you have left out. The best-of-both-worlds solution is to use both images.

Interestingly enough, there are no unconnected nodes in the second image, even though I deleted half of the nodes in the graph. That surprised me. I expected to find at least one person who was only connected to the network through one of the unnamed characters, but there’s no such person. And many of the people who remain are not characters I would consider to be important to the story (Why has the entire history of the Bundren family remained more or less intact? Who is Halliday, anyway?)

These are questions to be solved, or at least pondered. They are, at any rate, questions worth asking. If the network remains intact without these characters, what does their presence signify? What has changed between the first graph and the second?

After all, I do have a paper to write from all of this.

I promised you a third graph, did I not? This one moves in a rather different direction. As part of its ability to organize and rearrange your graph, yEd has a grouping functionality and will divide your graph into groups based on the criteria you choose. I had it use natural clustering.

A grouping into natural clusters should fulfill the following properties:

  • each node is a member of exactly one group,
  • each node should have many edges to other members of its group, and
  • each node should have few or even no edges to nodes of other groups.

yEd gave me 8 distinct groups, two of which had only two nodes in them.

Light in August Social Network Grouped

As always, click for full-size.

I assume that when yEd said that the groups would have few or no edges to nodes in other groups, it was doing the best it could with the material I gave it. I then had yEd rearrange the positions of the nodes so that the centrality of a node’s position within a group indicates how many connections it has.

What I love about this graph is how it divides Light in August into a set of six interconnected but distinct narratives. Each group larger than two centers around a specific character or group of characters involved in one thread of narrative. Joe Christmas, who is arguably the main character, has one section (along with a plurality of the other characters of color), Lena Grove, Bryon Bunch and Joe Brown are all grouped together in another and, while they talk about the characters in Joe Christmas’s section quite often, they have only three conversations with the characters in that group. Those are the two largest groups. Percy Grimm, for all that he only appears in one chapter, manages to collect 7 other nodes around himself and does seem, in his own way, to be the protagonist of his own story who just walked into this one for one chapter and then left again. He is also the only named character in his section.

Social network graphs are, for me, a way of re-encountering a text. They strip away most of the novel and model only a small portion of what is present in the text, but that portion becomes both visible and analytically available in a new way. (I think seeing and visibility will become a theme in this paper, once I write it.) The title of this course is “Experimental Faulkner”. I like to think that this qualifies.

Visualizations and Pedagogy

One of the questions I that I feel has been lurking at the back of my mind over the course of this project, but that hasn’t really gotten much screentime, is that of pedagogy. I’ve thought about how visualizations inform and engage their viewers, but that has been fairly tightly focused on Creator+Image versus Image+Viewer, rather than the question currently on my mind: how exactly can we use visualization in the classroom?

The impetus for thinking about this question comes from an article I read last week: Five-Picture Charades: A Flexible Model for Technology Training in Digital Media Tools and Teaching Strategies. It posits a certain kind of visualization production known as playing charades–using cameras and photoediting software, much of which is free, to give future teachers a way to integrate both technology and exciting activities into the classroom.

I was taken with it as it represented yet another way to create images out of literature but in a manner that seemed to embrace some of the…let’s call them features of visualizations that I have been struggling with. As you may recall from previous posts, we’ve all thought about the problem of meaning making in visualizations and how the images we create always tell us far more than they tell the viewers. The act of creating the visualizations educates far better than the seeing of it. The game of charades is predicated on this point. First of all, it involves the students (or, in this case, teachers experimenting with it) in the actual production in a way that is fun and that forces them to think about how to translate their impressions of the work into another medium. But, perhaps more importantly, it actually takes advantage of the disconnect between the creator and the audience. Charades is focused entirely on conveying information through a visual format, so the creators need to think about whether they’re doing the best job they can at conveying information, while the audience also needs to work to understand the visualization. By turning visualization into a game, the viewers become participants.

So does this help bridge the gap between creator and viewer, introducing this new kind of ludic element into the mix?

And what do you think about these kind of classroom visualizations? Are they helpful education tools or gimmicks to replace engagement with entertainment? (And does that question depend on how old the class is?)

The Social Network (of Daniel Deronda)

Since this project’s beginning, I had toyed with the idea of doing a social network graph that would look at the relationships between all the characters in the novel. I was aware that this would be a substantially larger undertaking than any of the other visualizations I had in mind, which perhaps explains why I left it for last. Despite forewarning myself, I grossly underestimated how difficult that would be and set off to code character interactions over the course of 70 chapters in an 800 page novel. As an experience that opened up the novel to me in all sorts of new ways, it was wonderful. As a mix between skimming and data entry, it was profoundly unpleasant.

But enough lamenting the plight of the digital scholar, that’s boring. Here are the results:

Now for the specs. In order to create this graph, I needed to set some rules for what qualified as interaction. A bidirectional interaction occurs when one named character speaks aloud (that is, with quotation marks) to another named character. A unidirectional interaction occurs when a named character speaks aloud about another named character. The chart does not differentiate between two people who gossip about one another and two people who actually speak to one another. Also, the chart only shows the presence or absence of interaction, it does not add weight to the edges based on how many times interactions took place. I am aware that this is less than ideal, but as this is just my first foray into social network graphing, I have not yet worked out the full range of the software’s ability. I have the data to create that graph, just not the knowhow. But I plan to work it out when I have the chance.

Anyway, this graph was generated by the graphing software yEd. I told it to place the characters in a single circle and to use color to convey a character’s centrality (darker colored nodes have more connections to the other nodes). Then I just played around with the background because I am a sucker for light on dark presentation.

Here’s where it gets fun. I told the software to redraw the graph based on the groups it thought that the characters should be divided into (well, not in so many words, but that was how I translated the instructions in my head). The resulting graph is below.

Cool, right? The weirdest part, for me, was that Mrs. Davilow (Gwendolen’s mother) is at the center of the giant social cluster rather than Gwendolen herself. I have a few ideas as to why she might be–she’s more important than I tend to give her credit for–but I’m leery of creating post-hoc explanations for something that could simply be a software quirk. Still, it’s provocative.

The other point I want to make is about families. Here is another version of this graph, this time with immediate family members all colored the same color.

Now, it’s much easier to see which family groups are more connected throughout the novel and which are not. I find it particularly intriguing that upper-middle class families are all spread out along one giant social circle while the lower class families tend to cluster closer together as family groups.

Finally, I did one more thing with this graph. In the spirit of Franco Moretti’s work with Hamlet, where he graphed the social network of the novel and then deleted the Danish Prince from the graph, I did the same with both Gwendolen and Deronda, then told yEd to rearrange the groups based on the new data.

Okay, take a look at the two graphs.

I’d be mean and ask for your thoughts, but as I’m not sure how many of my readers have read Daniel Deronda (not to mention how many readers we have),  it would be unfair to ask you for an interpretation. Instead, I will provide you with mine. So here’s the cool thing. The families that grouped together in the previous graph but not in this one were brought together by the actions of the main character–in this case, Deronda. So Mordecai rediscovered his long-lost sister Mirah through Deronda, for example. On the other hand, the families that now group together had their lives disrupted in the book by the actions of the main characters, either Deronda or Gwendolen, depending on the family in question. So if you look at Grandcourt, pictured here with his mistress, Mrs. Glasher and illegitimate heir, Henleigh, you’ll see that he’s nowhere near them in the graph with Gwendolen. In the text, Gwendolen marries Grandcourt despite knowing that he has a mistress and son who deserve to be legitimized. (Illegitimacy is a theme in this text.) I found it absolutely fascinating that removing the characters from the graph actually mimics what removing them from the book would have done.

So here’s my invitation to you: think about how else these graphs might be able to speak. I used them to construct a specific narrative of family ties throughout the novel based on how the connections behave. How else might you produce new elements of the novel’s narrative using these kinds of graphs? And, if you’ll think back to last week’s thoughts on dynamic social network graphs, how might those really help to structure questions about the novel?

One final note–I am really pleased to have finally produced something using statistical software that I think is pretty. It makes me feel that all is not yet lost.

Back to Textual Basics

As Claire mentioned, I do have a post coming up about network graphs. And, I should note, I ended up with more static images from that experiment than from all the other experiments I performed combined. There’s something about networking that makes me want to document every single change I make to the image.

Anyway, that’s not this post. This post is a return to the problem of the pretty, which I have not touched on in a while. I am at DHSI, the Digital Humanities Summer Institute, taking a course with David Hoover on seeing what can be done with text analysis. So while the focus of the class is on playing with textual analysis and seeing what it can provide on an interpretive level (or possibly just the “hmm, I wonder what this button does” level), almost all the analyses we have done have produced, almost as an afterthought, a visualization.

So I couldn’t resist.

Here are the results of three linked analyses done in Minitab, which is statistical analytic software. My basic relationship with this software is as follows: I can interpret the results. I even have a pretty good broad idea of how the computer gets the results. But if you want a concrete explanation, I soon get very very lost in my own verbiage and I recommend that you do what I do when I don’t know what to do…turn to Google.

My emphasis here is on the visualizations, however, so let us move towards those.

1) Cluster Observations of the Daniel Deronda Chapters

So…what do we make of this?

I’ve taken a leaf out of David Hoover’s book and color coded the labels to make it easier to see patterns. The green are the chapters from Daniel’s perspective, the purple are the chapters from Gwendolen’s perspective and the blue are chapters in which the perspectives switch back and forth between them. Uncolored chapters are those from the perspective of non-main characters. This was done in Pixelmator (a reasonably-priced image editing app for Mac) at some less than salubrious hour of the night and it rather shows.

Also, I spent several minutes that felt like hours last night trying to change the colors of the dendrogram (as I discovered, the technical term for this tree is a dendrogram). Of course, when I went back to it this morning, I figured out how to do it by accident. However, given the choice between recoloring the entire bottom or simply dealing with the ugly shade of red, I chose laziness over aesthetics.

Anyway, back to the graph. So this analysis take the 990 most frequent words in the novel after all proper names and gendered pronouns have been removed* and uses them to try and determine which chapters are most like other chapters based on how they use those words. It then shows the clusters (hence cluster observations) with the heights of the linkages in the dendrograms conveying degree of similarity.

So what this tells us is that the word usage is different enough that the top 1000 words, even without names, are enough to broadly distinguish between the chapters dealing with Deronda and those dealing with Gwendolen (It’s worth remembering that this is the top 1000 of ALL words in the novel, including words like “the,” “I” and “and”). It’s not perfect, but as Eliot is writing a fair amount of free indirect discourse, I would have been more surprised if it were. I’m actually surprised that it worked as well as it did.

And if you’re wondering about that weird blue line that is separate from the entire rest of the novel, it’s a very short chapter that consists almost entirely of a letter to Deronda from his long-lost mother. I had to go and look up the chapter to see why it was weird, but once I had it made perfect sense. (The human brain, of course, can justify anything, but I happen to think mine is right in this case.)

Onto the next graph!

2) Principle Component Analysis. (Adam Crymble, this one’s for you.)

100 mfw means that these divisions are based only on the 100 most frequent words. That’s less accurate, but because I’m going to superimpose all the words used in this analysis on the graph in a minute, I need to keep the number of words down to a size that doesn’t resemble a plague of locusts.

So what’s going on here?

*crickets chirp*

Okay, here’s the version I understand. What PCA does is it takes the raw data, in this case word frequency proportions, and “rotates them through multiple dimensions” to figure out the best combination of variables that represents the variations seen by the data. Once it has worked out these components (using math far beyond my comprehension), it graphs them based on the two principle components–that is, the two that best describe the differences. And that’s the graph. (I am utterly indebted to Charles Shirley for directing me to the following link, http://www.mun.ca/biology/scarr/Lab_4_-_Adaptive_shape_variation_09Nov02.pdf, which helped me understand what is going on.)

In terms of interpretive work, Deronda and Gwendolen don’t really cluster, but they do tend to divide with Gwendolen showing up predominantly above the x axis and Deronda predominantly below it. So the second Principle Component involved in dividing these two groups of chapters is the main character. I have no idea what the first component is, though there is a way to…if not find out, to think about it.

3) Scatterplot

This is a graph that uses the same components as above, but this time it graphs the 100 words used onto the component axes so you can compare where the words fall on the above graph. Imagine it superimposed on the previous graph. (I would do so myself but I haven’t figured out a way to pull it off without making both graphs entirely illegible). It’s…an interesting way to think about the distinctions in word usage. For example, my intuitive response to the way the words are laid out is to guess that the left-right principle component tells you how much dialogue is in a given chapter. The dialogue heavy chapters are on the right. (Obviously, the principle component itself is much more mathematical than that, but in my terms it seems to equate with heaviness of dialogue). My next step would be to check the chapters themselves and see if my predictions are correct.

The fact that these graphs are appearing at the end of this portion of the project is, in its own way, unfortunate. I think of these graphs as a pedagogical tool that works best at the beginning of project. If you wish to use them as such, what I would suggest doing, once you’ve gotten up the learning curve and worked out how to read them, is use these graphs to direct future research. For example, look at the divide between Eliot’s writing when she writes dialogue heavy chapters versus when she writes as the narrator (in fact, her narrative voice is quite interesting in its own right). That might be something to look into. And coming up with explanations for why the chapters in the dendrogram divided the way they did could certainly be fun (although suspect from the perspective of valid textual explanations).

However, what I would really do is look for better ways to display the information in these charts.  Statistics programs were not invented to create beauty, unless you find order out of chaos to be beautiful. As a project for the future–perhaps over the summer–it might be worth thinking of ways to artistically reinterpret statistical charts in a manner that, though it sacrifices fidelity to the data, still conveys the information but in a manner that is aesthetically appealing and draws in the reader. Basically, I want a visualization that attracts both people who love graphs and people who start running in the opposite direction when they see one. That would be a really useful visualization.

~~~

*I removed the proper names and pronouns because I already knew the information they could give me. If I found that the best indication that a chapter was about Gwendolen was the frequency of the word Gwendolen, I might get a very accurate result, but not a very interesting one. This way, though perhaps my results will be less accurate in division, they will be more interesting.