Playful Visualizations at Work, Working Visualizations at Play

Posts tagged ‘Text Art’

Hammering at Daniel Deronda

This time, we are using a MALLET!

(I apologize for the pun, but it does not seem to get old).

MALLET stands for MAchine Learning for LanguagE Toolkit and is proof that, among other things, there is no such thing as an impossible acronym. MALLET is a Java-based package designed for multiple kinds of natural language processing/machine learning, including what I used it for – Topic Modeling.

So what is Topic Modeling? Well, let’s say that texts are made up of a number of topics. How many? That depends on the text. So every word in that text (with the exception of common words like “an” ) should be related to one of those topics. What MALLET does in topic modeling mode is it divides a set of texts up into X number of topics (where X is your best guesstimate on how many there should be) and outputs all the words in that topic, with a shorter list of top words for each topic. Your job, as the human, is to guess what those topics are.

For more on the idea behind topic modeling, check out Matthew Jockers’ Topic Modeling Fable for the decidedly non-technical version or Clay Templeton’s Overview of Topic Modeling in the Humanities.

Now for the second question – why am I doing it? Beyond the “well, it’s cool!” and “because I can,” that is, both of which are valid reasons especially in DH. And my third reason is a subset of the second, in a way. I want to test the feasibility of topic modeling so that, as this year’s Transcriptions Fellow*, I can help others  use it in their own work. But in order to help others, I need to first help myself.

So, for the past two weeks or so, I’ve been playing around with MALLET which is fairly easy to run and, as I inevitably discovered, fairly easy to run badly. Because of the nature of topic modeling, which is less interested in tracking traditional co-occurrences of words (i. e. how often are two specific words found within 10 words of each other) and more interested in seeing text segments as larger grab-bags of words where every word is equidistant from every other**, you get the best topic models when working with chunks of 500-1000 words. So after a few less-than useful results when I had divided the text by chapters, I realized that I needed a quick way to turn a 300,000+ word text file into 300+ 1000 word text files. Why so long a text? Well, George Eliot’s Daniel Deronda is in fact a really long text. Why Daniel Deronda? Because, as the rest of this blog demonstrates, DD has become my go-to text for experimenting with text analysis (and, well, any other form of analysis). So I have MALLET, I have Daniel Deronda, I now also have a method for splitting the text thanks to my CS friends on Facebook and, finally, I have IBM’s “Many Eyes” visualization website for turning the results into human-readable graphics. All that’s missing is a place to post the results and discuss them.

I knew Ludic Analytics would not let me down. So, without further ado, I present the 6 topics of Daniel Deronda, organized into word clouds where size, as always, represents the word’s frequency within the topic:

Topic 1:


Topic 2:


Topic 3:


Topic 4:


Topic 5:


Topic 6:



You will notice that the topics themselves do not yet have titles, only identifying numbers. Which brings us to the problem with Topic Modeling small text sets – too few examples to really get high quality results that identify what we would think of as topics. (Also, topic modeling is apparently better when one uses a POS (parts of speech) tagger and even gets rid of everything that isn’t a noun. Or so I have heard.)

Which is not to say that I will not take a stab at identifying them, not as topics, but as people. (If you’ve never read Daniel Deronda, this may make less sense to you…)

  1. Daniel
  2. Mordecai
  3. Society
  4. Mirah
  5. Mirah/Gwendolen
  6. Gwendolen

I will leave you all with two questions:

Given the caveat that one needs a good-sized textual corpus to REALLY take advantage of topic modeling as it is meant to be used, in what interesting ways might we play with MALLET by using it on smaller corpora or single texts like this? Do the 6 word clouds above suggest anything interesting to you?

And, as a follow-up. what do you make of my Daniel Deronda word clouds? If you’ve never read the text, what would you name each topic? And, if you have read the text, what do you make of my categorizations? 

*Oh, yes. I’m the new Graduate Fellow at the Transcriptions Center for Literature & the Culture of Information. Check us out online and tune in again over the course of the next few weeks to see some of the exciting recent developments at the Center. Just because I haven’t gotten them up onto the site yet doesn’t mean they don’t exist!

**This is a feature, not a bug. Take, for example, a series of conversation between friends and, in every conversation, they always reference the same 10 movies although not always in the same order. MALLET would be able to identify that set of references as one topic–one that the human would probably call movies–while collocation wouldn’t be able to tell that the first movie and last movie were part of the same group. By breaking a long text up into 500-1000 word chunks, we are approximating how long something stays on the same topic.

Artist Analyst

As we get closer to the end of the academic quarter, and thus the end of the grade-motivated production of our dialogue in our wonderfully exciting, structured seminar, I am reading more and more and feeling more and more excited by the anticipated art of this coming summer and how -fun- and -exhilarating- it will be to work through some of my photoshop/pixelmator frustrations and limitations using my hands and paint and epoxy and physical texts. I’m not anti-technology, and this seminar has certainly encouraged me to learn more and do more with my computer, but I can’t wait to be outside with just my hands and some supplies. I am going to spend one more week playing with pixelmator, but am certainly going to spend the next days, months, and years thinking about a number of ideas generated by our blog. One thought in particular that I want to note today was sparked by Liz’s most recent post in which she wrote:

This brings me back to something I had discussed earlier, which is that our data always seems to be more useful for ourselves than for our readers. This may explain why the scholarly article and book have had such a long life; they don’t simply convey understanding, they enact it as well. Close reading recreates, in the article, the process through which we imbue texts with meaning. The act of applying historical research to a volume of literature mimics the act of research and the flash of understanding that comes when one grasps how a specific historical fact is relevant to the text at hand. Articles are processes, they are a temporal movement towards the end of an argument. Visualizations, however, lack that sense of journey. They are always, already, at the end even when you, the reader, are still at the beginning.

Wow. One issue I’ve always had with writing is I can never quote quite enough of the brilliance I see in other people’s work. Luckily, no one is grading my blog (at least, I hope that my writing skills are not under serious scrutiny here as I rarely read over what I publish), so I get to copy Liz’s paragraph in all its glory. The value of a process versus the product has been a concern of mine for the past few months. I have hopped and skipped and (mostly) stumbled through a wide variety of text analysis tools, all kinds of word generators and algorithms that changed my text input into hilarious explosions of letters. As you have seen in my posts, I also spent quite a bit of time in Microsoft Office.  However, there is no single product of this play that to me says that, here, I’ve found something breathtakingly new and spectacularly beautiful/useful for my greater purposes of understanding David Foster Wallace’s last work. That said, I now have an intimacy with chapter 14 of The Pale King that I have with few other texts on the planet. When I think of other works I know as well, I can only come up with a handful: Julio Cortázar’s “El otro cielo,” Sophocles’ Antigone, Harry Potter books 5-7, and André Gide’s L’Immoraliste. What’s interesting about this list, in terms of what Liz wrote above, is that my closeness to this list of highly varied texts is derived from an equally varied list of practices of reading. Harry Potter I’ve read four to five times. That type of intimacy is easily understood, right? I teach Antigone every year to undergraduates. I wrote one of my senior theses in college on Cortázar’s “cielo.” I painted sections of L’Immoraliste in a similar way to the pictures of a painting I posted recently in the entry entitled “l’art pour l’art.” Liz’s statement, that the visualization lacks a sense of journey, really intrigues me. I think she is right, at least about a lot of types of visualizations. Temporally, writing an essay on Wallace demands more time than copying part of his novel into Many Eyes and seeing what types of visuals can be created from it. Additionally, the process, or the parts of the process we do not edit out of the final product, are more transparent. What I want to know, or to think about, is what the difference might be between the journey depicted in an article or conference presentation, and what one can see in a piece of art. As with a visualization like a graph or word cloud, it is difficult to see a process when looking at one of my paintings. I can see the process, of course, as the artist, but it is not immediately visible to an audience. Our relationship to our own work is, necessarily, so different from what anyone else might be able to take away from it. I like that with an essay, it is easier to trace or unpack the levels of research that have gone into constructing an argument. But, I also like that my art can stand alone, and that there are an infinite number of ways for an audience to evaluate, think about, or create from what I’ve done.

Can you see me?

DFW Pale King Excel Sheets

This link will allow you to download my spreadsheets, I think, so check it out! Please comment if you have any other ideas about how to mess around with Excel that might be interesting for the project.