Playful Visualizations at Work, Working Visualizations at Play

Posts tagged ‘Many Eyes’

MALLET redux

I considered many alternative titles for this post:

“I Think We’re Gonna Need a Bigger Corpus”

“Long Book is Long”

“The Nail is Bigger, but the MALLET Remains the Same”

“Corpo-reality: The Truth About Large Data Sets”

(I reserve the right to use that last at some later date). But there is something to be said for brevity (thank you, Twitter) and, after all, the real point of this experiment is to see what needed to be done to generate better results using MALLET. The biggest issue with the previous run–as is inevitably the case with tools designed for large-scale analysis–was that I was using a corpus that consisted of one text. So my goal, this time around, is to see what happens when I scale up. So I copied the largest 150 novels out of collection of 19th and early 20th century texts that I happened to have sitting on my hard drive and split them into 500 word chunks. (Many many thanks to David Hoover at NYU, who had provided me with those 300 texts several years ago as part of his Graduate Seminar on Digital Humanities.. As they were already stripped of their metadata, I elected to use them.) Then I ran the topic modeling command in MALLET and discovered the first big difference between working with one large book and with 150. Daniel Deronda took 20 seconds to model. My 19th Century Corpus took 49 minutes. (In retrospect, I probably shouldn’t have used my MacBook Air to run MALLET this time.)

Results were…mixed. Which is to say that the good results were miles ahead of last time and the bad results were…well, uninformative. I set the number of topics to 50 and, out of those 50 topics, 21 were not made up of a collection of people’s names from the books involved.*  I was fairly strict with the count, so any topic with more than three or so names in the top 50 words was relegated to my mental “less than successful” pile. But the topics that did work worked nicely.

So here are two examples. The first is of a topic that, to my mind, works quite well and is easily interpretable. The second example is of a topic that is the opposite of what I want though it too is fairly easy to interpret.

Topic #1

First

So, as a topic, this one seems to be about the role of people in the world. And by people, of course, we mean MEN.

Topic #2:

Second

Now, this requires a some familiarity with 19th century literature. This topic is “Some Novels by Anthony Trollope”. While, technically accurate, it’s not very informative, especially not compared to the giant man above. The problem is that, while it’s a fairy trivial endeavor to put the cast of one novel into a stop list, it’s rather more difficult to find every first and last name mentioned in 150 Victorian novels and take them out. In an even larger corpus (one with over 1,000 books, say), these names might not be as noticeable simply because there are so many books. But in a corpus this size, a long book like “He Knew He Was Right” can dominate a topic.

There is a solution to this problem, of course. It’s called learning how to quickly and painlessly (for a given value of both of those terms) remove proper nouns from a text. I doubt I will have mastered that by next week, but it is on my to do list (under “Learn R” which is, as with most things, easier said than done).

In the meantime, here are six more word clouds culled from my fifty. 5 of these are from the “good” set and one more is from the “bad”.

Topic #3:

Third

Topic #4:

Fourth

(I should note, by the way, that party appears in another topic as well. In that one, it means party as a celebration. So MALLET did dinstinguish between the two parties.)

Topic #5:

Fifth

Topic #6:

Sixth

Topic #7

Seventh

Topic #8:

Eighth

There are 42 more topics, but since I’m formatting these word clouds individually in Many Eyes, I think these 8 are enough to start with.

So the question now on everyone’s mind (or, certainly on mine) is what do I do with these topic models? I could (and may, in some future post) take some of the better topics and look for the novels in which they are most prevalent. I could see where in the different novels reading is the dominant topic, for example. I could also see which topics, over all, are the most popular in my corpus. On another note, I could use these topics to analyze Daniel Deronda and see what kinds of results I get.

Of course, I could also just stare up at the world clouds and think. What is going on with the “man” cloud up in topic 1? (Will it ever start raining men?). Might there be some relationship between that and evolving ideas of masculinity in the Victorian era? Why is “money” so much bigger than anything else in topic #6? What does topic #7 have to say about family dynamics?

And, perhaps the most important question to me, how do you bring the information in these word clouds back into the texts in a meaningful fashion? Perhaps that will be next week’s post.

*MALLET allows you to add a stopwords list, which is a list of words automatically removed from the text. I did include the list, but it’s by no means a full list of every common last name in England. And, even if it was, the works of Charles Dickens included in this list would leave it utterly stymied.

Advertisements

Portland Craigslist vs. Santa Barbara Craigslist

Seeing Liz’s post yesterday reminded me that it has been awhile since I have posted here on LuAn.  Surely, I thought, there must be something I could include about my current work, as I’ve missed frequenting this friendly space.  And while I could dedicate a post about some of my students’ projects from the course I taught this summer or a few of the new visualizations I’ve created about La tumba (if you recall, Agusín’s La tumba to me is like Daniel Deronda to Liz—our default textual subjects), something else came to mind:

My good friend just moved to Portland, Oregon.

A random thought, indeed, but I plan to show how it is connected to this world of Ludic Analytics (or perhaps, how this world has taken over my mind and it now affects how I relate to any and all texts I come across).  Since my friend was new to the Portland area, she was looking for a place to live and would send me craigslist ads to get my opinion on possible new roommates.  She would also send me some of the ads she found crazy or ridiculous (of which, there were shockingly many…it’s probably more a “craigslist thing” than a “Portland thing”).  Then to help out, I began to search the Portland craigslist ads for her, in an effort to find her the perfect place to live in her new city.

It’s been a few years since I have looked for housing, so, I was not up to date on my local craigslist ad situation, but it seemed to me that the Portland posters had some common themes that kept popping up and were distinctly “Portland” compared to the “Santa Barbara” ads to which I was more aware.  Primarily, the Portland posters needed evidence that you were employed or had a steady job–which is definitely a good quality in a roommate. It seemed to me, however, that this statement was disproportionally included in the Portland ads.   The other commonalities that I perceived from reading the ads were that there were more vegetarians and self-identified “420 friendly” posters in Portland than in Santa Barbara.  However, I wondered: is my sense about this correct?  I decided to investigate by creating some visualizations of the ads and comparing the results.  (Thank you Many eyes).

Keep in mind that this is not the most scientific of experiments, but I was just curious, and I had the tools at the ready (focus more on the ludic here than the analytic).  I compared text from the first 11 posts from each city, Portland, Oregon and Santa Barbara, California.  In these ads, people were looking for roommates to fill their house.  Someday it might be fun to do a more formal analysis (with a bigger sample set, and more rigorous methodologies), but until then, consider these word clouds:

Portland:

portland word cloud

Santa Barbara:

santa barbara word cloud

“Room”  and “House” are (logically) prominent in both clouds. “Kitchen” is more evident in Santa Barbara, while “Work” or “working” does seem to have a higher prevalence in the Portland as I suspected. However, the “420” is actually bigger in the Santa Barbara cloud.  School related terms are also more present in the Santa Barbara cloud, perhaps suggesting the large population of students in our much-smaller-than-Portland town.

The clouds did not allude to as much information as I had hoped (despite looking cool) so I decided to check out some more visualizations:

Portland 2-word tag cloud:

portland 2 word tag

Santa Barbara 2 word tag cloud (with phone numbers removed):

tag cloud numbers removed santa barbara

Some observations from these visualizations: 1) it’s cheaper to live in Portland ($600 vs $800) 2) People in Portland do in fact “cook meat” and tend to name their dog “Roxy” (or one person with a dog name Roxy mentions said dog numerous times in the same ad)  3) My perception that self-identified “420” posters in Portland were more prevalent appears to be wrong.  Of course, one of the caveats of this type of visualization is that it could be misleading.  It might say “no” before, and change the meaning, like in the following example of a Santa Barbara phrase net diagram:

bring the party sb craigslist

Hmm.  Interesting.  It’s important to Santa Barbara Craigslist posters that you both “Share THE bathroom” and “Bring THE party”.  However, upon closer investigation, it’s actually “DON’T bring the pary”:

dont bring the party screen shot

So, there you go.  I guess sometimes data can be misleading (which we already knew).

And just so you know, in Portland it’s important to:

share the costs portland

Share THE costs.

Did these visualizations help my friend find a house?  No.  But they were fun to make and she definitely appreciated the effort.  It also solidified in my mind the fact that the process can be just as important as the results, and that it has come to the point where I make visualizations for the amusement of myself and my friends (a good thing?  I hope so).

Said friend eventually found a room in a nice house with an amicable roommate; although, unfortunately her new place does not actually come with a dog named Roxy.

Hammering at Daniel Deronda

This time, we are using a MALLET!

(I apologize for the pun, but it does not seem to get old).

MALLET stands for MAchine Learning for LanguagE Toolkit and is proof that, among other things, there is no such thing as an impossible acronym. MALLET is a Java-based package designed for multiple kinds of natural language processing/machine learning, including what I used it for – Topic Modeling.

So what is Topic Modeling? Well, let’s say that texts are made up of a number of topics. How many? That depends on the text. So every word in that text (with the exception of common words like “an” ) should be related to one of those topics. What MALLET does in topic modeling mode is it divides a set of texts up into X number of topics (where X is your best guesstimate on how many there should be) and outputs all the words in that topic, with a shorter list of top words for each topic. Your job, as the human, is to guess what those topics are.

For more on the idea behind topic modeling, check out Matthew Jockers’ Topic Modeling Fable for the decidedly non-technical version or Clay Templeton’s Overview of Topic Modeling in the Humanities.

Now for the second question – why am I doing it? Beyond the “well, it’s cool!” and “because I can,” that is, both of which are valid reasons especially in DH. And my third reason is a subset of the second, in a way. I want to test the feasibility of topic modeling so that, as this year’s Transcriptions Fellow*, I can help others  use it in their own work. But in order to help others, I need to first help myself.

So, for the past two weeks or so, I’ve been playing around with MALLET which is fairly easy to run and, as I inevitably discovered, fairly easy to run badly. Because of the nature of topic modeling, which is less interested in tracking traditional co-occurrences of words (i. e. how often are two specific words found within 10 words of each other) and more interested in seeing text segments as larger grab-bags of words where every word is equidistant from every other**, you get the best topic models when working with chunks of 500-1000 words. So after a few less-than useful results when I had divided the text by chapters, I realized that I needed a quick way to turn a 300,000+ word text file into 300+ 1000 word text files. Why so long a text? Well, George Eliot’s Daniel Deronda is in fact a really long text. Why Daniel Deronda? Because, as the rest of this blog demonstrates, DD has become my go-to text for experimenting with text analysis (and, well, any other form of analysis). So I have MALLET, I have Daniel Deronda, I now also have a method for splitting the text thanks to my CS friends on Facebook and, finally, I have IBM’s “Many Eyes” visualization website for turning the results into human-readable graphics. All that’s missing is a place to post the results and discuss them.

I knew Ludic Analytics would not let me down. So, without further ado, I present the 6 topics of Daniel Deronda, organized into word clouds where size, as always, represents the word’s frequency within the topic:

Topic 1:

Topic1

Topic 2:

Topic2

Topic 3:

Topic3

Topic 4:

TOPIC4

Topic 5:

Topic5

Topic 6:

Topic6

 

You will notice that the topics themselves do not yet have titles, only identifying numbers. Which brings us to the problem with Topic Modeling small text sets – too few examples to really get high quality results that identify what we would think of as topics. (Also, topic modeling is apparently better when one uses a POS (parts of speech) tagger and even gets rid of everything that isn’t a noun. Or so I have heard.)

Which is not to say that I will not take a stab at identifying them, not as topics, but as people. (If you’ve never read Daniel Deronda, this may make less sense to you…)

  1. Daniel
  2. Mordecai
  3. Society
  4. Mirah
  5. Mirah/Gwendolen
  6. Gwendolen

I will leave you all with two questions:

Given the caveat that one needs a good-sized textual corpus to REALLY take advantage of topic modeling as it is meant to be used, in what interesting ways might we play with MALLET by using it on smaller corpora or single texts like this? Do the 6 word clouds above suggest anything interesting to you?

And, as a follow-up. what do you make of my Daniel Deronda word clouds? If you’ve never read the text, what would you name each topic? And, if you have read the text, what do you make of my categorizations? 

*Oh, yes. I’m the new Graduate Fellow at the Transcriptions Center for Literature & the Culture of Information. Check us out online and tune in again over the course of the next few weeks to see some of the exciting recent developments at the Center. Just because I haven’t gotten them up onto the site yet doesn’t mean they don’t exist!

**This is a feature, not a bug. Take, for example, a series of conversation between friends and, in every conversation, they always reference the same 10 movies although not always in the same order. MALLET would be able to identify that set of references as one topic–one that the human would probably call movies–while collocation wouldn’t be able to tell that the first movie and last movie were part of the same group. By breaking a long text up into 500-1000 word chunks, we are approximating how long something stays on the same topic.

Seeing and Doing

Before I begin, I’m going to apologize. I will not be present at the Research Slam and so I don’t get a chance to play with glitter. Or communicate my work on one large sheet of paper in a way that balances both an overall view of my research along with enough details to make it comprehensible and interesting.

Claire, Meaghan, further thoughts about poster-ing as a form of visualization that might help us for presentations later on?

I will, however, try to put together a presentation-esque page on this site for all those interested in what my work is. A virtual poster, if you will.

In the meantime, I’m going to talk about the act of making a visualization less useful. I realize this is not what we think of ourselves as doing, but the work I have been doing in Many Eyes has been leading me in this direction.

As you may have noticed, I’m the one a bit hung up on “but what will I use this for?” Despite trying to let go of this obsession, I still think of my visualizations in terms of use-value. And while the mere act of seeing anew that is at the heart of visualization always possesses this mix of dulce et utile, the balance between the two seems to shift depending on where I am in the process of creating multiple iterations of visualizations. Or, to put it another way, the further I push the visualizations associated with a given topic, the more I insight I gain into the text from the process even as my readers gain less insight from the final product than they would have from the original.

I’ll provide an example, and nicely circumvent the aforementioned problem by making you, O readers, a part of the process.

I began with one of the Word Tree visualizations in Many Eyes.

As before, feel free to click on the caption and play with the Visualization in Many Eyes. You can type anything you want into the search box and the word tree will move to show you all the sentences in the text that have that phrase and how they “branch” off. This dynamic visualization is eminently useful as a way to think about characters and get a quick glance at the different traits associated with them. I chose Gwendolen because the impression I got of the novel while reading it was that her feelings were the most interesting. And here they are.

Now I wanted to explore what exactly it was that Gwendolen felt. To do that, I returned to the one form of text analysis with which I think we’re all familiar; the “Find” function in Microsoft Word. I took all the sentences in the text where either “Gwendolen felt” or “She felt” in relation to Gwendolen appeared. I then sorted them into those sentences that used “felt” to refer to emotions or those that used it differently, such as referring to actual contact between two people. Then I color-coded each sentence based on whether the emotions were positive, negative or neutral/unclear.

The Good:

Gwendolen felt ready to manage her own destiny? Gwendolen felt daring. Gwendolen felt some strength.  She felt well equipped for the mastery of life. She felt quite sure of herself. She felt assured that she could act well. She felt satisfied with her prospects at Offendene. She felt kindly toward everybody and was satisfied with the universe. She felt as if she were reinforcing herself by speaking with this decisiveness to her uncle. She felt prepared to hear everything. She felt equal to arguing with him about her going on the stage. She felt able to maintain a dogged calm in the face of any humiliation that might be proposed. She felt an equal right to the Promethean tone. She felt at this moment that it was not likely she could ever have loved another man better than this one. She felt the more assured that her expectations of what was coming were right. 

The Bad:

Gwendolen felt the bitter tears of mortification rising and rolling down her cheeks. Gwendolen felt some anger with her mamma, but carefully concealed it. Gwendolen felt herself painfully in the position of the young lady who professed to like potted sprats. Gwendolen felt that the dirty paint in the waiting – room, the dusty decanter of flat water, and the texts in large letters calling on her to repent and be converted, were part of the dreary prospect opened by her family troubles; and she hurried away to the outer door looking toward the lane and fields. Gwendolen felt every word of that speech. Gwendolen felt that she was being weighed. Gwendolen felt a sinking of heart under this unexpected solemnity. Gwendolen felt a sudden alarm at the image of Grandcourt finally riding away. Gwendolen felt herself stricken. Gwendolen felt suddenly uncomfortable, wondering what was to come. She felt passionately averse to this volunteered love. She felt anew current of fear passing through her. She felt herself very far away from taking the resolve that would enforce acceptance. She felt shaken into a more alert attention, as if by a call to drill that everybody else was obeying. She felt a sort of numbness and could set about nothing. She felt a retrospective disgust for them. She felt compelled to silence. She felt her heart beating with a vague fear. She felt herself in an attitude of apology. She felt bashful about walking up to him and letting him know that she was there,.  She felt a peculiar anxiety to-day. She felt sick with irritation. She felt a little dispirited. She felt them to be insulting. She felt like a shaken child – shaken out of its wailing into awe. She felt some tingling bashfulness at the remembrance of her behavior towards him. She felt a rising rage against him mingling with her shame for herself. What she felt beside was a dull despairing sense. She felt her habitual stifling consciousness of having an immovable obstruction in her life. She felt herself reduced to a mere speck. She felt a peculiar vexation that her helpless fear had shown itself, not, as usual, in solitude, but in well-lit company. 

The Neutral:

Gwendolen felt this lot of unhoped-for fullness rounding itself too definitely. Gwendolen felt an inward shock. Gwendolen felt a contradictory desire to be hastened. Gwendolen felt as if her heart were making a sudden gambol. She felt quietly, unargumentatively sure. She felt something very far from indifference as to the impression she would make on him. Was it triumph she felt most or terror?

You can see several interesting things from these phrases. One is that Gwendolen is not a happy person. Two is that she is most definitely the “Spoiled Child” Eliot names her. The act of going through the text and pulling out these quotes one by one was fascinating. The process didn’t provide a whole new view of Gwendolen’s character, but it did create a portrait of her that seemed come to life differently than the one that appears over the course of the novel. It’s a slightly altered picture of both her vulnerabilities and her determination, especially as it takes no notice of any changes in her over the course of the novel. Though you might not get the same sense of her as you would if you read the book; this is more like a character sketch. They say a picture is worth a thousand words.

I was trying to figure out how to present this data other than using the link to the Many Eyes tree above and was inspired by Meaghan’s post with the images and texts. I couldn’t do that, but I wanted to try something similar.

For any of you suffering under the mistaken impression that I can freehand this, I should mention that I traced over the actress Romola Garai’s silhouette. As she played Gwendolen in the BBC miniseries adapted from Daniel Deronda, it seemed appropriate. (All work was done in Adobe Photoshop CS5.1, which I do not own, but which the Transcriptions lab at UCSB does.)

I think I will call this my ludic interpretation. Next to the Many Eyes Word Tree, it seems rather less informative, yet having gone through the process of visualizing it, I feel as though I have learned more than I would have simply by looking at the word tree.

As a personal project and a way to learn about a text and formulate views of characters, this was a great exercise. But how do I make it useful for anyone other than myself? I benefited from the process; the end result is interesting, but the true value lies in enacting. So how can we make visualizations that are as useful to our viewers as they are to us?