Playful Visualizations at Work, Working Visualizations at Play

Posts tagged ‘Networks’

Sefaria IV: The Whys of Data Visualization

Welcome to the final post in this Ludic Analytics series on Sefaria. While my research itself is ongoing, this part of the project where experiment with the images I can make and ponder their value, has come to a close.

This post is distinct from the previous ones, which can be found here: part 1, part 2, and part 3, in that I’m finally going to move away from looking at the images themselves and focus instead on what their larger purpose is. But before I get to that, an important announcement.

All the data from this project – all three datasets, the .gexf files, the .csv exportable versions and some of the high res images are now available on my Github page for the Sefaria Visualization project. Sefaria is serious about its commitment to an open repository and I share that commitment with them. So if you want to grab any of these datasets and play around with them, please do so and I would very much like to see what you do with them.

A word of caution, however. These datasets are large and Gephi requires a lot of memory. All three datasets, but especially the August and September ones, will take forever to run on a computer with less than 8GB of RAM. Run them in the background on a machine that can handle it and assume that any layout algorithm other than plotting data points on a 2D plane will take some time to render. So, if you’re like me, and started doing this on a 2011 MacBook Air with 4GB of memory…don’t. And if you are familiar with software other than Gephi and prefer to use that, be my guest and do let me know about it.

All the data can be found here: Sefaria Data Visualization Project.

And now, onwards!

What is the value of this research?

As we all know, answering broad and general questions is difficult, so let’s break this one down a bit.
1. How does this kind of work – making visualizations and thinking about networked Jewish text – enhance the traditional experience of studying Jewish texts in a Jewish environment?
2. How can an academic researcher make use of these visualizations and to what degree does she need to become an expert in network theory to do so?

There. That’s much less intimidating.

Going in order, the first question really asks whether this kind of work has value within the traditional classroom. Given that the teaching of Jewish texts often focuses on the micro level and dwells on one topic for a very long time, this kind of visualization work seems like an important counterpoint to that kind of study. If students, as part of their explorations of Jewish texts, are learning how to trace a legal ruling from its source in the Bible to the modern day responsa on the topic, turning that unbroken line of tradition into a network that they can see could be really interesting. Rather than thinking linearly, they can look at specific ideas as clusters. And, starting with one of those groups, students could begin to think in terms of idea clusters – what groups of legal decisions come from verses that are right next door to one another.

None of this is new information and all of it could, in theory, be taught without the aid of images at all. But the images make it much easier to to think in a networked way.

And this is reflective also of the change that has come about with giant repositories like the Bar Ilan CD that contain an extraordinary number of sources or even Sefaria itself. We have access to the legal system as a whole in a way that really did not exist before the age of the computer. We’re going to have to think about how we want to access that system in a way that is both true to traditional forms and takes advantage of technology.1

The goal of teaching students about Jewish texts is only partially to familiarize them with the narratives they hear in synagogue and the laws that structure their lives. The other, more difficult job is to create a bond between the student and the text(s). And we do that by making the books tangible and meaningful, but we can also do that by making the text network tangible (metaphorically speaking). If we create emotional connections through interacting with texts and those connections have a profound influence on what we learn and how,2 we should be able to build on those connections through even less traditional forms of interaction. Such as making a graph.

So why is this kind of work useful in the classroom? Because it provides another way of accessing meaningful texts, one that can help students make connections they could not otherwise see and connect with the text in a way that deepens their appreciation for it.

Yes, this approach might make understanding a particular section of Jewish law a bit easier. But I’m much more interested in the way that it reshapes our relationship with all the texts as a whole. Not exactly what we can see, but how it changes the way that we look.

Which brings me to my second question, which I have partially answered in previous posts on this topic. How does an academic make use of this research?

I’ve given several examples of using the graphs as pointers towards interesting features in the texts. The strangeness of tractate Sukkah that I addressed here and the connections in the 18th chapter of Tanna Debei Eliyahu that I discussed last time are good examples. Both of these are interesting features noticeable only when examining the graph and each leads to a very different kind of research. As Sara pointed out in the comments, the results I was seeing for Sukkah came from the work her Talmud students did with Sefaria. So while not a feature of the text itself, this node opens up a conversation about using Sefaria in the classroom and data collection in a large, open-source project. Conversely, research into the 18th chapter of TDE would require a very different approach as the question, as far as I can determine, involves investigating why it has a disproportionate number of prooftext, whether the topic at hand requires so many texts or, as might always be the case, something else is at work.

And this might even be enough. If a literary critic with little or know network training can use these network graphs to discover new regions of research potential and new questions to ask about the construction of corpora, then perhaps this work has already achieved its goal.

But that feels like such a weak note on which to end. Not only because it absolves us from having to learn anything new about the networks themselves, but also because there should be so much more to do with this technology other than generate ideas and make pretty pictures.

A circular graph of all the nodes in the September database, arranged by name

Sefaria 9-22-14

Not to discount either generating ideas or making art. The practice of displaying information in an unreadable form purely for its aesthetic appeal is valuable as an act of artistic creation. If another value of this work is a set of awesome looking pictures to hang by my desk…far be it from me to complain. They’ll look great next to the embroidered TARDIS. I said that I was in this for the visceral joy of working with texts and the delight in making things the with which I think. But I will concede that not everyone wants what I want. I think we—the academic community—see the artistic values of our work as byproducts and, overall, would prefer research methods that generate answers rather than questions. So I will address that approach as well.

I realize that, in this conversation, I’m leaving out large swathes of digital research up to and including the WhatEvery1Says topic modeling project going at UCSB right now under the leadership of the 4Humanities group there. Using digital tools to interpret literary texts, while not free from controversy, has a pretty impressive track record and allows us to think anew about what we know and how we know it. But for many of these approaches, the images are secondary. They are elegant methods of displaying the information detailed in the critical literature itself. I’m talking about the actual value of taking information and transforming it into a visualization as a way of answering questions about a work or a corpus. To put the question another way, when is it better to see information than to read it?

And here’s where I think we start to see the value in making visualizations and knowing network theory. This kind of research is useful for destroying the linear thinking that narrative naturally invites. Database thinking (see Manovich in Language of New Media and Hayles in How We Think) has similar results in theory, but is comparatively useless to us as human beings in practice. We can’t read databases. We have tools that can, but what we end up reading or, realistically, seeing is the visual representation of the connections that are not apparent when traversing the work. Visualization breaks narrative. And sometimes, that’s what we want.

We want to break out of a narrative approach to, for example, the corpus of Jewish texts as a way of rethinking the legal, cultural and social influences that the texts have on one another. Here are some questions that, I hope, work like this might aspire us to answer.

  • How accurate is the vision of the Jewish legal system as a ladder with each subsequent generation relying on the scholarship that came beforehand? Do more recent writers hearken back to the earlier legal scholars or do they go straight to the Talmud or do they skip the legal sources in their own writing and rely entirely on the Biblical texts? What, in short, does a community of scholars look like?
  • Do scholars in different eras work differently? Are scholars more likely to refer to their predecessors in certain times than in others?
  • How interconnected are the commentaries? How often do they quote one another?
  • How interconnected is the corpus as a whole? Can you start anywhere and get back to Genesis 1:1? Which texts are inaccessible and do they share any features?
  • How much of the corpus is a dead end? And are dead ends characterized by any specific features?

We can’t read this information in the texts, but we can see it by looking at the visualizations. Which brings me to the end of this series, but to the beginning, I hope, of much research. All this data is available at Github and I welcome you to use it as you see fit.

As for me? Well, I’ve a dissertation to write and the data-viz work that I’ve been doing here is going to be a big part of that. And while my next post won’t be about Sefaria per se, let’s just say I still have a lot more to talk about when it comes to making meaning using network graphs.

  1. Last year, there was a…controversy over a specific decision made by a school principal when two of his students asked him to rule on a matter of law for them. I don’t want to get into the details of the controversy, but one of the loudest objections came from a rabbi who argued that simply having access to the sources to back up one’s opinion (he noted the Bar Ilan CD in particular) did not give someone the right to rule when more prominent rabbis disagreed with that ruling.
    Leaving aside that the principal in question is absolutely not the kind of person who would scour a database for one minor dissenting opinion rather than using his own knowledge to rule as he sees fit, this argument points to a more pervasive fear within all scholarly culture.
    Has the database replaced human memory? And can you really claim mastery over a topic if the mastery you have is, in fact, over the information seeking in the database?
    Conversely, can you claim mastery without the database? One of the points that I think the Sefaria graph makes elegantly is that there is really is (and always has been) “Too Much to Know,” to borrow the title of Ann Blair’s book on the subject. Is human knowledge of the canonical sources better than having a functioning database of every source? How do we rank knowledge without a canon?
    Given that my attitude towards technology can very broadly be summed up as “technology you do not like will not go away because you do not like it, so the only choice is to make it better”, I would argue that we need to train Jewish legal scholars in both forms of study. Legal precedent (unlike literature) has a better argument for the maintenance of the canon, but I think we do our laws and ourselves a disservice if we don’t take advantage of what technology can do and realize a system for using it to better understand and, yes, rule on Jewish law.
    Still, this point applies to relatively few people – the rabbis and not-rabbis responsible for handing down legal rulings. So let’s return to pedagogy. 
  2. A claim I don’t quite have the space to back up here, but I’m working on something that will address it. It’s called my dissertation. 

Sefaria II: The Map, the Territory and the Sukkah

My first post in this series dealt with the possibilities of Sefaria and what mapping such a system would look like at all. This, my second post, will jump to the opposite end of the spectrum. What are the limits of this kind of work and, perhaps more crucially, how do we make those limits work for us?

But first, a status update:

As many of you probably already noticed, the previous post in this series was featured in Wired’s science blog. You can find it here: The Network Structure of Jewish Texts. I was thrilled to have the work featured and I am so glad to see The Sefaria Project getting this kind of recognition.

Speaking of the project, a recent update to the database has increased the number of links from ~87,000 to over 150,000. This is incredibly exciting (obviously!) because it not only marks Sefaria’s continued growth, but also means that I have more data. So future posts in this series will draw on that new dataset as well and I’m looking forward to some comparative visualizations as well.

But enough about the future. Let us return to the past and the other visualizations I created with the first data set.

After negotiating with the 100,000+ nodes, I decided that I wanted something on a slightly more humanly sensible scale. I took the dataset I used for the previous visualizations and combined the nodes so that each node no longer represented a verse or a small section, but an entire book. This meant I only had ~400 nodes, a far more legible graph (at least by my standards).

Figure 1

So this is the map, arranged in a circle according to the category of text. The size of the node corresponds to the degree (how many connections it has) while the color corresponds to the kind of node. Edge weight or line thickness corresponds to how many connections exist between each node. The thicker the edge, the more references between the source node and the target.

Here is the key to the map:

  • Blue: Biblical texts
  • Green: The Talmud
  • Red: Mussar
  • Indigo: Mishnah
  • Yellow: Midrash
  • Green: Philosophy
  • Magenta: Halacha
  • Purple: Commentaries and Exegeses

This image tells a very different story than the map in the last post.  That map was a big data artifact (for a given value of big); it worked on the micro level to create macro sized connections. This graph is human scaled, which makes it more interesting to interpret, but perhaps less interesting to discuss observations about.

The strongest connections (by which I mean the thickest edges) are between the individual books of the Talmud and Rashi’s commentary on that book. Almost as thick are the connections between the five books of the Torah and their commentaries. This is not surprising. Rashi is the exegetical commentator for the Talmud; his commentary appears on the inside of every page and, as Haym Soloveitchik points out in his essay on the printed of the Talmud page, Rashi democratized the Talmud. Rashi is an indispensable learning aid, which also explains why Sefaria might make it a high priority to have all those links in place. This tracing of explicit references is the area in which Sefaria excels. Of course, there are other kinds of connections.

The Bible, specifically the five books of the Torah, are an interesting case study in what the current database can and cannot display. The most interesting piece of information, at least to me, is the paucity of connections between the Biblical books themselves. My immediate reaction was “Of course there are so few links!” After all, the network of reference and commentary relies on the presence of texts further along the timeline that can speak of the earlier texts. And the Bible does not make a practice of citing its own chapter and verse (especially because the chapters as we know them were introduced over 1,000 years after the closing of the canon). Figure 2 gives a better sense of what I’m talking about.

Figure 2

Figure 2

Here, you can see all the books of the Bible in the inner circle and, while there are some connections between the individual books (most notably the 5 books of the Torah to texts in Prophets and Writings), those edges seem scarce compared to the suffusion of green that encroaches from the Talmud’s corner and that signifies the interconnectedness of the Talmudic tractates.

Yet assuming that the Bible is not self-referential would be another kind of mistake. Many of the prophets speak about the covenant between God and Abraham, the exodus from Egypt, the calamities that might befall a recalcitrant king as they did that king’s father. And those are just the obvious, semantic references. The poetry of the prophets, the psalms and the language of the 5 megillot are just some examples of texts that use literary allusion and similarities of language to reference one another. So the network of references within the biblical texts are present, but they are not really the kind of references that Sefaria is set up to import wholesale. This is where the crowd-sourced nature of Sefaria really has a chance to shine; in a few years, it can become a repository of all the different possible connections between texts – an archive of what people think they see and how readers work with the texts. Sefaria has this capability built in – there is an option to add “allusion”s between one text and another, but those have to be added manually and individually. So check back in a few years.

This leads towards the point I allude to in my title. The graph is not really a record of Jewish texts as such, but a record of these texts as they are integrated into Sefaria. To borrow a well-known quote from Alfred Korzybski, “the map is not the territory”. Bearing this useful adage in mind, we can turn to what was my biggest question when looking at this graph. What is going on with Sukkah?

Sukkah is one of the 37 tractates of the Gemara*. It is neither the longest nor the shortest, not the most complex to grasp, nor the simplest. Based purely on my knowledge of the Talmud, I can’t think of a single reason why Sukkah should be far and away the largest of the tractates present.

And yet there it is. There are two possible kinds of answers. The first is that there is something special about Sukkah that sets it apart from the other tracates. Maybe there is something that I am not aware of or maybe this is a fascinating new discovery about the tractate itself. The second possibility is that something happened during the creation of this dataset to give Sukkah significantly more edges as compared to the other tractates.

The practical distinction between these two answers is that the former assumes that Sukkah is an actual outlier that is referenced significantly more often than the other tractates. The latter assumes that Sukkah is actually representative of what all the tractates should look like and the extra edges that it possesses represent data that has only been entered for Sukkah, but should eventually be added for the rest. (The third possibility is a data error. I’m discounting that because I looked back at the actual data and, as I’ll get to in a minute, it’s pretty clear that it’s not an error. But it is always wise to assume human error first.)

So which is it? How does one pinpoint which of the possibilities is more likely? Well, this is how I did it.

I created an ego graph of tractate Sukkah. The ego graph is a graph that shows only the nodes that connect to a specific node. So this graph shows all the nodes that connect, one way or another, to Sukkah.

Figure 3

Figure 3

The giant green blob in the hat is Sukkah. The collection on the left are all the biblical, Talmudic and halachic sources that refer to or are referenced in Sukkah. But what’s interesting is the cloud of small nodes surrounding Sukkah on the right. Those nodes are almost entirely from Maimonides’ Mishneh Torah, one of the foremost works of halachic literature and, more crucially for our purposes, a text that references pretty much every tractate of Talmud. There should be edges between the Mishneh Torah and each and every green node here. The absence of those edges suggests that it is the dataset that is incomplete and that Sukkah, rather than an outlier, is the node that most closely represents the textual connections that exist.

So that’s cool. By looking at the node as an extraordinary case, we uncover evidence of its ordinariness. That leaves us with an entire different set of questions. What happened to Sukkah? Why did someone take the time to add all these edges to Sukkah?

I can think of several possibilities.

  1. Daf Yomi. Daf Yomi is the practice of learning one folio (front and back) of Gemara a day and, in 7 1/2 short years, completing the entire Talmud. About 6 months ago, Daf Yomi covered tractate Sukkah. It’s possible that some Daf Yomi scholar discovered Sefaria right when he (statistically speaking, Daf Yomi scholars are he) started Sukkah and decided that, as part of his daily study, he would add the connections between the Talmud and the Mishneh Torah. This doesn’t explain why he stopped after Sukkah – there have been four tractates since Sukkah  – but it’s a start.
  2. Pedagogy. An educator decided to introduce the concept of the halachic chain of tradition  using digital tools and assigned their students to collaboratively edit Sukkah by adding the connections between the section they were learning and the halachic literature. So, as part of a classroom module, these students entered this data. This seems like a lot of data for students to enter manually, but it is certainly a possibility.
  3. It was a test of an automatic importing system. The powers that be were testing to see whether they could import the edges between the Talmudic texts and their halachic commentaries . Sukkah just happened to be the one they tested.

There are probably more possibilities, but I think that covers the basic kinds of users – the scholar, the educator, the technologist. Each of whom could be responsible for this anomaly. (By the way, if any of my readers have inside knowledge and knows what actually happens, I would appreciate anything you have to say.) When looking at a dataset like this, I find that my inclination is to start asking about the data. What would it mean to ask instead about the users and the development of the dataset? Or, to indulge in both my impulses, how can we study the data and the dataset in tandem? How do we mediate between the impulse to assign meaning to the data and the equally compelling impulse to assign it to the dataset? What exactly should I be studying?

And that is the question with which I leave you with and to which I invite your responses. What intrigues you about these visualizations? What would you like to talk about? In the crowd-sourcing spirit of Sefaria, I would like to augment my questions with yours. What would you like to know?

*Brief technical note – the Mishnah and the Gemara together make up the Talmud. However, both the term “Talmud” and “Gemara” are colloquially used to refer to the tractates that include the Mishnaic text and the Gemara that accompanies it.

Revisiting the Social Networks of Daniel Deronda

My twitterstream overflowed, in the past few days, with tweets about the uses, misuses and limits of social networking.* Coincidentally (or perhaps not, given the identity of at least one retweeter), we discussed the role of social network graphs in humanistic inquiry in this week’s session of Alan Liu’s “Intro to Digital Humanities” class. For those of you following along, we are #engl236 on Twitter and, last week, we made graphs. So I am going to interrupt my glacial progress through the possible uses of R**and put the longer-form meditation on what I am trying to do with these experiments in statistical programming on hold in order to talk about my latest adventures in social network graphing.

As longtime readers of this blog will remember, this is not my first foray into Social Network graphing. Nor is it my second. This gave me a huge advantage over many of my colleagues (sorry!) because I had already spent hours collecting and formatting the data necessary to graph these kinds of social networks. Since I wasn’t going to map new content, I thought I would at least learn a new program to handle the data. So I returned to Gephi, the network visualization tool that I had failed to master 18 months ago.

And promptly failed again.

PSA: If you have Apple’s latest OS installed, Gephi will not work on your machine. I and two of my classmates discovered this the hard way. Fortunately, the computers in the Transcriptions Lab are–like most institutional machines–about an OS and a half behind and so I resigned myself to only doing my work on my work computer.  After some trial and error, I figured out how I needed to format the csv file with all my Daniel Deronda data and imported it into Gephi. After some more trial, more error, and going back to the quickstart tutorial, I actually produced a graph I liked. Daniel Deronda in Gephi

In this graph, size signifies “betweenness centrality” which is a marker of how important a circle is in the graph according to how many connections the node has and how often that node is necessary for getting places in the network (i. e., how often the shortest path between two other nodes is through this node), which means that the node’s size indicates how vital that person is to other people’s connections as well as how many connections they themselves have. Color signifies grouping. Nodes that are the same color are nodes that have been grouped together by Gephi’s modularity algorithm…which is Gephi’s function for dividing graphs into groups.

So here we see three groups, which can be very roughly divided into Gwendolen’s social circle, Deronda’s social circle and Mirah’s social circle. There’s something delightful about the fact that the red group is made up entirely of the members of the Meyrick family and the girl they took in (Mirah). So Mirah truly becomes a member of the Meyrick family.

As this is a comparative exercise, I’m less interested in close-reading this graph and more interested in thinking through how it compares to yEd.

Gephi is certainly more aesthetically pleasing than yEd, especially given the settings I was using on the latter. And, unlike yEd, Gephi can very easily translate multiple copies of the same interaction into more heavily weighted lines, which helps provide a better idea of who speaks to whom how often in the novel (something I had been struggling with last year). At the same time, yEd’s layout algorithms seem far more interesting to me than Gephi’s “play around with Force Atlas until it looks right” approach. So while the layout does, I think, do a decent job of capturing centrality and periphery, it is less interestingly suggestive than yEd.

The other failing that Gephi has is the lack of an undo button. This might seem trivial to some of you, but being able to click on a node, delete it from the graph and then quickly undo the deletion was what made it so easy for me to do “Daniel Deronda without Daniel (and, erm, Gwendolen)”. With Gephi, I have this paranoid fear that I will lose the data forever and it will automatically save and I’ll have to do all this work over again. After a while, I finally screwed my courage to the sticking place and deleted our main characters to produce the following three graphs.

Daniel Deronda without Daniel inGephi

Daniel Deronda without Daniel

Daniel Deronda without Gwendolen

Daniel Deronda without Gwendolen

Daniel Deronda without Either

Daniel Deronda without Daniel or Gwendolen

The results are interesting, although perhaps less interesting than the disk-shaped diagrams from yEd that demonstrated changes in grouping. yEd allowed for some rather fine-grained analysis about who was regrouped with whom. On the other hand, Gephi makes it clear that both Gwendolen and Deronda tie together groups that, otherwise, are more distinct, as shown by the sudden proliferation of color in the first and third graphs particularly. Gephi makes it easy to see Deronda’s importance in tying many of the characters together. His influence on the networks is far stronger than Gwendolen’s.

Now, for the sake of comparison, here are the Gephi and yEd graphs side by side.

Daniel Deronda Gephi and yEd Comparison

I have not yet performed a more complete observational comparison of the layout, centrality measures and grouping algorithms in Gephi versus yEd (which, I admit, would begin with researching what they all mean) and the relationship between how data is presented and what questions the viewer can ask, but here are my preliminary reactions. Gephi does a far better job of pointing to Deronda’s importance within the text while yEd is better at portraying the upper-class social network in which Gwendolen in enmeshed. And while Gephi’s layout invites the viewer to think of its nodes in terms of centrality and periphery, yEd’s circular layout structures one’s thought along the lines of smaller groups within networks. Different avenues of inquiry appear based on which graph I look at.

This comparison produces three different questions.

  1. How do you know when to use which program? Can one tell at the outset whether the data will be more interesting and approachable in Gephi, e.g., or is this the perfect application of the “guess and check” approach where you always run them both and then decide which graph is more useful for the kinds of questions you want to ask. Are my conclusions here, about Gephi’s focus on centrality versus yEd’s focus on group dynamics, representative?
  2. How meaningful are the visual relationships one perceives in the network?
    1. Let’s take the graph above as an example and go for the low-hanging fruit. Young Henleigh, the illegitimate son of Grandcourt is way down at the bottom of the graph, connected unidirectionally to his father (his father speaks to him, but he does not speak back) and bidirectionally to his mother, with whom he converses. Gephi has colored him blue, indicating that, at least according to Gephi’s grouping algorithm, he is more closely associated with the other blue characters (a group made up predominantly of those who show up in Daniel’s side of the story and who I am valiantly resisting calling the Blue Man Group). Arguably, this is because those in Deronda’s circle talk slightly more about the boy since they have heard rumors of his existence, while those in Grandcourt’s social circle have not. And Henleigh’s repulsion distance is another indicator of how Grandcourt ignores his son and keeps his family at a distance.
    2. That is, I think, a fair reading of the book Daniel Deronda. My conclusions are borne out in the text itself and are justifiable within the larger narratives of Grandcourt’s treatment of others, a topic that I’ve written about several times over the course of my graduate career. But is it a fair reading of the graph? Am I taking accidents of layout as purposeful signals? Or are my claims, grounded as they are in edge distance and modularity, reasonable?
  3. In addition, did the graph actually tell me this information in a way that the book did not or did it simply remind me to look at what I already knew? This is part of an old and still unanswered question of mine – will the viewing of the social network graph ever really be useful or is it the decisions and critical moves that go into making the graph that produce results?

Obviously, this last question only applies to work like mine, where the graph is hand-coded and viewed as a model of an individual text. In cases where this work is mostly automated and several hundreds of novels are being studied for larger patterns of interactions, the question of whether the graph or the making thereof produces the information is irrelevant.

But the question of what kinds of meaning can be located in layout and pattern is still crucial, especially when one is comparing how different networks “look”. This may be a particularly pernicious problem in literary criticism and media studies: we’re trained to look at texts and images and treat them as…intentional. Words have meaning, pictures have meaning and we talk about this larger category of “media objects” in a way that assumes that their constituent parts have interpretable significance. This is not the same as claiming authorial intentionality, it’s simply an observation that, when we encounter a text, we take it as given that we can make meaning using any element of that text that impinges on our consciousness. There are no limits regarding what we can read into word choices, provided we can defend our readings and make sense out of them. Is that true of graphs? Are we entitled to make similar claims by reading interpretations into features of the layout and with the only test of said interpretation’s veracity our rhetorical ability to convince someone else to buy it? For example, could I claim that Juliet Fenn’s position on the graph between Deronda and Gwendolen shows that she, and all that she stands for, comes between them?  My instinct is to say no. But the same argument about place applied to a different character makes perfect sense. Mordecai’s place is between Deronda and the group of Jewish philosophers on the far right is emblematic of how he connects Deronda to his nation and how he is the one who rouses Deronda’s interest in Zionism.

I can think of three off-the-cuff responses to this problem. The first is to say that location is a fluke and, when it corresponds to meaning, that’s an accident. This feels unsatisfying. The second is to say that there is something about Juliet Fenn that I’m missing and, were I to apply myself to the task, I could divine the reason behind her placement. This is differently unsatisfying, not because I don’t think I can come up with a reason, but because I am afraid that I can.*** And if I succeed in making a convincing argument, is that because I unearthed something new about the book or because I’m a human being who is neurologically wired to find patterns, a tendency exacerbated by my undergraduate and graduate training in the art of rhetorical argument? In short, the position that all claims that “can” be made can be taken seriously is only marginally less absurd than the claim that all layout elements are always meaningless and, consequently, any meaning we make or find is insignificant. The third response heads off in a different direction. Perhaps my discomfort with reading these networks lies not in the network, but in my own lack of knowledge. I have not been trained in network interpretation and I need to stop thinking like a literary theorist and start thinking like a social scientist. I need to learn a new mode of reading. This, while perhaps true, also leaves me dissatisfied. I am not, fundamentally, a social scientist. I am not looking for answers, I’m looking for interesting questions/interpretive moves/ideas worth pursuing. While it would be very cool to show, in graph form, how Mordecai’s ideology spreads to Daniel and how ideas act as a kind of positive contagion in this novel, that theory is not stymied if there is insufficient data to prove it. I can take imaginative leaps that social scientists responsible for policy decisions must absolutely eschew.

Which means it is time to think about a fourth position. If we, as scholars of media in particular, are going to continue doing such work, then we need a set of protocols for understanding these visualizations in a manner that both embraces the creativity and speculative nature of our field while articulating the ways in which this model of the text corresponds to the actual text. Such a set of guidelines would  be useful not only as a as a series of trail markers for those of us, like me, who are still new to this practice and unsure of where we can step, but also as a touchstone that we can use to justify (mis)using these graphs. If the sole framework currently in existence is one that does not account for our needs, we may find ourselves accused of “doing it wrong” and, without an articulated, alternative set of guidelines, it becomes exponentially more difficult to respond. On the most basic level, this means having resources like Ted Underwood’s explanation of why humanists might not want to follow the same steps that computer scientists do when using LSA available for network analysis. Underwood explains how the literary historian’s goal differs from the computer scientist’s and how that difference affects one’s use of the tool. Is there a similar post for networks? Is there an explanation of how networks within media differ from networks outside of media and advice on how to shift our analytic practice accordingly? Do we even have a basic set of rules or best practices for this act of visualizing? And, if not, can we even claim these tools as part of our discipline without actually sitting down and remaking them in our image?

I don’t want to spend the rest of my scholarly career just borrowing someone else’s tools. I want Gephi and yEd…and MALLET and Scalar and, yes, even R to feel like they belong to us. Because right now, for all that I’ve gotten Gephi to do what I want and even succeeded in building a dynamic graph of the social network of William Faulkner’s Light in August (which told me nothing I did not already know from reading the book), I still feel like I’m playing in someone else’s sandbox.

*Granted, this is Twitter and so three posts, each retweeted several times, can make quite a little waterfall.

**I will say that the R learning curve made figuring out Gephi seem nearly painless by comparison.

***In the interest of proving a point, a short discussion of Juliet Fenn: Juliet Fenn’s location between Deronda and Gwendolen and at the center of the graph is significant precisely because she is the character who represents what each of them is not. Juliet is of the more aristocratic circle defined by Sir Hugo and his peers and, unlike Daniel, actually belongs there by birth. She beats Gwendolen in the archery contest, which proves her authenticity both in terms of talent and, again, aristocracy. Were either Daniel OR Gwendolen authentically what they present themselves as (and, coincidentally, who their co-main-character perceives them to be), Juliet Fenn would be Gwendolen’s mirror and Deronda’s ideal mate. As neither Gwendolen nor Daniel are, in fact, who they seem to be, Juliet is neither. She is merely a short blip during the early chapters of the book who can be easily ignored until her graphic location discloses the subtle purpose of her character–the idea of a “real” who Gwendolen cannot be and Deronda cannot have. Of course, neither character explicitly wants or wants to be Juliet. This isn’t meant to be explicit, merely to color our understanding of the otherness of Deronda and Gwendolen. It’s not that Juliet Fenn keeps them apart per se, but the discrepancies between who she is and who they are, as illustrated by the graph, is what makes any relationship between Gwendolen and Deronda impossible.

The Limits of Social Networks

Though we have mostly gone our separate ways over the past year, I find that I am attached to the idea of the LuAn collective and want to keep it going just a bit longer. After all, you never know when you might need a data viz blog that you co-run.

As a second year student in the English department at UCSB, I am gearing up to take (i.e. reading madly for) my qualifying exams this June. As luck would have it, I am also finishing up my course requirements this quarter, so I find myself in the…unenviable position of writing a paper on a topic that would ordinarily lie far outside my interests in the 19th century English novel: William Faulkner. So I did what any digital humanist with an unhealthy interest in visualization would do in my situation – I made a graph.

I wanted to write a final paper for this course that reflects my theoretical interests and would allow me to continue developing a subset of my digital skills. Of course, trying to get all of my interests to move in more or less the same directions is like herding kittens, but I had been seeking another opportunity to think through a novel using a social network graph and, well, I wouldn’t have to start from scratch this time. I knew how my graphing software, yEd, worked and I knew how long it took to turn a book into a collection of Excel cells denoting conversations (20% longer than you think it will take, for those of you wondering). So why not create a social network graph of one story in Yoknapatawpha?

Don’t answer that question.

Light in August is widely considered to be the most novel-like of Faulkner’s novels, which made it a good choice for my project. After all, I had experience turning a novel-like novel into a social network graph and no experience whatsoever with a text like The Sound and the Fury. Much as I was intrigued by and even enjoyed The Sound and the Fury and Absalom, Absalom!, the prospect of figuring out the rules for graphing them was…intimidating to say the least.

For all its novelistic tendencies, Light in August is still decidedly Faulknerian and, in order to work with it, I found myself either revising some of my previous rules or inventing new ones. When I worked on George Eliot’s Daniel Deronda, I had used a fairly simple set of two rules: “A bidirectional interaction occurs when one named character speaks aloud (that is, with quotation marks) to another named character. A unidirectional interaction occurs when a named character speaks aloud about another named character.”

Here are the Faulkner rules:

  1. When one character speaks to another, that interaction is marked with a thicker, dark grey arrow.
  2. When one character speaks about another, that interaction is marked with a thin, dark blue arrow.
  3. When one character speaks to another within another character’s narration (i.e. X is telling a story and, in it, Y talks to Z), that interaction is marked with a thicker, light grey arrow
  4. When one character speaks about another within another character’s narration, that interaction is marked with a thin, green arrow.

There are several changes of note here. First, I learned more about yEd and figured out how to put properties like line size and color in the spreadsheet itself so that the software would automatically map color and line weight as appropriate. This meant I could make finer and clearer distinctions than last time, at least in terms of showing kinds of communication. Second, I changed the rule about quotation marks because quotation marks don’t necessarily connote audible speech in Faulkner, nor does their absence connote internal monologue. I relied entirely on the dialogue tags in the text to decide whether a sentence was spoken aloud or not. Finally, I changed the rule about named characters. All speaking characters are represented in the graph, regardless of whether or not we are ever told their names. Had I not changed this rule, the number of characters of color represented in this graph would have fallen from 15 to 3. There are 103 distinct nodes in this graph, which means 103 characters speak in this text.

Jeffrey Stayton, in an article entitled “Southern Expressionism: Apocalyptic Hillscapes, Racial Panoramas, and Lustmord in William Faulkner’s Light in August” (which, in the interest of full-disclosure, I am still in the middle of reading), discusses how Faulkner figures racial landscapes in Light in August as a kind of Southern Expressionism. It is fitting, of course, that one of Willem de Kooning’s expressionist paintings is based on and entitled “Light in August”. But this graph highlights the relationship between fading into the background and remaining unnamed, it shows how easily racial landscapes can become racial backgrounds and how easily it is to elide the unnamed. In the Victorian novel, a certain charactorial parsimony seems to ensure that everyone who speaks is named. Daniel Deronda is 800 pages long and contains 62 character nodes. Light in August is 500 pages long and contains 103. If you remove all the unnamed characters, there are 44 character nodes. (For those of you counting, thats 38/88, close to half of the white characters, and 12/15 or four fifths of the black characters. The other 8 are groups of people, who seem to speak and are spoken to fairly often in this text.)

There are several ways to interpret this difference and I am loathe to embrace any of them without, frankly, having done more work both with Faulkner and with the Victorian novels. One of the things I find striking, though, is that Light in August seems to be making visible (though only just) things that are either not visible or entirely not-present in Daniel Deronda. Light in August is told from different characters’ viewpoints and the narration always locates itself in their perspective and confines itself to what they know. So the graph becomes a record not only of what they have seen, but also of how they have seen it.

I can hear some of you grumbling “What graph? You haven’t shown us a graph yet!”

My apologies. For that, I will give you three. Anything worth doing is worth overdoing.

1) The first graph.

Light in August Social Network Organic DiskClick to see it in full size.

In this graph, color corresponds to importance, as determined by number of interactions. The darker the color, the more interactions that character has had. That dark red mark in the middle is Joe Christmas.

2) The graph without the unnamed characters

Light in August Social Network Organic Disk Sans Unnamed

Click for full size.

Colors mean the same here that it did in the previous graph.

There are several differences between the two graphs. Obviously, the second is legible in a way that the first one is not, which is not entirely a virtue. When it comes to graphing, legibility and completeness tend not to walk hand in hand. The more you leave out, the more you can see so, contra-positively  the less you can see, the less you have left out. The best-of-both-worlds solution is to use both images.

Interestingly enough, there are no unconnected nodes in the second image, even though I deleted half of the nodes in the graph. That surprised me. I expected to find at least one person who was only connected to the network through one of the unnamed characters, but there’s no such person. And many of the people who remain are not characters I would consider to be important to the story (Why has the entire history of the Bundren family remained more or less intact? Who is Halliday, anyway?)

These are questions to be solved, or at least pondered. They are, at any rate, questions worth asking. If the network remains intact without these characters, what does their presence signify? What has changed between the first graph and the second?

After all, I do have a paper to write from all of this.

I promised you a third graph, did I not? This one moves in a rather different direction. As part of its ability to organize and rearrange your graph, yEd has a grouping functionality and will divide your graph into groups based on the criteria you choose. I had it use natural clustering.

A grouping into natural clusters should fulfill the following properties:

  • each node is a member of exactly one group,
  • each node should have many edges to other members of its group, and
  • each node should have few or even no edges to nodes of other groups.

yEd gave me 8 distinct groups, two of which had only two nodes in them.

Light in August Social Network Grouped

As always, click for full-size.

I assume that when yEd said that the groups would have few or no edges to nodes in other groups, it was doing the best it could with the material I gave it. I then had yEd rearrange the positions of the nodes so that the centrality of a node’s position within a group indicates how many connections it has.

What I love about this graph is how it divides Light in August into a set of six interconnected but distinct narratives. Each group larger than two centers around a specific character or group of characters involved in one thread of narrative. Joe Christmas, who is arguably the main character, has one section (along with a plurality of the other characters of color), Lena Grove, Bryon Bunch and Joe Brown are all grouped together in another and, while they talk about the characters in Joe Christmas’s section quite often, they have only three conversations with the characters in that group. Those are the two largest groups. Percy Grimm, for all that he only appears in one chapter, manages to collect 7 other nodes around himself and does seem, in his own way, to be the protagonist of his own story who just walked into this one for one chapter and then left again. He is also the only named character in his section.

Social network graphs are, for me, a way of re-encountering a text. They strip away most of the novel and model only a small portion of what is present in the text, but that portion becomes both visible and analytically available in a new way. (I think seeing and visibility will become a theme in this paper, once I write it.) The title of this course is “Experimental Faulkner”. I like to think that this qualifies.

To each his (or her) own.

In this post, I would like to bring up something that I know Liz, Meaghan, and I have talked about in person, but have yet to discuss in LuAn (you know, the “cool” way to refer to our blog, Ludic Analytics).  The theme of today’s post:  Subjectivity.

Many of the visualizations that we create are based on a series of rules; regulations that each person as a visualization creator must invent before approaching the source to collect data.  In life, I tend to be a rule follower. I’m good at standing in lines, maybe not so good at coloring inside the lines, but overall I like structure.  The problem, however, is the fact that when one is creating the rules it’s a) easier to both follow and (occasionally) break said self-created rules and b) one person’s rules will be different from another’s.

I’ll give an example to help elucidate the point.  Liz and I both have worked on network graphing the dialog in the novels that we are studying (I believe she will post some interesting graphics on her work soon).  The other day, she mentioned the problem of judging what is, and what is not, dialog in a novel.

With certain genres, such as plays, this is less of a problem.  That’s why, I believe, many network graphs of literature are often done on theatrical works, especially Shakespeare (like Moretti’s work on Hamlet).  However, with novels, there are different types of dialog, and sometimes it is not as easy to grasp the flow of conversation.

I know that when I approached this problem, I resorted to making a list of rules.  I needed some structure to validate what I was doing.  I think, in a way, I wanted to make it more “scientific.”  Here are a few examples from my recent dialog network project:

1) It counts as dialog even if the protagonist talks to himself, as long as the comment is made “outloud” (in La tumba, this type of dialog is marked by a “-“, so it’s easier to see compared to some other novels)

2) If however, the comment is not “outloud” it does not count

3) Implied dialog does not count (if there is mention of two characters talking, but the reader doesn’t know what was said)

4) If the speaker is talking to more than one person, each listener will be listed

5) The first person who speaks is the speaker, and the other is the listener (meaning that directionally, the arrow representing the edge between the nodes will travel from speaker to listener, even if both actively participate in the conversation).

The list goes on.  Despite the fact that I created the list, I still found myself in the midst of grey areas.  To attempt to find more black/white territory and avoid the penumbra, I would make another rule.

This might be an extreme example; I got a bit carried away with the rule making.  Yet, anyone who has approached a text for this type of data classifying knows that it can be challenging to decipher different aspects of a text or in this case (to continue with the example) every instance of dialog.

In fact, almost all of the work done for this visualization was by hand (excepting, of course, the actual visualization); which, incidentally brings up the other issue of human vs. machine readings.  Could I have saved myself the work of manually mining the data?  Perhaps.  I’m sure some sort of program could be written to do the reading for me.  But, would the computer “know” who is talking?  Can the computer understand the context enough to fill in the character’s name if it were not mentioned?  Yes, but only if I “taught” the computer to do that, and even then, it might not always be right.  Also, while I may teach students daily, my methods for teaching a computer (i.e. programming) are not at the level where I could teach it to recognize specific characters.

In my case, it was easier to go through the book myself.  The result?  It was not as exciting as I had hoped, but I don’t know what I expected.  After all, in a first person narrative, most of the dialog does indeed revolve around the protagonist, with very few instances (just one in this case) of outside conversations.  However, the only slightly surprising factor was how many conversations there are.  Twenty-two nodes!  I know the novel is known for its use of dialog, but I had not realized how many different people are a part of these conversations with the protagonist.

Dialog Network for La Tumba

As for the subjectivity aspect of this post, it would be interesting to see someone else’s dialog network visualization of the same work (and based on his or her set of rules).  Would this somehow change the appearance of the graph?  I assume it would, considering even the presentation aspect was up to me.  I picked the color, yellow (seemed like a good choice at the time) and then changed the layout to better see the edges, so it was more aesthetically pleasing (at least to my eyes).

The more and more that I work with visualizations, the more I realize how much they are an extension of me: from my methodology in collecting the data, to my interpretation of the data, and finally to my presentation of it.  However much I strive to make a logical and objective product, I can never seem to separate it from being a form of (personal) expression.  Yet, I continually ask myself, is that a bad thing?  I think/hope not.

What Are We Doing With Our Visualizations?

A colleague of mine pointed me towards the following post about Shock and Awe Graphs in the Digital Humanities. The author, Adam Crymble, makes some decidedly thought-provoking points about what graphs are meant to be doing and how data visualization can sometimes work as a tool of intimidation as well as elucidation.

So before you publish a visualization, please take a moment and step back. As in the cult classic, Office Space, ask yourself: Is this Good for the Company?

Is this Good for Scholarship?

Or am I just trying to overwhelm my reviewers and my audience?

The authors of the blog Clioviz respond to Crymble’s question with a post In Praise of Shock and Awe, which also (and unsurprisingly) has some very good points to make about the value of disseminating information via visualization. They note that a certain amount of “shock and awe” in inevitable in fields like ours where the mere existence of plotted data points is enough to give some scholars palpitations. The main thrust of their argument, however, is that complex, beautiful and awe-inspiring graphs are not inherently a bad thing when they are usable. If a graph is complex to the point of unreadability, that is usually because the graph-er was attempting a kind of elegant complexity and failed. (This, of course, returns us to one of the basic problems of DH: we’re doing things we were never trained to do and the success of being able to do them at all blinds us to the necessity of doing them well.)

Both pieces make certain assumptions that I think we, as the Ludic Analytics group, are not willing to make. The first is that visualizations exist to convey information to the reader and the second that visualizations must have some immediately identifiable utility. The visualization presented at the beginning of Crymble’s piece is meant as a joke, but because a) he doesn’t provide any more serious examples and b) the point I’m trying to make works just as well, I am going to pretend it is real and assume that if I can answer his reductio ad absurdum with logic, then said logic can surely be applied to more reasonable work. Like others of its ilk, this image a piece of art I would frame and hang on my wall rather than a readable graph. It offers very little in the way of interpretation to the untrained viewer and is, as Crymble says with his tongue firmly in his cheek, about 18th century cattle’s preference for south facing barns. Crymble is frustrated when asked to view graphs like this as proof. However, were there no image whatsoever–had he merely read a paper that claimed to have looked at the data and found that cattle preferred south-facing barns–I would imagine he would have had less trouble with the assertion. This visualization exists because it can, not because it makes any particular point. It is there to be beautiful. And were it real, it would also show that the researchers engaged with the data to the extent necessary to produce such a graph. It would not be proof of point, but proof of process. And I would imagine that the task of creating such a visualization and dealing with the information would give the researchers a better understanding of their data, even if the visualization lacks a trickle-down effect of understanding to the reader.

This brings me back to something I had discussed earlier, which is that our data always seems to be more useful for ourselves than for our readers. This may explain why the scholarly article and book have had such a long life; they don’t simply convey understanding, they enact it as well. Close reading recreates, in the article, the process through which we imbue texts with meaning. The act of applying historical research to a volume of literature mimics the act of research and the flash of understanding that comes when one grasps how a specific historical fact is relevant to the text at hand. Articles are processes, they are a temporal movement towards the end of an argument. Visualizations, however, lack that sense of journey. They are always, already, at the end even when you, the reader, are still at the beginning.

I can think of several possible solutions to this problem. One is to accompany visualizations with detailed descriptions of their genesis (Stephen Ramsay does this to good effect in his article “In Praise of Pattern”). Another is to create dynamic visualizations that can operate on a temporal as well as spatial scale. For example, imagine a social network graph where you can watch the edges build up between the different nodes while the nodes move around to create different groupings as the networks grow over the course of a novel. You could even have edges fade slightly if a connection has not been mentioned for over ten chapters, for example. As might be evident, I find this idea truly exciting and would love to imagine a novel performed as a network graph. A third option would be to use the visualization not as proof of theory, but as a starting point for the reader to form her own conclusions about the topic. The visualization becomes a way to share data rather than results and the reader is invited to tell her own story with it (I am drawing this idea from N. Katherine Hayles’s new book, How We Think). The data, and database, are an interface where textual exploration can happen rather than a static image of exploration someone else has already done.

These last two solutions require a somewhat radical rethinking of data presentation. Putting the visualization in as “(fig. 3)” on page 6 of the printed article is no longer going to cut it. Articles are very good at what they do, which is provide a forum in which to recreate traditional practice so that the reader can experience it along with the author. If we want our readers to experience our non-traditional readings along with us, we’re going to need non-traditional modes of delivery to do it.

Animation and Information

So I have found myself increasing drawn to the idea of these phrase nets. There’s something about the way they ask me to engage with the text on a decontextualized level that I love. Certain specific words make me wonder about how they are being used in the text and I try to remember when they might be deployed, but others just distract me and intrigue me.

Like the networks of body parts here. Perhaps if I animate it… (click image for animation)

Animated Phrase Net

I suppose my question here is “What is the difference between the animated version and the earlier ones?” Is there something more compelling about a dynamic visualization? On the simplest level, I find myself spending more time staring at things that move, but does that make this a “better” visualization if I don’t even know what I’m trying to convey with it? Then again, the reason I don’t know what it means is because I haven’t returned to the text yet to think about it. If visualizations are a tool for analysis as well as a form of…art, I suppose (My artistic skills leave what to be desired, but art nonetheless), then I need to think about their implications. Or perhaps I don’t. Perhaps my next read-through of Daniel Deronda will be more illuminating even if I’m not directly thinking about this network. Perhaps someone else will see a connection. Or perhaps this gif will just reman here, fading slowly into color.

Questions of usefulness bring me to my second point, which is an article I saw that I felt resonated with something Meaghan had brought up before. There are several people in the field of Digital Humanities, Stephen Ramsay comes to mind immediately, who are insisting that you cannot be a digital humanist if you do not know how to code. (Full disclosure–I can handle basic html and have once or twice actually uttered the phrase “Stand back, I know regular expressions“.) But aside from the fact that this more or less relegates people working with Facebook and Twitter and doing really interesting things with technology in the classroom to something else (and maybe they should have a different title, but they seem to be part of the club these days), I am bothered by this assertion and have spent some time trying to work out why…other than the fact that I find marginalization disturbing especially when aimed at me. Then a friend of mine posted the following article to his Facebook page and I got it.

Please Don’t Learn to Code

The author makes several good points, chief among them is that we don’t need any more (bad) code in the world and, I have to be honest, most of the software I’ve seen produced by those in the humanities has been just that. We don’t need “good enough” coding, we need excellent coding done by professionals who are willing to share and maintain and update their software so that we, as scholars, can have equally excellent results. Which is not to say that Digital Humanists shouldn’t know a bit about code or shouldn’t decide to make it their “skill” and become just as good as a professional. I have met amateurs in almost every field who can beat the pants off the professionals, but still only do what they do as a hobby. And if you’re that good, please go for it! But here’s my plea. If you’re just going to learn enough to hack something together to get you through a project, a clunky thing that needs you to coax it along and that can’t really be used with any reliability by your colleagues, then perhaps you should think about whether the discipline as a whole will benefit more from your code or from you teaming up with someone who really knows what they’re doing.