Visualization as an Introduction to Text Mining?

Over the next 24 hours thousands, if not tens of thousands, of pages of text will appear on the Internet that will be of use to historians–books via Google, government documents, primary sources from archives, and many more. This blessing and curse of the digital age presents those of us who teach history with a serious problem.

How do we teach our students techniques for working with all of this text?

Once upon a time, in the “olden days” as one of my students put it recently, college professors took their students to the university library and taught them how to use the many resources available to them there. But we already knew the basics back in that time gone by. We knew what a book was and how to read it–not like a historian reads it, but we knew that books had tables of contents and indeces that could help us find things we were looking for. And, while the university library was pretty overwhelming compared with our high school libraries, it was small potatoes compared to the mass of online text that just keeps growing and growing and growing…

Now that the olden days are gone, it’s up to us to teach our students some new techniques. But where to start? Text mining–the new big thing in digital humanities–is a relatively higher order skill. Should history majors develop this skill? I happen to think so. And if I’m right, then how might we begin to teach them text mining skills?

One possible way is to use some of the new data visualization tools that are floating around online. One that I think has some promise as a teaching tool (not so much as a research tool) is Many Eyes. As my colleague Dan Cohen has already pointed out, Many Eyes visualizations can be very misleading when we use them to try to understand the meaning of text. That, in itself, is a good lesson for students when they start learning about text mining.

I think that, instead, we need to use these sorts of tools at a much more basic level. For instance, the next time I teach Historical Methods, I plan to use a number of new exercises I’ve been working on to get students thinking about how they might use technology to analyze text. One of these will be using Many Eyes to have them analyze their own writing–just to introduce them to the ways these sorts of tools might be used.

As an example, I took all three of the posts I wrote here titled Making Digital Scholarship Count and poured the text into Many Eyes. Here is an example of what the just over 3,000 words in those three posts look like when converted to a tag cloud.

Right away students can see that a few key terms dominate the text I wrote in those three posts. This doesn’t mean that the posts are definitely about “scholarship”, “digital”, or “work”, but focusing on these three terms is a starting point for analysis. In this case, of course, the posts are about digital scholarship, so seeing those terms writ large like this offers no big analytical breakthrough. This too can be a useful insight for students–sometimes the answer is pretty obvious.

But, Many Eyes offers additional tools that students can use to play around with basic text mining. For instance, because my posts were obviously most concerned with scholarship, in what contexts did I use this loaded term? Here’s another visualization of my text, using “scholarship” as the key word.

This particular visualization (click on it for a clearer view) is one they can use to start thinking about how a scholar or an historical actor used certain words.

Obviously, Many Eyes visualizations aren’t sophisticated analytical tools. But they do offer a useful first step toward a more sophisticated understanding of text mining and what it might do for historians.

2 thoughts on “Visualization as an Introduction to Text Mining?”

Tiffany Oxley says:

September 25, 2008 at 10:46 am

Eaagle Software recently released Full Text Mapper (FTM) which provides an outstanding visualization analytical tool. It allows you to not only see the key concepts or topics within texts but also allows you to explore the relationships between the topics. With FTM you can map data contained in Word, Excel, PowerPoint, Text, HTML, and PDF files. The software also have a good reporting tool which allows you to export reports of data findings. Information about the softtware can be obtained at http://www.eaagle.com/index.php?go=FTM.
Jon Olsen says:

October 20, 2008 at 8:47 pm

Mills, we had one of our alumna from the Computer Science program here at UMass today discussing this very project – she is on the research team at IBM that is developing it. You raise some very valid points about some of the possible pitfalls, but she did raise an interesting aspect as well. The social networking and conversations that are taking place related to the data sets and the visualizations. Another interesting aspect was the ability to compare two different texts it creates a dual tag cloud matching common words between two data sets. This could be interesting for comparing speeches between different people or language used during a debate (there is quite a bit of data now on the site taken from the presidential debates). Personally, I like the idea of being able to graph the data on the maps and the network diagram.

Comments are closed.