When I was a freshman in college one of the first history classes I took included a tour of the university’s main library and an introduction to its vast card catalog, the like of which none of us had ever seen. Our professor patiently explained the arcana of the Library of Congress subject heading system, showed us how a work might turn up in the catalog either by title, author, or subject heading, and then sent us off on a scavenger hunt through the thousands of little file drawers. By the end of our class period, each of us had the beginnings of a bibliography on the subject of our course.
That first foray into the world of real historical research was fun, overwhelming, and educational all at the same time. But it was also limited to secondary sources and was entirely limited to those works available in the university library.
How the history student’s world has changed.
Today our students are face to face with access to primary and secondary sources beyond count–quite literally tens of millions of primary sources and an equally large and growing corpus of scanned secondary works. My professors taught me in a pedagogical world based on scarcity. Today we teach in a world dominated by abundance.
“Big data” is one of the “big ideas” of the current decade across many sectors of the information economy and historians and other humanists have already begun working on exciting projects [see also and also] that are helping us find ways to mine emerging super massive datasets of historical information. One maturing example is the Criminal Intent project funded by the NEH’s Digging into Data program (my colleagues Fred Gibbs and Dan Cohen are central players in this project).
As exciting as the Criminal Intent project and other similar data mining efforts are, they are currently operating at a level a bit to complex for the average undergraduate. Simpler data mining tools like Google’s NGram viewer offer a more frictionless introduction to data mining concepts. For instance, I’ve written about how undergraduates might use the NGram viewer to mine millions of words from the Google book database and begin to think about what sorts of historian’s questions might then come out of such a mining exercise.
Right now, today, getting much beyond these basic sorts of exercises with undergraduates will be difficult. But it is useful to remember that ten years ago it was not so easy to make a web page. Before too much longer the user interfaces for mining massive data sets of historical information — especially texts and images — will be appropriate for the undergraduate curriculum. That means it is already past time for historians to be thinking about how we can incorporate data mining into the undergraduate curriculum. Some interesting graduate syllabi have begun to appear, but data mining, whether text or image mining, seems to be largely absent from the undergraduate history curriculum.
Imagine, for instance, a course that begins with the simplest tools, such as Many Eyes or the NGram viewer, helping history students to see both the strengths and weaknesses of these tools. From there the course could move on to increasingly complex forays into data mining, letting the students range further and further afield as their skills grow. Our colleagues in computer science have already developed such courses, but such courses would need to be adapted heavily for them to work with history students who (mostly) lack the background in programming.
In my previous post I pointed out that incorporating “making” into the history curriculum gives us opportunities to build connections to other academic disciplines (art, engineering, graphic design). Data mining offers us similar opportunities (computer science, library science, computational sciences). The more creative we can be about building such linkages, the richer our curriculum can be and the better prepared our students will be for the world they’ll face when they graduate.
But just as important, we’ll be training a new generation of historians to work with the unimaginable wealth of historical information that a decade’s worth of scanning and marking up of texts, images, video, and sound files, has made available to us all.