Update: The new version of JUXTA (version 1.4, released September 2010) can collate XML files in any valid schema, including P4 and P5 from the Text Encoding Initiative. So you can now collate encoded texts against plain texts, such as Google OCR scans. For a more elaborately encoded text, you can easily select options that will produce a clean text if you have a basic understanding of TEI. But the post’s point still holds. In most case you don’t NEED to understand TEI schema. JUXTA’s default import options work quite well for majority of XML texts.
I am in the first flush of enthusiasm for JUXTA, the textual collation tool from NINES. It is a beautiful thing that could achieve an absurdly unrealistic goal, to democratize the practice of textual scholarship and make it accessible to most literary scholars. I think the Whitman air is getting to me. Blakeans would never imagine “democratizing” literary scholarship. In any case, JUXTA is a tool of this moment in time, the moment of free online archives of texts. Its transformative power is its ability to compare unencoded texts.
Even though I used the term unencoded in the loose sense (all texts are encoded, see McGann), the fact that JUXTA can collate texts that lack formal markup (that is, REQUIRES texts without markup) was a big stumbling block for me. On trying the latest version, I discovered a stable application with sufficient documentation to collate text in the wild, such as raw OCR. In three days I’ve collated three texts: Sarah Orne Jewett’s The Country of Pointed Firs, Augusta Jane Evans’s St. Elmo, and Harriet Beecher Stowe’s Uncle Tom’s Cabin. Maybe I should say two “texts in the wild.” I’ve spent to much time domesticating wild texts of Uncle Tom’s Cabin to bother with the wild ones anymore.
With JUXTA, a basic familiarity with online archives, and minimal familiarity with text encoding, you can compare two printed versions of a text with relative ease. To gain a basic familiarity with online archives of texts is not a trivial exercise, but dabblers can get started at my summary of digital American Literature collections. Scholars should search a library catalog instead of my indiosyncratic collection.
Unencoded really means unencoded. All XML tags except the opening header tags, closing tags, and JUXTA’s milestone tags are prohibited. If you download an XML-encoded text, you’ll need to know how to remove its encoding. There are many ways to do this, but an XML parser with an Identity XSLT script is probably the easiest. If I’ve started talking gibberish to you, don’t do it this way. You’ll get frustrated before you ever recognize the beauty of JUXTA. In any case, remove all tags, tag-like character detritus (i.e., angle brackets), and entities (i.e., those things that begin with ampersand signs).
I hope to return to this soon, but I’ve too much collating to do. After I drafted this, I left it sitting in my draft box. And now I wonder what I was thinking. Most literary scholars will not compare texts because it would not occur to them that the differences could be meaningful. But maybe they can spare a few hours to compare two versions of a novel.
I of course mean “Dummies” in the best sense, that used by Wiley publishing for its famous series of how-to books on technical matters, for people who want to learn but not allow their book titles to show their interest in self-improvement. The Dummies label is oddly portable, but I lack sufficient imagination to understand how Post-Traumatic Stress Disorder for Dummies could possibly fit into the series.
By the way, the textual comparison work in the previous post (one below), on Sarah Orne Jewett’s Pointed Firs, is based on JUXTA work.
NOTE: JUXTA is the right tool for collating scanned and OCR’d text if your interest is the major differences between two different printings of the same text. For comparing two copies from the same setting of type, Juxta is the wrong tool. You need a tool for sight-based collating, such as the Hinman Collator, Lindstrand Comparator, Haley’s Comet, or McLeod’s portable collator. But you don’t need a device of any type. I’ve written a post on device-free collating.