The Digital Archive and Literary Scholarship: Textual Collation for Dummies

Update: The new version of JUXTA (version 1.4, released September 2010) can collate XML files in any valid schema, including P4 and P5 from the Text Encoding Initiative. So you can now collate encoded texts against plain texts, such as Google OCR scans. For a more elaborately encoded text, you can easily select options that will produce a clean text if you have a basic understanding of TEI. But the post’s point still holds. In most case you don’t NEED to understand TEI schema. JUXTA’s default import options work quite well for majority of XML texts.

I am in the first flush of enthusiasm for JUXTA, the textual collation tool from NINES. It is a beautiful thing that could achieve an absurdly unrealistic goal, to democratize the practice of textual scholarship and make it accessible to most literary scholars. I think the Whitman air is getting to me. Blakeans would never imagine “democratizing” literary scholarship. In any case, JUXTA is a tool of this moment in time, the moment of free online archives of texts. Its transformative power is its ability to compare unencoded texts.

Even though I used the term unencoded in the loose sense (all texts are encoded, see McGann), the fact that JUXTA can collate texts that lack formal markup (that is, REQUIRES texts without markup) was a big stumbling block for me. On trying the latest version, I discovered a stable application with sufficient documentation to collate text in the wild, such as raw OCR. In three days I’ve collated three texts: Sarah Orne Jewett’s The Country of Pointed Firs, Augusta Jane Evans’s St. Elmo, and Harriet Beecher Stowe’s Uncle Tom’s Cabin. Maybe I should say two “texts in the wild.” I’ve spent to much time domesticating wild texts of Uncle Tom’s Cabin to bother with the wild ones anymore.

With JUXTA, a basic familiarity with online archives, and minimal familiarity with text encoding, you can compare two printed versions of a text with relative ease. To gain a basic familiarity with online archives of texts is not a trivial exercise, but dabblers can get started at my summary of digital American Literature collections. Scholars should search a library catalog instead of my indiosyncratic collection.

JUXTA can be downloaded here. The help page tells you most things that you need to know. But I’ll tell you one more thing from hard-won experience.

Unencoded really means unencoded. All XML tags except the opening header tags, closing tags, and JUXTA’s milestone tags are prohibited. If you download an XML-encoded text, you’ll need to know how to remove its encoding. There are many ways to do this, but an XML parser with an Identity XSLT script is probably the easiest. If I’ve started talking gibberish to you, don’t do it this way. You’ll get frustrated before you ever recognize the beauty of JUXTA. In any case, remove all tags, tag-like character detritus (i.e., angle brackets), and entities (i.e., those things that begin with ampersand signs).

I hope to return to this soon, but I’ve too much collating to do. After I drafted this, I left it sitting in my draft box. And now I wonder what I was thinking. Most literary scholars will not compare texts because it would not occur to them that the differences could be meaningful. But maybe they can spare a few hours to compare two versions of a novel.

I of course mean “Dummies” in the best sense, that used by Wiley publishing for its famous series of how-to books on technical matters, for people who want to learn but not allow their book titles to show their interest in self-improvement. The Dummies label is oddly portable, but I lack sufficient imagination to understand how Post-Traumatic Stress Disorder for Dummies could possibly fit into the series.

By the way, the textual comparison work in the previous post (one below), on Sarah Orne Jewett’s Pointed Firs, is based on JUXTA work.

NOTE: JUXTA is the right tool for collating scanned and OCR’d text if your interest is the major differences between two different printings of the same text. For comparing two copies from the same setting of type, Juxta is the wrong tool. You need a tool for sight-based collating, such as the Hinman Collator, Lindstrand Comparator, Haley’s Comet, or McLeod’s portable collator. But you don’t need a device of any type. I’ve written a post on device-free collating.

This entry was posted in Uncategorized. Bookmark the permalink.

6 Responses to The Digital Archive and Literary Scholarship: Textual Collation for Dummies

  1. Wesley, on behalf of the ARP team, thanks for the nod! We’re really proud of Juxta, and hope it finds a wider audience. I just want to qualify what you say here about Juxta requiring “unencoded” texts. It’s true that it works well with plain text (.txt) files, and that those are perhaps easiest for casual users to find or produce. But Juxta is designed to do much more sophisticated things with XML files that have been transformed (generally using XSLT) to the Juxta XML format. Again — not a requirement, but nice for “power users” to know. All this is explained in the user manual available from our web site.

  2. wraabe says:

    Bethany, thanks for the kind words and the explanation. As usual, all partial truths are untruths, unencoded is not unencoded, and plain text is not plain text. Had Juxta documentation a nice little XSLT identity style sheet (say, on a web page) that would strip tags from XML documents, a wider audience of users might not merely try Juxta but also succeed at using it for textual comparison.

    I also hope that the next release will improve (or offer better warnings) on Juxta’s two annoying features of choking on long texts and stray encoding. We students of prose might want to compare more than 120 duodecimo pages [ten chapters of typical 19th-century length chapters]. Juxta complains. Texts in the wild (scanned OCR texts) may have stray angle brackets or ampersands. And a user needs to know that it is necessary to remove them from the file before Juxta will successfully load it.

  3. Pingback: Collation in Scholarly Editing: An Introduction (DRAFT) « Fill His Head First with a Thousand Questions

  4. Pingback: Juxta » Blog Archive » for dummies?

  5. Pingback: N I N E S - News

  6. Thanks for your explanation and sharing this useful information. I found many useful information in this blog. Thanks once more times :D

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s