Reliability of Electronic Texts

A blanket statement about the reliability of texts in digital collections is not possible. Users should keep in mind two considerations: method of text transcription and provenance of text. If a digital text (or printed text) includes information about these two topics in its notes or bibliographical apparatus, the reader or user can make informed decisions about a particular text. The absence of these types of information should always put the user or reader on guard.

The best method to establish a text is to choose a text of known provenance and transcribe it by keyboard, at least twice. These two transcriptions can then be compared to note differences. All differences should be referred to original documents. The corrected text should then be orally proofed against the original. Additional electronic collations, careful double-checking of all corrections, and additional oral or silent proofings are necessary. If an editor decides to introduce emendations, he or she should dutifully record any departures from the text of those documents. I’ve spent hundreds of hours doing this for my dissertation. The efficacy of these methods is comparatively well established, but the methods are extremely time-consuming and thought uninteresting and not strictly necessary by the editors of most texts. A negative characterization of the activity of textual scholars is that they engage in a fetish for accuracy. Let’s not argue: each to his or her own obsession be true.

The major area of concern is not the reliability of the small set of electronic texts produced by obsessive textual scholars–though such texts are likely to elicit commentary if a scholarly reader with a strong opinion prefers a different approach–the concern is the massive digital collections encountered frequently, whether primary documents reproduced by librarians or database vendors, paperbacks or electronic versions of texts produced by serious literary scholars who are not interested in textual scholarship, and the many outputs of interested amateurs or hacks. In large-scale digital collections, librarians generally select a few hundred (or a few thousand) volumes according to some criteria. Money enough and time is not available to edit properly, so texts are scanned by OCR, perhaps spell-checked, validated and perhaps made to conform to a DTD, and then put out on the web. When texts are scanned using OCR software, a rate of accuracy of 95% is often considered good enough.

This is a statistic, so you should trust it less than a damn lie. Remember that 95% usually reflects the best of all possible worlds, a wonderful microfilm copy of a flawless original. If the 19th-Century publisher was not thoughtful enough to print in a type font ideal for OCR, the copy used for microfilm was bent or torn, or the microfilm is faded or poorly exposed, the 95% rate is applied against the “original,” which is understood as that lousy copy of microfilm. But if one is in the best of all possible worlds (the “original” is a superb microfilm copy of an original), a line of type typically has at least 40 characters. Simple math will tell you any two lines of type will likely include errors in transcription. To see how bad this turns out in practice, compare a few lines of the transcribed text in the Making of America books collection with the digital images. Facsimile page images are held to compensate for the unreliability of the transcription. Searchable transcription compensate for the fact that images cannot be searched, so long as a person who is aware of the shortfalls remembers to perform multiple searches on letterforms likely to be mistaken in OCR (an interesting project, which someone should undertake!). Despite the shortfalls in accuracy (XML validation and DTD conformance have at best an incidental relationship to accuracy), society and serious scholars thank the professional technocrats and interested amateurs for the massive production.

The world will never have sufficient numbers of textual scholars to edit even a discernible slice of the world’s print production, so textual scholars generally confine themselves to works that have a considerable claim for importance—works that are interesting to the society and thus have a chance of publication or funding. Regardless of whether one is classified as a database producer, librarian, textual scholar, literary scholar, amateur, or hack, one can do a good job or a poor one. Publishers may insist that the work is done well, or they may not care. A press’s reputation is a helpful though not always trustworthy guide to method.

Aside from method, another factor for analyzing electronic texts is its provenance, that is, which “text” is being transcribed. Library collections will often produce a transcription of the first edition, the edition that is usually desirable to collectors, librarians, and scholars. The practice is probably a good thing because of a well-understood fact about textual transmission. When people transcribe texts, they make errors. You can test this statement by comparing a block quotation from almost any scholarly essay to the text that the author claims to be citing. More often than not, a careful double-check of five or six long quotations will reveal an error (often a small one) in transcription. We all make these mistakes, but when the types of mistakes that even careful scholars make are writ large over the course of a triple-decker (three-volume) novel, the result is 100s of errors. Transcribing accurately is no mean feat, and any transcription is always a translation from one form into another. However, a transcription of the first edition is likely to have fewer accretions of 100s of small errors than a transcription of whichever paperback is on the shelf, because the editors of paperbacks, even those aimed at the academic market, are typically not trained editors. They are more likely to be prominent critics, who in the past decades are less likely to have received formal training in textual scholarship. If a project (electronic or paperback edition) informs you of the provenance, you as the reader can check the transcription against the original, if accuracy is important.

If you know the provenance of an electronic text and you know that the OCR transcription is 95% accurate, then a responsible scholar who is, say, seeking all instances of a word in a novel, is advised to perform multiple searches with partial words or word variations. But what happens if you don’t know the provenance of a text and it seems to read accurately? You will typically find this case in Project Gutenberg, which often produces readable texts about which the project provides no information about the source. Project Gutenberg does not have the provenance fetish, and it is agnostic about which text is transcribed. Any old (or new) text will do. Since provenance information may be omitted or is so minimal as to only provide the edition on which the PG text is based, it is difficult (if not impossible) to know whether you are reading an accurate transcription of an early text, a cleaned up transcription of a faulty text, a transcription that has been corrected by a computer spell checker, or an inadvertent conflation because the original transcriber used a different printing than the proofreader. To spell check a text that is more than 100 years old is a waste of time if your aim is to transcribe accurately. Modern spelling and grammatical usage differs markedly from typical forms in early nineteenth-century or earlier periods and from dialectical forms of any period, and an editor who does not know that cannot serve scholarship. If you compare an MOA text with a Project Gutenberg text, it may seem that the Project Gutenberg text has been proofread better. For Project Gutenberg, which promotes the availability of electronic texts at the expense of other values, these details are often insignificant.

For serious students of literary texts, the availability fetish should be troubling (as should the fetish for first editions, since authors and publishers revise, correct, and re-issue texts in altered forms). The faults that I ascribe to Project Gutenberg often apply to texts edited by serious scholars. Transcriptions (OCR or keyboard) produce modernization and other types of errors, and efforts at correction often produce other errors. The is true for large library digital text collection projects. And it is true for the vast majority of paperbacks at Barnes & Noble or Amazon or your local college bookstore.

So what is a sane person to do? Before relying on a text, check into the electronic text’s bibliographical header or a paperback’s “Note on the Text.” The lack of provenance information should be troubling. A library-produced site will often provide information on provenance and transcription processes. A transcription is possibly as reliable as the process used. A text produced by a serious literary scholar will often include a note on the “source” for the text and maybe a brief note about modernization. This is where the editor pats the reader on the head and says “trust me.” Proceed at your own risk. Trust Project Gutenberg as much as you trust a paperback in a garage sale. The text may be interesting in its own right (or useful for its annotations) quite apart from its interest as an accurate reproduction of an earlier text.

If a print or electronic edition has a textual scholar who is aware of multiple variations and intends for the reader to become aware of them, the editor will typically include a textual introduction with words like recto and verso, technical terms like issue and state, apparatuses with variant readings from other versions of the texts, and nods to scholars like A. E. Housman, Josef Bedier, Walter W. Greg, Fredson Bowers, G. Thomas Tanselle, Jerome McGann, etc. This is where the editor should lay out the provenance of those texts considered significant for the edition, principles for the transcription and presentation of evidence, and the origin of textual variations insofar as they can be determined. If the argument is sound and the edition lives up to the argument, you can decide to place your trust in the editor. If the argument is unsound, wrongheaded, or describes an approach antithetical to your beliefs, then you must rely on the textual scholar in your own head.

UPDATE: Lisa Spiro’s post on her Digital Humanities blog, Evaluating the quality of electronic texts, offers additional perspective on considerations other than provenance and accuracy of transcription in the case of electronic texts. While I deviously implied that print has concerns similar to electronic texts–despite my title–she offers a much broader perspective on the particular issues that apply to electronic texts. While you may notice that she in the post refers to my blog, such is life in the digital humanities blogger world, which is a small one. But in the tradition of journalistic disclosure, Dr. Spiro and I both attended UVA, shared some of the same dissertation readers, have presented on the same panel at a conference, and are both contributing to a forthcoming volume on digital texts from Michigan. But the reason that I suggest reading her site is not for any of that. I suggested her site because her work is really good.

4 Responses to Reliability of Electronic Texts

  1. Harry says:

    I was looking for a free document scanning software on the Internet. I had used Textbridge for the past 8 years with many versions of Windows OS and I was not willing to buy another expensive scanning software any more. Then I found some interesting ones online, like Free OCR, etc. Though not as good as commercial OCR softwares, they did produce promising results to me.

  2. Pingback: scholarly editing blog « EdRLS

  3. Gisle Aas says:

    thanks for sharing

  4. Harry says:

    thanks for sharing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s