CollateX, Python, Anaconda, Oh My: Or, What Have I Done? (Week 3 Reflections)

Somewhere in previous two or three posts I explained that I want to engage more sophisticated collation tools, which for me includes CollateX. Therefore, I decided this past week to get really engaged. Upon turning to the site at, I find that there is no longer a Windows or Mac command line version, as there was in CollateX 1.5. Now with version 1.6, it’s something else, a Java archive, and I’m not really sure what that means.

I play a DH scholar on TV (I’ve encoded texts in XML, published peer-reviewed scholarship on Scholarly Editing and the Whitman Archive, and done some XSLT development for Blake Archive), but I’m still an English major at heart. Mostly I read things, so this is distressing to me. But, okay, I have the enormous privilege of a research leave semester. If I’m going to learn something new and technical, it’s going to be when I have intensive time to devote to it. So I may as well. Take a few deep breaths, and here I go.

First I tried figuring out what the heck to do with the new CollateX download. Being the naive sort, I went to directory and tried a version of what worked with Version 1.5. This command worked in CollateX, version 1.5, when one had the files one wanted to compare in the bin directory:

./collatex wit1.txt wit2.txt > wit1wit2compare.txt

I wondered whether (but doubted that) it would work with CollateX, version 1.6, this new Java version. But I’m not expecting much cause there’s no longer a “bin” directory.

java -jar collatex-tools-1.6.1-jar wit1.txt wit2.txt > wit1wit2compare.txt

The result? not able to access jar file. OK, so that’s not going to work. A command displays the documentation:

java -jar collatex-tools-1.6.1-jar -h

This is what the “documentation” looks like:

Screen Shot 2015-09-20 at 10.03.39 PM

I try what documentation says, so:
collatex wit1.txt wit2.txt > wit1wit2compare.txt

And I get…nada. Command not found. I realize that there shall be more floundering in a technical hellscape. Then I recall something that Ronald Dekker had tweeted in reply to one of my Twitter questions:

I had started studying Python this summer. I’m glad I did, because now it looks like I may have no choice. I know the people who put together these tools are wicked smart, and when academics I’ve found them to be genuinely nice people, but (as I said) my English major heart is having palpitations.

UPDATE (April 2016), This works in version 1.7:
java -jar collatex-tools-1.7.1.jar -o:output1.txt file1.json file2.json

Surely, I’m not the only one with this trouble. Isn’t Google search my friend? Yes, so I search for “collatex install” and hope for the best. Surprisingly, half-way down the first page of results is “Python 3 and CollateX installation instructions – oo.” The “oo” is a bit worrisome, but it is on the Obduron server, which I recognize. And I know the scholar from other work, his “Even Gentler Introduction to XML,” which I have assigned to students: it’s David Birnbaum. Alright, despair, away with ye for now.

I followed the Python and CollateX install instructions (this is not a namby-pamby version but serious geek tools, “an enterprise-ready Python distribution for large-scale data processing” called Anaconda. It’s not “clickable,” so to the command line. And I apparently have something called Pip, which is like HomeBrew or MacPorts, a command-line install routine (not so bad, I’ve been around this kind of block with LaTeX and installed Gimp with MacPorts). More Pip for Levenshstein (harmless process, will read about that later), and then GraphViz both as separate program (had that, now updated) and something called Python bindings for GraphViz. No idea, just need to do it, but almost skipped that step. Some 3 hours later (minor problems with XCode install having hung, Python 2.7 showing up on iPython Notebook, moments of about Python 3 requirement interfering with Python 2.7, reading Anaconda documentation), I have CollateX installed andrunning in iPython Notebook.

Now, I’m trying to get my head round fact that I now have industrial strength Python distribution (farewell Monty Python and Eric Idle jokes), what possible reason I need to launch a web server to run iPython Notebook, what iPython Notebook is, and where on God’s green earth (though I suspect in my file system) the files that I want to collate should be.

Side note here: I’m not trying to save the world. I’m trying to collate 5 very accurate transcriptions, transcribed by myself and others typing and read aloud to proofreading, in a very arts-and-crafts sense. I’m a bookish person: I treasure books as individual physical objects, and I gather up the fragments and put them in little baggies when old bindings or paper fragments crumble and totally sympathize with others who do same thing. I need computers to automate my editorial work, not do it for me. But I’m spending an awful lot of time trying to figure out whether computers can do the work that I need them to do.

Now, time to begin the CollateX tutorial. Cause of course all that’s not the actual work, just the setup to be complete before the work can begin. An afternoon I spent working my way through this. Ooh, step by step with iPython, I think can do this. This is all well and good, but my trouble is in 119, when transcribed texts are inserted into Python script and viewed on screen. That’s weird: no one would do that except a computer person teaching a tutorial for demonstration purposes. What I really need is collating exterior files. And yeah, I see exterior files, sample collation files from Barbara Bordalejo’s neat Darwin Online project. But how do you pull in external files? This is what tutorial says:

Part 3: Reading multiline input from files (watch this space)
Part 4: Creating XML output (watch this space)

You can expect me to be watching this space daily for the next two months. Enough for one day. Then, on Monday, I wake up and remember there’s email. So I (with hope) send David Birnbaum an email message. Not 30 minutes later–I kid you not; I just walked dog around block after sending–he replies with new instructions posted to GitHub, at

This is where my weekend work ended, perched between hope and fear, as I needed also to do other work, to demonstrate class project at library and to write a letter of recommendation. Maybe I’ll figure out how to do this, with its 180 lines of Python code, and maybe I won’t. And what will I do with JSON output, when I don’t even know what that is.

UPDATE (April 2016), More advice to my former self: Go study Python incessantly, with books like Lutz’s Learning Python and Programming Python. Go study XSLT, with books like Tennison’s Beginning XSLT. You have to be able to re-write these scripts. Also, 4 matters are bugs in CollateX. 1) If it won’t process something (HTML, etc.) it’s probably because it only processes JSON. 2) You may not have a blank space before closing tag in item to be collated. Put in a tilde instead. Otherwise, the tokenizing XSLT template will break.

The project that I had to finish up on Monday is the Drupal publication of letters of Alfred Chester, which students in my DH course got into near-publishable state at end of class. But we had to wait for permissions from Chester’s executor Edward Field, which he provided last month. Today I tried very much to figure out how to publish letter images with TEICHI Framework, but I was ultimately stymied. I don’t feel like I can go back to that, as it may take several days, and I need to focus on this Python CollateX work.

I am nervous and anxious again, especially after reading “Computer-supported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript Project,” in which the idea is that automated collation is supposed to solve all kinds of problems for technically sophisticated projects. The projects that are imagined are well-funded and technically supported, not the work of a lonesome scholar at a regional state university. Gonna have to get more imaginative here, as I have no desire to be Boxer the horse from Orwell’s Animal Farm. I know, maybe I’ll use my annual $500 travel budget and travel as luggage and sleep on the street in Europe the next time that a CollateX seminar is offered. Or raid my high son’s college fund?

Caution (Profanity with Sexual Innuendo Follows)

You don’t have to keep reading, as no further information about collation follows.

And yes, it’s bawdy, but it’s really funny: Alfred Chester is a riot. And now that his letters are published online with the permission of his estate, I can share the funniest line, from his 22 May 1964 letter to Norman Glass, “Why do you grease your asshole if there is no one there to fuck it?” Since reading that letter, I have no longer cared about a tree falling unheard in the forest. Now, I think instead of Chester’s line (and chuckle to myself).

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to CollateX, Python, Anaconda, Oh My: Or, What Have I Done? (Week 3 Reflections)

  1. Pingback: Learning Python (Week 4) | Fill His Head First with a Thousand Questions

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s