Skip to main content

OCR-ing Tristram Shandy

3 min read

Cleaning the OCR'd text of my Tristram Shandy section wasn’t too difficult when I used find and replace. I learned that because databases like ECCO use OCR’d texts to power their search engine, searches are not going to be accurate if your word has an S in the middle or beginning of it. But you can try searching the archaic form of the word—so I’d have to try “worfhip” or “worlhip” (what my OCR’d text garbled it into) to find mentions of worship. If the OCR program analyzed the entire sentence rather than just single characters, machine learning could help predict what the right word should be – for example, it could make sure pronouns are spelled correctly because grammar dictates where they have to go; it could find words in the OCR’d text that aren’t in the dictionary and aren’t capitalized (so less likely to be pronouns); and it could use the placement of an “s” in a word to guess what word it should be.

 

Cleaned, OCR’d texts seem extremely helpful to me in increasing the accessibility of historically significant texts. Even digital facsimiles aren’t helpful to the extent that they take forever (and a lot of paper, if you print them) to be able to read. To me, divorcing a text from the layout that its original readers engaged with significantly changes it. I’m a visual learner, so I remember the location of a word on a page when I read, which leads me to think that the layout is important because it shifts how we group different parts of the text and how we digest what we’re reading—especially with newspapers and paratexts.

And that’s why it’s hard for us to contextualize the proliferation of the novel and how it affected people; we don’t experience the difficulties of obtaining novels anymore. I think it’s important to consider how books used to be read: slowly, carefully, while turning many expensive pages. Perhaps people memorized while reading so they could recount it orally later, or read out loud. Authors couldn't easily edit content, either, like we can edit a Word document. Reading a book on a webpage as we do with Gutenberg Project texts makes it seem like a sea of words that goes on continuously, and we can get lost in it.