Deprecated: __autoload() is deprecated, use spl_autoload_register() instead in /home/rachelsa/ on line 17

Warning: ini_set(): Headers already sent. You cannot change the session module's ini settings at this time in /home/rachelsa/ on line 22

Warning: ini_set(): Headers already sent. You cannot change the session module's ini settings at this time in /home/rachelsa/ on line 23

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 25

Warning: ini_set(): Headers already sent. You cannot change the session module's ini settings at this time in /home/rachelsa/ on line 26

Warning: ini_set(): Headers already sent. You cannot change the session module's ini settings at this time in /home/rachelsa/ on line 27

Warning: ini_set(): Headers already sent. You cannot change the session module's ini settings at this time in /home/rachelsa/ on line 28

Warning: ini_set(): Headers already sent. You cannot change the session module's ini settings at this time in /home/rachelsa/ on line 29

Warning: session_set_save_handler(): Cannot change save handler when headers already sent in /home/rachelsa/ on line 86

Warning: session_name(): Cannot change session name when headers already sent in /home/rachelsa/ on line 45

Warning: session_start(): Cannot start session when headers already sent in /home/rachelsa/ on line 46

Warning: session_cache_limiter(): Cannot change cache limiter when headers already sent in /home/rachelsa/ on line 47

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 57

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 58

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 376

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 377

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 41

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 42

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 57

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 58

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 998

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 170

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 5

Warning: Cannot modify header information - headers already sent by (output started at /home/rachelsa/ in /home/rachelsa/ on line 6
Rise of the Novel 2018
Skip to main content

I used chapter five of Tristram Shandy and a free trial version of Prizmo for the OCR editor. As others have noted, there were mostly errors on the level of the letter (such as confusing the letter s for f). While I am very unfamiliar with the software, perhaps having sort of an autocorrect dictionary could be useful in avoiding these mistakes, especially since they popped up so often.

I think it is very useful for these databases of digital facsimiles of these books to exist alongside actual copies of the books. The digital facsimile allows for a preservation of the original text; it allows us to understand the meaning of certain stylistic choices in context, and it allows us to get a better idea of the embodied experience of novel reading as it once was. Of course, being able to just get the text from the digital facsimile is extremely valuable, it allows for an increased ability to access the text. The two forms should not be considered as an exact replacements or conversions though, they are obviously connected but give us as the reader different ways of accessing the text.

OCR and making texts real/less real

3 min read

Besides the dozens of times I had to substitute s for f in correcting my text of Tristram Shandy (a correction which is more about a change in written English and thus feels beyond the scope of an OCR program, although I’d be interested to see whether a program could be trained to recognize and remove antiquated uses of f from texts), the mistake which stuck out to me the most was how my OCR software treated non-standard letters. I used a free trial of Prizmo after Google Docs only succeeded in giving me half a page of recognizable text from the first chapter of Tristram Shandy, and as far as I could tell there were two primary sticking points for the software when it tried to OCR my text—that my text was an older edition with some uneven letters and random spots, and that the chapter I chose had annotations handwritten between some paragraphs. The first problem resulted in some a’s being swapped out for s’s and the addition of commas or periods wherever there’d been a dot on the page, but the handwriting threw Prizmo into a tailspin; the two main chunks of marginal notes were transcribed as “4” (f4:1638]o HW Pp, fT.” and “Aahelary (3% ••. sare _- Sy are. Cr oe! Bz ftJSL.”

I can understand why recognizing handwriting is so hard for a computer program—OCR softwares are trained on standardized fonts, and handwriting is often particular, individual, messy, and weird. But this complete meltdown of Prizmo’s abilities still surprised me—I think I’d somehow unconsciously trusted that the program would have the same reading comprehension abilities as a human, and when that expectation (of course) fell up short, I was happy. Because I could read that handwriting! I’m still smarter and more human than this image-text conversion robot! I could correct Prizmo’s mistakes and create a more perfect version of the text it was trying to copy and (in its own weird and particular way) failed to fully capture, I could turn the ECCO’s collection of images into a file which can now be run through other computer programs, and I could take those handwritten (and therefore individual, and more obviously human) notes and type them, too, into a standardized form. All of which felt somehow both revolutionary and strange; to think of this program as my co-translator of Tristram Shandy, of myself as both a witness to the uniqueness of an annotated text and the editor of that text into something more generalized, was deeply weird.

This exercise brought up so many questions for me about human writing, print culture, computer programs and our relationships to them, and whether a program can have character (and that somehow connects to Gallagher again, I think, although reading computer programs as analogous to fictional inhabitants of the novel is quite a stretch)—I don’t think I’m anywhere close to figuring out how to answer these questions or what they are in the first place, but I’ve had a lovely time trying.


OCR: Technology and Temptation (Assignment 4)

2 min read

I would be interested to know just how the OCR technology I used (Prizmo) works. Does it recognize only individual characters and separate them into words solely by spacing? Or does it make use of some kind of internal dictionary, whose entries it seeks to match to the words it processes? Does it--and if not, could it--use technology akin to the predictive text of messaging apps, probabilistically identifying visually unclear words by considering their verbal contexts? From my results, I see no evidence that Prizmo's OCR function is so sophisticated (the developers might want to stop it from recognizing the antiquated form of 's' as 'f' before tackling my proposals), but why couldn't it be?


I find myself instinctively tempted to consider an OCR-ed (and cleaned) text to be formally purer than the digital facsimile it is drawn from, and I'm curious about the reason or reasons behind that temptation. What about the conventions of digital text makes it feel, at least to me, less distorted or corrupt than an image of the same words on the pages of a book (even if that book is a first edition)? Is it a matter of font? Of digital text's manipulability--its relative freedom from the structure of lines and pages? Perhaps I've simply been conditioned by a twenty-first-century upbringing to think of digital words as more normative, more natural, and more legitimate than printed ones. I am unsure, but I intend to go on resisting this feeling and this temptation.

"E'rifirain S'handy" (According to OCR)

2 min read

I used Online OCR, a free online character recognition service, to look at the first few pages of chapter 9. The most frequently occuring mistake, as many others have stated, is the recognition of "s" as "f." I was able to create a small program to replace every "f" in each word with "s" and while this is still inaccurate (as it replaces letters that should be f), this version of the text serves as a closer digital facsimilie.

Further, in the original text, the spaces between words and punctuation marks varies at times. Online OCR seemed to have trouble detecting this -- for example, it read a smaller space as negligible, while larger spaces indicated to the OCR to add a space in the digital copy. On the first page of chapter 9, the first few words, "It was forty" have smaller spaces between the words, especially compared to the rest of the page. OCR read this as one word: "Tomasforty."

Most of the texts we have read have similar font to Tristram Shandy, so it would be interesting to see if we could adapt OCR software for different time periods and genres, which accounts for many of our digital translation issues. I plan to look into this idea more throughout the week.

Exercise 4

2 min read

I used the first chapter of the first volume of "Tristram Shandy," and the biggest mistake the OCR editor made was with letters that are written differently now than they were in 1759. The letter "s" was the biggest offender. Perhaps if there is a way to get the editor to recognize the unique shape of the old "s," this problem would not arise. Also, the editor was unable to decipher words that had faded. I'm not sure how to address this issue, if there is a way to perform more accurate reconstruction of words then that would be it, but what can be done for words that have faded entirely? In the future, could the editor be upgraded to guess words based on context? Perhaps even using previous sentences as reference points if there happen to be a lot of sentences that are constructed similarly?

The difference between a single copy of a book and a digitial facsimile is fairly small. Because the digital facsimile captures the physicality of the book (the damage, the unique lettering, etc.) it can serve as a fairly suitable substitute for the real thing. It does hold a distinct advantage over a physical copy: it can never be destroyed. It will forever exist and be accessible through databases like ECCO, while physical copies either decay or are inaccessible in the interest of preservation. However, it's not perfect. Due to its commitment to recreating the novel as it was in physical form, its contents are not always legible to machine reading, making them less than useful for transcribing. But with gradual improvements to OCR editors, the digital facsimile will become increasingly useful.

Assignment 4: Legibility and Spacing in Tristram Shandy

4 min read

From the Eighteenth-Century Collections Online database, I used Laurence Sterne's The life and opinions of Tristram Shandy, gentleman. ... London, MDCCLX. [1760]-67 [1771?]. 139 pp. Vol. Volume 1 of 9 (9 vols. available) Literature and Language. I attempted to do word searches specifically on the title and cover pages believing that the larger amount of spacing in those first pages would allow the site's keyword search to detect the words I was looking for easier. The first page composed almost entirely a photograph of the author and his name in large font (LAURENCE STERNE. A.M.). However, when searching for any part of his name, there were matches what so ever. I then turned to "Image 8," the title page. The page itself had the title (THE LIFE AND OPINIONS OF TRISTRAM SHANDY, GENT., followed by a brief description in the lower quadrant of the page that was not in English, and then the year and volume at the bottom. Nearly everything on that page, with exception to the character's name and the word "Opinions," was detected by the keyword finder. I assumed it was because of the cursive writing that overlayed over some of the title and, possibly, the penmanship and spacing on the part of the author and publisher. My assumption was later proven as I spaced out the word chapter ("CHAP" vs. "C H A P") and found that the keyword reader picked up the first spelling for some chapters and the other spelling for the majority of the rest. Finally, I attempted to match the large font writing of "Sir" on the page labeled "Image 10," but all I matched was blank areas on other pages and the Non-English words on the title page. 

Using the keyword readers in both the free version of Adobe Acrobat and Google Chrome proved fruitless as I could not match visible words, letters, or numbers to the pages. When running the pages through Adobe Pro's Optical Character Recognition (OCR), I made sure to run three types of pages; one with solely text, one with text and handwritten notes, and one with handwritten exclusively notes. When running pages with text and written notes through OCR, this was my result:  

I was overwhelmed with text boxes to the point where I was unable to edit words without having another textbox shroud my writing. For pages with solely handwritten notes, I found little to no text boxes. Even in the pages with exclusively published text, I noticed increased spacing in certain areas that were previously unreadable which were brought up as corrections by Adobe's spellchecking system. In trying to reduce the number of words continued onto another different line to increase the chances of detection in plain text and Microsoft Word files, I was unable to change the letter count in a line without having the word spill over to the line below it. The OCR also saw certain letters that overlapped each other as one single letter, indicated by removing both letters when I used the backspace key. Finally, aspects of the PDF pages of the novel, including chapter headings, enlarged first letters for each chapter, and even an entire page were merely left as images by the OCR. Not only were certain words unable to be traced in the text, but the handwritten notes, if translated by OCR, could provide a source of thought or critique concerning the ideas presented in the novel. Though Adobe Pro's spellcheck was successful in determining some common English words from the text, it might also help to run a program cross-referencing the text with common words during that period, as it is much easier than keeping a systematic database of penmanship for certain writers. 

My concern towards the use of using digital facsimiles and keyword searching of copied texts focused on the negative impact on research. In cases where research is needed for a specific word, topic, or individual, the inability of such programs to discover words in documents could result in the researcher missing a crucial report concerning their research or needs to spend additional time reading to every source if they cannot parse through them. The impaired translations of facsimiles would also limit the sources a reader could cross-reference in a database, significantly impairing their understanding the effects of their research topic on individuals who may not have had access to printing or did not want to risk outwardly sharing their opinions. Therefore, the ability to translate image to text and using OCR in facsimiles needs to be improved, or researchers and students risk confusing the words in literary texts and overlooking critical pieces of history.


I used the google drive OCR software to convert chapter 4 of Tristram Shady. The most clear and obvious mistake I saw was the f/s conversions, where inconsistently the Google Drive would pick up on the correct version. I was/am curious to why this was, so I did some research and watched this great video.

Basically the video states that text is broken into binaries of 1s representing pixels of black and 0s pixels of white, and then looks at sections of 1s and 0s and compares it to a database of hardcoded curves and letters to determine the accurate letter. However, if google drive used this method, than we would always get an f as an f as opposed to an s. However we do get a decent amount of ss in our text, so the inconsistency is confusing. Even with the inconsistency, the presence of ss in the first place indicates that google's OCR software as some way of determining whether the perceived f is a true f. In order to raise the consistency, I would write in a couple lines of code where I use an if statement to determine what percentage of the fs have been converted to ss and after a certain threshold is reached, convert all the fs that fulfill a certain criteria (google docs never turned a real f into an s in my chapter, so I'm presuming this isn't a problem) into ss.

With regards to more generally the entire medium of digital facsimiles, I honestly don't see many differences between digital facsimiles and actual books, besides stuff like the thickness of the cover or whatever. Converting to text of course loses a lot of information, especially if a book has a lot of paratext. I'm curious about whether the plain text version of an actual book can exist as an independent text on its own right, and whether operating conventional means of literary analysis upon the plain text will result in differing conclusions (I highly doubt), or whether the conventional means would simply prove insufficient (I think so), because of the lack of stylistic indicators that we rely on. Perhaps texts that rely a lot on a single stylistic difference (half the words are in italics, half in bold for example), would result in starkly different analyses when read in a plain-text version. I am unsure how I feel about that if that could be correct.

Assignment 4 - Tristram Shandy

3 min read

I worked with Chapter 5 in Tristram Shandy. The main issues I ran into included s being turned into f or { because of the font. As for the latter, we could default { into s. For the former, we could take the words the OCR spits out with f in them and run them through a dictionary. If they are words in the dictionary, the f stays and if not, the computer converts it to s. Using a loop in the programming, we could do this for all combinations of letters if there are multiple f’s in a word. Additionally, w almost always came out as vv or yv. Since these letter combinations are much less likely than w, w should be the default. The same is true with m being converted to rn and in (although these letter combinations are more likely). Therefore, I think going back to running the output through the dictionary for words with letter combinations like this would be a good idea. (In fact I think running all words through a dictionary would be a good idea and flagging those that don’t show up to be checked would be good. However, this could take a lot more time and would require someone to look through the flagged words). As for the names in italics that didn’t translate close at all, I don’t know if there would be a way to either train the OCR to be better at recognizing such texts or another tool that could be used to get at least more of a semblance of the word in the translation.


In translating a piece to plain text, however, we lose some of the stylistic elements chosen for the book, such as an enlarged first letter, or the syllable at the bottom of the page to help in reading aloud (which may not be missed by readers now, but is an artifact of the culture at the time that sheds light on the period). Also lost are varying text sizes (like we saw on the title pages) as well as images and the effects the font itself might have (emotional or cultural, such as the s that appears to be an f- or at least so says the OCR). In terms of symbolism, also, there is something lost by turning the image (as its own piece of art, chosen by authors and editors and printers) into a plain .txt file or Times New Roman and thus from art (and a physical commodity) into merely another computer file. Whether these losses are important, however, remains up for debate and likely depends on the manner in which the texts are used. For example, while stylistic elements may be lost, other elements may be gained by turning it into text. For example, text can be used in different ways, such as through Voyant for analysis. It also allows for changing font or size to increase readability for those who are visually impaired or otherwise wouldn’t be able to read.