I wrote up the following as a primer for the students in my Digital Humanities seminar in 2020; updated in Spring 2025. If you have favorite RegEx commands and tips, I would welcome them in the comments or hit me up on Blue Sky.
The most common use case for needing a bit of coding for people in literary studies – especially people working with digital collections and archives – is when we have to format texts. This is less glamorous work than working with fancy visualizations or maps, but it can be incredibly useful and time-saving in many different contexts. Some of these will also potentially translate into work skills outside of academia.
For my own work, I frequently get messy files that have been scanned from old editions, and then OCRed. They need clean-up!
Cleaning up a single 80-page collection of poetry by hand is not that big a deal, but we have been working with dozens of them. And a single 300 page novel can take hours if you don’t have any tools to speed it up. My rule of thumb is: when you find yourself doing the same repetitive task again and again hundreds of times, that’s something that ought to be do-able by a machine.
Sometimes messy scanned texts have certain recognizable patterns. For instance, in scanned/OCRed poetry, you often see things like this:
I never see the burial place,
Where my dear mother lies ;
But that I think I see her face,
Peak at me through the skies.
[And yes, it says, “peak” not “peek” in the original. Don’t think this one had a copy-editor]
The quality of this is pretty good actually, but notice the extra space before the semi-colon on the second line. Let’s say you have that exact glitch 50 or 100 times in a collection… You would normally fix that with Find and Replace:
Find: [space];
Replace: ;
So now the second line should read: “Where my dear mother lies;”
Chances are, if you see that space before punctuation with a semi-colon, there are probably some with other punctuation as well – I would go through the same document, and do Replace All changes for space before comma, space before period, space before question mark, space before exclamation point.
But what if you saw patterns with more complicated glitches – things that a simple find and replace couldn’t address?
Like, say, you wanted to get rid of all of the running headers in a novel – lines that begin with a number and then end with the title of the novel? Again, not a big deal to do this 10 or 20 times by hand. But 300 times? Gets a little old. You can literally do it with a little snippet that looks like this
Find: \n\d+.*\n
Replace with : [leave blank].
(Note: we're jumping ahead a bit here, but the bit of code above says: look for any new line (\n) that starts with a number (\d) followed by any text (.*) and ending with another carriage return (\n). If you replace that with nothing, you are telling it to delete any lines that have that description.)
Hit “Replace All” on the find and replace box, and you just saved yourself 300 manual edits.
Or, you wanted to find all lines that end with hyphenated words and unbreak the hyphenated words, putting them together?
Or: you have a passage or a text that for whatever reason is in ALL CAPS. How can you convert it to conventional capitalization without retyping it?
For those types of problems, you could use a coding system called Regular Expressions (Regex). You can use Regex codes directly in a Find and Replace box in a sophisticated text editor – you don’t need to know Python or R (though these commands do work within Python and R, and some people talking about Regex online are using it with Python).
Technically Google Docs has a Regular Expressions box, though to be honest I've not used it much, mainly because Google Docs starts to run very slowly if you are working with larger documents. A 300-page novel runs super-slowly; because the software is constantly analyzing and indexing your file as you work and relaying everything back and forth with a remote server; it is also creating a ton of invisible stuff in your file related to formatting and special characters.
I usually use a free piece of software called Notepad++ for this kind of work (CoTEditor on Mac). 300-page novels load quickly and without delay, and there are no hidden characters or invisible formatting. It’s also completely offline, though, so you have to remember to hit “Save” and then upload the file to a destination when done.





