Amardeep Singh: Text Processing with Regular Expressions (RegEx): a Digital Humanities Work-Flow for Beginners (no coding)

I wrote up the following as a primer for the students in my Digital Humanities seminar in 2020; updated in Spring 2025. If you have favorite RegEx commands and tips, I would welcome them in the comments or hit me up on Blue Sky.

The most common use case for needing a bit of coding for people in literary studies – especially people working with digital collections and archives – is when we have to format texts. This is less glamorous work than working with fancy visualizations or maps, but it can be incredibly useful and time-saving in many different contexts. Some of these will also potentially translate into work skills outside of academia.

For my own work, I frequently get messy files that have been scanned from old editions, and then OCRed. They need clean-up!

Cleaning up a single 80-page collection of poetry by hand is not that big a deal, but we have been working with dozens of them. And a single 300 page novel can take hours if you don’t have any tools to speed it up. My rule of thumb is: when you find yourself doing the same repetitive task again and again hundreds of times, that’s something that ought to be do-able by a machine.

Sometimes messy scanned texts have certain recognizable patterns. For instance, in scanned/OCRed poetry, you often see things like this:

I never see the burial place,
Where my dear mother lies ;
But that I think I see her face,
Peak at me through the skies.

[And yes, it says, “peak” not “peek” in the original. Don’t think this one had a copy-editor]

The quality of this is pretty good actually, but notice the extra space before the semi-colon on the second line. Let’s say you have that exact glitch 50 or 100 times in a collection… You would normally fix that with Find and Replace:

Find: [space];

Replace: ;

So now the second line should read: “Where my dear mother lies;”

Chances are, if you see that space before punctuation with a semi-colon, there are probably some with other punctuation as well – I would go through the same document, and do Replace All changes for space before comma, space before period, space before question mark, space before exclamation point.

But what if you saw patterns with more complicated glitches – things that a simple find and replace couldn’t address?

Like, say, you wanted to get rid of all of the running headers in a novel – lines that begin with a number and then end with the title of the novel? Again, not a big deal to do this 10 or 20 times by hand. But 300 times? Gets a little old. You can literally do it with a little snippet that looks like this

Find: \n\d+.*\n

Replace with : [leave blank].

(Note: we're jumping ahead a bit here, but the bit of code above says: look for any new line (\n) that starts with a number (\d) followed by any text (.*) and ending with another carriage return (\n). If you replace that with nothing, you are telling it to delete any lines that have that description.)

Hit “Replace All” on the find and replace box, and you just saved yourself 300 manual edits.

Or, you wanted to find all lines that end with hyphenated words and unbreak the hyphenated words, putting them together?

Or: you have a passage or a text that for whatever reason is in ALL CAPS. How can you convert it to conventional capitalization without retyping it?

For those types of problems, you could use a coding system called Regular Expressions (Regex). You can use Regex codes directly in a Find and Replace box in a sophisticated text editor – you don’t need to know Python or R (though these commands do work within Python and R, and some people talking about Regex online are using it with Python).

Technically Google Docs has a Regular Expressions box, though to be honest I've not used it much, mainly because Google Docs starts to run very slowly if you are working with larger documents. A 300-page novel runs super-slowly; because the software is constantly analyzing and indexing your file as you work and relaying everything back and forth with a remote server; it is also creating a ton of invisible stuff in your file related to formatting and special characters.

I usually use a free piece of software called Notepad++ for this kind of work (CoTEditor on Mac). 300-page novels load quickly and without delay, and there are no hidden characters or invisible formatting. It’s also completely offline, though, so you have to remember to hit “Save” and then upload the file to a destination when done.

1. Start by installing a text editor.

See above. I would also make a sub-folder in your Documents folder dedicated to working with texts.

1a. Two tips for saving files from the internet.

Wherever possible, when downloading text files from the internet, make sure to save them as Plain Text (UTF-8). The UTF-8 refers to a standard character set without unprintable characters, and we can mostly not worry about it right now.

Mac users only: This can be a little more complicated on a Mac than on Windows. If you're using Safari on a Mac, you might have trouble figuring out how to save a plain text file from the internet (it will only give you the ".webarchive" format on the default "Save As..." option.

Try this: hit CTRL-click, and then select "Download Linked file as..." to save a file as plain text when using Safari on a Mac.

Also, when you save files from the internet, you should probably start getting in the habit of labeling them really specifically to help you figure out what you're looking at later. Don't just accept the file name and file type chosen by someone else. I typically use filenames like

"Pandita-Ramabai-The-High-Caste-Hindu-Woman-1888-nonfiction.txt"

It might seem like overkill if you just have three files. But when you have three hundred files to search through it will come in handy to know exactly what you're looking at.

Also: it would be good to get into the habit of creating filenames that don't have spaces or other punctuation. If and when your files are queried by other software (i.e., running in Python), those spaces and punctuation marks may cause problems.

2. A few basics

Note from 2025. I recently discovered this interactive Regex tutorial, and it's pretty good. If you find my introduction below confusing,

https://regexone.com/

You could just work your way through this (would take about 30-40 minutes maybe?). Honestly, it might be a better way to learn this than just reading my tutorial (though my tutorial might make it clearer how / why you might want to know how to use some of these codes in a literature context).

* * *

Ok, let's run through a couple of basics. There are a few basic codes that are helpful to know. You can often combine them and string them along together to try and mean / do various things.

\d – any digit in a number. To represent any three-digit number, use this \d\d\d

\w – any alphanumeric character. \w\w\w\w is any four-letter word

\s – any whitespace character, including carriage return, space, or tab.

\n – end line / carriage return in a text file (usually an invisible / nonprinting character)

. – any character at all

From the Regular Expressions Quickstart page:

gr.y matches gray, grey, gr%y, etc. Use the dot sparingly. Often, a character class or negated character class is faster and more precise.

[When you might want this? Let’s say you have to change all the British spellings in a book to American spellings – grey to gray, colour to color, etc. Or vice versa. Some academic journals actually have style sheets that dictate spelling along these lines. My first book (2007) was published with a British publisher, and they wanted British spellings per the house style guide. At the time, I went through the entire book manuscript and changed it by hand.]

.* – Powerful regex code that means “any and everything going forward.” You have to put in an endpoint if you use this (like \n), or it will go to the end of the file. The most common one I might use might be .*\n

+ – “hungry” modifier, which means keep going with that thing until I say stop. Example: \d+ – means if you find a number digit, keep going until the number stops. Useful if you want to do a search for numbers of different sizes (for my work, page numbers: \d+ would match the numbers 3, 30, and 300).

| – this means “or.” Comes in handy if you want to set up a find and replace that includes semicolons, periods, commas, and other punctuation: ;|,|.|?|!

(Though with that particular example, since those punctuation marks are also commands within Regex, you need to do an “escape” – meaning, you need to tell the editor you mean an actual question mark in the text, not what the ? means to Regex. So it would actually look like this: \;|\,|\.|\?|\! There are 12 characters that require a \ to 'escape' them in RegEx:

\^, \$, \\, \. \*, \+, \?, $, $, \[, \], \{, \}, \|, \/

And if you find all that confusing, don't worry about it for now!.

() — In RegEx, you can put things in parenthesis to capture them temporarily to memory. This can come in handy if you want to do something a little more complicated than usual. Typically the “Find” line will involve some text you want to transform in some way. On the “Replace” line, can call back the text that was ‘captured’ using \1

Let’s say your boss has a spreadsheet with hundreds of phone numbers in a column, that are formatted inconsistently. Some are (123) 456-7890, while others are 123-456-7890. She wants them all just like this: 1234567890 so the autodialer software can understand them.

You could use a Find/Replace Regex to do that. The code is a little complicated, and I won’t try and do it here, but essentially you would use parentheses and “escapes” to capture the various digits, remove the punctuation and spaces, and then reinsert the numbers by themselves.

3. Putting RegEx to Work: Examples and Use Cases

First, make sure that the "Regular Expression" button is turned on in the Find + Replace dialog box. (Ctrl-H in Notepad++)

Also note the "Match Case" option. For now, we may not need to worry about that. But for some RegEx tricks we may want to turn that on.

3a. Removing Numbers and Page Headers

If you copy and paste a text file from HathiTrust into a Text Editor, you might get something that looks like this:

Page Scan 13

A ROLLING STONE CHAPTER I LADY KESTERS After a day of strenuous social activities, Lady Kesters was enjoying a well-earned rest, reposing at full length on a luxurious Chesterfield, with cushions of old brocade piled at her back and a new French novel in her hand. Nevertheless, her attention wandered from Anatole France ; every few minutes she raised her head to listen intently, then, as a little silver clock chimed five thin strokes, she rose, went over to a window, and, with an impatient jerk, pulled aside the blind. She was looking down into Mount Street, W., and endeavouring to penetrate the gloom of a raw evening towards the end of March. It was evident that the lady was expecting some one, for there were two cups and saucers on a well-equipped tea-table, placed between the sofa and a cheerful log fire. As the mistress of the house peers eagerly at passers- by, we may avail ourselves of the opportunity to examine her surroundings. There is an agreeable feeling of ample space, softly shaded lights, and rich but subdued colours. The polished floor is strewn with ancient rugs ; bookcases and rare cabinets exhibit costly con» I

Page 2
2 A ROLLING STONE tents ; flowers arc in profusion ;

And this continues for the whole book...

There are a few simple things we can do using Find and Replace automation to clean this up.

To get rid of page numbers, use "Replace" (Ctrl-H in Notepad++ on Windows), and do the following three RegEx replace commands

Find: Page \d+\n
Replace: [leave this blank]

(Hint: Make sure "Regular Expressions" are turned on!) The "\d" in the find stands for a numerical digit.

The "+" asks your editor to look for multilple instances of digits in a row. So if you have one-digit, two-digit, or three-digits numbers, all should "match" for this search.

The \n at the end is a new line or carriage return.

(You might not want to do "Replace All" on this, since there may be some numbers at the ends of lines you might want to keep.)

You might also notice that all of the semi-colons in the paragraph above are preceded by a space. To get rid of those, you can do this:

Find: [space];
Replace: ;

If there are spaces before punctuation throughout a document, another hack might be to use a Regex command like this:

Find: (.*) ([::,.?!;])
Replace: \1\2

This one is harder to explain. Essentially, you are asking the Text Editor to 'capture' all text (.*) to memory, then a space, then 'capture' any common punctuation. Each parenthesis becomes a captured 'string' that is held in memory and numbered. You Replace with the two captured strings in sequence -- without a space in between them.

3b. Getting rid of other page headers. In the sample of text above, you see this:

2 A ROLLING STONE

That is a page header. If you look at the rest of the book, a version of that is on every other page. Again, we can automate the removal of this using RegEx:

Find: \d+ A ROLLING STONE\n
Replace: [Leave this blank.]

3c. Putting Line Breaks Before New Chapters.

In the chunk above, you see this:

A ROLLING STONE CHAPTER I LADY KESTERS After a day of strenuous social activities,

In this book, it looks like new Chapters are not going to be clearly demarcated with line breaks, but we probably want them to make the text file readable for humans.

To make sure new chapters are easy to find, you can put them in using this command:

Find: CHAPTER
Replace: \n\n CHAPTER

The "\n" is for new line. If you do the command above, it will put two line breaks before each instance of CHAPTER (make sure the "Match Case" option is turned on, or it might do this when it randomly encounters the word "chapter." Most likely, the only time you'll see the word CHAPTER in all caps is the beginning of a new chapter).

3d. HathiTrust-specific clean-up.

Documents derived from HathiTrust often look a little like this:

## p. (#5) ##################################################
A TINY SPARK
BY
CHRISTINA MOODY
Washington, D. C.
MURRAY BROTHERS PRESS
1910
## p. (#6) ##################################################

What if you want to get rid of every line that starts with ##?

You can use a RegEx find and replace that looks like this :

Find: ##.*\n
Replace:

That looks for a line that begins with ## and also captures all text (.*) up to a line break (\n).

Replace with -- nothing.

3e. Converting unformatted poetry to indented stanzas.

Let's say you're working with a text where the poetry looks like this in your file:

PHILADELPHIA, SEPT., 1872.

TO THE FACULTY OF HOWARD UNIVERSITY:

GENTLEMEN, I my pen have raised,

The one by which your Board I've praised;

It is a pen of noble deeds,

By which I have sown wisdom's seeds.

It is a pen I long have trained.

By it a thousand hearts I've gained,

For it was truly made of steel,

Therefore to it your hearts will yield.

For truly it does speak to-day,

As did it on the first of May;

For then I know it did record

Your little and your great reward.

Remember that its highest aim

Is much like yours-is much the same;

For you will heal the wounded heart,

And give the young an upward start.

But you can see that in the original page images, the verses are in indented stanzas. Is there a quick way to reformat the text so it looks like the original?

Find: (.*\n)(.*\n)(.*\n)(.*\n)
Replace: \1 \2\3 \4\n

(There are three spaces between \1 and \2 and again between \3 and \4)

What that does is take all characters (.*) in four lines of text each ending with a carriage return or new line (\n), and capture each to memory as "1" "2" "3" and "4." (The parentheses do the capturing.) On the Replace, then returns the four lines, indents lines 2 and 4, and adds a new carriage return at the end. And voila:

GENTLEMEN, I my pen have raised,

The one by which your Board I've praised;

It is a pen of noble deeds,

By which I have sown wisdom's seeds.

It is a pen I long have trained.

By it a thousand hearts I've gained,

For it was truly made of steel,

Therefore to it your hearts will yield.

For truly it does speak to-day,

As did it on the first of May;

For then I know it did record

Your little and your great reward.

Remember that its highest aim

Is much like yours-is much the same;

For you will heal the wounded heart,

And give the young an upward start.

Again, you cannot really do this for a Global Find and Replace, but if you use the Regex above you can speed through a document pretty well.

This is especially helpful for longer poems in the ballad stanza format (especially popular in 19th-century poetry).

4. Putting in line breaks in unformatted poetry.

Sometimes when you bring text in from an OCRed PDF file, you get all of the text of the poems, but none of the line breaks. This is from a book of poetry I've been working with, in a file derived from HathiTrust:

SONG OF THE HINDUSTANEE MINSTREL WITH surmah tinge thy black eye's fringe, 'Twill sparkle like a star; With roses dress each raven tress, My only loved Dildar! II Dildar ! there's many a valued pearl In richest Oman's sea; But none, my fair Cashmerian girl! O ! none can rival thee. Ill In Busrah there is many a rose Which many a maid may seek, But who shall find a flower which blows Like that upon thy cheek? IV In verdant realms, 'neath sunny skies, With witching minstrelsy, We'll favor find in all young eyes, And all shall welcome thee.

This is a challenging one! I haven't found a way to fully automate introducing line breaks using RegEx, though I have found a command that works reasonably well to speed it up -- find capitalized letters, and insert a line break before.

For the above, do

Find: ([ABCDEFGHIJKLMNOPQRSTUVWXYZ])
Replace: \n\1

The brackets around the bracketed group tells the software you're looking only for these capital letters (make sure Match Case is on). This doesn't work perfectly, since the "I" will often catch the pronoun by itself (which might not be the beginning of a line). It will also catch randomly capitalized words and proper nouns (in the above, it will catch words like "Dildar"). So you can't just tell it to Replace All -- you have to go through and check each one. It's still faster than doing it completely without automation.

How it works: the parentheses around the bracket tells the software to "capture" the letter in question and keep it in memory for the Replace command.

The \1 in the replace command calls back the string we captured in the Find command, and tells the software to print the same letter again.

The above passage is particularly messy, but if you run the command above and make some judicious choices about likely line breaks using the Find/Replace dialog box (again, not using Replace All), you could end up with:

SONG OF THE HINDUSTANEE MINSTREL

I. WITH surmah tinge thy black eye's fringe,
'Twill sparkle like a star;
With roses dress each raven tress,
My only loved Dildar!

II. Dildar ! there's many a valued pearl
In richest Oman's sea;
But none, my fair Cashmerian girl!
O ! none can rival thee.

III. In Busrah there is many a rose
Which many a maid may seek,
But who shall find a flower which blows
Like that upon thy cheek?

IV. In verdant realms, 'neath sunny skies,
With witching minstrelsy,
We'll favor find in all young eyes,
And all shall welcome thee.

After the line breaks are in place, we might go through and use the hack above to clean up the "space before punctuation" problem.

4a. Removing line breaks -- but not at the ends of paragraphs.

Many times, when you extract text from page scans via OCR, you have line breaks in the text file that match the line breaks in the original scanned image. As you convert text files to formats for publishing on the web, you typically want to remove those.

It would not be hard to do a global find and replace to remove all line breaks, but that would produce a single line of unformatted text. No bueno!

So what we need ideally is a command that removes line breaks but not at the ends of paragraphs.

One possible Regex command that would do this is:

Search: (?<!\.)\r?\n(?!"$)

Replace with: [space]

That is a more advanced RegEx code. Try it and see if it works for you (note: again, I would not do global Find and Replace for this)

However, this is not quite up to the task of global search and replace, since sometimes you have sentences at the end of a line in a file that are not actually the end of a paragraph -- you'll have to go back through and correct those. That said, it can speed up the process of working through a text file quite a bit.

5. A more advanced RegEx example: extracting a list of words from a tagged file.

RegEx is extremely sophisticated, and there are many more advanced commands I won't get into here (also: I am still very much a learner).

(It's basically a form of coding without actually writing programs... Interestingly, some full-fledged programming languages do allow RegEx code to be embedded within, so it might be worth your while to learn more of it...)

Here is a more advanced example that was shown to me by a software developer who works in the library at my institution (Rob Weidman). We have a file where we used Stanford Named Entity Recognition (NER) to tag every proper Name and every Location in a book (why we did that and how that works is a question for another day). The output it produces looks like this:

Major <PERSON>Carteret</PERSON>, though dressed in brown linen, had thrown off his coat for greater comfort. The stifling heat, in spite of the palm-leaf fan which he plied mechanically, was scarcely less oppressive than his own thoughts. Long ago, while yet a mere boy in years, he had come back from <LOCATION>Appomattox</LOCATION> to find his family, one of the oldest and proudest in the state, hopelessly impoverished by the war,--even their ancestral home swallowed up in the common ruin. His elder brother had sacrificed his life on the bloody altar of the lost cause, and his father, broken and chagrined, died not many years later, leaving the major the last of his line. He had tried in various pursuits to gain a foothold in the new life, but with indifferent success until he won the hand of <PERSON>Olivia Merkell</PERSON>, whom he had seen grow from a small girl to glorious womanhood.

Let's say we want a file with just the names of people referenced in this book -- the items the NER software has tagged as <PERSON>.

First, you want to make sure there aren't a lot of invisible line breaks in the text. It's ok to do a global Find/Replace where you replace blanks for \n.

Then:

1)Replace <PERSON with \n<PERSON
This puts a line break before each Person tag

2)Delete the first line of text (everything up to the first instance of
a person tag)

3)Replace <PERSON>(.*)<\/PERSON>.* with $1

This gets rid of the tags and everything outside of the tags and
replaces it with just the text within the tags.

This produces a list of just names of people tagged in the text. In the paragraph above, it would produce

Carteret
Olivia Merkell

What the various commands above are doing is complicated. The .* in the parentheses means capture everything. The $1 calls back the string we just captured between the tags. The .* after the second person tag is grabbing all of the text *outside* of the tags -- which we're going to delete.

For reference, ".*" is a really important RegEx command.

Also helpful is the ".+" command... And the "NOT" command (^)...

It goes on... I'll just recommend people look at the "RegEx Cheat Sheet" here.

Text Processing with Regular Expressions (RegEx): a Digital Humanities Work-Flow for Beginners (no coding)

About Me