I wrote up the following as a primer for the students in my Digital Humanities seminar this fall, but I figured others might benefit from it as well. If you have favorite RegEx commands and tips, I would welcome them in the comments or hit me up on Twitter.
A lot of digital humanities work involves working with messy texts -- you get a PDF image file from Google Books, HathiTrust, Archive.org, or scans from old Microfilm, and you want to turn it into something you can work with, either for producing digital editions of texts or for quantitative analysis.
OCR (Optical character recognition) is software that converts image-text in PDFs to Text. It is built into some PDF software (the non-free version of Adobe Acrobat has OCR, for instance), and you can find various PDF-Text converters online that will do it for you. Depending on the quality of the software and the quality of your page scan, OCR can be somewhere from 80-95% accurate. For most things (other than producing digital editions), 95% is pretty good. Still, I often find myself working hard to clean up the output of OCR to make sure it's useful for my various projects.
It's also worth mentioning that some image files are poor enough that it's not worth your while to use OCR at all -- there's so many mistakes that it might just be faster to retype the whole document, letter for letter.
Many digital humanities queries about literary texts require plain text files that don't have a lot of noise in them. If you are asking software to do word counts or study other features of the language inside a text, you want to make sure you have words by the authors themselves in the text, nothing else. If you have a folder full of Text files from Project Gutenberg, you need to go through and cut out the header and footer texts they attach to every text they publish online. If you have a text from HathiTrust that started as a PDF file, you probably want to cut out page numbers and page headers (Page 7, Page 8, etc.).
Below I give a few tips on how to do that type of clean-up work efficiently using Text Editing software.