Friday, September 18, 2020

Announcing: a Digital Edition of the Poems of Henry Derozio

I've been happy to collaborate with Professor Manu Samriti Chander on a digital edition of the poems of Henry Louis Vivian Derozio, the first Indian poet to write in English. 

It is essentially finished, though some additional copy-editing and proofreading probably remains to be done (if you see any typos or other errors, please contact me!). I edited it and built the collection, so any glitches you find are my doing. It's completely appropriate that Manu wrote the Preface to the project, for reasons I'll explain. 

Derozio published two books of poems in 1827 and 1828, and had an intense, impactful, but brief career as a professor at Hindu College in Calcutta. He died of cholera in 1831. 

I learned about Derozio through reading Chander's Brown Romantics: Poetry and Nationalism in the Global Nineteenth Century. Among other things, Chander's account of Derozio convinced me of his importance both as a Romantic poet -- and Derozio was intensely interested in and engaged with the writings of British and Irish poets of the 1810s and 20s -- and as a key figure in the emergence of modern Anglophone South Asian literature.

Derozio was criticized by English reviewers even during his life of "imitating" English Romantic poets. And while there is no doubt that he did borrow heavily from the form and style of writers like Byron, Thomas Moore, and others, he also applied his own, distinctly Indian sensibility in his writing. As Chander puts it in his preface to the Digital Edition:

Far from simply following in the footsteps of such popular figures as Lord Byron, Thomas Campbell, and Thomas Moore, Derozio uses their work often as a point of departure or as a signpost on his own poetic journey. Indeed, Derozio inaugurated his own tradition in India, inspiring his students to form the Young Bengal Movement. These liberal thinkers and activists were sometimes referred to as “Derozians,” and they carried their teacher’s ideas forward even after his death in 1831. (link)

For more on Derozio's relationship to Romanticism, I would recommend readers to Chander's book, or the books and essays of Professor Rosinka Chaudhuri, who has also edited the Oxford University Press scholarly edition of Derozio's works. 

This digital edition is not meant to supplant Chaudhuri's volume, but rather to provide a convenient point of access to Derozio's works for a broad readership. Among other things, I hope people teaching literature courses -- including specialist courses on Romantic poetry, but also literature surveys, courses on South Asian literature, and others -- will consider assigning Derozio. To faciliate that, I've put together a "Teaching Resource" page on the Scalar site, along with a downloadable PDF with some suggested selections from Derozio's poetry (this might make for a lively one-day unit on Derozio). 

Thursday, August 20, 2020

Fall Teaching: "Decolonizing (Digital) Humanities"

I'm teaching a grad seminar on Digital Humanities this fall. It's the first time I've taught this material formally since Fall 2015, when I co-taught an Intro to DH class with my colleague Ed Whitley. It's a whole new group of students, of course, but also almost an entire turnover in terms of scholarship. 

I'm structuring most of the hands-on work around two Text Corpora I've been developing, one on African American Literature, and the other on Colonial South Asian Literature

If the Canon has been the defining structure of traditional literary studies, in the DH framework the starting point is the Corpus. You can do a lot with a group of texts structured this way -- from Text Analysis, to Natural Language Processing, to thinking about Archives and Editions. As with the Canon, the questions you can ask and the knowledge you can produce are strongly determined by what's included or excluded from the Corpus. 

Course Description: 

This course introduces students to the emerging field of digital humanities scholarship with an emphasis on social justice-oriented projects and practices. The course will begin with a pair of foundational units that aim to define digital humanities as a field, and also to frame what’s at stake. What are the Humanities and why do they matter in the 21st century? How might the advent of digital humanities methods impact how we read and interpret literary texts? Some topics we’ll consider include: Quantifying the Canon, Race, Empire & Gender in Digital Archives, and an introduction to Corpus Text Analysis. Along the way, we’ll explore specific Digital Humanities projects that exemplify those areas, and play and learn with digital tools and do some basic coding. The final weeks of the course will be devoted to collaborative, student-driven projects. No programming or web development experience is necessary, but a willingness to experiment and ‘break things’ is essential to the learning process envisioned in this course.


August 25

Intro.: Discuss in person/Zoom

Matthew Kirschenbaum, “What Is Digital Humanities and What’s It Doing in English Departments?”

Roopika Risam, “Introduction: the Postcolonial Digital Record” (from New Digital Worlds)

Keywords: Digital Humanities, Postcolonial Studies, Postcolonial Digital Humanities (Risam), “Digital Canonical Humanities” (Risam)

Example in class (in support of Risam’s point about Digital Canonical Humanities). Compare the Charles Chesnutt Archive ( with the Walt Whitman Archive (

Getting our feet wet at home (20-30 minutes): Google Ngram viewer. Set for “English Fiction.” Recommend “Smoothing” set to 0. 

August 27


Risam, “Chapter One: The Stakes of Postcolonial Digital Humanities”

Ted Underwood, “Preface: the Curve of the Literary Horizon” from Distant Horizons

Keywords: Quantitative vs. Digital; Distant Reading vs. Close Reading; “Slaughterhouse of Literature”/”Great Unread” 

Getting our feet wet with a Corpus of African American literature:

September 1

Politics & Terminology in Literary Studies 

M.H. Abrams, “Canon of Literature” from A Glossary of Literary Terms 

Other Keywords Entries (read a selection according to interest): “Black Arts Movement,” “Feminist Criticism,” “Harlem Renaissance,” “New Criticism,” “New Historicism,” “Periods of American Lierature,” “Periods of English Literature,” “Postcolonial Studies,” “Queer Theory” 


September 3

Digital Humanities and Literary History

Underwood, Chapter 1, “Do We Understand the Outlines of Literary History?” (From Distant Horizons)

 Franco Moretti, “Graphs,” from Graphs, Maps, Trees (2007. On CourseSite)

Homework: Play with Voyant-Tools. For this exercise, let’s look at a second Text Corpus: Colonial South Asian Literature. 

September 8

Digital Humanities--Canonicity

Amy Earhart, “Can Information Be Unfettered? Race and the New Digital Humanities Canon”

Stephanie P. Browner, “Digital Humanities and the Study of Race and Ethnicity”;rgn=div1;view=fulltext;xc=1#5.1

Underwood, Chapter 2 “The Life Spans of Genres” (from Distant Horizons)

On your own: New tool to explore: AntConc (downloadable software)

September 10

Quantifying the Expanding Canon

Studying Anthologies: Lehigh grad student Adam Heidebrink-Bruno’s work on American modernism. Zoom visit from Adam himself.

Open Syllabus Project: Who is being taught?

Homework: Do test queries on

African-American authors? Latinx authors? LGBTQ+ authors? Postcolonial authors? How would we quantify the results? How might we visualize them?

September 15

Hands-on project workshop: Playing with data -- either from the Corpora I posted on CourseSite or from other corpora you can find online. 

(If there’s a particular topical corpus -- say, Detective Fiction or Science Fiction -- you’re looking for, you could start by Googling it. But also feel free to ask me.)

I also recommend you read this primer for working with plain text files & getting started with processing those texts to make them useful:

September 17

Workshop continued.

Short analysis with data due: September 20 

September 22

Race and the Digital Humanities 1

Kim Gallon, “Making a Case for the Black Digital Humanities” (2016)

Safiya Umoja Noble, “Towards a Critical Black Digital Humanities” (2019)

September 24

Race and the Digital Humanities 2: Algorithms of Oppression

Noble, Safiya Umoja. Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press, 2018, doi: 10.2307/j.ctt1pwt9w5.

Noble, Algorithms of Oppression: Introduction
Noble, Algorithms of Oppression Chapter 1 

Risam, “What Passes for Human?” (2019) (Bringing the kinds of questions Noble asks to AI, Facial recognition, robotics)


September 29

Slavery and the Archive 1

Jessica Marie Johnson, “Markup Bodies: Black [Life] Studies and Slavery [Death] Studies at the Digital Crossroads” (2017) (CourseSite)

Gabrielle Foreman, “Writing About ‘Slavery’? This Might Help” (brief document with tips and dos & don’ts)

Colored Conventions Project

Hands-on work on creating custom maps: Possibly: using Named Entity Recognition to get Names and Maps from our African American Literature Corpus. 

October 1

Slavery and the Archive 2: Jamaica

Vincent Brown, “A Slave Revolt in Jamaica”

Readings from Vincent Brown, Tacky’s Revolt (2020): “Prologue,” “Chapter 2: The Jamaica Garrison,” “Chapter 4: Tacky’s Revolt” 

October 6

Slavery and the Archive 3:

Getting our feet with a newspaper archive. African American Newspapers Series 1: 1827-1998. Need to log in through Lehigh’s library website using your Lehigh account credentials.

Try some sample queries, perhaps related to abolition, emancipation, reconstruction. 

Could also return to the African American authors from our African American Text Corpus. Passing? Liberia? Lynching? Interracial romance/mixed-race experiences? African American genre fiction (i.e., detective fiction, science fiction, Gothic, etc.)? Other topics of interest?

October 8

Digital Archives, Editions, Collections

Earhart, “The Era of the Archive” (Traces of the Old, Uses of the New, Chapter 2). Keywords: New Historicism; Digital Archive vs. Digital Edition

Kenneth M. Price, “Edition, Project, Database, Archive, Thematic Research Collection: What's in a Name?”

Risam, Chapter 2 of New Digital Worlds. “Colonial Violence and the Postcolonial Digital Archive” 


October 13

Analog Archives: What Are Archives For?

Terry Cook, “Evidence, Memory, Identity, and Community: Four Shifting Archival Paradigms” (2013) 

Kate Thiemer, “Archives in Context and As Context” (2013)

(An analog archivist questions the way Digital Humanities scholars use the word “archive”; she posits “collection” might be more appropriate)


October 15

Digital Editions: Hands-on/Collaborative/Student-driven

Workshop for Second Project: Constructing a Basic Digital Edition in Scalar. Hands-on Introduction to the Scalar platform & Lehigh's Instance of Scalar.

Possible sources for producing Digital Editions/Collections in Scalar: African American Text Corpus, Colonial South Asian Literature

October 20

Students work collaboratively on building a Digital Edition of a text in Scalar, with introductory essay, notes, other relevant materials. More info. TBA.

Project Due Sunday October 25.

October 22

Digital Media Studies 1: Twitter -- Hashtag Activism

Jackson, Sarah J, Moya Bailey, and Brooke Foucault Welles. #Hashtag Activism: Networks of Race and Gender Justice, 2020.

#Hashtag Activism, Introduction, “Making Race and Gender Politics on Twitter”

#Hashtag Activism, Chapter 5: “From Ferguson to #FalconHeights: The Networked Case for Black Lives”

October 27

Digital Media Studies 2: Twitter; Scraping

Marcia Chatelain, “Is Twitter Any Place for a [Black Academic] Lady?” [focus on “#FergusonSyllabus and academic expectations/culture] (2019)

Hands-on work: Scraping hashtags on Twitter. Possibly using Python (will demonstrate how to do this)

October 29

Digital Media Studies 3: Instagram

Hands-on work: Scraping hashtags, keywords, authors on Instagram:

November 3

Digital Media Studies 4: Instagram -- InstaPoetry.

 Lili Paquet, “Selfie Help: The Multimodal Appeal of Instagram Poetry” (2019) (CourseSite)

Instapoets: Rupi Kaur, others

Possibly: Analyzing our scraped data using Sentiment Analysis:

November 5

Intersectional Data Feminism 1

Lauren Klein and Catherine D’Ignazio, Data Feminism:
Introduction: “Why Data Science Needs Feminism”
Chapter 1: “The Power Chapter”

November 10

Intersectional Data Feminism 2

Klein and D’Ignazio, Data Feminism:
Chapter 3: “On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints”
Chapter 4: “What Gets Counted Counts”
Hands-on work: To be announced

November 12

Intersectional Data Feminism 3

Ted Underwood, Chapter 4 of Distant Horizons: “Metamorphoses of Gender” 

Hands-on work: Can we replicate some of Underwood’s analyses? Also, can we apply some of this to the African American Literature Text Corpus or the Colonial South Asian Literature Corpus? Do texts by black and brown writers engage with gender the same way? Are there variations in the pattern? 

November 17

Digital Humanities Pedagogy

Roopika Risam, New Digital Worlds. Chapter 4.

Explore some of the tools Risam mentions. 

November 19

Digital Humanities Pedagogy

Stefan Sinclair and Geoffrey Rockwell, “Teaching Computer-Assisted Text Analysis: Approaches to Learning New Methodologies” (from Digital Humanities Pedagogy)

Olin Bjork, “Digital Humanities and the First-Year Writing Course” (from Digital Humanities Pedagogy)

November 24-26

Thanksgiving Week (nothing scheduled)

December 1

(Fully Remote) Workshop: Final projects

December 3

(Fully Remote) Workshop: Final projects + Semester Wrap-up


Saturday, August 08, 2020

Text Corpus: Colonial South Asian Literature

Recently, I announced a Text Corpus I had put together, of African American Literature from 1853-1923. 

I've also been putting together a Corpus of Colonial South Asian Literature from roughly the same period.  

The link to that folder can be accessed here. I'll also be posting the files on Github soon.

This has been a much harder Corpus to compose. Whereas with the African American literature we have bibliographic lists of published works to serve as a guide (such as the one posted at the History of Black Writing at Kansas), there does not appear to be an equivalent list with respect to Colonial South Asia. 

Choices Made in Producing this Corpus:

1. Nationalities

I decided to include British as well as South Asian writers in the Corpus. Many of the writers were clearly in dialogue with one another; South Asian writers were clearly reading people like Rudyard Kipling, E.M. Forster, and Katherine Mayo. It's a little less clear which South Asian writers British and American writers were reading other than Tagore (and this itself might be studied). The publishing industries also overlapped to a considerable extent; while some South Asian writers published their works with publishers based in India, many aimed to publish with houses based in London. 

One possible line of inquiry with this material might be to try and compare fiction, poetry and drama by British authors with South Asian output in English. Such inquiry could either be historical and thematic (i.e., comparing the way British and South Asian writers reacted to historical events like the Sepoy Mutiny or the Famine of 1876), or it could be connected to matters of language and style. To do that it makes sense to have writers from different backgrounds represented in the Corpus. 

I knew there was a fair amount of interest in colonial India in the U.S. at the time -- from the appreciation of Kipling to the American feminist fascination with Pandita Ramabai. However, while doing this research I was surprised to come across a large number of Pulpy Indian adventure novels by an American writer named Talbot Mundy.  

In the metadata file, I list the nationalities of the authors. Besides a few Americans in the collection, I would draw readers' attention to B.M. Croker (an Irish woman who lived in India and wrote many Romance novels based in colonial India), and Sara Jeannette Duncan (a Canadian woman who also lived in India and wrote prolifically as well).  

In addition to the nationality question, with South Asian writers who moved abroad there is also the question of destination. Cornelia Sorabji (who eventually moved to England) is of course pretty well known. Dhan Gopal Mukerji, who moved to the U.S. in the 1910s, is mainly known for his memoir Caste and Outcast, but he was quite a prolific literary writer, with several books of poetry and fiction that are worth looking at. 

2. Translations. 

I decided to include translations by South Asian writers like Bankim Chandra Chatterjee (Chattopadhyay) and Rabindranath Tagore in the Corpus. Tagore of course needs no explanation; he was one of the few South Asian writers to break through and achieve global acclaim in the early 20th century. Bankim Chandra Chatterjee (here, I'm using one of the spellings used at the time, aware of course that "Chatterjee" and "Chatterji" are colonial-era abbreviations of Chattopadhyay...) is slightly different. He is clearly historically important for Anandamath (here included in translation as Dawn Over India) and Rajmohan's Wife (thought to be the first English-language novel by an Indian author), but it seemed like it might be valuable to include some other of his Bengali novels in translation here. Several of these I found at Wikisource.

Alongside translations by South Asian writers, there are a few translations in the corpus of historical South Asians texts by British writers. 

3. Fiction and Nonfiction

Right now there is a limited amount of nonfiction included in the corpus. This was a very tough decision, as there is a vast array of nonfiction colonial travel writing based in South Asia from this period. I've excluded that sort of writing for now, though I may include more of it as I continue to expand the corpus. 

However, I decided to include some nonfiction, mostly texts by literary authors who wrote occasional works of nonfiction (Dhan Gopal Mukerji's Caste and Outcaste is included, as is Tagore's My Reminiscences). I've also included a plain text file of Pandita Ramabai's The High-Caste Hindu Woman, mainly because it seems like an important text that might be useful for researchers in this field. Any queries specifically structured around the stylistics of fiction or the colonial novel might want to exclude these nonfiction texts. 

4. Derivation; grunt work

As with my other Corpus, I pulled together materials from different repositories to assemble this corpus. Here, the lion's share of material comes from Project Gutenberg and HathiTrust. (Derivation is indicated in my metadata file.) 

The Gutenberg materials were in good shape; they've generally been proofread and formatted cleanly.

The HathiTrust materials required much more work. One can extract HathiTrust texts by requesting plain text, but these OCR page scans need quite a bit of processing to make them clean enough to use. A lot of the grunt work of assembling this collection has entailed doing that processing. 

Here is a list of works I've imported from HathiTrust page scans thus far: 

Arnold, W.D. Oakfield; Or, Fellowship in the East 1855
Bain, F.W. A Hindoo Love Story 1898
Candler, Edmund Abdication 1922
Candler, Edmund Siri Ram, Revolutionist 1911
Candler, Edmund Mantle of the East 1910
Candler, Edmund Year of Chivalry 1916
Chatterji, Bankim Chandra Anandamath: Dawn Over India 1882 (1941)
Chatterji, Bankim Chandra Krishnakanta's Will 1917
Croker, B.M.  Proper Pride 1882
Croker, B.M.  Diana Barrington: A Romance of Central India 1888
Croker, B.M.  A Rolling Stone 1911
Diver, Maud Lilamani: A Study in Possibilities 1911
Diver, Maud Unconquered 1917
Derozio, Henry Louis Vivian Poems of Henry Louis Vivian Derozio: A Forgotten Anglo-Indian Poet 1923 (1831)
Duncan, Sara Jeannette Burnt Offering 1910
Dutt, Michael Madhusudan Sermista; a drama in five acts 1859
Dyer, Helen S. Pandita Ramabai: The Story of Her Life 1900
Kipling, Rudyard and Wolcott Balestier The Naulahka: A Story of West and East 1892
Mukerji, Dhan Gopal Caste and Outcast 1923
Mukerji, Dhan Gopal Layla-Majnu: A Musical Play in Three Acts 1916
Mukerji, Dhan Gopal Rajani: Songs of the Night 1916
Ramabai, Pandita The High Caste Hindu Woman 1888
Satthianadhan, Krupabai Kamala: A Story of Hindu Life 1894
Sorabji, Cornelia Between the Twilights: Being Studies of Indian Women By one of Themselves 1908
Sorabji, Cornelia Indian Tales of the Great Ones Among Men, and Bird-People 1916
Sorabji, Cornelia Shubala-A Child Mother 1920
Sorabji, Cornelia Sun-Babies: Studies in the Child-Life of India 1904
Tagore, Rabindranath Gora 1924 (1901)

Some of the highlights in the table above are in bold. As far as I know, these are the first plain text versions of the above texts to be made available online. 

You may notice that a couple of these texts are dated post-1923. I believe the 1941 translation of Anandamath (Dawn Over India) has fallen out of copyright in the U.S.

I should add that while I've cleaned up these files, I haven't proofread them. That is going to be a long-term project -- for which I would welcome collaborators! 

Friday, August 07, 2020

Text Processing 101: a Digital Humanities Work-Flow for Beginners

I wrote up the following as a primer for the students in my Digital Humanities seminar this fall, but I figured others might benefit from it as well. If you have favorite RegEx commands and tips, I would welcome them in the comments or hit me up on Twitter.

A lot of digital humanities work involves working with messy texts -- you get a PDF image file from Google Books, HathiTrust,, or scans from old Microfilm, and you want to turn it into something you can work with, either for producing digital editions of texts or for quantitative analysis. 

OCR (Optical character recognition) is software that converts image-text in PDFs to Text. It is built into some PDF software (the non-free version of Adobe Acrobat has OCR, for instance), and you can find various PDF-Text converters online that will do it for you. Depending on the quality of the software and the quality of your page scan, OCR can be somewhere from 80-95% accurate. For most things (other than producing digital editions), 95% is pretty good. Still, I often find myself working hard to clean up the output of OCR to make sure it's useful for my various projects. 

It's also worth mentioning that some image files are poor enough that it's not worth your while to use OCR at all -- there's so many mistakes that it might just be faster to retype the whole document, letter for letter. 

Many digital humanities queries about literary texts require plain text files that don't have a lot of noise in them. If you are asking software to do word counts or study other features of the language inside a text, you want to make sure you have words by the author themself in the text, nothing else. If you have a folder full of Text files from Project Gutenberg, you need to go through and cut out the header and footer texts they attach to every text they publish online. If you have a text from HathiTrust that started as a PDF file, you probably want to cut out page numbers and page headers (Page 7, Page 8, etc.). 

Below I give a few tips on how to do that type of clean-up work efficiently using Text Editing software. 

1. Installing a text editor.

To get started, you need dedicated text editor software, and you probably need to plan to do this work on a computer rather than a tablet or phone. Note: you can't really do this work in Microsoft Word, Pages, or Google Docs! Those apps will keep trying to add material into your files you don't want -- formatting, bits of invisible code. They also lack the really sophisticated Find and Replace features ("Regular Expressions") we'll need later. 

I use Notepad++ (free) on my Windows laptop and the CoT Editor (free) on the Mac in my office.   

I would also make a sub-folder in your Documents folder dedicated to playing with texts. 

2. Two tips for saving files from the internet.

Wherever possible, when downloading text files from the internet, make sure to save them as Plain Text (UTF-8). The UTF-8 refers to a character set, and we can mostly not worry about it right now. 

This can be a little more complicated on a Mac than on Windows. If you're using Safari on a Mac, you might have trouble figuring out how to save a plain text file from the internet (it will only give you the ".webarchive" format on the default "Save As..." option. 

Try this: hit CTRL-click, and then select "Download Linked file as..." to save a file as plain text when using Safari on a Mac. 

Also, when you save files from the internet, you should probably start getting in the habit of labeling them really specifically to help you figure out what you're looking at later. Don't just accept the file name and file type chosen by someone else. I typically use filenames like


It might seem like overkill if you just have three files. But when you have three hundred files to search through it will come in handy to know exactly what you're looking at. 

Also: it would be good to get into the habit of creating filenames that don't have spaces. If and when your files are queried by other software (i.e., running in Python), those spaces will cause problems. 

2. Find and Replace function

Quite a lot of text processing can be done with advanced Find and Replace features in Text Editors. You don't need coding!

2a. Removing Numbers and Page Headers

If you copy and paste a text file from HathiTrust into a Text Editor, you might  get something that looks like this: 

Page Scan 13

A ROLLING STONE CHAPTER I LADY KESTERS After a day of strenuous social activities, Lady Kesters was enjoying a well-earned rest, reposing at full length on a luxurious Chesterfield, with cushions of old brocade piled at her back and a new French novel in her hand. Nevertheless, her attention wandered from Anatole France ; every few minutes she raised her head to listen intently, then, as a little silver clock chimed five thin strokes, she rose, went over to a window, and, with an impatient jerk, pulled aside the blind. She was looking down into Mount Street, W., and endeavouring to penetrate the gloom of a raw evening towards the end of March. It was evident that the lady was expecting some one, for there were two cups and saucers on a well-equipped tea-table, placed between the sofa and a cheerful log fire. As the mistress of the house peers eagerly at passers- by, we may avail ourselves of the opportunity to examine her surroundings. There is an agreeable feeling of ample space, softly shaded lights, and rich but subdued colours. The polished floor is strewn with ancient rugs ; bookcases and rare cabinets exhibit costly con» I

Page 2

2 A ROLLING STONE tents ; flowers arc in profusion ; 

And this continues for the whole book... 

There are a few simple things we can do using Find and Replace automation to clean this up. 

--First, make sure that the "Regular Expression" button is turned on in the Replace dialog box. (Regular expressions -- RegEx -- are little bits of code that help us automate certain tasks.)

To get rid of page numbers, use "Replace" (Ctrl-H in Notepad++ on Windows), and do the following three RegEx replace commands 

Find: Page \d\d\d

Replace: [leave this blank]

The "\d" in the find stands for a numerical digit. If you put three of those in a row, the software is looking for specifically -- and only -- three digit numbers. If you do the above command and hit "Replace All" it should remove all of the "Page ..." above 100. 

If you then do this: 

Find: Page \d\d

Replace: [leave this blank]

It will then do the page numbers in the 10-99 range. Then repeat again with Find: "Page \d" --> Replace: blank. And that does pages 1-9. 

(Why do it in this order? Try it the other way and see what happens. You'll figure out why it's best to start with the hundreds pretty quickly...) 

You might also notice that all of the semi-colons in the paragraph above are preceded by a space. To get rid of those, you can do this:

Find: [space];

Replace: ;

2b. Getting rid of other page headers. In the sample of text above, you see this: 


That is a page header. If you look at the rest of the book, a version of that is on every other page. Again, we can automate the removal of this using RegEx: 

Find: \d\d\d A ROLLING STONE

Replace: [Leave this blank.]

Then do the tens and ones again. 

2c. Putting Line Breaks Before New Chapters. 

In the chunk above, you see this: 

A ROLLING STONE CHAPTER I LADY KESTERS After a day of strenuous social activities, 

In this book it's not clear whether new Chapters are going to be clearly demarcated with line breaks. 

To make sure new chapters are easy to find, you can put them in using this command: 


Replace: \n\n CHAPTER

The "\n" is for new line. If you do the command above, it will put two line breaks before each instance of CHAPTER (make sure the "Match Case" option is turned on, or it might do this when it randomly encounters the word "chapter." Most likely, the only time you'll see the word CHAPTER in all caps is the beginning of a new chapter). 

3. Putting in line breaks in unformatted poetry. 

Sometimes when you bring text in from an OCRed PDF file, you get all of the text of the poems, but none of the line breaks. This is from a book of poetry I've been working with, in a file derived from HathiTrust: 

SONG OF THE HINDUSTANEE MINSTREL  WITH surmah tinge thy black eye's fringe, 'Twill sparkle like a star; With roses dress each raven tress, My only loved Dildar! II Dildar ! there's many a valued pearl In richest Oman's sea; But none, my fair Cashmerian girl! O ! none can rival thee. Ill In Busrah there is many a rose Which many a maid may seek, But who shall find a flower which blows Like that upon thy cheek? IV In verdant realms, 'neath sunny skies, With witching minstrelsy, We'll favor find in all young eyes, And all shall welcome thee. 

This is a challenging one! I haven't found a way to fully automate introducing line breaks using RegEx, though I have found a command that works reasonably well to speed it up -- find capitalized letters, and insert a line break before. 

For the above, do


Replace: \n\1

The brackets around the bracketed group tells the software you're looking only for these capital letters (make sure Match Case is on). This doesn't work perfectly, since the "I" will often catch the pronoun by itself (which might not be the beginning of a line). It will also catch randomly capitalized words and proper nouns (in the above, it will catch words like "Dildar"). So you can't just tell it to Replace All -- you have to go through and check each one. It's still faster than doing it completely without automation.  

How it works: the parentheses around the bracket tells the software to "capture" the letter in question and keep it in memory for the Replace command. 

The \1 in the replace command calls back the string we captured in the Find command, and tells the software to print the same letter again. 

The above passage is particularly messy, but if you run the command above and make some judicious choices about likely line breaks using the Find/Replace dialog box (again, not using Replace All), you could end up with:


I. WITH surmah tinge thy black eye's fringe,
'Twill sparkle like a star;
With roses dress each raven tress,
My only loved Dildar!

II. Dildar ! there's many a valued pearl
In richest Oman's sea;
But none, my fair Cashmerian girl!
O ! none can rival thee.

III. In Busrah there is many a rose
Which many a maid may seek, 
But who shall find a flower which blows 
Like that upon thy cheek? 

IV. In verdant realms, 'neath sunny skies,
With witching minstrelsy,
We'll favor find in all young eyes,
And all shall welcome thee.

4. A more advanced RegEx example: extracting a list of words from a tagged file.

RegEx is extremely sophisticated, and there are many more advanced commands I won't get into here (also: I am still very much a learner). 

(It's basically a form of coding without actually writing programs... Interestingly, some full-fledged programming languages do allow RegEx code to be embedded within, so it might be worth your while to learn more of it...) 

Here is a more advanced example that was shown to me by a software developer who works in the library at my institution (Rob Weidman). We have a file where we used Stanford Named Entity Recognition (NER) to tag every proper Name and every Location in a book (why we did that and how that works is a question for another day). The output it produces looks like this: 

Major <PERSON>Carteret</PERSON>, though dressed in brown linen, had thrown off his coat for greater comfort. The stifling heat, in spite of the palm-leaf fan which he plied mechanically, was scarcely less oppressive than his own thoughts. Long ago, while yet a mere boy in years, he had come back from <LOCATION>Appomattox</LOCATION> to find his family, one of the oldest and proudest in the state, hopelessly impoverished by the war,--even their ancestral home swallowed up in the common ruin. His elder brother had sacrificed his life on the bloody altar of the lost cause, and his father, broken and chagrined, died not many years later, leaving the major the last of his line. He had tried in various pursuits to gain a foothold in the new life, but with indifferent success until he won the hand of <PERSON>Olivia Merkell</PERSON>, whom he had seen grow from a small girl to glorious womanhood.

Let's say we want a file with just the names of people referenced in this book -- the items the NER software has tagged as <PERSON>. 

First, you want to make sure there aren't a lot of invisible line breaks in the text. It's ok to do a global Find/Replace where you replace blanks for \n. 


1)Replace <PERSON with \n<PERSON
This puts a line break before each Person tag

2)Delete the first line of text (everything up to the first instance of
a person tag)

3)Replace <PERSON>(.*)<\/PERSON>.* with $1

This gets rid of the tags and everything outside of the tags and
replaces it with just the text within the tags.

This produces a list of just names of people tagged in the text. In the paragraph above, it would produce 


Olivia Merkell

What the various commands above are doing is complicated. The .* in the parentheses means capture everything. The $1 calls back the string we just captured between the tags. The .* after the second person tag is grabbing all of the text *outside* of the tags -- which we're going to delete. 

For reference, ".*" is a really important RegEx command.

Also helpful is the ".+" command... And the "NOT" command (^)...

It goes on... I'll just recommend people look at the "RegEx Cheat Sheet" here

Thursday, July 30, 2020

Announcing An Open-Access African-American Literature Corpus, 1853-1923

Announcing: an Open-Access African American Literature Corpus, 1853-1923
Amardeep Singh, Lehigh University. On Twitter @electrostani
July 2020

I’ve put together a small corpus of texts by Black literary authors in plain text format. The corpus is downloadable and researchers are free to modify it according to preference.

The corpus at present consists of, at present, about 100 texts by African American writers, of which about 75 are works of fiction (about 4.1 million words) and 25 are books of poetry (about 400,000 words). It starts in 1853, the year of publication of William Wells Brown’s Clotel and Frederick Douglass’ short fiction “The Heroic Slave,” and ends in 1923, with Jean Toomer’s Cane. Some of the files are admittedly still a little rough around the edges; cleaning and formatting will be an ongoing and long-term process. Still, I think the files are in good enough shape to start preliminarily exploring them using tools like AntConc or VoyantTools.

Right now I’m making the collection available as a Google Drive link as well as on Github

→ Download link. You can find the corpus here (Google Drive) or here (Github).


In the Metadata file I’ve created to accompany the collection, I indicate the origin of each text. Many come from Project Gutenberg, HathiTrust, the American Verse Project at the University of Michigan, the Library of Congress, and the History of Black Writing Novel Corpus. A few texts were present on multiple repositories; I generally used the text of the source that seemed cleanest and most convenient. 

I believe everything I’ve included in the corpus is in the public domain. 

Why Do This / My Background:

I started thinking about the relative paucity of collections focused on people of color online a few years ago (see my blog post on “Archive Gap” from 2015). I then initiated a couple of digital projects aimed to intervene in what I saw as the absence of Black writers in particular, “Claude McKay’s Early Poetry,” and “Women of the Early Harlem Renaissance.” That latter project in particular opened my eyes to the wealth of materials that have essentially fallen off the radar of literary history. A limited quantity of this overlooked material is sampled in anthologies like Maureen Honey’s Shadowed Dreams: Women’s Poetry of the Harlem Renaissance or Double-Take: A Revisionist Take on the Harlem Renaissance. But there remains a fairly substantial ‘great unread’ in the African Amerian literary tradition that could be brought to light, at least partly just by gathering materials that might have already been digitized in one form or another. 

Other corpora centered around Black writers do appear to exist, but they’re often restricted access. (For instance, The History of the Black Novel corpus has 53 works available to the public, but the larger corpus with about 450 works is restricted access for copyright reasons.) 

If corpora either don’t exist or aren’t readily available to scholars who don’t have access to password-protected university servers, that slows down research. At this point, Digital Humanities scholars have done impressive work analyzing large corpora of literature, but very few have applied computational methods to specifically African American texts. My hope is that this corpus might nudge more people to try. 

What’s included in the Corpus: 

In its current form, the corpus contains a mix of poetry and prose (for convenience, I’ve indicated whether a text is poetry or fiction in the title of each file). I’ve excluded slave narratives and other texts that are clearly not literary. (A large number of North American Slave Narratives are, in any case, collected here.) 

I included poetry alongside fiction in part because many of the topics historically-minded scholars might be interested in from these materials can be found in both formats. Many Black poets from this period wrote occasional poetry connected to historical events, including the Civil War and Emancipation, the Spanish-American War, World War I, the "Red Summer" of 1919, and so on. Admittedly, this mixing of formats might cause problems when studying these texts using certain software platforms (i.e., poetry and prose will be tokenized differently; they also need to be classified differently when doing word frequency types of queries, and sentence-length queries won't be useful). 

For convenience, I've also created folders with "Just Poetry" and "Just Fiction" from the collection in the Google Drive folder link above. 

Gender issues: It might also be worth noting that during this time period there were many African-American women publishing poetry -- but not as many who published fiction. (The reasons for this are beyond the scope of a brief announcement.) Still, including poetry can also be seen as an intentional choice -- designed to include writing by women in the field of view. It's also an invitation to other scholars using these materials to encourage them to work with writing by women. 

Users of this corpus who disagree with my choices are welcome to modify the selection when they design their own queries. I would also welcome any and all feedback. 

Honoring Black Writers / Expanding the Canon:
I’ve been inspired by the statement the Colored Conventions Project asks users to agree to when they download the CCP corpus, especially the first three principles:

  • I honor CCP’s commitment to a use of data that humanizes and acknowledges the Black people whose collective organizational histories are assembled here. Although the subjects of datasets are often reduced to abstract data points, I will contextualize and narrate the conditions of the people who appear as “data” and to name them when possible.
  • I will include the above language in my first citation of any data I pull/use from the CCP Corpus.
  • I will be sensitive to a standard use of language that again reduces 19th-century Black people to being objects. Words like “item” and “object,” standard in digital humanities and data collection, fall into this category. (Link)
While I don’t ask users of this collection to sign an analogous statement, I encourage all users of these materials to adhere to the spirit of the request made by CCP of the users of their corpus. My goal in doing this type of work is to recognize and validate the work of African American writers as important contributors to world literature. One of the ways we can do that is to consider the work at scale, using computational tools like text analysis and stylistics.