Announcing An Open-Access African-American Literature Corpus, 1853-1929

Announcing: an Open-Access African American Literature Corpus, 1853-1929
Amardeep Singh, Lehigh University. On Twitter @electrostani
July 2020 (updated January 2025)

I’ve put together a small corpus of texts by Black authors of fiction and poetry in plain text format. The corpus is downloadable and researchers are free to modify it according to preference.

The corpus at present consists of, at present, about 175 texts by African American writers, of which about 90 are works of fiction (about 5 million words) and 85 are books of poetry (about 700,000 words). It currently starts in 1853, the year of publication of William Wells Brown’s Clotel and Frederick Douglass’ short fiction “The Heroic Slave,” and ends in 1929, the year of Nella Larsen's Passing. Some of the files are admittedly still a little rough around the edges; cleaning and formatting will be an ongoing and long-term process. Still, I think the files are in good enough shape to start preliminarily exploring them using tools like AntConc or VoyantTools.

Right now I’m making the collection available as a Google Drive link as well as on Github


→ Download link. You can find the corpus here (Google Drive) or here (Github). (The Google Drive is more recently updated.)


Sources: 

In the Metadata file I’ve created to accompany the collection, I indicate the origin of each text. Many come from Project Gutenberg, HathiTrust, the American Verse Project at the University of Michigan, the Library of Congress, and the History of Black Writing Novel Corpus. A few texts were present on multiple repositories; I generally used the text of the source that seemed cleanest and most convenient. 


Why Do This / My Background:

I started thinking about the relative paucity of collections focused on people of color online a few years ago (see my blog post on “Archive Gap” from 2015). I then initiated a couple of digital projects aimed to intervene in what I saw as the absence of Black writers in particular, “Claude McKay’s Early Poetry,” and “Women of the Early Harlem Renaissance.” That latter project in particular opened my eyes to the wealth of materials that have essentially fallen off the radar of literary history. A limited quantity of this overlooked material is sampled in anthologies like Maureen Honey’s Shadowed Dreams: Women’s Poetry of the Harlem Renaissance or Double-Take: A Revisionist Take on the Harlem Renaissance. But there remains a fairly substantial ‘great unread’ in the African American literary tradition that could be brought to light, at least partly just by gathering materials that might have already been digitized in one form or another. 

Other corpora centered around Black writers do appear to exist, but they’re often restricted access. (For instance, The History of the Black Novel corpus has 53 works available to the public, but the larger corpus with about 450 works is restricted access for copyright reasons.) 

If corpora either don’t exist or aren’t readily available to scholars who don’t have access to password-protected university servers, that slows down research. At this point, Digital Humanities scholars have done impressive work analyzing large corpora of literature, but very few have applied computational methods to specifically African American texts. My hope is that this corpus might nudge more people to try. 


What’s included in the Corpus: 

In its current form, the corpus contains a mix of poetry and prose (for convenience, I’ve indicated whether a text is poetry or fiction in the title of each file). I’ve excluded slave narratives and other texts that are clearly not literary. (A large number of North American Slave Narratives are, in any case, collected here.) 

I included poetry alongside fiction in part because many of the topics historically-minded scholars might be interested in from these materials can be found in both formats. Many Black poets from this period wrote occasional poetry connected to historical events, including the Civil War and Emancipation, the Spanish-American War, World War I, the "Red Summer" of 1919, and so on. Admittedly, this mixing of formats might cause problems when studying these texts using certain software platforms (i.e., poetry and prose will be tokenized differently; they also need to be classified differently when doing word frequency types of queries, and sentence-length queries won't be useful). 

For convenience, I've also created folders with "Just Poetry" and "Just Fiction" from the collection in the Google Drive folder link above. 

Gender issues: It might also be worth noting that during this time-period there were many African-American women publishing poetry -- but not as many who published fiction. (The reasons for this are beyond the scope of a brief announcement.) Still, including poetry can also be seen as an intentional choice -- designed to include writing by women in the field of view. It's also an invitation to other scholars using these materials to encourage them to work with writing by women. 

Users of this corpus who disagree with my choices are welcome to modify the selection when they design their own queries. I would also welcome any and all feedback. 

Honoring Black Writers / Expanding the Canon:
I’ve been inspired by the statement the Colored Conventions Project asks users to agree to when they download the CCP corpus, especially the first three principles:

  • I honor CCP’s commitment to a use of data that humanizes and acknowledges the Black people whose collective organizational histories are assembled here. Although the subjects of datasets are often reduced to abstract data points, I will contextualize and narrate the conditions of the people who appear as “data” and to name them when possible.
  • I will include the above language in my first citation of any data I pull/use from the CCP Corpus.
  • I will be sensitive to a standard use of language that again reduces 19th-century Black people to being objects. Words like “item” and “object,” standard in digital humanities and data collection, fall into this category. (Link)
While I don’t ask users of this collection to sign an analogous statement, I encourage all users of these materials to adhere to the spirit of the request made by CCP of the users of their corpus. My goal in doing this type of work is to recognize and validate the work of African American writers as important contributors to world literature. One of the ways we can do that is to consider the work at scale, using computational tools like text analysis and stylistics.

"Some Have Happiness Thrust Upon Them": Playing With "Twelfth Night" in "A Suitable Boy" (2/3)

(Part 2 in a Series. See part 1 here. Mira Nair's adaptation of A Suitable Boy debuts on BBC One in the UK on 7/26; the U.S. broadcast dates are yet to be announced.)

Vikram Seth's A Suitable Boy, set just after Indian independence, is deeply concerned with what we might call "de-Anglicization" -- the process by which upper-class and -caste Indians began to shed themselves of the Anglophilia that had been thoroughly imposed upon them over two centuries of British rule in India.

Elite English culture was presented to Indians in modes of dress and eating; it was seen as a work ethic and a demeanor to aspire to ("stiff upper lip"); it was visible in architecture and social structures (the "Club"). But nowhere was the pursuit of Englishness more palpable than in the school system the British established and that Indians continued to propagate for several generations. Most major English-medium Indian schools universities remain modeled on the British system; it's only recently that the American approach to "college" has begun to make inroads.

At the beating heart of that system of educative discipline is of course the Canon of English Literature. So it's not at all an accident that in A Suitable Boy one of the main characters is a young lecturer in English at the provincial (fictional) Brahmpur University. And his young sister-in-law, Lata -- the primary protagonist in the novel -- is herself an English major at the same university. 

It's not that the British are still hanging around at Brahmpur University in Seth's novel; even by the early 1950s, they've all departed. All of the faculty we meet are either fully Indian or mixed-race Anglo-Indian. There's no wizened British Department Chair to force the Indian faculty to toe the line and live and die by Shakespeare, Donne, Milton, and (Percy) Shelley. The Indian faculty enforce the Canon all the same. But the young people at least inhabit Shakespeare slightly differently than the British might have. And the audience receives the play differently than we might expect.

Revisiting "A Suitable Boy" in 2020 (1/2)

I'm excited about Mira Nair's six-part adaptation of Vikram Seth's A Suitable Boy, which will be premiering on BBC One and Netflix India on July 26th. (No word yet on when and how we'll be able to see it in the U.S.) As most people reading this probably know, I have a special interest in this project since I published a book-length study of Mira Nair's films. This is Nair's first feature film since Queen of Katwe (2016), and her first film set in South Asia since The Reluctant Fundamentalist (2012). Nair has a special eye and a gift for telling stories about India, and it's been too long since she's made a film there. Seth's novel, I think, seems like a great fit for picking up where Monsoon Wedding left off...

*

I actually re-read Seth's book in its entirety earlier this summer, partly out of anticipation for the coming adaptation (I should also say that I'm also thinking of writing an article or a book chapter on the novel...). As I did so, I felt a newfound appreciation for the book that I didn't have the first time approached it. In the 1990s, as a young reader, I was interested in the shiny and topical style of writers like Rushdie. I wanted 'quick hits' -- ideas that can be encapsulated nicely in a seminar paper or conference talk. Later, as a young teacher, I tended to look for short books that work well with undergraduates; hence, I put A Suitable Boy away on a high shelf and left it alone. Today, I'm drawn much more to good storytelling and research, and Seth's novel has both.

For those who don't have the many, many hours required to read the whole thing, one possible angle you could try is the Dramatized Audiobook version, which condenses the story and uses a pretty well-known ensemble voice cast. It does downplay the politics and plays up the "Anglophile" parts of the plot a bit, but it's a high quality dramatization and quite entertaining. I listened to it a couple of years, and it whetted my appetite to get back to the text itself.

Nair's television adaptation has a trailer that you can see here:



Thoughts about the trailer? To my eye, the trailer emphasizes two of the romantic plots (Lata-Kabir and Maan-Saaeda Bai), while deemphasizing some of the less glamorous characters and side-plots (Kabir Durrani is clearly there -- but where's Haresh Khanna?).


That said, I have heard (directly from the director!) that the adaptation is going to attend to the social and political upheaval described at length in the novel -- the tensions between urban and rural Indias, the caste politics, and communalism. I'm pleased about that; the novel is much more than a period piece and romantic drama. (If you look carefully at the trailer, you'll see some hints of the politics...)

In a series of three blog posts (one per week), I'll revisit this fine novel, and introduce it (without spoilers!) for people who've never read it.