Thursday, July 30, 2020

Announcing An Open-Access African-American Literature Corpus, 1853-1923

Announcing: an Open-Access African American Literature Corpus, 1853-1923
Amardeep Singh, Lehigh University. On Twitter @electrostani
July 2020

I’ve put together a small corpus of texts by Black literary authors in plain text format. The corpus is downloadable and researchers are free to modify it according to preference.

The corpus at present consists of, at present, about 100 texts by African American writers, of which about 75 are works of fiction (about 4.1 million words) and 25 are books of poetry (about 400,000 words). It starts in 1853, the year of publication of William Wells Brown’s Clotel and Frederick Douglass’ short fiction “The Heroic Slave,” and ends in 1923, with Jean Toomer’s Cane. Some of the files are admittedly still a little rough around the edges; cleaning and formatting will be an ongoing and long-term process. Still, I think the files are in good enough shape to start preliminarily exploring them using tools like AntConc or VoyantTools.

Right now I’m making the collection available as a Google Drive link as well as on Github


→ Download link. You can find the corpus here (Google Drive) or here (Github).


Sources: 

In the Metadata file I’ve created to accompany the collection, I indicate the origin of each text. Many come from Project Gutenberg, HathiTrust, the American Verse Project at the University of Michigan, the Library of Congress, and the History of Black Writing Novel Corpus. A few texts were present on multiple repositories; I generally used the text of the source that seemed cleanest and most convenient. 

I believe everything I’ve included in the corpus is in the public domain. 


Why Do This / My Background:

I started thinking about the relative paucity of collections focused on people of color online a few years ago (see my blog post on “Archive Gap” from 2015). I then initiated a couple of digital projects aimed to intervene in what I saw as the absence of Black writers in particular, “Claude McKay’s Early Poetry,” and “Women of the Early Harlem Renaissance.” That latter project in particular opened my eyes to the wealth of materials that have essentially fallen off the radar of literary history. A limited quantity of this overlooked material is sampled in anthologies like Maureen Honey’s Shadowed Dreams: Women’s Poetry of the Harlem Renaissance or Double-Take: A Revisionist Take on the Harlem Renaissance. But there remains a fairly substantial ‘great unread’ in the African Amerian literary tradition that could be brought to light, at least partly just by gathering materials that might have already been digitized in one form or another. 

Other corpora centered around Black writers do appear to exist, but they’re often restricted access. (For instance, The History of the Black Novel corpus has 53 works available to the public, but the larger corpus with about 450 works is restricted access for copyright reasons.) 

If corpora either don’t exist or aren’t readily available to scholars who don’t have access to password-protected university servers, that slows down research. At this point, Digital Humanities scholars have done impressive work analyzing large corpora of literature, but very few have applied computational methods to specifically African American texts. My hope is that this corpus might nudge more people to try. 


What’s included in the Corpus: 

In its current form, the corpus contains a mix of poetry and prose (for convenience, I’ve indicated whether a text is poetry or fiction in the title of each file). I’ve excluded slave narratives and other texts that are clearly not literary. (A large number of North American Slave Narratives are, in any case, collected here.) 

I included poetry alongside fiction in part because many of the topics historically-minded scholars might be interested in from these materials can be found in both formats. Many Black poets from this period wrote occasional poetry connected to historical events, including the Civil War and Emancipation, the Spanish-American War, World War I, the "Red Summer" of 1919, and so on. Admittedly, this mixing of formats might cause problems when studying these texts using certain software platforms (i.e., poetry and prose will be tokenized differently; they also need to be classified differently when doing word frequency types of queries, and sentence-length queries won't be useful). 

For convenience, I've also created folders with "Just Poetry" and "Just Fiction" from the collection in the Google Drive folder link above. 

Gender issues: It might also be worth noting that during this time period there were many African-American women publishing poetry -- but not as many who published fiction. (The reasons for this are beyond the scope of a brief announcement.) Still, including poetry can also be seen as an intentional choice -- designed to include writing by women in the field of view. It's also an invitation to other scholars using these materials to encourage them to work with writing by women. 

Users of this corpus who disagree with my choices are welcome to modify the selection when they design their own queries. I would also welcome any and all feedback. 

Honoring Black Writers / Expanding the Canon:
I’ve been inspired by the statement the Colored Conventions Project asks users to agree to when they download the CCP corpus, especially the first three principles:

  • I honor CCP’s commitment to a use of data that humanizes and acknowledges the Black people whose collective organizational histories are assembled here. Although the subjects of datasets are often reduced to abstract data points, I will contextualize and narrate the conditions of the people who appear as “data” and to name them when possible.
  • I will include the above language in my first citation of any data I pull/use from the CCP Corpus.
  • I will be sensitive to a standard use of language that again reduces 19th-century Black people to being objects. Words like “item” and “object,” standard in digital humanities and data collection, fall into this category. (Link)
While I don’t ask users of this collection to sign an analogous statement, I encourage all users of these materials to adhere to the spirit of the request made by CCP of the users of their corpus. My goal in doing this type of work is to recognize and validate the work of African American writers as important contributors to world literature. One of the ways we can do that is to consider the work at scale, using computational tools like text analysis and stylistics.

No comments: