Work in Progress: A Modernism Text Corpus [Early 20th Century Literature Corpus]

As readers may be aware, I've been periodically creating small, open-access textual corpora, collecting African American literature and literature from Colonial South Asia. 

The Kitchen Sink, Carefully Collected and Labeled: Recently, I thought it might be a worthwhile project to create a larger textual corpus, collecting out-of-copyright materials from a broad range of authors from the early 20th century. The idea is to collect materials from recognizable modernists like Virginia Woolf and James Joyce, alongside African American writers, Indian writers like Rabindranath Tagore, as well as a sampling of genre fiction (including detective fiction, adventure fiction, science fiction, etc.). So: everything from Jack London to Edith Wharton to Langston Hughes. 

The goal is to produce a collection that could be useful to people doing quantitative analyses of these materials, but also to scholars doing conventional historical scholarship on the literature of the period. I've tried to make the collection segmented, so that people interested in just writing by mdoernist women, for instance, could sort the collection that way (see the metadata below). Similarly, people interested in just African American poetry could sort the collection that way as well (using the Af-Am poety folder). 

Having these aspects of social and cultural identity represented in the metadata was important to me; it's one reason why I've found existing textual repositories online insufficient. 

How to access the corpus? This is a work in progress. It can be found here for the moment.

As I've been going, I've been drawing largely on digital files at Project Gutenberg, Archive.org, and HathiTrust. (Note: the Gutenberg files will need to be "cleaned" to make them useful for quantitative queries; as of the present writing, I have not yet done that with the files, but it should be happening soon.)

As important (or more important) than the collection itself is the metadata file, with information about the texts. I'll say more about the metadata file below. 


1. Folders: 

Literary Fiction / High Modernism. Essentially what you would expect -- texts from 30-40 prominent modernist writers from the UK, Ireland, and the U.S., with a view less well-known figures like Hope Mirrlees. 

Genre Fiction, including Science Fiction, Detective Fiction, Adventure, Romance, Horror. This period was of course the Golden Age of Detective Fiction, with Arthur Conan Doyle writing at the fin de siecle and writers like Agatha Christie and Dorothy Sayers emerging in the 1920s. Writers like Doyle and Wells both straddled the late 19th and early 20th centuries; ultimately, I will probably aim to put their pre-1900 works in an appropriate folder for people doing author-based work. 

All Fiction. What it sounds like. A mix of "highbrow," "middlebrow" and popular fiction. 

All Poetry. Canonical figures like 

Drama. As of the present moment, I haven't been actively seeking out dramatists to include in this folder; it mostly consists of plays written by authors who were primarily not playwrights (such as Yeats).

African American Fiction. For more on this collection, see this earlier description of my African American materials

African American Poetry. See the link above.

Colonial South Asian Texts. For more on this collection see here

Nonfiction and Essays (including Travel narratives, Memoirs, and Literary Criticism).


2. Metadata.

We've collecting the following information about the texts as we go. The metadata file (a work in progress) can be viewed here

Author's name

Title

Year of First Publication

Year of Author's Birth

Publisher (first publisher)

Genre or Mode: Fiction, Nonfiction, Poetry, Short Fiction, Drama

Author's inferred gender: M, F, NB. As of now, I am understanding writers like Bryher and Radclyffe Hall to be nonbinary (NB). Others of course have complex relationships to gender expression (one thinks of Gertrude Stein). This category may be revised or rethought over time. 

Author's nationality

Location in Corpus

Location of Publisher

Tags and Themes: WWI, Travel, LGBTQIA, Disability, Environmental, African American, South Asian, Indigenous

Provenance of Text: Gutenberg, HathiTrust, Archive.org, etc.

Again, the metadata file is very much a work in progress. Completing it may take weeks or even months, but I hope that when it's complete it will be useful to researchers.