As readers may be aware, I've been periodically creating small, open-access textual corpora, collecting African American literature and literature from Colonial South Asia.
After a recent experience at the Modernist Studies Association conference, I thought it might be a worthwhile project to create a larger textual corpus, collecting out-of-copyright materials from a broad range of authors from the early 20th century.
What is a Textual Corpus?
A textual corpus is a collection of texts, typically in plain text format, arranged to be analyzed in various ways, including using quantitative methods. The first major creators of textual corpora were computational linguists, who have studied large-scale linguistic phenomena in corpora constructed within a given language. More recently, digital humanities scholars have been working with corpora of specifically literary texts, often with methodologies that borrow from or gesture towards linguistics. For instance, can we infer author gender in a large corpus of novels to ascertain patterns in the demographics of fiction over time? Can we use certain linguistic patterns to ascertain the genres of novels within a larger corpus?
While anthologies and archives (including digital archives) have traditionally been designed to represent the most important and meaningful texts in particular geographical, cultural, and historical contexts, textual corpora often eschew questions of literary value in the interest of maximal inclusivity. Many quantitative methods rely on large-scale corpora to achieve statistical viability, and to answer questions about patterns in language usage, the fact that a particular book of poetry was critically well-received and another was not might be less important than the fact that both were published at a certain time and place. In our collection, we have aspired to maximal inclusivity, incorporating materials that the editorial tradition might have overlooked, such as 'minor' texts by 'major' writers, as well as writing that has entirely fallen off the critical radar.
The idea is to collect materials from recognizable modernists like Virginia Woolf and James Joyce, alongside African American writers, Indian writers like Rabindranath Tagore and Cornelia Sorabji, as well as a sampling of genre fiction (including detective fiction, historical fiction, adventure fiction, science fiction, romance, etc.).
So: everything from Jack London to Edith Wharton to Georgette Heyer to Langston Hughes.
The goal is to produce a collection that could be useful to people doing quantitative analyses of these materials, but also to scholars doing conventional historical scholarship on the literature of the period.
I've been creating thematic tags and genre classifications as I go, so that people interested in just writing by modernist women, for instance, could sort the collection that way (see the metadata below). Similarly, people interested in just African American poetry could sort the collection that way as well (using the Af-Am poetry folder). Other topics I've started tracking are materials related to World War I, materials related to colonialism and empire, LGBTQIA materials, disability, and the environment.
(Note: tagging is at a very early stage thus far. I would welcome help and contributions from any readers who have specialist knowledge about any of the topics mentioned above.)
Having these topics represented in the metadata was important to me; it's one reason why I've found existing textual repositories online insufficient. Project Gutenberg, for instance, has in recent years dramatically improved its approach to data about original publication, but many texts in their collection continue to have no information about publication date or the publisher name. I wanted to make a collection where all of that information was added back in.
How to access the corpus?
This is a work in progress. It can be found here for the moment.
As I've been going, I've been drawing largely on digital files at Project Gutenberg, Archive.org, and HathiTrust. (Note: the Gutenberg files will need to be "cleaned" to make them useful for quantitative queries; as of the present writing, I have not yet done that with the files, but it should be happening soon.)
As important (or more important) than the collection itself is the metadata file, with information about the texts. I'll say more about the metadata file below.
1. Folders:
On the Google Drive, I have been subdividing files into folders to make them more useful to conventional, historically-minded scholars.
Literary Fiction / High Modernism. Essentially what you would expect -- texts from 30-40 prominent modernist writers from the UK, Ireland, and the U.S., with a few less well-known figures like Hope Mirrlees.
Genre Fiction, including Science Fiction, Detective Fiction, Adventure, Romance, Horror. This period was of course the Golden Age of Detective Fiction, with Arthur Conan Doyle writing at the fin de siecle and writers like Agatha Christie and Dorothy Sayers emerging in the 1920s. Writers like Doyle and Wells both straddled the late 19th and early 20th centuries; ultimately, I will probably aim to put their pre-1900 works in an appropriate folder for people doing author-based work. You'll also see out of copyright materials by people like A.E.W. Mason, H.Rider Haggard, Georgette Heyer, etc.
All Fiction. What it sounds like. A mix of "highbrow," "middlebrow" and popular fiction.
All Poetry. Canonical figures like Yeats, Pound and Eliot alongside "minor" figures. A very substantial representation of African American poetry.
Drama. As of the present moment, I haven't been actively seeking out dramatists to include in this folder; it mostly consists of plays written by authors who were primarily not playwrights (such as Yeats), though there is a pretty good collection of Somerset Maugham plays.
African American Fiction. For more on this collection, see this earlier description of my African American materials.
African American Poetry. See the link above.
Colonial South Asian Texts. For more on this collection see here.
Nonfiction and Essays (including Travel narratives, Memoirs, and Literary Criticism).
2. Metadata File.
We've collecting the following information about the texts as we go. The metadata file (a work in progress) can be viewed here.
Author's name (Last, first)
Title of work
Year of First Publication
Year of Author's Birth. This is interesting and probably important. We see writers like Joseph Conrad who ius often considered a "Modernist," but who was born in 1857. Most writers associated with inventing high modernism were born between 1870-1890. Virginia Woolf and James Joyce were born on the same year!
Publisher (first publisher). Publisher information could be really interesting to explore. Modernist studies scholars have long been interested in small presses like the Woolfs' Hogarth Press. But here, we are gathering information about who published with which publisher including big commercial houses. This could be useful to scholars interested in the business side of early 20th century literature. (It's interesting to see that many African American writers before the Harlem Renaissance used small and local publishers, as the major houses were typically closed to them.)
Genre or Mode: Fiction, Nonfiction, Poetry, Short Fiction, Drama
Author's inferred gender: M, F, NB. As of now, I am understanding writers like Bryher and Radclyffe Hall to be nonbinary (NB). Others of course have complex relationships to gender expression (one thinks of Gertrude Stein, who has historically been identified as a lesbian, but who some scholars have been positing as transmasculine or genderqueer). This category may be revised or rethought over time.
Author's nationality
Location in Corpus: Which folder is the file in in the Google Drive?
Location of Publisher: London, New York, somewhere else?
Tags and Themes: Some tags I have been tracking: WWI, Travel, LGBTQIA, Disability, Environmental, African American, South Asian, Indigenous, Interracial, Passing
Provenance of Text: Gutenberg, HathiTrust, Archive.org, etc.
Again, the metadata file is very much a work in progress. Completing it may take weeks or even months, but I hope that when it's complete it will be useful to researchers.










