I thought it might be a worthwhile project to create a textual corpus collecting out-of-copyright materials from a broad range of authors from the early 20th century -- 1900-1930. The idea is to include publication information, genre classifiers (literary fiction, romance fiction, detective fiction, etc), and some topical tags (World War I, gender/feminism, etc). I am also including selected texts from the earlier African American Literature and Literature of Colonial South Asian corpora I created a few years ago in the new corpus.
Short version:
- The corpus containing out-of-copyright works 1900-1930 is here; the metadata file (also very important!) is here.
- As of May 2026, it contains about 1100 texts by about 215 authors -- British, American, Irish, Canadian, New Zealand, Australian, Indian, and a few others. Some dual nationality authors (Joseph Conrad was born in Poland but became a British citizen) are indicated with both of their nationalities.
- Of those 1100 texts, about 130 texts are by African American authors.
- About 330 of the texts in the corpus are by women, and about 30-40 texts are by authors who may be today understood as Trans or Nonbinary (marked as NB in the metadata), including Radclyffe Hall, Bryher, and Gertrude Stein (on Stein, see Chris Coffman's book). As I continue to expand the corpus, I will be prioritizing authors who were women (though I believe the current 3:1 ratio, imbalanced as it is, is fairly close to the historical average for this period).
Precedents: There is a small modernist (mainly high modernist) corpus created by a group in the UK here. I corresponded a bit with the curators of that project, though in the end I created my own corpus from scratch. And the US Novel Corpus at the University of Chicago is here (1200 texts are open access). In truth, it is not very easy to use; I have looked at their metadata file for reference, but I have not used their texts in building this corpus.
Basics: What is a Textual Corpus?
A textual corpus is a collection of texts, typically in plain text format, arranged to be analyzed in various ways, sometimes using quantitative methods. (I also think text corpora can be used by scholars doing more traditional thematic and historicist research, especially if the materials are tagged. More about that below.)
The first major creators of textual corpora were computational linguists, who have studied large-scale linguistic phenomena in corpora constructed within a given language. More recently, digital humanities scholars have been working with corpora of specifically literary texts, often with methodologies that borrow from or gesture towards linguistics. For instance, can we infer author gender in a large corpus of novels to ascertain patterns in the demographics of fiction over time? Can we use certain linguistic patterns to ascertain the genres of novels within a larger corpus?
While anthologies and archives (including digital archives) have traditionally been designed to represent the most important and meaningful texts in particular geographical, cultural, and historical contexts, textual corpora often eschew questions of literary value in the interest of maximal inclusivity. Many quantitative methods rely on large-scale corpora to achieve statistical viability, and to answer questions about patterns in language usage, the fact that a particular book of poetry was critically well-received and another was not might be less important than the fact that both were published at a certain time and place. In our collection, we have aspired to maximal inclusivity, incorporating materials that the editorial tradition might have overlooked, such as 'minor' texts by 'major' writers, as well as writing that has entirely fallen off the critical radar.
What is in this Early 20th C. Text Corpus?
The idea is to collect materials from recognizable high modernists like Virginia Woolf and James Joyce, alongside African American writers, Indian writers like Rabindranath Tagore and Cornelia Sorabji, as well as a sampling of genre fiction (including detective fiction, historical fiction, adventure fiction, science fiction, romance, westerns, etc.).
So: everything from Jack London to Edith Wharton to Georgette Heyer to Langston Hughes.
The goal is to produce a collection that could be useful to people doing quantitative analyses of these materials, but also to scholars doing conventional historical scholarship on the literature of the period.
I've been creating thematic tags and genre classifications as I go, so that people interested in just writing by modernist women, for instance, could sort the collection that way (see the metadata file). Similarly, people interested in just African American poetry could sort the collection that way as well (using the Af-Am poetry folder). Other topics I've started tracking are materials related to World War I, materials related to colonialism and empire, LGBTQIA materials, disability, and the environment.
(Note: tagging is at a very early stage thus far. I would welcome help and contributions from any readers who have specialist knowledge about any of the topics mentioned above.)
Having these topics represented in the metadata was important to me; it's one reason why I've found existing textual repositories online insufficient. Project Gutenberg, for instance, has in recent years dramatically improved its approach to data about original publication, but many texts in their collection continue to have no information about publication date or the publisher name. I wanted to make a collection where all of that information was added back in.
How to access the corpus?
This is a work in progress; it has been steadily growing over the course of 2025-2026. The whole corpus can be found here. You can either work with the materials in that Google Drive folder, or download the whole folder to your computer.
As I've been going, I've been drawing largely on digital files at Project Gutenberg, Archive.org, and HathiTrust. I have been cleaning the Gutenberg files of header and footer boilerplate, though admittedly there may be some files in the folders that have not been cleaned.
As important (or more important) than the collection itself is the metadata file, with information about the texts. I'll say more about the metadata file below.
Licensing and Use: This work is licensed on a CC-BY-NC basis, meaning you are free to download the whole project and use it for your own research, though I would like to be credited. Also, I ask that all uses of this corpus and associated metadata remain non-commerial.
1. "File Under"/Folders:
On the Google Drive, I have been subdividing files into folders to make them more useful to conventional, historically-minded scholars.
Literary Fiction-High Modernism. Essentially what you would expect -- texts from 30-40 prominent modernist writers from the UK, Ireland, and the U.S., with a few less well-known figures like Hope Mirrlees. Writers like Arnold Bennett would not have called themselves "modernists," but they were definitely in the ballpark of literary fiction and in conversation with modernists. One reason to have this classifier might be to distinguish / compare writers like Ford Madox Ford, Ernest Hemingway, Virginia Woolf, Sherwood Anderson, and so on, against writers associated with Genre Fiction.
Genre Fiction, including Science Fiction, Detective Fiction, Adventure, Romance, Horror. This period was of course the Golden Age of Detective Fiction, with Arthur Conan Doyle writing at the fin de siecle and writers like Agatha Christie and Dorothy Sayers emerging in the 1920s. Writers like Doyle and Wells both straddled the late 19th and early 20th centuries; ultimately, I will probably aim to put their pre-1900 works in an appropriate folder for people doing author-based work. You'll also see out of copyright materials by people like A.E.W. Mason, H.Rider Haggard, Georgette Heyer, etc.
Drama. As of the present moment, I haven't been actively seeking out dramatists to include in this folder; it mostly consists of plays written by authors who were primarily not playwrights (such as Yeats), though there is a pretty good collection of Somerset Maugham plays.
African American Fiction. For more on this collection, see this earlier description of my African American materials.
African American Poetry. See the link above.
Colonial South Asian Texts. For more on this collection see here.
Nonfiction and Essays (including Travel narratives, Memoirs, and Literary Criticism).
2. Metadata File.
I've been collecting the following information about the texts as I go.
Author's name (Last, first)
Title of work
Year of First Publication
Year of Author's Birth. This is interesting and probably important. We see writers like Joseph Conrad who is often considered a "Modernist," but who was born in 1857. Most writers associated with high modernism were born between 1870-1890. Virginia Woolf and James Joyce were born on the same year! Quite a lot of writers associated with the Victorian period -- Henry James, Rudyard Kipling -- were still actively writing and publishing well into the early 20th century.
Publisher (first publisher). Publisher information could be really interesting to explore; and again, the absence of this information has been a major limitation of Project Gutenberg's collection. Modernist studies scholars have long been interested in small presses like the Woolfs' Hogarth Press or Elizabeth Yeats' Cuala Press; it's revealing to see how and when writers worked with these presses, and when they published with big commercial houses like Macmillan. This information could be useful to scholars interested in the business/publishing side of early 20th-century literature. (It's interesting to see that many African American writers before the Harlem Renaissance used small and local publishers, as the major houses were typically closed to them.)
Publisher location. I have also been keeping track of publisher location. This is not always 100% accurate, especially when books may have been simultaneously published in London and New York. I have been going from how the publisher is described on the book's title page. Besides New York and London, it's interesting to see the publishers in other locations, including Chicago, Indianapolis, Toronto, San Francisco, Dublin, and Calcutta. African American literature before the Harlem Renaissance was mostly self-published on local presses around the country. It became more "New York" focused in the 1920s.
Genre or Mode: Fiction, Nonfiction, Poetry, Short Fiction, Drama
Author's inferred gender: M, F, NB. As of now, I am understanding writers like Bryher and Radclyffe Hall to be nonbinary (NB). Others of course have complex relationships to gender expression (one thinks of Gertrude Stein, who has historically been identified as a lesbian, but who some scholars have been positing as transmasculine or genderqueer). This category may be revised or rethought over time.
Author's nationality
File Under: Broad classifier, equivalent to the shelf in a bookstore or library: Historical fiction, Romance Fiction, Literary Fiction-High Modernism, etc.
Location of Publisher: London, New York, somewhere else?
Tags and Themes: Some tags I have been tracking: WWI, Travel ("Italy," "India" etc), LGBTQIA, Gender / feminism / suffrage, Disability, Environmental, African American, South Asian, Indigenous, Interracial, Passing
Provenance of Text: Gutenberg, HathiTrust, Archive.org, etc.
Again, the metadata file is very much a work in progress. Completing it may take weeks or even months, but I hope that when it's complete it will be useful to researchers.