Showing posts with label Digital Humanities. Show all posts
Showing posts with label Digital Humanities. Show all posts

Association for Asian Studies Conference 2026: A Few Highlights and Notes

I was at the AAS conference in Vancouver over the weekend, to be part of a panel on Colonial Archives and Digital Humanities in South Asia. 

I also took the opportunity to listen in on some conversations I might normally get to hear at literature conferences. 

I was just there for Friday and Saturday, and I was able to attend the following panels:


I'll do brief summaries of some takeaways from the various sessions below.

* * * 


1. The Asian Smart Cities panel was something I went to on a lark, mainly out of curiosity. Here's a bit from the panel description: 

The concept of smart city is linked to futuristic scenarios made of images, symbols and concepts that became part of collective imagination and memory: cities should not only be efficient, productive and accessible; they also need to be beautiful, sustainable and socially inclusive.  

At present, the smart city designation means things like: real-time traffic monitoring, with cameras and censors; CCTV cameras everywhere, observed either by humans or (increasingly) by AIs; weather and threat warnings (i.e., flood sensors). 

By and large, I was not surprised to hear Singapore discussed on the panel as embracing the smart city approach. But I was interested in the presentation on the panel dealing with the Smart City approach in Jakarta. There, it has been only partially successful since there are so many people in the city who are in informal settlements... it's hard to use high-tech cameras and monitors when people are living in shacks and improvised settlements... There was also an interesting paper here on the rise and fall of the cycle rickshaw (Bejak, in Jakarta) as a mode of transportation and as a symbol of the Indonesian working-class "everyman" that continues to be invoked by politicians even as the city modernizes. 

(Side comment: I do wonder whether before planners invest billions of dollars making smart cities in the Global South, they should make cities where everyone has access to affordable housing, power grids and sewage systems that work, and roads and public transportation.)

Some of the papers alluded to other dissents from the Smart City model, especially the growing emphasis on using AI instead of human monitoring. AI-powered smart city technology is expensive; it's often strongly promoted by companies selling monitoring systems and other tech companies; and it can lead to a sense of being constantly policed that might be good for preventing street crime, but that's not good for overall social well-being or urban discovery or spontaneity. 

Along those lines I came across this Op-Ed by Richard Sennett in the Guardian that spoke to those dissents: "No One Likes a Smart City That's Too Smart": 

Uniform architecture need not inevitably produce a dead environment, if there is some flexibility on the ground; in New York, for instance, along parts of Third Avenue monotonous residential towers are subdivided on street level into small, irregular shops and cafes; they give a good sense of neighbourhood. But in Songdo, lacking that principle of diversity within the block, there is nothing to be learned from walking the streets. [...]

A great deal of research during the last decade, in cities as different as Mumbai and Chicago, suggests that once basic services are in place people don't value efficiency above all; they want quality of life. A hand-held GPS device won't, for instance, provide a sense of community. More, the prospect of an orderly city has not been a lure for voluntary migration, neither to European cities in the past nor today to the sprawling cities of South America and Asia. If they have a choice, people want a more open, indeterminate city in which to make their way; this is how they can come to take ownership over their lives.

(This wasn't mentioned on the panel; just something I read and thought was on point.)

* * * 


2. The Cultural Revolution panel I attended was really well-attended -- standing room only, with a number of people turned away at the door due to the overflow crowd. The speakers were all very senior academics, some with several books on the history of post-revolution China. Here's a bit from the program copy.

Yiching Wu will argue that in May of 1966, Mao’s intention was to initiate a targeted purge within education institutions, but the campaign soon escalated into a generalized attack on “capitalist roaders” inside the party. Andrew Walder will examine how the unintended consequences of Mao’s moves shaped the course of factional conflicts, particularly in the context of failed truce negotiations among rival rebel groups. Patricia Thornton will focus on the dynamics of the mass movement and the question of representation, raising critical questions about Mao’s ability to direct or contain the grassroots movement he had unleashed. Daniel Leese will assess the quality and structure of information that reached Mao, drawing on the party’s internal reporting systems to interrogate the limits of central knowledge and decision-making during the Cultural Revolution. Felix Wemheuer will chair the discussion.  

Essentially, what I took away from the discussion was the sense that the opening of the Cultural Revolution was a lot less organized than one might think. Mao himself initiated some of the new policies, but the extremity of what followed was not really his intent, nor were the actions of party officials in towns and villages outside of Beijing fully under his control. The panelists discussed a number of key events in 1966-1967 in pretty granular detail (see the Wikipedia page for the Cultural Revolution, and scroll down to 1966: Outbreak)

* * * 


3. The "Beyond the Visual: Gender, Queerness, and Media Margins" panel I attended had some really interesting papers thinking about sound and voice in Japanese popular culture. 

The paper I found most interesting was Haruki Segicuchi's paper a 1988 Japanese film called Summer Vacation 1999, about a homoerotic relationship between teen boys where the actors were actually all cis-gendered women! 

I also really enjoyed Minori Ishida's paper on "Gender Deviance in the Bodies of Anime Characters." The panelist mentioned anime series I mostly hadn't seen, like Fena: Pirate Princess and The Land of the Lustrous. There's some really interesting stuff going on here with representations of gender identity (including non-binary and gender non-conforming characters) in both art design and in voicing in these series. While traditional anime featured a highly stylized and binarized approach to gender (soft / feminine women and girls; tough/masculine boys & men), some newer series are exploring queer and nonbinary aesthetics both in visual character design and voicing. 

* * * 


4. The Film, Media, and Gender panel I attended was a bit of a hodge-podge. I especially enjoyed the two papers dealing with South Asian film studies. 

Rebecca Peters of Florida State University gave a paper on Kiran Rao's film Laapata Ladies, focused on how the film uses costume design and clothing to mount a critique of conservative gender norms and expectations. It's part of a dissertation she's writing on women film directors in Bollywood, which sounds like it will be pretty impactful. 

Arpit Gaind of UCLA gave a rich talk summarizing his research based on his field experience in Jharkhand working with Adivasi filmmakers. 

Here's a bit from his abstract: 

Drawing on ethnographic fieldwork and film analysis, this study demonstrates how Indigenous collectives such as Akhra Ranchi have pioneered what Raheja (2007) theorizes as "visual sovereignty"—the space wherein Indigenous filmmakers critique and reconfigure dominant media conventions while operating within their constraints. By repurposing technologies from analog VHS to digital drones, Adivasi filmmakers parallel global Indigenous movements in asserting what Barry Barclay conceptualized as "Fourth Cinema"—media controlled by Indigenous communities rather than cultural colonizers.

Links for further exploration:

Akhra Ranchi main page

Akhra Ranchi Facebook page

Scholarly chapter on Adivasi Dance in Jharkhand that alludes to Akhra Ranchi

* * * 


5. As I suggested above, the panel "Sitting in the Tension: Caste in the South Asian Diaspora" was a highlight for me. 

Speakers were Sharanjit Kaur Sandhra (University of Fraser Valley), Neha Gupta (UBC), Sasha Sabherwal (Northeastern University), Anita Lal (Poetic Justice Foundation), and Manmit Singh (grad student at UBC). 

I was especially interested in the stories told about a recent exhibit that has appeared at various universities in British Columbia called Overcaste, which has been controversial in the Sikh community. (See coverage in the Vancouver Sun). 

Anita Lal is a Dalit (Chamaar) Sikh whose family has been in British Columbia for four generations. Her great-grandfather Maya Ram Mahmi was the first Dalit migrant to arrive in Canada. The community was small, but over time they established their own institutions; today, there are several Ravidasia Gurdwaras that have been founded by Dalit Sikhs. 

The Overcaste exhibit has a nice digital version that can be accessed here.

More relevant links: Punjabi Sikh and Dalit (article at SAADA)

Poetic Justice Foundation

Account of the Exhibit at Community Wire, with a quote from Anita Lal that contains a mention of Maya Ram Mahmi:

“In 1906, my great-grandfather Maya Ram Mahmi became the first recorded Dalit immigrant to Canada, seeking a brighter future and escape from the social and economic oppressions he faced in India. Yet, he and his descendants, including myself, have faced ongoing caste discrimination, an issue that persists over a century later. Through the OVERCASTE exhibit, we aim to highlight the often-ignored problem of caste bias in Canada. This initiative seeks to amplify the Dalit Canadian narrative, which has been historically sidelined and ignored,” says Anita Lal, Co-Curator of the exhibit and Co-Founder of the Poetic Justice Foundation. 

* * * 

6. I was surprised by the generally optimistic tone of the next panel I attended, "AI in Action: Best Practices for Research, Publishing, and Teaching in Asian Studies." Two of the speakers here, Joseph Alter and Elise Huerta, were journal editors. 

Alter described how the submission rate for the Journal of Asian Studies has increased by 150% in the past five years. The reason is not so much AI-assisted writing as AI-assisted translation, as many potential contributors who are not native speakers of English are writing up their research in their own languages and then using Gen-AI translation to render their work in smooth, idiomatic English. 

The editor was not especially bothered by this, and I can see why -- it has the potential to democratize scholarship in Asian Studies. (However, it does mean that reviewers have to be found to handle all those new submissions, and policies have to be developed to handle the use of AI...) 

The editors also mentioned the growing problem of peer reviewers being tempted to use generative AI to create overviews or summaries of submitted articles, or even to write assigned reviews. 

Along those lines, in the Q&A I asked the following question: 

[Me] This question is first for the editors on the panel but others might also have things to say about it. I’m a little surprised that the overall tone of this panel is a lot less apocalyptic than I would have expected. In literature and writing, the mood is a lot darker – I taught first-year writing recently, and it was really tough to get through to students about the importance of the process we’re asking them to engage in. Some students are having trouble resisting the temptation to cheat with AI, while others wish it would just go away. 

Perceived audience and reward matter a lot. People tend to work hard when they know there’s a reward for their effort. People tend to write more thoughtfully and carefully when they know there is a reader who will care what they say. I'm worried about academics also being tempted to cheat using gen-AI for peer-review. 

We should mention that peer-review is by and large unpaid labor. It’s also work that doesn’t really have the same level of professional reward as our primary research. Most likely our reviews will be read by an editor who knows our name but will go back to the author who doesn’t know who we are. And while we can claim the review on our CVs it doesn’t count for much in university professional activities reports, so our department chairs and Deans don’t really pay much attention either. So our audience of human readers is tiny; it seems hard to imagine people will not start to cheat when they write anonymous peer-reviews. 

So it's a structural problem. Can there be structural solutions? 

Perhaps open-peer review?  So if we do a review of an essay, it is and can be known by others...?

In their responses, the editors of the two journals and others on the panel were not terribly concerned with this problem. Their sense is that peer-reviewing is voluntary writing, so people who don't want to do the work will turn down the request to review. And they feel that most if not all of what they currently get in terms of peer-review evaluations are written by humans even if the readership is largely anonymized.  And they feel that people are by and large sticking to the honor system & often writing really compelling, constructive reviews that help other scholars and that help the field overall. 

Overall, a lot less apocalyptic than one would expect! 

* * * 

7. Finally, my own panel. 

Margaret Schotte and Christina Welsch have collaborated on an impressive DH project called Sailing With the French, which aims to "visualize and analyze more than 1300 voyages of the French East India Company during the 18th century, uncovering patterns and stories from archival records of the era." They're finding some really fascinating stuff about the demographic backgrounds of the sailors who sailed for the French Indies Company in the 18th century. Alongside Frenchmen, there were also Lascars and enslaved people, some of them from Africa, who were on these ships. 

I would also recommend people interested in these topics check out Christina Welsch's book, The Company's Sword: The East India Company and the Politics of Militarism, 1644–1858

For my part, I posted the text of my own talk and slides here.

Dhanashree Thorat's talk on telegraph and internet infrastructures overlapped with her 2019 article in South Asian Review, which you can see here.

* * * 

After my panel I chatted with Nicole Ranganath of UC-Davis. She mentioned the Pioneering Punjabis Digital Archive (1300 items) and the Punjabi and Sikh Diaspora Archive. The latter has some impressive material related specifically to early Punjabi women settlers in California (see Women's Gallery).

African American Poetry: Website Updates

At the end of Black History Month for 2026, I was impressed to see a new high in monthly traffic for the digital collection I edit on African American Poetry: 40,000 users in February! 

Chart showing user traffic for African American Poetry: A Digital Anthology February 2026. 40,000 users for 30 days

40,000 users in a month is a jawdropping number that is a little hard to comprehend, especially considering most academic articles I might otherwise publish would be read by 100 people or less. Even for other digital collections I have edited, I would consider 5% of that traffic -- 2000 users a month -- to constitute success, so this is really a different scale. Admittedly, at least some of the new traffic might be bots and generative AI scrapers; I've seen a significant uptick in users in China, though I would be surprised to learn that African American literature in English has suddenly appeared on university syllabi there. 

Of course all credit really goes to the amazing writers whose works are collected on the site -- there is clearly a large number of folks out there looking for these materials, both inside and outside of academia. Writings by Nella Larsen, Zora Neale Hurston, Langston Hughes, and Claude McKay are the most in demand.  

Chart showing usage statistics for African American Poetry a Digital Anthology. Nella Larsen, Zora Neale Hurston, Alain Locke's The New Negro, Langston Hughes, Claude McKay

I have also been making some regular updates and additions to the site.

1. A simple digital edition of the volume, "Four Lincoln University Poets" (1930). It includes poems by Langston Hughes, Edward Silvera, and Waring Cuney. All three influential Harlem Renaissance poets were undergrads at Lincoln at the same time! (Admittedly, Hughes was a little older than most other students -- he had spent several years in the early 1920s wandering the world as a sailor even as his literary career was taking off in the pages of magazines like The Crisis. When he decided to back to college in 1926, he landed on Lincoln, then a fairly modest but well-reputed HBCU outside of Philadelphia.)  

2. An author page for Harlem Renaissance poet Waring Cuney

Cuney's best-known poem is the free verse "No Images"; it was widely anthologized at the time: 

She does not know
Her beauty
She thinks her brown body
Has no glory

If she could dance
Naked
Under palm trees
And see her image in the river
She would know

But there are no palm trees
On the street
And dish water gives back no images


3. A stub author page for poet Azalia E. Martin (active 1900-1910). Sadly, I couldn't find much biographical information on her. 

However, see her powerful 1906 poem "A Protest": 

"Ye who would stop the progress of a race,
Give ear; that race would question thee." 


4. Improved author pages for Harlem Renaissance poets Edward Silvera and Lewis Alexander. I tracked down the only publicly available photo of Edward Silvera from a Lincoln University yearbook. (I've asked the Lincoln U. library for a higher-res version...) Once I can track down a better / higher res. version of a photo, I might take a stab at making a Wikipedia page for Edward Silvera. 

5. Added new poems by Lucian Watkins, mostly discovered in the Richmond Planet newspaper, via the Library of Virginia's website. Watkins served in the Army and was stationed in the Philippines during the Spanish-American War (1898-1900) and the war against Filipino independence fighters that followed (early 1900s). Based on an entry in the Planet, it seems like we can confirm that he remained enlisted in the Army all the way through the World War I years (1918). 

2025: My Year in Books


1. General Interest Recommendations


Arundhati Roy, 
Mother Mary Comes to Me. This was a standout for me this year -- Roy's beautifully written memoir of her rocky relationship with her mother. It is also a compelling intellectual autobiography that follows the arc of Roy's career, from her early days (training as an architect; acting in and then writing for films and television), to her more contemporary social justice interventions. The God of Small Things was a work of fiction, but every major character was based on a real person, and many of the difficult things that happened to the children in the novel are based on events experienced by Roy and her family. I especially appreciated the section in Mother Mary Comes To Me on the architect Laurie Baker, someone I'd not heard of before. 

Even now -- and after many, many years of teaching books like The God of Small Things -- I've still never seen Roy's early films (Massey Sahib, directed by Pradip Krishen; In Which Annie Gives It Those Ones, which Roy wrote; and Electric Moon, which, frankly, I'd never even heard of!)

Massey Sahib (1989) is a kind of loose adaptation of Joyce Cary's Mister Johnson transposed to India; there's a version of it on up on YouTube here.

There's a version of In Which Annie Gives It Those Ones (1990) here. (This film, which is based on Roy's experience in a school of architecture in Delhi in the 1970s, seems like the place to start)

I don't see any versions of Electric Moon (1992) online. (Probably ok; in her account of it in the memoir, Roy suggests that this film, a hybrid British-Indian production made with BBC funding, was a bit of a misfire.)

Caoilinn Hughes, The Alternatives. File under: thoughtful climate fiction. A readable but somewhat idiosyncratic novel of ideas; what would it really mean to move to rural Ireland and drop off the grid? What sacrifices would it require, especially in terms of your personal relationships and your family? At the center of this smart novel are four sisters, each with a Ph.D. -- one a philosopher, one a geologist, one a caterer, and the fourth a political scientist. The debates between the sisters form the core of the novel. Some of the philosophy might be a little abstruse for readers (Kant!), though Hughes does find ways to make it accessible enough and relevant to the core ethical dilemmas she wants to explore. 

Charlotte McConaghy, Wild Dark Shore. File under: climate fiction + thriller. A novel set on a remote island outpost near Antarctica (Shearwater Island), with a group of caretakers whose main job is to protect a doomsday seed bank. The novel has the stylized language and lyricism of literary fiction, though in the second half it turns more into a thriller plot. Overall, it made me curious to visit the place itself, though given its remoteness that seems far-fetched. (Let's start by getting ourselves to Australia or New Zealand first...)

Percival Everett, James. I'm guessing most people in my circle have read this brilliant rewriting of Huckleberry Finn from Jim/James' point of view -- it was on everybody's top ten lists last year. I finally read it this year; it's very good. I especially liked the investment in James' interest in writing his own story: "With my pencil, I wrote myself into being. Wrote myself to here." Also: "I can tell you that I am a man who is cognizant of his world, a man who has a family, who loves a family, who has been torn from his family, a man who can read and write, a man who will not let his story be self-related, but self-written." This theme of the novel reminded me of other 'postcolonial' texts that write back to the Anglo-American Canon -- and that thematize the act of writing as a central part of coming to own one's subjectivity (see: J.M. Coetzee's Foe). I've never taught Uncle Tom's Cabin, but if I were to do that in the future, I would do it alongside James

Work in Progress: A Modernism Text Corpus [Early 20th Century Literature Corpus]

As readers may be aware, I've been periodically creating small, open-access textual corpora, collecting African American literature and literature from Colonial South Asia. 

After a recent experience at the Modernist Studies Association conference, I thought it might be a worthwhile project to create a larger textual corpus, collecting out-of-copyright materials from a broad range of authors from the early 20th century. 

What is a Textual Corpus? 

A textual corpus is a collection of texts, typically in plain text format, arranged to be analyzed in various ways, including using quantitative methods. The first major creators of textual corpora were computational linguists, who have studied large-scale linguistic phenomena in corpora constructed within a given language. More recently, digital humanities scholars have been working with corpora of specifically literary texts, often with methodologies that borrow from or gesture towards linguistics. For instance, can we infer author gender in a large corpus of novels to ascertain patterns in the demographics of fiction over time? Can we use certain linguistic patterns to ascertain the genres of novels within a larger corpus? 

While anthologies and archives (including digital archives) have traditionally been designed to represent the most important and meaningful texts in particular geographical, cultural, and historical contexts, textual corpora often eschew questions of literary value in the interest of maximal inclusivity. Many quantitative methods rely on large-scale corpora to achieve statistical viability, and to answer questions about patterns in language usage, the fact that a particular book of poetry was critically well-received and another was not might be less important than the fact that both were published at a certain time and place. In our collection, we have aspired to maximal inclusivity, incorporating materials that the editorial tradition might have overlooked, such as 'minor' texts by 'major' writers, as well as writing that has entirely fallen off the critical radar. 

What is in our Modernist Text Corpus?

The idea is to collect materials from recognizable modernists like Virginia Woolf and James Joyce, alongside African American writers, Indian writers like Rabindranath Tagore and Cornelia Sorabji, as well as a sampling of genre fiction (including detective fiction, historical fiction, adventure fiction, science fiction, romance, etc.). 

So: everything from Jack London to Edith Wharton to Georgette Heyer to Langston Hughes. 

The goal is to produce a collection that could be useful to people doing quantitative analyses of these materials, but also to scholars doing conventional historical scholarship on the literature of the period. 

I've been creating thematic tags and genre classifications as I go, so that people interested in just writing by modernist women, for instance, could sort the collection that way (see the metadata below). Similarly, people interested in just African American poetry could sort the collection that way as well (using the Af-Am poetry folder). Other topics I've started tracking are materials related to World War I, materials related to colonialism and empire, LGBTQIA materials, disability, and the environment. 

(Note: tagging is at a very early stage thus far. I would welcome help and contributions from any readers who have specialist knowledge about any of the topics mentioned above.) 

Having these topics represented in the metadata was important to me; it's one reason why I've found existing textual repositories online insufficient. Project Gutenberg, for instance, has in recent years dramatically improved its approach to data about original publication, but many texts in their collection continue to have no information about publication date or the publisher name. I wanted to make a collection where all of that information was added back in. 

How to access the corpus? 

This is a work in progress. It can be found here for the moment.

As I've been going, I've been drawing largely on digital files at Project Gutenberg, Archive.org, and HathiTrust. (Note: the Gutenberg files will need to be "cleaned" to make them useful for quantitative queries; as of the present writing, I have not yet done that with the files, but it should be happening soon.)

As important (or more important) than the collection itself is the metadata file, with information about the texts. I'll say more about the metadata file below. 


1. Folders: 

On the Google Drive, I have been subdividing files into folders to make them more useful to conventional, historically-minded scholars.

Literary Fiction / High Modernism. Essentially what you would expect -- texts from 30-40 prominent modernist writers from the UK, Ireland, and the U.S., with a few less well-known figures like Hope Mirrlees. 

Genre Fiction, including Science Fiction, Detective Fiction, Adventure, Romance, Horror. This period was of course the Golden Age of Detective Fiction, with Arthur Conan Doyle writing at the fin de siecle and writers like Agatha Christie and Dorothy Sayers emerging in the 1920s. Writers like Doyle and Wells both straddled the late 19th and early 20th centuries; ultimately, I will probably aim to put their pre-1900 works in an appropriate folder for people doing author-based work. You'll also see out of copyright materials by people like A.E.W. Mason, H.Rider Haggard, Georgette Heyer, etc.

All Fiction. What it sounds like. A mix of "highbrow," "middlebrow" and popular fiction. 

All Poetry. Canonical figures like Yeats, Pound and Eliot alongside "minor" figures. A very substantial representation of African American poetry.

Drama. As of the present moment, I haven't been actively seeking out dramatists to include in this folder; it mostly consists of plays written by authors who were primarily not playwrights (such as Yeats), though there is a pretty good collection of Somerset Maugham plays. 

African American Fiction. For more on this collection, see this earlier description of my African American materials

African American Poetry. See the link above.

Colonial South Asian Texts. For more on this collection see here

Nonfiction and Essays (including Travel narratives, Memoirs, and Literary Criticism).


2. Metadata File.

We've collecting the following information about the texts as we go. The metadata file (a work in progress) can be viewed here

Author's name (Last, first)

Title of work

Year of First Publication

Year of Author's Birth. This is interesting and probably important. We see writers like Joseph Conrad who ius often considered a "Modernist," but who was born in 1857. Most writers associated with inventing high modernism were born between 1870-1890.  Virginia Woolf and James Joyce were born on the same year!

Publisher (first publisher). Publisher information could be really interesting to explore. Modernist studies scholars have long been interested in small presses like the Woolfs' Hogarth Press. But here, we are gathering information about who published with which publisher including big commercial houses. This could be useful to scholars interested in the business side of early 20th century literature. (It's interesting to see that many African American writers before the Harlem Renaissance used small and local publishers, as the major houses were typically closed to them.) 

Genre or Mode: Fiction, Nonfiction, Poetry, Short Fiction, Drama

Author's inferred gender: M, F, NB. As of now, I am understanding writers like Bryher and Radclyffe Hall to be nonbinary (NB). Others of course have complex relationships to gender expression (one thinks of Gertrude Stein, who has historically been identified as a lesbian, but who some scholars have been positing as transmasculine or genderqueer). This category may be revised or rethought over time. 

Author's nationality

Location in Corpus: Which folder is the file in in the Google Drive?

Location of Publisher: London, New York, somewhere else? 

Tags and Themes: Some tags I have been tracking: WWI, Travel, LGBTQIA, Disability, Environmental, African American, South Asian, Indigenous, Interracial, Passing

Provenance of Text: Gutenberg, HathiTrust, Archive.org, etc.

Again, the metadata file is very much a work in progress. Completing it may take weeks or even months, but I hope that when it's complete it will be useful to researchers. 

Fall Teaching: "Decolonizing (Digital) Humanities"

[Updated January 2022] 

I'm teaching a grad seminar on Digital Humanities this fall. I'm structuring most of the hands-on work around two Text Corpora I've been developing, one on African American Literature, and the other on Colonial South Asian Literature

If the Canon has been the defining structure of traditional literary studies, in the DH framework the starting point is the Corpus. You can do a lot with a group of texts structured this way -- from Text Analysis, to Natural Language Processing, to thinking about Archives and Editions. As with the Canon, the questions you can ask and the knowledge you can produce are strongly determined by what's included or excluded from the Corpus. 


Course Description (short version): 

This course introduces students to the emerging field of digital humanities scholarship with an emphasis on social justice-oriented projects and practices. The course will begin with a pair of foundational units that aim to define digital humanities as a field, and also to frame what’s at stake. What are the Humanities and why do they matter in the 21st century? How might the advent of digital humanities methods impact how we read and interpret literary texts? Some topics we’ll consider include: Quantifying the Canon, Race, Empire & Gender in Digital Archives, and an introduction to Corpus Text Analysis. Along the way, we’ll explore specific Digital Humanities projects that exemplify those areas, and play and learn with digital tools and do some basic coding. The final weeks of the course will be devoted to collaborative, student-driven projects. No programming or web development experience is necessary, but a willingness to experiment and ‘break things’ is essential to the learning process envisioned in this course.

Text Corpus: Colonial South Asian Literature

Recently, I announced a Text Corpus I had put together, of African American Literature from 1853-1923. 

I've also been putting together a Corpus of Colonial South Asian Literature from roughly the same period.  

The link to that folder can be accessed here. I'll also be posting the files on Github soon.

This has been a much harder Corpus to compose. Whereas with the African American literature we have bibliographic lists of published works to serve as a guide (such as the one posted at the History of Black Writing at Kansas), there does not appear to be an equivalent list with respect to Colonial South Asia. 

Choices Made in Producing this Corpus:

1. Nationalities

I decided to include British as well as South Asian writers in the Corpus. Many of the writers were clearly in dialogue with one another; South Asian writers were clearly reading people like Rudyard Kipling, E.M. Forster, and Katherine Mayo. It's a little less clear which South Asian writers British and American writers were reading other than Tagore (and this itself might be studied). The publishing industries also overlapped to a considerable extent; while some South Asian writers published their works with publishers based in India, many aimed to publish with houses based in London. 

One possible line of inquiry with this material might be to try and compare fiction, poetry and drama by British authors with South Asian output in English. Such inquiry could either be historical and thematic (i.e., comparing the way British and South Asian writers reacted to historical events like the Sepoy Mutiny or the Famine of 1876), or it could be connected to matters of language and style. To do that it makes sense to have writers from different backgrounds represented in the Corpus. 

I knew there was a fair amount of interest in colonial India in the U.S. at the time -- from the appreciation of Kipling to the American feminist fascination with Pandita Ramabai. However, while doing this research I was surprised to come across a large number of Pulpy Indian adventure novels by an American writer named Talbot Mundy.  

In the metadata file, I list the nationalities of the authors. Besides a few Americans in the collection, I would draw readers' attention to B.M. Croker (an Irish woman who lived in India and wrote many Romance novels based in colonial India), and Sara Jeannette Duncan (a Canadian woman who also lived in India and wrote prolifically as well).  

In addition to the nationality question, with South Asian writers who moved abroad there is also the question of destination. Cornelia Sorabji (who eventually moved to England) is of course pretty well known. Dhan Gopal Mukerji, who moved to the U.S. in the 1910s, is mainly known for his memoir Caste and Outcast, but he was quite a prolific literary writer, with several books of poetry and fiction that are worth looking at. 

2. Translations. 

I decided to include translations by South Asian writers like Bankim Chandra Chatterjee (Chattopadhyay) and Rabindranath Tagore in the Corpus. Tagore of course needs no explanation; he was one of the few South Asian writers to break through and achieve global acclaim in the early 20th century. Bankim Chandra Chatterjee (here, I'm using one of the spellings used at the time, aware of course that "Chatterjee" and "Chatterji" are colonial-era abbreviations of Chattopadhyay...) is slightly different. He is clearly historically important for Anandamath (here included in translation as Dawn Over India) and Rajmohan's Wife (thought to be the first English-language novel by an Indian author), but it seemed like it might be valuable to include some other of his Bengali novels in translation here. Several of these I found at Wikisource.

Alongside translations by South Asian writers, there are a few translations in the corpus of historical South Asians texts by British writers. 

3. Fiction and Nonfiction

Right now there is a limited amount of nonfiction included in the corpus. This was a very tough decision, as there is a vast array of nonfiction colonial travel writing based in South Asia from this period. I've excluded that sort of writing for now, though I may include more of it as I continue to expand the corpus. 

However, I decided to include some nonfiction, mostly texts by literary authors who wrote occasional works of nonfiction (Dhan Gopal Mukerji's Caste and Outcaste is included, as is Tagore's My Reminiscences). I've also included a plain text file of Pandita Ramabai's The High-Caste Hindu Woman, mainly because it seems like an important text that might be useful for researchers in this field. Any queries specifically structured around the stylistics of fiction or the colonial novel might want to exclude these nonfiction texts. 

4. Derivation; grunt work

As with my other Corpus, I pulled together materials from different repositories to assemble this corpus. Here, the lion's share of material comes from Project Gutenberg and HathiTrust. (Derivation is indicated in my metadata file.) 

The Gutenberg materials were in good shape; they've generally been proofread and formatted cleanly.

The HathiTrust materials required much more work. One can extract HathiTrust texts by requesting plain text, but these OCR page scans need quite a bit of processing to make them clean enough to use. A lot of the grunt work of assembling this collection has entailed doing that processing. 

Here is a list of works I've imported from HathiTrust page scans thus far: 

Arnold, W.D. Oakfield; Or, Fellowship in the East 1855
Bain, F.W. A Hindoo Love Story 1898
Candler, Edmund Abdication 1922
Candler, Edmund Siri Ram, Revolutionist 1911
Candler, Edmund Mantle of the East 1910
Candler, Edmund Year of Chivalry 1916
Chatterji, Bankim Chandra Anandamath: Dawn Over India 1882 (1941)
Chatterji, Bankim Chandra Krishnakanta's Will 1917
Croker, B.M.  Proper Pride 1882
Croker, B.M.  Diana Barrington: A Romance of Central India 1888
Croker, B.M.  A Rolling Stone 1911
Diver, Maud Lilamani: A Study in Possibilities 1911
Diver, Maud Unconquered 1917
Derozio, Henry Louis Vivian Poems of Henry Louis Vivian Derozio: A Forgotten Anglo-Indian Poet 1923 (1831)
Duncan, Sara Jeannette Burnt Offering 1910
Dutt, Michael Madhusudan Sermista; a drama in five acts 1859
Dyer, Helen S. Pandita Ramabai: The Story of Her Life 1900
Kipling, Rudyard and Wolcott Balestier The Naulahka: A Story of West and East 1892
Mukerji, Dhan Gopal Caste and Outcast 1923
Mukerji, Dhan Gopal Layla-Majnu: A Musical Play in Three Acts 1916
Mukerji, Dhan Gopal Rajani: Songs of the Night 1916
Ramabai, Pandita The High Caste Hindu Woman 1888
Satthianadhan, Krupabai Kamala: A Story of Hindu Life 1894
Sorabji, Cornelia Between the Twilights: Being Studies of Indian Women By one of Themselves 1908
Sorabji, Cornelia Indian Tales of the Great Ones Among Men, and Bird-People 1916
Sorabji, Cornelia Shubala-A Child Mother 1920
Sorabji, Cornelia Sun-Babies: Studies in the Child-Life of India 1904
Tagore, Rabindranath Gora 1924 (1901)

Some of the highlights in the table above are in bold. As far as I know, these are the first plain text versions of the above texts to be made available online. 

You may notice that a couple of these texts are dated post-1923. I believe the 1941 translation of Anandamath (Dawn Over India) has fallen out of copyright in the U.S.

I should add that while I've cleaned up these files, I haven't proofread them. That is going to be a long-term project -- for which I would welcome collaborators! 


Text Processing with Regular Expressions (RegEx): a Digital Humanities Work-Flow for Beginners (no coding)

I wrote up the following as a primer for the students in my Digital Humanities seminar in 2020; updated in Spring 2025. If you have favorite RegEx commands and tips, I would welcome them in the comments or hit me up on Blue Sky. 


The most common use case for needing a bit of coding for people in literary studies – especially people working with digital collections and archives – is when we have to format texts. This is less glamorous work than working with fancy visualizations or maps, but it can be incredibly useful and time-saving in many different contexts. Some of these will also potentially translate into work skills outside of academia. 

For my own work, I frequently get messy files that have been scanned from old editions, and then OCRed. They need clean-up! 

Cleaning up a single 80-page collection of poetry by hand is not that big a deal, but we have been working with dozens of them. And a single 300 page novel can take hours if you don’t have any tools to speed it up. My rule of thumb is: when you find yourself doing the same repetitive task again and again hundreds of times, that’s something that ought to be do-able by a machine. 

Sometimes messy scanned texts have certain recognizable patterns. For instance, in scanned/OCRed poetry, you often see things like this: 

   I never see the burial place, 
   Where my dear mother lies ; 
   But that I think I see her face, 
   Peak at me through the skies. 

[And yes, it says, “peak” not “peek” in the original. Don’t think this one had a copy-editor] 


The quality of this is pretty good actually, but notice the extra space before the semi-colon on the second line. Let’s say you have that exact glitch 50 or 100 times in a collection… You would normally fix that with Find and Replace: 

   Find: [space]; 

   Replace: ; 

So now the second line should read: “Where my dear mother lies;” 

Chances are, if you see that space before punctuation with a semi-colon, there are probably some with other punctuation as well – I would go through the same document, and do Replace All changes for space before comma, space before period, space before question mark, space before exclamation point. 

But what if you saw patterns with more complicated glitches – things that a simple find and replace couldn’t address? 

Like, say, you wanted to get rid of all of the running headers in a novel – lines that begin with a number and then end with the title of the novel? Again, not a big deal to do this 10 or 20 times by hand. But 300 times? Gets a little old. You can literally do it with a little snippet that looks like this 

   Find: \n\d+.*\n 

   Replace with : [leave blank]. 

(Note: we're jumping ahead a bit here, but the bit of code above says: look for any new line (\n) that starts with a number (\d) followed by any text (.*) and ending with another carriage return (\n). If you replace that with nothing, you are telling it to delete any lines that have that description.)

Hit “Replace All” on the find and replace box, and you just saved yourself 300 manual edits. 

Or, you wanted to find all lines that end with hyphenated words and unbreak the hyphenated words, putting them together? 

Or: you have a passage or a text that for whatever reason is in ALL CAPS. How can you convert it to conventional capitalization without retyping it? 

For those types of problems, you could use a coding system called Regular Expressions (Regex). You can use Regex codes directly in a Find and Replace box in a sophisticated text editor – you don’t need to know Python or R (though these commands do work within Python and R, and some people talking about Regex online are using it with Python). 

Technically Google Docs has a Regular Expressions box, though to be honest I've not used it much, mainly because Google Docs starts to run very slowly if you are working with larger documents. A 300-page novel runs super-slowly; because the software is constantly analyzing and indexing your file as you work and relaying everything back and forth with a remote server; it is also creating a ton of invisible stuff in your file related to formatting and special characters. 


I usually use a free piece of software called Notepad++  for this kind of work (CoTEditor on Mac). 300-page novels load quickly and without delay, and there are no hidden characters or invisible formatting.  It’s also completely offline, though, so you have to remember to hit “Save” and then upload the file to a destination when done. 


Announcing An Open-Access African-American Literature Corpus, 1853-1929

Announcing: an Open-Access African American Literature Corpus, 1853-1929
Amardeep Singh, Lehigh University. On Twitter @electrostani
July 2020 (updated January 2025)

I’ve put together a small corpus of texts by Black authors of fiction and poetry in plain text format. The corpus is downloadable and researchers are free to modify it according to preference.

The corpus at present consists of, at present, about 175 texts by African American writers, of which about 90 are works of fiction (about 5 million words) and 85 are books of poetry (about 700,000 words). It currently starts in 1853, the year of publication of William Wells Brown’s Clotel and Frederick Douglass’ short fiction “The Heroic Slave,” and ends in 1929, the year of Nella Larsen's Passing. Some of the files are admittedly still a little rough around the edges; cleaning and formatting will be an ongoing and long-term process. Still, I think the files are in good enough shape to start preliminarily exploring them using tools like AntConc or VoyantTools.

Right now I’m making the collection available as a Google Drive link as well as on Github


→ Download link. You can find the corpus here (Google Drive) or here (Github). (The Google Drive is more recently updated.)


Sources: 

In the Metadata file I’ve created to accompany the collection, I indicate the origin of each text. Many come from Project Gutenberg, HathiTrust, the American Verse Project at the University of Michigan, the Library of Congress, and the History of Black Writing Novel Corpus. A few texts were present on multiple repositories; I generally used the text of the source that seemed cleanest and most convenient. 


Why Do This / My Background:

I started thinking about the relative paucity of collections focused on people of color online a few years ago (see my blog post on “Archive Gap” from 2015). I then initiated a couple of digital projects aimed to intervene in what I saw as the absence of Black writers in particular, “Claude McKay’s Early Poetry,” and “Women of the Early Harlem Renaissance.” That latter project in particular opened my eyes to the wealth of materials that have essentially fallen off the radar of literary history. A limited quantity of this overlooked material is sampled in anthologies like Maureen Honey’s Shadowed Dreams: Women’s Poetry of the Harlem Renaissance or Double-Take: A Revisionist Take on the Harlem Renaissance. But there remains a fairly substantial ‘great unread’ in the African American literary tradition that could be brought to light, at least partly just by gathering materials that might have already been digitized in one form or another. 

Other corpora centered around Black writers do appear to exist, but they’re often restricted access. (For instance, The History of the Black Novel corpus has 53 works available to the public, but the larger corpus with about 450 works is restricted access for copyright reasons.) 

If corpora either don’t exist or aren’t readily available to scholars who don’t have access to password-protected university servers, that slows down research. At this point, Digital Humanities scholars have done impressive work analyzing large corpora of literature, but very few have applied computational methods to specifically African American texts. My hope is that this corpus might nudge more people to try. 


What’s included in the Corpus: 

In its current form, the corpus contains a mix of poetry and prose (for convenience, I’ve indicated whether a text is poetry or fiction in the title of each file). I’ve excluded slave narratives and other texts that are clearly not literary. (A large number of North American Slave Narratives are, in any case, collected here.) 

I included poetry alongside fiction in part because many of the topics historically-minded scholars might be interested in from these materials can be found in both formats. Many Black poets from this period wrote occasional poetry connected to historical events, including the Civil War and Emancipation, the Spanish-American War, World War I, the "Red Summer" of 1919, and so on. Admittedly, this mixing of formats might cause problems when studying these texts using certain software platforms (i.e., poetry and prose will be tokenized differently; they also need to be classified differently when doing word frequency types of queries, and sentence-length queries won't be useful). 

For convenience, I've also created folders with "Just Poetry" and "Just Fiction" from the collection in the Google Drive folder link above. 

Gender issues: It might also be worth noting that during this time-period there were many African-American women publishing poetry -- but not as many who published fiction. (The reasons for this are beyond the scope of a brief announcement.) Still, including poetry can also be seen as an intentional choice -- designed to include writing by women in the field of view. It's also an invitation to other scholars using these materials to encourage them to work with writing by women. 

Users of this corpus who disagree with my choices are welcome to modify the selection when they design their own queries. I would also welcome any and all feedback. 

Honoring Black Writers / Expanding the Canon:
I’ve been inspired by the statement the Colored Conventions Project asks users to agree to when they download the CCP corpus, especially the first three principles:

  • I honor CCP’s commitment to a use of data that humanizes and acknowledges the Black people whose collective organizational histories are assembled here. Although the subjects of datasets are often reduced to abstract data points, I will contextualize and narrate the conditions of the people who appear as “data” and to name them when possible.
  • I will include the above language in my first citation of any data I pull/use from the CCP Corpus.
  • I will be sensitive to a standard use of language that again reduces 19th-century Black people to being objects. Words like “item” and “object,” standard in digital humanities and data collection, fall into this category. (Link)
While I don’t ask users of this collection to sign an analogous statement, I encourage all users of these materials to adhere to the spirit of the request made by CCP of the users of their corpus. My goal in doing this type of work is to recognize and validate the work of African American writers as important contributors to world literature. One of the ways we can do that is to consider the work at scale, using computational tools like text analysis and stylistics.

Digital Humanities Exhibits at #MSA18: An Annotated Overview

I'm at the MSA this year to talk about my Claude McKay project as part of a Digital Exhibition.

The format is unusual: in one of the main halls of the conference hotel, the organizers set up large-ish monitors. Presenters bring their own laptops and, for a single morning of the conference, demonstrate their work to conference attendees as they come and go from regular panels. You don't give full-length talks, but that makes sense for many digital projects -- the open-ended format allows you to be more interactive and exploratory than is possible in a conventional conference talk.

Here are some of the exhibits that were on display at #MSA18 with my brief annotations:

Mapping Expatriate Paris. I got a chance to talk to Clifford Wulfman and Joshua Kotin from Princeton, who have been building a polished, very useful site based on Sylvia Beach's lending library records at Shakespeare & Co. bookstore. She kept the lending library records for many users. These contain books signed out but also the addresses of members of the lending library. One interesting discovery: many of the users of her lending library were actually not poor, left-bank bohemians, but members of the French upper class. (Check out this page to see a map and discussion of the left bank/right bank addresses of Shakespeare & Co. lending library patrons.)

Modernist Archives Publishing Project. I got to talk to Alice Staveley of Stanford about this project. It's an impressive archive of the output of the Hogarth Press -- its books, but also secondary materials like account books and correspondence. There was much more printed by the Hogarth Press than just Virginia Woolf and the Bloomsbury mainstays; among the many authors and texts I'd never heard of were a number of Indian authors whose works I'd like to explore. Many of the texts in this archive might be technically under copyright, but many of the authors' families have granted permission for the digital presentation of their works. I was impressed by the level of care and attention to details in this well-funded project.

Marianne Moore Digital Archive. The majority of Marianne Moore's poetry is under copyright, but this site is planning to put forward some really interesting ancillary materials, including Moore's notebooks and the Marianne Moore Newsletter, which contained sketches Marianne Moore made in her notebooks as well as analysis and rare historical-biographical engagement with the author.

Modernist Networks (Modnets). I didn't get a chance to talk to the folks doing this project in person but the goal is pretty clear -- they're aiming to be a hub for modernist studies digital humanities project and also a kind of vetting / peer-review mechanism along the lines of what we see with sites focusing on earlier periods. Currently they have 59 federated sites and links to more than 78,000 objects. (I will submit my own project to them for peer-review / federation once it's a little further along.)

Modeling Modernist Studies (Topic Modeling Modernism/Modernity). Jonathan Goodwin's interesting topic modeling project exploring keywords and concept-clusters in the flagship journal of Modernist Studies. It's a continuation of a kind of meta-scholarly analysis he was doing earlier with his modeling of the language of MLA job listings. I got a chance to talk with Jonathan about the project and I hope to play around more with some of the newer topic modeling tools he's been using at some point. 

Modernism in Baltimore: A Literary Archive. I did not get to talk to the folks behind this project. Still, the idea here seems fairly straightforward -- they're collecting artifacts and historical materials related to literary modernism in Baltimore (the contributors also appear to have an interest in architecture and the arts more broadly). As of now the home for this is a Facebook page, though some resources are stored at Baltimoreheritage.org.

Routledge Encyclopedia of Modernism / Linked Modernisms. Stephen Ross, whom I met at DHSI last year (he teaches at University of Victoria and is currently President of the MSA), is the general editor of the Routledge Encyclopedia of Modernism (a large-scale digital / subscription-based encyclopedia project). He's been taking the metadata generated by that project to produce an open (non-paywalled) resource called "Linked Modernisms." As of this morning the main link for the project seems to be broken, but you can read about the project here.

Open Modernisms. Another project from the University of Victoria. It's a collection of modernism studies syllabi. At this point just starting out, it looks like. (But I have some syllabi I want to send them... Readers, consider contributing!)

I enjoyed talking to Brandon White of UC Berkeley about his project using WordNet and NLTK to analyze the plot and evolving thematics of Faulkner's Absalom, Absalom (incest, bigamy, race/miscegenation...). I don't see a link to his Compson project online, so I don't think it's public research yet.

William S. Burroughs Digital Manuscript Project at  Florida State University. Unfortunately the Burroughs archive is, for copyright reasons, largely behind a password-protected firewall. But I got to talk to Stanley Gontarski and Paul Ardoin about the project at length, and I was really impressed by the level of attention and care they have put in -- there are some really powerful tools for analyzing and comparing versions and studying Burroughs' intertextuality. In short, a really powerful resource for serious Burroughs scholars. (Anyone reading this interested in using the site should contact the site editors; they can get you a temporary password to access FSU's amazing Burroughs materials.)

Using a Visual Understanding Environment to Understand H.D.'s Networks of Influence. Celena Kusch is co-chair of the international H.D. Society. I got to talk to her about using a software package called the Visual Understanding Environment to study the social network around the writer H.D. Fascinating project and a software package I definitely want to explore a little myself, perhaps for my Kiplings project.

American WWI Poetry Digital Archive. I talked to Tim Dayton of Kansas State at length about this excellent archive of more than 400 books of American World War I poetry. This morning, unfortunately, I can't seem to find a link to the project itself anywhere. (I think this project is currently being migrated from Scalar 1 to Drupal or perhaps Scalar 2.)

*

My own Claude McKay project was a modest first version of a site that will eventually have more primary texts (the two Jamaican collections of poems are coming soon!) and more robust network diagrams (probably using Giphy down the road). It was gratifying to talk about the work with a number of people walking by my booth; thank you to everyone who took a few minutes to stop by and say hello. Most people seemed to get it, and saw the value of the network diagrams / thematic tagging that I and my graduate students have been doing.

Claude McKay: New Site, Expanded Project (w/Network Diagrams)

Harlem Shadows: Claude McKay's Early Poetry
http://scalar.lehigh.edu/mckay/

I've recently been working on rebuilding a collaborative class project on Claude McKay's Harlem Shadows in the Scalar platform. As I've been putting the new site together, I've also been adding fresh material to the project, including a number of McKay's early political poems. (I've also been using Scalar for my Kiplings and India project.) It's a powerful platform, especially with regards to metadata, annotations, and tagging. It's also designed to allow you to create multiple "paths" through overlapping material. In McKay's case the Paths feature comes in particularly handy as he tended to publish the same poems in different venues; it's revealing to see which poems he tended to republish and which he quietly "put away."

The new site can be accessed here. I would particularly recommend readers play around with the Visualizations options on the menu at the top corner of the screen.

Here is the text of some new material I've added to the site, analyzing, in a very preliminary and informal way, a couple of network diagrams I generated using Scalar's built-in visualization tools.

* * *

Below I'll present two different network diagrams I've derived from Scalar's built-in visualization feature. One looks at the clusters created by thematic tags, the other looks at the relationship between poems published in different venues.

Skeptics of Digital Humanities scholarship sometimes see objects like network diagrams and wonder what they might tell us that we don't already know. And indeed, even here, to some extent, the diagrams below do show us visually some things we might have been able to intuit without the benefit of this tool.  I should also acknowledge that the thematic tags we have been using are somewhat subjective. We have the poem "A Capitalist at Dinner" tagged by "Class" but not by "Labor." Others might structure these tags differently and end up with diagrams that look different. 

That said, there are some surprises here. In McKay's poetry I'm especially interested in thinking about the connections between the two streams of his writing from this early period, which we might loosely divide into a) political poems (including race-themed poems and Communist/worker-themed poems) and b) nature-oriented, pastoral and romantic poems. At least in terms of publication venue, there is quite a bit of overlap between these two broad categories. McKay excluded the most directly Communist poems from his book-length publications, but he included—often at the urging of his editors—poems expressing decisive anger at racial injustice in American society. And even in the body of poems published in magazines like Workers Dreadnought there are hints of the nature themes in poems like "Joy in the Woods "and "Birds of Prey." The network diagrams show us a series of other poems as well at the "hinge" between the two clusters. These poems might be particularly worthy of special attention and study in the future. 


A. Thematic Tags.

Take a look at the following network diagram showing the relations between a limited set of thematic tags, generated by Scalar using the built-in visualization application. The image below is a static image, but if you click on VISUALIZATIONS > TAG on the menu in the corner of this site, you'll get a "clickable" diagram that is also live and manipulable. The body of poems included here is comprised of all of the poems from Harlem Shadows as well as about fifteen of the early poems not included in Harlem Shadows



(See the full-size version of this diagram here)

What does this diagram show? First, we should note that the red dots show tags, while the orange dots show poems. As of November 2016, only eight thematic areas have been tagged: Race, Class, City, Nature, Home, Sexuality Homoeroticism, Labor. (More Tag information from the earlier, Wordpress version of this site is currently in the Metadata for individual poems, and is discoverable using the search function on this Scalar site. Try searching for "Birds," for instance.) 

What Can We Learn? 

1. Thematic Clusters. First and most obviously, certain themes are "clustered" together. Nature and Home have many overlaps, and thus appear clustered. Sexuality and homoeroticism also form a cluster. And finally, the tags focused on Class, Labor, and city life also form a natural cluster, though the clustering is significantly less tight than the others.

2. Centrality of Nature. An obvious discovery is that "Nature" is one of the most common tags in McKay's early poetry. This was a surprise to the students in the Digital Humanities class (given that we think of McKay as a black poet with militant/leftist politics, we might expect those themes to be more dominant). Of course, many of the poems marked "Nature" also overlap with race, class/labor, or sexual/queer themes. The surprise in finding so much discussion of Nature—and specifically McKay's interest in writing about birds—might remind us that we actually need to read a poet's poems before rushing to narrowly define them (i.e., as a black, political poet). (I would encourage visitors to look at Joanna Grim's essay exploring the "bird" theme in Harlem Shadows)

3. Home. Many of McKay's poems in this period thematize his memory of life in Jamaica. Thus, a few of the poems (for instance, "The Tropics in New York") reflect McKay's nostalgia for his pastoral upbringing from the vantage point of someone now living in a much larger, modern urban setting. 

4. Poems with three or more tags. I'm interested in the poems that presently have three or more tags: "The Barrier," "The Castaways," and "On the Road." These are poems that scholars may not have paid very attention to in the past, but diagrams like the one above might lead us to think of them as newly important as they bridge some of McKay's most important themes from this period. (Again, the number of tags is a bit arbitrary and at present an artifact of the way metadata has been tagged. At most this information might nudge readers to pay a bit more attention to some poems rather than others, not to make any sweeping conclusions about the poems as a whole.)

I would encourage users of this site to play with the live visualization tool and send me (Amardeep Singh) any screen captures that seem interesting or telling. 


B. Publication Venues

This diagram is a bit more messy. It contains nodes for publication venues (which are organized on this Scalar site using "Paths"). These appear in light blue in the diagram below.  Users can access a "live" version of the diagram using VISUALIZATIONS > CONNECTIONS in the menu in the corner above. 



(See the full size version of this diagram here)

What do we see here? (Note: the blue dots represent publication venues. The red dots represent thematic tags. The orange dots represent individual poems. The green dots are media files uploaded to this site. Readers should probably try and ignore the green dots.)

Essentially there is a larger cluster around Spring in New Hampshire and Other Poems and Harlem Shadows, and a smaller cluster around the Workers Dreadnought path and the Early Uncollected Poetry path I've constructed on this site. Perhaps not all that surprisingly, the sexuality and homoeroticism tags are mostly entirely disconnected from the labor & class oriented poetry published in magazines like Workers Dreadnought.  But there are some poems right in the middle between the two clusters that seem especially interesting to consider -- poems like "Joy in the Woods," "The Battle," "Summer Morn in New Hampshire," "Birds of Prey," and "Labor's Day" that appear with strong connections both to the "Nature" tag and to "Class" and "Labor" tags. Though few of these poems have been looked at closely by critics, they are in some ways the key to understanding the two major aspects of Claude McKay's poetry in this period. 

Group Project: Sentiment Analysis of Poetry in Python (DHSI 2016)

I took a one-week course on Coding Fundamentals at DHSI 2016 with Dennis Tenen (Columbia University) and John Simpson (University of Alberta). You can see the syllabus for the course here

Let me start with a quick plug for Dennis Tenen's group 
part of the project involved teaching all of the young people in the class the coding they would need to build a Twitter bot. The Bot is currently not active, but the stream it produced over several months is well worth a look.

*

Why coding? I wanted to get started with coding because it seems to be one of the major dividing lines between people who can chart their own independent course through the digital humanities and people who work with ideas and tools developed by others. It's not the be-all, end-all, of course (as I've said before, you can do so much now with off-the-shelf tools), but some experience with coding seems like it could be really helpful for projects that don't quite fit the mold of what's come before.

The class itself was intense, frustrating, and sometimes really fun. I'm not going to lie: learning how to code is hard. I can't say that I will readily be able to start spitting out Python scripts after four days of working with the language, but I might at least be able to figure out how to a) do some simple scripts to process batches of text files that otherwise require repetitive, laborious work, and b) use libraries of code developed by others in Python to do more advanced things.

*