Session 1-2

Defining Literary Data: An Information Theoretical Approach

  • Devin Higgins (Michigan State University)
  • Thomas Padilla (Michigan State University)
  • Arend Hintze (Michigan State University)

In addition to forming a piece of the lasting and living embodiment of the cultural heritage of humanity, literature also comprises a form of data. The features of this data are precisely what define the “literary” as such. In order to “understand the structural continuity of the step from information to literature and back again...[and] to grasp the nonuniqueness of literature in an absolute structural sense,” that is, to specify a difference of degree rather than kind between literature and other forms of data, it will be necessary to isolate and define the features of the literary band of the data spectrum in as nuanced a way as possible.[1]

To isolate and define features of literary data, the authors will employ a range of information-theoretical techniques to analyze literary text and find distinguishing patterns. An algorithm developed at Michigan State University to study the information novelty in DNA sequences can equally be applied to strings of arbitrary text. This algorithm has been used to quantify how much information is generated by Twitter users on a daily basis, and will be adapted to measure information novelty patterns over the duration of texts. Do literary and non-literary texts give rise at similar rates to new tokens of length n, to new words, and to new concepts? How does the frequency with which new n-long character strings appear vary by genre? Comparing corpora of literary texts (defined in published bibliographies) against random subsets of non-canonized texts published in the same era and location, will potentially reveal distinctive patterns of information.

In addition to literary and non-literary corpora, these algorithms will also be used to assess highly formalist works which employ diverse lexical and structural strategies to achieve new literary effects, such as works originating from the OuLiPo, and Christian Bök's Eunoia. How will radical approaches to form reveal themselves within the text as data? Are algorithmic literatures particularly amenable to computational analysis?

A scientific theory of literature existed as a premise for literary analysis throughout the 20th century. Formalists, structuralists, semioticians, and others utilized scientific techniques to describe and formalize aspects of language particular to the literary domain. Advances in computing and the advent of information theory in the middle of the century led to perhaps the first conference on literary data in 1964.[2] Broadly construed, these approaches to literary studies have been characterized as reductive yet the impulse to structure and categorize the text was not intended to ascribe unquestionable stability to literature. Rather, every text ”manifests to some degree features of a work of art.”[3]

The “literary” exists on a spectrum and the motive of the information theoretical approach is not to categorize, but to trace the signal.

In order to perform this analysis, appropriate datasets will need to be collected. The proposed project will describe methods of extracting relevant texts from the portion of the Google books dataset that MSU hosts locally—approximately 3 million volumes. Using linked data to sort and extract texts in the dataset according to non-bibliographic forms of description will allow the authors to create textual datasets based on an expanded list of parameters, and with greater granularity, than is currently possible using existing search and retrieval functions.

The presentation at JADH will share techniques for the creation of these worksets, in addition to describing the results of new investigations into the informational content and form of literary text construed as data.

information theory, data, literary studies, novelty


  1. Parker, Terence quoted in Hayot: “What is Data in Literary Studies?”
  2. Lotman, Yurij M.: “The Future for Structural Poetics,” Poetics, Vol. 8, No. 6 (1979).
  3. Literary Data Processing Conference proceedings, September 9, 10, 11 (1964). IBM, Data Processing Division (1964).