Data curation as publishing for digital humanists

This is an edited version of the presentation I gave at the CIC Center for Library Initiatives Annual Conference, May 22-23, 2013. I was delighted to be part of a panel with Matt Gold (CUNY) and Matthew Jockers (Nebraska) on "Digital Humanities, Alternative Publishing Needs of Faculty." Since I wrote much of the text of the talk I presented, I thought I would share it here, somewhat edited. (I don't usually write the "prose" of my talks beforehand.) My original slides are available on Speaker Deck with all of the leaps, shorthand, and repetitions inherent in being a product intended for verbal delivery.

I want to extend my thanks again to all the staff of the CIC Center for Library Initiatives and to the members of the Program Committee for the 2013 Annual Conference for inviting me to speak. The University of Maryland will become part of the CIC on July 1 of this year and attending this conference was a great way to preview some of the exciting work happening in the libraries of the CIC institutions.

UPDATE: This piece was later published in the Journal of Digital Humanities 2.3 (Summer 2013).

Title slide: Data curation as publishing for digital humanists

When I was asked to participate on this panel about digital humanities and the alternative publishing needs of faculty, I felt obliged to temper my delighted acceptance of the invitation with a caveat that " scholarly publishing" (and its associated challenges) is not something I usually see as central to the work I do. One of the things I work on is data curation and the thoery and practice of data curation should be relevant to conversations about "emerging options for scholarly publishing." So, to address the theme of this conference and this panel, I would like to talk about data curation as publishing. The work of curating data—the activities required to maintain the usefulness of information produced as part of research—should be legible as "publishing" work in much the same way that well-understood tasks related to preparing and circulating monographs or journals are publishing work.

Data curation as a "publishing" activity is increasingly relevant to the working lives of digital humanities scholars. Moreover, articulating connections between "publishing" and data curation is important in the context of strategic decision libraries might make and, in fact, are making about how to participate in "publishing." Data curation as publishing is publishing work that draws directly on the unique skills of librarians and aligns directly with library missions and values in ways that other kinds of publishing endeavors may not.

In referring to "data curation" I am speaking specifically of information work that integrates closely with the disciplinary work practices and needs of researchers in order to "maintain digital information that is produced in the course of research in a manner that preserves its meaning and usefulness as a potential input for further research" (Munoz and Renear 2011). Data curation is "the active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time" (Cragin et al. 2007) . This distinguishes data curation from many near synonyms: digital curation, digital stewardship, digital preservation. Hopefully, this high-level description of what curation activities enable—discovery, assurance of quality, "added value", and re-use—also suggests the points of connection (via their similar ends) with other activities that are considered part of "publishing." I also want to emphasize—for the purpose of making clear what kind of project library-data-curation-as-library-publishing needs be—is that data curation work is active (activist?) and that it is informationist work.

The link between data curation and publishing is not new. Joyce Ray, Sayeed Choudhury, and Mike Furlough presented a paper in 2009, summarizing several strands of contemporaneous work. The paper was entitled "Digital Curation and E-Publishing: Libraries Make the Connection". [ed. note: To suggest the multiple connections between data curation and publishing, I also pointed to a symposium on the "Now and Future of Data Publishing" taking place at Oxford the same day at the CIC Libraries conference.]

What is new—or at least newer—is data curation (in the sense above) as a part of humanities research. Digital humanists in particular are becoming increasingly aware of data curation issues and data curation needs as part of the way they (we) work. This element of digital humanities work is becoming prevalent enough that I've selected examples casually from things that I've come across in my professional networks and feeds recently. Lincoln Mullen, a PhD student at Brandeis University, posted on his blog about using the statistical programming language R for historical research. As part of his discussion, Mullen describes how he converted the tables from a monograph he found in his research to a series of comma-separated-values (CSV) files in order to produce graphs and charts of the changing demographics of American religion. Along with his analysis and the blog post about his methods, he posted the (small) data set to Github, a platform for sharing open source software code and open data. Ted Underwood, Associate Professor of English at the University of Illinois, has made the work he and a graduate assistant have done building, cleaning, normalizing, and labeling a data set drawn from the HathiTrust corpus a significant part of the output of his "Uses of Scale" project and other professional presentations. It is also increasingly common to see the release of open data sets as enticement to attract digital humanists to work on particular sets of questions, or in partnership with cultural heritage organizations—see, for example, the IndexCat data from the National Library of Medicine, a small collection of catalog records for a historical library of children's literature, data from some of the crowdsourcing projects run by the New York Public Library, the Smithsonian Cooper-Hewitt, National Design Museum collection data, and many more examples.

At least part of the professional activity of the digital humanists and organizations above involves making data available and suitable for re-use. As any of the researchers involved would no doubt say, curation of these data sets take time, effort, and money. Libraries getting involved to help digital humanists do this kind of work would be offering something of value. This would be "publishing" not only in the sense of registering and "making public" a product of scholarly work, but this data curation work would also be "publishing" in the sense of ensuring quality and disseminating outputs to interested communities. (Thanks are due to Shana Kimball for prompting this extension of the argument in discussion after my original talk). By recognizing data curation work as a publishing activity, libraries would have a "market opportunity" to address unmet needs in the digital humanities community (among others).

In the paper by Choudhury, Furlough, and Ray mentioned above, the authors describe how data curation and publishing can be mutually-reinforcing activities. They write:

we have on the one hand, a community, or a subset of several communities, that has been working on the “back end” of digital production from the generation of raw data to the construction of an organized product that can be accessed, and, on the other hand, another community—publishers—who work on the “front end” of scholarly communications, from manuscripts to publication.

"Making the connection" involves bringing these communities together as complementary elements of a service portfolio that will help libraries justify their funding and their relevance amid changing scholarly practices. This is a good argument and some innovative libraries (among them Penn State, Johns Hopkins, and Purdue) seem to be having some success with this as a strategy. I would argue that it is possible, even preferable, to treat the connection between data curation and publishing as being more fundamental. Data curation is publishing—a form of publishing especially for digital scholarship—and libraries interested in investing in "publishing" as an innovative activity should take some of the resources allocated for such endeavors and devote them to paying for data curation work.

The discussion of "back end" and "front end" by Choudhury, Furlough, and Ray places the connections between data curation and "publishing" in the context of lifecycle models of data (this is explicit in the paper) and lifecycle models of existing scholarly publications like journals and monographs (from author to editor, publisher to library, etc.). As such, while the mutual reinforcement of curation and publishing is emphasized, the recommendations as to what activities libraries and publishers should undertake are (somewhat disappointingly) familiar. Publishers add value to end products through peer review and high quality production and presentation. Libraries standardize and preserve these outputs and continue to make them available to a community over time. Treating data curation and publishing as kindred services may offer the prospect of expanding a library's stable of "innovative" offerings while not straining resources because there are management efficiencies in having both the "front end" and "back end" people in the library. However, in this model, neither libraries nor publishing seems truly transformed and this is a problematic mismatch when so many other aspects of scholarly work are being transformed.

So there is a need to step outside the lifecycle model inherited from other kinds of scholarly publications. The products, work practices, and exchanges involved in doing data curation as publishing activity will look different from those involved in other previous kinds of publishing. However since data curation work still fulfills the ends of registering, making public, ensuring quality, and disseminating to potential users, data curation should still be legible as publishing. In thinking of data curation as publishing, it is important to understand that this is not exactly the same as data publication.

Slide: Data curation as publishing is not the same as data publication

In a recent publication in Data Science Journal, Mark Parsons and Peter Fox explore "data publication" as a metaphor for the kind of things that scholarly communities want to see happen with data. They explain that "Data Publication builds from the familiar and conceptually simple model of scholarly literature publication" and they capitalize the terms deliberately to indicate the status of this phrase as "a recognized metaphor and data management paradigm." Parson and Fox's paper elaborates on what are some significant problems in adopting this metaphor. In the limited space available I want to focus on just one of these problems. Parsons and Fox note that under the model of Data Publication "publishers are distributed and can act autonomously or in concert." Thus, they write, "there is ... little emphasis on data discovery and interoperability across systems. Data are often presented as they were created without explicit considerations of data integration or significant reuse. … The attention is on preservation and formal recognized scholarly contribution with less attention to … issues such as latency, rapid versioning and reprocessing, and computational demands." To understand data-curation-as-publishing (which I'm advocating as a way to serve digital humanities scholars) only as "Data Publication" expands recognizable publisher and library activities to a new class of scholarly objects (data) but in many ways perpetuates the (flawed) status quo. Libraries becoming data publishers has many of the same flaws as the model of libraries becoming journal and monograph publishers.

Within the critique of Data Publication there are glimpses of what it could mean to treat the activities of data curation as "publishing" activity in a way that would benefit both scholars and libraries. The first part of Parson and Fox's critique is that under the model of "Data Publication" there is "little emphasis on data discovery and interoperability across systems." Various examples from the media landscape suggest the truth of this claim. In the realm of ebooks, the importance of outlets like Amazon and other digital dissemination channels has recently forced publishers to pay greater attention to "discovery" and to devote more resources to things like metadata, but at the same time, the fracturing and proliferation of ebook reading platforms is an ongoing example of problems of interoperability across systems in a publishing marketplace. (There is a similar shape to the story of the relative fortunes of the on-demand video company Netflix and various real or rumoured video platforms implemented by specific studios or content creators.) This leads to the question of whether lack of emphasis on discovery and interoperability are intrinsic to the business of publishing (presumably because the energies of publishers are directed elsewhere to activities considered more vital to mission and survival)? Attention to "discovery" and related issues of interoperability across systems are traditional and persistent features of library work. There are likely to be difficulties in that the library, the more it acts as publisher, might get away from doing the valuable work it has done in the past. The flip side of this point, is the opportunity, expressed in Choudhury, Furlough and Ray's piece, to excel where traditional publishers have not. However, this alignment, just having both "back end" and "front end" of the process, may not be sufficient to avoid falling into the trap of neglecting discovery and interoperability if Data Publication is the governing metaphor rather than data curation being the predominant action.

This leads to the next part of the critique—that, in a model of Data Publication, "data are often presented as they were created without explicit considerations of data integration or significant reuse." Data being "presented as they were created" sounds like a description of researcher self-deposit into (institutional) data repositories—currently the most common form of library engagement with data curation. That Parsons and Fox single this problem out in a discussion of why Data Publication is a problematic metaphor from the perspective of solving the real information needs of researchers suggests that while the provision of institutional data repositories is necessary and important it is not sufficient to support scholarship. So, libraries cannot stand pat; they cannot maintain only the "back end" of these processes but must make the connection to more active engagement. Libraries also cannot just adopt a position of becoming data publishers (via repository provision) in the way some are seeking to become journal publishers through the use of platforms like Digital Commons and similar initiatives. Data spread across institutional repositories becomes like a fragmented ebook market spread across proprietary reading platforms.

It is worth noting too that issues that a Data Publication model does not easily encompass—"issues such as latency, rapid versioning and reprocessing, and computational demands"—resemble precisely the kinds of demands that digital humanists are likely to make in the course of trying to do their work.

Treat data curation activities as "publishing"—worthy of new enthusiasm and new resources from libraries—but be wary of framing the endeavor as "data publishing" (an analog to journal and monograph publishing)? What form could this actually take? First, I return to part of the definition of data curation offered above: "curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time." Many of the kinds of work that librarians do meet this definition: creating metadata, building catalogs, developing and refining indexes, building, organizing, and maintaining collections. The extension of library, archive, and informationist practice into new forms of work also applies here: aggregating data, cleaning and normalizing data, annotating data with controlled vocabularies and ontologies. Second, and offering a specific example from the digital humanities, data-curation-as-publishing could look something like the Alexandria Archive Institute's Open Context project. Open Context work on review, documentation, and publication of research data, mostly in the discipline of archaeology. The first heading on "About" page of the project web site speaks of "data sharing as publication" and a flavor of the work the project carries out can be gleaned from the editors' blog. Open Context is hosted and administered by the non-profit Alexandria Archive Institute and thus represents a kind of freestanding example of an organization doing data-curation-as-publishing. This recalls an interesting remark that Choudhury, Furlough, and Ray make in passing. In describing the creation of the Data Conservancy architecture and service at Johns Hopkins, they write: "It is especially important to note the role of a particular individual at AAS who acted as the human “interface” between the various players. This individual could easily be classified as a “data scientist” – an individual with knowledge of a specific domain or discipline yet also a deep knowledge of data management." They go on to remark that "libraries would be wise to consider developing such expertise and capacity in-house." I contend that Open Context, and its editors, represent another example of this kind and that libraries should be figuring how to set up and host such activities. At the University of Maryland Libraries, those working on data curation are beginning to make the case to subject selectors (who control collection budgets to support various disciplines) to spend collection funds on curation work for significant data sets. These discussions are still at early stages—there is lots to figure out including what specifically should appear on "the invoices" for such data curation work as selectors are being asked to pay—but libraries who wish to engage seriously with support for data-intensive research (like the digital humanites) will increasingly need to sell and buy such services.

Slide: Data curation as publishing aligns directly with library missions and values

Finally, why argue for framing this work and these transactions around data curation as "publishing" activity? Because I think data curation activities are fully legible as "publishing"—meeting the same ends and goals and potentially contributing to scholarship in the same kinds of ways. Also because "library publishing" is a site of buzz and activity and potential investment. Despite how it might sound this is the opposite of cynical. I would argue that if libraries are going to invest resources in "publishing" then that money should be spent partly on doing data curation work because data-curation-as-publishing offers the most value to both researchers and libraries. (Note: my fellow panelists Matt Gold and Matthew Jockers also offered compelling visions of how to deploy some of the "publishing" resources to support digital humanities.) Data-curation-as-publishing is the right form of publishing for libraries to be in because the work of data curation aligns with libraries' missions and values in ways that other kinds of publishing ventures do not. (There is much about scholarly "publishing" as it exists now that is not about making knowledge public or ensuring quality of that knowledge or disseminating it to those who need and could use it. There is a great deal of "publishing" that is about issues of prestige, labor, and equity of the disciplinary professions. In my opinion, libraries don't really have a dog in that fight and shouldn't spend resources trying to fix those problems.) In a recent paper in the library and information science literature on assessing data value, Carole Palmer, Nic Weber, and Melissa Cragin remind us that "the library and information science meta-science perspective articulated by [Marcia] Bates (1999) has always been fundamental to the role of providing broad, useable information collections and services, especially to support interdisciplinary research." Doing data curation work (like that described above) needs the unique training and skills of librarians and other information professionals and it supports the goals and values of the profession in making information accessible and usable to communities of users who need it. Making data curations fully legible as publishing, and investing in data-curation-as-publishing, can help make problems of data discovery, interoperabilty, and re-use less daunting and show a clear way for the library to be a publisher in ways that research communities like the digital humanities need.