Since I started teaching short courses on humanities data curation on semi-regular basis (first as part of MITH's digital humanities training institute and then as part of the Digital Humanities Data Curation institute), I've been looking around for suitable hands-on exercises to help people "get a feel for" different aspects of the work involved in curating data in a humanities context. Maintaining the usefulness of data to researchers can involve planning, describing, building collections, and even tasks that shade into digital preservation like migrating data to new media. Curation can also involve “cleaning,” normalizing, reconciling—what we might call “munging” data—probably (hopefully?) for the purpose of creating better search, retrieval, or indexing. The open data generated by the New York Public Library's What's on the Menu? project has been a great testbed for experimenting with these latter kinds of practical curation work.

Lydia Zvyagintseva, a Master's student from the University of Alberta visiting MITH for a practicum, did great work exploring possibilities for how additional, curator-generated facets for browsing data about events and locations could add research value. In the most recent Digital Humanities Data Curation workshop (which I am fortunate to co-teach with Dorothea Salo and Julia Flanders), we worked through exercises on significant properties and potential user needs related to the menus data. In this post, I’ll describe some more exploratory work I’ve done since the most recent workshop.

Beyond Access and Preservation

What's on the menu? is a useful curation testbed because the New York Public Library (NYPL) has already done a generally excellent job providing access to the data. The menu transcription project has been running since 2011, and it has been a wild success. Volunteers have used the site to transcribe almost 17,000 digitized historic menus (as of the time of this post). NYPL makes all the menu data available for bulk download and also provides an application programming interface (API).

Screenshot of the front page of the NYPL's menus site as of August 2013

Beyond this impressionistic sense of “good access” to the data, we can evaluate the NYPL’s arrangements for access according to a commonly-used quality measure like Tim Berners-Lee's 5 Star Linked Open Data scale. The What’s on the menu? data set scores fairly well by this measure—somewhere between 3 and 4 stars on the 5-star scale. The data is available on the web in a machine-readable, structured, non-proprietary format (stars 1-3). The criteria for the first star also specifies that data should be distributed with an open license and here we could quibble a little with the existing provisions for access. The “Data” page at the What’s on the menu? site states that there are “No known copyright restrictions on this material” but asks those who use the data to “credit The New York Public Library as source on any applications or publications.” The phrase “no known copyright restrictions” echoes the language of the Public Domain Mark suggested by Creative Commons but it’s not entirely clear that NYPL’s intent with this data is wholly the same as that underlying the Public Domain Mark. Perhaps formally using the Public Domain Mark would help clarify that this is truly open data? (I offer this suggestion tentatively because I know that NYPL has some very good copyright advisors and so on the whole, I think we can give What’s on the menu at least 3 open data stars.) NYPL also uses HTTP URIs as identifiers for things, which is part of the criteria for 4-star linked open data, but the data is returned via the API in either JSON, or (a custom) XML rather than using the most-W3C-blessed standards (RDF and SPARQL). For example, I can get back data about a particular dish by sending a request to the URI that identifies it (e.g., http://api.menus.nypl.org/dishes/1860):

{
"description": null,
"first_appeared": 1887,
"highest_price": "$0.85",
"id": 1860,
"last_appeared": 1989,
"links": [
{
"href": "http://menus.nypl.org/api/dishes",
"rel": "index"
},
{
"href": "http://menus.nypl.org/api/dishes/1860/menus",
"rel": "menus"
}
],
"lowest_price": null,
"menus_appeared": 98,
"name": "Brussel Sprouts",
"times_appeared": 98
}

The highest 5-star rating would apply to data that includes links to other datasets. One objective of further curation work might be to discover and contribute links between the NYPL data and other open data sets to create something like the concordances that the Cooper Hewitt Labs have built for entities in their collections.

In many data curation scenarios the most urgent tasks involve moving data from the original site of creation to a stable environment (like a repository) where it can be preserved, but also where it can be reliably accessed. Neither of these problems is at issue with the menus data, so the potential curator can consider what other activities might improve the usefulness of the data.

What’s a Data Curator To Do?

Looking just beyond the (valid, important) tasks of preservation and basic access that are currently occupying many academic libraries entering the realm of data curation, interesting additional possibilities emerge for constructing what the work of data curation can be. I’m particularly interested right now in work that data curators can do to build secondary and tertiary resources—reference materials, if you will—around data. I mean particularly reference materials that draw on the skills of people with training in library and information science, things like indexes. These types of organized systems of description can be one way to provide additional value over full text search (which, for many kinds of data sets, e.g., a table of numerical readings, is not particularly effective anyway).

How might this apply to the data from NYPL’s menu transcription site? For this exploratory data curation exercise, I’m setting myself the goal of seeing what can be done with the names of various dishes in What’s on the Menu? (surely, one of the main points of interest in this data set). The end product I’m imagining is a good index to the dishes represented in NYPL’s collection of menus. We could have an “authorized form” for each dish, keep track of any alternate forms, and begin to work out categories of related dishes. From this, we could make some headway toward producing linked data from the menus data set—via concordances like the Cooper Hewitt’s—and we could also make our index of dishes available to others as a reconciliation service for cleaning and normalizing other data sets (using tools like Open Refine and the Open Knowledge Foundation Labs’ Nomenklatura).

NYPL has a data set that scores very well on an established scale of openness—the library provides access to machine-readable, structured data in a non-proprietary format—but further curatorial work can still improve the usefulness of this data by ordering and systematizing it at a layer beyond the technical structure of file format. The reason additional curation is needed has to do with the difference between strings and ‘things.’

‘Strings Versus Things’

Advocates of linked open data often use some variation of the phrase ‘from strings to things’ in order to convey the basic motivation behind the technology. A Google search will turn up numerous examples. See, for example, this talk by Mia Ridge at a Linked Open Data in Libraries, Archives and Museums (LODLAM) workshop from last year (2012). As Ridge explains,

Computers think in strings (and numbers) where people think in ‘things’. If I say ‘Captain Cook’, we all know I’m talking about a person, and that it’s probably the same person as ‘James Cook’). The name may immediately evoke dates, concepts around voyages and sailing, exploration or exploitation, locations in both England and Australia… but a computer knows none of that context and by default can only search for the string of characters you’ve given it.

An inspection of the What’s on the menu? data set shows that we’re working with strings. A search for “Brussel sprouts” returns 611 results including at least 3 that look nearly identical but have different counts for the numbers of menus on which they appear. For our reliable index of dishes we want to be working with things (where we can leverage those nice HTTP URIs to convey our specific meaning in machine-parseable terms). In the era of Google, this type of variation and duplication in search results is something to which researchers are accustomed and perhaps it even re-introduces a kind of serendipity, however, this feature of full-text search, which operates on strings, does make asking other questions of the data more difficult.

To appreciate the effect that going from strings to things can have, I extended my method of inspection from a single search to the whole data set. The front page of What’s on the menu proclaims that 1,260,150 dishes have been transcribed to date (this was in late July so slightly higher now). A quick look at the downloaded data, suggests that it might be more accurate to say there are 1,260,150 instances of dishes in NYPL’s system. There are only 469,357 394,871 entries (rows) in “Dish.csv” (again, for the July data)—each one representing a “dish” that has been given a unique identifier. (To quickly check myself, I looped through the CSV file and totaled up the values from the “timesappeared” column. The result—1,257,525 dish instances— is close enough to the published value to confirm my assumption.) So, really we have 1.2 million instances of 469K 394 thousand _types of dishes. Given the example of the Brussel sprouts, I suspect that the number of types is actually lower still. 469K 394 thousand “dishes” is large enough to make for an interesting challenge but curation of this data to create a reliable index is only half as big a job as it appears from the web site.

Cracking open the data set and inspecting it is one way of assessing the need for curation and the likely amount of effort required—at 469K 394 thousand data “points” or even 1.2 million the data set is small enough to do this without stretching common workflows or computational tools. (You can open “Dish.csv” in Microsoft Excel, for example.) You could make a similar determination about the curatorial actions needed and the rough scale of the challenge without opening any of the data files.

Part of the basic conceptual equipment of data curation (as a meta-discipline) is a rough taxonomy of types of data: observational, experimental, simulation, etc. Data curation researchers have also developed some cross-cutting ideas of “data levels”—from more “raw” to more “cooked.” These terms come from a techno-scientific context (data levels developed in the context of work with earth-observing satellite imagery) but we can also use them to reason about humanities data like that from What’s on the menu by thinking about the project like a system.

The NYPL has imaged a collection of physical objects producing a first level of data (though not really “raw” in any deep sense, cf. “Raw Data” Is an Oxymoron). Then, through the construction of the What’s on the menu? site/application, the Library processed this first level of data with the aid of online volunteers (another term for “crowdsourcing” being “human computation”). What we can download as a data set is roughly this second level of data—transcriptions based on the images. Treating the contents of the downloaded CSV files as a kind of partially-processed observational data both helps us estimate error and variation (hello “human error”) and also think about how to plan for transformations and changes to the data set (will “authorized” forms fully supersede original forms?). Thus, we can reason from our theoretical knowledge of data and data curation to guide practical “hands on” action.

Next Steps

In the next post, I’ll describe some of the actual data munging work I’m doing to get closer to an index of dishes for What’s on the menu?. The data set of “dish” names is small enough to open in Excel but is big enough to challenge the normalizing functionalities of the more-powerful Open Refine. I found a way around this bottle neck and fell into a few other useful workflows along the way.

UPDATE (2013-08-17): Corrected the number of rows in the downloaded CSV file.