Digipres + ??? = PROFIT.

OK, in reality profit is the last thing on my mind when it comes to digital preservation (or anything really, hence why I’m a librarian and *not* a businessperson), but as I’ve been working on a presentation of my initial digital collections assessment findings this meme popped into my head. This is largely because one of my final slides presents the next steps of my research and work, which are basically building blocks with each leading easily into the next step.


Digital Preservation Collections Assessment
Preservation Profiles of Assessed Content

Preservation Strategies for Each Profile

Digital Policies at the Departmental and Institutional Level

Institutional Digital Preservation Strategy


So what is the point of each step? First, a collections assessment helps to provide a snapshot view of what is currently happening around the institution. In my recent meetings, I’ve discovered a lot of information just by talking to collections managers – a lot of this is tacit within the institution, usually knowledge held by specific staff members. So on top of the snapshot, a collections assessment also begins the process of better documentation, which is the cornerstone of institutionalizing digital preservation.

The next step I’ll be undertaking (though slowly, since October is filled with meetings and travel) is the creation of preservation profiles for the content currently held by IU Libraries. So this basically means that I’ll be working to group content by shared characteristics (e.g., Image, A/V, e-Text, Book/Serial) in order to develop specific strategies to preserve them. When I think about this work, the DPLA Challenge Grant, , always pops into my head. Where DPLA currently has A LOT of rights statements defined by individual institutions and collections, the aim of the grant (a collaboration with Europeana and Creative Commons) is to simplify and create a dozen rights statement profiles that will suit the needs of all of their content and content providers. Likewise, at IU our current digital collections are defined only individually, and so there is an overabundance of unique content needs. By defining the most important characteristics of a specific content profile, we can start to group things together in meaningful ways. And then, once that’s been done, the creation of specific strategies for each profile (e.g., migration, emulation) will be much easier.

The final pieces of the puzzle are the overall institutional policies and strategy that will define how we move into digital preservation. Where currently there are a lot of resources for digital preservation – the technical infrastructure is mega, there are a lot of staff and faculty with knowledge of the issues, etc – there is no mindful push into the future. So this will be the big picture of the next year of my work, and will be directly based on the collections assessment that I’m finalizing now.

All of this is a push towards building capacity for the institution. Where currently our digital collections are relatively safe and well-managed, we have the potential to receive a lot of unique and interesting collections and content that we currently have no way of handling. So the definition of preservation profiles and strategies will create opportunities for the collection of new content, and our digital policies and overall institutional strategy will be our guide to whether or not we end up collecting that content. As a large public institution with the resources for digital preservation, we have a great responsibility to steward scholarship into the future, and my hope in all of this is that our digital preservation capacity-building will ultimately allow IU to say YES more often. That would be a great thing.

Digital Preservation : Collections Assessment

After nearly a year without a new post, I’m reviving my blog as a way to work through some of the (very exciting!) challenges I’m facing in my new position (see my About page for more information on the most recent updates in my career). As a digital preservationist, my main professional charge is to move the institution from ‘storage’ to ‘preservation’ by developing mindful digital preservation strategies and practices. Where things are currently managed with (relatively) minimal problems, there is still a lot of room to optimize our stewardship of digital scholarship in order to ensure long, long-term sustainability.

Since I started in July, one of my biggest tasks has been to develop a comprehensive assessment of the digital collections held by the Libraries. While IUBL’s content is relatively manageable at this point (though we’re approaching the 100TB marker), the is about to create . So before we get too massive, I’m going through and developing a collections assessment based on this template.

So what is the goal of a collections assessment? According to Faria et al (2013 ):

Digital preservation starts by understanding what content a repository holds and what are the specific characteristics of that content. This process is supported by the characterization of content and allows a content owner to be aware of content volumes, characteristics, format distributions, and specific peculiarities such as digital rights management issues, complex content elements, or other preservation risks.
So the goal in this case is to start to define the most important characteristics of each collection, in order to later develop strategies for effective preservation. As it stands, the assessment spreadsheet has 29 fields, broken down into seven groupings. One of the most difficult challenges in developing the template so far has been finding the right balance of granularity. While it’s pretty easy to wander down a rabbit hole of detail with each collection, defining the most important aspects for preservation involves a fair amount of stepping back and looking at the big picture. As I’ve worked to add information to the spreadsheet on our various collections, the template has evolved quite a bit.
So this is where I currently am, and I’d be interested to hear any feedback on the template. Once I finish the assessment, I’ll post an update and share some information about the different content preservation profiles that I’ll be developing.

Collections Assessment Template (Excel)

Suggested Readings

Open Access in the Humanities


My long wait for Martin Paul Eve’s book on  is apparently going to be a little longer – it seems that the OA copy isn’t up on Cambridge UP’s website yet, and physical copies are still in presale.  I’ve been waiting for months to see this come out, so the added agony is killing me.

But! In the meantime, I plan to check out another open access book of interest: .  This looks like it’ll definitely provide an interesting perspective, even if there isn’t a chapter on digital economics.

 Edit: The book is now live, and can be read in full . 

Extensive Readings.


As part of my research at the Huygens Institute in The Hague, I’ve been reading A LOT lately and on the advice of a visiting faculty member, I’ve recently started tracking all of my articles and books in my Zotero account.  You can view my library here: .  I haven’t done much organization of the collection at this point, but I plan to go through and tag contents and break it up into thematic collections, so check back if you’re interested in digital preservation of digital humanities publications, copyright, piracy, or scholarly communication.


The Medium is the Message

“A book will always have its role… But the opportunity is to use a technology built for discourse to create an unprecedented good for scholarship.”

– Cameron Neylon (PLoS), in discussing monograph publication and the move to OA as part of his closing keynote at Open Access monographs in the humanities and social sciences conference


Semantics and Siloization: A Few Thoughts on DH BeNeLux


This morning I posted a  to the , outlining some of the major themes and takeaways of the conference.  This leaves the choicer, more granular bits to be lolled about here on my personal blog.  So then, this post is meant to provide a deeper understanding of the conference, but is also really an avenue for me to work through a few concepts that have been sitting at the edge of my mind for the past week.

So first, to contextualize – the DH BeNeLux conference, in its first year, is a collaboration between various cultural heritage organizations and research centers in Belgium (Be), the Netherlands (Ne), and Luxembourg (Lux) and aims to foster a sense of local community within the larger context of the digital humanities.  While there seem to be a lot of meetup groups in various parts of the world, this multi-country collaboration, to me, felt fresh.  Because there were still researchers from very different cultures in attendance, the more international issues like conference language were broached; however, unlike a more international conference like , the use of English seemed a lot less justifiable as none of the organizing institutions utilize the language officially (to my knowledge).

The conference raised a lot of interesting points overall, and even the final panel discussion felt more like the beginning of a conversation rather than the end.  So with this fact in mind, here are a few discussion starters that I grasped from the various DH BeNeLux sessions I attended.

1. Digital Humanities is not digital humanities is not the humanities AS digital.

Whitacre’s Virtual Choir 3

This is a really important point  that was apparent in a lot of the breakout sessions, but truly laid bare by .  A faculty member at Leiden University College in The Hague, Fu’s (specifically highlighting the Virtual Choir work of , which should be watched and rewatched in surround sound) was a resounding hit as part of the Day One session on Crowdsourcing; however, it was actually her comments in a few other presentations that really put forth the need for a deeper discussion of semantics.

During a presentation by Niels-Oliver Walkowski on the  (as part of the generically-titled About DH session), Fu engaged Walkowski in an interesting discussion of his use of ‘digital’, in which she asserted that he was using the word predicatively, rather than attributively.  She again brought up a similar point during the final panel discussions, though apropos to what I don’t remember.  Her assertion, though a bit difficult to engage with outright, was something I wrote in my notebook as a point I wanted to return to.  Now that I’ve had time to roll it around in my brain (and refamiliarize myself with the grammatical terms – thanks, Google), I think it’s an important thing to talk about.  How do we as digital humanists (capital D capital H) differentiate from humanists working with the digital… or do we?

Further, what constitutes a digital humanities project?  Looking over some of the projects coming out of the top digital humanities institutes around the world even, I can’t help but often wonder “isn’t that just a digital archive?”  Speaking to a traditional humanist about this issue, he asserted that a lot of digital humanities projects seem like “humanities with maps”, often with a questionable methodology.  While that’s just one scholar’s understanding of the field, it really points to the need to pin down a semantic understanding of ‘digital’ in this context.  Or not.  But at the very least we need to illucidate what ‘digital humanities’ means in the context of our own projects, so that we’re not all being lumped into the same, often wrong, box.

2. External factors need to be better addressed in terms of their impact on the field.

Albert Meroño-Peñuela presenting on the Short Title Catalogue, Netherlands during the Linked Open Data breakout.

Albert Meroño-Peñuela presenting on the Short Title Catalogue, Netherlands during the Linked Open Data breakout.

One of the most interesting presentations I saw during the conference was that of Alastair Dunning, who discussed the in digital humanities projects, asserting that copyright has, to a large extent, excluded more recent works from being studied in the field.  As a copyright nerd, this was one of those ah-ha moments, where something so obvious was laid out so succinctly that I had to wonder why I had never considered it before.  The data from Dunning’s study of some of the top DH research centers (which can be viewed in his ) is very straightforward, and I’ll be interested to see what comes of it.

Another issue worth mentioning here is something that Max Kemman talks about in his post on the conference, and that is the fact that.  The heavily-attended Linked Data breakout on Friday provided some insight into how this is changing, but still there are issues (e.g., audience members asking how a project would map to other existing LOD projects like , and presenters not having a plan in action).  , librarian at Universiteit Gent, did address this issue in the final panel by suggesting that libraries were the place where desiloization occurs, but this raises yet another discussion point…

3. Digital humanities ≠ digital preservation, and we need to figure that out.

A common question in most of the breakouts that I attended was “What are you planning to do with the data once [insert techie project name here] is complete?”, and a common answer was that other researchers will figure that out once the data is published.  While it’s one thing to leave interpretation up to the scholarly masses, it’s a whole different issue that many (though not all) digital humanists leave long-term preservation of digital projects up to other entities like libraries to solve at the end of the project cycle.  As a digital preservationist myself (and now as someone studying the long-term sustainability of digital scholarly editions, har har) I can feel my blood start to boil just now.  Preservation strategies need to be implemented at the beginning of a project and adapted as it goes, otherwise the library is simply going to become a place where DH projects go to die.

This is probably a good stopping point, before the entire post becomes a heated discussion of best practices in digital preservation.  Suffice it to say that the conference was amazing for generating great ideas and discussions, allowed researchers to share final and mid-cycle projects and receive feedback, and also the conference swag was highly notable.


DH BeNeLux

, or Digital Humanities Belgium-Netherlands-Luxembourg in longform, is happening today at  in the Hague (the research institute where I work is on the fifth floor in the same building).  Follow along with the hashtag () and official Twitter account () if you can.  The program is amazingly full of some of the top digital literary and history scholars working on cool projects, so I can’t wait to hear what they have to say.

As a heads up, if you follow my own Twitter account () I’ll be livetweeting this schedule (on UTC time):

Thursday, June 12

13:00-14:00 Keynote: Melissa Terras

14:00-15:00 About DH

 – Hieke Huistra, Bram Mellink

 – Niels-Oliver Walkowski

15:30-16:50 Crowdsourcing

 – Lars Wieneke and Marten Düring

 – Marie-Charlotte Le Bailly

 – Cissie Fu

 – René Voorburg

Friday, June 13

09:15-10:15 Copyright

 – Alastair Dunning

 – Renske van Nie

 – Tjeerd Schiphof and Karina van Dalen-Oskam

10:45-11:45 Linked Data

 – Wouter Beek, Rinke Hoekstra, Fernie Maas, Albert Meroño-Peñuela and Inger Leemans

 – Alina Saenko

 – Max Kemman and Astrid van Aggelen

12:00-12:45 Panel Session

12:45-13:00 Closing remarks

Digital Preservation 101: Undertaking a File-Level Inventory for Preservation Planning

This entry was originally posted on January 13, 2014.  It outlines the institution-wide file format inventory that I undertook at Dumbarton Oaks as part of my work as a Library of Congress National Digital Stewardship Residency fellow.

In order to develop a better understanding of the holdings at Dumbarton Oaks as part of my NDSR project, I have been working on a file-level inventory that can hopefully be embedded in a digital preservation workflow process at DO in the future.

The benefits of an inventory are manifold, but these are a few that I highlighted in a recent presentation (all originally adapted from ):


The inventory basically tells us what we have, how much we have, where we have it, and most importantly, what user behaviors surround the creation and management of digital assets.  Keep these goals in mind as you are working, because undertaking a file-level inventory won’t be easy.  There really aren’t a lot of tools out there, and the ones that are there require a pretty solid base of technical knowledge.

The two tools I decided to try out were JHOVE2 and DROID.

The first was my main focus, as .  On top of this, JHOVE2 includes validation of files, which is an added bonus when compared to DROID.

Drawbacks of JHOVE2, however, were pretty insurmountable in my project implementation.  They included the need to run now-outdated Java 6, and the lack of a GUI.


Command line, anyone?

The main problem that I ran up against with JHOVE2, however, wasn’t the actual implementation (all of the basic commands needed are outlined in the handbook, so even a relative novice can run it), but rather the reporting.  After going through all of the steps, the tool was spitting out a massive jumble of text that I was unable to make out.  After consulting the forums and trying our in-house IT specialist at Dumbarton Oaks, I had committed too much time to JHOVE2 and still couldn’t process the inventory reports and so I decided, for the sake of moving the project forward, that I would go with DROID instead.

The most recent version, , is a lot more accurate than older versions.  The install is incredibly easy (for Windows: download ZIP file, unzip, run BAT file, done).

The interface is also a whole lot prettier than JHOVE2:


But of course, there were still (mysterious) problems.


Beyond the occasional crash, the tool’s output is fairly readable, especially if you pre-read the user guide referenced above.

Here’s a small example of what the final reports look like:


DROID is helpful for identifying preservation issues, like the  above.  The report also provides information like MIME type (to get a top-level idea of general types of media), date last modified (I found this really helpful for determining whether a drive was full of archival assets or everyday files), and file format and size.

While I tried out these two tools, there are other possibilities to check out.  See this .  Some all-in-one preservation tools like also integrate file inventorying, sometimes referred to as preservation planning.