Incorporating Emulation into a ‘business as usual’ digital preservation workflow.
This post is intended to be speculative and may well be full of errors, both in the writing (spelling/grammar/typos) and in the content (I could be way off-mark). I am putting it out here as a thought piece to stimulate commentary and ideas. Some of this came out of recent discussions at the Future Perfect 2012 conference with many people including Jeff Rothenberg and Dirk von Suchodoletz.
What would it mean to take emulation seriously as a digital preservation strategy?
Most major digital preservation systems are currently based around having migration as the main long term preservation strategy. Some may argue that they are all in fact based on a strategy of hedging bets by way of retaining the original files and implementing migration, and this may be so; however none that I am aware of are based around using only emulation as a digital preservation strategy. I believe there is merit in some institutions using only emulation as a digital preservation strategy. They may wish to also use migration for providing access derivatives, much as we use a photocopier for providing access derivatives of paper records today. However there are some interesting and potentially cost-saving differences when implementing an emulation based digital preservation strategy instead of a migration based strategy.
This post is an attempt to highlight some of the differences in implementing a purely emulation based approach.
What would a business as usual digital preservation workflow look like?
At point of transfer or earlier digital preservation practitioners (DPPs) would try to ascertain the necessary rendering environment or environments for each digital object. This might be as simple as knowing that the object was a pdf file from a certain era and so would have been intended to be rendered in one of x version of Acrobat Reader, or a Microsoft Word document file from a certain era, created with OpenOffice, therefore intended to be rendered with either OpenOffice or one of the versions of Microsoft Word that was available at the time. Or it may be far more complex. The decision on how accurate the rendering environment has to be will depend on the context in which the object was normally used. If it was normally used by many users on many different systems then one or more representative rendering environments may be appropriate. If it was normally used by a multiple users via a specialised environment, then a copy of that environment may need to be made and transferred with the object.
Any necessary environments or environment components would be checked off against the preservation institution’s inventory (e.g. Microsoft word xx, java version xx, environment xx). Any components that had to be transferred from the agency would be packaged for transfer. Where full environments had to be transferred disk images would be made or virtual appliances would be transferred.
Files would go into the repository with some (digital) preservation metadata consisting of their age, rendering environment ID(s), date of last modification and any relevant fixity information (other metadata would be transferred for access restriction and discovery purposes etc). The date of last modification would be used when configuring the rendering environment to ensure active date fields were contemporaneous with the file (i.e. the emulated environment would have the system date set to the date the file was last modified).
The files would then have bit-preservation routines applied to them as per usual (copies made, checksums checked, media refreshment and replacement, etc).
If an appropriate rendering environment was not available in the inventory of the transferring agency one would either have to be configured or selected from a provider. Testing of the environment could be done in conjunction with the transferring organisation or individual, or could be done automatically using standard software installation testing routines. That one environment could then be used to render any object that was associated with it in the future. An average DPP (archivist, librarian) with basic IT skills should be able to be trained on how to configure most environments. In many cases it will only require knowledge of how to install applications on a base-operating system image.
When a user requested access to the original object there would be a number of options available:
They could be provided access to the object automatically rendered in the associated rendering environment within a controlled environment, e.g. in a reading room.
They could be provided access to the object automatically rendered in the associated rendering environment remotely, either through a custom application or through a web-browser.
They could be provided with the files that make up the object and information about the rendering environment, e.g. an unique ID for the environment or a list of the components. This could then be provided by the user (e.g. the transferring agency may still have the environment running) or by an external service provider.
They could be provided an access derivative created as part of non-preservation value-add process to facilitate greater reuse.
Throughout all of these options (aside from 4) the user could be given a number of ways to interact with the object and move content from it to a more modern environment (these may depend on confidentiality or commercial constraints):
They could be given the option of printing objects to a file or printer.
They could be given the option of selecting and copying content to paste into the modern host environment.
They could be given the option of save the object in a different format and moving the result to the modern host environment.
How does this process differ from standard, migration-based, approaches?
There is no validating of files against format standards (JHOVE would be unnecessary). Format validation only matters if you want to be able to consistently apply migration tools across a large set of files. If you are employing an emulation strategy this variance is not a problem. Intra-format variance generally results from different creating applications creating files differently but with the intention of them adhering to the same formatting standard. This variance is useful for identifying the rendering application but a problem for validation tools.
Format analysis becomes less important. Strictly speaking format identification is unnecessary when implementing an emulation strategy. The only format-like information that is necessary is an identifier for the rendering environment(s) to be used to render the object. File format identification tools could be used to infer the rendering environment(s) for the files. For example tools like DROID could be repurposed to identify patterns relating to creating applications and from there the intended rendering environment(s) could be inferred.
Identifying the rendering environment would be much more important and testing that environment at point of transfer could be more important. Doing this at point of transfer would make any issues apparent immediately rather than putting them off to a later date. In theory it would make it easier to consult with the original content owners to confirm decisions made (something that is harder to do each time a migration is conducted).
Preservation planning would involve tracking systems architecture etc, not software “obsolescence”. I.e. preservation planning would require ensuring that your emulation tools ran on your current host environment(s).
Preservation actions would involve writing new emulation hosts to host the old virtual hardware or writing new emulators to run the old environment images. This could be a significant process but would be relatively rare and would only need to be done once per emulator (which might emulate many different architectures & hundreds or thousands of environments).
6. Decisions about the content presented to users (e.g. as a result of migration or emulation) are made early in the preservation process (at point of transfer) as opposed to when a migration action is deemed necessary.
7. Access to the digital original could be more complicated for the average user and various mechanisms may have to be put in place to overcome this. Providing basic instructions for interacting with each environment would be an initial step. Old software documentation could be digitised and made available. Old software manuals often assumed no knowledge of computers and could be repurposed for future users. Interactive walk-through overlays could be added to the software (thanks to Jeff Rothenberg for suggesting this) leading users through the main steps necessary to interact with the objects (e.g. when mice no longer exist). Access to derivative versions may also be provided if required.
In general the steps involved in implementing a digital preservation strategy involved only emulation are quite different from those involved in implementing a migration strategy. Without solid examples of the practice of each, and metrics on costs and results, it is hard to say which would be more efficient.
I welcome comments and am very aware of the many gaps in this quite hurriedly written post. I chose to post this here rather than on the OPF or elsewhere because of its very raw nature, its speculative content and because i do not want it in any way associated with my kind employers.
EDIT:
I forgot an important point
The digital preservation institution does would not necessarily have to hold copies of any or every environment. They would only need to have access to them or to ensure that users could access them. Initially this may be possible with no work whatsoever. For example the environment for a pdf file may be limited to any current version of Acrobat Reader that a user would likely have at home, running on any OS that supported it. In the future if external emulation services were available the preservation institution may only have to check that the particular environment was available or request that it was configured and made available from the service provider. After that they may not need to actively do a lot besides tracking the health of the service providers (besides the usual bit-preservation routines).