Text

Reply to DSHR’s Comment

I ran over the comment limits on David Rosenthal’s blog when I tried to reply to his reply to my comment on his blog. I’ve included my reply below instead. 

Hi David,

The problem I see is that we fundamentally disagree on the framing of the digital preservation challenge. I meant to reply to your last “refutation” of Jeff Rothenberg’s presentation at Future Perfect 2012 but hadn’t gotten around to it yet. Perhaps now is a good time. I was the one that organised Jeff’s visit and presentation and I talked with him about his views both before and after so I have a pretty good idea of what he was trying to say.  I won’t try to put words into his mouth though and will instead give my (similar) views below.

The digital preservation challenge, as I see it, is to preserve digitally stored or accessed content over time. I think we can both agree that if we aren’t leaving something unchanged then we aren’t preserving anything. So, to me, the digital preservation challenge requires that we ensure that the content is unchanged over time

Now I’m not sure if you would agree that that is what we are trying to do. If you do, then it seems we disagree on what the content is that we are trying to preserve.  If you disagree that that is what we are trying to do then at least we might be able to make some progress on figuring out what the disagreement stems from.

So if you can at least understand my perspective I’d also like to address your comments about format obsolesce. I’m not a proponent of the idea of format obsolescence. The idea makes little sense to me. However I am a proponent of a weak form of the idea of software obsolescence and, more importantly, the associated idea of content loss due to software obsolescence.

The weaker form of the idea of software obsolescence that I’m a proponent of is that because of hardware changes, software loss and loss of understanding about how to use software, software becomes unusable using current technology without active intervention.

The associated idea of content loss that I am a proponent of is the idea that to successfully preserve many types of content you need to preserve software that that content relies upon in order to be presented to users and interacted with. A stronger way of putting that is to say that in many cases, the thing to be preserved is so inextricably connected to the software that the software is part of that thing.

If you take that leap to accepting (whether fully or in order to simplify the explanation) that the software is part of the thing to be preserved, then it becomes obvious that practitioners who are  only doing migration are in many cases not doing real preservation as they are not preserving the entirety of the objects.  Hence Jeff’s presentation in which he reprimanded the community for not really making progress since the early 2000s.  Almost nobody is preserving the software functionality.

As it is relevant to your post and comments, I’ll use a web page as an example to illustrate what I mean. The content presented to users for interaction with by a traditional web page, is presented using a number of digital files including the server hosted files, e.g. the web server & applications, the html/XHTML pages, scripts, images, audio, and the locally hosted files such as the browser, fonts, browser skins, extensions etc. The combination of these files mediated by usually at least two computers (the server and the client) together present content to the user that the user can interact with it. Changing any one of the files involved in this process may change the content presented to the user.  To preserve such a page it is my view that we need to start by deciding what content makes up the page so that we can both begin to preserve it and so that we can also confirm that that content has been preserved and is still there in an unchanged form at a point in the future. In most cases it’s likely that all that needs to be preserved is the basic text and images in the page and their general layout. If this is all then migration techniques may well be appropriate if the browser ever becomes unable to render the text and images (though I agree with you that that doesn’t seem necessary yet or likely to be necessary in a hurry). However there are two difficulties with this scenario:

  1. There will be many cases where the content includes interactive components and/or things that include software dependencies.
  2.  When you don’t know, or can’t affordably identify the content to be preserved, preserving as much as possible, cheaply, is your best option. 

(A)  means  that you will require some solution that involved preserving the software’s functionality, and I believe that (B) means you should use an emulation based technique to preserve the content.

Emulation based techniques are highly scalable (across many pieces of digital content) and so benefit from economies of scale. Emulation strategies and tools, once fully realised, I believe will provide a cheaper option when you factor in the cost of confirming the preservation of the content.

It’s a bit like the global warming problem. Most products and services do not include the carbon cost in them. If they did they would likely be much more expensive. Well I believe digital preservation solutions are similar: if you factor in the costs of confirming/verifying the preservation of the content you are trying to preserve, then many solutions are likely to be prohibitively expensive as they will require manual intervention at the individual object level.  Emulation solutions, on the other hand, can be verified at the environment level and applied across many objects, greatly reducing costs.

So as I see it, it is not about format obsolescence, it is about (a weak form of) software obsolescence and preservation of content that can’t be separated from software.

In your post you seemed to be suggesting something similar, that content needed to be preserved that was heavily reliant upon browsers and server based applications. You also discussed a number of approaches including some that involved creating and maintain virtual machines, and followed that with the statement that: “the most important thing going forward will be to deploy a variety of approaches”. I took that to mean you had softened a little in your attitude towards using emulation to preserve content over time<a« «.

Sorry, I seem to have misunderstood.

Text

How could son B could rely on the emulation solution?

Commenting on my last post about the importance of digital object integrity Barbara asked:

Nice post! But son B need to do something more if I was the judge.If in court, how could son B show that the firm that did the emulation was trustworthy? By asking 3 firms and compare the results?”

I’m going to take this to mean: how might the firm prove to son B that their emulation solution was trustworthy so that son B could prove it in court?

Its a good question, and one that I get asked a lot, so here is my answer:

I think the bar for how “authentic” an emulation solution needs to be should be set as follows: For running a particular software environment emulated hardware should meet the same expectations as were expected of  hardware that the environment would have run on when it was in use. 

I don’t think that (in most cases) it has to be the exact same hardware. The reason I believe this is because that is not expected of digital objects today (and wasn’t expected of the objects created in 2009/2010). I.e. it is not expected that when one person opens a file another person created that it has to be opened on the exact same hardware. On the other hand it is often expected that the user should use the same software. Software is often recommended or required by organisations when users are given files to use from them. Websites used to (and sometimes still do) regularly list the software that should be used to interact with them.

I could speculate as to why hardware is deemed less important than software by actual users but (to an extent) it doesn’t matter why, it just matters that that is the way it is.

When hardware vendors develop new PCs they test them to ensure they function as they should. These same tests (which are mostly automated) should be run on emulated hardware and the hardware should pass if it is to be accepted. The first Google hit I found for such a test suite was for EuroSoft.

So in order to prove that their emulation solution is trustworthy the firm providing the service needs to prove the emulated hardware stands up to the same integrity tests any hardware would have had to have passed if it was to be used with Windows Vista & Office 2003 back in 2009. 


Text

A scenario illustrating the importance of digital object integrity

I’ve been commenting on some blog posts recently about the importance of being able to be sure that your preservation actions don’t change the content/message(s) of a digital object and today I thought of a scenario that seems both very plausible and hopefully effective at illustrating the issue.

In 2009 a mother wrote her last will and testament using Microsoft Word 2003, she never printed it and saved it in the default Word 2003 format. In her will she split her assets evenly between her two sons, sons A and B. In 2010 she had a falling out with son A and updated the document to reflect that by removing her son from receiving a large portion of the inheritance that he otherwise would have received  She died 15 years later without ever updating her will. It survived on the home server backups her sons had made for her over the years. 

Come 2025 the two sons both retrieved the will and prepared to use it to argue their case for their portion of her assets. Son A presents the will in court as rendered using LibreOffice 2023, the only contemporary software that will open it. When opening the file LibreOffice misinterprets the changes that had been saved using the track changes feature of Word 2003 and presents a slightly mangled version that includes portions of the 2009 version in which Son A gets half his mother’s assets. 

When presented with this, Son B knows something is not right. His mother (who he looked after as her Alzheimers set it) often and repeatedly told him that she had disowned Son A and that “he’s not getting anything!”. 

However when the object was presented in the court Son B was surprisingly calm. Son B had come prepared! He had also tried accessing the object by rendering the file in LibreOffice 2023, however when he saw the result he immediately set about looking for someone who could help him access the “real” object. 

After allowing his thoughts about searching to interface with the BrainCloud* he had quickly found a company specializing in old digital objects and digital access. Within seconds his problem had been analysed by them and, after he had provided his authorization thought pattern, the file was opened using a remotely hosted emulated environment that included Office 2003 running on Windows Vista (his mother had bought her laptop and just the wrong time and refused to upgrade it as she “liked the colour” of wndows vista). 

Back in the court brother B presented his version of the object. The version that used Word 2003 running on Vista as the Interaction environment, with the change tracking presentation layer turned on, clearly showed what he was entitled to and, coupled with other evidence of his late mother’s change in relationship to son A, was the deciding factor that won him his full inheritance. 

 (*TM) 

Text

Accessing Authentic Archived Websites Well (aaaww)

I’d like to start this post with a bold claim:

The only currently practical solution for preserving access to archived websites over the long term its to maintain old web browsers and access them using those browsers running on emulated or virtualised hardware.  

The truth of that claim could easily be debated but i’m not going to do that here. Instead I’d like to assume that it’s true and look at what stands in our way from doing that right now. 

Given this (assumption), what happens when we try to do this right now?:

Obscure DP problem: optimising web archive interfaces for old... on Twitpic

(www.archive.org running in Netscape Navigator on Windows 98 on VMware).

In (other) words it doesn’t work. It doesn’t work because our web-archive interfaces are not designed for old browsers. So this is the first issue. 

This is assuming a workflow in which users load up an old browser and browse amongst links from web archives within the old browser. Another option might be to have the browsing done via a modern host and the viewing/rendering passed to the emulated/virtualised browser. But either way, this seems to be something practical web archivists could be working on. 

Another issue is security, providing web archives via old software may mean providing old software access to the internet. For example it might mean providing windows 95 with IE 5 access to the internet. This should be manageable through good use of firewalls etc. But in theory most modern host environments should be able to be made immune to viruses that might attack the old operating systems. And the emulated systems could be set in “snapshot” mode to ensure any damage done can be recovered simply by restarting the emulated desktop. 

The third and more challenging issue is the ever present issue for emulation solutions: Licensing. Many old browsers require old proprietary operating systems on which to run. This is a legitimate issue that desperately needs to be dealt with if we are to make emulation a viable solution more widely. However this is actually slightly less of a problem with browsers than with other software. Many browsers were freely available and many can be run directly on old Linux distributions or indirectly through API emulators such as WINE on linux. Most old browsers can still be downloaded via sites like OldVersion.com and OldApps.com, or are included in old Linux Distributions and repositories as David Rosenthal likes to point out

In fact if I load up the KEEP Emulation Framework’s default Damn Small Linux environment I can browse the Wayback machine with ease, aaaww:

Aaaww .... Lolcat.com from 2007 running in Firefox via the Em... on Twitpic

UPDATE:

I may try to build a version of a Damn Small Linux disk image for use in the Emulation Framework with old versions of browsers running on it via WINE. 

The WINE Database shows Old Versions of Internet Explorer, Netscape, Firefox and Opera running well. 

UPDATE #2: I resized the PuppyLinux disk image provided with the Emulation Framework, added Wine 1.5 and installed Internet Explorer 3.01 and Netscape Communicator 4.80 for Windows. The disk image is available here and can be added to the Emulation Framework by following the instructions in this document to “add software”. I may try to add more browsers in the future if anyone is interested. 

Text

Maintaining Format Migration Paths

This is a quick post to put a question out there for discussion. The (partly rhetorical) question that I have been pondering over and raising with others is:

How long do we have to maintain a migration path for for any particular format?

Its probably safe to assume that most digital preservation institutions will continue to receive files in old formats for a long time after the format is considered to be obsolete. If we are going to use a migration strategy to preserve these files then the current best practice seems to be that we should migrate them as soon as we believe that their format is obsolete. 

For example, say a digital preservation institution has a set of WordPefect 5.1 files and in 2012 it realizes/decides that they are obsolete and decides to migrate them all to ODT files to preserve them. This would seem to be a reasonable and practical approach for preserving these files.  However if we apply the question posed above to this example: what happens if the institution receives more WordPerfect 5.1 files?

Presumably if the institution receives the additional WordPerfect 5.1 files while the tool(s) they used to migrate the original set are still functioning then they should be able to migrate them as they ingest them or soon after. But what happens when those migration tools are obsolete? Will they have to find or create new migration tools? Will they have to migrate the old migration tools? 

There are a lot of answers to these questions including the option of refusing to accept any more files in formats that they have migrated away from. But to me it gives two good reasons to maintain emulation tools:

1. Migration through emulation tools (such as the UFC Migrate tool created as part of the Planets project) could help to ensure that files that come into the repository in obsolete formats can always be migrated. These tools do partly beg the question: Why bother migrating them if you are maintaining the ability to render them anyway (as you probably are if you are maintaining the ability to migrate them using original software)? - one answer is that you might migrate for reusability of parts of the content in modern software. 

2. Using an emulation strategy to preserve the objects would make this issue redundant. 

Text

Incorporating Emulation into a ‘business as usual’ digital preservation workflow.

This post is intended to be speculative and may well be full of errors, both in the writing (spelling/grammar/typos) and in the content (I could be way off-mark). I am putting it out here as a thought piece to stimulate commentary and ideas. Some of this came out of recent discussions at the Future Perfect 2012 conference with many people including Jeff Rothenberg and Dirk von Suchodoletz.

 

What would it mean to take emulation seriously as a digital preservation strategy?

 

Most major digital preservation systems are currently based around having migration as the main long term preservation strategy. Some may argue that they are all in fact based on a strategy of hedging bets by way of retaining the original files and implementing migration, and this may be so; however none that I am aware of are based around using only emulation as a digital preservation strategy. I believe there is merit in some institutions using only emulation as a digital preservation strategy. They may wish to also use migration for providing access derivatives, much as we use a photocopier for providing access derivatives of paper records today. However there are some interesting and potentially cost-saving differences when implementing an emulation based digital preservation strategy instead of a migration based strategy.

 

This post is an attempt to highlight some of the differences in implementing a purely emulation based approach.

 

What would a business as usual digital preservation workflow look like?

 

At point of transfer or earlier digital preservation practitioners (DPPs) would try to ascertain the necessary rendering environment or environments for each digital object. This might be as simple as knowing that the object was a pdf file from a certain era and so would have been intended to be rendered in one of x version of Acrobat Reader, or a Microsoft Word document file from a certain era, created with OpenOffice, therefore intended to be rendered with either OpenOffice or one of the versions of Microsoft Word that was available at the time. Or it may be far more complex. The decision on how accurate the rendering environment has to be will depend on the context in which the object was normally used. If it was normally used by many users on many different systems then one or more representative rendering environments may be appropriate. If it was normally used by a multiple users via a specialised environment, then a copy of that environment may need to be made and transferred with the object.

 

Any necessary environments or environment components would be checked off against the preservation institution’s inventory (e.g. Microsoft word xx, java version xx, environment xx). Any components that had to be transferred from the agency would be packaged for transfer. Where full environments had to be transferred disk images would be made or virtual appliances would be transferred.

 

Files would go into the repository with some (digital) preservation metadata consisting of their age, rendering environment ID(s), date of last modification and any relevant fixity information (other metadata would be transferred for access restriction and discovery purposes etc). The date of last modification would be used when configuring the rendering environment to ensure active date fields were contemporaneous with the file (i.e. the emulated environment would have the system date set to the date the file was last modified).

The files would then have bit-preservation routines applied to them as per usual (copies made, checksums checked, media refreshment and replacement, etc).

If an appropriate rendering environment was not available in the inventory of the transferring agency one would either have to be configured or selected from a provider. Testing of the environment could be done in conjunction with the transferring organisation or individual, or could be done automatically using standard software installation testing routines. That one environment could then be used to render any object that was associated with it in the future. An average DPP (archivist, librarian) with basic IT skills should be able to be trained on how to configure most environments. In many cases it will only require knowledge of how to install applications on a base-operating system image.

 

When a user requested access to the original object there would be a number of options available:

1.  They could be provided access to the object automatically rendered in the associated rendering environment within a controlled environment, e.g. in a reading room.

2.  They could be provided access to the object automatically rendered in the associated rendering environment remotely, either through a custom application or through a web-browser.

3.  They could be provided with the files that make up the object and information about the rendering environment, e.g. an unique ID for the environment or a list of the components. This could then be provided by the user (e.g. the transferring agency may still have the environment running) or by an external service provider.

4.  They could be provided an access derivative created as part of non-preservation value-add process to facilitate greater reuse.

 

Throughout all of these options (aside from 4) the user could be given a number of ways to interact with the object and move content from it to a more modern environment (these may depend on confidentiality or commercial constraints):

 

a)  They could be given the option of printing objects to a file or printer.

b)  They could be given the option of selecting and copying content to paste into the modern host environment.

c)  They could be given the option of save the object in a different format and moving the result to the modern host environment.

 

How does this process differ from standard, migration-based, approaches?

 

 

1.  There is no validating of files against format standards (JHOVE would be unnecessary). Format validation only matters if you want to be able to consistently apply migration tools across a large set of files. If you are employing an emulation strategy this variance is not a problem. Intra-format variance generally results from different creating applications creating files differently but with the intention of them adhering to the same formatting standard. This variance is useful for identifying the rendering application but a problem for validation tools.

 

2.  Format analysis becomes less important. Strictly speaking format identification is unnecessary when implementing an emulation strategy. The only format-like information that is necessary is an identifier for the rendering environment(s) to be used to render the object. File format identification tools could be used to infer the rendering environment(s) for the files. For example tools like DROID could be repurposed to identify patterns relating to creating applications and from there the intended rendering environment(s) could be inferred.

 

3.  Identifying the rendering environment would be much more important and testing that environment at point of transfer could be more important. Doing this at point of transfer would make any issues apparent immediately rather than putting them off to a later date. In theory it would make it easier to consult with the original content owners to confirm decisions made (something that is harder to do each time a migration is conducted).

 

4.  Preservation planning would involve tracking systems architecture etc, not software “obsolescence”. I.e. preservation planning would require ensuring that your emulation tools ran on your current host environment(s).

 

5.  Preservation actions would involve writing new emulation hosts to host the old virtual hardware or writing new emulators to run the old environment images. This could be a significant process but would be relatively rare and would only need to be done once per emulator (which might emulate many different architectures & hundreds or thousands of environments).

 

6.  Decisions about the content presented to users (e.g. as a result of migration or emulation) are made early in the preservation process (at point of transfer) as opposed to when a migration action is deemed necessary. 

7. Access to the digital original could be more complicated for the average user and various mechanisms may have to be put in place to overcome this. Providing basic instructions for interacting with each environment would be an initial step. Old software documentation could be digitised and made available. Old software manuals often assumed no knowledge of computers and could be repurposed for future users.  Interactive walk-through overlays could be added to the software (thanks to Jeff Rothenberg for suggesting this) leading users through the main steps necessary to interact with the objects (e.g. when mice no longer exist). Access to derivative versions may also be provided if required.


 

In general the steps involved in implementing a digital preservation strategy involved only emulation are quite different from those involved in implementing a migration strategy.  Without solid examples of the practice of each, and metrics on costs and results, it is hard to say which would be more efficient.

 

I welcome comments and am very aware of the many gaps in this quite hurriedly written post. I chose to post this here rather than on the OPF or elsewhere because of its very raw nature, its speculative content and because i do not want it in any way associated with my kind employers.

 

EDIT:

I forgot an important point

The digital preservation institution does would not necessarily have to hold copies of any or every environment. They would only need to have access to them or to ensure that users could access them. Initially this may be possible with no work whatsoever. For example the environment for a pdf file may be limited to any current version of Acrobat Reader that a user would likely have at home, running on any OS that supported it. In the future if external emulation services were available the preservation institution may only have to check that the particular environment was available or request that it was configured and made available from the service provider. After that they may not need to actively do a lot besides tracking the health of the service providers (besides the usual bit-preservation routines).  

Text

PDF, File Formatting and Creating Applications

I often go on about how software applications often diverge from, or uniquely implement, documented standards for formatting files that they create. Some software vendors are aware of, and document these deviations. Adobe are a good example of this with their software and in particular their implementation of extensions in writing files that adhere to the PDF version 1.7 formatting standard. Below is an extract from the wikipedia entry on PDF with my emphasis in italics at the bottom:

“Adobe’s PDF specifications

Adobe changed the PDF specification several times and continues to develop new specifications with new versions of Adobe Acrobat. There have been nine versions of PDF with corresponding Acrobat releases:[10]

  • (1993) – PDF 1.0 / Acrobat 1.0
  • (1994) – PDF 1.1 / Acrobat 2.0
  • (1996) – PDF 1.2 / Acrobat 3.0
  • (1999) – PDF 1.3 / Acrobat 4.0
  • (2001) – PDF 1.4 / Acrobat 5.0
  • (2003) – PDF 1.5 / Acrobat 6.0
  • (2005) – PDF 1.6 / Acrobat 7.0
  • (2006) – PDF 1.7 / Acrobat 8.0
  • (2008) – PDF 1.7, Adobe Extension Level 3 / Acrobat 9.0
  • (2009) – PDF 1.7, Adobe Extension Level 5 / Acrobat 9.1

The ISO standard ISO 32000-1:2008 is equivalent to Adobe’s PDF 1.7. Adobe declared that it is not producing a PDF 1.8 Reference. The future versions of the PDF Specification will be produced by ISO technical committees. However, Adobe published documents specifying what extended features for PDF, beyond ISO 32000-1 (PDF 1.7), are supported in its newly released products. This makes use of the extensibility features of PDF as documented in ISO 32000-1 in Annex E. Adobe declared all extended features in Adobe Extension Level 3 and 5 have been accepted for a new proposal of ISO 32000-2 (a.k.a. PDF 2.0).[11]

The specifications for PDF are backward inclusive. The PDF 1.7 specification includes all of the functionality previously documented in the Adobe PDF Specifications for versions 1.0 through 1.6. Where Adobe removed certain features of PDF from their standard, they too are not contained in ISO 32000-1.[1]

PDF documents conforming to ISO 32000-1 carry the PDF version number 1.7. Documents containing Adobe extended features still carry the PDF base version number 1.7 but also contain an indication of which extension was followed during document creation.

I added the emphasis to make a point. For understanding how files are internally structured it is not always enough to just know the formatting standard adhered to when files were created (e.g. PDF version 1.7). Sometimes we need more information about how the particular application chose to interpret or, as in the example above, implement the standard. This information could be represented in many cases simply by knowing what the creating application was. This is shown/implicitly acknowledged in the Wikipedia extract by the inclusion of the application association with the name of each PDF version listed in the list of versions. 


Text

Clarifying Migration vs Emulation (+ some conjectures)

Disclaimer:

This post and all others on this blog are my personal thoughts and opinions and are not necessarily those of any organisation I work for or have worked for.

Now to the post.

Firstly, the clarification:

If we assume that “the aim of digital preservation is to maintain our (the preserving organisation’s) ability to render digital objects over time”.

Then this means that digital objects become at risk when there is potential for them not to be rendered by us at a point in the future, and digital objects become issues when they can’t be rendered by us.

Maintaining the ability to render digital objects means maintaining access to a software environment that can render the objects. In other words this means we have to have at least one copy of the software and dependencies that are needed to render the objects. 

In order to mitigate against a risk that objects won’t be renderable we have at least two options:

1. migrate content from files that make up the objects to other files that can be rendered in environments that we currently support. 

2. maintain access to environments indefinitely using emulation/virtualization.

So there is the clarification. Now some conjectures regarding it:

  1. For any reasonably sized volume of digital objects that require the same rendering environment, it may be simpler and cheaper to just continue to maintain access to one environment by emulating or virtualizing it. All this takes is the ability for somebody to install the required software in a virtual/emulated machine and for that machine image to continue to be renderable by emulation/virtualisation software in the future.
  2. Maintaining one copy of a compatible environment suffices for preservation purposes as it enables us to say we have preserved the objects, but is probably not good enough for access. There are reasons why we should provide viewers for digital objects, and also reasons why we should try to make sure users can access objects using their own modern/contemporary software.  For these reasons we may also have to perform migration where it is cheap/fund-able and provide access to the preservation master through reading rooms (either physical or virtual) in which we can restrict the number of concurrent users to as many as we have licenses for for the emulated environments.
Text

Emulation Workbench for Digital Object Format Analysis

As part of on-going research I have recently been working a lot with emulated desktop environments. 

One of the somewhat surprising things to come out of this work has been the realisation that an having a set of emulated desktops with various old applications installed on them (an emulation workbench) is a really valuable tool for digital preservation practitioners. 

When faced with an digital object with an unknown format that DROIDJHOVE etc cannot identify, one of the most useful approaches I have found for discovering the format of the object is to try opening it in a number of applications of roughly the same era.  Often applications will suggest an open-parameter to use when opening a file e.g:

Share photos on twitter with Twitpic

Or they may obviously produce errors when opening a file e.g:

Share photos on twitter with Twitpic

Both of which can be useful for understanding the types of objects you are dealing with. 

Some applications specify explicitly that they are converting an object from one format to another, implying that the application decided that the object was of the first format. 

Admittedly this approach can be time consuming. But if you have a set of files that you think are the same type it may be worthwhile spending the time attempting to open the files in different applications. Also, with some research it may be possible to automate this process so that an object can be automatically opened in a range of applications from it’s era and the results automatically analysed to see which gave the least errors or to analyse the conversion messages provided to see whether all the applications agree on the original format. Jay Gattuso has discussed something similar here.

Given the obsolescence of hardware, and difficulty setting up old hardware, this use-case highlights the need for a set of emulated desktops for digital preservation practitioners to add to their tool-set. Such a tool-set or “workbench” would be extremely helpful for adding to format databases such as Pronom and UDFR.

Comments appreciated via @euanc on twitter

Text

Mining Application Documentation for File Format Intelligence

I’ve been working on an application and installed environment database. 

As part of this I have been documenting the save-as, open, export and import parameters (options) for many business applications. 

For example, the following are the open parameters available for Lotus 1-2-3 97 edition installed on Windows 95:

ANSI Metafile (CGM)
Bitmap (BMP)
dBase (DBF)
Excel (XLS;XLT;XLW)
Lotus 1-2-3 PIC (PIC)
Lotus 1-2-3 SmartMaster Template (12M)
Lotus 1-2-3 Workbook (123;WK*)
Paradox (DB)
Quattro Pro (WQ1;WB1;WB2)
Text (TXT;PRN;CXV;DAT;OUT;ASC)
Windows Metafile (WMF)

Recently I realised that this might be a good source for intelligence about file formats. Let me explain what I mean.

Different applications differentiate in different ways between versions of file formats in their open and save-as parameters. The logic behind the differentiation may be able to be analysed to discover when format variants are significant or not.

For example Microsoft Word Version 6.0c (running on Windows 3.11) has the following open parameters for word for ms-dos files:

Word for MS-DOS 3.x - 5.x
Word for MS-DOS 6.0

In contrast to this WordPerfect 5.2 for Windows (running on Windows 3.11) has these open parameters:

MS Word 4.0; 5.0 or 5.5
MS Word for Windows 1.0; 1.1 or 1.1a
MS Word for Windows 2.0; 2.0a; 2.0b

Of which the first may be referring to ms-dos versions.

Lotus Word Pro 96 Edition for Windows (running on Windows 3.11) has the following open parameter for word for ms-dos files:

MS Word for DOS 3;4;5;6 (*.doc)

And Corel WordPerfect Version 6.1 for Windows (running on Windows 3.11) has these open parameters:

MS Word for Windows 1.0; 1.1 or 1.1a
MS Word for Windows 2.0; 2.0a; 2.0b; 2.0c
MS Word for Windows 6.0

None of which refer to any ms-dos variants. 

This pattern continues through more recent variants of each office suite.

The interesting finding from this is that the Microsoft suites differentiate between versions 3,4,5 (as a group) and version 6 but not within/between versions 3, 4 and 5 and the other suites (when they have a relevant parameter) do not differentiate between any of 3, 4, 5, or 6. 

If every office suite differentiated between the variants in the same way then this would indicate that there were significant differences between them. However as they don’t then it is inconclusive in this case. 

As Microsoft wrote the standards in this example then their suites ought to have the most reliable information and therefore it may be sensible to conclude that version 6 is significantly different to versions 3, 4 or 5. 

This pattern also holds for save-as paramaters. The Microsoft suites differentiate between version 6 and the group of versions 3, 4 and 5 whereas the other suites don’t differentiate this way. 

As the database gets more populated more analysis will be possible. Where there is general agreement in both open and save-as parameters across multiple applications then this will give digital preservation practitioners very good reason to believe that there are significant differences between the formats in question. 

I am carefully suggesting that these findings only give us reason to believe that there are differences. There may not actually be differences. Just because particular applications allow for users to differentiate between these parameters/file formatting options that does not mean that the applications themselves actually do. It may, for example, be a marketing tool to enable the vendor of the product to state show that the tool is “compatible with many formats” even though it may use the same code to open them all. 

Hopefully finding similar differences across many vendor’s tools enables us to mitigate against this issue but it should be noted that this approach does not provide definitive results. 

Comments would be appreciated via twitter @euanc