Workflow for Digitization and Preservation
The workflow developed for digitization and preservation of floppy disks at ARCS is comprised of six main stages, as shown in the workflow diagram below. This workflow is adapted from Meister (2014), which divides the process into four stages: Document Media, Create Disk Image, Export Files, and Initial Analysis. However, we have added an Appraisal Stage and Pre-ingest/Ingest Stage to this workflow to account for the final actions in the preservation process.
This section will walk through the workflow and discuss some of the challenges we encountered while performing the digitization process. A detailed step-by-step version of the workflow is available in the Digital Preservation Guide: 3.5-Inch Floppy Disks.
Stage 1: Document Media
To run through the workflow, we selected a disk in a format that the disk drive would be able to read without any issues: an IBM formatted disk. We assigned an identifier to the disk (Step 1). Since all the disks in the fonds were found in a box with no particular arrangement, the identifier is simply a number to denote the order in which they were processed. Recording the identifier number in the filename and metadata was simple enough, however the question of whether to mark the physical floppy disk with the identifier number and how to do so has not been answered, and depends on whether the disk is being retained after digitization for long term preservation.
We then took note of the physical characteristics of the disk, including its color, formatted, storage capacity, serial number, physical condition, and information on the label (Step 2). The label of this particular disk read: "Final Report 1997, Women in Engrg chair; SSHRC; AMIA.htm[?]. We then took a photograph of the floppy disk (Step 3).
Stage 2: Create Disk Image
To create the disk image, we used the following equipment:
- BitCurator software environment, installed on a dedicated computer without network access
- the Targus PA905 disk drive
- the Tableau write-blocker, in addition to the write-protection tabs on the disk itself
Steps 1 to 3 of this stage consist of configuring the software and hardware.
Step 4 was to create the disk image. It was interesting to see the perspective of the digital forensics community in Guymager, the tool used to acquire the disk image. Metadata fields such as "Case number", "Evidence number" and "Examiner" were easily adapted to suit ARCS' needs.
The disk imaging action produced two files: the disk image file in Expert Witness format (.E01), which contains embedded metadata, and an info file with metadata about the process. With the disk image file created, the floppy disk was safely removed from the drive. The disk image serves a baseline copy of the artifact. Any extraction or manipulation of the disk contents are done on this digital copy (Durno, 2016, Lee et al., 2013).
Stage 3: Export Files
Working with the disk image in BitCurator is relatively simple thanks to the BitCurator Quick Start Guide (UNC School of Information and Library Science, 2018). However, there were a few unexpected situations that came up along the way.
When we opened ClamTK to perform a virus check on the disk image (Step 1), it informed us that updates were required. However, without network access, we could not easily download these updates. We decided to proceed without the updates, though in the future, ARCS will need a system for updating ClamTK and other open source tools in the BitCurator environment.
After running a virus check, we attempted to run a checksum on the disk image file, but were not able to. We soon discovered that all we needed to do was mount the disk (Step 2) for the checksum (Step 3) to work. Next, we exported the files from the disk image (Step 4), a total of 37 files, and ran a virus check on all the files (Step 5).
Stage 4: Initial Analysis
The initial analysis step is about making sense of the disk image contents using various tools in BitCurator. First, we ran the bulk extractor tool which identifies personal and private information, such as phone numbers and emails as well as other features of interest like URLs (Step 1). Then, the BitCurator Reporting Tool was used to extract filesystem metadata (Step 2), create preservation metadata in PREMIS format (Step 3), and generate a series of human readable reports that allow the archivist to conduct an appraisal of the disk contents (Step 4).
In keeping with relevant code of ethics, we permanently removed all the deleted files that were recovered by BitCurator (Step 5). The deleted files were identified in the fiwalk_deleted_files.pdf, though we noticed that not all the deleted files were visible in the file viewer. We were able to show the hidden files to perform the deletion by clicking “show hidden files” in the menu options. Then, another checksum was run (Step 6).
The final step (6) of the initial analysis stage is to catalogue the created contents (the photograph of the floppy disk, the disk image, and the .info file). This digital preservation metadata will be uploaded into Archivematica with the Submission Information Package (SIP).
At this point, our role in the process was complete as the next stage relies on the appraisal of an archivist. Part of the reason we were unable to proceed was also due to the fact that we could not open many of the files that were in legacy formats.
Stage 5: Appraisal
The reports created in the previous step will enable the archivist to conduct a high level appraisal of the disk image contents (Step 1). They can also conduct further assessment of individual files by opening them. In particular, files containing personal or private information should be examined and removed/redacted as needed. The bulk_extractor_report.pdf provides an overview of how many instances there are of each feature type (e.g. telephone, email, etc), however the individual Excel files for each feature type must be opened to see which file the feature appears in. It took us awhile to figure out how to find the location of each feature: look in column B, "filename"!
Though it may not be possible to open legacy files, a file format identification should be used to identify that formats of the extracted files (Step 2). The appraisal decisions need to weigh the difficulty of accessing and migrating files against their potential value. The availability of special software to access files or the obsolescence of the format may impact appraisal decisions (McGuire, 2018).
Files deemed as not having archival value can be deleted (Step 3), although it is interesting to note that they will still be accessible if they are re-extracted from the disk image file. It is recommended that a final directory listing of all files being accessioned is created (Step 4) and included in the SIP (McGuire, 2018).
Below are some of the BitCurator files which the archivist may use in the appraisal stage.
Stage 6: Pre-Ingest/Ingest
Finally, the output files generated throughout the entire process would be packaged to create a Submission Information Package (SIP). A table showing the output file generated by each step is available on pages 34-35 of our Digital Preservation Guide. The SIP would then be uploaded into Archivematica, where a checksum is generated upon ingest.