TestData

From PreservWiki

(Difference between revisions)
Jump to: navigation, search
Line 19: Line 19:
The Preserv2/Eprints Preservation Toolkit consists of 3 main parts, 2 of which have been covered elsewhere within this project. The first one of which is the DroidWrapper, which has been customized to be able to directly read and manipulate EPrints 3.2 datasets to allow direct and reliable access to all of the files within the repository. More about why this was done is discussed in the later results section. This version of the DroidWrapper is otherwise the same as any other version, it simply locates the files and feeds them to DROID for classification, the result of which is then fed back into the repository. Due to the nature of this particular DROID wrapper this classification is fed directly into the repository rather than made available for a separate parser to process.
The Preserv2/Eprints Preservation Toolkit consists of 3 main parts, 2 of which have been covered elsewhere within this project. The first one of which is the DroidWrapper, which has been customized to be able to directly read and manipulate EPrints 3.2 datasets to allow direct and reliable access to all of the files within the repository. More about why this was done is discussed in the later results section. This version of the DroidWrapper is otherwise the same as any other version, it simply locates the files and feeds them to DROID for classification, the result of which is then fed back into the repository. Due to the nature of this particular DROID wrapper this classification is fed directly into the repository rather than made available for a separate parser to process.
-
The other 2 parts of the Preserv2/Eprints Preservation Toolkit consist of a configuration file, including an EPrints dataset which is used to extend the
+
The next part of the Preserv2/Eprints Preservation Toolkit consist of a configuration file, including an EPrints dataset which is used to extend the EPrints file dataset with pronomid, classification_quality and classification_date. This configuration file also adds the pronom dataset which consists of a cache table to store data from the pronom registry, such that it doesn't have to be queried everytime the repository administrator asks for a preservation report. This dataset also caches the file counts and risk scores enabling this page to load much faster.
 +
 
 +
The final part of the Preserv2/Eprints Preservation Toolkit is the page which displays the result to a repository administrator. more about this page and the toolkit can be found @ http://wiki.preserv.org.uk/index.php/EPrintsPreservation.
* Where were those files stored and managed when I fed them into the classification process?
* Where were those files stored and managed when I fed them into the classification process?
 +
 +
They are fed to the DROID via the new EPrints storage controller, so DROID essentially has direct access to the object (if local) or via a download if the file is offsite. This really depends on the separation between DROID and the objects. For more information please refer to the papers on Smart Storage and the EPrints Storage Layer.
* What tool did I use for the process?
* What tool did I use for the process?
 +
 +
Not included in the toolkit is the DROID tool (as well as JAVA which is required to run DROID). In order to classify our objects we used DROID v3.00 and Signature File v13.
* What happens to the results?
* What happens to the results?
 +
 +
Each object is fed to DROID individually via a DROID xml classification file. This is done to avoid command line interpretation problems involving escaped characters. DROID then classifies this file and feeds back the results in the same format. These results are then processed by the wrapper before being directly injected into the EPrints dataset fields.
 +
 +
Once all the objects are classified the wrapper then triggers an update on the risk scores relating to those objects. After which it then updates the file counts relating to each format, readying the data to be displayed by the Admin page.
 +
* How is that process and displayed?
* How is that process and displayed?
 +
 +
The Preserv2/Eprints Preservation Toolkit comes with an Admin page which displays the results. More on this page including screen shots of it acting upon the typical repository can be found @ http://wiki.preserv.org.uk/index.php/EPrintsPreservation
 +
 +
* Preparing the results for comparison to ROAR and conclusions
 +
 +
In order to further process these results an extra system component was produced which was able to further analyse the results. This component consisted of several extension services:
 +
 +
** The ability to compare the Preserv2/Eprints Preservation Toolkit classification to that originally established by ROAR to find differences.
 +
** The ability to fill in gaps in mime-types.
 +
** The ability to display file extensions and compare these to the DROID classification.
= About the dataset =
= About the dataset =

Revision as of 16:19, 27 February 2009

Contents

Methodology

  • Identified test data, where it is, where it comes from?

The test data represents a snapshot of ROAR taken at the time of running the import_test_data script. This script essentially asks ROAR for a current overall snapshot of all the repositories registered with ROAR as well as each one of the individual repositories identified. This snapshot consisted of a table which outlines how many files of each format (and version) are in the repository as classified by an earlier version of DROID. From this snapshot the import script then processed the results to established how many of each file format are required to build a smaller, but similar looking snapshot. For the typical repository snapshot, made from the snapshot of all registered repositories in ROAR (over 1400+), it was decided to harvest around 1000 files. For each of the individual repositories 100 files were targeted from each.

  • How did I get the data?
  • What I had to change to make it work?

To harvest the content ROAR was used again to randomly select the file URLs from for each format and then these were downloaded directly from the source repository. Any URLs which failed to connect or download (of which there were a lot!) were ignored and another URL was requested from ROAR until a complete dataset was downloaded. With ROAR being based on the data which is ascertained from the OAI-PMH interface of a repository, intended to be a reliable service, it was surprising many URLs had to be attempted before a file was returned.

  • What did i do with the data?
  • What I did with the files? Where they went?

Once the data was harvested from the source repositories, it was then analyzed briefly to double check for completeness, before being injected into an EPrints 3.2 (svn) repository extended with the Eprints/Preserv2 Preservation Toolkit. Each dataset was fed into a separate repository thus splitting the datasets. In total 2144 files were fed into 13 repositories.

  • What is the toolkit and what does it inlcude?

The Preserv2/Eprints Preservation Toolkit consists of 3 main parts, 2 of which have been covered elsewhere within this project. The first one of which is the DroidWrapper, which has been customized to be able to directly read and manipulate EPrints 3.2 datasets to allow direct and reliable access to all of the files within the repository. More about why this was done is discussed in the later results section. This version of the DroidWrapper is otherwise the same as any other version, it simply locates the files and feeds them to DROID for classification, the result of which is then fed back into the repository. Due to the nature of this particular DROID wrapper this classification is fed directly into the repository rather than made available for a separate parser to process.

The next part of the Preserv2/Eprints Preservation Toolkit consist of a configuration file, including an EPrints dataset which is used to extend the EPrints file dataset with pronomid, classification_quality and classification_date. This configuration file also adds the pronom dataset which consists of a cache table to store data from the pronom registry, such that it doesn't have to be queried everytime the repository administrator asks for a preservation report. This dataset also caches the file counts and risk scores enabling this page to load much faster.

The final part of the Preserv2/Eprints Preservation Toolkit is the page which displays the result to a repository administrator. more about this page and the toolkit can be found @ http://wiki.preserv.org.uk/index.php/EPrintsPreservation.

  • Where were those files stored and managed when I fed them into the classification process?

They are fed to the DROID via the new EPrints storage controller, so DROID essentially has direct access to the object (if local) or via a download if the file is offsite. This really depends on the separation between DROID and the objects. For more information please refer to the papers on Smart Storage and the EPrints Storage Layer.

  • What tool did I use for the process?

Not included in the toolkit is the DROID tool (as well as JAVA which is required to run DROID). In order to classify our objects we used DROID v3.00 and Signature File v13.

  • What happens to the results?

Each object is fed to DROID individually via a DROID xml classification file. This is done to avoid command line interpretation problems involving escaped characters. DROID then classifies this file and feeds back the results in the same format. These results are then processed by the wrapper before being directly injected into the EPrints dataset fields.

Once all the objects are classified the wrapper then triggers an update on the risk scores relating to those objects. After which it then updates the file counts relating to each format, readying the data to be displayed by the Admin page.

  • How is that process and displayed?

The Preserv2/Eprints Preservation Toolkit comes with an Admin page which displays the results. More on this page including screen shots of it acting upon the typical repository can be found @ http://wiki.preserv.org.uk/index.php/EPrintsPreservation

  • Preparing the results for comparison to ROAR and conclusions

In order to further process these results an extra system component was produced which was able to further analyse the results. This component consisted of several extension services:

    • The ability to compare the Preserv2/Eprints Preservation Toolkit classification to that originally established by ROAR to find differences.
    • The ability to fill in gaps in mime-types.
    • The ability to display file extensions and compare these to the DROID classification.

About the dataset

The test dataset is a snapshot from roar.eprints.org taken at 17:00 on Thursday 27th November 2008.

The Typical Repository Dataset

The following table and chart represent the typical repository according to ROAR [[1]]

Here the top 95 percentile have been specified leaving the final 5% consisting of other formats (a total of 111 formats in this case).

Format Percentage
Unknown 13
Portable Document Format (1.4) 13
Portable Document Format (1.3) 10
Portable Document Format - Archival (1) 8
Portable Document Format (1.2) 5
Portable Document Format (1.6) 5
Hypertext Markup Language 4
Portable Document Format (1.5) 4
Fixed Width Values Text File 2
MS-DOS Text File with line breaks 2
Unicode Text File 2
IBM DisplayWrite Document (3) 2
IBM DisplayWrite Document (2) 2
MS-DOS Text File 2
Macintosh Text File 2
Tab-Delimited Text File 2
Plain Text File 2
Fixed Width Values Text File 2
Other (111 Formats) 5

[Pie Chart[2]]

In Preserv2 we are going to select 1000 items in a weighted fashion (e.g. 13% PDF (1.4)) to make up a test dataset. This data will be taken randomly from the 1200+ repositories currently registered with ROAR.

The dataset will then be loaded into a full functional EPrints repository with Preserv2 extensions and used for testing of these extensions. This dataset can also be used as the exemplar dataset from which risk scores can be generated, although it is more of a guide for which formats we need risk scores for, being atypical in the community.

A Sample of Specific Repositories

In the specific repository tests we are going to choose at least 10 repositories from ROAR which exhibit the diversity of repositories. We are then going to consider the effects on each of the following following factors with regards the repositories preservation profile and possible strategy.

  • Size of repository.
  • Preserv profile.
  • Type of repositories (Software and Content)
  • Levels of activity.
  • What affects do mandates have?

100+ files will be taken from each repository to represent an accurate cross section of that repository (similar to the first tests but this time repository specific). Each of these datasets will then be subject to the same tests as the atypical dataset in part 1.

Personal tools