TestData

From PreservWiki

(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
 +
= Methodology =
 +
 +
* Identified test data, where it is, where it comes from?
 +
 +
The test data represents a snapshot of ROAR taken at the time of running the import_test_data script. This script essentially asks ROAR for a current overall snapshot of all the repositories registered with ROAR as well as each one of the individual repositories identified. This snapshot consisted of a table which outlines how many files of each format (and version) are in the repository as classified by an earlier version of DROID. From this snapshot the import script then processed the results to established how many of each file format are required to build a smaller, but similar looking snapshot. For the typical repository snapshot, made from the snapshot of all registered repositories in ROAR (over 1400+), it was decided to harvest around 1000 files. For each of the individual repositories 100 files were targeted from each.
 +
 +
* How did I get the data?
 +
* What I had to change to make it work?
 +
 +
To harvest the content ROAR was used again to randomly select the file URLs from for each format and then these were downloaded directly from the source repository. Any URLs which failed to connect or download (of which there were a lot!) were ignored and another URL was requested from ROAR until a complete dataset was downloaded. With ROAR being based on the data which is ascertained from the OAI-PMH interface of a repository, intended to be a reliable service, it was surprising many URLs had to be attempted before a file was returned.
 +
 +
* What did i do with the data?
 +
* What I did with the files? Where they went?
 +
 +
Once the data was harvested from the source repositories, it was then analyzed briefly to double check for completeness, before being injected into an EPrints 3.2 (svn) repository extended with the Eprints/Preserv2 Preservation Toolkit. Each dataset was fed into a separate repository thus splitting the datasets. In total 2144 files were fed into 13 repositories.
 +
 +
* What is the toolkit and what does it inlcude?
 +
 +
The Preserv2/Eprints Preservation Toolkit consists of 3 main parts, 2 of which have been covered elsewhere within this project. The first one of which is the DroidWrapper, which has been customized to be able to directly read and manipulate EPrints 3.2 datasets to allow direct and reliable access to all of the files within the repository. More about why this was done is discussed in the later results section. This version of the DroidWrapper is otherwise the same as any other version, it simply locates the files and feeds them to DROID for classification, the result of which is then fed back into the repository. Due to the nature of this particular DROID wrapper this classification is fed directly into the repository rather than made available for a separate parser to process.
 +
 +
The other 2 parts of the Preserv2/Eprints Preservation Toolkit consist of a configuration file, including an EPrints dataset which is used to extend the
 +
 +
* Where were those files stored and managed when I fed them into the classification process?
 +
 +
* What tool did I use for the process?
 +
 +
* What happens to the results?
 +
* How is that process and displayed?
 +
= About the dataset =
= About the dataset =

Revision as of 15:19, 27 February 2009

Contents

Methodology

  • Identified test data, where it is, where it comes from?

The test data represents a snapshot of ROAR taken at the time of running the import_test_data script. This script essentially asks ROAR for a current overall snapshot of all the repositories registered with ROAR as well as each one of the individual repositories identified. This snapshot consisted of a table which outlines how many files of each format (and version) are in the repository as classified by an earlier version of DROID. From this snapshot the import script then processed the results to established how many of each file format are required to build a smaller, but similar looking snapshot. For the typical repository snapshot, made from the snapshot of all registered repositories in ROAR (over 1400+), it was decided to harvest around 1000 files. For each of the individual repositories 100 files were targeted from each.

  • How did I get the data?
  • What I had to change to make it work?

To harvest the content ROAR was used again to randomly select the file URLs from for each format and then these were downloaded directly from the source repository. Any URLs which failed to connect or download (of which there were a lot!) were ignored and another URL was requested from ROAR until a complete dataset was downloaded. With ROAR being based on the data which is ascertained from the OAI-PMH interface of a repository, intended to be a reliable service, it was surprising many URLs had to be attempted before a file was returned.

  • What did i do with the data?
  • What I did with the files? Where they went?

Once the data was harvested from the source repositories, it was then analyzed briefly to double check for completeness, before being injected into an EPrints 3.2 (svn) repository extended with the Eprints/Preserv2 Preservation Toolkit. Each dataset was fed into a separate repository thus splitting the datasets. In total 2144 files were fed into 13 repositories.

  • What is the toolkit and what does it inlcude?

The Preserv2/Eprints Preservation Toolkit consists of 3 main parts, 2 of which have been covered elsewhere within this project. The first one of which is the DroidWrapper, which has been customized to be able to directly read and manipulate EPrints 3.2 datasets to allow direct and reliable access to all of the files within the repository. More about why this was done is discussed in the later results section. This version of the DroidWrapper is otherwise the same as any other version, it simply locates the files and feeds them to DROID for classification, the result of which is then fed back into the repository. Due to the nature of this particular DROID wrapper this classification is fed directly into the repository rather than made available for a separate parser to process.

The other 2 parts of the Preserv2/Eprints Preservation Toolkit consist of a configuration file, including an EPrints dataset which is used to extend the

  • Where were those files stored and managed when I fed them into the classification process?
  • What tool did I use for the process?
  • What happens to the results?
  • How is that process and displayed?

About the dataset

The test dataset is a snapshot from roar.eprints.org taken at 17:00 on Thursday 27th November 2008.

The Typical Repository Dataset

The following table and chart represent the typical repository according to ROAR [[1]]

Here the top 95 percentile have been specified leaving the final 5% consisting of other formats (a total of 111 formats in this case).

Format Percentage
Unknown 13
Portable Document Format (1.4) 13
Portable Document Format (1.3) 10
Portable Document Format - Archival (1) 8
Portable Document Format (1.2) 5
Portable Document Format (1.6) 5
Hypertext Markup Language 4
Portable Document Format (1.5) 4
Fixed Width Values Text File 2
MS-DOS Text File with line breaks 2
Unicode Text File 2
IBM DisplayWrite Document (3) 2
IBM DisplayWrite Document (2) 2
MS-DOS Text File 2
Macintosh Text File 2
Tab-Delimited Text File 2
Plain Text File 2
Fixed Width Values Text File 2
Other (111 Formats) 5

[Pie Chart[2]]

In Preserv2 we are going to select 1000 items in a weighted fashion (e.g. 13% PDF (1.4)) to make up a test dataset. This data will be taken randomly from the 1200+ repositories currently registered with ROAR.

The dataset will then be loaded into a full functional EPrints repository with Preserv2 extensions and used for testing of these extensions. This dataset can also be used as the exemplar dataset from which risk scores can be generated, although it is more of a guide for which formats we need risk scores for, being atypical in the community.

A Sample of Specific Repositories

In the specific repository tests we are going to choose at least 10 repositories from ROAR which exhibit the diversity of repositories. We are then going to consider the effects on each of the following following factors with regards the repositories preservation profile and possible strategy.

  • Size of repository.
  • Preserv profile.
  • Type of repositories (Software and Content)
  • Levels of activity.
  • What affects do mandates have?

100+ files will be taken from each repository to represent an accurate cross section of that repository (similar to the first tests but this time repository specific). Each of these datasets will then be subject to the same tests as the atypical dataset in part 1.

Personal tools