From PreservWiki

(Difference between revisions)
Jump to: navigation, search

DaveTarrant (Talk | contribs)
(New page: These conclusions are split into several sections which relate to the different stages of the testing process, this includes the initial harvesting of the data as well as the actual classi...)
Next diff →

Revision as of 12:36, 3 March 2009

These conclusions are split into several sections which relate to the different stages of the testing process, this includes the initial harvesting of the data as well as the actual classification itself.


Building the Test Datasets

In order to build the test datasets a script was written which directly queries ROAR for relevant URLs of files relating to the dataset which is being harvested. This script samples randomly from ROAR, thus if it was run twice the resulting datasets would be different. Initially it was thought that to generate a 100 file dataset we would only need to ask ROAR for 100 URLs however this was soon found not to be the case. The URLs provided by ROAR seem to be valid however almost 1/3 of those provided fail to provide a file when attempting to harvest directly from this URL. Most of this 1/3 was probably made up of "file not found", "file withdrawn" or "request a copy" and these should provide the relavent error codes. At the point of receiving an error code the harvester would simply try another file. We could say that based on this argument that the correct result was received for these files (i.e. the error code), however as we will find later it is rare that the error codes are actually relevant to the error.

Constructing the Test Repositories

With the datasets constructed it was time to submit these into EPrints 3.2 repositories. This proved to be the easiest stage (having done this process many times before) and each 100 item repository took about 5 minutes to fully populate. I did have 1 issue with a long file name which was wrongly escaped on disk, renaming this one file solved that problem (bit of a hack but it was 1 file out of 2144).

The Classification Process

The Preserv2-EPrints-Toolkit which includes a DROID wrapper, caching database (as an EPrints dataset) and results page (as EPrints Admin Screen) was fully tested and found to be fully working (after some minor bug fixes). The toolkit can be downloaded at http://files.eprints.org/422/.

The classification part of the toolkit which is meant ot be scheduled by all our models was in fact invoked manually on the datasets to save waiting for the scheduled job to launch, the command used to do this manually was the same as used in the scheduled process however. The process generally ran very smoothly once it had been discovered that DROID command line invocation does not support the full shell escaping used on a Linux environment running the bash shell (other shells were untested). It is suspected that DROID has seen most of its testing on Windows thus supports the dos style command prompt. To solve this problem further investigation was ruled out in favor of using the XML Classification file syntax which DROID can both read from and write to. This slows the process slightly (by a few milliseconds) however is much more stable as a result.

With the above fix applied the classification process went extremely smoothly and all 2144 files in 13 repositories were classified and outputted data was fed into EPrints.

Classification Conclusions

DROID v3.00 & Signature File v13 classifies more files than the version used by ROAR

275 were unknown on import and 146 remain unknown however 89 invalid and badly formed files exist in both datasets, thus discounting these means 186 were unknown originally verses the 57 which are now unknown which is a major improvement.

DROID v3.00 + Sig File v13 has issues determining exact version of certain file formats

The newer version of DROID is much worse at identifying the different file versions when asked to identify formats including text, rtf and the tiff image format. This means that DROID knew which format they were, i.e. Text file, however could not tell you if it is a Comma Separated Values text file, Macintosh formatted text file or any other particular version.

It is likely that this may be the case with other formats but the three examples listed existed in significant numbers within our dataset. For these types DROID identified them all as the same file version, grouping them together inaccurately.

Mime-types should be considered as a means to a basic classification

With DROID getting the basic classification correct, e.g. knowing a text file is a text file, it is important to consider mime-types when concluding on what files have changed their format classification in a way which may mean that the file is at risk. Without this factor the comparison with ROAR implies that over 1/4 of the files in the 1000 item typical dataset have changed in classification. When mime-types are applied to this however we find that 256 files match by mime-type and only 40 files change classification (investigated later).

In order to compare the mime-types of the files this data was obtained from the DROID/PRONOM identification file which was fine for the records which had a mime-type listed against them. PRONOM is lacking a lot of mime-type information for file formats which do have an existing mime-type. In most cases this mime-type information has existed for a while and it is unclear why this data is missing from the PRONOM registry. The TNA suspects that mime-type is not a compulsory field which needs to be present when data is provided to the registry.

DROID v3.00 - The percentages!

Classification by Extension & Percentages

Word Documents being wrongly classified as Excel files

In-conclusive file classification changes

Why so many HTML files!

Personal tools