TestDataConclusions

From PreservWiki

(Difference between revisions)
Jump to: navigation, search
(New page: These conclusions are split into several sections which relate to the different stages of the testing process, this includes the initial harvesting of the data as well as the actual classi...)
Line 36: Line 36:
==DROID v3.00 - The percentages!==
==DROID v3.00 - The percentages!==
 +
 +
If we include the 89 malformed files the DROID v3.00 was unable to classify a total of 146 files out of the 2144 total across all 13 repositories. This is a 93.1% classification rate. If we also take into account those which are wrongly classified (discussed later) this comes down to a 92.75% success rate by mime-type.
 +
 +
This figure does not take into account the success rate per file version which was deemed a task not necessary in completion of this part of the system testing. However as we have already outlined previously it is believed that one of the DROID versions (likely the new one) has issues telling apart the different file versions.
==Classification by Extension & Percentages==
==Classification by Extension & Percentages==
 +
 +
This was an extension test designed to see if DROID is more accurate than simply looking at file extensions when it comes to decided mime-types. Over the entire 2144 item dataset only 4 items did not have extensions. Process the rest gave a accuracy rate of 99.8% to file extensions which includes the 89 malformed files which were classified as possible HTML files by their extension (these files contained HTML but without an HTML header, thus the reason to be malformed).
 +
 +
It should be noted that not all files were opened to see if they were indeed what their extension states however fringe cases and DROID unknowns were tested in this way and found to be the file types their extension decreed.
 +
 +
It should also be noted that while file extension is a good way of finding the file type, many files contained additional data after their extension, e.g. a .1 .2 .3 etc or a _backup or _old. Thus is is necessary to parse these manually to find the correct extension that may be in the middle of the file name (e.g. test.txt.backup).
 +
 +
==DROID vs File Extensions==
 +
 +
Although here we have stated that file extensions are potentially a more accurate way of classifying the content of your repository, you cannot find out the file version in this way. This means you cannot differentiate between a Word 95 document and one conforming to the Word 2003 specification. DROID is still required at this stage. The general conclusion is that by using a combination of techniques DROID may become more accurate.
==Word Documents being wrongly classified as Excel files==
==Word Documents being wrongly classified as Excel files==
 +
 +
==In-conclusive file classification changes==
==In-conclusive file classification changes==
==Why so many HTML files!==
==Why so many HTML files!==

Revision as of 14:06, 3 March 2009

These conclusions are split into several sections which relate to the different stages of the testing process, this includes the initial harvesting of the data as well as the actual classification itself.

Contents

Building the Test Datasets

In order to build the test datasets a script was written which directly queries ROAR for relevant URLs of files relating to the dataset which is being harvested. This script samples randomly from ROAR, thus if it was run twice the resulting datasets would be different. Initially it was thought that to generate a 100 file dataset we would only need to ask ROAR for 100 URLs however this was soon found not to be the case. The URLs provided by ROAR seem to be valid however almost 1/3 of those provided fail to provide a file when attempting to harvest directly from this URL. Most of this 1/3 was probably made up of "file not found", "file withdrawn" or "request a copy" and these should provide the relavent error codes. At the point of receiving an error code the harvester would simply try another file. We could say that based on this argument that the correct result was received for these files (i.e. the error code), however as we will find later it is rare that the error codes are actually relevant to the error.

Constructing the Test Repositories

With the datasets constructed it was time to submit these into EPrints 3.2 repositories. This proved to be the easiest stage (having done this process many times before) and each 100 item repository took about 5 minutes to fully populate. I did have 1 issue with a long file name which was wrongly escaped on disk, renaming this one file solved that problem (bit of a hack but it was 1 file out of 2144).

The Classification Process

The Preserv2-EPrints-Toolkit which includes a DROID wrapper, caching database (as an EPrints dataset) and results page (as EPrints Admin Screen) was fully tested and found to be fully working (after some minor bug fixes). The toolkit can be downloaded at http://files.eprints.org/422/.

The classification part of the toolkit which is meant ot be scheduled by all our models was in fact invoked manually on the datasets to save waiting for the scheduled job to launch, the command used to do this manually was the same as used in the scheduled process however. The process generally ran very smoothly once it had been discovered that DROID command line invocation does not support the full shell escaping used on a Linux environment running the bash shell (other shells were untested). It is suspected that DROID has seen most of its testing on Windows thus supports the dos style command prompt. To solve this problem further investigation was ruled out in favor of using the XML Classification file syntax which DROID can both read from and write to. This slows the process slightly (by a few milliseconds) however is much more stable as a result.

With the above fix applied the classification process went extremely smoothly and all 2144 files in 13 repositories were classified and outputted data was fed into EPrints.

Classification Conclusions

DROID v3.00 & Signature File v13 classifies more files than the version used by ROAR

275 were unknown on import and 146 remain unknown however 89 invalid and badly formed files exist in both datasets, thus discounting these means 186 were unknown originally verses the 57 which are now unknown which is a major improvement.

DROID v3.00 + Sig File v13 has issues determining exact version of certain file formats

The newer version of DROID is much worse at identifying the different file versions when asked to identify formats including text, rtf and the tiff image format. This means that DROID knew which format they were, i.e. Text file, however could not tell you if it is a Comma Separated Values text file, Macintosh formatted text file or any other particular version.

It is likely that this may be the case with other formats but the three examples listed existed in significant numbers within our dataset. For these types DROID identified them all as the same file version, grouping them together inaccurately.

Mime-types should be considered as a means to a basic classification

With DROID getting the basic classification correct, e.g. knowing a text file is a text file, it is important to consider mime-types when concluding on what files have changed their format classification in a way which may mean that the file is at risk. Without this factor the comparison with ROAR implies that over 1/4 of the files in the 1000 item typical dataset have changed in classification. When mime-types are applied to this however we find that 256 files match by mime-type and only 40 files change classification (investigated later).

In order to compare the mime-types of the files this data was obtained from the DROID/PRONOM identification file which was fine for the records which had a mime-type listed against them. PRONOM is lacking a lot of mime-type information for file formats which do have an existing mime-type. In most cases this mime-type information has existed for a while and it is unclear why this data is missing from the PRONOM registry. The TNA suspects that mime-type is not a compulsory field which needs to be present when data is provided to the registry.

DROID v3.00 - The percentages!

If we include the 89 malformed files the DROID v3.00 was unable to classify a total of 146 files out of the 2144 total across all 13 repositories. This is a 93.1% classification rate. If we also take into account those which are wrongly classified (discussed later) this comes down to a 92.75% success rate by mime-type.

This figure does not take into account the success rate per file version which was deemed a task not necessary in completion of this part of the system testing. However as we have already outlined previously it is believed that one of the DROID versions (likely the new one) has issues telling apart the different file versions.

Classification by Extension & Percentages

This was an extension test designed to see if DROID is more accurate than simply looking at file extensions when it comes to decided mime-types. Over the entire 2144 item dataset only 4 items did not have extensions. Process the rest gave a accuracy rate of 99.8% to file extensions which includes the 89 malformed files which were classified as possible HTML files by their extension (these files contained HTML but without an HTML header, thus the reason to be malformed).

It should be noted that not all files were opened to see if they were indeed what their extension states however fringe cases and DROID unknowns were tested in this way and found to be the file types their extension decreed.

It should also be noted that while file extension is a good way of finding the file type, many files contained additional data after their extension, e.g. a .1 .2 .3 etc or a _backup or _old. Thus is is necessary to parse these manually to find the correct extension that may be in the middle of the file name (e.g. test.txt.backup).

DROID vs File Extensions

Although here we have stated that file extensions are potentially a more accurate way of classifying the content of your repository, you cannot find out the file version in this way. This means you cannot differentiate between a Word 95 document and one conforming to the Word 2003 specification. DROID is still required at this stage. The general conclusion is that by using a combination of techniques DROID may become more accurate.

Word Documents being wrongly classified as Excel files

In-conclusive file classification changes

Why so many HTML files!

Personal tools