TestDataResults

From PreservWiki

(Difference between revisions)
Jump to: navigation, search
Line 8: Line 8:
** File in question: http://dspace.mit.edu/47:14Z%20(GMT).%20No.%20of%20bitstreams:%20263191853.pdf:%207781517%20bytes,%20checksum:%20f3b11c5d00fa6a9a02d4c906c1825757%20(MD5)63191853MIT.pdf:%207786389%20bytes,%20checksum:%2032bfd24738365b6456b4d2a10a961c6b%20(MD5)&.
** File in question: http://dspace.mit.edu/47:14Z%20(GMT).%20No.%20of%20bitstreams:%20263191853.pdf:%207781517%20bytes,%20checksum:%20f3b11c5d00fa6a9a02d4c906c1825757%20(MD5)63191853MIT.pdf:%207786389%20bytes,%20checksum:%2032bfd24738365b6456b4d2a10a961c6b%20(MD5)&.
** FIXED: DROID does not handle shell escaping (in this case bash) when file names are handed to it, thus it cannot read the file. This was worked around by handing droid a "droid list XML file" which contains a XML encoded link to the file.
** FIXED: DROID does not handle shell escaping (in this case bash) when file names are handed to it, thus it cannot read the file. This was worked around by handing droid a "droid list XML file" which contains a XML encoded link to the file.
 +
 +
* FIRST PARSE RESULTS (with the above correction)
 +
** All 994 files were located by the EPrints3.2 classification engine.
 +
** Only 45 objects were left unclassified which is better than the 129 which were UNKNOWN at the time of importation.
 +
** All other comparisons are difficult however some initial observations are that DROID is classifying the objects differently from the original classification however these are still the same in mime type. For example an HTML 4.0 file is now being classified as an XHTML or just HTML file. The same is true of text files and some PDFs. We suspect this is due to the Tentative classifications which DROID provides for some files. As a result we are now including the classification quality (tentative, positive) in the data output from a classification.

Revision as of 17:07, 17 February 2009

For now this is a bullet pointed list which needs bulking out:

  • When harvesting repositories about 1/3 of downloads fail for various reasons, Further Investigation?
  • ROAR identifies a lot of stuff at HTML as it may get redirected to html in the process of trying to get a resource. This is wrong as this is not then the resource, the bug we believe is in the fact that the repository sends a HTTP 200 and not an HTTP 401 header. This is particularly the case on the ANU repository where we have only 74 items as fmt/94 (html) is the redirected 401 page. Further Investigation?
  • FIRST PARSE RESULTS (with the above correction)
    • All 994 files were located by the EPrints3.2 classification engine.
    • Only 45 objects were left unclassified which is better than the 129 which were UNKNOWN at the time of importation.
    • All other comparisons are difficult however some initial observations are that DROID is classifying the objects differently from the original classification however these are still the same in mime type. For example an HTML 4.0 file is now being classified as an XHTML or just HTML file. The same is true of text files and some PDFs. We suspect this is due to the Tentative classifications which DROID provides for some files. As a result we are now including the classification quality (tentative, positive) in the data output from a classification.
Personal tools