From PreservWiki

(Difference between revisions)
Jump to: navigation, search
Line 57: Line 57:
* DSpace 1.5.1
* DSpace 1.5.1
* 3 newly classified files
* 3 newly classified files
* 2 Unknowns
* 90 exact matches
* 90 exact matches
* 3 matching on mime-type
* 3 matching on mime-type
* 0 Still Unknown
* 2 non-matching
** Previously Classified as HTML now PDF (embargos?)
Good Result That
Good Result That

Revision as of 17:09, 26 February 2009



For now this is a bullet pointed list which needs bulking out:

  • When harvesting repositories about 1/3 of downloads fail for various reasons, Further Investigation?
  • ROAR identifies a lot of stuff at HTML as it may get redirected to html in the process of trying to get a resource. This is wrong as this is not then the resource, the bug we believe is in the fact that the repository sends a HTTP 200 and not an HTTP 401 header. This is particularly the case on the ANU repository where we have only 74 items as fmt/94 (html) is the redirected 401 page. Further Investigation?
  • FIRST PARSE RESULTS (with the above correction)
    • All 994 files were located by the EPrints3.2 classification engine.
    • Only 45 objects were left unclassified which is better than the 129 which were UNKNOWN at the time of importation.
    • All other comparisons are difficult at this stage.
    • Some initial observations:
      • DROID is classifying the objects differently from the original classification however these are still the same in mime type. For example an HTML 4.0 file is now being classified as an XHTML or just HTML file. The same is true of text files and some PDFs. We suspect this is due to the Tentative classifications which DROID provides for some files. As a result we are now including the classification quality (tentative, positive) in the data output from a classification.
    • update_pronom_uids scripts tries to process NULL and thus dies because it receives an error 400, technically it should only die if the service is not reachable, or just do it on a per object basis.
    • In the philsci repository the number of files with pronomids is correct but the division by mime-type is wrong.
    • No error handling on the formats_risks page if nothing has been classified!
    • The files missing bug was a problem with the script generating the EP3XML not escaping URLs correctly. With this fixed all repositories populate correctly. NEED TO SUBMIT THIS TO GOOGLE CODE OR SOMETHING?

The Differences

  • ROAR: DROID v1.2 with signature file version 12.
  • Preserv2-EPrints-Toolkit: DROID v3.0 with signature file version 13.

General Observations

  • The Preserv2-EPrints-Toolkit version of DROID has issues when trying to differentiate between different types of text and rtf files.
  • As a result it classifies them all as one type.
  • Pronom is lacking a complete set of mime-types for its format data, which is key when it comes to changing identifiers.
  • The Preserv2-EPrints-Toolkit version of DROID is able to classify more objects, cutting down on the number of unknowns
  • Some files have been updated in the base repositories but remain of the same mime-type. DROID classification error or manual user update, unfortunately there is no way to tell.

Typical Repository Outcomes (994 Files)

  • 2 files, previously of known format are now unknown? #
  • Out of the 994 files only 83 remain concerning with 43 still unknown and 40 not matching their original classification even by mime-type. Of the remaining files 567 matched exactly to their previous classification and 256 (most likely text and rtf) matched by mime type. This backs the general conclusion about DROID v3.00 with sig file v13 has issues determining the exact file version of some types of simple file.
    • 184 text files changed classification but stayed of the same mime-type.
    • 45 tiff files changed classification but stayed of the same mime-type.

Specific Repository Outcomes

Philsci (94 Files)

  • EPrints 2.2.1 (pepper)
  • 10 objects classified which were previously unknown.
  • 75 exact matching classifications and 6 matching on mime-type (All text/rtf files)
  • 5 remain unknown.

Senado (99 Files)

  • DSpace Repository
  • 89 unknown on input, 89 unknown on output
    • All of these 89 files were in fact only 2 files which were provided by the publication URL.
    • Due to the fact I can't read Brazilian/Spanish I think that these pages represent a page describing a method to get at the document such as payment required (HTTP 414) or resource unavailable (HTTP 404). The pages are malformed html which don't provide an HTTP header code (as stated by me) to describe their page. Basically a very bad web site implementation Web < 0.1.
  • All other classifications matched exactly.

Minho (98 Files)

  • DSpace 1.5.1
  • 3 newly classified files
  • 90 exact matches
  • 3 matching on mime-type
  • 2 non-matching
    • Previously Classified as HTML now PDF (embargos?)

Good Result That

ANU (76 Files)

  • DSpace
  • Of the 76 I could find accessible all 76 matched their imported classifications in every way
  • Objects which I couldn't get could not be classified and compared.

Roskilde (98 files)

  • DSpace
  • Newly classified (previously unknown): 2
  • 1 still unknown
  • 1 non mathcing classification
  • 3 matched on mime-type
    • PDF changes this time.
  • 91 exact matches

Stirling (102 Files)

  • DSpace
  • Newly classified (previously unknown): 1
  • Matched on mime-type: 5
  • Exact matches: 63
  • Absolute non matching: 33
    • These are documents previously classified as PDF which has been newly classified as HTML.
    • This is due to an embedded 302 error not being sent as a header, this now appears to be fixed since I took the original dataset.

ECS (94 Files)

  • EPrints 3.1
  • Still Unknown: 2
  • Newly Classified (previously UNKNOWN): 1
  • Matched on Mime Type: 1
  • Exact Matching Classifications: 88
  • Absolute non matching classifications: 2
    • 1 HTML is a PDF now
    • 1 HTML is a now Web Archive Document rfc822

Glasgow (99 Files)

  • EPrints 3.1.1
  • Newly Classified (previously UNKNOWN): 7
  • Matched on Mime Type: 2
  • Exact Matching Classifications: 88
  • Still UNKNOWN: 2
    • Invalid HTML redirect pages, the re-direct pages are correctly used.

Good repository

Queensland (99 Files)

  • EPrints 3.1.1 (Port and Brandy)
  • Newly Classified (previously UNKNOWN): 6
  • Exact Matching Classifications: 93

Good Repository

  • All PDF repository, everything classified, no problems, other than the fact that it's all PDF.

E-Lis (96 Files)

  • EPrints (Chocolate-coated Coffee Bean)
  • Newly Classified (previously UNKNOWN): 7
  • Exact Matching Classifications: 88
  • Absolute non matching classifications: 1
    • Probably due to upgrade in DROID Sig file as they are similar types
    • fmt-99 (Hypertext Markup Language (4.0) to x-fmt-429 (Microsoft Web Archive - message/rfc822)

Good Repository

Soton (97 Files)

  • EPrints 2.x
  • Newly Classified (previously UNKNOWN): 4
  • Matched on Mime Type: 6
  • Exact Matching Classifications: 85
  • Absolute non matching classifications: 1
    • PDF now HTML
  • Still UNKNOWN: 1
    • The file has a Shock Wave Flash (Flash) extension, DROID may not have a signature for this format.

Tartu (98 Files)

  • DSpace
  • Newly Classified (previously UNKNOWN): 4
  • Matched on Mime Type: 14
  • Exact Matching Classifications: 60
  • Absolute non matching classifications: 19
    • 18 are tiff's which are now classified as HTML page
    • 1 is a PDF which is now classified as an HTML page
  • Still UNKNOWN: 1
    • A 650Mb zip file, part of a much larger set of zip files. Should DROID check for consecutively number zip files and how do you handle this.
Personal tools