![]() |
![]() |
| Home Page | Building Repositories | Expanding Content | Increasing Usage | Using RSP |
Harvesting Repository Data and OAI-PMHPrinciples of Harvesting RepositoriesIn general, the Open Access Initiative's (OAI) preferred method of re-use of repository data is harvesting. This differs from web crawling in that harvesting gathers data in structured XML formats - i.e. retaining separate fields for authors, titles, dates, and so forth - whereas web crawlers deal with everything as one big text. Structured data not only provides opportunities for richer search services, but also facilitates data analysis and data mining What gives this process its power is the way that individual institutional repositories can each have their own particular collection policies and administrative systems, but to be linked into one large, a virtual, global repository through the use of the OAI-PMH. This allows individual institutions or subject communities to build their own individual repositories for their own purposes, but for users to be able to search just one service to gain access to all of the content of all of the repositories. About 75% of repositories worldwide (~85% in the UK) provide an interface that uses the standard Open Access protocol OAI-PMH. Such repositories are designated 'OAI-compliant'. When a repository adheres to this protocol, some or all of the metadata that it holds for all the items in its collection is exposed for harvesting by service providers. The returned metadata usually includes a URI for the full-text file, which can therefore also be processed if required. The advantage of OAI-PMH, apart from its ubiquity, is that it is relatively simple both to implement within a repository software package and to use. The trade-off is that its query facilities are very rudimentary - some might even say non-existent - and variations in the format of the returned data can be a problem. Consequently, some repositories may provide other machine-to-machine interfaces with richer functionality, usually in addition to OAI-PMH. The two principal examples are SRU-CQL and Z39.50. The extra features of these protocols mean they require more development effort, and so far very few software suppliers or actual repositories have implemented them. For a more detailed overview of harvesting protocols and related issues, see Swan & Awre (2006). There are some other names you may hear in the context of harvesting. RSS and Atom are protocols that are increasingly being used for providing details of repository items in XML format. However, these are really news feed protocols and not suitable for serious harvesting, although they can be used for harvester-like applications such as the production of departmental publications lists. RSS and Atom are covered in more detail elsewhere on this website. You may also encounter the terms REST and SOAP. These are not harvesting protocols, but web communications standards, and their documentation seems particularly impenetrable to all but serious techies. Fortunately, repository administrators and their technical support staff should never need to work with them at this level. OAI Protocol for Metadata Harvesting (OAI-PMH)While OAI-PMH is intended as a machine-to-machine interface, it returns results as XML, which can also be displayed on web browsers for human consumption. Hence, the examples below are given as hyperlinks. EPrints.org provide a useful XML stylesheet for rendering OAI-PMH output that is used by many repositories that run EPrints software. Principles of OAI-PMHOAI-compliant repositories have an 'OAI Base URL' in addition the URL for human users. For instance, Aberystwyth University's CADAIR repository (http://cadair.aber.ac.uk/) has the OAI Base URL http://cadair.aber.ac.uk/dspace-oai/request. On its own, an OAI Base URL simply returns XML containing an error message. This is because the protocol expects instructions in the form of a 'verb' and other arguments to be appended to the URL. The simplest case is the verb 'Identify', which returns identity information about the repository. E.g.: http://www.british-history.ac.uk/oai/oai.aspx?verb=Identify Another 'Identify' example using EPrints' XML stylesheet to render the output more readably: http://eprints.soton.ac.uk/perl/oai2?verb=Identify OAI-PMH VerbsAltogether, there are six OAI-PMH verbs, some of which require additional arguments:
Note: The verbs and their associated arguments are case-sensitive. Results are often not returned as one big file, but as chunks of so many records ending with 'resumptionToken' that can be used to retrieve the next chunk. OAI-PMH installations can be set up to return results using a variety of metadata schemas. As a minimum, all OAI-PMH servers must be able to return results using the Dublin Core (oai_dc) schema, and this is all that many repositories or packages offer. However, they can provide as many or as few additional schemas as they wish. It is encouraging that EPrints Version 3 now comes with several schema options by default, while DSpace offers extra schemas that just need to be enabled in the configuration file. Also, the DRIVER project is heavily promoting MPEG21-DIDL schema as a European standard. For further details, an excellent online tutorial on the OAI-PMH is available from the Open Archives Forum. SRU-CQLAs the hyphen in the name suggests, SRU-CQL comprises two parts: SRU (Search/Retrieval via URL) is a protocol for sending queries to the server and receiving search results back. It uses a rich set of query and response parameters, and can accommodate a variety of XML record schemas. Unlike OAI-PMH it does not require an obligatory record schema, although in practice Dublin Core tends to available. See http://www.loc.gov/standards/sru/specs/search-retrieve.html CQL (Contextual Query Language) is the query language used within SRU for specifying actual queries. It offers all the advanced options you would expect of a mainstream bibliographic database, which for some purposes may be preferable to the rudimentary facilities of OAI-PMH. See http://www.loc.gov/standards/sru/specs/cql.html SRU-CQL is maintained by the Library of Congress. Its documentation is well-written and easy to understand. Z39.50Z39.50 was originally a pre-Web ancestor of SRU-CQL, developed primarily for library and information related systems. It is mostly used for cross-searching bibliographic databases, although it has been extended to cover non-bibliographic media. In principle Z39.50 interfaces could be added to Open Access repositories, but its 'successor' SRU-CQL is now generally preferable for OA purposes. The international Z39.50 standard is maintained by the Library of Congress. See documentation and related information at http://www.loc.gov/z3950/agency/. ReferencesAlma Swan & Chris Awre (2006) Linking UK Repositories: Technical and organisational models to support user-oriented services across institutional and other digital repositories: Scoping Study Report: Appendix, JISC, [2006] |
|||||||||||||||
| Contact: support@rsp.ac.uk | Copyright | Terms & Conditions | Privacy | Accessibility | Reviewed: 30-Jul-2008 |