Harvesting Repository Data and OAI-PMH

Principles of Harvesting Repositories

In general, the Open Archives Initiative's (OAI) preferred method of re-use of repository data is harvesting. This differs from web crawling in that harvesting gathers data in structured XML formats - i.e. retaining separate fields for authors, titles, dates, and so forth - whereas web crawlers deal with everything as one big text. Structured data not only provides opportunities for richer search services, but also facilitates data analysis and data mining

What gives this process its power is the way that individual institutional repositories can each have their own particular collection policies and administrative systems, but to be linked into one large, a virtual, global repository through the use of the OAI-PMH. This allows individual institutions or subject communities to build their own individual repositories for their own purposes, but for users to be able to search just one service to gain access to all of the content of all of the repositories.

About 75% of repositories worldwide (~85% in the UK) provide an interface that uses the standard Open Access protocol OAI-PMH. Such repositories are designated 'OAI-compliant'. When a repository adheres to this protocol, some or all of the metadata that it holds for all the items in its collection is exposed for harvesting by service providers. The returned metadata usually includes a URI for the full-text file, which can therefore also be processed if required.

The advantage of OAI-PMH, apart from its ubiquity, is that it is relatively simple both to implement within a repository software package and to use. The trade-off is that its query facilities are very rudimentary - some might even say non-existent - and variations in the format of the returned data can be a problem. Consequently, some repositories may provide other machine-to-machine interfaces with richer functionality, usually in addition to OAI-PMH. The two principal examples are SRU-CQL and Z39.50. The extra features of these protocols mean they require more development effort, and so far very few software suppliers or actual repositories have implemented them.

For a more detailed overview of harvesting protocols and related issues, see Swan & Awre (2006).

There are some other names you may hear in the context of harvesting. RSS and Atom are protocols that are increasingly being used for providing details of repository items in XML format. However, these are really news feed protocols and not suitable for serious harvesting, although they can be used for harvester-like applications such as the production of departmental publications lists.

You may also encounter the terms REST and SOAP. These are not harvesting protocols, but web communications standards, and their documentation seems particularly impenetrable to all but serious techies. Fortunately, repository administrators and their technical support staff should never need to work with them at this level.

OAI Protocol for Metadata Harvesting (OAI-PMH)

While OAI-PMH is intended as a machine-to-machine interface, it returns results as XML, which can also be displayed on web browsers for human consumption. Hence, the examples below are given as hyperlinks. EPrints.org provide a useful XML stylesheet for rendering OAI-PMH output that is used by many repositories that run EPrints software.

Principles of OAI-PMH

OAI-compliant repositories have an 'OAI Base URL' in addition the URL for human users. For instance, Aberystwyth University's CADAIR repository (http://cadair.aber.ac.uk/) has the OAI Base URL http://cadair.aber.ac.uk/dspace-oai/request. On its own, an OAI Base URL simply returns XML containing an error message. This is because the protocol expects instructions in the form of a 'verb' and other arguments to be appended to the URL.

The simplest case is the verb 'Identify', which returns identity information about the repository. E.g.:

http://www.british-history.ac.uk/oai/oai.aspx?verb=Identify

Another 'Identify' example using EPrints' XML stylesheet to render the output more readably:

http://eprints.soton.ac.uk/perl/oai2?verb=Identify

OAI-PMH Verbs

Altogether, there are six OAI-PMH verbs, some of which require additional arguments:

Identify
Returns information about the repository e.g. http://rrp.roehampton.ac.uk/cgi/oai2.cgi?verb=Identify
ListMetadataFormats
Lists the metadata formats supported by the repository. The minumum requirement is oai_dc (Dublin Core) e.g. http://cadair.aber.ac.uk/dspace-oai/request?verb=ListMetadataFormats
ListSets
Lists the sets provided by the repository (e.g. departments, subjects, etc.) e.g. http://eprints.soton.ac.uk/perl/oai2?verb=ListSets
ListIdentifiers
Lists record identifiers, dates & any other headers for each deposited item. Requires the argument 'metadataPrefix' - metadataPrefix=oai_dc should suffice. Results can be limited to specified sub-sets. e.g. http://epubs.cclrc.ac.uk/oai?verb=ListIdentifiers&metadataPrefix=oai_dc
ListRecords
Harvests metadata records from the repository Requires the argument 'metadataPrefix' - metadataPrefix=oai_dc should suffice. Results can be limited to specified sub-sets. e.g. http://eprints.nottingham.ac.uk/perl/oai2?verb=ListRecords&metadataPrefix=oai_dc
GetRecord
Gets an individual metadata record from the repository Requires the arguments 'identifier' & 'metadataPrefix'. e.g. http://eprints.whiterose.ac.uk/cgi/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:eprints.whiterose.ac.uk:937

Note: The verbs and their associated arguments are case-sensitive. Results are often not returned as one big file, but as chunks of so many records ending with 'resumptionToken' that can be used to retrieve the next chunk.

OAI-PMH installations can be set up to return results using a variety of metadata schemas. As a minimum, all OAI-PMH servers must be able to return results using the Dublin Core (oai_dc) schema, and this is all that many repositories or packages offer. However, they can provide as many or as few additional schemas as they wish. It is encouraging that EPrints Version 3 now comes with several schema options by default, while DSpace offers extra schemas that just need to be enabled in the configuration file. Also, the DRIVER project is heavily promoting MPEG21-DIDL schema as a European standard.

For further details, an excellent online tutorial on the OAI-PMH is available from the Open Archives Forum.

SRU-CQL

As the hyphen in the name suggests, SRU-CQL comprises two parts:

SRU (Search/Retrieval via URL) is a protocol for sending queries to the server and receiving search results back. It uses a rich set of query and response parameters, and can accommodate a variety of XML record schemas. Unlike OAI-PMH it does not require an obligatory record schema, although in practice Dublin Core tends to available. See http://www.loc.gov/standards/sru/specs/search-retrieve.html

CQL (Contextual Query Language) is the query language used within SRU for specifying actual queries. It offers all the advanced options you would expect of a mainstream bibliographic database, which for some purposes may be preferable to the rudimentary facilities of OAI-PMH. See http://www.loc.gov/standards/sru/specs/cql.html

SRU-CQL is maintained by the Library of Congress. Its documentation is well-written and easy to understand.

Z39.50

Z39.50 was originally a pre-Web ancestor of SRU-CQL, developed primarily for library and information related systems. It is mostly used for cross-searching bibliographic databases, although it has been extended to cover non-bibliographic media. In principle Z39.50 interfaces could be added to Open Access repositories, but its 'successor' SRU-CQL is now generally preferable for OA purposes.

The international Z39.50 standard is maintained by the Library of Congress. See documentation and related information at http://www.loc.gov/z3950/agency/.

References

Alma Swan & Chris Awre (2006) Linking UK Repositories: Technical and organisational models to support user-oriented services across institutional and other digital repositories: Scoping Study Report: Appendix, JISC, [2006] http://www.jisc.ac.uk/uploaded_documents/Linking_UK_repositories_appendix.pdf