Metadata creation and flows

As this is a guide to embedding repositories, it assumes that a regular set of metadata creation practices and defined schemas are already in use at existing repositories, in conformance with standards permitting interoperability and harvesting. Information about such metadata in the context of setting up repositories is available here.

This section is concerned with what happens with metadata in various embedding scenarios. Specific examples are used to illustrate this.

General principles

Clearly, the principles of simplicity and the avoidance of duplication of effort which were discussed in the Deposit workflows section apply also to the creation of metadata under any scenario:

  • The workflow should be as simple as possible, while capturing all the information necessary to produce the benefits to the depositor
  • Benefits that will encourage repeated engagement include the easy population of staff web profiles with publication data, construction of lists for CVs etc
  • If information can be captured automatically from the object or extracted from a file, it should be as long as the quality of the resulting metadata is adequate
  • The maximum use should be made of drop-down menus and auto-completion features, which will lower the barriers to deposit and increase the accuracy of the metadata

The issue of how to disambiguate names in repositories is one that preoccupies many repository managers. There is a section specifically on Names here.

Metadata flows and data quality

In the standalone repository model, metadata schemas are usually initially taken ‘out of the box’ with repository software and often modified or extended. Metadata is submitted by researchers or administrators directly to the repository, or derived automatically from the file or object itself, modified and probably augmented (for example with subject headings) by repository staff.

In an embedded repository which is linked with research information systems metadata is drawn from a number of different sources, both internal university systems and external sources, in addition to being derived from files or added by the repository staff where necessary. For example, in one case citations are added to PDFs so that they are visible to users who do not come into the repository but click through to the PDFs directly from Google or other search tools.

In the final report of the IncReASe project, the project team commented on the question of metadata quality:

“We have tried to maintain good metadata quality and consistency within WRRO [the White Rose repository] but we have variation in name formats for authors and, to a lesser degree, journal titles and publishers. Sources of metadata are often imperfect; repositories need to make a realistic assessment of the resource needed to improve metadata and decide whether the added value justifies the cost of the resource needed to achieve it. Repositories need to think about how the metadata will be re-used and strike an appropriate balance between speed of dissemination and quality of metadata. It may be that tools to work in conjunction with bulk metadata ingest (e.g. to identify empty fields or potentially anomalous metadata) could help improve metadata quality.”

Internal sources of data

Metadata which allows the linking of research outputs to projects and funding information is a valuable addition to the repository as it smoothes the deposit workflow, permits better measurement of research performance and can drive alerts to authors about Open Access requirements from funders and assist with reporting to funders.

A Research Information Management system (RIM) may interact dynamically with other business systems such as HR, grants and awards, student systems and finance, or bulk import data on a regular basis. Some of this data is then cascaded into the repository, where it can be used effectively to link to research outputs.

There are likely to be considerable issues with the clean-up of data and metadata in the implementation phases of integration if data is duplicated across systems. Guaranteeing data quality is one of the main challenges in implementing a Research Information Management system. In general the approach is to try to ensure that the data is cleaned up in the originating systems (HR, student information etc), rather than imported and then cleaned, though some institutions have opted to place all data in a data warehouse before importing it to the Research Information Management system or direct to the repository.

While not all the metadata entered in a Research Information Management system is relevant to the repository function, in the process of integration there will be situations where data conflicts occur and where there may be concerns that high quality repository records could be overridden by edited metadata coming from the Research Information Management system. For example, this has arisen in situations where a Research Information Management system has updated items in a DSpace repository and not sent file metadata as part of that process , and in some cases fields have been deleted as part of the updating process. Some of these problems have been solved by fixes to proprietary systems, others by local action to protect fields in DSpace. It is worth noting that metadata quality may no longer be controlled by the repository in scenarios where the workflow originates in the Research Information Management/CRIS – cataloguing standards set by the repository may be incompatible with incoming metadata and require compromises to be made.

As the White Rose Research Online team has noted, in relation to the Symplectic implementation at Leeds:

“Potentially, we have a source of high quality metadata for publications. But we also lose control of metadata quality as the Symplectic installation becomes our metadata authority source for any records that co-occur in Symplectic and WRRO. Although Symplectic harvests metadata from quality controlled sources, because of the wide subject spread at University of Leeds, a significant proportion (as yet unquantified) of additions to Symplectic will be via manual metadata creation. It remains to be seen whether repository staff – or library staff more generally – will have a role in ensuring metadata consistency, quality and completeness within the Symplectic system. Such proactive improvement is likely to be of long term benefit – not just for metadata quality within WRRO – but also because the data is likely to be used for Leeds’ Research Excellence Framework submission.”[1]

Upgrades to repository platform software can also cause problems, and it can be frustrating to have to delay upgrades which have desirable features because of potential conflicts.

External sources of data

Whether or not a Research Information Management systems is in place, it is now possible to import external data from a number of sources to make the completion of bibliographic data in the workflow for deposit much easier. Thomson Reuters and Elsevier provide APIs to allow import of data from Web of Science and Scopus (though subscriptions to the products are necessary and in the case of Thomson, some libraries report having to purchase InCites as well to meet all their needs). EPrints has had plug-ins for some time to do such imports as well. The data imports can be very large and rapid, causing some problems, and there needs to be considerable manual editorial intervention to eliminate records both for quality reasons (non peer-reviewed and student papers are also on Web of Science) and to sort out duplication.

APIs are also available for ArXiv and Pubmed, though these are not without some problems in use, for example extracting journal title, volume and issue data from ArXiv. Metadata in ArXiv is often incomplete and affiliation data is frequently missing. For more experiences on using these see the final report of the IncReASe project.[2]

These sources mean that many authors can now review and check their attributed publications rather than having to manually enter the data, which helps to raise deposit rates and improve accuracy. These sources tend not to have such good coverage of arts and humanities, however.

RoMEO data can also be imported to allow checking of the copyright position of particular papers. See the Interaction with external systems section.

Interoperability and data exchange

There is work going on in a number of projects to assist in the interoperability of metadata, both between Research Information Management systems and repositories in an institution and in terms of the exchange of data with other organisations, such as HEFCE and the research councils, as well as aggregation of data between HEIs e.g. the CRISPool project in Scotland.[3]

The Knowledge Exchange,[4] a collaboration between JISC and similar organisations in several other countries, has undertaken a project to increase the interoperability between CRIS systems and repositories by defining and proposing a metadata exchange format for publication information with an associated common vocabulary and contributing to:

  • Making metadata input more efficient
  • Avoiding or reducing duplicated inputs on the two platforms
  • Increasing metadata quality, reliability and reusability
  • Increasing quality level of services based on these metadata
  • Reducing costs of metadata handling and exchange

Inter-institution and funder data exchange

CERIF is the recommended format for the exchange of research information between HEIs and for submitting to the REF. However this is a complex area. CERIF itself may need extensions and the systems environment in different institutions varies considerably, making it harder to accomplish a uniform approach in the UK than in smaller nations with national Research Information Management implementations. There are alternatives to using CERIF.

Technical issues around integration, interoperability and standards, including CERIF, are discussed in more detail here and here.

Some instances of metadata flows in embedded repositories

The scenarios described at the start of this guide illustrate different ways in which repositories may be embedded. The following two cases explore some metadata-related aspects of two of the scenarios.

Scenario 1: using a repository as a research publications database – The case of Enlighten at Glasgow

In this embedding scenario, the repository also becomes the central publications database, holding both metadata records and full text/other outputs. It is linked with other elements of the research management infrastructure; as far as metadata is concerned, the most important elements are likely to be project and funding data and staff and research student identity information.

Project and funder data

In the case of the embedding of Glasgow’s Enlighten repository, project and funder data are imported into the repository using overnight bulk downloads. This allows the deposit interface to the repository to be pre-populated with project and funding data in drop-down menus, so that when a researcher deposits a publication, he or she can enter the project code and associate it with a particular project. All the other funding data connected to that project code is then auto-completed in the multi-value funding field of the record. The data recorded are:

  • Project code number
  • Award number
  • Principal investigators [and associated project staff] – look up is on any member of the project team but the record then stores the PI name against the link
  • Funder Name
  • Funder Code
  • Funder’s Reference Number
  • Lead Organisational Unit – again, any unit can be the object of a search but the result stored is the Lead Unit

It is very important that the depositing author not be asked to rekey such information manually, and that it is easy to skip over the funding metadata step if the output is not associated with any project funding. Repository staff members also have to ensure that funding data is only made public where that has been flagged as permissible.

As a result of bringing funding data into the repository, funding can be mapped to research outputs and compliance with any open access mandates demonstrated.

If the deposit is being made by someone other than the author a check might be made with the author to ensure the output is associated with the correct project.

Author names

The Glasgow team have solved the problem of names disambiguation by creating a Glasgow author authority listing, based on user records imported from the data vault. Users now log in with their Glasgow Unique Identifier (GUID). Now regardless of the cited form of the author’s name, or if they have published under other names e.g. maiden name, the publication can be associated with the author name in the authority file (through the addition of the GUID field in the author record) and can be browsed to using that list. This brings benefits both for the author, who can see an accurate and non-duplicated publications list (EPrints generates a list but uses multiple forms of the author name) and the university.

Additional journal-related metadata

Three new fields were added for the journal article document type:

  • ISSN (Online) [Text]
  • Journal Abbreviation [Text]
  • Published Online [Date]

This clearly distinguishes between printed and online ISSNs and allows the export of the appropriate ISSN, for example for research assessment (the electronic ISSN was specified in the RAE2008 for articles with DOIs).

Journal abbreviations were added to address the needs of disciplines such as Mathematics which use short names in their citations from the American Mathematical Society.The Published Online date was added as a result of data coming from the Faculty of Biomedical and Life Sciences as their articles are often published online and may have no hardcopy publication date.

A benefit from the ISSN data is that it will allow a relatively easy comparison over time between where Glasgow authors publish and the journals subscribed to by the Library, through use of the ISSN data in the LMS.

Scenario 2: Current Research Information (CRIS) with linked repository – The case of AURA (repository) and PURE (CRIS) at Aberdeen

In this scenario, a repository is linked with a t Research Information Management System. In the Aberdeen example, the Research Information Management system is a proprietary CRIS( Pure) where the links to both the repository and other business systems (e.g. HR) are handled by the CRIS itself, but it could be in other cases an assemblage of linked systems, part of which constitute a virtual CRIS.

In this scenario, the metadata flows from the CRIS into the repository alongside, or as well as, the output itself.

At Aberdeen the decision was taken to acquire the PURE system (jointly procured with St. Andrew’s, though each implementation is separate) following the experience of submitting to the RAE2008. The Library had been involved in the data collection and checking for the RAE and was therefore in a position to participate actively in the project to implement the PURE system advocating an integrated approach incorporating the repository, not merely a replacement for the existing publications database. Since PURE was launched, the repository has experienced a doubling of the full text outputs it contains, which is attributed to both the higher profile of the repository arising from the project but also to the ease of adding an output as the end step of recording a publication in PURE – as simple as attaching a document to an email message.

Bibliographic data – deposit workflow

Bibliographic data is imported into the repository from a number of external data sources including:

  • Web of Science
  •  ArXiv
  • Pubmed

Depositors (whether researchers or administrators undertaking mediated deposit) can then associate the identified outputs with data drawn from other university databases or created by themselves via the PURE system:

  • Current, former and honorary members of staff – synchronised with the University’s HR system
  • Projects – synchronised with the University’s research grants database
  • Events, impact cases and professional activities created by themselves and other members of staff
  • Journals and publishers – from authority lists maintained in Pure by the AURA team

Export of bibliographic data

Data can be exported to:

  • BibTex
  • HTML
  • Microsoft Office (Excel and Word)
  •  PDF
  •  RefMan

Bibliographic data – initial reconciliation of repository data with external data

Existing data was passed to Thomson Reuters who returned three data sets:

  • Existing data enhanced with Web of Science data – loaded as ‘Validated’
  • Existing data which could not be matched to Web of Science data – loaded as ‘For Validation’
  • Web of Science data linked to the university by Thomson Reuters which could not be matched to the existing data – loaded as ‘For Validation’
  • SWAP
No comments yet.

Leave a Reply