Introducing Repository Tools 2

Edited

Introduction

With the release of v4.16 of Elements in December 2015, we delivered the first piece of functionality on what we have been calling internally our Repository Tools 2 roadmap: the introduction of figshare for institutions as a data source for Elements. In releases since we've created Repository Tools 2 integrations with DSpace, EPrints, and Figshare for Institutions.

This article explains what Repository Tools 2 is, the vision behind it, and how it relates to Repository Tools 1.x. We also invite you as a community to ask questions and make comments in the Repository Tools 2 Q&A community thread, for everyone to share.

What is Repository Tools 2?

Repository Tools 2 is the philosophy behind the building of next generation institutional repository interoperability in Elements. It differs from the existing Repository Tools 1.x approach in several important ways, ultimately enabling us to deliver richer functionality and a more maintainable interaction between Elements and your institutional repository in the months and years to come.

Repository Tools 1.x was conceived and designed over 6 years ago in partnership with Imperial College library, in an age when institutional Open Access mandates were a thing of the future. Its design enabled it to achieve the goals it had at that time. Since then, it has benefited from a long road of incremental improvements and alterations, and is still a very capable solution for boosting rates of full text deposit and providing automated high quality metadata updates to a connected repository, as well as for reporting within Elements on compliance levels with an institutional Open Access policy.

However, scholarly communications and Open Access has moved on, and the design that was appropriate 6 years ago, although still valid today, is now limiting its future potential for growth.

As they are developed, new Repository Tools 2 connectors will be offered alongside the existing Repository Tools 1.x connectors, which will be maintained for ongoing use by customers. We understand that stability of existing system connections is very important to our customers.

Because of fundamental incompatibilities in the approaches behind Repository Tools 1.x and Repository Tools 2, it is not going to be possible to connect to a repository system using Repository Tools 1.x and Repository Tools 2 at the same time. For your repository, you will choose whether to use the older, but currently more complete Repository Tools 1.x technology, or where available the newer, currently less complete, in-development Repository Tools 2. We will be happy to discuss the pros and cons with you, and help you make the right choice.

Principle 1: Repositories must be data sources

The new vision for Repository Tools 2 is to start from a foundation of integrating your institutional repository first as a data source for Elements. Although the ultimate goal of most repository integrations is to push data and files from Elements to the repository, this initial requirement will nevertheless be the foundation on which such functionality is then developed.

Metadata from items in your repository will appear within Elements in the same way as records from the the more traditional data sources like the Web of Science, PubMed or arXiv. If a given institutional repository platform cannot be integrated by Symplectic as a data source for Elements, then it is not a candidate for a Repository Tools 2 integration at all. Being able to assume that all repositories supported under Repository Tools 2 are fully-fledged data sources for Elements significantly simplifies the planning, development, delivery and maintenance of later, more complex functionality, such as file deposit and OA policy compliance monitoring.

As an example, see a DSpace repository record in the publication image below:


This is a brand new approach, and it will be the main visible difference between Repository Tools 1.x and 2. At the time of writing, we have so far implemented figshare for institutions, Digital Commons, and DSpace as Repository Tools 2 data sources.

The technical amongst you may be interested to know that the existence of institutional repository metadata records in Elements is the authoritative information that tells the two systems which publications in Elements are associated with which items in the repository.

This is in stark contrast to the information architecture of Repository Tools 1.x, where a custom table (called "symplectic_pids" in DSpace and EPrints) or other similar data structure (in Fedora 3) created and maintained by Symplectic inside your institutional repository is the place where these associations are stored and managed. There is no need for Symplectic to create and maintain this table in your institutional repository in the Repository Tools 2 world.

Unlike Repository Tools 1.x, Repository Tools 2 will be written to be hands-off about how duplicate items are handled in the connected repository, allowing repository administrators to address duplicates in their own time, and in a way that works best for their repository.

  • Repository Tools 1.x was not aware of items in the repository that it had not itself created, which could sometimes lead to the upload of duplicates by users of Elements. Repository Tools 2 connects the repository as a data source for Elements, meaning that Elements will generally know when there is already an associated repository item for any given publication. As a result, the probability that a user of Elements uploads a duplicate item to the repository in the first place is reduced.

  • Elements will automatically de-duplicate imported repository metadata records just as it does for records from any other data source, such that where it believes a duplicate item exists in the repository, two (or more) repository records will be placed into the same Elements publication. Just as for every other data source, if Elements incorrectly assigns two repository records to the same publication, the falsely matched record can be manually split out of the publication. And likewise, where Elements misses a duplicate, publications containing repository records can be merged manually into one publication containing more than one repository record. None of these Elements-side operations will result in an automated attempt by Elements to perform a similar "merge" operation in the connected repository, avoiding associated problems such as the creation of "clone" items and loss of download statistics experienced in some Repository Tools 1.x setups.

  • Repository administrators interested in taking advantage of the ability of Elements to detect duplicates in the repository can (for example) query the Elements reporting database to export a report detailing which Elements publications contain multiple records from the repository, and act on this information in the repository as they see fit.

Principle 2: Let the experts maintain the repositories

The second big differentiating factor between Repository Tools 1.x and 2 is that we will aim to use only the native API functionality offered by each repository platform, in contrast to the approach taken with Repository Tools 1.x, where Symplectic traditionally wrote large and complex plugins or alterations to repository platforms.

6 years ago, this approach was our only choice, since the popular institutional repository platforms of the day did not offer their own APIs powerful enough to support your use cases in Elements. Symplectic's developers are experts in the technologies used to develop Elements. Stretching those skills to maintain EPrints scripts (written in Perl) and DSpace and Fedora 3 plugins (written in Java) was our only option, which takes special dedication from a team specialised in other technologies.

Now that many institutional repository platforms have more fully featured native APIs allowing us to interact with them more flexibly than they used to, it is appropriate that we should engage with those instead of spending time maintaining our own competing ones. This brings many benefits, amongst them allowing their maintenance to be performed by the experts for each respective platform, saving Symplectic (and ultimately you) time, and inevitably providing better quality support for those platforms' APIs.

Where a platform is missing necessary or desired functionality in its API, we will aim to engage with external communities who are experts in developing those platforms to push their APIs forward. This is a more community based approach and you should expect to see us being more active in institutional repository developer communities, representing your interests.

Principle 3: Cope gracefully with differences in repository platform capabilities

If Symplectic is not going to write and maintain a proprietary API plugin for each repository platform, then the varied and changing capabilities of the native APIs of repository platforms raises the risk that some platforms will be missing required or desired API features, and therefore cannot be connected to optimally, or in some cases, at all.

Our Repository Tools 2 roadmap therefore needs to treat each repository platform separately, and dynamically adjust the end user experience in Elements in response to the supported capabilities of the connected repository. Our philosophy is to use something like a Progressive Enhancement design strategy to our evolving Repository Tools 2 feature set, starting from a baseline of implementing a repository as a data source for Elements. Ability to integrate a repository as a data source is then the lowest barrier to entry for the Repository Tools 2 feature set, with progressively more user experiences being made available in an Agile manner as Symplectic is able to code against the various capabilities of each supported version of each repository platform.

This approach will result in different user experiences in Elements for different repositories, but will allow us the flexibility to take advantage of newer features on each repository platform without requiring that all repositories support them.

Principle 4: Configure interoperability as much as possible within Elements itself, and not in the repository

With Repository Tools 1.x, much of the configuration of the way the two systems work together lies outside of the Elements system, in files that form a part of your repository installation. These typically include Symplectic "crosswalk" files (typically XSLTs) that describe how to map metadata from Symplectic Elements into metadata formatted for the repository, as well as a configuration file that influences how data alterations made by Elements affect the workflows and other mechanics of the repository. In addition, optional scripts running as a part of the repository installation are responsible for periodically detecting and pulling metadata changes in from Elements and updating the associated items in the repository. All of these need to be maintained to some extent by your repository IT team.

Some degree of repository-side configuration may still be necessary with Repository Tools 2, but our philosophy will be, where there is a choice, to move as much configuration as possible to the Elements side of the bridge and to leave the repository being as close to a clean native installation as possible. This overlaps a little with Principle 2, as it aims to reduce the number of alterations that need to be made to a repository installation to make it able to work with Elements.

However, moving configuration parameters to the Elements side opens up the path to many other benefits, too. With Elements being more aware of how you wish to work with your repository and how you want to crosswalk your data, we can later introduce user interface controls to help configure and simplify aspects of your crosswalks. We can also more ably detect when a metadata change in Elements is not of interest to the connected repository, allowing us to suppress unnecessary metadata change propagation events across the network, reducing the load on your repository. There are many other potential benefits besides.

Further notes on supported proper usage

The intended and proper usage of Repository Tools 2 is for you to connect Elements to your own institution's dedicated institutional repository. Although Elements cannot always prevent you from connecting to a repository that does not belong to your institution, this is not a supported use of the functionality.

The code behind the harvesting of data is optimised under the assumption that the harvested data are all outputs of your institution. Amongst other things, this enables us to infer more about the likelihood that any given repository item was authored by a member of your staff with a matching name, something that is critical to not overpopulating users' pending publication lists with false positives, and also not bloating and slowing down the Elements database with irrelevant publications.

If you are interested in using Repository Tools 2 to connect to a repository that contains shared content with other institutions, please discuss this with Symplectic to ascertain whether your particular usage is supported.

Adding support for additional repository platforms

We have regular internal prioritisation meetings where we review the current landscape, including customer feature requests and the evolving repository platforms themselves.

Feasibility Study

For each repository platform and version of it, the first thing we must do is a feasibility study to ascertain to what degree the platform meets two important criteria for a maintainable implementation under Repository Tools 2:

  1. The platform must be suitably standardised

  2. The platform must provide a powerful enough API

By "standardised", we mean a platform that in practice will look similar enough in its configuration across all possible customer sites, that a standard working connector to the platform's API can be developed and maintained efficiently by Symplectic. Examples of non-standardised repositories might include repository platforms that do not provide a standard set of metadata fields and/or workflows relevant to a typical digital library.

By a "powerful enough API", we mean a repository platform that provides an API that can be used to give Elements ongoing visibility of all metadata and file contents of the repository and access to item publishing workflow statuses. Ideally it should also give Elements the ability to submit new items and/or files to the repository in real-time.

Integration as a data source

If the repository platform is suitably standardised and provides a powerful enough API, then the next thing we would look to do is implement the repository as a data source for Elements. This allows the harvesting of repository items as metadata records inside Elements, providing ongoing visibility to Elements of the changing contents of the repository.

OA Monitoring

Where appropriate embargo, file version, and other accessibility status information can be stored by the repository platform and made available in the data harvested from it, we would aim to integrate it with the Elements OA Monitor, allowing administrators access to powerful reports measuring degree of compliance with and outstanding actions under an institutional Open Access policy.

New item/file deposit

For repositories with APIs supporting the creation of new items and upload of files in real-time, we would hope to provide this functionality via the Elements user interface, alongside prompts and workflows to deposit appropriate files where there are outstanding actions under your institutional Open Access policy. When depositing files from Elements, we would where possible store appropriate embargo and file version information.

Feedback and questions

Please feel free to provide thoughts and ask questions about Repository Tools 2 on the OA and Repository Q&A community thread.

Was this article helpful?

Sorry about that! Care to tell us more?

Thanks for the feedback!

There was an issue submitting your feedback
Please check your connection and try again.