Repository Tools 1: Architecture

Edited

N.B. This article relates to the Repository Tools 1 (RT1) protocol, which is no longer supported as of Elements 6.11. New repository integrations should use Repository Tools 2 (RT2).

This document describes the general architecture of the Repository Tools module. It is a companion piece to the installation guides, detailing how Elements and the Repository interact, the data model used to represent the connection between the two systems, and how changes in the Elements data are handled.

Design

The fundamental building blocks of Repository Tools are a set of RESTful web services, with endpoints provided in both Elements and the repository, to cover both synchronous and asynchronous updates to the data in both systems.

Repository Tools are designed to keep the amount of communication between Elements and the repository at a minimum – so that during normal operation (e.g. times of heavy user activity), impact on both systems is minimized.

Synchronous Operations

To reduce inter-process communication, these are reserved for file-based operations and occasions where immediate feedback is needed.

File-Based Operations

An operation happens immediately when it involves the manipulation of files (rather than only items or metadata). That is, an operation that is initiated by a user and involves either:

  • Depositing full text.

  • Granting the repository licence (depositing the licence file).

  • Deleting full text.

  • Revoking the repository licence (deleting the licence file).

Note: when a full text file is deposited to the repository or the licence is granted, then an Atom XML document with the publication state and metadata is included with the file that is pushed to the repository. This Atom document is not included when a file is deleted or a licence is revoked.

Current Publication Status

There are times when the user will require immediate feedback regarding the up-to-the-minute status of the repository. In order to do this, Elements asks the repository to return the status of anything relating to a single Elements publication id. This happens when a user (1) opens the 'Full text' tab; or (2) accesses the 'Manage full text' page.

For bulk display (e.g. the file count in the 'Full text' tab heading, or API access), Elements maintains a cache of what it knows about any repository items linked to a publication.

When Elements retrieves information about a single publication, it will update the cache of repository data for that publication accordingly.

Asynchronous Operations

There are many causes of data change in both Elements and a repository. For example, Elements may find a new record of a publication in the data sources; a manual data source may be edited; publications may be merged; data sources may have updated metadata. In the repository, an administrator may add or remove files directly, without using Elements.

To prevent performance issues with excessive communication between the two systems, these changes are not reflected immediately in the other system. Instead, both Elements and the repository (Repository Tools package for EPrints and DSpace) have scheduled processes which periodically query the other system and process the changes they find.

Note: to manage Elements' scheduled process, navigate to System Admin > Operations > Scheduled Jobs and find the "Synchroniser: Repository" process.

Elements' "Synchroniser: Repository" scheduled process performs the following functions:

  • Retrieve a summary of the repository contents

  • Update the repository cache.

The repository scheduled processes perform the following functions:

  • Create a summary of all the contents of the repository, to be returned by the “holdings” web service.

  • Query Elements Repository Tools API to find all publications that have been changed (updated metadata, new data sources, merges, deletions (merged into another record)).

  • Update the repository with new metadata and ensure that items are connected to the correct publication.

These processes should be scheduled by an administrator to run at times that will limit the impact on both systems (i.e. when they are being used least). Regular running of the scheduled jobs is important to ensure that Elements and your repository stay synchronised.

Example: suppose two publications are merged in Elements. When the repository Query and Update processes run, your repository is informed of the changes and the affected repository records are connected to the currently active publication.
Example: Elements discovers updated metadata in its data sources. When the repository Query and Update processes run, your repository will be updated with this new metadata.
Example: You are using the Elements API to get information about your repository deposits. When the repository Create process runs, it updates the information in the holdings service. Elements' "Synchroniser: Repository" process then runs to retrieve and process that information. Elements is kept up to date with any changes in the repository.

Repository-Side Web Services

All user repository interaction is handled by calls from Elements to the repository-side web services. All of the repository endpoints should sit under a common root path (e.g. https://myrepo.edu/rt). There currently is no requirement for this common root to respond to requests. We recommend that the path is reserved entirely for Repository Tools functionality, so that:

  • it could respond to requests in the future if needed.

  • current or future endpoints do not clash with any other URLs for your repository.

Note: when configuring the repository connection, this common root path is the 'Service Base URL' described in Repository Tools 1: Module Administration - Repository Connection

The Repository Tools connectors supplied by Symplectic will by default install on a unique root path, as described above.

There are three endpoints that are required by Elements:

[http://myrepo.edu/rt]/repository

HTTP Method

Description

GET

Returns a summary of the repository contents, in Atom XML.This summary is usually pre-generated to reduce load, and so will only be as up to date as the last time the scheduled job was run to create it.
Should support paginated operation (query string ?page=1, etc.)

POST

Deposit files to the repository (full text or licence). Expects a multipart AtomPub, where the first part is the Atom document providing all details of the deposit (who is depositing, on behalf of, Elements publication serialization, etc.), and the second part is the file that is being deposited.
If the publication is currently linked to a repository item, the result may depend on the state of the repository item (live, in review, etc.).

[http://myrepo.edu/rt]/publication/{id}

HTTP Method

Description

GET

Given the Elements publication id ({id}), returns in Atom XML format all known information about a linked repository item.
This must return the current status of the item, at the time it is called.

[http://myrepo.edu/rt]/file/{id}

HTTP Method

Description

GET

Returns the file specified by {id}. Allows a file to be returned to a user via Elements, even when it is not available via the public repository interface.

DELETE

Delete the file specified by {id}. If this is a licence file, then it indicates the item should be taken back into the “workspace”.
The result of this operation may depend on the state of the repository item (live, in review, etc.)

Elements-Side Web Services

The repository scheduled processes need to be able to query Elements and process updates. As explained above, this ensures that the majority of updates – those triggered by data changes within Elements, such as finding updated metadata – are deferred and scheduled for a time that will cause least impact.

[http://myelements.edu/rt-api]/list-publications

HTTP Method

Description

GET

Returns a list of all the publications that have been updated. Is normally called with the previous update date, to only get the most recent updates, and can be paginated (supplying a “page=1” parameter).
Note that unlike the main API endpoints, this will include any publications that have been deleted.

[http://myelements.edu/rt-api]/publication/{id}

HTTP Method

Description

GET

Return the detailed Elements state for the given publication {id}. This includes the status (approved, deleted, etc), any merge history (combining publications), and serialization of all of the fields and records.

Note: you only need to enable a repository tools endpoint in the API configuration, providing a port and path, in order to have the appropriate web services available.

Firewall / Security Requirements

Authentication has not been implemented in either the Elements or repository side Repository Tools API web services. As such, it is important that you secure access to the endpoints via alternative means.

Elements allows you to limit the IP addresses that are allowed to make requests of the insecure endpoints, and this applies to the repository tools API endpoints. Additionally, firewalls – both on the server, and separate network devices – can be used to limit the systems that are allowed to connect to the configured port.

On the repository server, the repository tools connector for DSpace and EPrints have configuration options to make the connector only process requests from the specified IPs. Additionally, standard configuration is available within the application server (Tomcat for DSpace, Apache HTTPd for EPrints) to limit the access to the connector application / paths. As the connector will typically be served on the same port as your repository, it will not be possible to use operating system or network device firewalls to limit the access.

If the Elements and repository servers are not installed on the same local network – for example, you make use of a hosted repository service - then it is important that you make sure that the web services on each server can be accessed by the other. For the repository server, that is likely to be the standard HTTP port that is also used for accessing the repository. With Elements, you will need to know the port that you have configured for the repository tools API endpoint, and ensure that it is available through the firewall by the repository server.

Repository Data Model

As designed, it is the responsibility of the repository / Repository Tools connector to maintain the link between a publication in Elements and a repository item. Elements does not “know” what repository item a publication is connected to, and instead uses the id of the publication in Elements as the primary identifier for all communication with the Repository Tools connector.

So the connector needs to keep track of the Elements ID associated with the publication for which a deposit is made, and the item(s) that represent that publication in the repository. It also needs to maintain some other state information and queues in order to process updated Elements data.

In order to do this, the connector stores information either in the database for the repository, or in the case of Fedora, in the repository data model itself.

Publications Record

This is the main record that is required by the connector – the record that tracks the link between a publication ID, and the associated repository item(s).

DSpace and EPrints

Implemented as the symplectic_pids table in the database. There is a row for each Elements publication that has been deposited / matched with the repository. The main columns that it uses are:

Column

Description

Table: symplectic_pids

pid

The ID of the publication in Elements

item_id

The ID of the DSpace Item / EPrint that is “live” (i.e. publicly available)

submission_id

The ID of the DSpace Item / EPrint that is currently being submitted / reviewed.


The distinction between item_id and submission_id is important. As will be described later, attempting to make changes to the repository (e.g. adding a new file) after the repository item that was initially created / matched has been made part of the public archive, then we may not be able to make changes directly on that record. In those cases, the Repository Tools connector will need to create a clone – until it is made public, the link to this item will be tracked as submission_id.

In addition, there are columns tracking the time when the item was first deposited via Elements (first_imported), and the last time it was modified via Elements / Repository Tools (last_modified). These are largely for information and debugging purposes, and are not actively important for the connector to function.

Fedora

Implemented as objects in the Fedora data model. For each publication, an object is created with the identifier info:fedora/pubs:{id}. All of the required values are stored in the RELS-EXT.

Subject

Predicate

Object

Usage

Fedora Object ID: info:fedora/pubs:{id}

info:fedora/pubs:{id}

TYPE

PUBS_RECORD

Identifies the object as a Publications Record

info:fedora/pubs:{id}

LOCKED_BY

PUBLICATIONS

Signifies that it is in the workspace (i.e. licence not granted).Is it a problem thatthis identifies the publication record, and not the item?
Only used in one particular Fedora implementation

info:fedora/pubs:{id}

DATE_LOCKED

 

Only used in one particular Fedora implementation

info:fedora/pubs:{id}

HAS_CURRENT

{uuid}

Is the “submission” item

info:fedora/pubs:{id}

HAS_VERSION

{uuid}

Historical “live” items?

info:fedora/pubs:{id}

HAS_VISIBLE

{uuid}

Is the “live” item

info:fedora/pubs:{id}

REPLACES

info:fedora/pubs:{id}

 

info:fedora/pubs:{id}

MERGE_FROM

info:fedora/pubs:{id}

 

Merge History

In Elements, it is important to have a clean representation of the publications, without any duplicates. With many data sources contributing bibliographic information, and the possibility for the data sources to overlap in their coverage, there may be occasions where multiple representations can’t be automatically grouped into a single publication. So, in Elements you will merge those separate representations into a single publication – and, where (at least) one of those publications is linked to a repository record, then the links between the repository item and Elements publication need to be maintained.

As repository tools processes these merges from Elements, it records each merge into a history table. The primary reason for this is to be able to update metadata in old repository items to indicate that they have been replaced, and by which item.

DSpace and EPrints

Column

Description

Table: symplectic_merge

source

The ID of the publication in Elements merged FROM

source_item

The ID of the (working/submission) item in the repository merged FROM

target

The ID of the publication in Elements merged TO

target_item

The ID of the (working/submission) item in the repository merged TO

merge_date

The date that the merge was recorded

Fedora

There is not a direct equivalent in how this is stored in Fedora. The connector is maintaining history information (previous versions of the item) as part of the publications record.

Subject

Predicate

Object

Usage

Fedora Object ID: info:fedora/pubs:{id}

info:fedora/pubs:{id}

HAS_CURRENT

{uuid}

Is the “submission” item

info:fedora/pubs:{id}

HAS_VERSION

{uuid}

Historical “live” items?

info:fedora/pubs:{id}

HAS_VISIBLE

{uuid}

Is the “live” item

info:fedora/pubs:{id}

REPLACES

info:fedora/pubs:{id}

 

info:fedora/pubs:{id}

MERGE_FROM

info:fedora/pubs:{id}

 

Updates Queue and Status

In order to process updated information from Elements as part of the scheduled tasks, the Repository Tools connector needs to record which publications have updates available. It also needs to record status about the updates – the time that the updates were last obtained – in order to only request the changes since the previous run in subsequent requests.

DSpace and EPrints

Column

Description

Table: symplectic_updates

pid

The ID of the publication in Elements

update_date

The date the update was queued

url

The URL to retrieve the publication details from Elements

 

Column

Description

Table: symplectic_state

last_update

The date the update scheduled job was last run

Fedora

In Fedora, there is no object recording the publications / URLs that are requiring updates. Instead, when a publication is found to be requiring an update (via the list updates scheduled task), this is recorded on the info:fedora/pubs:{id} object directly.

There are two ways this information can be recorded. The first is encoded in the label field of the Fedora object. The other is a triple inserted into the Publications Record (the Fedora object that represents the link between an Elements publication and a repository item).

Subject

Predicate

Object

Usage

Fedora Object ID:info:fedora/pubs:{id}

info:fedora/pubs:{id}

REQUIRES_UPDATE

<uri>

URI of publication in Elements API

Additionally, a single “state” object is created in the repository, and holds the last date that updates were retrieved from Elements.

Subject

Predicate

Object

Usage

Fedora Object ID: info:fedora/pubs:state

info:fedora/pubs:state

LAST_UPDATE

<time>

Date the update scheduled job was last run

Data Flow


This section shows the data flows that currently happen in the Repository Tools connector, particularly with respect to DSpace and EPrints. It is not necessarily a description of ideal data flows, and different implementations could be appropriate for other repository platforms.

File Deposit


Initiated by a user uploading a file to Elements, or clicking on the “upload” button next to a file found in an external source (arXiv, Europe PubMed Central).

The Repository Tools connector is supplied with a multipart Atom form. The Atom document describes the item being uploaded (it’s publication ID, metadata for all data sources, user information for the author(s), user information for person making the submission (e.g. and impersonator). The actual file being deposited is supplied as a binary stream.

 

New Item

Existing Item Workspace

Existing Item Workflow

Existing Item Live

Delete a File

Initiated by a user clicking the “delete” button next to a file in the repository, whilst in Elements..

The Repository Tools API defines this to be a HTTP DELETE operation on a particular URL (a Repository Tools specific URL that identifies a particular item and file (this is generated by Repository Tools and returned to the user interface as part of the message that is sent back to describe the holdings).

Existing Item Workspace

Existing Item Workflow


Existing Item Live

Licence Grant

Initiated by a user clicking the “grant licence” button in Elements. The Repository Tools API treats this the same as a file deposit – ie. an actual licence file is sent to the repository, as part of a multipart Atom form, to the same endpoint.

However, the deposit is marked to say that it is a licence file, and the expectation is that the repository will “finalize” the item at that point and pass it out of a hidden / disabled state, and make it available to a review workflow (if there is no review workflow for the repository, then the item would be made live).

Licence Revoke

Initiated by a user clicking the “revoke licence” button in Elements. The Repository Tools API treats this the same as a file deposit – ie.the URL of the licence file (as reported by Repository Tools) is supplied with a HTTP DELETE command.

It is not marked in the API that it is a licence file, but the Repository Tools connector is expected to be able to determine that this represents a licence file and act on it accordingly.

Existing Item Workflow

Existing Item Live

Process Update

Initiated by the scheduled tasks that you run on the repository server – specifically, ListUpdates and GetRecords.

As shown below, it is important to note that the list of updated publications returned by Elements does not (cannot) take into account whether there is a repository item attached to it. So, Elements will return all of the updated publications, and it is up to the ListUpdates process to filter the publications into the ones that it is potentially interested in.

ListUpdates will, when it finds a publication that it thinks it should be interested in add the URL to a queue for later processing.

GetRecords runs through the queue generated by ListRecords, and applies the changes that it finds. The most visible change in the repository will be the updating of metadata – applying the latest set of metadata in Elements to the repository record (new metadata is not pushed to repository by Elements, unless it is part of a file deposit). This can be configurable based on the state of the item (so that you don’t get unexpected metadata changes to live repository items).

In many ways, the most important role for GetRecords is to be a calling point for the Merged Publications processing. Records in Elements can be merged and split, even after files have been deposited (this is important to ensure that you do not have duplicate publications in Elements). Merge operations – like metadata updates – are not pushed to the repository automatically.

Regular pull updates (via GetRecords) will ensure that the repository items remain connected appropriately to an Elements resource, and so information about the full text will display correctly in the UI and reported on via the API.

Process Merged Publications in Elements


When publications are merged in Elements, there will always be a record of a (deleted) publication that has been merged FROM, and the surviving publication that has been merged TO.

How the merge will be handled by the Repository Tools connector depends both on what the surviving record in Elements is, what item(s) they are connected to in the repository, and what the state of those item(s) are.

Only Item Attached to Surviving Publication

No action required

Only Item Attached to Deleted Publication

Items Attached to Both Publications

Two Items, Both Live

Two Items, One Live

Two Items, Both Workflow

Two Items, One Workflow (one workspace)

Two Items, Both Workspace

Was this article helpful?

Sorry about that! Care to tell us more?

Thanks for the feedback!

There was an issue submitting your feedback
Please check your connection and try again.