Repository Tools 1: Architecture
N.B. This article relates to the Repository Tools 1 (RT1) protocol, which is no longer supported as of Elements 6.11. New repository integrations should use Repository Tools 2 (RT2).
This document describes the general architecture of the Repository Tools module. It is a companion piece to the installation guides, detailing how Elements and the Repository interact, the data model used to represent the connection between the two systems, and how changes in the Elements data are handled.
Design
The fundamental building blocks of Repository Tools are a set of RESTful web services, with endpoints provided in both Elements and the repository, to cover both synchronous and asynchronous updates to the data in both systems.
Repository Tools are designed to keep the amount of communication between Elements and the repository at a minimum – so that during normal operation (e.g. times of heavy user activity), impact on both systems is minimized.
Synchronous Operations
To reduce inter-process communication, these are reserved for file-based operations and occasions where immediate feedback is needed.
File-Based Operations
An operation happens immediately when it involves the manipulation of files (rather than only items or metadata). That is, an operation that is initiated by a user and involves either:
Depositing full text.
Granting the repository licence (depositing the licence file).
Deleting full text.
Revoking the repository licence (deleting the licence file).
Note: when a full text file is deposited to the repository or the licence is granted, then an Atom XML document with the publication state and metadata is included with the file that is pushed to the repository. This Atom document is not included when a file is deleted or a licence is revoked.
Current Publication Status
There are times when the user will require immediate feedback regarding the up-to-the-minute status of the repository. In order to do this, Elements asks the repository to return the status of anything relating to a single Elements publication id. This happens when a user (1) opens the 'Full text' tab; or (2) accesses the 'Manage full text' page.
For bulk display (e.g. the file count in the 'Full text' tab heading, or API access), Elements maintains a cache of what it knows about any repository items linked to a publication.
When Elements retrieves information about a single publication, it will update the cache of repository data for that publication accordingly.
Asynchronous Operations
There are many causes of data change in both Elements and a repository. For example, Elements may find a new record of a publication in the data sources; a manual data source may be edited; publications may be merged; data sources may have updated metadata. In the repository, an administrator may add or remove files directly, without using Elements.
To prevent performance issues with excessive communication between the two systems, these changes are not reflected immediately in the other system. Instead, both Elements and the repository (Repository Tools package for EPrints and DSpace) have scheduled processes which periodically query the other system and process the changes they find.
Note: to manage Elements' scheduled process, navigate to System Admin > Operations > Scheduled Jobs and find the "Synchroniser: Repository" process.
Elements' "Synchroniser: Repository" scheduled process performs the following functions:
Retrieve a summary of the repository contents
Update the repository cache.
The repository scheduled processes perform the following functions:
Create a summary of all the contents of the repository, to be returned by the “holdings” web service.
Query Elements Repository Tools API to find all publications that have been changed (updated metadata, new data sources, merges, deletions (merged into another record)).
Update the repository with new metadata and ensure that items are connected to the correct publication.
These processes should be scheduled by an administrator to run at times that will limit the impact on both systems (i.e. when they are being used least). Regular running of the scheduled jobs is important to ensure that Elements and your repository stay synchronised.
Example: suppose two publications are merged in Elements. When the repository Query and Update processes run, your repository is informed of the changes and the affected repository records are connected to the currently active publication.
Example: Elements discovers updated metadata in its data sources. When the repository Query and Update processes run, your repository will be updated with this new metadata.
Example: You are using the Elements API to get information about your repository deposits. When the repository Create process runs, it updates the information in the holdings service. Elements' "Synchroniser: Repository" process then runs to retrieve and process that information. Elements is kept up to date with any changes in the repository.
Repository-Side Web Services
All user repository interaction is handled by calls from Elements to the repository-side web services. All of the repository endpoints should sit under a common root path (e.g. https://myrepo.edu/rt). There currently is no requirement for this common root to respond to requests. We recommend that the path is reserved entirely for Repository Tools functionality, so that:
it could respond to requests in the future if needed.
current or future endpoints do not clash with any other URLs for your repository.
Note: when configuring the repository connection, this common root path is the 'Service Base URL' described in Repository Tools 1: Module Administration - Repository Connection
The Repository Tools connectors supplied by Symplectic will by default install on a unique root path, as described above.
There are three endpoints that are required by Elements:
HTTP Method | Description |
|---|---|
GET | Returns a summary of the repository contents, in Atom XML.This summary is usually pre-generated to reduce load, and so will only be as up to date as the last time the scheduled job was run to create it. |
POST | Deposit files to the repository (full text or licence). Expects a multipart AtomPub, where the first part is the Atom document providing all details of the deposit (who is depositing, on behalf of, Elements publication serialization, etc.), and the second part is the file that is being deposited. |
HTTP Method | Description |
|---|---|
GET | Given the Elements publication id ({id}), returns in Atom XML format all known information about a linked repository item. |
HTTP Method | Description |
|---|---|
GET | Returns the file specified by {id}. Allows a file to be returned to a user via Elements, even when it is not available via the public repository interface. |
DELETE | Delete the file specified by {id}. If this is a licence file, then it indicates the item should be taken back into the “workspace”. |
Elements-Side Web Services
The repository scheduled processes need to be able to query Elements and process updates. As explained above, this ensures that the majority of updates – those triggered by data changes within Elements, such as finding updated metadata – are deferred and scheduled for a time that will cause least impact.
HTTP Method | Description |
|---|---|
GET | Returns a list of all the publications that have been updated. Is normally called with the previous update date, to only get the most recent updates, and can be paginated (supplying a “page=1” parameter). |
HTTP Method | Description |
|---|---|
GET | Return the detailed Elements state for the given publication {id}. This includes the status (approved, deleted, etc), any merge history (combining publications), and serialization of all of the fields and records. |
Note: you only need to enable a repository tools endpoint in the API configuration, providing a port and path, in order to have the appropriate web services available.
Firewall / Security Requirements
Authentication has not been implemented in either the Elements or repository side Repository Tools API web services. As such, it is important that you secure access to the endpoints via alternative means.
Elements allows you to limit the IP addresses that are allowed to make requests of the insecure endpoints, and this applies to the repository tools API endpoints. Additionally, firewalls – both on the server, and separate network devices – can be used to limit the systems that are allowed to connect to the configured port.
On the repository server, the repository tools connector for DSpace and EPrints have configuration options to make the connector only process requests from the specified IPs. Additionally, standard configuration is available within the application server (Tomcat for DSpace, Apache HTTPd for EPrints) to limit the access to the connector application / paths. As the connector will typically be served on the same port as your repository, it will not be possible to use operating system or network device firewalls to limit the access.
If the Elements and repository servers are not installed on the same local network – for example, you make use of a hosted repository service - then it is important that you make sure that the web services on each server can be accessed by the other. For the repository server, that is likely to be the standard HTTP port that is also used for accessing the repository. With Elements, you will need to know the port that you have configured for the repository tools API endpoint, and ensure that it is available through the firewall by the repository server.
Repository Data Model
As designed, it is the responsibility of the repository / Repository Tools connector to maintain the link between a publication in Elements and a repository item. Elements does not “know” what repository item a publication is connected to, and instead uses the id of the publication in Elements as the primary identifier for all communication with the Repository Tools connector.
So the connector needs to keep track of the Elements ID associated with the publication for which a deposit is made, and the item(s) that represent that publication in the repository. It also needs to maintain some other state information and queues in order to process updated Elements data.
In order to do this, the connector stores information either in the database for the repository, or in the case of Fedora, in the repository data model itself.
Publications Record
This is the main record that is required by the connector – the record that tracks the link between a publication ID, and the associated repository item(s).
DSpace and EPrints
Implemented as the symplectic_pids table in the database. There is a row for each Elements publication that has been deposited / matched with the repository. The main columns that it uses are:
Column | Description |
|---|---|
Table: symplectic_pids | |
pid | The ID of the publication in Elements |
item_id | The ID of the DSpace Item / EPrint that is “live” (i.e. publicly available) |
submission_id | The ID of the DSpace Item / EPrint that is currently being submitted / reviewed. |
The distinction between item_id and submission_id is important. As will be described later, attempting to make changes to the repository (e.g. adding a new file) after the repository item that was initially created / matched has been made part of the public archive, then we may not be able to make changes directly on that record. In those cases, the Repository Tools connector will need to create a clone – until it is made public, the link to this item will be tracked as submission_id.
In addition, there are columns tracking the time when the item was first deposited via Elements (first_imported), and the last time it was modified via Elements / Repository Tools (last_modified). These are largely for information and debugging purposes, and are not actively important for the connector to function.
Fedora
Implemented as objects in the Fedora data model. For each publication, an object is created with the identifier info:fedora/pubs:{id}. All of the required values are stored in the RELS-EXT.
Subject | Predicate | Object | Usage |
|---|---|---|---|
Fedora Object ID: info:fedora/pubs:{id} | |||
info:fedora/pubs:{id} | TYPE | PUBS_RECORD | Identifies the object as a Publications Record |
info:fedora/pubs:{id} | LOCKED_BY | PUBLICATIONS | Signifies that it is in the workspace (i.e. licence not granted).Is it a problem thatthis identifies the publication record, and not the item? |
info:fedora/pubs:{id} | DATE_LOCKED | | Only used in one particular Fedora implementation |
info:fedora/pubs:{id} | HAS_CURRENT | {uuid} | Is the “submission” item |
info:fedora/pubs:{id} | HAS_VERSION | {uuid} | Historical “live” items? |
info:fedora/pubs:{id} | HAS_VISIBLE | {uuid} | Is the “live” item |
info:fedora/pubs:{id} | REPLACES | info:fedora/pubs:{id} | |
info:fedora/pubs:{id} | MERGE_FROM | info:fedora/pubs:{id} | |
Merge History
In Elements, it is important to have a clean representation of the publications, without any duplicates. With many data sources contributing bibliographic information, and the possibility for the data sources to overlap in their coverage, there may be occasions where multiple representations can’t be automatically grouped into a single publication. So, in Elements you will merge those separate representations into a single publication – and, where (at least) one of those publications is linked to a repository record, then the links between the repository item and Elements publication need to be maintained.
As repository tools processes these merges from Elements, it records each merge into a history table. The primary reason for this is to be able to update metadata in old repository items to indicate that they have been replaced, and by which item.
DSpace and EPrints
Column | Description | ||
|---|---|---|---|
Table: symplectic_merge | |||
source | The ID of the publication in Elements merged FROM | ||
source_item | The ID of the (working/submission) item in the repository merged FROM | ||
target | The ID of the publication in Elements merged TO | ||
target_item | The ID of the (working/submission) item in the repository merged TO | ||
merge_date | The date that the merge was recorded | ||
Fedora
There is not a direct equivalent in how this is stored in Fedora. The connector is maintaining history information (previous versions of the item) as part of the publications record.
Subject | Predicate | Object | Usage |
|---|---|---|---|
Fedora Object ID: info:fedora/pubs:{id} | |||
info:fedora/pubs:{id} | HAS_CURRENT | {uuid} | Is the “submission” item |
info:fedora/pubs:{id} | HAS_VERSION | {uuid} | Historical “live” items? |
info:fedora/pubs:{id} | HAS_VISIBLE | {uuid} | Is the “live” item |
info:fedora/pubs:{id} | REPLACES | info:fedora/pubs:{id} | |
info:fedora/pubs:{id} | MERGE_FROM | info:fedora/pubs:{id} | |
Updates Queue and Status
In order to process updated information from Elements as part of the scheduled tasks, the Repository Tools connector needs to record which publications have updates available. It also needs to record status about the updates – the time that the updates were last obtained – in order to only request the changes since the previous run in subsequent requests.
DSpace and EPrints
Column | Description |
|---|---|
Table: symplectic_updates | |
pid | The ID of the publication in Elements |
update_date | The date the update was queued |
url | The URL to retrieve the publication details from Elements |
Column | Description |
|---|---|
Table: symplectic_state | |
last_update | The date the update scheduled job was last run |
Fedora
In Fedora, there is no object recording the publications / URLs that are requiring updates. Instead, when a publication is found to be requiring an update (via the list updates scheduled task), this is recorded on the info:fedora/pubs:{id} object directly.
There are two ways this information can be recorded. The first is encoded in the label field of the Fedora object. The other is a triple inserted into the Publications Record (the Fedora object that represents the link between an Elements publication and a repository item).
Subject | Predicate | Object | Usage |
|---|---|---|---|
Fedora Object ID:info:fedora/pubs:{id} | |||
info:fedora/pubs:{id} | REQUIRES_UPDATE | <uri> | URI of publication in Elements API |
Additionally, a single “state” object is created in the repository, and holds the last date that updates were retrieved from Elements.
Subject | Predicate | Object | Usage |
|---|---|---|---|
Fedora Object ID: info:fedora/pubs:state | |||
info:fedora/pubs:state | LAST_UPDATE | <time> | Date the update scheduled job was last run |
Data Flow
This section shows the data flows that currently happen in the Repository Tools connector, particularly with respect to DSpace and EPrints. It is not necessarily a description of ideal data flows, and different implementations could be appropriate for other repository platforms.
File Deposit
Initiated by a user uploading a file to Elements, or clicking on the “upload” button next to a file found in an external source (arXiv, Europe PubMed Central).
The Repository Tools connector is supplied with a multipart Atom form. The Atom document describes the item being uploaded (it’s publication ID, metadata for all data sources, user information for the author(s), user information for person making the submission (e.g. and impersonator). The actual file being deposited is supplied as a binary stream.
New Item
Existing Item Workspace
Existing Item Workflow
Existing Item Live
Delete a File
Initiated by a user clicking the “delete” button next to a file in the repository, whilst in Elements..
The Repository Tools API defines this to be a HTTP DELETE operation on a particular URL (a Repository Tools specific URL that identifies a particular item and file (this is generated by Repository Tools and returned to the user interface as part of the message that is sent back to describe the holdings).
Existing Item Workspace
Existing Item Workflow
Existing Item Live
Licence Grant
Initiated by a user clicking the “grant licence” button in Elements. The Repository Tools API treats this the same as a file deposit – ie. an actual licence file is sent to the repository, as part of a multipart Atom form, to the same endpoint.
However, the deposit is marked to say that it is a licence file, and the expectation is that the repository will “finalize” the item at that point and pass it out of a hidden / disabled state, and make it available to a review workflow (if there is no review workflow for the repository, then the item would be made live).
Licence Revoke
Initiated by a user clicking the “revoke licence” button in Elements. The Repository Tools API treats this the same as a file deposit – ie.the URL of the licence file (as reported by Repository Tools) is supplied with a HTTP DELETE command.
It is not marked in the API that it is a licence file, but the Repository Tools connector is expected to be able to determine that this represents a licence file and act on it accordingly.
Existing Item Workflow
Existing Item Live
Process Update
Initiated by the scheduled tasks that you run on the repository server – specifically, ListUpdates and GetRecords.
As shown below, it is important to note that the list of updated publications returned by Elements does not (cannot) take into account whether there is a repository item attached to it. So, Elements will return all of the updated publications, and it is up to the ListUpdates process to filter the publications into the ones that it is potentially interested in.
ListUpdates will, when it finds a publication that it thinks it should be interested in add the URL to a queue for later processing.
GetRecords runs through the queue generated by ListRecords, and applies the changes that it finds. The most visible change in the repository will be the updating of metadata – applying the latest set of metadata in Elements to the repository record (new metadata is not pushed to repository by Elements, unless it is part of a file deposit). This can be configurable based on the state of the item (so that you don’t get unexpected metadata changes to live repository items).
In many ways, the most important role for GetRecords is to be a calling point for the Merged Publications processing. Records in Elements can be merged and split, even after files have been deposited (this is important to ensure that you do not have duplicate publications in Elements). Merge operations – like metadata updates – are not pushed to the repository automatically.
Regular pull updates (via GetRecords) will ensure that the repository items remain connected appropriately to an Elements resource, and so information about the full text will display correctly in the UI and reported on via the API.
Process Merged Publications in Elements
When publications are merged in Elements, there will always be a record of a (deleted) publication that has been merged FROM, and the surviving publication that has been merged TO.
How the merge will be handled by the Repository Tools connector depends both on what the surviving record in Elements is, what item(s) they are connected to in the repository, and what the state of those item(s) are.
Only Item Attached to Surviving Publication
No action required






















