Overview

This document is intended to be a thorough description of the data repository design, concentrating on its external interfaces. Its intended audience is implementors of other components that depend on the repository, as well as the repository coders. It should be kept up to date with any API changes so it can serve as a reference manual.

The repository is essentially an RDF triple-store with some extra features dictated by the needs of data entry tools, search, dissemination, and the data collection and curation process. Here are some of them:

Fine-grained access control for reading as well as modification operations.
Resolve URIs as linked data.
Maintain provenance metadata.
Workflow and lifecycle management with automated enforcement.
Provide isolated administrative domains so separate groups can share a repository.
Bind RDF graph to the contents of a serialized file allowing clean updates.

See the Repository Requirements page to review the original requirements list and correlate it with what this design document provides.

Concepts and Internal Structure

This section describes the internal conceptual data model.

Resource Instance

The primary purpose of the repository is to store, edit, and retrieve resource instances for its clients. The term comes from the eagle-i data model: a resource is an indivisible unit of something described in the eagle-i database. A resource instance in the repository is the corresponding graph of RDF statements that represents this resource as abstract data. It is rooted at one subject URI.

A resource instance is defined as the subject URI, and the collection of statements in which it is the subject. Furthermore, there must be one statement whose predicate is rdf:type and whose object is a URI (preferably of type owl:Class, but this is not checked).

The significance of resource instances in the repository architecture is:

The /update service creates, modifies, or deletes exactly one single instance transactionally.
The repository maintains metadata about each instance, as described in a separate section below.
Resource instances typically live in workspaces, about which there are more details below.
The dissemination service retrieves the contents of a single instance in various formats.
The /harvest service reports on changes to instances.

Embedded Instances (within Resource Instance)

In Cycle 3, the definition of a resource instance was extended to encompass embedded instances. This only changes the rules governing what statements belong in the instance's graph, although it has profound effects on the behavior of the repository. Embedded instances are essentially regular resource instances which are considered part of a "parent" resource instance. The precise definition of an embedded instance is as follows:

It has a unique subject URI, and exactly one asserted rdf:type statement.
Its rdf:type (possibly an inferred type) is a member of the designated class group indicating Embedded Instances. In the eagle-i data model the URI of this class group is:
http://eagle-i.org/ont/app/1.0/ClassGroup_embedded_class
.
It has exactly one parent: There is exactly one instance which is the subject of statements for which the EI is the object. This is an informal restriction (really an assumption) imposed on all instances of embedded types, though it is enforced by logic in the repository.
EIs may not be Orphaned or Shared: Any transaction which would result in an EI left without a parent, i.e. so it is not the object of any conforming statements or has multiple parents, is to be forbidden by logic in the repository. The only way to remove an EI from its parent is to delete all of its statements. You may copy an EI to another parent by creating a new EI under that parent, with a new unique URI for its subject.
No Broken Links to EIs: If an EI is removed, an instance may not retain any statements of which it is hte object. These must also be removed.
EIs Do Not Have Metadata. The repository does not create metadata (e.g. Dublin Core) about EIs. Any transactions on the EI are considered transactions on its parent and recorded as such, e.g. in the last-modified date in the metadata.
Transactional Integrity All transactional operations, such as /update and workflow, operate on the EI statements together with the parent instance's statements. For example, a workflow transition to published moves all the EIs to the Published graph along with their parent's statements.
*EIs reside in the Same Graph as the parent. Though it seems obvious, it's worth stating formally that the statements describing an EI must reside in the same named graph (workspace) as the statements of its parent.

Here is how EIs behave in repository operations:

Creation: An EI is created by adding a new URI with an appropriate rdf:type statement to a modification (including creation) of its parent. The type must belong to the embedded class group.

Modification, Deletion: Any modification of an EI must be done as a modification of its parent. The EI's properties, including type, may be changed; it may be deleted. These changes are recorded as a modification of the parent. The changes to the parent and its EIs may be driven by one HTTP request to the /update service, and will be performed in a single transaction.

Dissemination: A dissemination request on the parent instance will include all of the statements about its EIs. The EIs will be filtered of hidden properties (e.g. admin data and contact hiding) by the same rules as the parent, and returned in the same serialized RDF graph.
Dissemination requests on EIs are not supported. It is not recommended, the results are undefined.

Metadata Harvest:

See the description of the /harvest service for full details. Essentially, since EIs do not have an independent presence in the "instance" model of the repository, they are not reported on individually when the harvest service reports changes. A change to an EI, even deletion of the EI, is reported as a change to its parent. Likewise, creation of an EI is also reported as a change to its parent.

Resource Property Hiding

The repository is required to hide statements with certain predicates when exporting resource instances in these contexts:

The dissemination service, for example, the /i service and /repository/resource.
The /harvest service for populating search indexes

The set of predicates to be hidden is defined by the data model ontology, and identified by the data model configuration properties:

datamodel.hideProperty.predicate
datamodel.hideProperty.object

The hidden predicates are themselves subjects of statements whose predicate is the hiding predicate, and whose object is the hiding object, e.g.

...for example this configuration would hide dm:someStupidProperty:
datamodel.hideProperty.predicate = dm:hasSpecialAttribute
datamodel.hideProperty.object = dm:cantSeeMe
....and then later on, in the ontology (shown in N3):
dm:someStupidProperty dm:hasSpecialAttribute dm:cantSeeMe.

The mechanism of hidden-property hiding is implemented through access controls. See that section for more details.

This has to be implemented in the repository in order to enforce a consistent security model, which would not be possible if content hiding were left up to each client application.

Properties are "hidden" for various reasons, such as:

Properties that are effectively provenance metadata added by users in the resource acquisition and curation process, but which are not suitable for public viewing. They may contain confidential information or comments that the curators do not want publicized.
Properties whose values inherently contain confidential information, or information that administrators have been directed to hide, e.g. email addresses, physical locations, phone numbers.

Contacts and Email

The "contact" issue is closely related to property hiding. The essential problem is that for every resource instance, it is desired to have a means for the agent viewing that resource (e.g. a user in a Web browser viewing a Semantic Web-style dissemination page of the resource instance) to contact the agent who "owns" (or is otherwise responsible for) the resource. Email is one means of implementing this contact, but certainly not the only one. The contact could be in the form of a telephone number, street address, or even a redirect to another Web site which might include indirect contact info of its own. The purpose is to put a potential consumer of the resource in touch with its owner.

The repository only gets involved to mediate this contact process because it is also responsible for hiding all contact information from the agents who would use it. It must therefore implement some means of accepting a contact request or message from the outside agent, and forward it to the determined owner of the resource.

Contact properties are identified in the same way as hidden properties, only the relevant data model configuration keys are:

datamodel.contactProperty.predicate
datamodel.contactProperty.object

The mechanism of contact-property hiding is implemented through access controls - see that section for more details.

Internal Ontology and Metadata

There is a separate ontology document describing the repository's internal data model and administrative metadata.It i s an attachment to this page. Note that some statements described by that ontology appear as publically-readable metadata statements, while others are private and never exposed outside of the repository codebase.

The "ontology" graph is considered read-only in normal operation. All internal metadata (i.e. administrative metadata) is stored in a separate, distinct, named graph which should only be available to the repository's internal operations.

Named Graphs

The repository design takes full advantage of the named graph abstraction provided by most modern RDF database engines (Sesame, Virtuoso, Anzo, AllegroGraph). Every statement in the RDF database belongs to exactly one named graph. Since this is typically implemented by adding a fourth column to each triple for the named graph's URI, these databases are often called quad-stores instead of triple-stores. The data repository design takes advantage of named graphs:

To restrict query results (for logic and/or access control) efficiently by defining a dataset made up of named graphs.
Impose different access controls on subsets of the database defined by named graph.
Group the statements loaded from a file or network ingest and keep administrative metadata about their source, so they can be checked and updated as new versions of that file come out.
Apply different inference policies to graphs, according to their purpose (e.g. OWL-driven inference rules on ontologies).
Implement the "workspace" abstraction efficiently.

Internally, we collect some metadata about each named graph: access control rules, of course, and a type that documents the purpose of the named graph.

Named Graph Types:

Ontology---contains ontology descriptions, so reasoning may be required. The name of the graph is the namespace URI. Also implies that the graph is part of the "TBox" for inferencing.
Metadata---contains the repository's own metadata; public and private metadata are in separate graphs for easeir containment of query results.
*Workspace---*hold resource instances still in the process of editing and curation. Not visible to the public, and access to some workspaces is restricted to a subset of users.
*Published---*Published resource description data.
Internal - Contains data that is only to be used internally by repository code, never exposed.

The repository is created with a few fixed named graphics for specific purposes (e.g. internal metadata statements). Other named graphs are created as needed. Even the repository's own ontology is not a fixed graph since it can be managed like any other ontology loaded from a serialization.

Relationships - it would be helpful to record metadata about related named graphs, though the most compelling case for this is ontologies that use owl:includes to embed other ontology graphs. Since Sesame does not interpret OWL by itself, and we have no plans to add this sort of functionality for the initial repository implementation, this will be considered later.

Views

The repository provides views to give clients an easier way to query over a useful set of named graphs. A view is just a way of describing a dataset (a collection of named graphs). The repository server has a built-in set of views, each named by a simple keyword. You can use a view with the SPARQL Protocol and with a resource dissemination request. It is a equivalent to building up a dataset out of named graphs but it is a lot less trouble, and guaranteed to be stable whereas graph names might change. The views are:

published ---all resource instances and user records visible to the public, and all relevant ontologies and metadata.
published-resources ---just resource instances visible to the public, and all relevant ontologies and metadata.
metadata ---all named graphs of type Metadata visible to the authenticated user.
ontology ---all named graphs of type Ontology visible to the authenticated user
metadata+ontology ---all named graphs of types Metadata and Ontology visible to the authenticated user
null ---all statements in the internal RDF database regardless of named graph.
- Administrators only.
- NOTE: This is the ONLY way to see any statements that do not belong to any named graph, i.e. the "null context" in Sesame. If we are lucky this will be a small or empty set.
user ---those graphs which the current user has permission to read.
user-resources ---the graphs containing or related to eagle-i resources which the current user has permission to read. Note that the graph containing instances of repository users (i.e. of type foaf:Person) is, ironically, NOT part of this dataset, since they are not eagle-i resources.
public ---all graphs which are visible to the public, i.e. to the anonymous user, plus inferred statements. Note that this is not the same thing as 'published'.
all ---all named graphs including the ones internal to the repository; administrators only.

Important Note: You may have noticed that according to the definition, the user view is the same as the all view for an administrator user, so why bother creating an all view? It is intended to be specified when you have a query that really must cover all of the named graphs to work properly; if a non-administrator attempts it, it will fail with a permission error, instead of misleadingly returning a subset of the graphs.

Workspaces

A workspace is just another way to describe a dataset, by starting with a named graph. It is effectively a special kind of view. The name of a workspace is the URI of its base named graph, which must be of type workspace or published. When you specify that as the workspace, the repository server automatically adds these other graphs to the dataset:

All graphs of type Ontology that are readable by the current user.
All graphs of type Metadata that are readable by the current user.
The graph of inferred statements.
The graph of repository user instances (referenced by metadata statements)

You can specify a workspace instead of a view in SPARQL Protocol requests, and in resource dissemination requests.

Inferencing

As of the Version 1, MS5 release, the repository supports inferencing in some very specific cases. Since the repository's RDF data is very frequently modified, it does only the minimal inferencing needed by its users in order to keep the performance bearable.

Many inferencing schemes require inferencing to be re-done over the entire RDF database after every change because tracing the effects of a change through various rules would be at least as much computational effort as simply running the inferencing over. We have chosen a select subset of RDFS and OWL inference rules that makes incremental changes easy and efficient to re-compute.

See the RDF Semantics page for an overview of the greater body of inference rules (of which we only implement a small subset). The repository implements two different kinds of inferencing:

"TBox" inferencing:
- TBox is the terminology source, i.e. the ontology graphs.
- It consists of named graphs whose type is ontology.
- Upon any change to a TBox graph, inferencing is redone on the entire graph, and on all the ABox graphs.
- All inferred TBox statements are added as inferred statements directly to their TBox graph.
- Inference is done independently on every TBox graph -- they are expected to be completely disjoint. Thus, there should only be one or two TBox graphs.
- Following entailment rule rdfs11, all inferred rdfs:subClassOf relationships are created as direct statements.
- Following entailment rule rdfs5, all inferred subPropertyOf relationships are created as direct statements.
- Following entailment rule rdfs9, all inferred rdf:type properties are added.
"ABox" inferencing:
- ABox is the body of assertions, i.e. the statements about resource instances.
- All non-TBox named graphs are part of the ABox.
- All statements created from inference on ABox statements are added to a special named graph, http://eagle-i.org/ont/repo/1.0/NG_Inferred.;
  - This makes it easy to include or exclude the inferred statements in the dataset of SPARQL query
  - You can even detect whether or not a statement is inferred by adding a GRAPH keyword to the query and testing its home graph.
- Any change to a TBox graph only causes re-inferencing of the subject of each statement where a named instance was changed. This is possible because the inferred statements only depend on the asserted rdf:type properties of a a subject and the TBox graphs.
- Following entailment rule rdfs9, all inferred rdf:type properties are added to the graph for inferred statements.

The TBox graphs are configurable. You can set the configuration property eaglei.repository.tbox.graphs to a comma-separated list of graph URIs. By default, the TBox consists of:

The repository's internal ontology, http://eagle-i.org/ont/repo/1.0/
The eagle-i data model ontology, http://purl.obolibrary.org/obo/ero.owl

This inferencing scheme ensures very fast performance by assuming the TBox graphs never change under normal operations, which ought to be true. The data model ontology graph is only modified when a new version of the ontology is released. Likewise, the repository's internal ontology graph remains unchanged once the repository is installed. When the TBox graphs are changed, be aware that you will probably see a delay of many seconds or perhaps minutes, as all the TBox and ABox inferencing is re-done.

Inferred statements are not normally written when an entire graph is dumped. See the /graph service for details.

Users and Authentication

Authentication is managed entirely by the Java Servlet container. We rely on the container to supply an authenticated user name (a short text string) and whether that user has the "superuser" role. The container's role is only used for bootstrapping; normally roles are recorded in the RDF database and they take precedence over the container's role map.

Users Are Described in Two Databases

Each login user is (ideally) recorded in both the RDBMS used by the servlet container (or possibly some other external DB) and the RDF database. This is necessary because the servlet container, which is doing the authentication, only has access to the RDBMS through a plugin, but the repository authorization mechanism expects an RDF expression of the user as well. All of the services that modify users keep the RDBMS and RDF databases synchronized, and can cope with users found in one and not the other.

The RDBMS description of a user contains:

Username---*Present in RDF as well, this is the common key. Since it is the permanent description of a user this is *immutable.
Password---Only in RDBMS.
Membership in Superuser role. No other roles.

The RDF description of a user contains:

Username of corresponding RDBMS user.
Roles including Superuser if present in RDBMS
Various descriptive and provenance metadata.

When a user is present in RDF but not in the RDBMS, they are considered disabled and cannot login. They can be reinstated through the Admin UI.

When a user is present in the RDBMS but not in the RDF, they are considered undocumented. Upon login, an undocumented user is given the URI corresponding to his/her highest known role:
:Role_Superuser if the RDBMS indicates that role, or :Role_Anonymous otherwise. (Arguably the default role could also be :Role_Authenticated, but without RDF data for the user they are not fully authenticated, and this is incentive to fix the discrepancy.)

To fix an undocumented user:

An Administrator (superuser) can become documented by logging in, and either running the /whoami service with create=true or using the Admin UI to edit their own user info and saving it. An Administrator can fix an ordinary undocumented user by using the Admin UI to save their descriptive metadata; even if it is all blank, a user record will be created. Importing users also straightens out the mapping automatically.

About Roles

Roles are a way to characterize a group of users, for example, to grant them some access rights in the access-control system. Functionally, the role is part of a user's authentication, i.e., "who" they are.
A role is defined by a URI, the subject of a :Role instance. It should also have a locally unique, short text-string name (the rdfs:label of its :Role instance).

Each Role is independent of other Roles. Roles cannot be "nested". This is a necessary limitation that simplified the implementation considerably.

The Superuser role is built into the system because its privileges are hardcoded.

There are a couple of special Roles whose membership is implicit, that is, it never needs to be granted explicitly:

Anonymous -- unauthenticated users are implicitly assigned this role so they can be given explicit access. This role is never visibly asserted by a user, it is only for describing access controls. E.g. "The Published graph is readable by the Anonymous role".
Authenticated -- any user who logs in is authenticated and implicitly belongs to this role; the opposite of anonymous. This role is never explicitly asserted by a user, it is only for describing access controls.

Username and Password

A repository user is identified uniquely (within the scope of ONE repository instance) by a short symbolic username. This is a character string composed of characters from a certain restricted subset of the ASCII character set, in order to avoid problems of character translation and metacharacter interpretation in both the protocol layer (HTTP) and OS tools such as command shells. The password, which is paired with a username to serve as login credentials, is likewise restricted to the same range of characters as the username.

Character restrictions: The username and password MUST NOT include the character ':' (colon). They MAY only include:

Letter or digit characters in Unicode C0 and C1 (basic latin, latin-1)
The punctuation characters: ~, @, #, $, %, _, -, . (dot)

Note that although the HTTP protocol allows any graphic characters in the ISO-8859-1 codeset (modulo ':'), and linear whitespace, and even chars WITH special MIME RFC-2047 encoding rules, these are often implemented wrongly by HTTP clients and also invite encoding and metacahracter problems with OS and scriptiing tools. To avoid these troubles we simply restrict the available characters.

Authentication policy

All of the servlet URLs in this interface except the public dissemination request (/i) require authentication. The dissemination request makes use of an authenticated user and roles when they are available, to access data that would be invisible to an anonymous request, but it is never required.

Access Control (Authorization)

This is just an outline of the access control system. It is implemented as statements stored in the internal metadata graph. The access controls applying to an instance or other object are not to be directly visible in the repository API, execpt through administrative UI.

These types of access can be granted:

Read -- in context of named graph, allows the graph to be included in dataset of a query; for workflow transitions and hidden/contact property groups, this is the only relevant access grant.
Add -- add triples to a graph or instance.
Remove -- remove triples from a graph or instance.
Admin -- change the access controls on a resource. (Not really used in practice; only the Superuser role can change access.)

On the following types of resources:

Named Graphs, including workspaces.
- Read lets people download and run queries against the graph.
- Add, Remove let you modify it with /graph - this is usually reserved for admins.
Resource Instances (ignores explicit Read access; that comes from it's home named graph.)
- Add, Remove let you modify it with /update, this is usually granted temporarily to a specific user as part of the workflow process.
- ignores Read grants, read access reverts to its home graph.
Property Groups - sets of properties on a resource instance identified by statements in the data model ontology, see the data model configuration manual, particularly datamodel.hideProperty.predicate, datamodel.hideProperty.object.
- Read lets the Dissemination and Harvest service report on these properties.
- All other access types are ignored.
Workflow Transitions
- Read gives access to take (push) the transition.
- All other access types are ignored.

Access control statements grant access to either a specific user, or to a Role, which applies to all users holding that role.

Any user asserting the Superuser role is always granted access, bypassing all controls. This lets us bootstrap the system when there is no RDF database yet to describe grants. Repository administrators should always have the Administrator role, since most of the Admin UI and API requires it.

Access control is implemented by statements of the form:

Subject: resource, Predicate: access-type, Object: accessor

The resource is the URI of the instance, named graph, or workflow transition of interest. The access-type names one of the four types of access described above: read, add, remove, admin. Finally, the accessor is the URI of the Principal to be granted the access, either a Role or an Agent (user).

We anticipate having a relatively small number of these access grants. Although named graphs and workflow transitions need elaborate access descriptions, there are only a few of those -- on the order of dozens. Resource instances are of course more numerous but most of them have no access grants, deriving their read/query access from the named graph they reside in. The workflow claim service adds temporary grants to give the claim owner read/write access to be able to edit the instance while it is claimed.

Provenance Metadata

The repository automatically records provenance metadata about objects when they are created and modified by users' actions. Provenance means information about the history and origin of the data, in this case the authenticated identity responsible and time of the latest change. The following properties are recorded for these types of objects, and can be obtained by querying with a view or dataset that includes the named graph containing public administrative metadata.

Note that the there is at most one value of any of these properties for each subject. That means the "modified" properties are updated whenever a subject is modified and the record of the previous modification is lost. This is a simplification that may be remedied at some point in the future if we add versioning of data to the repository.

Named Graphs:
- dcterms:modified - literal date of last modification, encoded as xsd:dateTime
- dcterms:contributor - the URI of the Agent (authenticated user) who last modified it
- dcterms:source - description of file or URI last loaded into this NG, if that is how it was created. This record is used to compare it against the source later to decide whether an updaet is necessary. It is a node (possibly blank node) with the following properties:
  - dcterms:identifier - the resolvable URI of the resource loaded, most likely a URL in either the file or http scheme.
  - dcterms:modified - last-modification date of the resource, for later comparison when deciding whether to decache the repository copy of an external file, a literal xsd:dateTime.
Resource Instance:
- dcterms:created - literal date when resource was created, encoded as [xsd:dateTime
  
  http://www.google.com/url?q=http%3A%2F%2Fwww.w3.org%2FTR%2Fxmlschema-2%2F%23dateTime&sa=D&sntz=1&usg=AFrqEzfRDUuCfWjkGIj5ZuqOGGVPGBo7ZQ]
  - Set automatically by the Repository.
- dcterms:creator - the URI of the Agent (should be an auth'd user) who created teh instance.
  - The true meaning is the user who authored the data in the inital version of tihs instance.
  - Usually, this is the same as the user directly responsible for creating the instance.
  - However, when a different user uploads e.g. a spreadsheet created by other RNAVs, the Repository user is a mediator. The actual value of dcterms:creator comes from the uploaded data.
- (optional) dcterms:mediator - ONLY when dcterms:creator does not refer to the authenticated user who created the data, this is the URI of the Agent (authenticated user) who created this instance in the Repository.
  - Set automatically by the Repository.
- dcterms:modified - literal date when resource was last modified, encoded as [xsd:dateTime
  
  http://www.google.com/url?q=http%3A%2F%2Fwww.w3.org%2FTR%2Fxmlschema-2%2F%23dateTime&sa=D&sntz=1&usg=AFrqEzfRDUuCfWjkGIj5ZuqOGGVPGBo7ZQ]
  - Set automatically by the Repository.
- dcterms:contributor - the URI of the Agent (authenticated user) who last modified this instance.
  - Set automatically by the Repository.

Sesame Triplestore Extensions

Some repository features are implemented as extensions to the Sesame RDF database (aka triplestore). This means they are available both internally to the repository implementation and externally whenever an API to Sesame (e.g. its SPARQL query engine) is exposed.

1. Output formats

Additional output formats for both RDF serialization and SPARQL tuple query results allow output in:

*HTML, *media type text/html
*plain text, *media type text/plain (for SPARQL)
N-Quads, RDF only, for comparing serialized RDF including context (graph).

2. SPARQL Query Function

The repo adds a custom function to Sesame's query engine: repo:upperCaseStr. It returns the toUpperCase() version of of hte string value of an RDF value. Use it to sort values ignoring whether (a) the case of characters differs, (b) they are datatyped-literals or untyped literals (or other terms).

To invoke it you must have the repository's URI namespace defined as a
prefix, e.g.
PREFIXrepo:<http://eagle-i.org/ont/repo/1.0/> ..query text...
ORDER BYrepo:upperCaseStr(?label)

Workflow

The repository includes a "workflow" control system that directs the complete lifecycle of each resource instance, and mediates access by users at each lifecycle stage. The word "workflow" is often used to describe process-management and administration systems, but in this case it is really just a minimal implementation of states and extended access control.

It was implemented in the repository because it depends on persistent data and access control which are already available in the repository. It is also closely integrated with the access control system, which is easier to accomplish securely from within the repository codebase.

Workflow is manifested in RDF statements (of course) in the internal metadata graph. Although the Web API exposes some URIs and names of workflow objects, the ontology and access control details are intentionally hidden. There is no need for applications using workflow to see the model, all their access is through the API.

The model is a state map, with nodes and transitions between them. elements of workflow are:

Workflow State---A node on the state map. Every resource instance has exactly one current state. A newly created resource is initialized to a fixed "New" state.
*Transition---*Description of an arc on the map, or a transition from an initial state to a final state.
*Claim---*Assertion on a resource instance that a specific user (the claimant) has taken possession of it, in order to prepare it for the next workflow transition.
- Claiming typically causes some side-effects such as changing access controls.
- A claim may be resolved by taking a transition to another state, or by releasing it.
Pool---The set of resource instances available for claiming to a specific user. A pool is always computed by a query, it is not materialized.