This document is intended to be a thorough description of the data repository design, concentrating on its external interfaces. Its intended audience is implementors of other components that depend on the repository, as well as the repository coders. It should be kept up to date with any API changes so it can serve as a reference manual.
The repository is essentially an RDF triple-store with some extra features dictated by the needs of data entry tools, search, dissemination, and the data collection and curation process. Here are some of them:
See the Repository Requirements page to review the original requirements list and correlate it with what this design document provides.
This section describes the internal conceptual data model.
The primary purpose of the repository is to store, edit, and retrieve resource instances for its clients. The term comes from the eagle-i data model: a resource is an indivisible unit of something described in the eagle-i database. A resource instance in the repository is the corresponding graph of RDF statements that represents this resource as abstract data. It is rooted at one subject URI.
A resource instance is defined as the subject URI, and the collection of statements in which it is the subject. Furthermore, there must be one statement whose predicate is rdf:type and whose object is a URI (preferably of type owl:Class, but this is not checked).
The significance of resource instances in the repository architecture is:
In Cycle 3, the definition of a resource instance was extended to encompass embedded instances. This only changes the rules governing what statements belong in the instance's graph, although it has profound effects on the behavior of the repository. Embedded instances are essentially regular resource instances which are considered part of a "parent" resource instance. The precise definition of an embedded instance is as follows:
http://eagle-i.org/ont/app/1.0/ClassGroup_embedded_class |
Here is how EIs behave in repository operations:
Creation: An EI is created by adding a new URI with an appropriate rdf:type statement to a modification (including creation) of its parent. The type must belong to the embedded class group.
Modification, Deletion: Any modification of an EI must be done as a modification of its parent. The EI's properties, including type, may be changed; it may be deleted. These changes are recorded as a modification of the parent. The changes to the parent and its EIs may be driven by one HTTP request to the /update service, and will be performed in a single transaction.
Dissemination: A dissemination request on the parent instance will include all of the statements about its EIs. The EIs will be filtered of hidden properties (e.g. admin data and contact hiding) by the same rules as the parent, and returned in the same serialized RDF graph.
Dissemination requests on EIs are not supported. It is not recommended, the results are undefined.
Metadata Harvest:
See the description of the /harvest service for full details. Essentially, since EIs do not have an independent presence in the "instance" model of the repository, they are not reported on individually when the harvest service reports changes. A change to an EI, even deletion of the EI, is reported as a change to its parent. Likewise, creation of an EI is also reported as a change to its parent.
The repository is required to hide statements with certain predicates when exporting resource instances in these contexts:
The set of predicates to be hidden is defined by the data model ontology, and identified by the data model configuration properties:
The hidden predicates are themselves subjects of statements whose predicate is the hiding predicate, and whose object is the hiding object, e.g.
...for example this configuration would hide dm:someStupidProperty: datamodel.hideProperty.predicate = dm:hasSpecialAttribute datamodel.hideProperty.object = dm:cantSeeMe ....and then later on, in the ontology (shown in N3): dm:someStupidProperty dm:hasSpecialAttribute dm:cantSeeMe. |
The mechanism of hidden-property hiding is implemented through access controls. See that section for more details.
This has to be implemented in the repository in order to enforce a consistent security model, which would not be possible if content hiding were left up to each client application.
Properties are "hidden" for various reasons, such as:
The "contact" issue is closely related to property hiding. The essential problem is that for every resource instance, it is desired to have a means for the agent viewing that resource (e.g. a user in a Web browser viewing a Semantic Web-style dissemination page of the resource instance) to contact the agent who "owns" (or is otherwise responsible for) the resource. Email is one means of implementing this contact, but certainly not the only one. The contact could be in the form of a telephone number, street address, or even a redirect to another Web site which might include indirect contact info of its own. The purpose is to put a potential consumer of the resource in touch with its owner.
The repository only gets involved to mediate this contact process because it is also responsible for hiding all contact information from the agents who would use it. It must therefore implement some means of accepting a contact request or message from the outside agent, and forward it to the determined owner of the resource.
Contact properties are identified in the same way as hidden properties, only the relevant data model configuration keys are:
The mechanism of contact-property hiding is implemented through access controls - see that section for more details.
There is a separate ontology document describing the repository's internal data model and administrative metadata. It is an attachment to this page. Note that some statements described by that ontology appear as publically-readable metadata statements, while others are private and never exposed outside of the repository codebase.
The "ontology" graph is considered read-only in normal operation. All internal metadata (i.e. administrative metadata) is stored in a separate, distinct, named graph which should only be available to the repository's internal operations.
The repository design takes full advantage of the named graph abstraction provided by most modern RDF database engines (Sesame, Virtuoso, Anzo, AllegroGraph). Every statement in the RDF database belongs to exactly one named graph. Since this is typically implemented by adding a fourth column to each triple for the named graph's URI, these databases are often called quad-stores instead of triple-stores. The data repository design takes advantage of named graphs:
Internally, we collect some metadata about each named graph: access control rules, of course, and a type that documents the purpose of the named graph.
The repository is created with a few fixed named graphics for specific purposes (e.g. internal metadata statements). Other named graphs are created as needed. Even the repository's own ontology is not a fixed graph since it can be managed like any other ontology loaded from a serialization.
Relationships - it would be helpful to record metadata about related named graphs, though the most compelling case for this is ontologies that use owl:includes to embed other ontology graphs. Since Sesame does not interpret OWL by itself, and we have no plans to add this sort of functionality for the initial repository implementation, this will be considered later.
The repository provides views to give clients an easier way to query over a useful set of named graphs. A view is just a way of describing a dataset (a collection of named graphs). The repository server has a built-in set of views, each named by a simple keyword. You can use a view with the SPARQL Protocol and with a resource dissemination request. It is a equivalent to building up a dataset out of named graphs but it is a lot less trouble, and guaranteed to be stable whereas graph names might change. The views are:
Important Note: You may have noticed that according to the definition, the user view is the same as the all view for an administrator user, so why bother creating an all view? It is intended to be specified when you have a query that really must cover all of the named graphs to work properly; if a non-administrator attempts it, it will fail with a permission error, instead of misleadingly returning a subset of the graphs.
A workspace is just another way to describe a dataset, by starting with a named graph. It is effectively a special kind of view. The name of a workspace is the URI of its base named graph, which must be of type workspace or published. When you specify that as the workspace, the repository server automatically adds these other graphs to the dataset:
You can specify a workspace instead of a view in SPARQL Protocol requests, and in resource dissemination requests.
As of the Version 1, MS5 release, the repository supports inferencing in some very specific cases. Since the repository's RDF data is very frequently modified, it does only the minimal inferencing needed by its users in order to keep the performance bearable.
Many inferencing schemes require inferencing to be re-done over the entire RDF database after every change because tracing the effects of a change through various rules would be at least as much computational effort as simply running the inferencing over. We have chosen a select subset of RDFS and OWL inference rules that makes incremental changes easy and efficient to re-compute.
See the RDF Semantics page for an overview of the greater body of inference rules (of which we only implement a small subset). The repository implements two different kinds of inferencing:
The TBox graphs are configurable. You can set the configuration property eaglei.repository.tbox.graphs to a comma-separated list of graph URIs. By default, the TBox consists of:
This inferencing scheme ensures very fast performance by assuming the TBox graphs never change under normal operations, which ought to be true. The data model ontology graph is only modified when a new version of the ontology is released. Likewise, the repository's internal ontology graph remains unchanged once the repository is installed. When the TBox graphs are changed, be aware that you will probably see a delay of many seconds or perhaps minutes, as all the TBox and ABox inferencing is re-done.
Inferred statements are not normally written when an entire graph is dumped. See the /graph service for details.