Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This document is intended to be a thorough description of the data repository design, ; concentrating on its external interfaces.   Its The intended audience is implementors repository coders as well as implementers of other components that depend on the repository, as well as the repository coders.  It should be kept up to date with any API changes so it can serve as a reference manual.

The repository is essentially an RDF triple-store with some extra features dictated by the needs of data entry tools, search, dissemination, and the  the data collection and curation process. Here are some of them:These features include the ability to:

  • Bind an RDF graph to the contents of a serialized file allowing clean updatesFine-grained access control for reading as well as modification operations.
  • Resolve URIs as linked data.
  • Maintain provenance metadata.

Other features include:

  • Workflow and lifecycle management with automated enforcement.
  • Provide isolated Isolated administrative domains so separate groups can share a repository.
  • Bind RDF graph to the contents of a serialized file allowing clean updates.

...

  • Fine-grained access control for reading as well as modification operations.

Concepts and Internal Structure

...

The primary purpose of the repository is to store, edit, and retrieve resource instances for its clients. The term resource instance comes from the eagle-i data model: a resource is an indivisible unit of something described in the eagle-i database.  A resource instance in the repository is the corresponding graph of RDF statements that represents this resource as abstract data.  It is rooted at one subject URI.

A resource instance is defined as the subject URI, and the collection of statements in which it is the subject.   Furthermore, there must be one statement whose predicate is rdf:type and whose object is a URI (, preferably of type owl:Class, but this is not checked).

The significance of resource instances in the repository architecture is:

  • The /update service creates, modifies, or deletes exactly one single instance transactionally.
  • The repository maintains metadata about each instance, as described in a separate section more detail below.
  • Resource instances typically live in workspaces, about which there are more details described in more detail below.
  • The dissemination service retrieves the contents of a single instance in various formats.
  • The /harvest service reports on changes to instances.

Embedded Instances (within Resource Instance)

Early in the development processIn Cycle 3, the definition of a resource instance was extended to encompass embedded instances.   This change only changes affects the rules governing what statements belong in the instance's graph, although it has a profound effects effect on the behavior of the repository.   Embedded instances are essentially regular resource instances which are considered part of a "parent" resource instance. The precise definition of an embedded instance is as follows:

  1. It has a unique subject URI, and exactly one asserted rdf:type statement.
  2. Its rdf:type (possibly an inferred type) is a member of the designated class group indicating Embedded Instancesembedded instances.   In the eagle-i data model the URI of this class group is:
    Code Block
    http://eagle-i.org/ont/app/1.0/ClassGroup_embedded_class
    
    .It t has exactly one_ parent: There is _exactly one_ instance which is the subject of statements for which the EI is the object. This is an informal restriction (really an assumption) imposed on all instances of embedded types, though it is enforced by logic in the repository.
  1. EIs may not be Orphaned or Shared: Any transaction which would result in an EI left without a parent, i.e. so it is not the object of any conforming statements or has multiple parents, is to be forbidden by logic in the repository.   The only way to remove an EI from its parent is to delete all of its statements. You may copy an EI to another parent by creating a new EI under that parent, with a new unique URI for its subject.
  2. No Broken Links to EIs: If an EI is removed, an instance may not retain any statements of which it is hte object. These must also be removed.
  3. EIs Do Not Have Metadata. The repository does not create metadata (e.g. Dublin Core) about EIs. Any transactions on the EI are considered transactions on its parent and recorded as such, e.g. in the last-modified date in the metadata.
  4. Transactional Integrity All transactional operations, such as /update and workflow, operate on the EI statements together with the parent instance's statements. For example, a workflow transition to published moves all the EIs to the Published graph along with their parent's statements.
  5. *EIs reside in the Same Graph as the parent. Though it seems obvious, it's worth stating formally that the statements describing an EI must reside in the same named graph (workspace) as the statements of its parent.

...

Creation: An EI is created by adding a new URI with an appropriate rdf:type statement to a modification (including creation) of its parent. The type must belong to the embedded class group.

...

See the description of the /harvest service for full details. Essentially, since EIs do not have an independent presence in the "instance" model of the repository, they are not reported on individually when the harvest service reports changes. A change to an EI, even deletion of the EI, is reported as a change to its parent. Likewise, creation of an EI is also reported as a change to its parent.

...

The repository is required to hide statements with certain predicates when exporting resource instances in these contexts:

  1. The dissemination service, for example, the /i service and /repository/resource.
  2. The /harvest service for populating search indexes

The set of predicates to be hidden is defined by the data model ontology, and identified by the data model configuration properties:

  • datamodel.hideProperty.predicate
  • datamodel.hideProperty.object

The hidden predicates are themselves subjects of statements whose predicate is the hiding predicate, and whose object is the hiding object, e.g.for example:

Code Block
...for example this configuration would hide dm:someStupidProperty:
datamodel.hideProperty.predicate = dm:hasSpecialAttribute
datamodel.hideProperty.object = dm:cantSeeMe
....and then later on, in the ontology (shown in N3):
dm:someStupidProperty dm:hasSpecialAttribute dm:cantSeeMe.

The mechanism of hidden-property hiding is implemented through access controls. See that section for more details.

This has to be implemented in the repository in order to enforce a consistent security model, which would not be possible if content hiding were left up to each client application.

Properties are "hidden"   for various reasons, such as:

  • Properties that are effectively provenance metadata added by users in the resource acquisition and curation process, but which are not suitable for public viewing.   They may contain confidential information or comments that the curators do not want publicized.
  • Properties whose values inherently contain confidential information, or information that administrators have been directed to hide, e.g. email fore example, e-mail addresses, physical locations, phone numbers.

...

The "contact" issue is closely related to property hiding.   The essential problem is that for every resource instance, it is desired to have a means for the agent viewing that resource (e.g. . For example, a user in a Web browser viewing a Semantic Web-style dissemination page of the resource instance ) to contact the agent who "owns" (or is otherwise responsible for) the resource.   Email E-mail is one means of implementing this contact, but certainly not the only one.   The contact could be in the form of a telephone number, street address, or even a redirect to another Web site which might include indirect contact info of its own.   The purpose is to put a potential consumer of the resource in touch with its owner.

The repository only gets involved to mediate this contact process because it is also responsible for hiding all contact information from the agents who would use it.   It must therefore implement some means of accepting a contact request or message from the outside agent, and forward it to the determined owner of the resource.

...

The mechanism of contact-property hiding is implemented through access controls - see See that section for more details.

...

  1. Ontology---contains ontology descriptions, so reasoning may be required. The name of the graph is the namespace URI. Also implies that the graph is part of the "TBox" for inferencing.
  2. Metadata---contains the repository's own metadata; public and private metadata are in separate graphs for easeir containment of query results.
  3. *Workspace*---*hold resource instances still in the process of editing and curation. Not visible to the public, and access to some workspaces is restricted to a subset of users.
  4. *Published*--- * Published resource description data.
  5. Internal---Contains data that is only to be used internally by repository code, never exposed.

The repository is created with a few fixed named graphics for specific purposes (e.g. internal metadata statements). Other named graphs are created as needed. Even the repository's own ontology is not a fixed graph since it can be managed like any other ontology loaded from a serialization.

Relationships ---it would be helpful to record metadata about related named graphs, though the most compelling case for this is ontologies that use owl:includes to embed other ontology graphs. Since Sesame does not interpret OWL by itself, and we have no plans to add this sort of functionality for the initial repository implementation, this will be considered later.

...

  • published---all resource instances and user records visible to the public, and all relevant ontologies and metadata.
  • published-resources --- __just resource instances visible to the public, _and all relevant ontologies and metadata.
  • metadata---all named graphs of type Metadata visible to the authenticated user.
  • ontology---all named graphs of type Ontology  visible to the authenticated user
  • metadata+ontology---all named graphs of types Metadata and Ontology visible to the authenticated user
  • null ---all statements in the internal RDF database regardless of named graph.
    • Administrators only.
    • NOTE: This is the ONLY way to see any statements that do not belong to any named graph, i.e. the "null context" in Sesame.   If we are lucky this will be a small or empty set.
  • user ---those graphs which the current user has permission to read.
  • user-resources ---the graphs containing or related to eagle-i resources which the current user has permission to read. Note that the graph containing instances of repository users (i.e. of type foaf:Person) is, ironically, NOT part of this dataset, since they are not eagle-i resources.
  • public ---all graphs which are visible to the public, i.e. to the anonymous user, plus inferred statements. Note that this is not the same thing as 'published'.
  • all ---all named graphs including the ones internal to the repository; administrators only. 

Important Note: You may have noticed that according to the definition, the user view is the same as the all view for an administrator user, so why bother creating an all view? It is intended to be specified when you have a query that really must cover all of the named graphs to work properly; if a non-administrator attempts it, it will fail with a permission error, instead of misleadingly returning a subset of the graphs.

...

Many inferencing schemes require inferencing to be re-done over the entire RDF database after every change because tracing the effects of a change through various rules would be at least as much computational effort as simply running the inferencing over. We have chosen a select subset of RDFS and OWL inference rules that makes incremental changes easy and efficient to re-compute.

Wiki Markup
See the \[RDF Semantics page for an overview of the greater body of inference rules (of which  we only implement a small subset). The repository implements two  different kinds of inferencing:

  1. "TBox" inferencing:
    • TBox is the terminology source, i.e. the ontology graphs.
    • It consists of named graphs whose type is ontology.
    • Upon any change to a TBox graph, inferencing is redone on the entire graph, and on all the ABox graphs.
    • All inferred TBox statements are added as inferred statements directly to their TBox graph.
    • Inference is done independently on every TBox graph -- they are expected to be completely disjoint. Thus, there should only be one or two TBox graphs.
    • Following entailment rule rdfs11, all inferred rdfs:subClassOf relationships are created as direct statements.
    • Following entailment rule rdfs5, all inferred subPropertyOf relationships are created as direct statements.
    • Following entailment rule rdfs9, all inferred rdf:type properties are added.
  2. "ABox" inferencing:
    • ABox is the body of assertions, i.e. the statements about resource instances.
    • All non-TBox named graphs are part of the ABox.
    • All statements created from inference on ABox statements are added to a special named graph, http://eagle-i.org/ont/repo/1.0/NG_Inferred.;
      • This makes it easy to include or exclude the inferred statements in the dataset of SPARQL query
      • You can even detect whether or not a statement is inferred by adding a GRAPH keyword to the query and testing its home graph.
    • Any change to a TBox graph only causes re-inferencing of the subject of each statement where a named instance was changed.  This is possible because the inferred statements only depend on the asserted rdf:type properties of a a subject and the TBox graphs.
    • Following entailment rule rdfs9, all inferred rdf:type properties are added to the graph for inferred statements.

...