Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. It has a unique subject URI, and exactly one asserted rdf:type statement.
  2. Its rdf:type (possibly an inferred type) is a member of the designated class group indicating embedded instances. In the eagle-i data model the URI of this class group is:
    Code Block
    
    http://eagle-i.org/ont/app/1.0/ClassGroup_embedded_class
    
    t
  3. It has exactly one _ parent: . There is _ exactly one _ instance, which is the subject of statements for which the EI is the object. This is an informal restriction (really an assumption) imposed on all instances of embedded types, though although it is enforced by logic in the repository.
  4. EIs may not be Orphaned or Shared:. Any transaction which that would result in an EI left without a parent, i.e. so in other words it is **{}not* the object of any conforming statements, or it has multiple parents, is to be forbidden by logic in the repository. The only way to remove an EI from its parent is to delete all of its statements. You may can copy an EI to another parent by creating a new EI under that parent, with a new unique URI for its subject.
  5. No Broken Links to EIs: . If an EI is removed, an instance may not cannot retain any statements of which it is hte the object. These must also be removed.
  6. EIs Do Not do not Have Metadata. The repository does not create metadata (e.g. , Dublin Core) , for example, about EIs. Any transactions on the EI are considered transactions on its parent and recorded as such, e.g. in the last-modified date in the metadata.
  7. Transactional Integrity All transactional operations, such as /update and workflow, operate on the EI statements together with the parent instance's statements. For example, a workflow transition to published moves all the EIs to the Published published graph along with their parent's statements.
  8. *EIs reside in the Same Graph same graph as the parent. Though it seems obvious, it's worth stating formally that the statements describing an EI must reside in the same named graph (workspace) as the statements of its parent.

...

Creation: An EI is created by adding a new URI with an appropriate rdf:type statement to a modification (including creation) of its parent. The type must belong to the embedded class group.

Modification, /Deletion: Any modification of an EI must be done as a modification of its parent. The EI's properties, including type, may be changed; it may be deleted. These changes are recorded as a modification of the parent. The changes to the parent and its EIs may can be driven by one HTTP request to the / update service, and will be performed in a single transaction.

Dissemination: A dissemination request on the parent instance will include all of the statements about its EIs. The EIs will be filtered of hidden properties (e.g. such as admin data and contact hiding) , by the same rules as the parent, and nd returned in the same serialized RDF graph.
Dissemination requests on EIs are not supported. It is not recommended, the results are undefined.

...

  1. The dissemination service , (for example, the /i service and /repository/resource).
  2. The /harvest service for populating search indexes

...

  • Properties that are effectively provenance metadata added by users in the resource acquisition and curation process, but which are not suitable for public viewing. They may contain confidential information or comments that the curators do not want publicized.
  • Properties whose values inherently contain confidential information, or information that administrators have been directed to hide, fore example, such as e-mail addresses, physical locations, and phone numbers that administrators have been directed to hide.

Contacts and

...

E-mail

The "contact" issue is closely related to property hiding. The essential problem is that for every resource instance, it is desired to have a means for the agent viewing that resource. For example, a user in a Web browser viewing a Semantic semantic Web-style dissemination page of the resource instance to contact the agent who "owns" (or is otherwise responsible for) the resource. E-mail is one means of implementing this contact, but certainly not the only one. The contact could be in the form of a telephone number, street address, or even a redirect to another Web site which might include indirect contact info of its own. The purpose is to put a potential consumer of the resource in touch with its owner.

...

The mechanism of contact-property hiding is implemented through access controls See that section for more details, explained in more detail later in this document.

Internal Ontology and Metadata

There is a separate ontology document describing the repository's internal data model and administrative metadata.It i  s is an attachment to this page. Note that some statements described by that ontology appear as publically-readable metadata statements, while others are private and never exposed outside of the repository codebase.

...

The repository design takes full advantage of the named graph abstraction provided by most modern RDF database engines (Sesame, Virtuoso, Anzo, AllegroGraph). Every statement in the RDF database belongs to exactly one named graph. Since this is typically implemented by adding a fourth column to each triple for the named graph's URI, these databases are often called quad-stores instead of triple-stores. The data repository design takes advantage of named graphs to:

  • To restrict Restrict query results (for logic and/or access control) efficiently by defining a dataset made up of named graphs.
  • Impose different access controls on subsets of the database defined by named graph.
  • Group the statements loaded from a file or network ingest and keep administrative metadata about their source, so they can be checked and updated as new versions of that file come out.
  • Apply different inference policies to graphs, according to their purpose (e.g. OWL-driven inference rules on ontologies).
  • Implement the "workspace" abstraction efficiently.

Internally, we collect some metadata about each named graph : including access control rules, of course, and a type that documents the purpose of the named graph. 

Named Graph Types:

  1. Ontology---contains ontology descriptions, so reasoning may be required. The name of the graph is the namespace URI. Also implies that the graph is part of the "TBox" for inferencing.
  2. Metadata---contains the repository's own metadata; public and private metadata are in separate graphs for easeir easier containment of query results.
  3. *Workspace*---hold holds resource instances still in the process of editing and curation. Not visible to the public, and access to some workspaces is restricted to a subset of users.
  4. *Published*--- Published resource description data.
  5. Internal---Contains data that is only to be used internally by repository code , and never exposed.

The repository is created with a few fixed named graphics for specific purposes (e.g. such as internal metadata statements). Other named graphs are created as needed. Even the repository's own ontology is not a fixed graph since it can be managed like any other ontology loaded from a serialization.

Relationships---it would be helpful to record metadata about related named graphs, though although the most compelling case for this is the ontologies that use owl:includes to embed other ontology graphs. Since Sesame does not interpret OWL by itself, and we have no plans to add this sort of functionality for the initial repository implementation, this will be considered later.

...

The repository provides views to give clients an easier way to query over a useful set of named graphs. A view is just a way of describing a dataset (a collection of named graphs). The repository server has a built-in set of views, each named by a simple keyword. You can use a view with the SPARQL Protocol and with a resource dissemination request. It is a equivalent to building up a dataset out of named graphs but it is a lot less trouble, and guaranteed to be stable whereas graph names might could change. The views are:

  • published---all resource instances and user records visible to the public, and all relevant ontologies and metadata.
  • published-resources __ ---just resource instances visible to the public, _and all relevant ontologies and metadata.
  • metadata---all named graphs of type Metadata visible to the authenticated user.
  • ontology---all named graphs of type Ontology  Ontology visible to the authenticated user
  • metadata+ontology---all named graphs of types Metadata and Ontology visible to the authenticated user
  • null ---all statements in the internal RDF database regardless of named graph.
    • Administrators only.
    • NOTE: This is the ONLY way to see any statements that do not belong to any named graph, i.e. the "null context" in Sesame. If we are lucky this will be a small or empty set.
  • user ---those graphs which the current user has permission to read.
  • user-resources ---the graphs containing or related to eagle-i resources which the current user has permission to read. Note that the graph containing instances of repository users (i.e. of type foaf:Person) is, ironically, NOT part of this dataset, since they are not eagle-i resources.
  • public ---all graphs which are visible to the public, i.e. to the anonymous user, plus inferred statements. Note that this is not the same thing as 'published'.
  • all ---all named graphs including the ones internal to the repository; administrators only.

...

You can specify a workspace instead of a view in SPARQL Protocol requests, and in resource dissemination requests.

Inferencing

As of the Version 1, MS5 release, the The repository supports inferencing in some very specific cases. Since the repository's RDF data is very frequently modified, it does only the minimal inferencing needed by its users in order to keep the performance bearable.

Many inferencing schemes require inferencing to be re-done over the entire RDF database after every change because tracing the effects of a change through various rules would be at least as much computational effort as simply running the inferencing over. We have chosen a select subset of RDFS and OWL inference rules that makes incremental changes easy and efficient to re-compute.

Wiki MarkupSee the \[ RDF Semantics page for an overview of the greater body of inference rules (of which we only implement a small subset). The repository implements two different kinds of inferencing:

  1. "TBox" inferencing:
    • TBox is the terminology source, i.e. the ontology graphs.
    • It consists of named graphs whose type is ontology.
    • Upon * any change to a TBox graph, inferencing is redone on the entire graph, and on all the ABox graphs.
    • All inferred TBox statements are added as inferred statements directly to their TBox graph.
    • Inference is done independently on every TBox graph -- they are expected to be completely disjointdisjointed. Thus, there should only be one or two TBox graphs.
    • Following entailment rule rdfs11, all inferred rdfs:subClassOf relationships are created as direct statements.
    • Following entailment rule rdfs5, all inferred subPropertyOf relationships are created as direct statements.
    • Following entailment rule rdfs9, all inferred rdf:type properties are added.
  2. "ABox" inferencing:
    • ABox is the body of assertions, i.e. the statements about resource instances.
    • All non-TBox named graphs are part of the ABox.
    • All statements created from inference on ABox statements are added to a special named graph, http://eagle-i.org/ont/repo/1.0/NG_Inferred.;
      • This makes it easy to include or exclude the inferred statements in the dataset of SPARQL query
      • You can even detect whether or not a statement is inferred by adding a GRAPH keyword to the query and testing its home graph.
    • Any change to a TBox graph only causes re-inferencing of the subject of each statement where a named instance was changed.   This is possible because the inferred statements only depend on the asserted rdf:type properties of a a subject and the TBox graphs.
    • Following entailment rule rdfs9, all inferred rdf:type properties are added to the graph for inferred statements.

The TBox graphs are configurable. You can set the configuration property eaglei.repository.tbox.graphs to a comma-separated list of graph URIs. By default, the TBox consists of:

...

Inferred statements are not normally written when an entire graph is dumped. See the /graph service for details.

Users and Authentication

Authentication is managed entirely by the Java Servlet container.   We rely on the container to supply an authenticated user name (a short text string) and whether that user has the "superuser" role. The container's role is only used for bootstrapping; normally roles are recorded in the RDF database and they take precedence over the container's role map.

...

Each login user is (ideally) recorded in both the RDBMS used by the servlet container (or possibly some other external DB) and the RDF database. This is necessary because the servlet container, which is doing the authentication, only has access to the RDBMS through a plugin, but the repository authorization mechanism expects an RDF expression of the user as well.   All of the services that modify users keep the RDBMS and RDF databases synchronized, and can cope with users found in one and not the other. 

The RDBMS description of a user contains:

  1. Username---*Present in RDF as well, this is the common key.   Since it is the permanent description of a user this is **{}immutable*.
  2. Password---Only in RDBMS.
  3. Membership in Superuser role.  No other roles.

The RDF description of a user contains:

...

When a user is present in RDF but not in the RDBMS, they are considered disabled and cannot loginlog in.   They can be reinstated through the Admin UI.

When a user is present in the RDBMS but not in the RDF, they are considered undocumented.   Upon login, an undocumented user is given the URI corresponding to his/her highest known role:
:Role_Superuser if the RDBMS indicates that role, or :Role_Anonymous otherwise.   (Arguably, the default role could also be :Role_Authenticated, but without RDF data for the user they are not fully authenticated, and this is incentive to fix the discrepancy.)

...

An Administrator (superuser)  ; can become documented by logging in, and either running the / \whoami service with create=true or using the Admin UI to edit and save their own user info and saving itinformation. An Administrator can fix an ordinary undocumented user by using the Admin UI to save their descriptive metadata; even if it is all blank, a user record will be created.   Importing users also straightens out the mapping automatically.

...

Roles are a way to characterize a group of users, for example, to grant them some access rights in the access-control system. Functionally, the role is part of a user's authentication, i.e., "who" they are.
A role is defined by a URI, the subject of a :Role instance. It should also have a locally unique, short text-string name (the rdfs:label of its :Role instance).

Each Role is independent of other Roles.   Roles cannot be "nested". ___\ This is a _ necessary limitation that simplified the implementation considerably.

The Superuser role is built into the system because its privileges are hardcoded.

There are a couple of special Roles whose membership is implicit, that is, it never needs to be granted explicitly:

  1. Anonymous---unauthenticated users are implicitly assigned this role so they can be given explicit access. This role is never visibly asserted by a user, it is only for describing access controls.  E.g. : For example, "The Published graph is readable by the Anonymous role".
  2. Authenticated ---any user who logs in is authenticated and implicitly belongs to this role; the opposite of anonymous.    This role is never explicitly asserted by a user, it is _only for describing access controls.

...

A repository user is identified uniquely (within the scope of ONE repository instance) by a short symbolic username.   This is a character string composed of characters from a certain restricted subset of the ASCII character set, in order to avoid problems of character translation and metacharacter interpretation in both the protocol layer (HTTP) and OS tools such as command shells.   The password, which is paired with a username to serve as login credentials, is likewise restricted to the same range of characters as the username.

Character restrictions: The username and password * MUST NOT include* the character ':' (colon).   They MAY can only include:

  • Letter or digit characters in Unicode C0 and C1 (basic latin, latin-1)
  • The punctuation characters: ~, @, #, $, %, _, -, . (dot)

Note that although the HTTP protocol allows any graphic characters in the ISO-8859-1 codeset (modulo ':'), and linear whitespace, and even chars WITH special MIME RFC-2047 encoding rules, these are often implemented wrongly by  by HTTP clients and also invite encoding and metacahracter metacharacter problems with OS and scriptiing scripting tools.  To avoid these troubles we simply restrict the available characters.

...

This is just an outline of the access control system. It is implemented as statements stored in the internal metadata graph. The access controls applying to an instance or other object are not to be directly visible in the repository API, execpt except through administrative UI.

...

On the following types of * resources*:

  • Named Graphs, including workspaces.
    • Read lets people download and run queries against the graph.
    • Add, Remove lets you modify it with /graph Usually reserved for admins.
  • Resource Instances (ignores explicit Read access; that comes from it's home named graph.)
    • Add, Remove let you modify it with /update, this is usually granted temporarily to a specific user as part of the workflow process.
    • ignores Read grants,   read access reverts to its home graph.
  • Property Groups - sets of properties on a resource instance identified by statements in the data model ontology, see the data model configuration manual, particularly datamodel.hideProperty.predicate, datamodel.hideProperty.object.
    • Read lets the Dissemination and Harvest service report on these properties.
    • All other access types are ignored.
  • Workflow Transitions
    • Read gives access to take (push) the transition.
    • All other access types are ignored.

...

Any user asserting the Superuser role is always granted access, bypassing all controls.   This lets us bootstrap the system when there is no RDF database yet to describe grants.   Repository administrators should always have the Administrator role, since most of the Admin UI and API requires it.

Access control is implemented by statements of the form:

Subject:* resource, Predicate: access-type, Object: *accessor

The resource is the URI of the instance, named graph, or workflow transition of interest. The access-type names one of the four types of access described above: read, add, remove, admin. Finally, the accessor is the URI of the Principal to be granted the access, either a Role or an Agent (user).

...

Some repository features are implemented as extensions to the Sesame RDF database (aka also known as triplestore). This means they are available both internally to the repository implementation and externally whenever an API to Sesame, its SPARQL query engine, is exposed.

...

You can ask for an output format in two ways as well:

  1. Explicitly through the format arg, which takes precedence.
  2. As the HTTP Accept header in the request headers, which may be a list of formats and qualifiers; the repo repository implements full HTTP 1.1 content negotiation.

...

Name

Symbol

Default MIME type

Additional MIME types

RDF/XML

RDFXML

application/rdf+xml

application/xml

N3

N3

text/rdf+n3

 

N-Triples

NTRIPLES

text/plain

 

TriG

TRIG

application/x-trig

 

TriX

TRIG

application/trix

 

NTriples With Context 1,2

Context-NTriples

text/x-context-ntriples

 

HTML 2,3

RDFHTML

text/html

application/xhtml+xml

...

1 The Context-NTriples format is not an official RDF serialization; it was added for this repository, as a convenient way to export quads for testing. Note that it was formerly named NQuads, but there is already a different unofficial format known to the RDF community that is called "NQuads".
2 This format only supports output, it cannot be read by the repository.
3 HTML is for interactive viewing only, it cannot be parsed.

...

Name

Symbol

Default MIME type

Additional MIME types

SPARQL/XML

SPARQL

application/sparql-results+xml

 

TEXT

TEXT

text/boolean

 

Identify and/or "Create" Current User

...

/whoami

...

This request returns a tabular report on the current authenticated user's identity, to compose a friendly display in a Web UI. This is simply a SPARQL query wrapped in a servlet to hide the details of the internal data.

...

  • 201 when a new User instance was successfully created.
  • 409 (conflict) when a User instance already exists - . THIS SHOULD BE CONSIDERED SUCCESS!
  • Any other code indicates failure.

...

This call creates one or more globally unique, resolvable, URIs for new resource instances. It does not add any data to the repository; the instances will not exist until a user inserts some statements about them. The URI namespace is the default namespace from the configuration properties, followed immediately by the unique identifier. Note that ETL tools may request thousands of URIs at once so the mechanism to produce unique IDs must be able to handle that.

URL: {{ /repository/new (POST only)

Args:
count---number of URIs to return; optional, default is 1.
format---same as for SPARQL result format, same default (SPARQL XML)

...

Args:
uri=uri---optional, only if a URI is not specified as the tail of the request URI; an alternate way to explicitly specify the complete URI of the resource to disseminate. Allows any URI to be accessed, instead of assuming that the URI's namespace matches the hostname, context, and servlet path ("/i") to which the repository's webserver responds.
format=mimetype---optionally override the dissemination format that would be chosen by HTTP content negotiation. Note that choosing text/html results in a special human-readable result.
view=view---optionally choose a different view dataset from which to select the graph for dissemination. Mutually exclusive with workspace.
workspace=uri---URI of the workspace named graph to take the place of the default graph. Relevant metadata and ontology graphs are included automatically. Mutually exclusive with view.
noinferred (boolean)---Exclusive of all inferred statements from the generated results. This only applies to rdf:type statements; if the noinferred query argument is present (it need not have any value) then inferred types are left out of the results.
forceRDF (boolean)---forces a result of serialized RDF data. When the negotiated result format is text/html, the usual choice is to geneate generate the human-readable view. This forces an HTML rendering of the RDF statements which can be handy for troubleshooting, especially when combined with noinferred. NOTE: This is the only way to see the Embedded Instance statements in an interactive HTML view, for example, in a web browser, so it is especially handy for generating a clean view for debugging EIs. Default false.
forceXML (boolean)---when an HTML format would be generated, output the intermediate XML document instead of transforming it to XHTML. This is mainly useful for obtaining examples of the intermediate XML for developing new XSLT stylesheets and testing/debugging. Default is false.

Result:
Returns a serialization, or , optionally, human-readable HTML rendering of the graph of RDF statements describing the indicated resource instance. Note the deliberate choice of words: This graph includes not only the statements of which the URI is the subject, but also:

  1. Statements describing Embedded Instances embedded instances in the resource instance, for the same meaning of "describing" (but note that EIs are not recursive, an EI may not have its own EIs.)
  2. Statements about the "Label" properties of all predicates and object-property values of the instance and its EIs. (The exact set of properties used for "label" is configurable.) See the SWAG page for more about label properties.
  3. Provenance and some administrative metadata about the instance.

...

About HTML dissemination:
When the negotiated format is text/html, and unless either of the forceRDF or forceXML args was given, the dissemination process creates an intermediate XML document and transforms it into XHTML with the configured XSLT stylesheet. See description of the eaglei.repository.instance.xslt in the Repository Administrator Guide.

...

We provide an example transformation stylesheet that produces very simple HTML, intended to be the basis of custom stylesheets. It is available for download at .:https://localhost:8443/repository/styles/example.xsl
We manage the transformation within the repository, instead of adding an xml-stylesheet processing instruction to the XML, for compelling reasons:

...

There is also access control on some indivdual properties of the resource: those properties identified as hidden and contact properties by the data model ontology (and its configuration, see that separate document). The access controls on the resource URI configured as datamodel.hideProperty.object regulate hidden proerties, and datamodel.contactProperty.object regulates contact properties.

...

This ensures that no matter how much time passes between (2) and (3), if, for example, a user dawdles over an interactive edit or forgets and leaves it overnight, the edit token is already in place to indicate his/her intention to make a change. It does not prevent another user from coming along and grabbing the token to make a change, but it will indicate that there is an edit in progress, and it will prevent a stale copy from being checked in.

Comparison Comparison with SPARQL/UPDATE: In case you are wondering why we chose to implement this complex special-purpose service instead of a general protocol like SPARQL/UPDATE - , there were some compelling reasons:

...

Args:
{{uri---}}optional way to explicitly specify the complete URI, instead of assuming that the URI's namespace matches the hostname, context, and servlet path ("/i") of this webserver.
format---the default expected format for insert and delete graphs. If the args specify a content-type header, that overrides this value. Only recognizes triples even if the format supports quads.
action=(update|create|gettoken)---Update to modify an existing instance, create adds a new one. See below for details about gettoken.
token=uri---When action is update or create, this must be supplied. The value is the URI returned by the last gettoken request.
workspace=uri---Choose choose workspace named graph where new instance is created. Only necessary when action=create. Optional, default is the default workspace. DO NOT specify a workspace when action=update.
delete---graph of statements to remove from the instance; subject must be the instance URI. Deletes are done before inserts. Graph may include wildcard URIs in predicate and/or object to match all values in that part of a statement.
insert---graph of statements to add to instance; subject must be the instance URI.
bypassSanity---(boolean, default false, deprecated) NOTE: It is best if you pretend this option does not exist. When true, it skips some of the sanity tests on the resulting instance graph, mostly the ones checking the integrity of Embedded instances. Requires Administrator privilege. This was added to make the data migration from broken old EI data possible, it should rarely if ever be needed.

...

Result:
HTTP status indicates success or failure. Any modifications are transactional; on success the entire change was made, and upon failure nothing gets changed.

When action=update is used to effect a change to a resource instance, this service automatically optimizes the requested change so the fewest actual statements are modified. For example, if the request deletes all statements by using the wildcard URI in the position of predicate and value, and then inserts all of the statements that were there before along with an additional new statement, the only change actually made is to add that new statement. Since a gratuitous change to an rdf:type statement results in extra time spent inferencing, it is best to avoid it when possible.

When an update fails because the edit token is stale, the HTTP status is always 409 (Conflict). If this occurs, the only solution is to get a fresh token and re-do the update. It is NOT advisable to have a client simply retry the update with the same data , at least inform the user that there has been an intervening edit and updating now would destroy somebody else's changes.

When action=gettoken, the response includes a document formatted (according to the chosen format) as a SPARQL tuple result. It includes the columns:

token---URI of the edit token. It has no meaning other than its use in an update transaction. This is the last edit token created (and not yet "used up" by an update) on this instance; or else a new one that was created if there wasn't one available.
created---literal timestamp at which the token was created. It may be useful to display the date to the user if there was an existing token of recent vintage.
creator---URI of the user who created the token
new---boolean literal that is true if the gettoken operation created a new token. When false, it means the token already existed, which indicates there MAY already be another user's update in progress and that update may be in conflict with yours.
creatorLabel---rdfs:label\ of the creator, if available.

...

Character Sets: The SPARQL query document is entered as the value of the HTTP query-argument query in a GET or POST request. When the query is embedded in the URL in a GET request, or in a body of content-type application/x-www-form-urlencoded in a POST, it MUST use the character set of the URL. This is supposed to be ISO Latin-1 (ISO-8859-1) according to the HTTP 1.1 spec, but in practice browsers seem to use UTF-8-encoded Unicode.
If you put multilingual multi-lingual characters in a SPARQL query document, which is perfectly legitimate in literal values, the best way to ensure they are interpreted correctly is to declare the content-type explicitly with a charset parameter. To do this you must send a POST request with the entity body in multipart/form-data format, so that each argument value is encoded as a separate entity. This adds complexity, but it is the best way to ensure your HTTP client conveys the character set identification correctly. Within the entity for the query arg, add the header: Content-Type: text/plain;charset="UTF-8"
If you're using UTF-8. The media type should always be text/plain.

URL: /repository/sparql (GET and POST)

...