Federation in metaphactory

Federation

Andreas Schwarte

·

·

Reading time: 6 - 12 minutes

Federation in metaphactory

With metaphactory, we serve customers of various sizes and across multiple industries, but no matter whether we're talking about a clinical trial scoping or a bill of materials use case, customers are looking for solutions to address hybrid information needs. That means that end users usually have questions or information needs that are not limited to one single data source or just RDF graph data, but involve simultaneously dealing with a multitude of data sources, a multitude of data modalities and a multitude of data processing techniques.

Making the case for federation

Let's have a brief look at the challenges associated with each of these three dimensions characterizing hybrid information needs:

  • Variety of data sources. We often see a need to integrate data stored in several physical repositories. These repositories can include both native RDF triple stores, as well as datasets in other formats presented as RDF (e.g., a relational database exposed using R2RML mappings). The migration of data away from the original source - particularly in cases where the change frequency makes regular synchronization necessary - is often not feasible and requires a solution that allows to keep the data in place and separate from the knowledge graph, while still creating integrated views on all available data.
  • Variety of data modalities. Graph data in RDF often needs to be combined with other data modalities, e.g., textual, temporal, or geospatial data. A SPARQL query then needs to support corresponding extensions for full-text, spatial, and other types of search. Depending on the modality, a knowledge graph might not be the most natural database backend and such data can only be efficiently queried and processed in a dedicated storage, while it only reveals its full value when combined with other data sources.
  • Variety of data processing techniques. Retrieved data often has to be further processed using dedicated domain-specific services, e.g., graph analytics (finding the shortest path or interconnected graph cliques), statistical analysis and machine learning (applying a machine learning classifier or finding similar entities using a vector space model), etc. These services are typically external to the knowledge graph and require data from the knowledge graph to augment and enrich it.

Federation is a very powerful technique for providing integrated access over multiple heterogeneous data sources. The SPARQL 1.1 query language already provides a convenient built-in way of expressing data requests over multiple RDF knowledge graphs: Via the SERVICE clause queries over multiple SPARQL endpoints can be expressed using a single query.

Introducing Ephedra and FedX

In metaphactory, we support SPARQL 1.1 federation and go one step further: we offer two federation techniques - Ephedra and FedX - tailored to solve specific information needs, particularly involving hybrid data sources.

Ephedra FedX
Ephedra is a federation technology that integrates data from the main RDF database with information from (hybrid) sources using SERVICE calls. Ephedra enables combining available RDF data with data from RESTful services, SQL databases, or other hybrid sources. While adopting the SPARQL 1.1 federation mechanism, we broaden its usage to include custom services as data sources and optimize such hybrid queries to be executed efficiently.

Ephedra Federation Engine in metaphactory

  FedX provides transparent federation over multiple SPARQL endpoints under a single virtual endpoint. As an example, a knowledge graph such as Wikidata can be queried in a federation with endpoints that are linked to Wikidata as an integration hub. In a federated SPARQL query in FedX, you no longer need to explicitly address specific endpoints using SERVICE clauses. Instead, FedX automatically selects relevant sources, sends statement patterns to these sources for evaluation, and joins the individual results.

FedX Federation Engine in metaphactory

Examples for use of federation

In the remainder of this blog post we will highlight some practical examples demonstrating how the federation technologies discussed above - SPARQL 1.1. Federation, Ephedra, and FedX - address hybrid information needs. For each example, we'll highlight how one specific federation technique can be applied. However, since other federation techniques might be possible as well, at the end of each section, we will briefly review the applicability of all three techniques and compare their benefits and drawbacks.

1. Combining data from several SPARQL endpoints

For this first example, we will combine data from two SPARQL endpoints for which we have a good understanding of the data they provide (while combining more than two SPARQL endpoints is also possible). This means that we can clearly scope parts of our query to one or the other source and we know how to interlink data provided from the sources.

 

The SPARQL 1.1 query language provides a built-in feature for expressing information needs over multiple RDF knowledge graphs. Using SERVICE clauses, it is possible to combine information from remote RDF sources - typically open SPARQL endpoints - with data from the local RDF database.

 

This is clearly restrictive, as only cases of very limited variation in data sources can be handled (e.g., must be accessible as a SPARQL endpoint) and all sources must be explicitly addressed from within the query. Additionally, there is no option to connect to an endpoint which would require authentication and, therefore, only open SPARQL endpoints can be utilized. However, such a solution might still be sufficient for use cases like the one described below, which utilizes Wikidata and neXtProt.

 

In one of our examples hosted on our public Wikidata demonstrator, we apply SPARQL 1.1 federation to augment proteins with specific details available through the public neXtProt SPARQL endpoint: we take the Uniprot identifier available in Wikidata and use this as parameter to the regular SPARQL 1.1. SERVICE invocation to fetch all tissue expressions from neXtProt. The full example is available here »

 

Federation technique Rating Comments

Standard SPARQL 1.1 query

+

Information about which source provides which data is required.

Limited to RDF sources.

Authentication on endpoints not supported.

Ephedra

++

Information about which source provides which data is required.

Additional non-RDF sources can be included as well.

Authentication on SPARQL and non-SPARQL endpoints supported.

FedX

++

Information about which source provides which data does not have to be complete.

Limited to RDF sources.

Authentication per SPARQL endpoint supported.

2. Knowledge Graph Enrichment

Knowledge Graph Enrichment with EphedraA common use case of federation technology is knowledge graph enrichment, i.e., enrichment of local data with information from other (hybrid) data sources. Ephedra provides out-of-the-box modules for declaratively describing and accessing information from RESTful services and relational databases.

 

In the following, we describe an example where the knowledge graph is enriched using Word2Vec models. Here, we use a specialized data processing service which applies a trained machine learning model to find entities similar to a given set of other entities. This service utilizes the Word2Vec vector space model trained on the English Wikipedia corpus. Each Wikidata entity is represented as an embedding vector. Similarity defined as a distance in the embeddings vector space serves as a means to indicate relatedness between entities and complements the explicit relations stored in the RDF triplestore.

In a concrete example, you can search for entities similar to a keyword token you provided (e.g., an artist like Rembrandt). Initially, the keyword search is evaluated through Ephedra using the Wikidata Search API. Your selected entity is then used as input to the similarity services to identify similar resources. Note that the similarity service is made available as a virtual endpoint in the Ephedra federation.

 

Knowledge Graph Enrichment Example

The live example, together with a detailed explanation and its configuration, is available here »

 

Other examples for knowledge graph enrichment are:

  • Enriching RDF data with geospatial information. The OpenStreetMap REST API can be connected as a virtual endpoint in our Ephedra federation; this allows us to enrich locally available RDF data with geospatial information fetched from OpenStreeMap, without having to download the geospatial data, convert it to RDF and load it to a graph database. A step-by-step tutorial with a full configuration can be found here »
    Please note that OpenStreetMap also provides a SPARQL endpoint1 and this example is simply meant to highlight a possible use of REST APIs in federation.
  • Accessing information from a relational database. To demonstrate how Ephedra can be used to retrieve information from a relational database and integrate this information with local data, we provide an example scenario based on the Postgres Dvdrental tutorial. The step-by-step guide can be found here »
Federation technique Rating Comments

Standard SPARQL 1.1 query

-

Limited to RDF sources. Can't be applied here.

Ephedra

++

Allows integration of any data source.

FedX

-

Limited to RDF sources. Can't be applied here.

3. Database agnostic search through a Lookup Service

Many graph databases offer specific capabilities for full-text search that are often exposed as special services. Ephedra can be used to unify these database specific approaches under one common interface (which is called the Lookup Service in metaphactory) and support database agnostic search.

metaphactory provides multiple Lookup Service abstractions out-of-the-box: in addition to a generic pure SPARQL variant, metaphactory has multiple implementations that provide deep integration with the respective search technologies of supported databases. Specific search services can also be covered: in our Wikidata demo system, we have integrated a Lookup Service that specifically targets the Wikidata Search API (which is a REST API) for performing keyword search.

 

One of the main advantages of the Lookup Service abstraction is that search queries can be written independently of the underlying database. The configuration of UI search components does not encode any specific knowledge about the underlying database, but metaphactory transparently selects the suitable Lookup Service implementation and transparently invokes the search query.

 

A simple lookup query example for the keyword "Mona Lisa" looks as follows:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX lookup: <http://www.metaphacts.com/ontologies/platform/repository/lookup#>

  SELECT ?item ?label ?score ?description WHERE {
    SERVICE Repository:lookup {
      ?item lookup:token "Mona Lisa" ;
            lookup:name ?label ;
            lookup:score ?score ;
            lookup:description ?description .
    }
  }

Database agnostic search through a Lookup Service

Note that as an advanced feature of the Lookup Service abstraction, metaphactory supports the concept of Federated Lookup: the keyword query is sent to all individual lookup targets in parallel and the result is composed as a union of the individual results.

Please refer to Lookup Service documentation for further details.

Going beyond this, with Ephedra federation technology, it is possible to augment the pure set of search result items with additional information from the local database, e.g., to enrich information or offer filtering through facets. This is achieved by providing additional join patterns outside the actual lookup SERVICE query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  PREFIX lookup: <http://www.metaphacts.com/ontologies/platform/repository/lookup#>
PREFIX dct: <http://purl.org/dc/terms/>

    SELECT ?item ?label ?score ?description ?creator WHERE {
      SERVICE Repository:lookup {
        ?item lookup:token "Mona Lisa" ;
              lookup:name ?label ;
              lookup:score ?score ;
        lookup:description ?description .
      }
      ?item dct:creator ?creator .
    }
  

Database agnostic search through a Lookup Service



Federation techniqueRatingComments

Standard SPARQL 1.1 query

-

Limited to RDF data. Can't be applied here.

Ephedra

++

Allows integration of any data source.

FedX

-

Limited to RDF data. Can't be applied here.

4. Transparent federation for a Linked Data Browser

Transparent Federation for a Linked Data Browser

Another interesting federation use case is browsing through Linked Data published by multiple data sources, which might be independent for a number of regulatory, data access, governance, or organizational reasons. Each of these data sources may have specific access rights and users may not be allowed to see all information. The idea here is to start from a simple search interface for finding relevant (and accessible) resources and then navigate to the respective resource in the context of a transparent federation. By combining federation technology with data source specific permission features, only those pieces of information that are accessible for the current user are exposed.

On the resource page itself, the FedX federation engine will transparently show all connected information of the given resource, e.g., as a knowledge panel. A visualization of the resource's neighborhood in a graph is also possible. For the user it will appear as if all the resource data were available in a single graph database, while, in fact, the data may be distributed amongst multiple databases.

 

With transparent federation, performance and scalability depend on various factors, including the number of federation members, network latency, and distribution of data. We recommend applying a transparent FedX federation primarily in use cases with a concrete information need, i.e., where the need can be expressed in a very selective query that ensures a response in a reasonable time.

 

Federation techniqueRating Comments

Standard SPARQL 1.1 query

+

Information about which source provides which data is required.

Use of UNION is required to combine data from all sources, which makes the execution less performant and queries more complex.

Ephedra

+

Information about which source provides which data is required.

Use of UNION is required to combine data from all sources, which makes the execution less performant and queries more complex.

FedX

++

Information about which source provides which data does not have to be complete.

Limited to RDF sources.

Summary

In this blog, post we discussed hybrid information needs involving a variety of data sources, a variety of data modalities, and a variety of data processing techniques and presented concrete example use cases that leverage metaphactory's federation technologies. We demonstrated how Ephedra provides value-add to standard SPARQL 1.1 federation in specifically tackling hybrid information needs and showed how a Linked Data browser can benefit from a transparent FedX federation.

 

Moreover, by integrating federation into the metaphactory platform directly, we allow you to leverage this powerful feature while making use of all platform features, including: development of end-user oriented applications using our rich set of UI components; lifecycle management; security features such as authentication and authorization; declarative configurations, etc.

 

All examples discussed in this blog post are available in our Wikidata public demonstrator. Moreover, detailed step-by-step tutorials for the individual examples (including configuration and snippets) can be found in our samples repository »

References and further reading

Footnotes
1 https://sophox.org/

Andreas Schwarte

As a Principal Software Engineer at metaphacts and a specialist in semantic technologies, Linked Data, SPARQL and federated query processing, Andreas leads our software engineering team in developing, documenting, and testing metaphactory to ensure that the platform meets our customers' needs and helps them achieve their business goals.