Zitgist Logo  
upper left corner gradient
 

Foundations & Technology

Zitgist's technology is built around the central data model of RDF (Resource Description Framework, a means to represent the "facts" about any data object) and follows Linked Data principles.  RDF is generally expressed in "schemas" (or vocabularies, which with their specified relationships are formally known as "ontologies") that provide the semantics for what these facts and objects mean.

RDF is well suited to represent data sources that range from unstructured text to highly structured relational databases -- in other words, virtually everything.  Via a broad series of converters processed using these ontologies ("RDFizers"), as well, most any content or data format can be expressed as RDF.  Thus, data from disparate sources may be meaningfully combined and related. 

Zitgist stores and processes this linked RDF in efficient and scalable "triple stores" that, along with other standards, enable querying, retrieval and viewing of these consolidated datasets. Moreover, compliance with open standards and use of Web services and proven Internet architectures ensure broad applicability and superior performance for Zitgist's technologies.

Read more about the technology ...

The Web in Transition

The Web is in transition. While there are no real beginning and end points, there is a steady progression from a document-centric Web ("Web of documents") to one that is data-centric, including the mediation of semantics (the "Web of data"):

Transition in Web Structure
Document Web Structured Web
Semantic Web
    Linked Data  
  • Document-centric
  • Document resources
  • Unstructured data and semi-structured data
  • HTML
  • URL-centric
  • circa 1993
  • Data-centric
  • Structured data
  • Semi-structured data and structured data
  • XML, JSON, RDF, etc
  • URI-centric
  • circa 2003
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S
  • URI-centric
  • circa 2007
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S, OWL
  • URI-centric
  • circa ???

We are in the midst of a transition phase — Linked Data — that marks the beginning of the dominance of data on the Web.  Linked Data is a direct precursor to the semantic Web with its emphasis on RDF and data interoperability and services.

RDF - Resource Description Framework

RDF — Resource Description Framework — is a data representation model that uses a “triple” of subject-predicate-object, as generally defined by the W3C’s standard RDF model.  Triples are used to represent informational entities or objects as an assertion or "fact".  In such triples, subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. (You can think of subjects and objects as nouns, predicates as verbs, and even think of the triples themselves as simple Dick-and-Jane sentences from a child's beginning reader.)

Resources and predicates (as are most objects, except those specified with a literal) are provided a URI so that there is a single, unique reference for each item. (These conventions are great for machines but may make the length and complexity of the URIs appear complicated to humans; for example, ‘Dick‘ seems much more complicated when it is expressed as http://www.dick-is-the-subject-of-this-discussion.com/identity/dickResolver/DicksOpenID.xml.)

These URI lookups can themselves be an individual assertion, an entire specification (as is the case, for example, when referencing the RDF or XML standards), or a complete or partial ontology for some domain or world-view. While the RDF data is often stored and displayed using XML syntax, that is not a requirement. Other RDF forms may include N3 or Turtle syntax, and variants or more schematic representations of RDF also exist.

In all statements, the predicates point to reference URIs that precisely define the schema or controlled vocabularies used in that triple. Depending on provenance, source format, use of aliases, or other changes to make the display of triples more readable, it may at times be necessary to “dereference” what is displayed to obtain the URI values to trace or navigate the actual triple linkages. Deferencing in this case means translating the displayed portion (the “reference”) of a triple to its actual value and storage location, which means providing its linkable URI value. Note that literals are already actual values and thus are not “dereferenced”.

The great thing about RDF is how well it lends itself through subsequent logic to map and mediate concepts from different sources into an unambiguous semantic representation [my ‘glad== (is the same as) your ‘happy‘ OR my ‘gladis your ‘glad‘]. Further, with additional structure, such as through RDF Schema (RDFS) or the various dialects of OWL  (Web Ontology Language) and their formal ontologies, it is also possible to draw inferences and machine reason on the data.

Linked Data

Linked Data follows recommended practices for identifying, exposing and connecting this RDF data on the semantic Web.  Linked Data uses the standards of the Web to create typed links between data from different sources.  The three principles of Linked Data are to use:
  1. The RDF data model to publish structured data on the Web
  2. RDF links to interlink data from different data sources
  3. Ontologies to define the data relationships and semantics between sources so as to facilitate the actual interlinking.
A robust Linking Open Data (LOD) community has rapidly developed around the practice since its approval as a formal project of the W3C’s Semantic Web Education and Outreach (SWEO) Interest Group in March 2007. Though counts rapidly become dated, today, in less than a year, the size of the Linked Data on the Web exceeds several billion RDF triples.

This foundation of public interlinkable data comes from the highest value reference sources available, and includes most notable place, people, event, book, music, cultural, language and government entities. The following official figure of the LOD community is updated frequently (click on the figure below to get the most recent interactive version), and shows well the breadth of this data value:

Linked Data Web

This publicly available Linked Data is growing rapidly and can be conjoined with any and all enterprise data or non-linked data in other formats.  Those datasets merely await conversion.

The Linked Data principles can be applied equally well to existing data.  In this manner, other standard data sources in non-Linked Data format and internal enterprise data stores can be made interoperable.

Linked Data Conversion

The structure extraction necessary to construct a RDF “triple” is thus pivotal, and may require multiple steps. Depending on the nature of the starting content and the participation or not of the site publisher, there is a range of approaches.  

Where linked data does not already exist, the Virtuoso Sponger is the enabling technology used by Zitgist for these conversions.  The Sponger follows a cascading pipeline to process input files of heterogeneous origin.

The pipeline first checks to see if the file is RDF, in which case it processes the data directly and then cascades through a series of ontology checks.  If RDF is not returned, the Sponger passes the data through a metadata extraction pipeline that cascades through a variety of extractors, called "cartridges".  Assuming an applicable cartridge is found, the extracted data is transformed into RDF via a mapping pipeline. RDF entities (instance data) are generated from the input data by way of ontology matching and mapping.  Then, the structured RDF Linked Data is generated.

However, even if a specific cartridge is lacking, the Sponger is able to extract some minimum information from the basic Web page (however, though minimally useful, this fallback falls short of full data conversion and is not advisable for purposeful conversion projects).

RDF generation is done on the fly either using built-in XSLT processors, or in the case of GRDDL, its associated XSLT and local or remote XSLT processors. The RDF generation is based on an internal mapping table that associates the source's data type with matching ontologies (see next section).  The number of ontologies handled by the Sponger is being increased constantly.  A more detailed explanation of the Virtuoso Sponger is available in an OpenLink white paper.

Zitgist now has about 100 different converters in hand, and routinely adds new ones.  For a listing of these converters and further background, see the expanded section under Linked Data Conversion services.

Controlled Vocabularies and Ontologies

Ontology is one of those daunting terms for people first exposed to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.

The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”  In this context, understanding, using and manipulating ontologies is important because:

  • Depending on their degree of formalism, ontologies help make explicit the scope, definition, and language and meaning (semantics) of a given domain or world view
  • Ontologies may provide the power to generalize about their domains
  • Ontologies, if hierarchically structured in part (and not all are), can provide the power of inheritance
  • Ontologies provide guidance for how to correctly “place” information in relation to other information in that domain
  • Ontologies may provide the basis to reason or infer over its domain (again as a function of its formalism)
  • Ontologies can provide a more effective basis for information extraction or content clustering
  • Ontologies, again depending on their formalism, may be a source of structure and controlled vocabularies helpful for disambiguating context; they can inform and provide structure to the “lexicons” in particular domains
  • Ontologies can provide guiding structure for browsing or discovery within a domain, and
  • Ontologies can help relate and “place” other ontologies or world views in relation to one another; in other words, ontologies can organize ontologies from the most specific to the most abstract.

Ontologies are thus a central component of the semantic Web and Linked Data.  Their use is essential to resolving the semantics and relationships that truly allows data to interoperate.

The ontologies of import to Linked Data are written in RDF (Resource Description Framework) or other language variants based on RDF. Use of common frameworks enables descriptions of different domains and the instance data within them to be semantically related via their ontology constructs.

RDF ontologies bear some resemblance to the more familiar XML Document Type Definitions (DTDs) and XML Schema, though with important differences.  First, an RDF ontology does not require an XML serialization (though many are written in XML).  Second, a RDF specification does not indicate how a document should be interpreted, and they only restrict the set of elements that can be used in any given file.  Third, and most importantly, RDF ontologies, which are themselves written in RDF, provide definitions for relations and organizations between higher-level things.  

These extensions are the basis for bringing workable semantics to RDF ontologies.  These extensions began with RDF Schema (RDFS), which introduces the notion of a class and provides hierarchical and range and domains to relations.  The further extensions in the various dialects of the Web Ontology Language (OWL) define more classes and predicates that enable higher-order logics for inferencing and data mapping.

There are now literally hundreds of prominent RDF ontologies that have emerged, most of which are dedicated to specific domains such as finance, products, bioinformatics, and all areas conceivable.  However, some of the broader ones with potential applicability to most engagements include:

  • DOAP — Description Of A Project is an RDF schema and XML vocabulary to describe open-source projects
  • Dublin Core — a standard, official metadata for libraries and information science
  • FOAF — Friend of a Friend is an RDF schema for machine-readable modelling of homepage-like profiles and social networks
  • GeoNames — GeoNames integrates geographical data such as names of places in various languages, elevation, population and others from various sources
  • GRDDL — is a markup format for Gleaning Resource Descriptions from Dialects of Languages; that is, for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT
  • Microformats — (sometimes abbreviated μF or uF) is a piece of mark up that allows expression of semantics in an HTML (or XHTML) Web page. Programs can extract meaning from a web page that is marked up with one or more microformats
  • OPML — Outline Processor Markup Language is an XML format for outlines, and is commonly used to exchange lists of web feeds between web feed aggregators
  • RDFa — is a set of extensions to XHTML from the W3C. RDFa uses attributes from XHTML's meta and link elements, and generalises them so that they are usable on all elements allowing XHTML annotation markup with semantics
  • SIOC — Semantically-Interlinked Online Communities is written in RDFS and interconnects discussion methods such as blogs, forums and mailing lists to each other
  • SKOS — the Simple Knowledge Organisation System is a family of formal languages designed for representing thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary; it is built upon RDF and RDFS
  • YAGO — "Yet another great ontology" is a WordNet structure placed on top of Wikipedia.

Of course, these ontologies are also supported by the four ones that have Zitgist principals as lead editors, especially the subject concept matching ontology, UMBEL.  See the partner page on Ontologies for more information.

In general, though, Dublin Core, FOAF, GeoNames, GRDDL, microformats, SKOS and UMBEL are arguably the more important frameworks to supplement most domains.

Data Storage and Management

Once converted, Linked Data needs to be managed at scale with appropriate functionality.  Zitgist uses the OpenLink Virtuoso universal server for these purposes.

The section on Zitgist's Linked Data Platform (zLDP) product describes this server-based software in more detail.  But, another point is important in relation to the open standards and foundations for Zitgist's offerings.

Moving forward, one central feature is the huge diversity of formats, protocols and legacies that govern existing data.  The guiding thesis of the Virtuoso server — and by extension to zLDP — is complete interoperable functionality with all available sources, open or proprietary.  The diversity of the formats and protocols that can work with Virtuoso is truly impressive.  Some are indeed obscure, and many are legacy:  it is never easy to predict market winners in advance and even second-tier standards can have substantial user bases.  OpenLink understands this and has been committed for 15 years to provide universal access.

This commitment has accumulated to provide a technology base suitable for transitioning to Linked Data that is unparalleled.  Furthermore, that very same commitment also reflects itself in the consuming protocols of presenting and using data results.  Thus, from generation to consumption, Virtuoso is the right basis for building Linked Data platforms.

Scalability

Converting data into a common, interoperable RDF framework is essential for interoperability, but says little about workability and usability.  Even in a consistent RDF triple model, Linked Data still must be able to be retrieved, queried, searched and viewed and — most critically  — to do so at scale.  Indeed, these scale issues will only grow in importance as the scope of Linked Data embraces the entire Internet.

Major providers of data at scale on the Web, such as Google, Amazon and Ebay, demonstrate how so-called RESTful designs and architectures can scale with the large-scale additions of commodity servers.  These lessons are now pretty well understood and "cloud" computing and "big table" approaches to large index integrations should provide good confidence about general scale-up issues. OpenLink Software understands these issues well and writes frequently on related architecture and design.

But the nature of RDF triples and consolidation of multiple, broad datasets poses different and new issues for Linked Data performance at scale.  One approach, organizing data via "quads" with specific graph (dataset) conventions is clearly desirable, and an approach that Virtuoso has already taken.  Likely, new indexing structures and other structural data organization innovations will also be required.

On all measures, Virtuoso (and therefore zLDP) wins every RDF- and Linked Data- performance benchmark.  Moreover, the same technical posts previously mentioned demonstrate continuous innovation and an unmatched sophistication on performance questions and the unique challenges posed by RDF and Linked Data.  

The view that there are pure, low-level triple stores that maintain a repository of triples only and only operate on that construct is immaterial.  So long as an RDF datastore can export and expose its data via conformant APIs is the real benchmark.  Under this basis, the focus should rightfully be on improving performance and scalability internally without sacrificing external interoperability.  And performance must include the ability to handle real-world queries and retrieval demands. 

In the end, there should be no real concerns about converting existing resources to Linked Data.  What can be readily converted from legacy formats to Linked Data can also be easily reversed.  Meanwhile, the benefits of interoperable Linked Data remain clear.

Querying and Retrieval

The standard basis for querying RDF triple stores is through SPARQL (pronounced "sparkle"). SPARQL is an RDF query language; its name is a recursive acronym that stands for SPARQL Protocol and RDF Query Language.

The SPARQL query language is somewhat modeled after SQL, though less developed and with no update function.  (SPARQL/Update, nicknamed SPARUL, is one option that addresses this, and is also supported by zLDP).  In fact, in its early gestations, there were a number of competing alternatives to SPARQL that have lost steam.  This survival of protocols was also true prior to the emergence of SQL for relational databases.  And, as for SQL, it is quite likely SPARQL will also transition through a number of reference implementations.  One of the first incorporations, for instance, will likely be SPARUL.

Another challenge is full-text searching.  A still further challenge is integration with conventional RDBMSs.  Again, zLDP incorporates these functions, perhaps not internally fully conformant with existing standards, but which can be exposed to the outside world in conforming ways.

Nonetheless, in its current version, the SPARQL language has four types of queries:  SELECT, ASK, DESCRIBE, and CONSTRUCT.  The SELECT query is most similar to SQL in that a query provides a tabular response.  However, and this is fundamental, the basis of SPARQL queries is the resource, and not the SQL field.  Also, SPARQL, like SQL, has a WHERE clause, but one that is grounded in graphs.

Another real failing of current Linked Data and semantic Web query frameworks is a too literal view of triples and RDF. (Indeed, the jargon of the whole approach is often the worst enemy of the Linked Data promise.)  SQL was never able to bring data retrieval to the capability of standard users, and SPARQL in whatever standards form will likely not be able to do so either.

Just as Linked Data is grounded in the Web and the Internet, search, query and retrieval paradigms must be so as well.  And for that to occur, the emphasis needs to embrace user interface and usability issues as well as the technical abilities of retrieval languages represented in typical standards.

Viewing

Viewing data and results is listed last not because it is the least important consideration; quite the opposite.  But data viewing is the end of the data processing pipeline and the natural outcome of this technology and foundational discussion.  

Zitgist's perspective and capabilities surrounding how to view the unique aspects of Linked Data is evident on this site with supporting discussion under zLinks, the DataViewer and the Query Builder.

upper right corner gradient