The Web in Transition
The Web
is in transition.
While there are no real beginning and end points, there is a steady
progression from a document-centric Web ("Web of documents") to one
that is data-centric,
including the mediation of semantics (the "Web of data"):
 |
| Document Web |
Structured
Web |
Semantic Web
|
| |
|
Linked
Data |
|
- Document-centric
- Document resources
- Unstructured data and semi-structured
data
- HTML
- URL-centric
- circa
1993
|
- Data-centric
- Structured data
- Semi-structured data and structured data
- XML, JSON, RDF, etc
- URI-centric
- circa
2003
|
- Data-centric
- Linked data
- Semi-structured data and structured data
- RDF, RDF-S
- URI-centric
- circa
2007
|
- Data-centric
- Linked data
- Semi-structured data and structured data
- RDF, RDF-S, OWL
- URI-centric
- circa
???
|
We are in the midst of a transition phase — Linked Data — that
marks the beginning of the dominance of data on the Web.
Linked Data is a direct precursor to the semantic
Web with its emphasis on RDF and data interoperability and services.
RDF - Resource Description Framework
RDF — Resource Description Framework — is a data representation model
that uses a “triple” of
subject-predicate-object,
as generally defined by the
W3C’s standard RDF model.
Triples are used to represent informational entities
or objects as an assertion or "fact". In such triples,
subject
denotes the resource, and the
predicate denotes
traits or aspects of the resource and expresses a relationship between
the
subject and the
object.
(You can think of
subjects
and
objects
as nouns,
predicates
as verbs, and even think of the triples themselves as simple
Dick-and-Jane sentences from a child's beginning reader.)
Resources and predicates (as are most objects, except
those specified
with a literal) are provided a URI so that there is
a single, unique reference for each
item. (These conventions are great for machines but may
make the length and complexity of the URIs appear complicated
to humans; for example, ‘Dick‘
seems much more complicated when it is expressed as http://www.dick-is-the-subject-of-this-discussion.com/identity/dickResolver/DicksOpenID.xml.)
These URI lookups can themselves be an individual assertion,
an
entire specification (as is the case, for example, when referencing the
RDF or XML standards), or a complete or partial ontology for some
domain or world-view. While the RDF data is often stored and displayed
using XML syntax, that is not a requirement. Other RDF forms may
include N3 or Turtle syntax, and variants or
more schematic representations of RDF also exist.
In all statements, the predicates point to reference URIs that
precisely define the schema or controlled vocabularies used in that
triple. Depending on provenance, source format, use of
aliases, or other
changes to make the display of triples more readable, it may at times
be necessary to “dereference” what is displayed to obtain the URI
values to trace or navigate the actual triple linkages. Deferencing in
this case means translating the displayed portion (the “reference”) of
a triple to its actual value and storage location, which means
providing its linkable URI value. Note that literals are already actual
values and thus are not “dereferenced”.
The great thing about RDF is how well it lends itself
through subsequent logic to map and
mediate concepts from different sources into an unambiguous semantic
representation [my ‘glad‘
== (is the same as)
your ‘happy‘
OR my ‘glad‘
is your ‘glad‘].
Further, with additional structure, such as through RDF Schema
(RDFS)
or the various dialects of OWL
(Web Ontology Language) and their formal ontologies, it is
also possible to draw inferences and machine reason on the
data.
Linked Data
Linked
Data follows recommended practices for identifying, exposing
and connecting this RDF data on the semantic Web. Linked
Data uses the standards of the Web to create typed links
between data from different
sources. The three principles of Linked Data are to
use:
- The RDF data model to publish structured data on the Web
- RDF links to interlink data from different data sources
- Ontologies to define the data relationships and
semantics between sources so as to facilitate the
actual interlinking.
A robust
Linking
Open Data (LOD) community has rapidly developed around the
practice since its approval as a formal project of the
W3C’s Semantic Web Education
and Outreach (SWEO) Interest Group in
March 2007.
Though counts rapidly become dated, today, in less than a year, the
size of the Linked Data on the Web exceeds several billion RDF triples.
This foundation of public interlinkable data comes from the highest
value
reference sources available, and includes most notable place, people,
event, book, music, cultural, language and government entities. The
following official figure of the LOD community is updated
frequently (click on the figure below to get the most recent
interactive version), and shows well the breadth of this data value:
This publicly available Linked Data is growing rapidly and can
be conjoined with any and all enterprise data or non-linked
data in other formats. Those datasets merely await
conversion.
The Linked Data principles can be applied equally
well to existing data.
In this manner, other standard data sources in non-Linked
Data format and internal enterprise data stores can be made
interoperable.
Linked Data Conversion
The structure extraction necessary to construct a RDF “triple”
is
thus pivotal, and may require multiple steps. Depending on the nature
of the starting content and the participation or not of the site
publisher, there is a range of approaches.
Where linked data does not already exist, the Virtuoso
Sponger is the enabling technology used by
Zitgist for these conversions. The Sponger follows a
cascading pipeline to process input files of heterogeneous origin.
The pipeline first checks to see if the file is RDF, in which
case it processes the data directly and then cascades through a series
of ontology checks. If RDF is not returned, the
Sponger passes the data through a metadata extraction pipeline that
cascades through a variety of extractors, called "cartridges".
Assuming an applicable cartridge is found, the extracted data
is transformed into RDF via a mapping pipeline. RDF entities (instance
data) are generated from the input data by way of ontology matching and
mapping. Then, the structured RDF Linked
Data is generated.
However, even if a specific cartridge is lacking, the Sponger
is able to extract some minimum information from the basic Web
page (however, though minimally useful, this fallback falls
short of full data conversion and is not advisable for purposeful
conversion projects).
RDF generation is done on the fly either using built-in XSLT
processors, or in the case of GRDDL, its associated
XSLT and local or remote XSLT processors. The RDF
generation is based on an internal mapping table
that associates the source's data type with matching
ontologies (see next section). The
number of ontologies handled by the Sponger is being increased
constantly. A more detailed explanation of the Virtuoso
Sponger is
available in an OpenLink
white paper.
Zitgist now has about 100 different converters in
hand, and routinely adds new ones. For a listing of
these
converters and further background, see the expanded section under Linked Data
Conversion services.
Controlled Vocabularies and Ontologies
Ontology is
one of those daunting terms for people first exposed
to the semantic
Web. Not only is the word long and without many common antecedents, but
it is also a term that has widely divergent use and understanding
within the community. It can be argued that this not-so-little word is
one of the barriers to mainstream understanding of the semantic Web.
The root of the term is the Greek ontos, or being or the nature of things.
Literally — and in classical philosophy — ontology
was used in relation to the study of the nature of being or the world,
the nature of existence. Tom
Gruber, among others, made the term popular in relation to
computer science and artificial intelligence about 15 years ago when he
defined ontology as a “formal specification of a conceptualization.”
In this context, understanding, using and manipulating
ontologies is important because:
- Depending on their degree of formalism,
ontologies help make explicit the scope, definition, and language and
meaning (semantics)
of a given domain or world view
- Ontologies may provide the power to generalize about their
domains
- Ontologies, if hierarchically structured in part (and not
all are), can provide the power of inheritance
- Ontologies provide guidance for how to correctly “place”
information in relation to other information in that domain
- Ontologies may provide the basis to reason or infer over
its domain (again as a function of its formalism)
- Ontologies can provide a more effective basis for
information extraction or content clustering
- Ontologies, again depending on their formalism, may be a
source of
structure and controlled vocabularies helpful for disambiguating
context; they can inform and provide structure to the “lexicons” in
particular domains
- Ontologies can provide guiding structure for browsing or
discovery within a domain, and
- Ontologies can help relate and “place” other ontologies or
world
views in relation to one another; in other words, ontologies can
organize ontologies from the most specific to the most abstract.
Ontologies are thus a
central component of the semantic Web and Linked Data.
Their use is essential to resolving the semantics and
relationships that truly allows data to interoperate.
The ontologies of import to Linked Data are written
in RDF (Resource Description
Framework) or other language variants based on RDF.
Use of common frameworks enables descriptions of
different domains and the instance data within them to be semantically
related via their ontology constructs.
RDF ontologies bear some resemblance to the more
familiar XML Document Type
Definitions (DTDs) and XML Schema, though with important differences.
First, an RDF ontology does not require an XML serialization
(though many
are written in XML). Second, a RDF specification does
not indicate how a document should be interpreted, and they
only
restrict the set of elements that can be used in any given file.
Third, and most importantly, RDF
ontologies, which are themselves written in RDF, provide definitions
for relations
and organizations between higher-level things.
These extensions are the basis for bringing workable semantics
to RDF ontologies. These extensions began with RDF
Schema (RDFS), which introduces the notion of a class and provides
hierarchical and range and domains to relations. The further
extensions in the various dialects of the Web Ontology Language (OWL)
define more classes and predicates that enable higher-order logics for
inferencing and data mapping.
There are now literally hundreds of prominent RDF ontologies
that have emerged, most of which are dedicated to specific domains such
as finance, products, bioinformatics, and all areas
conceivable. However, some of the broader ones with potential
applicability to most engagements include:
- DOAP — Description
Of A
Project is an RDF schema
and XML vocabulary to describe open-source projects
- Dublin Core
— a standard, official metadata for libraries and
information science
- FOAF — Friend
of a Friend is
an RDF schema for
machine-readable modelling of homepage-like profiles and social
networks
- GeoNames —
GeoNames integrates
geographical data such as
names of places in various languages, elevation, population and others
from various sources
- GRDDL — is
a markup format for
Gleaning Resource
Descriptions
from Dialects of Languages; that is, for getting RDF data out of XML
and XHTML documents using explicitly associated transformation
algorithms, typically represented in XSLT
- Microformats — (sometimes
abbreviated μF or uF) is
a
piece of mark up that allows expression of semantics in an HTML (or
XHTML) Web page. Programs can extract meaning from a web page that is
marked up with one or more microformats
- OPML —
Outline Processor
Markup Language is an XML format
for outlines, and is commonly used to exchange lists of web
feeds between web feed aggregators
- RDFa — is
a set of extensions
to XHTML from the W3C.
RDFa uses attributes from XHTML's meta and link elements, and
generalises them so that they are usable on all elements
allowing XHTML annotation markup with semantics
- SIOC — Semantically-Interlinked
Online Communities is written in RDFS and interconnects discussion
methods such as blogs, forums and mailing
lists to each other
- SKOS —
the Simple Knowledge
Organisation System is a family
of
formal languages designed for representing thesauri,
classification schemes, taxonomies, subject-heading systems, or any
other type of structured controlled vocabulary; it is built
upon RDF and RDFS
- YAGO —
"Yet another great
ontology"
is a WordNet structure placed on top of Wikipedia.
Of course, these ontologies are also supported by the four
ones that have Zitgist principals as lead editors, especially the
subject concept matching ontology, UMBEL.
See the partner
page on Ontologies
for more information.
In general, though, Dublin Core, FOAF, GeoNames,
GRDDL, microformats, SKOS and UMBEL are arguably the more important
frameworks to supplement most domains.
Data Storage and Management
Once converted, Linked Data needs to be managed at scale with
appropriate functionality. Zitgist uses the OpenLink Virtuoso
universal server for these purposes.
The section on Zitgist's Linked Data Platform (zLDP) product
describes this server-based software in more detail.
But, another point is important in relation to the open
standards and foundations for Zitgist's offerings.
Moving forward, one central feature is the
huge diversity of formats, protocols and legacies that govern existing
data. The guiding thesis of the Virtuoso server —
and by extension to zLDP —
is complete interoperable functionality with all available sources,
open or proprietary. The diversity of the formats and
protocols that can work with Virtuoso is truly impressive.
Some are indeed obscure, and many are legacy: it is
never easy to predict market winners in advance and even second-tier
standards can have substantial user bases. OpenLink
understands this and has been committed for 15 years to provide
universal access.
This commitment has accumulated to provide a technology base
suitable for transitioning to Linked Data that is unparalleled.
Furthermore, that very same commitment also reflects itself
in the consuming protocols of presenting and using data results.
Thus, from generation to consumption, Virtuoso is the right
basis for building Linked Data platforms.
Scalability
Converting data into a common, interoperable RDF framework is
essential for interoperability, but says little about workability and
usability. Even in a consistent RDF triple model, Linked Data
still must be able to be retrieved, queried, searched and viewed
and —
most critically —
to do so at scale. Indeed, these scale issues will only grow
in importance as the scope of Linked Data embraces the entire Internet.
Major providers of data at scale on the Web, such as Google,
Amazon and Ebay, demonstrate how so-called RESTful designs and
architectures can scale with the large-scale additions of commodity
servers. These lessons are now pretty well understood and
"cloud" computing and "big table" approaches to large index
integrations should provide good confidence about general scale-up
issues. OpenLink Software understands these issues well and
writes
frequently on related architecture and design.
But the nature of RDF triples and consolidation of
multiple, broad datasets poses different and new issues for Linked Data
performance at scale. One approach, organizing data via
"quads" with specific graph (dataset) conventions is clearly desirable,
and an approach that Virtuoso has already taken. Likely, new
indexing structures and other structural data organization innovations
will also be required.
On all measures, Virtuoso (and therefore zLDP) wins every RDF-
and Linked Data- performance benchmark. Moreover, the same
technical posts previously mentioned demonstrate continuous innovation
and an unmatched sophistication on performance questions and the unique
challenges posed by RDF and Linked Data.
The view that there are pure, low-level triple stores
that maintain a repository of
triples only and only operate on that construct is immaterial.
So long as an RDF datastore can export and expose its data
via conformant APIs is the real benchmark. Under this basis,
the focus should rightfully be on improving performance and scalability
internally without sacrificing external interoperability. And
performance must include the ability to handle real-world queries and
retrieval demands.
In the end, there should be no real concerns about
converting existing resources to Linked Data. What can be
readily converted from legacy formats to Linked Data can also be easily
reversed. Meanwhile, the benefits of interoperable Linked
Data remain clear.
Querying and Retrieval
The standard basis for querying RDF triple stores is through SPARQL
(pronounced "sparkle"). SPARQL is an RDF query language;
its name is a recursive acronym that stands for SPARQL Protocol and RDF
Query Language.
The SPARQL query language is somewhat
modeled after SQL, though less developed and with no update
function. (SPARQL/Update, nicknamed SPARUL, is
one option that addresses this, and is also supported by zLDP).
In fact, in its early gestations, there were a number of
competing alternatives to SPARQL that have lost steam. This
survival of protocols was also true prior to the emergence of SQL for
relational databases. And, as for SQL, it is quite
likely SPARQL will also transition through a number of reference
implementations. One of the first incorporations, for
instance, will likely be SPARUL.
Another challenge is full-text searching. A still
further challenge is integration with conventional RDBMSs.
Again,
zLDP incorporates these functions, perhaps not
internally fully conformant with existing standards, but which can be
exposed to the outside world in conforming ways.
Nonetheless, in its current version, the SPARQL language has
four
types of queries: SELECT, ASK, DESCRIBE, and CONSTRUCT. The
SELECT query is most
similar to SQL in that a query provides a tabular
response. However, and this is fundamental, the basis of
SPARQL queries is the resource, and not the SQL field. Also,
SPARQL, like SQL, has a WHERE clause, but one that is grounded in graphs.
Another real failing of current Linked Data and
semantic Web query frameworks is a too literal view of triples and RDF.
(Indeed, the jargon of the whole approach is often the worst
enemy of the Linked Data promise.) SQL was never able to
bring data retrieval to the capability of standard users, and SPARQL in
whatever standards form will likely not be able to do so either.
Just as Linked Data is grounded in the Web and the Internet,
search, query and retrieval paradigms must be so as well. And
for that to occur, the emphasis needs to embrace user interface and
usability issues as well as the technical abilities of retrieval
languages represented in typical standards.
Viewing
Viewing data and results is listed last not because it is the
least important consideration; quite the opposite. But data
viewing is
the end of the data processing pipeline and the natural outcome of this technology and foundational discussion.
Zitgist's
perspective and capabilities surrounding how to view the unique aspects
of Linked Data is evident on this site with supporting discussion under
zLinks, the DataViewer and the Query Builder.