Experience and Expertise
Zitgist's work in developing UMBEL (see Ontologies and
Ontology
Development sections) is an exemplar for the company's data
conversion expertise. UMBEL involves about 22,000
reference "subject concepts" that are the binding points for
inter-relating various datasets and (at the moment) about 2 million
"named entities" (people, places, organizations, things and events)
that are the instance data subsumed under these concepts.
The subject concepts were drawn from a
complete vetting of the OpenCyc
knowledge base and its 300,000 or so total concepts and predicates.
OpenCyc provides the basis for the structural and
semantic relationships between the UMBEL subject concepts.
The named entity instance data were drawn
from Wikipedia,
YAGO,
and other fact-based Web resources. Both OpenCyc and WordNet
provided the aliases and synonyms that provide the controlled
vocabulary that aids disambiguation (choosing among alternative senses)
for the concepts and entities.
Processing of such massive reference sets
requires both semi-automated tools and expertise. In turn, as
such sources are processed, the methods and learning also become
resources for new conversions. Zitgist's familiarity with
potential tools for such Linked Data purposes is exemplified by its Sweet Tools listing
of more than 650 tools.
The network effect of bootstrapping existing
Linked Data to leverage the conversion of new Linked Data is a virtuous
multiplier. With each subsequent domain and format the
conversion hurdle lowers and the degree of automation increases.
Zitgist routinely completes new conversions.
Conversions with broader applicability are made into Sponger
"cartridges" (see next). Most importantly, however, is that
every member of the Linked Data community performing such conversions
is adding to the foundation of Linked Data useful for further
conversions, as well as adding to the storehouse of off-the-shelf
converters available for existing data. These third-party
converters are also frequently called "RDFizers" (see concluding
listing).
A
key aspect in Zitgist's conversion services is understanding the
strengths and limitations of these various methods and knowing how to
improve them for specific engagements.
Virtuoso Sponger
The Virtuoso Sponger is a
key enabling
technology used by
Zitgist for converting existing data. The Sponger follows a
cascading pipeline to process input files of heterogeneous origin.
The pipeline first checks to see if the file
is RDF, in which
case it processes the data directly and then cascades through a series
of ontology checks. If RDF is not returned, the
Sponger passes the data through a metadata extraction pipeline that
cascades through a variety of extractors, called "cartridges".
Assuming an applicable cartridge is found, the extracted data
is transformed into RDF via a mapping pipeline. RDF entities (instance
data) are generated from the input data by way of ontology matching and
mapping. Then, the structured RDF Linked
Data is generated.
However, even if a specific cartridge is
lacking, the Sponger
is able to extract some minimum information from the basic Web
page (however, though minimally useful, this fallback falls
short of full data conversion and is not advisable for purposeful
conversion projects).
RDF generation is done on the fly either
using built-in XSLT
processors, or in the case of GRDDL *,
its associated
XSLT and local or remote XSLT processors. The RDF
generation is based on an internal mapping table
that associates the source's data type with matching
ontologies. This mapping may use SIOC,
SKOS, FOAF, AtomOWL, Annotea bookmarks, Annotea annotations, EXIF,
or other ontologies depending on the source data.
The
number of ontologies handled by the Sponger is being increased
constantly.
Generally, of course, successful conversions
early in the
processing pipeline or with a matching ontology produce more accurate
and complete results. A key aspect of conversion projects,
therefore, is to move the processing earlier in the chain through
better input format recognition and targeted ontologies.
A more detailed explanation of the Virtuoso
Sponger is
available in an OpenLink
white paper.
Successful ongoing conversions result in adding
new Sponger cartridges to the available repository. Most of these
are made public, though some are retained for specific clients or
proprietary purposes.
Listing of Existing Formats
At present, though constantly increasing,
Zitgist's existing conversion
services recognizes nearly 100 various formats (note, contact Zitgist
directly
if one of your options is not on this list):
- RDF
- Serialization formats:
- Automatically recognized ontologies:
- SIOC
- SKOS
- FOAF
- AtomOWL
- Annotea
- Music Ontology
- Bibliograhic Ontology
- EXIF
- vCard
- WSDL
- Others
- (X)HTML pages
- HTML header metadata
- Embedded microformats
- eRDF
- RDFa
- hCard
- hCalendar
- XFN
- xFolk
- Syndication Formats:
- RSS 2.0
- Atom
- OPML
- OCS
- XBEL (for bookmarks)
- GRDDL *
(see note below)
- REST-style Web service APIs:
- Google Base
- Flickr
- Del.icio.us
- Ning
- Amazon
- eBay
- Freebase
- Facebook
- raw HTTP
- Etc.
- Files (multitude of file formats
and MIME types,
including):
- MS Office
- OpenOffice
- Open Document Format
- images
- audio
- video
- Etc.
- Data exchange formats
- iCalendar
- vCard
- XBRL
- BPEL
- TheyWorkForYou
- Virtuoso VADs
- OpenLink
licence files
- Third party metadata extraction
frameworks:
Note that MIT's SIMILE RDFizers also
recognizes these formats, in addition to those listed above:
And, there is a growing list of independent,
third-party RDFizers as well:
* GRDDL (Gleaning
Resource Descriptions from Dialects of Languages) is
a W3C markup format for getting RDF data out of XML and XHTML
documents
using explicitly associated transformation algorithms, typically
represented in XSLT GRDDL
accomodates a
wide variety of dialects (see one listing)
and can be combined with
arbitrary transformation mechanisms (though currently mostly
based on XSLTs).