Zitgist Logo  
upper left corner gradient
 

Linked Data Conversion Services

Zitgist provides Linked Data conversion services to:  1) develop new converters for new domains; 2) develop new converters for new formats; and 3) convert and process non-Linked Data using existing converters.

Converting existing data to Linked Data requires a syntactic step of converting the form to RDF (Resource Description Framework) and a semantic step for how to represent the data.  The general conversion process involves parsing and transforming the native data to a standard ("canonical") form, and then processing that new form into proper semantic relationships via a governing ontology for that domain. Zitgist has significant experience and expertise in all areas.

Zitgist routinely develops new ontologies and frameworks and works daily with reference standards for various domain-specific purposes. The company is fully conversant with all leading algorithms and tools.

Zitgist has experience with nearly 100 forms of existing converters, many of which we initially designed (for listing, see Read More option below).  But it is when existing data requires a new converter that Zitgist's experience and expertise shines.

For further information, read on, or contact Zitgist for a consultation on your particular needs.

Read more and see formats list ...

Experience and Expertise

Zitgist's work in developing UMBEL (see Ontologies and Ontology Development sections) is an exemplar for the company's data conversion expertise.  UMBEL involves about 22,000 reference "subject concepts" that are the binding points for inter-relating various datasets and (at the moment) about 2 million "named entities" (people, places, organizations, things and events) that are the instance data subsumed under these concepts.

The subject concepts were drawn from a complete vetting of the OpenCyc knowledge base and its 300,000 or so total concepts and predicates.  OpenCyc provides the basis for the structural and semantic relationships between the UMBEL subject concepts.

The named entity instance data were drawn from Wikipedia, YAGO, and other fact-based Web resources.  Both OpenCyc and WordNet provided the aliases and synonyms that provide the controlled vocabulary that aids disambiguation (choosing among alternative senses) for the concepts and entities.

Processing of such massive reference sets requires both semi-automated tools and expertise.  In turn, as such sources are processed, the methods and learning also become resources for new conversions.  Zitgist's familiarity with potential tools for such Linked Data purposes is exemplified by its Sweet Tools listing of more than 650 tools. 

The network effect of bootstrapping existing Linked Data to leverage the conversion of new Linked Data is a virtuous multiplier.  With each subsequent domain and format the conversion hurdle lowers and the degree of automation increases.

Zitgist routinely completes new conversions.  Conversions with broader applicability are made into Sponger "cartridges" (see next).  Most importantly, however, is that every member of the Linked Data community performing such conversions is adding to the foundation of Linked Data useful for further conversions, as well as adding to the storehouse of off-the-shelf converters available for existing data.  These third-party converters are also frequently called "RDFizers" (see concluding listing).

A key aspect in Zitgist's conversion services is understanding the strengths and limitations of these various methods and knowing how to improve them for specific engagements.

Virtuoso Sponger

The Virtuoso Sponger is a key enabling technology used by Zitgist for converting existing data.  The Sponger follows a cascading pipeline to process input files of heterogeneous origin.

The pipeline first checks to see if the file is RDF, in which case it processes the data directly and then cascades through a series of ontology checks.  If RDF is not returned, the Sponger passes the data through a metadata extraction pipeline that cascades through a variety of extractors, called "cartridges".  Assuming an applicable cartridge is found, the extracted data is transformed into RDF via a mapping pipeline. RDF entities (instance data) are generated from the input data by way of ontology matching and mapping.  Then, the structured RDF Linked Data is generated.

However, even if a specific cartridge is lacking, the Sponger is able to extract some minimum information from the basic Web page (however, though minimally useful, this fallback falls short of full data conversion and is not advisable for purposeful conversion projects).

RDF generation is done on the fly either using built-in XSLT processors, or in the case of GRDDL *, its associated XSLT and local or remote XSLT processors. The RDF generation is based on an internal mapping table that associates the source's data type with matching ontologies.  This mapping may use SIOC, SKOS, FOAF, AtomOWL, Annotea bookmarks, Annotea annotations, EXIF, or other ontologies depending on the source data.  The number of ontologies handled by the Sponger is being increased constantly.

Generally, of course, successful conversions early in the processing pipeline or with a matching ontology produce more accurate and complete results.  A key aspect of conversion projects, therefore, is to move the processing earlier in the chain through better input format recognition and targeted ontologies.

A more detailed explanation of the Virtuoso Sponger is available in an OpenLink white paper.

Successful ongoing conversions result in adding new Sponger cartridges to the available repository. Most of these are made public, though some are retained for specific clients or proprietary purposes.

Listing of Existing Formats

At present, though constantly increasing, Zitgist's existing conversion services recognizes nearly 100 various formats (note, contact Zitgist directly if one of your options is not on this list):

  • RDF
    • Serialization formats:
      • RDF/XML
      • N3
      • Turtle
    • Automatically recognized ontologies:
      • SIOC
      • SKOS
      • FOAF
      • AtomOWL
      • Annotea
      • Music Ontology
      • Bibliograhic Ontology
      • EXIF
      • vCard
      • WSDL
      • Others
  • (X)HTML pages
  • HTML header metadata
    • Dublin Core
  • Embedded microformats
    • eRDF
    • RDFa
    • hCard
    • hCalendar
    • XFN
    • xFolk
  • Syndication Formats:
    • RSS 2.0
    • Atom
    • OPML
    • OCS
    • XBEL (for bookmarks)
  • GRDDL * (see note below)
  • REST-style Web service APIs:
    • Google Base
    • Flickr
    • Del.icio.us
    • Ning
    • Amazon
    • eBay
    • Freebase
    • Facebook
    •  raw HTTP
    • Etc.
  • Files (multitude of file formats and MIME types, including):
    • MS Office
    • OpenOffice
    • Open Document Format
    • images
    • audio
    • video
    • Etc.
  • Data exchange formats
    • iCalendar
    • vCard
    • XBRL
    • BPEL
    • TheyWorkForYou
  • Virtuoso VADs
  • OpenLink licence files
  • Third party metadata extraction frameworks:

Note that MIT's SIMILE RDFizers also recognizes these formats, in addition to those listed above:

And, there is a growing list of independent, third-party RDFizers as well:

* GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a W3C markup format for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT  GRDDL accomodates a wide variety of dialects (see one listing) and can be combined with arbitrary transformation mechanisms (though currently mostly based on XSLTs).

upper right corner gradient