Matt Kohl

From monolithic XML for print/web to lean XML for data

realising linked data for dictionaries

Matt Kohl (Oxford University Press) and Sandro Cirulli (Oxford University Press)


In order to reconcile the need for legacy data compatibility with changing business requirements, proprietary XML schemas inevitably become larger and looser over time. We discuss the transition at Oxford University Press from monolithic XML models designed to capture monolingual and bilingual print dictionaries derived from multiple sources, towards a single, leaner, semantic model. This new model reflects the lexical content units of a traditional dictionary, while maximising human readability and machine interpretability, thus facilitating transformation to Resource Description Framework (RDF) triples as linked data.

We describe a modular transformation process based on XProc, XSLT, XSpec and Schematron that maps complex structures and multilingual metadata in the legacy data to the structures and harmonised taxonomy of the new model, making explicit information that is often implicit in the original data. Using the new model in its prototype RDF form, we demonstrate how cross-lingual, cross-domain searches can be performed, and custom data-sets can be constructed, that would be impossible or very time- consuming to achieve with the original XML content stored at the individual dictionary level.

  • Download Paper
    Conference Paper
  • Download Slides
    Conference Presentation Slides
How to cite this

Matt Kohl, Sandro Cirulli and Phil Gooch. "From monolithic XML for print/web to lean XML for data" Presented at XML London 2014, June 7-8th, 2014. doi:10.14337/XMLLondon14.Kohl01.

Matt Kohl