In order to reconcile the need for legacy data compatibility with changing business requirements, proprietary XML schemas inevitably become larger and looser over time. We discuss the transition at Oxford University Press from monolithic XML models designed to capture monolingual and bilingual print dictionaries derived from multiple sources, towards a single, leaner, semantic model. This new model reflects the lexical content units of a traditional dictionary, while maximising human readability and machine interpretability, thus facilitating transformation to Resource Description Framework (RDF) triples as linked data.
We describe a modular transformation process based on XProc, XSLT, XSpec and Schematron that maps complex structures and multilingual metadata in the legacy data to the structures and harmonised taxonomy of the new model, making explicit information that is often implicit in the original data. Using the new model in its prototype RDF form, we demonstrate how cross-lingual, cross-domain searches can be performed, and custom data-sets can be constructed, that would be impossible or very time- consuming to achieve with the original XML content stored at the individual dictionary level.
Matt Kohl, Sandro Cirulli and Phil Gooch. "From monolithic XML for print/web to lean XML for data"
Presented at XML London 2014, June 7-8th, 2014.
doi:10.14337/XMLLondon14.Kohl01
.