Over the last few years, we, as a community, have spent a great deal of time writing code to convert Microsoft Word documents into XML. This is a common task with fairly predictable stages to it. We need to read the .Docx or WordML file and and transform the flat, formatting-rich XML in a well structured XML document.
One approach to this problem is to create a pipeline that uses a progressive refinement technique to achieve a simple sequence of transformations from one format to another. Given that this approach requires the ability to chain multiple transformations together, we decided to build a framework to enable that.
This paper explores the implementation of this kind of pipelining through XProc and examine the pipeline processing used. We discuss the use of progressive enhancement to convert Microsoft Word files to an intermediate format, considering the challenges involved in converting Word in context. We look at the features of XProc which enable this sort of processing.
Nic Gibson. "Publishing with XProc"
Presented at XML London 2015, June 6-7th, 2015.
doi:10.14337/XMLLondon15.Gibson01
.
All information about the XML London conference is open and available in Linked RDF format.
SPARQL Endpoint: http://xmllondon.com/sparql
Graph Store Protocol: http://xmllondon.com/data
Thanks go to Charles Foster and William Holmes for their contributions to the XML London dataset.
If you would like to contribute to the XML London dataset, please submit a Git Pull Request to https://github.com/cfoster/xmllondon-rdf
Please contact us if you find a bug or think something could be improved.