XQuery/DocBook to Microsoft Word

Motivation

You want to create a Microsoft Word document from a DocBook file.

Method

There are two steps to building a high quality DocBook to MS-Word .docx transform.

Create a docx generator that assembles a zip file with all the correct components
Create a typeswitch transform that converts each DocBook element into the appropriate Open Office XML format

In this article we will create a zip file using the Microsoft Open_Packaging_Conventions (OPC) format. OPC files can be opened with any desktop unzip program but must be created problematically to ensure that specific files are places in the correct order. We will create several small XQuery functions that will extract key elements from the input DocBook 5 file and generate the XML files used in the Open Office XML specifications. Then we will assemble all the components into a zip file using a single generate-docx() function.

The other transformation will follow a very similar pattern.

Zip File Generation

This section shows you the process of generating a zip file using XQuery. This depends on having a zip function that allows you to specify each of the components of the output file and specifically requires the output to be in document-order.

The Output File Configuration

The output is a zip file with the following contents:

[Content_Types].xml - a single XML file in the root directory. This file MUST be placed first in the zip file collection.
_rels - a folder that has a single .rels file that is an XML file with releationships between files
docProps - a folder that has the document properies files in it. These are usually the app.xml and core.xml files
word - a folder with all the word content and two subfolders. Typical content includes:
- _rels folder with a single file such as document.xml.rels in it
- theme folder with a single file such as theme1.xml in it
- document.xml
- fontTable.xml
- settings.xml
- styles.xml
- webSettings.xml

Sample Use of Zip Function

The compression:zip( $entries, true() ) function takes two parameters. The first is a series of <entries> elements, one for each file or collection we are going to create.

Here is the entry for building the main [Content_Types].xml file.

<entry name="[Content_Types].xml" type="xml" method="store">{doc(concat($db2docx:template-collection, '/content-types-template.xml'))}</entry>

The following is an example of how we put a file in the _rels folder with the .rels file name:

<entry name="_rels/.rels" type="xml" method="store">{doc(concat($db2docx:template-collection, '/dot-rels.xml'))}</entry>

So to build a docx file we "assemble" each of the <entry> elements using the compression:zip() function and then return the binary stream to the web browser with the correct mime type and file name. This file is downloaded and you can then open it with MS word.

declare function db2docx:generate-docx($docbook-input-document as node(), $filename as xs:string) {

(: this has a sequence of <entry> elements that is used by the zip function :)
    let $entries :=
        (
        db2docx:content-type-entry(),
        ...
        db2docx:root-rels())
    return
        (
        response:set-header("Content-Disposition", concat("attachment; filename=", concat($filename, '.docx')))
        ,
        response:stream-binary(
            compression:zip( $entries, true() ),
            'application/zip',
            concat($filename, '.docx')
            )
        )
};

Mapping DocBook 5 Elements to Open Office XML

DocBook files are very easy to work with since the entire document can be stored in a single file. DocX has many small files and these files are stored in many different locations in the zip archive.

Here are some examples:

Core Properties

The Core Properties element are the standard Dublin Core metadata elements that you might see in a bibliographic entry for a book or article.

<cp:coreProperties xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <!-- Dublin Core Metadata Elements -->
    <dc:title>Converting DocBook to DocX Format</dc:title>
    <dc:subject>Horror Stories</dc:subject>
    <dc:creator>Dan McCreary</dc:creator>
    <cp:keywords>XML, Docbook, DocX, Conversion, Transformation, TypeSwitch</cp:keywords>
    <dc:description>How to convert DocBook to DocX</dc:description>
    <cp:lastModifiedBy>Dan McCreary</cp:lastModifiedBy>
    <cp:revision>1</cp:revision>
    <dcterms:created xsi:type="dcterms:W3CDTF">2012-05-04T13:35:00Z</dcterms:created>
    <dcterms:modified xsi:type="dcterms:W3CDTF">2012-05-04T13:35:00Z</dcterms:modified>
</cp:coreProperties>

Application Properties

Here is an example of an XQuery function that will fill in the number of sections in the application properties XML file.

declare function db2docx:app-properties($docbook-input-document as node()) {
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" 
   xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
    <SectionCount>{count($docbook-input-document//sect1)}</SectionCount>
    <Application>DocBook 5 to Microsoft Open Office XML (DocX) Converter by Dan McCreary of Kelly-McCreary &amp; Associates</Application>
    <DocSecurity>0</DocSecurity>
    <Lines>1</Lines>
    <Paragraphs>1</Paragraphs>
    <ScaleCrop>false</ScaleCrop>
    <Company>Kelly-McCreary &amp; Associates</Company>
    <LinksUpToDate>false</LinksUpToDate>
    <CharactersWithSpaces>12</CharactersWithSpaces>
    <SharedDoc>false</SharedDoc>
    <HyperlinksChanged>false</HyperlinksChanged>
    <AppVersion>0.1</AppVersion>
</Properties>
};

Document Body Element Transforms

Mapping your DocBook elements into Open Office XML format will vary depending on what DocBook elements you use and what your Word template structure is. This tutorial example will demonstrate mapping for the following elements:

article
article title
sect1
sect 1 title
para
figure

Sample DocBook 5 Input File

We will begin with a DocBook 5 chatper with a two level 1 sections that each have two level two subsections each with two paragraphs each.

<chapter xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
    <title>Chapter Title</title>
    <subtitle>Chapter Subtitle</subtitle>
    <para>This is the body introductory text of the chapter</para>
    <sect1>
        <title>Level 1 Title 1</title>
        <subtitle>Section 1 Subtitle</subtitle>
        <para>This is the text of the first paragraph of the first level 1 article section.</para>
        <para>This is the text of the second paragraph of the first level 1 article section.</para>
        <sect2>
            <title>Level 2 Title 1.1</title>
            <para>This is the text of the first paragraph of the first level 2 article sub-section.</para>
            <para>This is the text of the second paragraph of the first level 2 article sub-section.</para>
        </sect2>
        <sect2>
            <title>Level 2 Title 1.2</title>
            <para>This is the text of the first paragraph of the second level 2 article sub-section.</para>
            <para>This is the text of the second paragraph of the second level 2 article sub-section.</para>
        </sect2>        
    </sect1>
    <sect1>
        <title>Level 1 Title 2</title>
        <subtitle>Section 1 Subtitle</subtitle>
        <para>This is the text of the first paragraph of the first level 1 article section.</para>
        <para>This is the text of the second paragraph of the first level 1 article section.</para>
        <sect2>
            <title>Section 2 Title 2.1</title>
            <para>This is the text of the first paragraph of the first level 2 article sub-section.</para>
            <para>This is the text of the second paragraph of the first level 2 article sub-section.</para>
        </sect2>
        <sect2>
            <title>Section 2 Title 2.2</title>
            <para>This is the text of the first paragraph of the second level 2 article sub-section.</para>
            <para>This is the text of the second paragraph of the second level 2 article sub-section.</para>
        </sect2>
        
    </sect1>
</chapter>

Document Body

Open Office XML uses a complex XML structure for storing the body of text. Paragraphs are broken up into "runs" and then have text within those run elements. The following structure is an example of this:

<w:document xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
    xmlns:o="urn:schemas-microsoft-com:office:office" 
    xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" 
    xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
    xmlns:w10="urn:schemas-microsoft-com:office:word" 
    xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
    xmlns:v="urn:schemas-microsoft-com:vml" 
    xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
    xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006">
    <w:body>
        <w:p>
            <w:r>
                <w:t>Hello World!</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:document>

Creating Your Typeswitch Transform

We are now ready to dive into the element by element transform.

The structure looks like this:

declare function db2docx:main($content as node()*) as item()* {
    for $node in $content
    (: let $log := util:log-system-out(concat('In main with ', count($node//node()), ' elements')) :)
    return 
        typeswitch($node)
            case text() return $node
            
            (:
            case element(article) return db2h:article($node)
            case element(book) return db2h:book($node) :)

            (: each of these is responsible for handling its own title, paragraphs and subsections :)
            case element(db:chapter) return db2docx:chapter($node)
            case element(db:sect1) return db2docx:sect1($node)
            case element(db:sect2) return db2docx:sect2($node)
....
           default return db2docx:null()
};

Mapping your DocBook elements into Open Office XML format will vary depending on what DocBook elements you use and what your Word template structure is. This tutorial example will demonstrate mapping for the following elements:

article
article title
sect1
sect 1 title
para
figure
etc

Sample Recursive Function

The main "dispatch" function will arrive at the node of every high-level element. It will then just to the function specifically associated with that element. Usually the function has the same name as the element.

At each level of the transform you put in the data elements you need and then call the main function for each sub-element. This allows you to specifically put in structure that you know exists and avoids having to look up the context of an element depending on where you are in the tree. For example the title element is used consistently in the chapter, sect1 and sect2 sections. You can lookup the parent element name when you get to the title element but it is often easier just to put in the elements within the section you have just arrived at.

declare function db2docx:sect1($sect1 as node()) as node()* {
(
    <!-- sect1 -->,
    <w:p>
        <w:pPr>
            <w:pStyle w:val="Heading1"/>
        </w:pPr>
        <w:r>
            <w:t>{$sect1/db:title/text()}</w:t>
        </w:r>
    </w:p>,
    db2docx:main($sect1/db:para),
    db2docx:main($sect1/db:sect2)
)
};

Sample Output

Adding Images

DocBook Figures

DocBook figures have the following sample structure:

<figure>
   <title>Figure Caption</title>
   <mediaobject>
      <imageobject>
           <imagedata fileref="images/my-image.png" scale="50" contentwidth="500"/>
      </imageobject>
   </mediaobject>
</figure>

In the sample above we store all images for an article in an images collection directly in the collection that stores the main article XML file. We also scale the image to 50% of its original size or set the content width to be a fixed number of pixels.

Sample Open Office Image

Below is the equivalant structure of an image in docx format:

<w:drawing>
    <wp:inline distT="0" distB="0" distL="0" distR="0">
        <wp:extent cx="1714286" cy="514286"/>
        <wp:effectExtent l="19050" t="0" r="214" b="0"/>
        <wp:docPr id="1" name="Picture 0" descr="nosql-logo.png"/>
        <wp:cNvGraphicFramePr>
            <a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
               noChangeAspect="1"/>
        </wp:cNvGraphicFramePr>
        <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
            <a:graphicData
                uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
                <pic:pic
                    xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
                    <pic:nvPicPr>
                        <pic:cNvPr id="0" name="my-image-file.png"/>
                        <pic:cNvPicPr/>
                    </pic:nvPicPr>
                    <pic:blipFill>
                        <a:blip r:embed="rId4" cstate="print"/>
                        <a:stretch>
                            <a:fillRect/>
                        </a:stretch>
                    </pic:blipFill>
                    <pic:spPr>
                        <a:xfrm>
                            <a:off x="0" y="0"/>
                            <a:ext cx="1714286" cy="514286"/>
                        </a:xfrm>
                        <a:prstGeom prst="rect">
                            <a:avLst/>
                        </a:prstGeom>
                    </pic:spPr>
                </pic:pic>
            </a:graphicData>
        </a:graphic>
    </wp:inline>
</w:drawing>

The binary images must be placed in the word/media collection.

Revision Identifiers (RSIDS)

Microsoft documents also have a large number of revision attributes or "RSIDS" for each paragraph, run and text. These are used when there are multiple authors making changes and the changes must be tracked using a revision reviewing system. By assigning random ID numbers to each component of text it is easier for a person to view the tracked changes.

<w:p w:rsidR="00D910F7" w:rsidRDefault="00CB02EF" w:rsidP="00CB02EF">
   <w:pPr>
      <w:pStyle w:val="Title"/>
   </w:pPr>
      <w:r>
         <w:t>Document Title</w:t>
      </w:r>
</w:p>
<w:p w:rsidR="00CB02EF" w:rsidRDefault="00CB02EF" w:rsidP="00CB02EF">
   <w:pPr>
      <w:pStyle w:val="Heading1"/>
         </w:pPr>
      <w:r>
         <w:t>I am heading</w:t>
      </w:r>
</w:p>
<w:p w:rsidR="00CB02EF" w:rsidRDefault="00CB02EF">
   <w:r>
      <w:t>This is the body text for a paragraph.</w:t>
   </w:r>
</w:p>

You can disable the generation of the RSIDs by going to the Microsoft Word Options and then to the Trust Center and then Select "Privacy Settings" (although this has nothing to do with privacy) and the UNcheck the "Store random number to improve Combine accuracy"

References

XQuery/DocBook to Microsoft Word

Contents

Motivation

Method

Zip File Generation

The Output File Configuration

Sample Use of Zip Function

Mapping DocBook 5 Elements to Open Office XML

Core Properties

Application Properties

Document Body Element Transforms

Sample DocBook 5 Input File

Document Body

Creating Your Typeswitch Transform

Sample Recursive Function

Sample Output

Adding Images

DocBook Figures

Sample Open Office Image

Revision Identifiers (RSIDS)

References

Navigation menu

XQuery/DocBook to Microsoft Word

Motivation

Method

Zip File Generation

The Output File Configuration

Sample Use of Zip Function

Mapping DocBook 5 Elements to Open Office XML

Core Properties

Application Properties

Document Body Element Transforms

Sample DocBook 5 Input File

Document Body

Creating Your Typeswitch Transform

Sample Recursive Function

Sample Output

Adding Images

DocBook Figures

Sample Open Office Image

Revision Identifiers (RSIDS)

References

Navigation menu

Search