XQuery/Splitting Files

From Wikibooks, the open-content textbooks collection

< XQuery
Jump to: navigation, search

Contents

[edit] Motivation

You have a single large XML file with many consistent records in it. You want to split it into many smaller files so that each can be edited by a separate user.

[edit] Method

We will create an XQuery that will iterate through all the records in the file. For each record we will use the XQuery function to store a file in a collection. The format of this function is:

xmldb:store($collection, $filename, $data)

Where:

  • $collection is a string that holds the path to the collection we will be storing the data for each record. For example '/db/test/data'
  • $filename is the name of the file. The name can either be derived from the data or it can be generated by a sequence counter in the split query. For example 'Hello.xml" or "1.xml".
  • $data is the data we will be storing into the file

[edit] Sample Input XML Input File

<root>
   <row>
      <Term>Hi</Term>
      <Definition>An informal short greeting.</Definition>
   </row>
   <row>
      <Term>Hello</Term>
      <Definition>A more formal greeting.</Definition>
   </row>
</root>

[edit] Sample XQuery

xquery version "1.0";
 
let $input-file := '/db/test/input.xml'
let $collection := '/db/test/terms'
 
(: the login used must have write access to the collection :)
let $output-collection := xmldb:login($collection, 'my-login', 'my-password')
 
return
<SplitResults>{
     for $term-data in doc($input-file)/root/row
        (: For brevity we will create a file name with the term name.  Change this to be an ID function if you want :)
        let $term-name := $term-data/Term/text()
        let $filename := concat($term-name, '.xml')
        let $store-return := store($collection, $filename, $term-data)
     return
        <store-result>
           <store>{$term-name}</store>
           <filename>{$filename}</filename>
        </store-result>
}</SplitResults>

[edit] Using A Sequence Counter for Artificial Keys

Sometimes there are not any elements in the importing record that can be used as a unique key or are not appropriate to use as a artificial key. In this case you will want to use a counter to create an XML file with a unique number in it. The sequence number generated is called an "artificial key" since it is not really related directly to any data elements in the record.

You can achieve this by adding an "at counter" to your for loop. To do this just add the string at $count after the for variable like the following

for $term-data at $count in $input-file/row

The store function can then use the $count variable to create a file name with this number:

let $filename := concat($count, '.xml')

[edit] Adding a ID to each item using the XQuery update Operator

Once you have inserted the data into a collection you will then want to assign each item a unique ID. This is called an artificial key since it is created by an artifical import process and it not related to data inside of the item. Artificial keys are usually assigned by the computer system that stores the data but not derived from the data.

<item>
   <person-name>John Doe</person-name>
   ...
</item>

You can also automatically add an ID to each item by doing the following:

  for $item at $count in $items
     update insert <id>{$count}</id> preceding $item/persname

After this update the new ID element will be inserted before the person-name element:

<item>
   <id>47</id>
   <person-name>John Doe</person-name>
   ...
</item>

It is a best practice to make sure that items do not already have an ID element.

  for $item at $count in $items[not(id)]
     update insert <id>{$count}</id> preceding $item/persname

This prevents duplicate ids from being added if the script gets run twice. You can also modify this to start the count one higher then the largest id in a collection.

  (: get the largest ID in the collection :)
  let $largest-id := max(  collection($my-collection)/*/id/text() )
  let $offset := $largest-id + 1
  for $item at $count in $items[not(id)]
     update insert <id>{$count + $offset}</id> preceding $item/persname

[edit] References

The split pattern is documented in the Enterprise Pattern Integration Web site.