XQuery/String Analysis

From Wikibooks, open books for an open world
Jump to: navigation, search

XQuery analyze-string[edit]

XSLT 2.0 includes the analyze-string construct which captures matching groups (in parentheses) in a regular expresssion. Strangely this is not available in XQuery. It is possible to use the XSLT construct by wrapping an XQuery function round a generated XSLT stylesheet, even though this seems rather painful. In this installation of eXist, the XSLT engine is Saxon 8.

declare function str:analyze-string($string as xs:string, $regex as xs:string,$n as xs:integer ) {
 transform:transform   
(<any/>, 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> 
   <xsl:template match='/' >  
      <xsl:analyze-string regex="{$regex}" select="'{$string}'" > 
         <xsl:matching-substring>
            <xsl:for-each select="1 to {$n}"> 
               <match>  
                   <xsl:value-of select="regex-group(.)"/>  
                </match>  
             </xsl:for-each> 
          </xsl:matching-substring> 
      </xsl:analyze-string>  
   </xsl:template> 
</xsl:stylesheet>,
()
)
};


UK Vehicle Registration numbers[edit]

To illustrate the use of this function, here is a decoder for UK vehicle license plates. These have undergone a number of changes of format, so the script must first decide which format is used, then analyze the number to find the significant codes for the area and date of registration. The patterns are defined in XML and define the regular expression to be used, and the meaning of the matched groups.

Problem: Passing repetition modifiers through is failing


import module namespace str = "http://www.cems.uwe.ac.uk/string" at "../lib/string.xqm";

declare variable $patterns := 
<patterns>
   <pattern version="01" regexp="([A-Z][A-Z])(\d\d)[A-Z][A-Z][A-Z]">
            <field>Area</field><field>Date</field>
    </pattern>
   <pattern version="83" regexp="([A-Z])\d+[A-Z]([A-Z][A-Z])">
          <field>Date</field><field>Area</field>
   </pattern>
   <pattern version="63" regexp="([A-Z][A-Z])[A-Z]?\d+([A-Z])">
                <field>Area</field><field>Date</field>
   </pattern>
</patterns>;

declare function local:decode-regno($regno)  {
let $regno := upper-case($regno)
let $regno := replace($regno, " ","")

return 
    for $pattern in $patterns/pattern
    let $regexp := concat("^",$pattern/@regexp,"$")
    return 
       if (matches($regno,$regexp))
       then 
          let $analysis := str:analyze-string($regno,$regexp,count($pattern/field))
          return 
          <regno version="{$pattern/@version}">      
             {for $field at $i in $pattern/field
              let $value := string($analysis[position() = $i])
              let $table := concat($field,$pattern/@version)
              let $value := /CodeList[@id=$table]/Entry[Code=$value]
              return 
                    element {$field} {$value/*} 
             }
          </regno>
       else 
         ()
};

let $regno := request:get-parameter("regno",())
return local:decode-regno($regno)


Decode tables[edit]

Separate tables decode codes to date ranges or areas. These tables are plain XML created from CSV files via Excel. The pre-83 area codes are currently incorrect.


e.g.

<CodeList id="Area83">
        <Entry>
                <Code>AA</Code>
                <Location>Bournemouth</Location>
        </Entry>
        <Entry>
                <Code>AB</Code>
                <Location>Worcester</Location>
        </Entry>
        <Entry>
                <Code>AC</Code>
                <Location>Coventry</Location>
        </Entry>
...

Examples[edit]

  1. A current number plate: WP05LNU
  2. One from the previous series: L162BAY

Location Mapping[edit]

One use of this conversion is to display the locations on a map. Here we take a file of observed registration numbers, decode them all, group by location and generate a KML file with the locations geocoded through the Google API.

<NumberList>
    <Regno>H251GBU</Regno>
    <Regno>WRA870Y</Regno>
    <Regno>ENB427T</Regno>
    <Regno>C406OUY</Regno>
    <Regno>N62VNF</Regno>
    <Regno>R895KCV</Regno>
    <Regno>C758HOV</Regno>
    <Regno>H541HEM</Regno>
 ...


(:  this script plots the registration locations of a set of  UK vehicle license plates using kml.  :)

import module namespace geo="http://www.cems.uwe.ac.uk/exist/geo" at "../lib/geo.xqm";

import module namespace str = "http://www.cems.uwe.ac.uk/string" at "../lib/string.xqm";
declare  namespace reg = "http://www.cems.uwe.ac.uk/wiki/reg";

declare option exist:serialize "method=xml media-type=application/vnd.google-earth.kml+xml  indent=yes  omit-xml-declaration=yes"; 
declare variable $reg:icon := "http://maps.google.com/mapfiles/kml/paddle/ltblu-blank.png";
declare variable $reg:patterns := 
<patterns>
   <pattern version="01" regexp="([A-Z][A-Z])(\d\d)[A-Z][A-Z][A-Z]">
            <field>Area</field><field>Date</field>
    </pattern>
   <pattern version="83" regexp="([A-Z])\d+[A-Z]([A-Z][A-Z])">
          <field>Date</field><field>Area</field>
   </pattern>
   <pattern version="63" regexp="([A-Z][A-Z])[A-Z]?\d+([A-Z])">
                <field>Area</field><field>Date</field>
   </pattern>
</patterns>;

declare function reg:decode-regno($regno)  {
let $regno := upper-case($regno)
let $regno := replace($regno, " ","")

return
    for $pattern in $reg:patterns/pattern
    let $regexp := concat("^",$pattern/@regexp,"$")
    return 
       if (matches($regno,$regexp))
       then 
          let $analysis := str:analyze-string($regno,$regexp,count($pattern/field))
          return 
          <regno version="{$pattern/@version}">      
             {for $field at $i in $pattern/field
              let $value := string($analysis[position() = $i])
              let $table := concat($field,$pattern/@version)
              let $value := /CodeList[@id=$table]/Entry[Code=$value]
              return 
                    element {$field} {$value/*} 
             }
          </regno>
       else 
         ()
};

declare function reg:regno-locations($regnos) {
for $regno in  $regnos
let $analysis := reg:decode-regno($regno)
return
   if (exists($analysis//Location))
   then  string($analysis//Location) 
   else ()
};

let $url := request:get-parameter("url",())
let $x := response:set-header('Content-Disposition','inline;filename=regnos.kml;')

return
   <Document>
      <name>Reg nos</name>
      {for $i in (1 to 10)
       return 
       <Style id="size{$i}">
             <IconStyle>
             <scale>{$i}</scale>
             <Icon><href>{$reg:icon}</href> </Icon>     
          </IconStyle>
          </Style>
       }
      {
      let $locations :=   reg:regno-locations(doc($url)//Regno)
      let $max := count($locations)
      for $place in distinct-values($locations)
      let $latlong := geo:geocode(concat($place,',UK'))
      let $count := count($locations[. = $place])
      let $scale := max((round($count div $max  * 10),1))
         order by $count descending
         return         
          <Placemark>
           <name>{$place} ({$count})</name>
           <styleUrl>#size{$scale}</styleUrl>
           <Point><coordinates>{geo:position-as-kml($latlong)}</coordinates></Point>
         </Placemark>
        }
   </Document>

Generate Map

SMS service[edit]

The Department of Information Science and Digital Media supports an SMS service with facilities to send and receive text messages. The service is paid for by the University of the West of England, Bristol and all traffic is logged.

A decoder for UK vehicle license numbers is one of the demonstration services which are supported for mobile-originated (MO) text messages.

The format of the text message is

REG <regno>

e.g.

REG L162 BAY

A text message in this format sent to our SMS mobile number 447624803759 passes through a PHP script which allows multiple SMS services to be supported. The script uses the first word of the message to identify the associated service endpoint, and then invokes that endpoint via HTTP, passing the prefix as code, the rest of the message as text and the origination mobile number as from.

For the prefix REG, the associated endpoint is an XQuery script:

  http://www.cems.uwe.ac.uk/xmlwiki/regno/smsregno.xq

The smsregno.xq script is essentially the parseregno script above.

declare option exist:serialize "method=text media-type=text/text";
 
...
 
let $regno := request:get-parameter("text",())
let $data :=  local:decode-regno($regno)
return
   concat("Reply: ",
          $regno , 
          " was registered in  ",
          $data/Area/Location,
          " between ", 
          $data/Date/From ,
          " and ", 
          $data/Date/To
         )

The SMS switch then sends the Reply on to the originating mobile phone.

To do[edit]

  • solve problem with repetition modifiers (or function support for analayze-string)
  • Pre-83 area code data
  • Switch implementation in XQuery to replace the PHP application - awaits switch to eXist v2