XQuery/Wikipedia Lookup

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Page scraping is one way to retrieve a specific fact from a page provided its structure is stable.

Here the task is to use wikipedia to find the Latin name for a bird, given its common name.

declare namespace h = "http://www.w3.org/1999/xhtml";

let $name := request:get-parameter("name",())
let $url := escape-uri(concat("http://en.wikipedia.org/wiki/",$name),false()) 
let $page := doc($url)
let $genus := $page//h:tr[h:td[. ='Genus:']]/h:td[2]
let $species := $page//h:tr[h:td[. ='Species:']]/h:td[2]
let $binomial := string($page//h:tr[h:th//h:a[.='Binomial name']]/following-sibling::h:tr//h:b)
return 
   <bird name="{$name}" genus="{$genus}" species="{$species}" binomial="{$binomial}"/>

Here, the path to locate the data required, assuming the page is in Bird page format, involves complex XPath expressions. For example, the genus is the second cell in a table row whose first cell is 'Genus'.

Black Swan Wikipedia

The script often fails because:

  1. the name is ambiguous ThrushWikipedia
  2. the name is too broad Kiwi Wikipedia

It is not hard to see that more semantic markup with ontological relationships would be preferable to these uncertain contortions.