XQuery/Wikipedia Lookup
Appearance
< XQuery
Page scraping is one way to retrieve a specific fact from a page provided its structure is stable.
Here the task is to use wikipedia to find the Latin name for a bird, given its common name.
declare namespace h = "http://www.w3.org/1999/xhtml"; let $name := request:get-parameter("name",()) let $url := escape-uri(concat("http://en.wikipedia.org/wiki/",$name),false()) let $page := doc($url) let $genus := $page//h:tr[h:td[. ='Genus:']]/h:td[2] let $species := $page//h:tr[h:td[. ='Species:']]/h:td[2] let $binomial := string($page//h:tr[h:th//h:a[.='Binomial name']]/following-sibling::h:tr//h:b) return <bird name="{$name}" genus="{$genus}" species="{$species}" binomial="{$binomial}"/>
Here, the path to locate the data required, assuming the page is in Bird page format, involves complex XPath expressions. For example, the genus is the second cell in a table row whose first cell is 'Genus'.
The script often fails because:
It is not hard to see that more semantic markup with ontological relationships would be preferable to these uncertain contortions.