ETD Guide/Technical Issues/Harvest usage in Germany, France

From Wikibooks, open books for an open world
Jump to navigation Jump to search

The HARVEST system is often used for a fulltext search within the ETD archives. In Germany, most of the university libraries are using this particular software.

Preconditions for using Harvest

Before the installation, one should check whether the following technical preconditions are fulfilled.

Hardware:

fast processor(e.g. Sparc5...)
fast I/O
enough RAM ( > 64 MB) – and 1-2 GB free disk space (sources 25 MB)
Operating Systems supported
DEC OSF/1 ab 2.0
SunOS ab 4.1.x
SunSolaris ab 2.3
HPUX
AIX ab 3.x
Linux alle Kernel ab 1999 on alle Unix-Platformen
WindowsNT

Following additional software is needed to use Harvest:
Perl v4.0 and higher (v5.0 )
gzip
tar
HTTP-Server (with remote machine)
GNU gcc v2.5.8 and higher
flex v2.4.7
bison v1.22

Harvest Components

The Harvest system consists of two major components:
The Harvest Gatherer
The Harvest Broker.
This allows establishing a distributed retrieval and search model.
Installation procedure (ftp://ftp.tardis.ed.ac.uk/pub/harvest/develop/snapshots/ )

Gatherer

This program part is responsible for collecting the full text and metadata of the dissertations. The Gatherer visits several sites regularly, sometimes daily or weekly and builds an incremental index area database. The collected index data are held in a special format, called SOIF-Format (Summary Object Interchange Format). The Gatherer can be extended so that it can interpret different formats.

Broker

This part of the software is responsible to provide the indexed information by using a search interface. The broker operates as Query-Manager and does the real indexing of the data. Using a Web based interface he can reach several Gatherer and Broker simultaneously and perform several search requests.

In Germany there has been established a Germany-wide retrieval interface on the basis of the Harvest software, called The (Theses Online Broker), which is accessible via:
http://www.iuk-initiative.org/iwi/TheO
Within NDLTD a special Broker has been set up to add the German sites using Harvest to the international search.

harvest arch.jpeg

harvest network.jpeg

Harvest is able to do a search within the following document formats:

C, Cheader Extract procedure names, included file names, and comments
Dvi Invoke the Text summarizer on extracted ASCII text
FAQ, FullText,README Extract all words in file
Framemaker Up-convert to SGML and pass through SGML summarizer
HTML Up-convert to SGML and pass through SGML summarizer
LaTex Parse selected LaTex fields (author, title, etc.)
Makefile Extract comments and target names
ManPage Extract synopsis, author, title, etc., based on `-man
News Extract certain header fields
Patch Extract patched file names
Perl Extract procedure names and comments
PostScript Extract text in word processor-specific fashion, and pass through Text summarizer.
RTF Up-convert to SGML and pass through SGML summarizer
SGML Extract fields named in extraction table
SourceDistribution Extract full text of README file and comments for Makefile and source code files, and summarize any manual pages
Tex Invoke the Text summarizer on extracted ASCII text
Text Extract first 100 lines plus first sentence of each remaining paragraph
Troff Extract author, title, etc., based on `-man, `-ms, `-me macro packages, or extract section headers and topic sentences
Unrecognized Extract file name, owner, and date created

Configuration for PDF files:

Before the Harvest-Gatherer can collect PDF documents and transform into SOIF format it has to be configured.

Using only the standards configuration ignores the format. In order to make a format known to the Gatherer a summarizer for PDF has to be build:

Delete the following line in the file /lib/gatherer/byname.cf:
Pdf ^.*\.(pdf|PDF)$

Configure the PDF summarizer. Use Acrobat to transfer PDF documents into PS documents, that are used by the summarizer. A better choice provides the xpdf packages by Derek B. Noonburg (http://www.foolabs.com/xpdf ). It contains a PDF-to text converter (pdftotext), that can be integrated into, the summerizer Pdf.sum:
/usr/local/bin/pdftotext $1
/tmp/$$.txt Text.sum
/tmp/$$.txt rm /tmp/$$.txt

Configuring the Gatherers for HTML-Metadata

The Harvest-Gatherer is by standard configured to map every HTML metatag into an SOIF attribute, e.g. <META NAME="DC.Title" CONTENT="Test"> into an own SOIF attribute, that is equal to the NAME attribute of the metatag. The configuration can be found at:
<harvest home>/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl

The summarizer table has entries like this:
<META:CONTENT> $NAME If a retrieval should be done only in the HTML metatags, meaning within certain SOIF attributes, those attributes have to put in front of the search request and put into the retrieval forms provided to the user, e.g.
DC.Title: Test

Searching Metadata encoded in HTML

As in Germany there is a nationwide agreed metadata set for ETDs, those are searchable within the German wide Harvest network.

The following example shows, how those Dublin Core metadata (only a small part is displayed) is encoded within the HTML front pages for ETDs:

<META NAME="DC.Type" CONTENT="Text.PhDThesis"> <META NAME="DC.Title" LANG="ger" CONTENT="Titelseite: Ergebnisse der CT- Angiographie bei der Diagnostik von Nierenarterienstenosen"> <META NAME="DC.Creator.PersonalName" CONTENT="Ludewig, Stefan"> <META NAME="DC.Contributor.Referee" CONTENT="Prof. Dr. med. K.- J. Wolf"> <META NAME="DC.Contributor.Referee" CONTENT="Prof. Dr. med. B. Hamm"> <META NAME="DC.Contributor.Referee" CONTENT="PD Dr. med. S. Mutze">

For this agreed metadata set, there has been formulated a suggestion, on how the metadata can be produced and operated at the university libraries. The following schema shows the details: The doctoral candidate uploads his ETD document to the library. He does this while filling in an HTML form, that collects internally metadata in Dublin Core format. The university libraries check the metadata of correctness and the ETD of readability and correct usage of style sheets. The university library adds some descriptive metadata to the metadata set and put a presentation version of the ETD on its server. During this procedure an HTML format page containing the Dublin Core metadata encoded as HTML

At last submits the university library the metadata to the National library, which is in charge of archiving all German language literature.

The national library copies the ETD and metadata to its own internal system.

germanywide.jpeg

theses online broker.jpeg

Searching SGML/XML documents

Harvest also allows a search within SGML/XML DTD (document type definition) elements.

All that has to be done to configure the Gatherer component according to the following rules:

Within the home of the Harvest software (written as <harvest-home>) in /lib/gatherer/byname.cf a line has to be added: DIML ^.*\.did$. (DiML is the DTD used at Humboldt-University, did is the file names of the SGML documents accorning to the DIML-DTD). This says the Harvest-Gatherer, which Summerizer should be used, if documents ending with .did are found.

Now, the summarizer has to be build and saved as DIML.sum within the filesystem at <harvest-home>/lib/gatherer: (The summarizer contains the following line: #!/bin/sh exec SGML.sum ETD $*

Within the catalog file <harvest-home>/lib/gatherer/sgmls-lib/catalog the following entries have to be done: (They point to the public identifiers of the DIML and from DIML used DTVs)
DOCTYPE ETD DIML/diml2_0.dtd
PUBLIC "-//HUB//DTD Electronic Thesis and Dissertations Version DiML 2.0//EN" DIML/diml2_0.dtd
PUBLIC "-//HUB//DTD Cals-Table-Model//EN" DIML/cals_tbl.dtd
PUBLIC "-//HUBspec//ENTITIES Special Symbols//EN" DIML/dimlspec.ent

Now <harvest-home>/lib/gatherer/lib/sgmls-lib/DIML can be created (mkdir <path>) and four files copied into the path:
diml2_0.dtd, cals_tbl.dtd, dimlspec.ent und diml2_0.sum.tbl (DTD, entity file and summarizer table). The file diml2_0.sum.tbl consists of the DTD tags that should be searcheable and the appropriate SOIF attributes

The Gatherer can be launched now.

In order to search within certain SOIF tags, the name of the SOIF attribute has to be put in front of the search term, e.g. searching for "title: Hallo"means, searching within the SOIF-attribute 'title' for the search term 'Hallo'.

At Humboldt-University Berlin, there has been installed a prototype that allows a retrieval within documents structures, so a user may search within the following parts of a document and therefore specialize the search in order to retrieve only the wanted information and hits:

  • Fulltext (im Volltext)
  • For authors (nach Autoren)
  • In titles (in Titen)
  • In abstracts (Im Abstract)
  • Within authors keywords (in Autorenschlagwörtern)
  • For Institutes/ Subjects (nach Institut/ Fachgebiet)
  • For approvals (nach Gutachtern)
  • Headings of chapters (Überschriften)
  • Captions of figures (in Abbildungsbeschriftungen)
  • In Tables (in Tabellen)
  • Within the bibliography (in der Bibliographie)
  • For Author names within bibliography (nach Autoren in der Bibliographie)

search interface.jpeg

Harvest and OAI

With the growing enthusiasm for the approach of the open archives initiative, where a special OAI software protocol allows to send out requests to document archives and receive standardized metadata sets as answers, there are ideas on how this could be connected with the Harvest-based infrastructure that has been set up in Germany.

oai retrieval.jpeg

Making a Harvest archive OAI compliant means, that the information, that the Gatherer holds, has to be normalized (same metadata usage) and that the index has temporarily saved within a database. The Institute for Science Networking at the Department of Physics at the University of Oldenburg, Germany developed the following software. This software, written in php4, uses an SQL database to perform the OAI protocol requests. The SQL database holds the normalized data from the Harvest Gatherer.

oai spec.jpeg

Other university document servers, like the one at Humboldt-University, additionally hold the Dublin core metadata within an SQL database (Sybase) anyway. There a php4 script operating at the cgi-interface reads the OAI protocol requests that is transported via the HTTP protocol and puts them into SQL statements, which are then used as requests for the SQL-database. The database response is given in an SQL syntax as well, which is then transformed into OAI protocol syntax using XML and Dublin Core.(see http://edoc.hu-berlin.de/oai)

oai inter.jpeg


Next Section: The NDLTD Union Catalog