# ETD Guide/Print version

ETD Guide

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/ETD_Guide

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.

# Introduction

The UNESCO Guide for Creating Electronic Theses and Dissertations (ETDs) aims to help all those interested in projects and programs involving ETDs. To the extent possible, it has the eventual goal of aiding all students at all universities to be able to create electronic documents and to use digital libraries. It has particular focus on the emerging genre of ETDs, which should enhance the quality, content, form, and impact of scholarly communication that involves students engaged in research. It should help universities to develop their local infrastructure, especially regarding electronic publishing and digital libraries, which in turn build upon networking, computing, multimedia, and related technologies. In so doing, it should promote the sharing of knowledge locked up in universities, and the collaboration of universities spanning across all countries and continents, from North to South, from East to West, from developing to developed, from public to private, and from large to small. The ultimate effect may be one of the world's largest sustainable programs for diffusion of knowledge, across all fields, including science, technology, and culture.

This work should be of interest to diverse audiences. Its various sections (see discussion in section 1.6) are aimed to address the needs of universities (including administrators and faculty), students (including those who wish to create ETDs as well as those who wish to make use of already-created works), and those involved in training or setting up ETD projects or programs. Ample technical details are covered to support the needs of students wishing to apply multimedia methods to enhance their ability to communicate complex results, as well as the requirements of staff building a local support infrastructure to help such students.

This work is a living document that will continue to be updated in connection with the work of the Networked Digital Library of Theses and Dissertations (NDLTD, www.ndltd.org). It was born as a result of the support provided by UNESCO in grants given to Virginia Tech and University of Montreal. It was prepared by an international team of faculty and staff; coordinated by Shalini Urs. Its organization and content are the result of the editorial labor of Joseph Moxley. Its availability in a broad range of languages is the result of teams of translators, some funded in part by UNESCO, and others volunteering their assistance.

While the development of a comprehensive worldwide program of ETDs of necessity builds upon an enormous range of knowledge and experience, it is hoped that this work will suffice as an initial Guide. We hope that those who read some or all of the sections contained herein will build upon the foundation laid out in the Introduction, so they understand the key ideas and are energized to move forward to advance the cause. In keeping with the goals of NDLTD, we hope that people around the globe who are interested in ETDs will help each other, and together transform graduate education, promote understanding, vastly expand access to new discoveries, and empower the next generation of researchers to become effective leaders of the Information Age.

Next Section: What are ETDs

# Introduction/What are ETDs

Joining and participating in the Networked Digital Library of Theses and Dissertations, NDLTD is one of the best ways to understand the concepts regarding digital libraries. It directly involves students pursuing graduate education by having them develop their theses or dissertations (TDs) as electronic documents, that is, as electronic theses or dissertations (ETDs).

There are two main types of ETDs. One type, strongly preferred since students learn (by doing) as they create them, are author- created and submitted works. In other words, these are documents that are prepared by the (student) author (as is typical in almost all cases) using some electronic tools (e.g., Microsoft Word, LaTeX.), and then are submitted in their approved and final electronic form (to their university or agent thereof). Typically, the raw form of the document (e.g., in Word's ".doc" format) is converted into a form that is easy to preserve, archive, and make accessible for future readers (e.g., that follows standards, such as PDF or XML). That form is submitted, typically over a network connection, usually with related metadata (i.e., "data about data", often cataloging information as one might find in a library catalog, including title, year, author, abstract, and descriptors). Once submitted, such ETDs can be "discovered" by those interested, as a result of searching or browsing through the metadata, or by full text searching through the full document (text, and maybe even multimedia components, like images, video, or music).

The second type of ETD is typically an electronic file that is created (usually by university or service company staff) by scanning in the pages of a paper thesis or dissertation. The resulting ETDs are much less desirable than the abovementioned type: they require much more storage space, they do not easily support full text searching, they cannot be flexibly manipulated (e.g., cannot be zoomed in on by those with poor vision), and they don't lead to the student authors learning about electronic publishing (to prepare them for electronic submission of papers, proposals, or other works now commonly required). Nevertheless, such page images can be made accessible at low cost so that those afar can print and read a facsimile of the original paper pages.

In the subsequent discussion, most of the focus is on the first type of ETD mentioned above. However, the second type is commonplace in projects where a retrospective capture of old works is desired, or where a university wishes to share its research, is willing to go to considerable expense in that regard, and is not very concerned with educating or empowering students in electronic publishing methods.

Next Section: ETDs as new genre of documents

# Introduction/ETDs as new genre of documents

With thousands of students each year preparing ETDs, the creativity of the newest generation of scholars is being continuously expressed as they work to present their research results using the most appropriate form, structure, and content. While conforming as needed to the requirements of their institution, department, and discipline, students should develop and apply skills that will best prepare them for their future careers and lead to the most expressive rendering possible of their discoveries and ideas. Thus, ETDs are a new genre of documents, continuously re-defined as technology and student knowledge evolve.

The first benefit is that new, better types of TDs may emerge as ETDs develop as a genre. Rather than being bound by the limits of old-style typewriters, students may be freed to include color diagrams and images, dynamic constructs like spreadsheets, interactive forms such as animations, and multimedia resources including audio and video. To ensure preservation of the raw data underlying their work, promote learning from their experience, and facilitate confirmation of their findings, they may enhance their ETDs by including the key datasets that they have assembled.

As the new genre of ETDs emerges from this growing community of scholars, it is likely to build upon earlier forms. Simplest are documents that can be thought of as “electronic paper” where the underlying authoring goal is to produce a paper form, perhaps with color used in diagrams and images. Slightly richer are documents that have links, as in hypertext, at least from tables of contents, tables of figures, tables of tables, and indexes – all pointing to target locations in the body of the document. To facilitate preservation, some documents may be organized in onion-fashion, with a core mostly containing text (that thus may be printable), appendices including multimedia content following international standards, and supplemental files including data and interactive or dynamic forms that may be harder to migrate as the years pass by. Programs, applets, simulations, virtual environments, and other constructs yet to be discovered may be shared by students who aim to communicate their findings using the most suitable objects and representations

Next Section: Why ETDs?

# Introduction/Why ETDs?

The quality of a university is reflected by the quality of its students’ intellectual products. Theses and dissertations reflect an institution’s ability to lead students and support original work. In time, as digital libraries of ETDs become more commonplace, students and faculty will make judgments regarding the quality of a university by reviewing its digital library. Universities that incorporate new literacy tools, such as streaming multimedia, will attract students who hope to produce innovative work.

Starting an ETD program is like starting any other project: a need for the results must exist so all those involved will be motivated and committed through all the steps to the end—the moment when ETDs have become a regular and consolidated activity in the graduate programs of the University.

ETDs are based on the joint work of graduate students, mentors, graduate deans, administrative staff, library staff and the IT team. The success of the implementation of the ETD program requires the commitment of all these players plus that of the university’s higher administrative officers.

American Universities Should Require Electronic Theses and Dissertations by Moxley, Joseph M.

(Educause Quarterly, No. 3 2001, pp. 61–63)

Next Section: Minimize duplication of effort

# Introduction/Minimize duplication of effort

One benefit of ETDs is a reduction in the needless repetition of investigations that are carried out because people are unaware of the findings of other students who have completed a TD. Except in unusual cases, masters’ theses are rarely reported in databases (e.g., very few, except those from Canada, appear in UMI services like Dissertation Abstracts). Few dissertations prepared outside North America are reported either. With a globally accessible collection of ETDs, students can quickly search for works related to their interest from anywhere in the world, and in most cases examine and learn from those studies without incurring any cost.

Next Section: Improve visibility

# Introduction/Improve visibility

Once ETDs are collected on behalf of educational institutions, digital library technology makes it easy for works to be found. Through http://www.theses.org/, NDLTD directly makes ETDs available, and points to other services that facilitate such discovery. As a result, hundreds or thousands of accesses per year per work are logged, for example, according to reports from the Virginia Tech library regarding the ETDs it makes publicly accessible. As the collection of available ETDs grows and reaches critical mass, it is likely that it will be frequently consulted by the millions of researchers and graduate students interested in such detailed studies, expositions of new methodologies, reviews of the literature on specialized topics, extensive bibliographies, illustrative figures and tables, and highly expressive multimedia supplements. Thus, students and student works will become more visible, facilitating advances in scholarship and leading to increased collaboration, each made possible by electronic communication, across space and time.

# Introduction/Accelerate workflow: graduate more quickly, make ETDs available faster to outside audience

ETDs can be managed through automated procedures honed to take advantage of modern networked information systems. Since the shift to ETDs requires policy and process discussion among campus stakeholders, it is possible to streamline workflow and save time and labor. Checking of submissions and cataloging is sped up, moving and handling of paper copies is eliminated, and delays for binding are removed. The time between submission and graduation can be reduced, and ETDs can be made available for access within days or weeks rather than months.

Next Section: Costs and benefits

# Introduction/Costs and benefits

ETD submission over networks has zero cost, which compares favorably with the charges of hundreds or thousands of dollars otherwise required to print, copy, or publish TDs using paper or other media forms. In many institutions, the networking, computing, and software resources available to students suffice so that students preparing ETDs need make no additional expenditure. Similarly, on many campuses, assistance is available to answer questions and train students regarding word processing and other skills valuable for authors of electronic documents and users of digital libraries. If students elect to use personal computers and acquire their own software to use in ETD creation, these will later be useful in other research and development work, for both professional and personal needs, with low marginal expense specifically required for ETDs. Thus, it is typical that the pros far outweigh the cons regarding students preparing ETDs.

Next Section: Purpose, goals, objectives of ETD activities

# Introduction/Purpose, goals, objectives of ETD activities

The underlying purpose of ETD activities is to prepare the next generation of scholars to function effectively as knowledge workers in the Information Age. By institutionalizing this in a worldwide program, progress can be made toward tripartite goals of enhancing graduate education, promoting sharing of research, and supporting university collaboration. Particular objectives include:

• students knowing how to contribute to and use digital libraries;
• universities developing digital library services and infrastructure;
• enhanced sharing of university research results; and
• ETDs having higher quality and becoming more expressive of student findings.

# Introduction/Helping students be better prepared as knowledge workers

Most documents created today are prepared with the aid of computers. Many universities have "writing across the curriculum" programs, to ensure that students can create electronic documents that convey their knowledge and understanding, and demonstrate their ability to participate in the scholarly communication process. To function as effective knowledge workers, students must go beyond word processing skills that lead only to paper documents. They must learn to work with others, to share their findings by transmitting their results to others. This teamwork makes it feasible to collaborate, to co-author works, and thus to participate in research groups or teams, which are common throughout the research world (at the very least involving a faculty advisor and graduate student author). It also makes it pertinent for students to participate in common activities of modern researchers. Thus, they can be trained to submit a proposal electronically (e.g., as is required by the US's National Science Foundation) and to submit a paper to a conference (where papers are uploaded by authors, and downloaded by editors/reviewers, as part of the collection and selection activities).

Next Section: Helping students be original

# Introduction/Helping students be original

Originality is a defining feature of academic research. Using new media can empower you to conduct research in new ways. As suggested by the Exemplary ETDs, you can incorporate streaming video, interviews of subjects and settings, which readers can then view, which could be particularly useful for a case study or ethnographically informed research. You can provide different reading pathways for multiple audiences. For example, inexperienced lay audiences can have a more simplified version of your study, whereas more technical audiences can have more detailed analysis and citations. The technology allows you to present your work in new and different ways. For example, in your research you can include audio, video, and animations. You can add spreadsheets, databases, and simulations. You can even create virtual reality worlds.

Next Section: Helping students network professionally

# Introduction/Helping students network professionally

Networking on the Network by Phil Agee

This is a 52000 word essay that Phil Agee wrote with the intention of “get[ting] it into the hands of every PhD student in the world.”

# Introduction/Improving graduate education, and quality/expressiveness of ETDs

Around the globe, people have diverse views toward graduate students and graduate education. Some universities have strong advocacy for graduate students, often involving a graduate school and graduate dean. Others have no focus whatsoever on graduate students as a group, with all support thereof assumed to come by way of faculty mentors and advisors. Regardless of the local context; however, few would argue that graduate students should have fundamental knowledge and skills that will allow them to be effective researchers.

Accordingly, the move toward ETDs aims to enhance graduate education in effective researching. Building on the fact that students learn best by doing, this move encourages students to create and submit their own ETDs, thus learning a bit about electronic publishing and digital libraries in the context of preparing a work that is of importance to them. In particular, they need to gain some knowledge and skill in electronic document preparation, understanding not only the superficial aspects of relevant tools, but also beginning to learn about such processes as archiving, preservation, cataloging, indexing, resource discovery, searching, and browsing.

## Aids in Communication

Furthermore, once students become exposed to electronic documentation preparation processes, they often become aware of aids to help them communicate. Tools can be utilized to help with figures and drawing, to assist with analysis and graphing, or to support sharing and interactive exploration of data (e.g., through spreadsheets or database management systems). Images can be prepared as thumbnails that go into the body of a document to help the reader, as well as in varying sizes and resolutions to promote in-depth analysis.

Beyond these tools and aids, however, is the fact that authors spend more time when their likely readership is large. Students will tend to produce higher quality work, and faculty will demand better writing and clearer presentation of results, if the audience for a work numbers in the hundreds or thousands, as opposed to the common current situation where a mere handful will read the document. With ETDs leading to large numbers of accesses, many students and faculty will work hard to enhance quality beyond that commonplace prior to the advent of ETDs.

Next Section: Helping faculty

# Introduction/Helping faculty

• Each student could develop a bibliography reflecting his or her work, and a collective bibliography would emerge encompassing all of a faculty member's advisees.
• A student's acquired expertise will not completely leave with that student but will remain to help bootstrap new students (and new interests of the faculty member).
• The efforts of students working with a faculty member can be known to a wider audience. This would provide publicity and enhanced visibility for the student and that student's lab and major professor.
• Students who know how to use tools, such as Microsoft Word's tracking or commenting features, are better prepared for future e-publishing; they can use these tools for future collaboration and mentoring, which should save the faculty member time with the reviews and revisions.

# Introduction/Increasing readership of ETDs, communicating research results

Once a student embarks upon the process of creating an ETD, this student will be assured of a large increase in the number of readers. Such an audience emerges when ETDs are locatable because of searches using popular search engines and/or are available for other students as well as more advanced researchers to draw upon. Many look for the large bibliographies and comprehensive literature reviews found in ETDs. Others look for in-depth discussion of research methods. Some seek data sets to use for follow-up studies. A growing number of faculty use ETD collections as reference sources, saving space on their shelves or time required to walk to a library. Some also refer to ETDs in classes or in homework assignments, especially where there are important results and/or clear expressions of concepts and ideas.

Through such ETDs, students can communicate more effectively. Color figures are much easier to attain, in many situations, than those limited to black and white. Complex tables can be built, with sorting and subtotals incorporated, because of software tools. Spreadsheets or simulations help readers gain hands-on familiarity with data and analysis, promoting a deeper understanding. For those studying phenomena that could be characterized with color photos (using digital cameras or scanners), digital video sequences, audio files, medical images, or other digital representations. Thus, expanded and more effective communication of research results can be aided by usage of ETDs.

Once a student embarks upon the process of creating an ETD, that student can be assured of a one or more order of magnitude increase in the number of readers. Such an audience emerges when ETDs are locatable because of searches using popular search engines and/or are available for other students as well as more advanced researchers to draw upon. Many look for the large bibliographies and comprehensive literature reviews found in ETDs. Others look for in-depth discussion of research methods. Some seek data sets to use for follow-up studies. A growing number of faculty use ETD collections as reference sources, saving space on their shelves or time required to walk to a library. Some also refer to ETDs in classes, or in homework assignments, especially where there are important results and/or clear expressions of concepts and ideas.

At Virginia Tech, for example, many popular theses and dissertations are available to the public electronically. In 1996, there were 25,829 requests for ETD abstracts and 4,600 requests for ETDs themselves; by 1999 (January–August), there were over 143,056 requests for abstracts and 244,987 requests for ETDs. As of October 1999, the most popular ETD at VT had been requested over 75,000 times (see VT's download statistics at http://scholar.lib.vt.edu/theses/data/pdatah.htm).

# Introduction/Helping universities develop digital library services & infrastructure

Many universities have little or no involvement in digital library efforts. Yet, it is clear that most universities will have digital library efforts as part of, or in addition to, conventional library efforts. Digital libraries are of particular value when distance education is involved or when a university has a number of far flung sites. They also can promote more flexible access to information (e.g., 24 hours/day, 7 days/week, from home or office, with no blocking because another person has borrowed a work).

Establishing an ETD program automatically moves a university into the digital library era. Free software available through NDLTD, or from NDLTD members, is available that allows a university to develop its own digital library. Setting up that digital library helps the university bring together the personnel and infrastructure required for other digital library projects, too. Though the demands are small, the digital library that emerges will force consideration of almost all of the key concerns of those who work with digital libraries. Also, since various sponsors interested in digital library technology are helping in many ETD projects, there will be a continual enhancement of the digital library services that relate to improving local infrastructure.

# Introduction/Increasing sharing and collaboration among universities and students

As more and more ETDs become available at universities or through their agents, more and more students will draw upon the emerging digital library of ETDs. This will make it easier for students to learn of the work of other students. Contact is quite feasible since many ETDs include email addresses and other contact information, of faculty helpers/advisors at least if it is not feasible to track down a graduating student author. Once contact is made, it is possible to have further discussion of references, application of methodologies, reuse of datasets, and extension of prior work, as well as investigation of remaining open problems. This may lead to friendships, collaborations, joint publications, and other benefits. Such may expand beyond student-student relationships to also include faculty, groups, laboratory teams, and other communities.

Once ETDs are available, through diverse means, others may gain access. In particular, such access may be the only recourse open to those in developing countries who cannot afford to make purchases from Proquest, who cannot wait for expensive shipping of copies through interlibrary loan, who cannot attend the myriad conferences that demand the considerable expenses related to travel, or who cannot pay for expensive journals (that only may have short summaries of thesis or dissertation results).

Access may result from word of mouth, from search, from announcements, or by following links or citations. They may occur quickly after a work is made known to those whose works are referred to or to those who employ Selective Dissemination of Information services. They may occur intermediated by digital library systems or similar mechanisms. They may result when ETDs are referred to journal articles, conference papers, reports, course notes, and other forms.

Numerically, if theses and dissertations are all released as ETDs, the number of works per year may be around the number of journal articles published by university students. The number may be 20-50% of the number of journal articles prepared by the faculty. Regarding students, in most cases, an ETD will be the only publication of an author. If the number who read ETDs is 10 or 100 times the number who read a paper thesis, then it is clear that there will be a significant increase in access to university research as a result of an ETD program.

Next Section: Searching

# Introduction/Searching

Since in most cases it is in the interest of students and universities to maximize the visibility of their research results, the general approach of NDLTD is to encourage all parties interested to facilitate access to ETDs.

As part of the education component of NDLTD, it is hoped that graduate students will become facile with searching through electronic collections, especially those in digital libraries. If we regard managing information as a basic human need, ensuring that the next generation of scholars has such skill seems an appropriate minimal objective. Most specifically, since graduate research often builds upon prior results from other graduate researchers, it seems sensible for all ETD authors to be able to search through available ETD holdings. NDLTD encourages online resources, self-study materials, individual assistance, as well as group training activities be provided so that graduate students become knowledgeable about resource discovery, searching, query construction, query refinement, citation services, and other processes – both for ETDs and for content in their discipline.

# Introduction/Browsing: Classification systems, classification schemes used in different disciplines

A key method to gain access to ETDs is through browsing. Browsing promotes serendipity, in analogous fashion to when a person looks around in library stacks, picking up and glancing at a number of works, typically ones that are relatively close to each other.

Browsing often involves a researcher in a learning process connecting with the concepts, areas, and vocabulary used in a particular field. A researchers often moves around in "concept space", seeing what concepts are broader and which are narrower, which are related, and which are examples or applications of theories or methods. Thus, in the case of medical works, browsing often encourages researchers to think about diseases, treatments, location in an organism (or human body or subsystem thereof), symptoms, and other considerations. In many fields, browsing involves exploring a taxonomic system, managing an organization chart or hierarchical structure, or moving from a term to a more focused noun phrase.

In many disciplines, there are official classification systems. Some are quite broad, in some cases covering all areas, as is the case when using the subject headings prepared by the US Library of Congress (LCSH), or the Dewey Decimal Classification system (DDC, available from Forest Press, owned by OCLC). However, for in-depth characterization of a research work in a particular field, it is more appropriate to use the category system for the field. In computing, the ACM system is popular. In medicine, MeSH or UMLS are widely used. In physics, PACS is widely used. Gradually, digital library support for ETDs will support browsing using both broad schemes like DDC, as well as narrow schemes like PACS, integrated synergistically.

Next Section: Well known sites and resources for ETDs

# Introduction/Well known sites and resources for ETDs

NDLTD runs the Web site http://www.theses.org/ (also under the alias http://www.dissertations.org/) as a central clearinghouse for access to ETDs. This site points to various other locations that support portions of the worldwide holdings of ETDs. For example, the largest corporate archive, with over 1.5 million entries, is managed by UMI and has most doctoral dissertations from USA and Canada, as well as most masters’ theses from Canada, in microfilm form with metadata available as a searchable collection through Dissertation Abstracts. Since 1997 UMI has scanned new submissions (originally from microfilm, later directly from paper) and made the page images available through PDF files. With over 100,000 ETDs accessible through subscription or direct payment mechanisms, UMI hosts the largest single collection of both electronic and microfilm TDs.

Other corporations as well as local, regional, national, and international groups associated with NDLTD have Web sites too, such as http://www.cybertheses.org/[1] for the International Francophone project or http://www.dissonline.org/ for the German Dissertation online project. In addition, a number of WWW search engines have indexed some of the ETD collections available so this genre is included in general Web searches.

Some other schemes allow access to ETD collections. Using Z39.50, the “information retrieval protocol”, for example, the Virginia Tech ETD collection can be accessed through suitable clients or from some library catalog systems. OCLC’s WorldCat service, with over 20 million catalog records, has an estimated 3.5 million entries for TDs. Perhaps most promising is that the global as well as regional and local metadata information about ETDs may become widely accessible through the Open Archives Initiative (http://www.openarchives.org/).

The German "Dissertation Online" project was undertaken by the Initiative of Information and Communication of the German Learned Societies (http://www.iuk-initiative.org/index.shtml.en). This project was funded by the German Research Foundation (DFG: Deutsche Forschungsgemeinschaft) for 21/2 years (April 1998 - October 2000). The final conference took place in Berlin from October 30–31, 2000 (http://www.dissonline.de/abschlusstagung/index.html). The project worked at an integrative level and aimed towards a German wide initiative to bring scholarly publications, such as dissertations, diploma and master theses which are usually lost in libraries, online. Through the joint work of 6 academic disciplines (mathematics, physics, chemistry, educational sciences, computer science and libraries such as the State and University Library Lower Saxony (SUB Göttingen) and the German National Library (DDB: Die Deutsche Bibliothek)) which took place at several locations (Duisburg, Oldenburg, Erlangen, Berlin Computing Centre and School of Education, Göttingen, Frankfurt) the project was highly successful in Germany and elsewhere. A tight cooperation with the Networked Digital Library of Theses and Dissertations (NDLTD[2]), set up by Edward Fox from the Virginia Polytechnic Institute and State University, USA, was established.

The developments and the movement "DissOnline.de" resulted in establishing a bureau for coordination (Koordinierungsstelle DissOnline) at the German National Library (DDB). All German efforts taken are now coordinated and all developments (tools, guidelines, etc.) are collected by this bureau.

The main tasks of the original project still reflect the problem areas that have to be taken into consideration while setting up local, national or global projects on electronic theses and dissertations:

• Legal Issues
• Document formats
• Retrieval
• Multimedia
• Library issues
• Archiving

The results of these different subtasks are integrated into the different sections that follow for students, universities, technical issues and trainers.

1. PhysDis, a large collection of Physics Theses of Universities across Europe
2. TheO, a collection of theses of different fields of 43 Universities in
Germany, in as much as the Theses do contain Metadata
3. MPRESS, a large collection of European Mathematical Theses. which contains as a subset
4. Index nationaux prépublications, thèses et habilitations, a collection of theses in France in Mathematics

Next Section: Brief history of ETD activities: 1987-2007

# Introduction/Brief history of ETD activities: 1987-2007

The first real activity directed toward ETDs was a meeting convened by Nick Altair of UMI in Ann Arbor, Michigan during the fall of 1987 involving participants from Virginia Tech, ArborText, SoftQuad, and University of Michigan. Discussion focussed on the latest approaches to electronic publishing and the idea of applying the Standard Generalized Markup Language (SGML, an ISO standard approved in 1985) to the preparation of dissertations, possibly as an extension of the Electronic Manuscript Project. In 1988, Yuri Rubinsky of SoftQuad was funded by Virginia Tech to help develop the first Document Type Definition (DTD) to specify the structure of ETDs using SGML. Pilot studies continued using SoftQuad’s AuthorEditor tool, but only with the appearance of Adobe’s Acrobat software and Portable Document Format (PDF) in the early 1990s did it become clear that students could easily prepare their own ETDs.

In 1992 Virginia Tech joined with the Coalition for Networked Information, the Council of Graduate Schools, and UMI, to invite ten other universities to select three representatives each, from their library, graduate school/program, and computing/information technology groups. This meeting in Washington, D.C. demonstrated the strong interest in and feasibility of ETD activities among US and Canadian universities. In 1993, the Southeastern Universities Research Association (SURA) and Southeastern Library Network (Solinet) decided to include ETD efforts in regional electronic library plans. Virginia Tech hosted another meeting involving multiple universities in Blacksburg, VA in 1994 to develop specific plans regarding ETD projects. On the technical side, the decision was made that whenever feasible, students should prepare ETDs using appropriate multimedia standards in addition to both a descriptive (e.g., SGML) and rendered (e.g., PDF) form for the main work.

Then, in 1996, the pace of ETD activities sped up. SURA funded a project led by Virginia Tech to spread the concept around the Southeastern Unted States. Starting in September 1996, the US Department of Education funded a three-year effort to spread the concept around the USA. The pilot project that had proceeded at Virginia Tech led to a mandatory requirement for all theses and dissertations submitted after 1996 to be submitted (only) in electronic form. International interest spread the concept to Canada, UK, Germany, and other countries. To coordinate all these efforts, the free, voluntary federation called NDLTD (Networked Digital Library of Theses and Dissertations) was established and quickly began to expand. Annual meetings began in the spring of 1998 with about 20 people gathering in Memphis, TN. In 1999 about 70 came to Blacksburg, VA while in 2000 about 225 arrived in St. Petersburg, FL for the third annual conference.

Next Section: Global cooperation in ETD activities

# Introduction/Global cooperation in ETD activities

There continues to be rapid growth and development of ETD activities around the world. Whether such efforts arise spontaneously or as extensions of existing efforts, it is hoped that all will proceed in cooperative fashion so universities can help each other in a global collaboration [4], passing on lessons learned as well as useful tools and information. The mission of NDLTD is to facilitate such progress in a supportive rather than prescriptive manner. Over 100 members joined NDLTD by 2000, including over 80 universities in addition to national and regional project efforts; international, national, and regional organizations; and interested companies and associations. The only requirement for joining NDLTD is interest in advancing ETD activities, so it is hoped this will help ensure global cooperation.

A number of groups involved in NDLTD are particularly interested in supporting efforts in developing countries. The sharing of research results through ETDs is one of the fastest ways for scholars working in developing countries to become known and have an impact on the advancement of knowledge. It also is one of the easiest and least costly ways for universities in developing countries to become involved in digital library activities and to become known for their astute deployment of relevant and helpful technologies. The Organization of American States, UNESCO, and other groups are playing a most supportive role in facilitating this process.

Next Section: Overview of rest of the Guide

# Introduction/Overview of rest of the Guide

Subsequent sections further explain ETD activities.

Section 2 discusses issues for university decision makers and implementers of projects on campuses.

Section 3 presents the topic for students.

Section 4 deals with further technical details.

Section 5 takes a broader view, raising the level to issues related to launching campus initiatives and training those who may train students.

Finally, Section 6 provides a glimpse of future directions.

Next Section: Universities

# Universities

By the end of 2007, the number of universities involved in NDLTD was well over 100. Scores of other universities were considering work with ETDs, and hundreds of universities were aware of the concept. NDLTD hopes that forthwith there will be ETD efforts in every country and then in every state/province, eventually in every leading university, soon serving every language group, and ultimately in every college and university. Since NDLTD aims to help, and there are no costs to join, it is hoped that membership will rise to closely match the number that are interested in ETDs.

It is not clear why any university should not become involved in ETD efforts. Once an ETD program on a campus has evolved to the point that ETD submission is a requirement, the effort saves money for the university as well as the students, while providing many important benefits. And, reaching such a point is not hard, if there is a local team, with effective leadership, that has a clear understanding of ETDs and what is occurring elsewhere in terms of support, cooperation, and collaborative efforts.

This section of the Guide explains why and how universities can establish ETD programs, and helps those involved in an ETD program to address concerns and problems that may be voiced. It appears on the one hand that an ETD program has the healthy effect of helping a university to engage in a rich dialog on a wide variety of issues related to scholarly communication. This is important, since we are in the midst of a revolution in these processes, and both students and players must be aware of the situation if they are to manage effectively (and economically, while working in accord with the core values of scholars). On the other hand, ETD efforts have advanced to the point that proceeding with an ETD program, while not as "sexy" as some of the more vigorously funded digital library efforts, does work well, providing real solutions to real concerns, leading to sustainable and beneficial practices. In short, launching an ETD program is a "no brainer" that can quickly advance almost any university to the happy position of having an effective and economical digital library initiative in place, which also can serve as a model for other efforts.

Next Section: Why ETDs?

# Universities/Why ETDs?

The quality of a university is reflected by the quality of its students’ intellectual products. Theses and dissertations reflect an institution’s ability to lead students and support original work. In time, as digital libraries of ETDs become more commonplace, students and faculty will make judgments regarding the quality of a university by reviewing its digital library. Universities that incorporate new literacy tools, such as streaming multimedia, will attract students who hope to produce innovative work.

Starting an ETD program is like starting any other project: a need for the results must exist so all those involved will be motivated and committed through all the steps to the end—the moment when ETD's have become a regular and consolidated activity in the graduate programs of the University.

ETD's are based on the joint work of graduate students, mentors, graduate deans, administrative staff, library staff and the IT team. The success of the implementation of the ETD program requires the commitment of all these players plus that of the university’s higher administrative officers.

American Universities Should Require Electronic Theses and Dissertations by Moxley, Joseph M. (Educause Quarterly, No. 3 2001, pp. 61–63.)

# Universities/Reasons and strategies for archiving electronic theses and dissertations

1. ETDs make the results of graduate programs widely known.

2. Graduate programs may be evaluated by the number of theses and dissertations (TDs) produced and by the number of accessible ETD's.

3. In many countries, when financed with public funds, it is expected that TDs will be made public. ETDs are the easiest way to accomplish this.

4. TDs are part of the assets and history of the universities.

5. TDs exist and are published on paper, so why not publish them electronically?

6. TDs are referred by examining committees, a warranty of quality to be published.

7. TDs contain bibliographical reviews.

8. TDs present the methods used during research, thus allowing these methods to be used by others.

9. TDs allow extensions to be identified and undertaken.

10. TDs hold information that will help avoid duplication of efforts.

11. To publish TDs funded with public money is a way of returning the results to society.

12. To electronically publish TDs makes the results known nationally and internationally.

13. To electronically publish TDs makes it less expensive to students, who do not have to print as many copies.

15. ETDs require less storage space.

16. ETDs can identify and connect national and international research groups.

18. An ETD program introduces digital libraries in the universities allowing other projects to bloom.

19. ETDs are way of sharing intellectual production.

20. Wide knowledge of good quality TDs strengthens the faculty, the graduate programs and the university.

21. Widely known results allow copies to be identified in an easier way.

22. Universities will be able to share knowledge on digital libraries.

The reasons are purposely listed in the order they were presented during the discussion and, no doubt, seem to be quite scattered.

Thus, let's imagine 3 categories of reasons: benefits to students, benefits to universities and benefits to regions/countries/society. The reasons that were listed above can be reorganized and assigned to these categories.

First, there are benefits to students which reflect on the university and on society too. Of the 23 reasons above, 15 are beneficial to students:

25. An ETD program introduces digital libraries in the universities allowing other projects to bloom.

27. ETDs are way of sharing intellectual production.

28. ETDs can identify and connect national and international research groups.

29. ETDs make the results of graduate programs widely known.

30. Graduate programs may be evaluated by the number of theses and dissertations (TDs) and by the number of accessible ETDs.

31. TDs allow extensions to be identified and undertaken.

32. TDs contain bibliographical reviews.

33. TDs hold information that will help avoid duplication of efforts.

34. TDs present the methods used during research, thus allowing these methods to be used by others.

35. To electronically publish TDs makes it less expensive to students who do not have to print as many copies.

36. To electronically publish TDs makes the results known nationally and internationally.

37. Wide knowledge of good quality TDs strengthens the faculty, the graduate programs and the university.

38. Widely known results allow copies to be identified more easily.

Next, the universities are the focus and some reasons that were listed under benefits to students will appear again:

40. An ETD program introduces digital libraries in the universities allowing other projects to bloom.

41. Graduate programs may be evaluated by the number of theses and dissertations (TDs) and by the number of accessible ETDs.

42. ETDs can identify and connect national and international research groups.

43. ETDs make the results of graduate programs widely known.

44. ETDs require less storage space.

45. TDs are part of the assets and of the history of the universities.

46. To electronically publish TDs makes the results known nationally and internationally.

47. Universities will be able to share knowledge in digital libraries.

48. Wide knowledge of good quality TDs strengthens the faculty, the graduate programs and the university.

49. Widely known results allow copies to be identified more easily.

Benefits that appeared under the previous 2 categories will be listed again in this last category, devoted to regions/countries/society:

51. ETDs can identify and connect national and international research groups.

52. In many countries, when financed with public funds, it is expected that TDs will be made public. ETDs are the easiest way to accomplish this.

53. To publish TDs funded with public money is a way of returning the results to society.

54. Universities will be able to share knowledge in digital libraries.

55. Widely known results allow copies to be identified in an easier way.

Finally, two other reasons for ETDs are:

56. TDs exist and are published on paper, so why not publish them electronically?

57. TDs are referred by examining committees, a warranty of quality to be published.

There were 23 reasons when they were first listed. When they were classified, we got 34. The explanation is simple - many reasons bring benefits to more than one category.

It is not hard to come to the conclusion that ETD's are beneficial to all, and that ETD programs are good and should be considered by universities.

Next Section: How to develop an ETD program

# Universities/How to develop an ETD program

Some points must be considered before starting an ETD program because they impact on:

• The stage where the project starts;
• The type of training to be provided;
• The technology to be adopted;
• The personnel to be hired, if needed;
• The time frame for the program to be operative;
• The total cost of implementation.

The team implementing the ETD program must be aware that decisions must be made based on legislation, culture, financial conditions, infrastruture and political aspects of their area. So, they must be prepared to analyze the important issues, suggest solutions and have them approved by the propoer authorities. Only when these aspects are considered can the program start.

Many of the following points may not apply to developed nations but can be crucial to developing countries.

The points are grouped into four categories:

Analysis of the stage of the automation of the library system:

• Is there an OPAC?
• Does it support the MARC format?
• Does it support digital objects? What types? Can the contents of objects be searched?
• Does it have a Web interface?
• Is it compliant with Protocol Z39.50?
• Is there a project to automate the catalog? Does the system to be used support the MARC format? What is the time frame for the catalog to be automated?
• Can the system to be used can hold a digital library (types of digital objects, Web interface, Z39.50)?
• Number of TDs per year and to date (both on paper and on digital files):

Per year - this number is important because it influences:

• The number of persons in the help desk;
• The decision on how to start - by some areas to proof concept or all programs;
• The planning of the growth of infrastructure/equipment.

To date - this number is important because it influences:

• The decision to make the retrospective capture in some areas and/or dates or all TDs;
• The decision to use scanner + OCR in some areas and/or dates or all TDs (on paper);
• The planning of the types and numbers of equipment;
• The planning of the team;
• The planning of the growth of infrastructure/equipment.

Format presentation of TDs:

• One or many formats? The number of DTDs, templates and viewing filters depends on this;
• Is an update of format needed or desired? The process must be defined - topdown, consensus, etc.;
• Is there legislation to comply with?

Authors' rights and publishing conditions:

• Is there legislation to comply with?
• Is a document of authorization required? If so, define terms and approve with legal department;
• Establish conditions? For black out period, for on campus publishing only, for partial publishing, etc.

The width of the scope of the topics above shows that an ETD program involves many areas of the universisty and all of them must be committed to it. After these points have been addressed, the project of the ETD program may begin.

An ETD program, like any other project, requires that the roles of each player be well defined. All the people involved must be aware of the importance of their work individually and of the interaction that makes up a team. There must be a commitment in all levels, from the highest administration to the graduate students.

In terms of the university, the 3 main players are:

• The IT/Computer Group
• The Library

They will lead the university in deciding upon:

• The formats for ETDs;
• The way to deal with authors' rights and publishing procedures;
• The workflow for the new dissertations;
• The strategy and workflow for retrospective capture;
• The style sheets and filters for visualization;
• The preservation format(s) and procedures;
• The digital library system to be used;
• The identification of the digital documents;
• The relation of the ETD digital library with legacy systems (library, administrative, etc.);
• The support team to be used;
• The training program.

At the beginning of the project, they must have the support of the highest administration to:

• Change the TDs’ formats, if necessary;
• Change the culture of mentors, students and administrative staff;
• Solve the problems related to authors' rights;
• Assure preservation and information security of the digital collection (ETD);
• Make sure funds are provided for equipment (HW & SW), personnel and training programs.

A good way to start the discussion in the university is by writing a pre- project to be submitted by the 3 players to the high administrative officers. The main topics of the pre-project should include but not be limited to:

• The objectives:
• Main objective
• Secondary objectives
• The benefits to:
• Students
• The institution
• The region/country/society
• The characteristics of the ETD program and of digital library:
• Functional characteristics
• Technical characteristics
• The results to be achieved
• A brief and focused description of the program and the project
• A description of the main steps of the project
• An estimate of the resources (human and material) needed to implement the program
• An estimate of the annual resources (human and material) needed to keep the program in operation
• An estimate of the time frame to implement each step of the project and to have the program operable
• The commitment, responsibilities and actions of each participant involved
• The relation of the program to other programs in:
• The national scenario
• The international scenario

The writing of this pre-project will help the team organize and ascertain everyone’s involvement. In addition, the higher administration will be able to decide based on more reliable information, increasing the degree of commitment of all involved in the program.

# Universities/Scenarios illustrating approaches, schedules and workflow

A common workflow at a campus involved in NDLTD is as follows. Each student will use a home computer, or computer in a lab, to submit his or her ETD. His or her work will have been created on the computer he or she is working with, or will have been moved there using some media format, such as diskette or CD-RW. The student will go to a well-known URL (e.g., at Virginia Tech, will start at http://etd.vt.edu and then follow links).

First, students will provide an ID and password that authenticates them. Next, they will enter in metadata about their ETD. When that is complete and checked, they will use their browser to locate each of the files that make up the ETD, so uploading can proceed. Eventually, the entire ETD will be uploaded.

Later, a person in the graduate school will find that a new ETD has been uploaded. He or she will check the submission. If there are problems, he or she will use email to contact the student so that changes/corrections can be made, and the process repeated up through this step. Once a proper ETD is submitted, a cataloguer in the Library will be notified. He or she will catalog the work, adding in additional categories, and checking the metadata. Eventually he or she will release the work to allow proper access, according to the student/faculty preferences.

More refined workflow models can be applied. Let us suppose that at the Polytechnic University of Valencia, Spain the process starts from a catalog of ETD proposals, published by faculty members filling in the appropriate form; each proposal may include the title, keywords, abstract, level of expertise required, and other useful information. A student can apply for a number of proposals (specifying an order of preference) filling a form including his/her personal data. A faculty committee makes the final assignment of proposals to candidates. At this point, most of the ETD metadata have been collected, and there is no need to introduce them during the submission process.

This section looks at how electronic publication of theses and dissertations will enhance graduate education. Topics discussed include: improved knowledge of electronic publication technologies, greater access to scholarly information, wider distribution of an author's work, and student and faculty concerns.

## Introduction

The move by Graduate Schools to allow or even require students to submit theses and dissertations as electronic or digital documents (ETDs) creates much excitement, both positive and negative, among the students and faculty who will be affected by this initiative to digitize these important documents. These positive and negative views will no doubt be tempered by increased knowledge of the ETD process and through increased experience in creating and archiving ETDs. At this time in the development of the ETD process, I believe the importance of an open-minded approach to this new way of expressing the outcomes of masters and doctoral research is captured very well in the following statement by Jean- Claude Guédon in his work, Publications électroniques (1998):

When print emerged, universities failed to recognize its importance and almost managed to marginalize themselves into oblivion. With a new major transition upon us, such benign neglect simply will not do. Yet the challenges universities face in responding to an increasingly digitized and networked world are staggering. Universities need a vision allowing them to express their dearest values in new forms, rather than protect their present form at the expense of their most fundamental values.

The ETD initiatives now under way in universities around the world are about bringing fundamental change to our current concept of what constitutes a thesis or a dissertation. In the U.S., this concept has not changed significantly since students first began to submit paper theses and dissertations in our first research universities over 120 years ago. By moving from a paper presentation of research results to a digital presentation, we make available to the ETD author a powerful array of presentation and distribution tools. These tools allow the author to reveal to masters and doctoral committees, to other scholars, and to the world, the results of their research endeavors in ways and with a level of access never before possible.

## Changes in Presentation

I believe graduate schools and faculty, in the name of maintaining quality, have all too often inhibited the creativity of graduate students by forcing them into a mold to which they must all conform. This is nowhere more evident than in the thesis or dissertation where format restrictions abound. Some graduate schools have special paper with printed margins within which all written material must be contained. Some graduate schools still read and edit the entire text of every thesis or dissertation. Many have thesis or dissertation editors whose reputation for using fine rulers and other editorial devices for enforcing graduate school format are legendary.

I believe that the student must submit a high quality document that is legible, readable, and that conveys the results of the research or scholarship in a manner that is clear and informative to other scholars. The document does not, however, need to be narrowly confined to a specific format if it meets the above criteria. To create a high quality ETD students must be information literate. That is, they must, at a minimum, have a level of knowledge of office software that will allow them to create a document that if printed would result in a high quality paper document. This kind of properly formatted digital document thus becomes the primary construct of the author, rather than a paper document. In conducting training workshops for Virginia Tech students, a number of which are older non-traditional students, we have found that this lack of office software skills is the single greatest impediment to their being able to produce a good "vanilla" ETD—that is, an ETD that has the appearance of a paper ETD, but is submitted as a digital document.

As early 1999 about 80% of Virginia Tech's 1500 ETDs are vanilla ETDs. Accordingly, we have emphasized the development of these skills, which number less than ten and can be taught in an hour, in our student ETD workshops. Once the student has the fundamental skills to produce an ETD, they are ready, if they desire, to move on to more advanced topics for producing a visually and audibly enhanced ETD. Advanced topics include landscape pages; multimedia objects like graphs, pictures, sound, movies, simulations; and reader aids like internal and external links, thumbnail pages, and text notes. Students are not required to use these enhancement tools, but by giving them access to these tools we open creative opportunities for students to more clearly express the outcomes of their masters or doctoral research.

To maintain quality, the student's thesis or dissertation committee must actively participate as reviewers in this process and must be prepared to exercise judgment concerning the suitability of material for inclusion in the ETD. The resulting "chocolate ripple" or in some cases "macadamia nut fudge" ETDs are the forerunners of a new genre of theses and dissertations which will become commonplace in the future.

Whether tomorrow's graduate students are employed inside or outside the university environment, the ubiquitous presence and use of digital information will certainly be a major part of their future careers. For this reason efforts to increase the information literacy are certain to benefit graduate students long after they have used these skills to produce a thesis or a dissertation.

## Valuable Content

The traditional view is that the doctoral dissertation and less so the masters thesis provides a one time opportunity for the student to do an in depth study of an area of research or scholarship and to write at length about the topic, free of the restrictions on length imposed by book and journal editors. Such writings may contain extensive literature reviews and lengthy bibliographies. They may also contain results of preliminary studies or discussions of future research directions that would be very valuable to the researchers and scholars who follow. Primarily because of restrictions on the length of journal articles, such information exists only in theses and dissertations. I believe this view is correct and should be maintained in the digital thesis or dissertation.

## Access and Attitudes

The attitudes of students and faculty toward the value of theses and dissertations vary greatly. For the reasons given above some value them highly. Others, particularly some faculty, see them as requirements of graduate schools that have little value. These individuals consider the journal publication the primary outcome of graduate student research. I do not dispute the added value of the peer review process for journal articles and for books, yet I do firmly believe that so long as the scholar or researcher using ETDs as information sources recognizes theses and dissertations for what they are, these documents are valuable sources of information.

Indeed, these information sources have been grossly underutilized because of the difficulty in obtaining widely available, free access to them either through university libraries or through organizations like University Microfilms. If a comprehensive worldwide networked digital library of theses and dissertations existed, I believe the impact and utilization of these sources of information would rise in proportion to the increased access. This view is supported by experience at Virginia Tech in our ETD project. Research done in 1996 by the Virginia Tech library showed that the average thesis circulated about twice a year and the average dissertation about three times a year in the first four years they were in the library. These usage statistics do not include the use of copies housed in the home departments of the students or the usage of dissertations in the University Microfilms collection. Even so, the usage of the 1500 ETDs in our digital library far outpaces the use of paper documents.

Growth in usage has been steady and remarkable. For the calendar year 1998 there were over 350,000 downloads of the PDF files of the 1500 ETDs that were in the VT library. This is over 200 downloads for each ETD in the collection. The distribution of the interest in the ETDs is equally remarkable. The majority of the interest comes from the U.S with inquiries in 1998 coming from the following domains: 250,000 from .edu, 88,000 from .com, 27,000 from .net, 6,800 from .gov, and 3400 from.mil. Inquiries also come from countries around the world including the 8,100 from the United Kingdom, 4,200 from Australia, 7,300 from Germany, 3,900 from Canada, and 2,200 from South Korea. The most accessed ETDs have been accessed tens of thousands of times with many over one thousand accesses. To learn more about accesses see http://scholar.lib.vt.edu/theses/data/somefacts.html

## Publication and Plagiarism

When the ETD project began at Virginia Tech, some students and faculty expressed great concern that publishers would not accept derivative manuscripts or book manuscripts from ETDs. For some publishers this concern is legitimate and the ETD project has put into place a system for students and advisors to restrict access to ETDs until after journal articles appear. This system seems to satisfy faculty, students and publishers. Publishers that have discussed this matter with us usually have not expressed concern with the release of the ETD after the journal article is published. One exception may be small scholarly presses that publish books derived from ETDs. These presses view the book as having a sales life of several years after the initial date of publication. In these cases, it may be necessary to extend the period of restricted access well beyond the publication date of the book.

For the longer term, however, it is important that researchers and scholars regain control of their work by becoming more knowledgeable about their rights as original creators and as holders of the copyrights to the work. This requires universities to have active programs to educate their faculty and students about copyright. Publishers also need to be educated to be less concerned about ETDs interfering with the marketability of their journals. This can be done, in part, by an effort on the part of researchers and scholars to educate publishers of their professional journals. They need to help persuade journal editors that ETDs most often are not the same as the journal articles derived from them, and that there is a serious difference because they have not been subject to the stamp of approval that is the result of peer review. As such they should not be considered a threat to the news value or to the sales potential of the journal. It is interesting to note that a Virginia Tech survey of students who had released their ETDs worldwide showed that twenty students had published derivative manuscripts from the ETDs with no publisher resistance to accepting the manuscripts.

It is also noteworthy that the American Physical Society has a practice of sharing electronic copies of preprints of manuscripts undergoing peer review (http://xxx.lanl.gov/). Those that successfully pass peer review are published in the Society's journals. This practice is essentially the same as the practice being proposed for ETDs above.

The risk of plagiarism is next on the list of concerns of students and faculty. We do not yet have enough experience with ETDs to speak authoritatively about this issue. If one thinks a bit about it though, it seems that the risks of exposure of plagiarism will deter such activity. Most researchers and scholars still work in fields where a fairly small group of workers have detailed knowledge of their work. It follows that because of the size of the field and because of the ease of detecting plagiarized passages in electronic documents, the risks of detection will make wide spread plagiarism unlikely.

More disconcerting to me is the closely related concern of researchers and scholars that by reading their students ETDs, other researchers and scholars will achieve a competitive edge in the contest for grants and contracts. Most research in U.S. universities is done in the name of supporting the well being of the nation and is being sponsored directly or indirectly with public tax dollars. There is something wrong with a view that research and scholarship should not be shared among other researchers and scholars for the above reasons. Yet the concern is understandable in today's financially stretched research universities where the competition for promotion and tenure among young faculty is fierce. Similarly, faculty are encouraged to develop intellectual property in which the university claims a share. I'm not sure if we have gone too far down this road, but I am concerned that our obligation as scholars to make our work known to other scholars is being compromised. A result of this compromise is that the goal of scholars to advance knowledge through sharing knowledge may also be slowed.

## How Virginia Tech implemented the ETD Requirement

ETD discussions with the Graduate Dean, the Library, and Ed Fox, a faculty member conducting research on digital libraries, began in 1991. At that time we were exploring the possibilities of optional submission. Shortly thereafter Adobe Acrobat® software for creating and editing Portable Document Format (PDF) files came on the market. This software for the first time provided a tool that was easy to use and allowed documents to be moved between computer operating systems and platforms while retaining the original document formatting. This was a great step forward in increasing worldwide access to information while retaining the original author's formatting style. At this time we began a pilot study to determine if the Acrobat® met our needs. We determined rather quickly that it was the most suitable product for our needs at that time. In my opinion that conclusion holds true today.

We continued discussions with the Graduate School and the Library and in the Fall of 1995 concluded that we would seek to make the submission of ETDs a requirement of the Graduate School. We took a proposal to the Commission on Graduate Studies and Policies for discussion. A degree standards subcommittee discussed the proposal amongst themselves then with ETD team members, Ed Fox from Computer Science, Gail McMillan from the Library, and John Eaton from the Graduate School. In these discussions the expressed concerns dealt with archiving and preservation, the burden to the students and the burden to the faculty and departments. After full discussion, the subcommittee recommended approval of the proposal. The commission discussed and approved the proposal, subject to the following provisions.

• That a student training process be conducted to show students how to produce an ETD.
• That necessary software (Adobe Acrobat®) be made available to students in campus computer labs.
• That the faculty not be burdened by this process.

With these provisions agreed to, the Commission approved a one year voluntary submission period to be used for beginning the student ETD workshops, informing the university community, and development of the infrastructure needed to move to requiring ETDs, after which ETDs would become a requirement in the spring semester of 1997. All went very smoothly while the process was voluntary. Workshops were started, software was placed in campus computer labs, visits were made to departments, articles were published in the campus newspaper, and the advisory committee was formed.

Late in the spring semester of 1997, after the mandatory requirement began, a small but vocal group of faculty, mostly from the life sciences and chemistry expressed a serious concerns about compromising the publication of derivative manuscripts from ETDs made available world wide. While we had a provision for withholding release of ETDs pending publication of manuscripts, the time period of six months was thought to be short. The ETD team responded to this concern by giving the student and the advisor greater control of the access to the ETD through the ETD approval form which can be found at http://etd.vt.edu/. The modifications made to the ETD approval form seem to have satisfied faculty concerns about publication, and since that date the ETD project has operated very smoothly at Virginia Tech and is now rapidly becoming and integral part of graduate education.

## Conclusion

The ETD project has provided the opportunity for fundamental change in the expression of and access to the results and scholarship done by students in research universities around the world. These tools also can easily be extended to the expression of and access to research done by faculty. As scholars, we should not let this opportunity slip by. As Jean-Claude Guédon said "Benign neglect simply will not do".

Next Section: Role of the Library and Archives

# Universities/Role of the Library and Archives

ETDs have a very positive impact on libraries because they are an easy way to expand services and resources. With ETDs libraries can evolve into digital libraries and online libraries. Because authors create and submit the digital documents and yet others validate them (i.e., graduate school approval), all the library has to do is receive, store, and provide access. This is not a radical departure from what libraries do normally; only these documents are electronic. Today, many libraries are already handling electronic journals, so ETDs can extend the multimedia resources to the online environment and give every library something unique in their digital resources.

The library can do more, such as improving workflow, reducing the time from receipt to public access, and, of course, one ETD can have multiple simultaneous users. ETDs can be submitted directly to the library server so that as soon as they are approved, they can become available to users, eliminating the need to move the documents from the Graduate School to the Library. This change in workflow can also eliminate the time delay previously caused by bounding and cataloging them prior to providing access. The most tenuous and highly emotional service libraries provide to ETDs is archiving. Because not enough time has elapsed to prove that digital documents can live for decades in publicly accessible digital libraries, the uncertainty of online archives causes great unease to many. Libraries must, therefore, be careful about security and back-ups.

Another role that the library plays can be to put a prototype in place. While the Graduate School is nurturing the policy evolution among the academic community, there is a model developing to meet the needs of the academy. Establishing an ETD project Web site that documents the evolving initiative, listing active participants, providing a sample submission form and potential policy statements can do this. These statements might address levels of access, copyright statements, and link to existing ETD initiatives.

When a university is adopting policies about ETDs it is helpful to have a place where its students, faculty, and administrators can see what an electronic library of digital theses and dissertations might be. Many graduate students are anxious to participate in an ETD initiative and the library’s Web site can take advantage of their enthusiasm. Invite students who have complete their TDs to submit them electronically. This will build the initial ETD database and test the submission form as well as give the Graduate School personnel opportunities to compare the old and the new processes with somewhat familiar TDs.

The library is also in a position to offer graduate students the incentive to participate. Most libraries collect binding fees so that theses and dissertations can be bound uniformly. Archiving fees can replace binding fees when ETDs replace paper TDs. However, the library may wish to offer to eliminate this fee for the first (limited number of students) who submit ETDs instead of TDs.

# Archiving Electronic Theses Dissertations

The best chance electronic information has of being preserved is when it is used online regularly and continually. As soon as it is not used, there will be trouble remembering the media that produced it and that made it accessible.

As ETDs begin to join the library’s traditional theses and dissertations, it is a good time to align the commitment and the resources to maintain these online information resources over time. A library’s Special Collections Department and/or its University Archives are often responsible for storing and preserving theses and dissertations. Document parallel standards, policies, and procedures for electronic theses and dissertations (ETDs).

The academic departments determine the quality of the work of their students, while the individual thesis/dissertation committees approve the student's work on its own merits. The Graduate School primarily oversees mechanical considerations, the purpose of which is to provide a degree of uniformity, to assure that each thesis or dissertation is in a form suitable for reading and/or viewing online and that it can be preserved. The University Archives ensures long-term preservation and access to this record of graduate students’ research.

With digital materials libraries give access and simultaneously prolong the life of the work, ensure the durability of the present through stability of the means of mediation.

# Factors Effecting Archiving

1) Access

The first goal is to have all ETDs online and available all the time from a stable server. If necessary (depending on the capabilities of the server), some ETDs could be moved to a secondary server. Considerations for moving ETDs to a secondary server are usage (ETDs with fewest accesses) or age (oldest ETDs). Formats and file sizes probably would not be a factor in employing a secondary server, though extremely large ETDs may be prime candidates for separate online storage and access. If it becomes necessary to move some ETDs to a secondary server, programs would be written to trigger migration. Currently "age" would be easier to program, but in the future "usage" (actually, lack of use or few downloads) would be preferable characteristics for migrating ETDs to a secondary server for archiving. URNs would link migrated ETDs. URNs could be mapped to PURLs at some future date.

2) Security

Data

ETDs that been submitted but not yet approved should be frequently backed-up (e.g., hourly) if changes have occurred since the last back up; otherwise, generate a back-up programmatically every few hours. Make a weekly back up of ETDs in all directories (i.e., all levels of access). Make copies programmatically and transfer them to another server; make weekly back-ups to tape for off-line storage. Retain copies in quarterly cycles and annually archive to a CD-ROM.

Content

Authors cannot modify their ETDs once approved. Exceptions are made with proper approval. Viewers/readers cannot modify or replace any ETDs. Only in extreme circumstances would the system administrator make modifications to an ETD (e.g., when requested by the Graduate School to change the access restrictions or to activate or change email addresses).

3) Format Migration

The library should share with the university the responsibility to guarantee that ETDs will be available both within and outside the scholarly community indefinitely. To keep ETDs reader-friendly and to retain full access will mean migrating current formats to new standard formats not yet known. This will be done through the cooperative efforts of the library (who maintains the submission software, the database of ETDs, and the secure archive) and university computing expertise. Standard formats should be the only acceptable files approved. Formats recommended for ETDs that may need to be converted to new standards in the future.

Image Formats:

• CGM (.cgm);
• GIF (.gif);
• JPEG (.jpg);
• PDF (.pdf);
• PhotoCD;
• TIFF (.tif)

Video Formats:

• MPEG (.mpg);
• QuickTime – Apple (.qt and .mov);
• Encapsulated Postscript (.eps)

Audio Formats:

• AIF (.aif);
• CD-DA , CD-ROM/XA (A or B or C);
• MIDI (.midi); MPEG-2;
• SND (.snd);
• WAV (.wav)

Text Formats:

• ASCII (.txt);
• PDF (.pdf);
• XML/SGML according to the document type: "etd.dtd" (.etd) ETD-ML

Authoring Formats:

• Authorware, Director (MMM, PICS)

Special Formats:

• Excel (.xcl)

Next Section: Intellectual Property Rights

# Universities/Intellectual Property Rights

Whether an author is creating an electronic or paper thesis or dissertation, it does not change their rights and obligations under the law. While university policies vary, it is the custom that the person who creates a work is the owner of the copyright. Therefore, the author of an electronic thesis or dissertation is the copyright holder and owns the intellectual property, their ETD. Within the United States, authors have rights protected by law particularly US Code, Title 17, especially section 106. Authors get to decide how their works will be reproduced, modified, distributed, performed in public and displayed in public. An author may use another’s work with certain restrictions known as “fair use” (US Code, Title 17, sect. 107). The four factors of fair use that must be considered equally are: (1) purpose and character of use; (2) nature of the copyrighted work; amount and substantiality; and (4) effect. In the United States, libraries are also considered in the same copyright law under section 108. [For further explanations, see http://www4.law.cornell.edu/uscode/17/ch1.html]

The owner of an ETD, as explained above usually the student, must take direct action if an ETD service is to be provided. The wording that is agreed to in writing, by student authors as well as the faculty working with them, must make clear how ETDs are handled. The wording used at Virginia Tech is one model. In the approval form used for this purpose, the following is agreed:

The student certifies that the work submitted is the one approved by the faculty they work with. The university is given authority, directly or through third party agents, to archive the work and to make it accessible, in accord with any access restrictions also specified on the form. This right is in perpetuity, and in all forms and technologies that may apply.

Next section: Publishers

# Universities/Publishers

A continuing topic of discussion in the ETD community, including Graduate School administrators, research faculty, and librarians, is the question of “prior publication.” That is, whether publishers and editors of scholarly journals view electronic theses and dissertations that are available on the Internet and through convenient Web browsers as being published because they are so readily and widely available.

Several publishers have also attested to this and these statements are available in a variety of sources such as http://www.ndltd.org/publshrs/index.html. (others?) At the Cal Tech ETD 2001 conference, Keith Jones from Elsevier stated emphatically that his company encourages its authors to link their articles in Elsevier journals to their personal Web sites and also authorizes faculty members’ departments to provide such links. Jones reported that Elsevier understands the importance of getting new authors such as graduate students to publish in his journals early in their careers because they are then likely to continue to publish with the same journal. He also pointed out the publishing in an Elsevier Science journal is an important source of validation for academics so that the subsequent availability of those articles from other non-profit and educational sources is not a threat.

# Universities/Human resources and expertise needed for an ETD program

While each university’s situation with regard to personnel will vary, it has been shown that ETD programs do not require a large contingency of expert professionals to initiate the program or to maintain it. With a portion of the time of one librarian and one programmer, the Virginia Tech library established the ETD Web site and documented the sequence of events that lead to the computer programs for each stage of acceptance, storage, and access, and implemented the procedures for the university.

One step that should not be overlooked is to involve every person that had a role in traditional theses and dissertation handling. This includes, but is not necessarily limited to.

• Graduate School personnel who, for example, receive the TDs as well as the various forms and payments, and who may be responsible for approving the final document
• Library personnel such as the University Archivist, reference librarian, cataloging librarian, binding clerk, and business services personnel who, for example, are responsible for the long term preservation and access as well as those responsible for processing the microfilming invoices

LIBRARY STAFFING: programmer, student assistant, faculty liaison

A programmer may spend one-half to one hour per day during non-peak periods on maintenance and development. During peak periods this person may spend eight hours per day on problems, development, and system improvements.

One or two student assistants are helpful. One student who knows programming may spend a maximum of two hours per week during non-peak times, and up to ten hours per during the periodic submission/approval peaks. Another student assistant would work face-to-face with graduate student authors to train them and to provide assistance with word processing and PDF software. This assistant might also maintain the training materials, handouts, and Web pages, including instructions for preparing and submitting ETDs.

A faculty member from the library would supervise staff; draft policies; prepare budgets; collaborate with system maintenance and developers; and monitor workflow. This person would also be a liaison to faculty, staff, students, departments, and colleges to help them to become familiar with the processing, accessing, and archiving of ETDs. The faculty liaison from the library would also conduct workshops, write articles, participate in graduate student seminars, prepare handouts and Web pages, as well as collaborate with other universities and libraries.

Next section: Sources for funding

# Universities/Sources for funding

The aim of the ADT (Australian Digital Theses) Program is to establish a distributed database of digital versions of theses produced by the postgraduate research students at Australian universities. The theses are available worldwide via the web. The ideal behind the program is to provide access to and promote Australian research to the international community.

The ADT concept was an initiative of 7 Australian university’s libraries in association with the Council of Australian University Librarians (CAUL ).

The ADT model was developed by the 7 original project partners during 1998-1999. The program was then opened up to all CAUL members (all Australian universities) in July 2000. The original 7 partners will continue to guide and advise the national group in their role as the ADT Steering Committee.

The initial project was funded by an Australian Research Council (ARC) - Research Infrastructure Equipment and Facilities (RIEF) Scheme grant (1997/1998). The ARC is the peak Australian government research funding body and the grant to establish the ADT was a result of a successful application made by the group of 7 above. It was recognised at the outset that recurrent funding was going to be problematic, and that the model to be developed had to take this into consideration. The model developed is essentially self sustaining, with only a small commitment of resources required. The original funds were used to create such a self supporting distributed and collaborative system.

The software used is generic, and designed to be easily integrated at ADT member sites. Once installed, students can either self submit, or seek support free from the library. Theses are mounted on local servers and require a minimum of maintenance. The central ADT metadata repository is searchable and is created automatically from rich DC metadata generated from the submission/deposit form. The metadata is gathered automatically using a metadata gatherer robot. The idea behind the ADT is that producing research theses is normal business for universities and a commitment to include a digital copy to the ADT requires minimal resources. In fact, digital copies are much cheaper to produce that the traditional paper bound versions.

It must be noted that institutional membership and individual contributions to the ADT are voluntary, and will remain so for some time to come.

Each NDLTD member has to apply to its own national, regional and community funding agencies. However, recurrent funding will probably be an ongoing issue unless a commitment is made at the government level to support such initiatives. In the broader context, ongoing funding for an international community body, which the NDLTD now is, will be difficult to achieve. A possible solution would be for the NDLTD to seek Non-Government Organization [NGO] status and thus secure ongoing funding from UN instrumentalities such as UNESCO. Such funding is critical to ensure the good work already achieved by the NDLTD in bringing together a large and disparate number of institutions from across the globe bound by an ideal to promote, support and facilitate open unrestricted access to worldwide research contained in theses, and to share without cost/commercial barriers experiences, support and guidelines with the world community in an open, transparent way.

Next section: Costs

# Universities/Costs

Evaluation of Costs at Virginia Tech

Libraries usually have some of the infrastructure already in place to handle ETDs, especially if they are already providing access to electronic journals or digital images. However, many administrators like to have data about costs associated with an initial budget, and for this reason the following costs for personnel, hardware, and software were established by Virginia Tech’s Digital Library and Archives. Keep in mind that existing personnel and hardware can be commandeered for the ETD prototype and that frequently software is available on the Internet as shareware. This is how the Virginia Tech initiative began in 1995 and continued into the first year of required ETDs, 1997.

[taken from http://scholar.lib.vt.edu/theses/data/setup.html]

These estimates assume that the university/library would adapt existing programs, scripts, Web pages, software, etc. that has been developed by the VT ETD Team. No additional equipment, software, or staff was necessary for the VT library to begin or to maintain the ETD system and services.

$24,000 STAFF$13,500 EQUIPMENT

$12,000 SOFTWARE ================$49,500 TOTAL

The estimated costs below show that for about $49,500 a library could replicate the Scholarly Communications Project ETD library if it adapts what VT has already developed for the NDLTD. If another university/library must also provide support to students like the New Media Center does, then the costs are greater for training, equipment and software. STAFFING:$24,000

System's Administrator: .25 FTE at $6000-$6,600/yr

In Virginia, pay band 5: $30,000/annual;$14/hr We estimate the our Sys Admin spends .5 - 1 hr/day during non-peak times on maintenance and development. During peak periods he may spend 8 hrs/day on problems, development, and system improvements.

Student assistant: .25 FTE at $2900.00-$4500.00/yr

• knowledge of pdf and html desirable; train to use ftp and telnet if necessary
• desirable: programming experience.
• salary depending on skill level at VT would be $6.00 -$10.00/hr

We estimate that our student (who knows programming) spends a maximum of 2 hrs/wk during non-peak times, and up to 10 hrs/wk during the periodic peaks.

Faculty liaison: .25 FTE @ $50,000 =$12,500

• Supervise staff; draft policies; prepare budgets; collaborate with system maintenance and developers; monitor workflow.
• Work with faculty, staff, students, departments, colleges to become familiar with the processing, accessing, and archiving of ETDs.
• Conduct workshops, write articles, participate in graduate student seminars, prepare handouts and Web pages. collaborate with other universities and libraries.

EQUIPMENT—SERVING ETDS: $13,5000 server:$5,000 dual 3 GHZ processor, 2gb RAM, at least 20 to 80 GB hard drive

Originally, we did not purchase special equipment for ETDs, but incorporated this additional responsibility into the original server, a NeXt3.3 running HP. In Sept. 1997 we purchased a Sun Netra. tape drive for back-ups : $3500 ( At VT we use our Computing Center's backup service) increase of 4-5 Gb disk space per year Each ETD requires an average of 4.5Mb 2 workstations :$5,000

SOFTWARE—SERVING ETDS: $250-$12,000

$50 Red Hat 3 advanced Server (free download for software and$50 education entitlement cost)

$10,000 InfoSeek$1000/yr Server Equipment Maintenance fee

$40 x 2 Acrobat*$50 x 2 Word Processor*

$150 x 2 Photoshop* * Educational pricing available for this software At VT we have a New Media Center with staff, software, and equipment that can help students with every stage of ETD preparation and submission. If you don't have such a facility, consider: STUDENT SUPPORT: EQUIPMENT$1000 Desktop with monitor; CD-RW/DVD; (internet access)

$150 scanner$3000 2 printers: LaserJet and ColorPrinter

$300 digital camera STUDENT SUPPORT: SOFTWARE$40 x 2 Acrobat*

$50 x 2 Word Processor*$150 x 2 Photoshop* * Educational pricing available for this software

STUDENT SUPPORT: STAFF

• Training and assistance for equipment and software
• Maintenance of training materials: Web pages: instructions and handouts

These may not change from the costs of cataloging traditional theses and dissertations. One advantage is that for the same costs, more information can easily be added to the bibliographic record because of the ease of copy-and-paste features of word processors. For example, include the abstract in online catalog records for ETDs and index this MARC field to enhance findings through keyword searching. Another advantage of ETDs is that there are no longer the fees associated with binding, security stripping, labeling, shelving ($.10/vol. estimated), circulating ($.07/vol. estimated), etc. The Virginia Tech Library saved about 66% of the processing costs because of the greatly reduced handling of ETDs; costs dropped from $12 per TD to$3.20 per ETD.

Next section: Processing charges

# Universities/Processing charges

Many universities that have paper submissions of theses and dissertations collect funds for processing these works. Usually the funds are collected from students. In some cases they might be collected from grants, or from a sponsor (e.g., from the National Library, with regard to works in Canada).

These funds commonly are called "binding fees" or are given other names designating their use for processing. A typical fee might be $20 or$30 per work.

In the case of transition to ETDs at Virginia Tech, there were changes to these fees. First, in 1996, when submission of ETDs was encouraged but voluntary, the processing fee was waived for those students submitting an ETD instead of a paper document. This provided a monetary incentive to move toward ETDs.

However, in 1997, with a requirement in place, the processing fee was re-instituted. However, it was renamed "archiving fee". Thus, funds were collected from students:

• to help defray the costs of the ETD program;
• to provide a pool of funds to archive and preserve works throughout their lifetime, allowing for migration to new media types (e.g., automatic copying to newer types of storage media, such as from one online disk to a more modern disk), as well as conversion to new formats (e.g., moving from SGML to XML, or from one version of PDF to a newer version).

Next section: Budgets

# Universities/Budgets

The estimated cost of a project to electronically produce and distribute theses varies based on a number of factors such as: the expertise and competence of your team; the technology used; the volume of documents to process; and the cost of living in your country (human resources, computer equipment, communications, etc.). In these circumstances, it is impossible to provide any budgetary estimates, at least without knowing the operating conditions and elements of a particular situation. Giving an estimate, even in the broad sense of an order of magnitude, would be of no use. Nonetheless, this section will allow you to determine your expenditure budget by providing a list of budgetary items that must be foreseen.

The expenditure budget of an electronic theses project involves four principal modules. These are: the start-up; the implementation of thesis production and distribution services; communications; and student training.

## Start-up

Start-up costs are related to what you require to begin a project to electronically produce and distribute theses. Besides the personnel, who constitute the most important element in any electronic thesis distribution project, the start-up costs principally relate to infrastructure and training. Since the network infrastructure is usually provided by the host institution, this expense is omitted from the following table.

Human Resources
Professionals
• Adoption/creation of a procedure for processing theses
• Development of tools, programmes and scripts
• The number of professionals required is linked to the volume of theses to be processed and to the variety of disciplines taught at your institution
Technicians
• Production of theses, assisting students, and routine tasks
• The number of technicians required is linked to the volume of theses to be processed
Management
• Supervision
• Communication with the university’s administration, and with external partners
• Day-to-day management
Infrastructure
Server
• Materials (the actual server, UPS, back-up unit, etc.)
• Software (operating system, search tool, etc.)
Server site
• This is usually provided by the University
Computer equipment for processing and for management
• Materials (PC, printer, etc.) for all members of the team
• A workstation (PC) to test different software and to receive students’ files (by ftp) in a secure manner
• Software chosen to suit the chosen technology and assembly line (Office suite, Adobe, XML software, HTML editor, etc.)
Training
Training team members
• Organized training or self-instruction
• Manuals and documentation
Table 1– Start-up for an electronic thesis production and distribution project

## Implementation of production and distribution services

The resources required to create an assembly line for producing theses are related to the technologies used as well as the expertise and experience of team members. For instance, the choice to create valid XML documents from the files submitted by students, even if many useful XML tools are now available at reasonable prices, necessarily implies larger investments. Nonetheless, it is important to remember that more costly technological solutions may provide significant benefits to other electronic publication projects within the same institution. For instance, they might be useful for publishing electronic journals or for digitizing other types of collections. Table XXX lists the general stages needing attention when establishing a budget, regardless of the technologies chosen.

Implementing production and distribution services
Production
• Analysis of needs
• Choice of a metadata model (including a permanent referencing system)
• Integration of the metadata creation or generation process
• Testing different software
• Planning and testing of the production assembly line
• Creation or adoption of a follow-up tool for production work
• Formal implementation of the service
Distribution
• Install and set parameters for the search tool (full-text and metadata)
• Create and manage the Web site
• Create the site’s architecture
• Conceive and create the site’s graphic signature
• Ensure Web referencing (Web search tools, indexes, other ETD sites, etc.).
• Create mechanisms for managing access
• Install a tool to measure visits to, and usage of, the site

Table 2 – Implementing production and distribution services

## Communication

The communications plan often makes the difference between successful projects and aborted ones. A communications plan must be drafted when the technical processes are in place and the university’s governors approve the project. Worktime and other resources must be budgeted for the creation and implementation of a communications plan. The principal tasks usually required to this end are listed in Table XXX.

Communication Plan
Human Resources
• Drafting the Communications plan
• Organizing and holding information sessions with the university’s professors, researchers and students
• Writing the content of information documents
• Meeting with journalists from the university or from the city’s newspapers.
Tools for communicating and promoting the project
• Posters, brochures and other documents
• E-mails to professors and students
• Putting information on-line on the project’s Website

Table 3 – Communication Plan

## Training Students

Distributing theses on the Web is in itself an effective means of valuing an institution’s students and research. Nevertheless, emphasis must be placed on training students so that they can master the tools of document creation that allow them to circulate their research results. The use of new information technologies has become essential for university studies. Whether it involves searching databases and the Internet, or using software to help manage and assess data and present findings, the ability to work easily with these tools will be of daily use for students throughout their future careers. Many universities undertaking electronic theses projects have uncovered serious skills gaps in terms of creating tables, of integrating figures and images, and of using functions that automatically generate tables of contents or lists. It is clear that basic training must be offered to better prepare students for the process of writing important documents like theses.

Student training must be offered through many forms such as workshops, on-line tutorials and personal consultations. One can decide to offer one or another or all of these forms of training. In many institutions, the preparation and offer of training to students results from the collaboration of many different units: the faculty of graduate studies, the libraries and the information technologies services.

Training Students
Planning and drafting
• Determining the students’ current state of knowledge/skills
• Analyzing needs
• Planning the various training modules
• Preparing and drafting the pedagogical tools
Workshops
• Preparing workshops, examples and exercises
• Organizing the workshops (selecting classrooms, installing software, etc.)
• Hiring assistants (eventually students) for the workshop’s “hands-on” period
On-line tutorials
• Adapting workshop content and materials for the Web
Personalized consultation service
• Time taken by the professionals and technicians to answer student questions (by e-mail, telephone, and eventually in person).
Table 4 – Training Students

Next section: Plagiarism

# Universities/Plagiarism

The issue of plagiarism often arises among the arguments used to express skepticism with regard to putting theses online. In short, many people tend to think that because a digitized thesis is easily copied in part or in whole, it can be easily plagiarized. Consequently -so goes the reasoning - it is better to keep theses offline.

The argument is largely false and can be refuted fairly easily. To begin with, it is easy to recall that the invention of the Philosophical Transactions (1665) by Henry Oldenburg, the Secretary to the Royal Society in London, was motivated by the issue of intellectual property. Oldenburg reasoned that if the research results of Scientist X. were printed in a journal (after being certified as being of good quality and original) and that journal was made widely available through the multiplication of copies, then Scientist. X would have a better chance to lie ownership claims than if he/she held back these results. By apparently giving away the results of his/her work, a scientist ensures his/her intellectual property most effectively. The ability to compare new results to already published work makes plagiarism a very risky business at best.

Theses are not so well protected at present. Widely dispersed across many institutions in many countries (and languages), they are so poorly catalogued on a national or international basis that they often disappear from sight. This means that someone taking the time to read a thesis in a remote university in a country where the cataloguing is poorly organized may well be able simply to use that thesis and make it pass for one's own. Occasionally, such cases emerge in the literature, even in the United States despite the fact that the cataloguing of theses is most advanced in that country.

The paradox of placing theses on line, especially if these theses can be harvested through some technique that involves full text searching can help identify analogous texts rather easily. As a result, far from placing the digitized theses at risk, putting them on line in a manner that optimizes their access, irretrievability and, therefore, visibility, offers a very efficient way to protect intellectual property and prevent plagiarism. In fact, it would probably be relatively easy to design software that could make periodic sweeps through inter-operable theses collections according to ever more sophisticated algorithms in order to ferret out such possible forms of plagiarism. With many languages involved, it is clear that no perfect solution will ever appear; however, those theses available online will be more protected than theses that remain poorly catalogued and are not readily available outside the institution from which they are issued.

In effect, putting theses on-line amounts to rediscovering Oldenburg's wisdom when it comes to scientific intellectual property. It provides what could arguably turn out to be the best deterrent to plagiarism, wherever it may arise. The more theses appear on line, the fewer will be the chances of carrying on successful plagiarism.

Next section: Assessment and Measurement Introduction

# Universities/Assessment and Measurement Introduction

I shall consider assessment to include the gathering of information concerning the functioning of students, staff, and institutions of higher education. The information may or may not be in numerical form, but the basic motive for gathering it is to improve the functioning of the institution and its people. I used functioning to refer to the broad social purposes of a college or university: to facilitate student learning and development, to advance the frontiers of knowledge, and to contribute to the community, and the society.

(Alexander W. Astin, Assessment for Excellence, 1991, p. 2)

Electronic theses and dissertations are not only products of student research, but also marks of the students’ preparation to become scholars in the information society. In pursuit of this broader social purpose, this section of the Guide focuses on two separate but related topics: assessment and measurement. Both are important components of an institutional ETD program. The role of assessment in an ETD program is to understand whether the goals and objectives of the program are being met, while issues of measurement focus on the production and use of an institution’s ETDs.

Assessment of a program’s goals and objectives yields information that may be of great value to policymakers and administrators. Higher education institutions are increasingly asked by their boards, by state legislators, and by federal government agencies to demonstrate the effectiveness of their programs, which has led to the development of many assessment programs on campuses. Assessment is often used to provide trend data or to assist in resource allocation, with data helping to make a case for the viability of an ETD program or to determine which parts of the program require the most (or least) resources. Ideally, an ETD assessment program will have ties to overall campus assessment activities.

However, whether or not accountability to administration or government is a driving factor in developing assessment for an ETD program, any new effort can benefit from a systematic way of measuring what it is achieving. Such measures as the number of ETDs produced at an institution each year or the number of times a dissertation has been downloaded from the institution’s web site are frequently cited to demonstrate the success of an ETD program.

The goals of this section of the Guide are:

1. to encourage those involved in developing and implementing ETD programs to think at early stages about broad questions of assessment, and

2. to familiarize individuals working on measurement of ETDs with developing national and international initiatives that are developing standard ways of measuring the use of electronic information resources.

Next Section: Types of Assessment

# Universities/Types of Assessment

Assessment takes two basic forms, based on the point at which the evaluation is done:

• Formative evaluation takes place during the development of a program. It is used to help refocus activities, revise or fine tune a program, or to validate the success of particular activities. Formative assessment in an ETD program may help diagnose whether training sessions are useful to students, whether the submission process worked efficiently, whether the students learned about electronic scholarly communication issues, or whether the students find what they learned in order to produce an ETD helpful in their employment.
• Summative evaluation is used to examine the program at its completion or at the end of a stage. Since an ETD program is ongoing, summative evaluation can be used on an annual basis to yield statistics that can be compared from year to year.

Next Section: The Assessment and Measurement Process

# Universities/The Assessment and Measurement Process

This chapter of the Guide is intended for those who have responsibility for implementing an institutional or departmental ETD program. The process of creating a successful assessment and measurement requires planning, creating goals and objectives, choosing types of measures, and deciding what type of data to collect.

## Planning

Some campuses form an ETD committee or team, consisting of representatives from the Graduate School, the faculty, the library, the computing center, and other relevant units on campus. As part of their overall planning for the development of an ETD program, the committee should make explicit their goals for assessment and measurement of the program and put in place mechanisms to collect data related to the goals. The data collection could be gathered from web statistics packages, from student surveys, or from interviews, depending on what type of information meets the assessment’s goals.

Several units of the university—including academic departments, the graduate school, information technology services, and the library—are involved in any ETD program. In developing an assessment plan, it is useful to think convergent. Are there particular things that would be useful for several of these constituents to know? This will leverage the value of the assessment and may allow for joint funding and implementation, thereby spreading the costs and the work.

During the planning process, the ETD committee may focus on a variety of issues, including:

• the impact that the ETD program is having on the institution’s reputation;
• the degree to which the ETD program is assisting the institution in developing a digital library; and
• the benefits to and concerns of students and advisors participating in the ETD program.

Once the focus of the assessment and measurement activities is identified, the ETD committee should assign responsibility for development and implementation to another team. This team should include assessment and measurement experts from institutional units such as an institutional planning office, a survey research institute, or an instructional assessment office.

## Creating Goals and Objectives

A necessary prerequisite to assessment is a clear understanding of the ETD project’s goals. Each institution’s assessment plan should match the goals and objectives of the institutional ETD program, which may have a broader scope than the simple production of electronic content.

If an ETD program does not have clearly defined goals, an excellent resource is the NDLTD web site. This site includes the goals of the NDLTD, which can be adapted to local needs. These goals include:

• Increasing availability of student research
• Lowering costs of submission and handling of theses and dissertations
• Empowering students
• Empowering universities

Within each of these broad areas, many types of measures can be developed to help evaluate whether the ETD program is succeeding.

## Choosing Types of Measures

Institutions have many choices in what they measure in an ETD program. After aligning the assessment goals with the institution’s goals, the assessment plan must describe what types of measures are needed for various aspects of the program. McClure describes a number of categories of measures, including those that focus on extensiveness, efficiency, effectiveness, service quality, impact, and usefulness. (Charles R. McClure and Cynthia L. Lopata, Assessing the Academic Networked Environment, 1996, p. 6)

An extensiveness measure collects data on such questions as how many departments within the university are requiring ETDs or how many ETD submissions are made each year. This type of data lends itself to comparison, both as trend data for the individual institution and in comparisons to peer institutions.

An efficiency measure collects data to compare how life cycle costs of ETDs compare to those of print theses and dissertations. For example, Virginia Tech includes such a comparison on its Electronic Thesis and Dissertation Initiative web site.

Effectiveness measures examine the degree to which the objectives of a program have been met. For example, if an objective of the institution’s ETD program is to empower students to convey a richer message through the use of multimedia and hypermedia, data can be collected that displays the proportion and number of ETDs employing such techniques by year. If an objective of the program is to improve students’ understanding of electronic publishing issues, the institution can measure such understanding prior to and after the student produces an ETD.

Measures of service quality examine whether students are receiving the training and follow-up assistance they need.

## Deciding What Data to Collect

Frequently, discussions of how to assess electronic information resources are limited to defining ways of counting such things as searches, downloads, and hits. These measures are certainly useful, but they provide a limited view of the overall value of electronic information resources.

Many vital questions cannot be answered with statistics about searches, downloads, and hits: Can users access more information than in the past due to availability of information resources online? Has the availability of electronic information resources improved individuals’ productivity and quality of research? Has the availability saved them time?

Collection of data for assessment should be designed to answer some of these questions, to address the educational goals embedded in an ETD program, and to gauge whether or not those goals have been achieved.

Decisions about data collection are also informed by the institutional mission and goals. For example, if the institution is interested in increased visibility both nationally and internationally, then statistics on downloads of ETDs by country and institutional IP address could be useful. Examining who is using an institution’s ETDs by country and by amount of use would also be a valuable gauge of impact. As recommended systems develop, this area may grow in importance.

There are a number of questions related to users that are possible targets for assessment. These include:

• Are students achieving the objectives of the ETD program?
• Are students using tools such as Acrobat appropriately and efficiently?
• Has the availability of student work increased?
• Do students have an increased understanding of publishing issues, such as intellectual property concerns?

In addition, a number of questions related to student satisfaction could be addressed in data collection plans. These may include:

• Were students satisfied with the training or guidance they received to assist them with producing an ETD?
• Did the availability of their dissertation on the web assist them in getting a job?
• Are they using the technology and electronic authoring skills they learned in their current work?

Usefulness to students may be a factor of the availability of their ETD on the web, assisting employers in gauging their area of research and the quality of their output. Availability may also lead employers to contact students for openings that require a particular skill set. And students may find that the skills of preparing an ETD and the framework of issues associated with the ETD, such as intellectual property issues, is useful in their places of employment after graduation.

The usefulness of an ETD program to students, faculty, and others may be an important factor to measure in order to gather data that can be conveyed to administrators and funding agencies. This data might best be collected six months to a year after the completion of the ETD.

Finally, the ability for faculty and students around the world to easily examine the dissertation and thesis output of a particular department may provide a new dimension to rankings and ratings of graduate departments. Monitoring national ratings in the years pre- and post-implementation of an ETD program could be useful, although may be only one factor in any change in ranking or rating.

Next Section: Measuring Production and Use of ETDs: Useful Models

# Universities/Measuring Production and Use of ETDs: Useful Models

In addition to assessing some of the programmatic goals of the ETD program, institutions will want to have some basic assessment measures in place to document the production and use of their ETDs. The work done at Virginia Tech’s Electronic Thesis and Dissertation Initiative can provide a model for other institutions. Using web statistics reporting software, Virginia Tech monitors a number of measures for their ETDs, including:

• the availability of campus ETDs
• multimedia in ETDs
• which domains within the United States and abroad are requesting ETDs
• requests for PDF and HTML files
• distinct files requested
• distinct hosts served
• average data transferred daily

In compiling its counts, Virginia Tech eliminates everyone working on ETDs at the institution, including the Graduate School. They try to eliminate repetitive activity from robots and other sources of that type as well. Virginia Tech is also working on the compilation of an international count of ETDs produced in universities.

Institutions must decide whether they will report their ETD collections and usage separately, in conjunction with other campus web site usage, or in conjunction with other electronic resources managed by the library. Now, when practices and standards for gathering these statistics are evolving, institutions may need to collect the information and report it in conjunction with more than one type of related collection. In any case, institutions should keep abreast of national and international initiatives that are seeking to define and standardize statistical reporting of the number and use of electronic information resources.

Several projects currently focus on collection of statistics related to information resources:

• The Association of Research Libraries (ARL) collects statistics from libraries of large research universities in North America and has been working on several initiatives that explore data collection related to digital information. In particular, one of its New Measures initiatives, the E-Metrics Project, is recommending a particular set of measures and defining their collection.
• The International Coalition of Library Consortia’s (ICOLC) has created Guidelines for Statistical Measures of Usage of Web-Based Indexed, Abstracted, and Full-Text Resources. They encourage vendors of electronic information products to build into their software statistical report generation that will meet the ICOLC Guidelines, promoting comparability of products.
• In Europe, the EQUINOX Project, funded under the Telematics for Libraries Programme of the European Commission, addresses the need to develop methods for measuring performance in the electronic environment within a framework of quality management. Their products include a consolidated list of data sets and definitions of terms.
• Library Statistics and Measures, a web site maintained by Joe Ryan of Syracuse University, also provides a useful set of links to resources on library statistics and measures.

Another important type of post-processing is the extraction of statistical information from metadata sets. For administrative purposes, institutions may be interested in the number of ETDs supervised by each professor, the keywords most used , the month(s) in which more ETDs are submitted, etc. Usually, relevant metadata are extracted from the ETD database and processed using specialized tools like Microsoft Excel. The access to the database can be done using either ODBC drivers or specialized middleware utilities.

Next Section: Statistics and Usage

# Universities/Statistics and Usage

Gathering data about ETDs can be done through online surveys and log file analysis. Online surveys can also be used to gather data from graduate student authors submitting ETDs and from ETD readers accessing them. An easy to use online survey system is available to NDLTD members at http://lumiere.lib.vt.edu/surveys/ Ask graduate student ETD authors a variety of questions about the process as well as about support services and use this data to improve resources and services. Query readers to discover their reasons for accessing ETDs, the results of their use of ETDs, as well as to improve digital library resources and services.

Questions for student authors may be designed to improve the process of preparing and submitting electronic documents. Sample questions might include:

• If you consulted the VT ETD Web site, please indicate if the site was useful.
• If you attended an ETD workshop, please indicate if you found the workshop useful.
• If you used a [particular] computer lab, please indicate if the staff was helpful.
• How many ETDs did you consult while preparing your ETD?
• Compared to what you expected, how difficult was it to create a PDF file?
• My computer is a [PC, Mac, Unix, other].
• Where were you when you submitted your ETD?
• Compared to what you expected, how difficult was it to submit your thesis/dissertation electronically?
• Within the next 1–2 years, what do you intend to publish from your ETD?

Usage

In the online environment, usage is similar to but not equal to library circulation and re shelving statistics. Nevertheless, to report usage, institutions should also report numbers of available ETDs so that downloads can be viewed in the context of what was available at a given point in time.

It would be beneficial if all libraries providing access to ETDs would capture and report similar data about usage, but at this time we are not doing this. Gail McMillan has periodically surveyed and reported numbers of ETDs available from the NDLTD. Her data is based on institutions reporting numbers of ETDs available through their institutions, but the NDLTD has not begun to report institutional use of or access to their ETD collections.

Next Section: Measurement in Related Contexts

# Universities/Measurement in Related Contexts

Another important type of post-processing is the extraction of statistical information from metadata sets. For administrative purposes, institutions may be interested in the number of ETDs supervised by each professor, the keywords most used, the month(s) in which more ETDs are submitted, etc. Usually, relevant metadata are extracted from the ETD database and processed using specialized tools like Microsoft Excel. The access to the database can be done using either ODBC drivers or specialized middleware utilities.

For a broad view of counting information, two projects are widely regarded as providing interesting models and data.

• Peter Lyman and Hal R. Varian’s How Much Information project at the University of California, Berkeley is an attempt to measure how much information is produced in the world each year.
• The OCLC Web Characterization Project conducts an annual web sample of publicly available web sites to analyze trends in the size and content of the web.

Some programs that provide guidance or models for collecting institutional data in higher education are also available. These projects can provide definitions for data, survey questions, and descriptions of data collection that can be adapted for one’s own institution.

• K. C. Green has been conducting the Campus Computing Project since 1990. His work charts the increasing use of technology on campuses.
• The TLT Group’s Flashlight Program, under the direction of Steve Ehrmann, has developed a subscription-based tool kit that provides a large, structured set of assessment techniques and data collection models that can be adapted by individual campuses that want to study and improve the educational uses of technology. The Flashlight Program web site also includes valuable overviews of assessment issues and provides advice on deciding what to assess and how to develop questions.
• The Coalition for Networked Information’s Assessing the Academic Networked Environment project provides case studies of campuses that implemented assessment projects.
• One of the participants in the CNI project, the University of Washington, has a rich set of assessment instruments and reports on its web site, UW Libraries Assessment.

# Universities/Guidelines for Implementing an Assessment Program for ETDs

The Coalition for Networked Information (CNI) sponsored a project, “Assessing the Academic Networked Environment,” in which institutional teams developed and implemented assessment projects related to a variety of areas, including teaching and learning, electronic reserves, computer skills, and electronic library resources. From the project reports and informal feedback from the participants, CNI developed a set of guidelines for institutions engaging in assessment activities related to networks or networked information. The guidelines focus on the process of doing assessment in higher education. The suggestions can be applied directly to assessment projects for ETDs.

• Bring together an assessment team of individuals from various units on campus that can add useful perspectives and expertise; include, if possible, someone who specializes in assessment.
• Align the overall goals of the assessment initiative with the institution’s goals and priorities.
• Gain support from the administration at as many levels as possible.
• Make a realistic determination of the resources (staff, time, equipment, and money) that are available for the assessment.
• Choose a manageable portion of the assessment project as the first implementation. Do not attempt to do a comprehensive assessment of campus networking on the first try.
• Consider using more than one assessment technique to measure the aspect of networking that you have chosen; particularly consider combining quantitative and qualitative approaches as complementary techniques.
• Identify carefully who are the audiences for the assessment reports.
• Examine what you might do with the information you collect, including improving services, seeking additional funding and determine whether your data will provide what you need for that objective.
• Refine assessment instruments on a periodic basis and incrementally add new components.
• Monitor the work of national groups such as ARL, EDUCAUSE, CNI, and the Flashlight Project to see whether materials they develop and guidelines they produce can provide a framework for your project.

(Joan K. Lippincott, “Assessing the Academic Networked Environment,” Information Technology in Higher Education: Assessing Its Impact and Planning for the Future, ed., Richard N. Katz and Julia A. Rudy. Jossey-Bass, 1999, pp. 21–35.)

In addition to gathering availability and usage data, online surveys can be used to gather information from readers to voluntarily agree to be surveyed by answering questionnaires. Some of the useful information that can be gathered includes. Sample questions for ETD readers might include:

• Where do you work/study?
• What do you do?
• What type of computer are you using?
• What is the speed/type of connection are you using?
• Are you familiar with Adobe PDF?
• Are you familiar with online databases?
• If you are from a university, does your institution accept electronic theses and dissertations (ETDs)?
• If your institution does not accept ETDs, do you think it should?
• Have you ever submitted an ETD? (to find out if your readers are ETD authors past, present or future)
• For what purpose are you using this digital library?
• If you downloaded any ETDs, how easy was it to find what you were looking for?
• If you searched for an ETD, how fast was the response to your search request?
• How often do you plan to use Virginia Tech's ETD library?
• How often do you plan to use other ETD libraries?

Next Section: Resources List

# Universities/Resources List

General Information:

Association of Research Libraries. ARL Statistics and Measurement Program. http://www.arl.org/stats

Association of Research Libraries. Supplementary Statistics 1998-99. Washington, DC: ARL, 2000.

Astin, Alexander W. Assessment for Excellence The Philosophy and Practice of Assessment and Evaluation In Higher Education. American Council on Education/Oryx Press. 1991.

Developing National Library Network Statistics & Performance Measures.

EQUINOX: Library Performance Measurement and Quality Management System. http://equinox.dcu.ie/

International Coalition of Library Consortia (ICOLC). Guidelines for Statistical Measures of Usage of Web-Based Indexed, Abstracted, and Full Text Resources. November 1998. http://www.library.yale.edu/consortia/webstats.html

McClure, Charles R. and Cynthia L. Lopata. Assessing the Academic Networked Environment. Washington, DC: Coalition for Networked Information, 1996.

NDLTD: Networked Digital Library of Theses and Dissertations. http://www.ndltd.org/

Ryan, Joe. Library Statistics & Measures. http://web.syr.edu/~jryan/infopro/stats.html

Technology in Higher Education and Web Studies:

The Campus Computing Project. http://www.campuscomputing.net/

Coalition for Networked Information. Assessing the Academic Networked Environment. http://www.cni.org/projects/assessing/

Katz, Richard N. and Julia A. Rudy. Information Technology in Higher Education: Assessing Its Impact and Planning for the Future. New Directions for Institutional Research No. 102. San Francisco: Jossey-Bass, 1999.

Lyman, Peter and Hal R. Varian. How Much Information? http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/

Online Computer Library Center, Inc. Web Characterization Project. http://wcp.oclc.org/

TLT Group. Flashlight Program. http://www.tltgroup.org/flashlightP.htm

University of Washington. UW Libraries Assessment. http://www.lib.washington.edu

Virginia Polytechnic Institute and State University. Electronic Thesis and Dissertation Initiative http://etd.vt.edu/

# Universities/Policy Initiatives: National, Regional, and Local; Discipline specific; Language specific

## National

For universities, it seems most practical to participate in national ETD initiatives. For those initiatives it is advisable that the National Library, which is often in charge of archiving the country’s literature, takes a leading role. We see those approaches in Germany within the Dissertation Online Initiative or in Canada. The national library as a central point, organising not only the archive structure but also the cooperation between universities may serve as a central entry point to the ETD initiative for interested parties.

Principal tasks of a so called central coordination bureau for national ETD initiatives are:

• providing a coordination structure for the cooperation of universities for political, organisational, technological and educational issues and developments
• providing an organizational concept for funding local or regional initiatives, therefore negotiating and cooperating with national funding agencies
• defining special interest and working groups in which representatives from universities can participate
• organizing workshops for participating universities covering special topics in order to discuss and solve particular problems
• cooperating with the international initiatives, e.g. NDLTD, Cybertheses, MathDissInternational or PhysDiss.

In France, a national program for ETDs has been initiated by the Ministry of Education. The electronic deposit shall be compulsory within the end of the year. The organisational scheme adopted is defined as follows:

• each university will be in charge of the conversion of its theses and dissertation into an archiving format (SGML/XML).
• associations of institutions may allow the mutualization of human and technical resources.
• a national institution, the Association des Bibliothèques de l'Enseignement Supérieur (ABES) (approximative translation: Association of Universitary Libraries) has been designated as the national central bureau.

The following text was taken from Prof. Peter Diepold (Sourcebook for ETDs)

In 1996 four German learned societies - comprising the fields of chemistry, informatics, mathematics, and physics - signed a formal agreement to collaborate in developing and using digital information and communication technologies (ICT) for their members, scientific authors and readers. The objectives of this collaboration were

• on a local level: to bring together the activities of individual - and often isolated - university researchers and teachers in the various academic fields
• nation wide: to join forces in voicing the interests and needs of scientific authors and readers toward the educational administration, granting agencies, research libraries, documenting agencies, publishing houses and media enterprises
• globally: to use the widespread international contacts of the learned societies to exchange concepts, development and solutions and adopt them to the specific needs within one's own field

The initiative soon caught public attention, leading to the enlargement of the group. Since then, the learned societies in the fields of education, sociology, psychology, biology, and electronic engineering have also committed themselves to the advancement of the goals of the IuK- Initiative. (IuK stands for Information und Kommunikation in German).

Funds were granted for three years by the Federal Ministry of Education and Research, Bundesministerium für Bildung und Forschung (BMBF) (www.bmbf.de).

One major project within this initiative was the Dissertation Online project, undertaken from April 1998 until October 2000. The activities of one of the workgroups led to a proposal to the German Research Foundation (DFG) to fund an interdisciplinary project to present dissertations online on the Internet, involving five universities (Berlin, Duisburg, Erlangen, Karlsruhe, and Oldenburg), and five academic fields, chemistry, education, informatics, mathematics, and physics.. DFG stands for Deutsche Forschungsgemeinschaft (http://www.dfg.de/) and is Germany's National Science Foundation. Funding was initially restricted to one year.


Parser/Validator

When a XML document is created, it is parsed/validated to see if it is syntactically correct and well formed. To accomplish this task, we need a XML parser/validator. There are a lot of XML parser available. Most of the XML editors mentioned above is also a XML parser. For example, XML Spy provides a IDE for XML, i.e. it is both editor and parser.

To view a XML documents, we need a browser that supports XML. Many browsers now support XML, such as:

Most newer browsers support XML in a way that lets it fully integrated with HTML. A stylesheet is required to view the XML documents. For ETD a cascading stylesheet has been developed for that. That stylesheet can be downloaded from http://www.etdguide.org/.

# Technical Issues/Conversions from Word, Word Perfect or other RTF-compatible tools to SGML\XML

Performing a conversion from MS Word documents into instances of a specified SGML or XML DTD is a very complex task. What you will need for that is:

• A SGML or XML document type definition (DTD) that serves as structure model for the output. One says that the output SGML document is valid to the specifies DTD, or it is an instance of this DTD:
• A Word stylesheet that holds paragraph and character styles according to the structures in the DTD. So if in a DTD you have defined a structure for Author: e.g. expressed in the output file as:
<author>
<title>Dr.</title><firstname>Peter</firstname><surname>Fox</surname>
</author>

You have to find expression in Word: paragraph styles: author character styles (just to be used within an author-paragraph): first name, surname, title

• You will need some kind of a configuration file that allows the mapping of the DTD elements into Word elements and vice versa.
• You will need an SGML or XML parser to check the output SGML/ XML document against the DTD.

Conversion Methods

Often a conversion is done by using a plug into MS Word directly, but other options use the Microsoft internal exchange format RTF (Rich Text Format) for conversion. Those tools can interpret the RTF file with the MS Word style that is still coded in this RTF file and export it into an SGML document. This process mostly happens within batch mode without using many graphical user interfaces.

Within the following paragraphs we describe several approaches:

1. Approach of the Université de Montréal, Université de Lyon 2, Universidad de Chile
2. Humboldt-University Berlin and Germanwide Dissertation Online project

There are other approaches in development as well, especially within Scandinavia and the University of Oslo/ Norway. We don't refer to their solution yet.

Conversion method of the Cybertheses project

Proposition of section (Vivi) The process line for converting Word files into SGML documents developed within the CyberThèses project uses scripts written with the Omnimark language.

Coversion Process The input of the process line is an RTF file with a "structuring style sheet" and the output is an SGML document encoded according to the TEI Lite DTD (see the TEI web site at http://etext.virginia.edu/TEI.html).

The conversion process is made up of three main steps :

• first, one converts the RTF file into a flat XML file encoded according to DTD of RTF. The produced file is a linear sequence of paragraph elements each one having an explicit "stylename" attribute corresponding to the RTF style names.
• the second step consists of the re-generation of the hierarchical and logical structure of the document based on the analysis of stylename attribute.
• last, a SGML parser allows one to validate the conformity of the produced SGML document with the TEI Lite DTD. Some supplementary scripts then allow the export of the SGML document towards other formats (HTML, XML).

Most of the scripts will soon be available from the CyberTheses web site : http://www.cybertheses.org This system is devoted to a particular DTD, but its generalization to other document models shall not pose any difficulty.

Using SGML Author for Word (Humboldt-University Berlin)

Why did we use the SGML Author for Word?

The "Dissertation Online" project implemented and refined a conversion strategy that allows writers to convert documents written in MS word with a special stylesheet (dissertation.dot) into an SGML instance of the DiM.dtd.

We used this product from Microsoft, the SGML Author for Word, for several reasons:

1. SGML Author is quite easy to configure
2. It is easy to use.
3. It is less expensive than other software producing SGML files with the same quality.
4. It support an international standard for tables: CALS.
5. As it is a Word-Add-On it handles docuents in MS-Word doc- format better than other tools.
6. As we started using this technology in 1997, it supported from the very beginning Word97, the version of word which was the actual one that time.

Unfortunately, Microsoft didn't continue the development of this tool. So there are no new versions available for Office 2000 or Office XP. However, the internal document format from MS Word 97, MS Word 2000 and Office XP are the same in the sense of the conversion into SGML. This means documents written in Word 2000 or Office XP can be imported into Word97 and therefore a conversion can be done.

Preconditions

For a successful conversion from a word document into a DiML document you will need:

• The DiML-document type definition (diml20.dtd, calstb.dtd)
• the SGML-Author for Word97 (may not available at Microsoft Shops any more, but NDLTD esp. Prof. Dr. Edward Fox may provide English versions of it that work with English Word)
• The Association file for the Microsoft SGML-Author for Word (diml20.dta)
• The converter stylesheet, which consists of several macros programmed to make the preconversion process easier.
• The perl programming language (free Software)
• The nsgmls-Parser (free Software)
• Several perl scripts to correct the transformation of tables.

Software

You must have the following software installed at you computer:

The converter stylesheet and the authors stylesheet can be obtained from the following website: http://dochost.rz.hu-berlin.de/epdiss/vorlage.html

Converter scripts and perlscripts can be optained from http://www.educat.huberlin.de/diss_online/software/tools.exe (Perl scriptc, DTD and converter file for MS SGML-Author for Word - KonverterDiML2_0.dta)

Conversions

The conversion from a Microsoft Word document into a SGML document, which is an instance of the DiML.dtd that is used at Humboldt-University, takes several steps:

Step 1

Preparing the conversion without using the converter Microsoft SGML Author for Word directly

• Check the correct usage
• Load the stylesheet for conversion (NOT the one for the authors), see figure below.
• There is a special feature to get the page numbers out of the Word document by using certain word specific text anchors. Those have to be converted into hard coded information using a page numer stylesheet.
• Formattings that have been applied by the author without using style sheets have to be replaced by the correct style sheets.
• In order to get a correct display of tables later on by using CSS stylesheets within common browsers, empty table cell have to be filled up with a single space (letter).
• Soft coded line breaks have to be preserved for the conversion. This is done by inserting special characters #BR# to that. This will be used to insert later a special SGML tag for soft line breaks <br/>.

Step 2

Converting with Microsoft SGML Author for Word

• Press the button "Save as SGML" within the FILE menu.
• Load the converter file KonverterDiML2_0.DTA
• Check the XML/SGML output using the feedback file (fbk) see figure below.

Step 3

Work through the output file (output according to the DiML.dtd) automatically.

• Load the perlskripts using the batch file preprocessor.bat
• Parse the DiML file

Step 4

Transforming the DiML file into a HTML file

• Load the perl scripts by using the batch file did2html.bat
• Check the HTML Output.
• Correct possible errors manually within the SGML file and repeat the transformation.

A demonstration quicktime video may be found at the ETD-Guide server as well. (see http://www.educat.hu-berlin.de/diss_online/software/didi.mov)

Using FrameMaker+SGML6.0 for a conversion of MS Word documents into SGML instances.

Editing or converting using FrameMaker is much more complex than the previously described methods. FrameMaker is able to import formatted Word documents keeping the stylesheet information and exporting the document via an internal FrameMaker format as SGML or XML documents. In order to proceed with a conversion using FrameMaker you will need the following configuration files.

1. A conversion table. This contains the list of the Word styles and the corresponding elements within the FrameMaker internal format. This table is saved within the FrameMaker internal format (*.frm).
2. A document type definition will be saved within FrameMaker internally as EDD (Element Definition) It is saved within the FrameMaker internal document format (*.edd)
3. FrameMaker uses layout rules for the internal layout of documents. Within this layout definition the layout of documents is described just like it is within MS Word documents: single formats and their appearances like text height, etc. are defined. This file is also stored as (*.frm file).
4. The Read-Write Rules contain rules that define which FrameMaker format will be exported in which SGML / XML element.
5. The SGML- or XML DTD has to be used as well, including Catalog- or Entity files, as well as Sub DTDs, like CALS for tables.
6. To process a conversion a new SGML application has to be defined within FrameMaker+SGML. This application links all files that are needed for a conversion as described above. It enables FrameMaker to parse the output file when exporting a document to SGML or XML:

A workflow and a technology for conversion for ETD using FrameMaker+SGML6.0 was first developed at the Technical University Helsinki, within the HUTPubl project (1997–2000), see http://www.hut.fi/Yksikot/Kirjasto/HUTpubl

Other Tools

Text editors, Desktop Publishing Systems that can export SGML/XML documents

Tools that export using a user specified DTD:

Tools that export using their own native DTD:

Converter Tools:

References

1. Bollenbach, Markus; Rüppel, Thomas, Rocker, Andreas: FrameMaker+SGML5.5. Bonn; Reading, Mass., Addison-Wesley Longman, 1999, ISBN 3 8273 1508 5
2. St. Laurent; Biggar, Robert: Inside XML DTDs. New York, Mc Graw Hill, 1999, ISBN 0 07 134621 X
3. Ducharme, Bob: SGML CD. New Jersey, Prentice Hall, 1997, ISBN 0 13 475740 8
4. Smith, Norman E.: Practical Guide to SGML/XML Filters. Plano. Texas, Wordware Publishing Inc., 1998, ISBN 1 55622 587 3
5. Goldfarb, Charles; Prescod, Paul: XML Handbuch. München, Prentice Hall, 1999, ISBN 3 8277 9575 0 Invalid ISBN

Next Section: Conversions from LaTeX to SGML/XML

# Technical Issues/Conversions from LaTeX to SGML\XML

Looking at this problem, it first gives the impression that because writing in LaTeX is also a kind of structured writing, a conversion into SGML or XML compatible documents can quite easily be done.

Often people use as software Onmimark, Balise or simple Perl scripts for conversion.

Problems

Problem 1 The LaTeX format itself enables due to the structural approach an easier conversion into SGML/XML. But often the pure LaTeX approach goes hand in hand with sophisticated macro programming, so that in many cases the pure structure of a LaTeX document will be destroyed by macros or the conversion will be much harder to perform. Also the lack of a parser that checks the correct usage of structure elements like chapter, section, subsection make a conversion more complicated.

Problem 2 If an author wants to define a mathematical formula in LaTeX, he has 2 basic opportunities:

• Producing the mathematical formula as picture
• Defining them with the appropriate mathematical LaTeX features as text formulas.

First and Second Versions

The first version prevents from any secondary usage of the formula. The second version allows the reusability of a formula in different contexts. So it makes it easy to prove correctness of a statement by importing the LaTeX formula into a mathematical software package like Maple or Mathematica. Formulas coded in LaTeX can be displayed in a rendered form in an browser by software or plug-ins like IBM Techexplorer, Math Viewer. As LaTeX coded formulas still have the disadvantage, that they are not encoded using so called sematic tags, the usage of MathML is highly advised MathML is an XML document type definition for mathematics developed by the W3C.

The Letter e

In LaTeX authors often don't distinguish between the letter 'e', that may stand for a variable and the Euler constant. In MathML there is a huge difference whether something is encoded as variable e or as the Euler e (2,718. . . ). Therefor, the usage of layout definition in LaTeX for mathematics complicats the conversion into MathML and therefore into any SGML/XML format.

Possible Solutions

In order to prepare mathematical formulas in LaTeX for a conversion, many universities and the TeX. User Groups around the world are working soon the definition of certain macros that can be transformed into the appropriate MathML definitions.

See University of Montréal at www.theses.umontreal.ca or University of the Bundeswehr in Munich.

Next Section: Rendering-style sheets

# Technical Issues/Rendering-style sheets

The Problem

In the case of conversion of documents into SGML/XML, the original file must be written using a structuring style sheet (template). But students often give a lot of importance to the layout of the final document and make many personalizations and adaptations on the original structuring style sheet. It is our duty to also convert also this aspect of their work. The conversion tools devoted to the document content have to be completed by conversion tools that may generate rendering style-sheets : typically XSL files for XML documents.

Layout Demand

As SGML or XML cannot be directly read by users using today's browsers (Opera, Netscape, Internet Explorer), it is necessary either to provide layout information in the form of a stylesheet or to transform those highly structured documents into layout or printable version in HTML, PDF,PS, which can be more easily used by users to read the documents.

Generally, universities have 2 possibilities to cope with the demand for layout:

1. Try to preserve all layout information the author (student) has given to the MS Word or other document. This strategy involves the author directly by the personalization of a standard style sheet for his/her work.
2. Universities could decide to develop one single style sheet or a choice of certain style sheets usable for all ETDs. This would support a corporate design and solve the layout question on a more general level, leaving it to the ETD production department.

Style Sheets for SGML or XML documents

The development of such tools is being achieved in France (collaboration between Lyon 2 and Marne-la-Vallée) for documens produced in MS Word and other RTF compatible authoring tools. Theirs are based on the analysis and the extraction of the typographic characteristics associated to each used style.

Equivalent Tools

The equivalent tools shall be easy to develop for document produced with LateX, as this kind of language natively uses the notion of rendering style-sheets.

Style sheet languages for XML are:

• Cascading Style Sheets (CSS )
• Extensible Style Language (XSL) .

As CSS are not powerful enough to handle the complexity and demand for large XML documents as theses and dissertations, this is not advisable.

Substandards within XSL Standard

Within the XSL standard, one distinguishes between several sub-standards:

• Extensible Style Sheet Transformation (XSLT). This part allows the user to produce stylesheets that act like small programs. They transform the original document, that is always valid against a specified DTD, into a document that either follows another DTD (which allows an easier rendering within browsers, as HTML.dtd) or allows a transformation of the document into other document description languages such as Rich Text Format (RTF), LaTeX, PDF. From those formats the production of a printed version may be possible.
• XPath allows to authors to build expressions that link to other XML documents, but not only to the whole document. The citation of a certain subsection and the citation of e.g. section 3 to 5 might be possible once common browsers support this linking technology.
• XSL:Fo ( Formatting vocabulary, that can be applied to the nodes of an XML document)

Example for a Printing On Demand Service (POD) based on the use of stylesheets

Digital Libraries which use their document servers as long term electronic archives will not make printed information dispensable. On the contrary: for users of these information systems the desire for printed documents is increasing. In most cases this desire often focuses not on the whole document as such, but on particular parts of it like chapters, citations and so on. For that reason the Humboldt-University Berlin's printing on demand project aims toward the development of a technology which allows the users to print the only the desired part of a certain document.

Printing on Demand with XML

For the printing on demand component with XML the usage of Apache/Cocoon was chosen. This software uses an XSLT-engine to produce an HTML or PDF-Version on the fly. "The Cocoon Project is an Open Source volunteer project under the auspices of the Apache Software Foundation (ASF), and, in harmony with the Apache webserver itself, it is released under a very open license. Even if the most common use of Cocoon is the automatic creation of HTML through the processing of statically or dynamically generated XML files, Cocoon is also able to perform more sophisticated formatting, such as XSL:FO rendering to PDF files, client-dependent transformations such as WML formatting for WAP-enabled devices, or direct XML serving to XML and XSL aware clients. "

Cocoon

As Cocoon does not consist of a printing on demand component (especially a selection feature) a small workaround using different XSLT-stylesheets had to be created. The users view, containing an HTML-view onto the actual document includes check boxes which the user can use to select parts of a specific document. This view is produced by the XSLT-Broker stylesheet which calls a default stylesheet, that produces HTML (XSLT-Stylesheet with option document.xml?format=html). If the user selects certain parts of the document by clicking in the checkboxes, by clicking on the "OK" button a perl script (PHP-Choise) is called. This script selects the desired chapters and sections of the document by using XPath-expressions (http://dochost.rz.huberlin.de/proprint/bsp/slides.xml?CHAPTER=3 and http://dochost.rz.huberlin.de/proprint/bsp/slides.xml?CHAPTER=4 ) cuts those parts out of the document and holds them in the main memory. This procedure is carried out by the XSLT-Broker-stylesheet that has now been called with the XML-option (document.xml?format=xml). These parts are added to one single XML-document (all in the main memory!) and processed by the XSLT Broker-stylesheet either with the print option or the HTML option (document.xml?format=pdf or document.xml?format=html)

M 1.0 DC.Title 245 $a title name the author assigned to the work 1.1 DC.Title.X-Notes 500 source of title note M 2.0 DC.Creator.PersonalName 100$a author, personal person primarily responsible for creating the intellectual content of the work
M 2.3 DC.Creator.X-Institution 710 $a institution's full name name R 2.4 DC.Creator.X-Major O 2.5 DC.Creator.X-College 710$b college's official name
R 2.6 DC.Creator.X-Dept 710 $b department's official name O 3.0 DC.Subject 653$a uncontrolled keywords keywords or phrases describing the subject or content of the work
650 $a LCSH 690$a LCSH
R 4.0 DC.Description.Abstract 520 $a author's summary textual description of the content of the work R 5.0 DC.Publisher.CorporateName 260$b entity responsible for making the work available in its present form
R 6.0 DC.Contributor.X-Chair.PersonalName 700 $a added personal name person(s) or organization(s) who made significant contributions to the work but that contribution is secondary to the author R 6.1 DC.Contributor.XCommittee.PersonalName 700$a added personal name
O 7.0 DC.Date.Valid date committee approved the work in final form(s)
M 7.1 DC.Date.X-Approved 260 $c date Graduate School approved the work M 8.0 DC.Type 655$2 local index term--genre/form category of the work:Text.Thesis.Doctoral or Text.Thesis.Masters + (see DC enumer'd list)
M 9.0 DC.Format 856$q locat.+access/file trans. mode work's data format; ID software (+hardware?) to display/operate the work (see DC list) M 10.0 DC.Identifier 856$u URL (or PURL, hndl, URN) unique ID of the work(s)
856$b If IP address: (Access number) 10.1 DC.Identifier.XCallNumber.LC 090$a
O 11.0 DC.Source 786$n 0 Data Source Entry/Title metadata describing original source from which the work was derived O 12.0 DC.Language 546$a language note language of intellectual content of the work
041$a language code M 13.0 DC.Relation 787$n nonspecific relationship note describes how ETD's parts relate to entire work
787$o nonspecific relationship ID URL O 14.0 DC.Coverage 500$a general note spatial or temporal characteristics of intellectual content of the work
255 $c If spatial: cartographic statement of coordinates 513$b If temporal:period covered note
15.0 DC.Rights 540$a terms governing use/repro copyright statement 856$u If URL: (with $3=rights) M 15.1 DC.Rights.X-Availability Should be "accessibility?" none/some/full access O 15.2 DC.Rights.X-Availability.Notify date to assess accessibility O 15.3 DC.Rights.X-Proxy.PersonalName contact re accessibility if author unreachable O 15.4 DC.Rights.X-Proxy.Address O 15.5 DC.Rights.X-Checksum.MD5 Reveals if file is corrupted. Appropriate here? O 15.6 DC.Rights.X-Signature.PGP Should be accessibility? none/some/full access M = mandatory O = optional R = highly recommended MARC 502 derived from type, creator's institution, date valid MARC 538 = format Paul Mather and Gail McMillan, Virginia Polytechnic Institute and State University, Nov. 18, 1998 MARC does not equal Metadata: missing from Metadata=file size; several notes (vita, abstract), 949/040 (more?) Leader and 008--some redundant info from variable fields Next Section: Naming Standards # Technical Issues/Naming Standards This topic is discussed in Naming standards: file names; unique Ids. The focus there is on naming of files and on unique identifiers. Here we consider those matters briefly, but also discuss URNs and OAI naming. At Virginia Tech, naming of the parts of an ETD is left to authors. Many have a single document, called etd.pdf, but there are many other selections. The entire ETD is referred to by way of a summary page. This has an ID like: etd-12598-10640 or (in earlier cases) etd-5941513972900 These two correspond to full URLs, respectively: Other but similar schemes are in use elsewhere - recall Naming standards: file names; unique Ids. The key point is to have an ID for each work, and a way to resolve that to the actual document. In some cases, a URN is given, such as a handle or PURL, which in turn is resolved to a URL. (See Identifying: URN, PURL, DOI) That allows users to remember the URN, and be assured that over time, as documents are moved about, there still be automatic linking from the URN to the designated document. IDs as discussed above also are important with regard to OAI. Assuming that, as is required, each archive has a unique identifier, then as long as each record in the archive, as required, has a unique internal ID, then the combination of the two will uniquely determine a record in an archive. Next Section: Encryption; Watermarking # Technical Issues/Encryption; Watermarking Encryption and Digital Signature Overview: Using digital signatures Digital signatures act like conventional signatures - allowing you to "sign off" on anything that requires an approval. You can simply attach your "signature" to the document. In addition, a signature stores information, like the date and time, and allows you to track document versions and validate their authenticity. To create a digital signature profile: 1. Choose Tools > Self-Sign Signatures > Log In. 2. In the Acrobat Self-Sign Signatures - Log In dialog box, click New Profile. 3. In the User Attributes area of the Acrobat Self-Sign Signatures - Create New User dialog box, enter your name and whatever other information you want to include in the three optional fields. 4. In the Profile File area of the Acrobat Self-Sign Signatures - Create New User dialog box, enter the path name for the folder in which you want to store your signature profile or click Browse and choose a folder. Enter a password of at least six characters in the User Password and Confirm Password fields and then click OK. To add a digital signature to a document: 1. Click on the Digital Signature tool in the Tool bar and then click and drag where you want to place your signature. 2. In the Acrobat Self-Sign Signatures- Sign Document dialog box you can select an option from the Reason for Signing Document pop-up menu or enter a reason in the field, and you can enter a location in the Location, e.g. City Name field. Note: If you’re using a third-party signature handler, follow the instructions displayed on screen. You may be prompted to log in to the handler or enter required information. 3. Enter your password in the Confirm User Password field and then click Save Document. 4. If this is the first signature added to the document, the Save As dialog box is displayed. Enter a name and choose a location for the file and then click OK. Note: If the Save As dialog box is displayed when you add a digital signature, you end up with two copies of the document: one unsigned and one signed. 5. From this point on, you should use the signed version. 6. To display a list of a document’s signatures, click the Signatures tab in the Navigation pane. The Signatures palette’s pop-up menu contains several commands for working with digital signatures; the Properties command lets you see the attributes of a digital signature. Next Section: Packaging # Technical Issues/Packaging The Australian Digital Theses (ADT) Program software is a modified version of the original Virginia Tech ETD software and is in its second release. The ADT software modifications were to make it generic, flexible and customisable for easy integration within the local IT infrastructure. This is critical as the ADT Program is a distributed and collaborative system involving a large number of Australian universities. The software is distributed to all ADT members free by ftp download as a .tar file. The reason for using a .tar file is that it keeps related files together, and thus facilitates the transfer of multiple files between computers. The files in a .tar archive must be extracted before they can be used. Extracting the distribution .tar file will reveal a directory (adt) containing 2 further directories (cgi-gin & public_html) plus installation instructions in a readme.txt file plus an empty adt-ADT admin site to help identify the structure of the admin side of the software. Also included in the release package is a test site for members to look at and use to familiarise themselves on the look, structure and functionality. Three dummy theses are used as examples of how theses can look, and how they fit into the admin structure. General software details, as well as all other information regarding the ADT Program is publicly available on the ADT Information page @: http://adt.caul.edu.au/ Responsibility for the ADT software and initial setup support for new members is taken by the lead institution - The University of New South Wales Library. Overview of the ADT Program software 1. Deposit Form • generic look and feel • includes both Copyright & Authenticity statements • completely revised help screens • revised, as well as new alerts for errors etc.. • does not allow non compliance with core ADT standards - e.g. filenames; illegal symbols • all fields & processes compulsory unless otherwise indicated on the form 2. Administration pages • possible to edit html as well as change restrictions • with new update function it is now possible to ADD, DELETE, RENAME files. Another feature of the update function is to be able to move files to a NOACCESS directory, as well as and to make them accessible again. This is designed primarily for certain parts of theses that cannot be made public for reasons of copyright, patents, legal reasons, and other sensitivities • now possible to make thesis available without any restriction (ideal), restrict to campus only, restrict whole thesis for approved caveat period, totally restrict parts of thesis. Any combination is possible with the choice, or choices made, being reflected in the local view of the thesis. That is, if thesis is restricted to campus only this is now obvious, similarly if part of thesis is restricted (noaccess) this is obvious too. Knowing if restrictions apply at the outset will not frustrate those searching the database • now possible to easily un-make a deposit. That is, to remove a thesis from public/restricted view and/or to take this back to the deposit directory where any editing and changes can be made before reapproving and making accessible again • refined URI structure using date (yyyymmdd) and time (hhmmss) • revised and new help & alert pages 3. Metadata • completely revised and updated according to latest Dublin Core Qualifiers document released 11 July 2000. • to aid search functionality keywords/phrases (i.e. DC.Subject) are each repeated as separate element strings. This has resulted in a revised look of the HotMeta database view. The brief record (default) now shows Title, Author, Date, Institution/School, with the expanded view showing all the metadata. Next Section: Backups; Mirrors # Technical Issues/Backups; Mirrors An archival infrastructure for ETD should not only consider document format or the use of digital signatures, but also a consequently run concept for mirroring and backups. Mirrors ensure, that archives run stable, regardless of usage, bandwidth and location. A common server concept for an ETD archive is the following one: 1. Production server 2. Archive Server (Storage Unit, like IBMs Tivoli Storage Management System) 3. Public Archive server 4. Archive Mirror at different geographical location. The production server is the server that communicates with the users, especially the authors. Here the uploads and metadata processing is done. The archive server ensures that a permanent incremental backup of the public archive server is done. Only system administrators have access to the archive server. This secures the access to the ETDs and their digital signatures. It prevents users from manipulating ETDs and their authenticity and integrity. The public archive server itself is the server that is known as the document server from the outside. Here the ETDs can be access by users, the retrieval is available. We advise to secure this server with a special RAID (Redundant Array of Independent Disks) system, that allows a security in case of a hardware failure of the disk, that might be caused by extreme usage of the server. A RAID system holds internally 2 equal copies of the server distributed on independent hard disks. So if one copy cannot be accessed due to a head crash or other hardware inconsistencies, the second copy will take over functionality and operate as the first copy. The user itself will not realize, which copy is actually in use. Additionally to the archive server there should be another storage management system server, an archive mirror that mirrors the original archive. This is important if due to environmental accidents e.g. fire, earthquake or anything else destroys the original archive server. So a data loss, or the loss of the full archive can be prevented. This backup archive should be located at another geographical position, even at another continent. Next Section: Dissemination of ETDs # Technical Issues/Dissemination of ETDs Ultimately, ETDs are designed to be disseminated, to at least some audience. Aspects of this activity are considered in the next subsections. First, each ETD must have an ID, and there must be a link between the ID and the actual work. Second, each ETD must have a metadata record attached, that can be used for resource discovery and other purposes. Third, there must be a link between the ID, the metadata, and the actual body of the ETD. Fourth, the works must be made available, typically through a Web server. Sixth, there may be summaries or full works that are provided for discovery, or for indexing that is designed to lead to discovery. For example, summary pages may be indexed by Web search engines so that they may be found as a result of a Web search. Note that it is encouraged for all NDLTD sites to keep a log regarding dissemination of the local ETDs, as well as other related information, so statistical and other reports can be prepared both for individual sites and for sites that have aggregate information (such as NDLTD). Next Section: Identifying: URN, PURL, DOI # Technical Issues/Identifying: URN, PURL, DOI Resources distributed on the Internet are accessible by means of a syntax which corresponds to their physical location. This syntax is defined by the RFC 1738 and is known as a Uniform Resource Locator (URL). This way of doing things creates certain problems which we must often confront. Who has not encountered the famous HTTP error 404 Not Found, which indicates that the server cannot find the location of the requested resource? This does not mean that the resource is no longer on the server, because it may simply have been moved to another location. URLs have no means of being automatically updated when a resource is moved to another place, such that we often run up against that famous HTTP error. While the URL identifies the address of a resource, the Uniform Resource Name (URN) identifies the actual resource, the unit of information, much like the ISBN does for books. To draw a parallel, the URL corresponds to a users’ postal address while the URN corresponds to users’ social insurance or social security number. The URN is thus attached to a resource and not to a physical address. By knowing this identifier, it is possible to find this resource even if its physical address changes. The URN ensures an institutional commitment to the preservation of access to a resource on the Internet. In the framework of the Université de Montréal’s digital thesis pilot project, undertaken in 1999-2000, we implemented a system for producing URNs based on the model proposed by the CNRI. A global server based at the CNRI manages "Naming authorities" which refer to publisher’s numbers. A local server installed at the thesis distribution station in turn houses a database which manages the associations between URNs and URLs. All of this bears close resemblance to NetworkSolution’s system for managing the DNS which regulate the IP addresses of computers linked to the Internet, except that in our case it is documents being given addresses as opposed to computers. The model proposed by CNRI is the Handle system. This system is also the cornerstone for the DOI Foundation’s system. The construction of the Handle falls in two parts. The URN’s prefix corresponds to the publisher’s number (the Université de Montréal’s publisher’s number is 1012). This number is unique and cannot be used by any other organization. "Sub-names" can be added following this number in order to subdivide it into more precise units. This sequence is followed by a slash ("/") and a freely chosen alphanumeric sequence. Thus, a Handle-type URN for theses reads as follows: hdl:1012.Theses/1999-Albert.Mathieu(1959)-[HTML] We chose the year of the thesis defense, the author’s name, his/her date of birth and the format of the file as the constitutive elements of a thesis’ URN identifier. Please note that one must first download the CNRI’s plugin in order to use the Handle system. This system has the advantage of being fairly much in conformity with the requirement of RFC 1737 concerning the framework regulating a URN system. Its application is nevertheless fastidious since one absolutely requires the plugin to be able to resolve the links. After experimentation with the CNRI’s system, the Université de Montréal intends to use another system for our ongoing electronic theses project. Another interesting avenue is the PURL system created by OCLC. Let us note that a document attached to a PURL can be modified, contrary to other norms or applications for the use of a URN. The PURL system largely follows the same principle as the Handle system except that the URNs are resolved using a URL address. This solution has the advantage of not requiring the use of a plugin. In fact, a PURL is a URL. Rather than pointing directly to an Internet resource, a PURL points to an intermediary resolution service. This service associates the PURL with the active URL, which is then provided to the client. The client then normally gives access to the resource. It is possible to register PURLs with an intermediary service (such as the OCLC’s) or to install the service on one’s own server. References Next Section: Metadata models for ETDs # Technical Issues/Metadata models for ETDs One of the objectives of an ETD program is to yield easy access to TDs. Since we are dealing with digital libraries, we are implicitly dealing with libraries. One of the actions performed on a library catalog is that of search and retrieve. This is the first step towards accessing the contents of a library item; the second step is the use (read, listen, view, etc.) of the item. In order to be efficient in the search and retrieve action, the user must search a catalog in which the items were properly identified, besides using good search functions. This section is about the identification of ETD's, which is a very important step towards their dissemination. The identification will be accomplished through the use of the metadata elements whose set is named the metadata model of the digital library of TDs. Before we address metadata models for ETDs, it is important that some ideas are brought to the discussion. These ideas are related to the choice of a model to be considered later on. These models must be rich and versatile to contain information of different natures and to be searched by users from all over the world. It is obvious that the richer and more versatile the metadata model is, the more time and effort it takes to capture (collect and record) the information into the digital library. The decision on which model to use will have to take this into consideration. In some situations it may be necessary to adopt the simplest possible model in order to make the metadata capture viable. Later in this chapter the Dublin Core Metadata Element Set will be introduced. It seems that it is the consensus of the minimum identification to be used for ETD's. The ideas for us to think about are: • Many languages in one world • ETDs to be read all over the world • Metadata • Contents and instances • Contents, instances and metadata • Contents, instances and languages • Metadata models and languages • Metadata schemes • Specialization of the metadata models for TDs • Conclusion - metadata models for ETDs Many languages in one world Our world is a very diverse linguistic place. Those who work with information and are involved in international projects know English. This is the language they use to communicate, to access the Internet, to read technical literature, etc. At the same time, not only many other languages exist but some of them have large numbers of native speakers. The 100 most spoken languages of the world, when first language speakers are counted, can be found in http://www.sil.org/ethnologue/top100.html. In descending order, the first 10 are Chinese (Mandarin), Spanish, English, Bengali, Hindi, Portuguese, Russian, Japanese, German (Standard) and Chinese (Wu). If only the other 9 languages are considered, it is not hard to imagine the numbers of texts that are written and published every year. The same happens with TDs. The number of TDs published in languages other than English must be very big. ETD's to be read all over the world One of the purposes and benefits of an ETD program is to yield easy access to the results presented in TDs, no matter where the reader is and where the dissertation was written. We assume that ETD digital libraries are to be connected to the Internet so that their contents can be shared worldwide, to make sure this benefit is accomplished. Metadata Metadata are data about data or information about information. The metadata elements are the attributes used to describe a digital library item just like the ones used to catalog items in a traditional library. Many of these attributes are language dependent, as for example titles, abstracts, subjects, keywords, etc. Others obviously are not, as for example authors' names, digital format, number of bytes of the file, etc. Since some metadata elements are language dependent and TDs are written in many languages, we can expect that most probably the metadata will use the language of the work. This can pose a problem for search and retrieve activities since most of us are not fluent in as many languages as we would like to be. Contents and instances The items of a digital library may be identified in 2 different levels; the same way the items of a traditional library are. The first level is the content which is equivalent to a title of a traditional library and the second is the instance which is equivalent to a volume. A content is the logical definition of an item of the digital library and it is identified by a set of attributes. An instance is the physical realization of a content or title. It is a digital object and is identified by a set of attributes too. The use of contents and instances allows contents to have multiple instances either in different formats or due to physical partitions. This will yield a one to many relationship among contents and instances. The use of contents and instances also allows the access control to be performed on the partitions instead of on the content. This makes the digital library more flexible in terms of dealing with intellectual property rights. Therefore, we can conclude that there are attributes that are particular to contents and others that refer to instances. The metadata model must contain both. Contents, instances and metadata Some metadata elements are common to all contents, as for example title, abstract, type, etc., while others are common to all instances, as for example electronic format, access level, etc. On the other hand, some metadata elements are specific to some contents, as for example translation control - original content, translator, etc., and others are specific to some instances, as for example special equipment, expiration date, remote location, etc. From this comment, we can see that the metadata model must be versatile to contain attributes that are common to all contents and to all instances and also the specific ones, in order to accommodate specialization of the digital library items. Contents, instances and languages Contents may be language dependent. The language of the content is the one in which it is written, spoken or sung. Other languages may be associated with a content - the ones in which it is catalogued. It is possible to describe a content written/spoken/sung in one language in other language(s). This way, there is one catalog entry in each of the languages to be used. The use of multilingual cataloguing yields points of access in different languages if the search is performed in all of them. This topic will be addressed in the section Database and IR. Metadata models and languages It is possible to define the digital library to hold more than one language. A good choice would be, at least, the language(s) of the nation where TDs are developed and English. If this is the case, the metadata model can have all attributes that are language dependent written in each language to be used in the digital library and the language code must be a part of the primary key in the database. Attributes that are language independent would have only one representation in the database. Metadata schemes There are quite a few metadata schemes. Some are strictly related to library items while others have a broader scope, as for example the ones devoted to digital objects to be used in Web Based Education. Some schemes are well known and should be mentioned: Specialization of the metadata models for TDs Besides the usual data contained in general purpose metadata schemes, there are some types of information related to TDs that may be of interest to the university. For this reason, it may be useful to consider adding extra metadata elements to the traditional metadata schemes. The additional elements can be separated in 3 groups: • Administrative information - department, date of presentation, date of acceptance, financial support, etc. • Academic information - level, mentor, examining committee, etc. • Traditional library information - university, library system, control number, call number, etc. These may be useful to yield information concerning the graduate programs of the university. Conclusion - metadata models for ETD's The definition of the metadata model for an ETD digital library must combine: • The needs for proper identification of ETD's for the goals of access to be achieved (national access? international access?) • The administrative needs of the university At the same time, the restrictions imposed by budget or operation time frames must be to taken into consideration. There is a balance between what is desired and what is possible. Some comments concerning this balance are made: • For international access, the use of English besides the original language(s) is mandatory. This means that titles and abstracts must be translated, and that subjects headings, keywords, etc. will be multilingual catalogs to be maintained. • For the ETD digital library to be a part of the international community, the minimum requirements in terms of ETD identification must be met. This means that at least the DCMES must be used. • For the university to have good control of the intellectual property, the use of content / instance concept allows access specifications to be established on the digital objects. Thus, some objects may be made public while others may have different types of restrictions due to format or to intellectual content. • In the definition of the workflow to operate the ETD program, attention must be given to the capture of the metadata elements. If non-librarians are involved in the process, there must be a good training program and a careful review process so that the attributes are catalogued right. The choice of the metadata model is very important and the team in charge of the implementation of the ETD program must study the possibilities before making the decision. Minimum standards must be met. Next Section: Cataloging: MARC, DC, RDF # Technical Issues/Cataloging: MARC, DC, RDF Many traditional library systems exchange and store records using the MARC Format (Machine Readable Catalog), one of the realizations of the ISO 2709 Standard. The MARC Format has approximately 1,000 fields, many with subfields which can be repeated. The use of this format allows a very detailed description of the items. This format has a specific field (856) to identify of electronic objects associated with the intellectual item and its other physical instances. The DCMES (Dublin Core Metadata Element Set - http://dublincore.org/documents/dces/) is a set of 15 attributes divided into 3 groups: content, intellectual property and instanciation. Associated to them there are the Dublin Core Qualifiers (http://dublincore.org/documents/dcmes-qualifiers/) that enhance the identification of the items. There is a relation between the MARC Format and the DCMES since there is an intersection between the 2 sets of attributes. The RDF (Resource Description Framework - http://www.w3.org/TR/1999/rec-rdf-syntax-19990222/) is a foundation for processing metadata. It specifies a representation for metadata as well as the syntax for encoding and transporting this metadata. The objective is to yield interoperability of Web servers and clients, and to facilitate automation of processing of resources. It can be used to describe Web pages, sites or digital libraries. The DCMES can be used with the RDF representation. Considering the MARC Record and the Cataloging Department Work Flow MARC Bibliographic Records Catalogers may want to focus initially on what fields are currently included in the MARC bibliographic record for theses and how these would be the same or different for ETDs. The MARC record for theses is not very robust and often has a local twist, presenting valuable information in a unique format that can be seen only at the originating institution. Optimally, author, title, abstract, and other relevant bibliographic information would be programmatically adapted to the appropriate MARC fields. The extent of AACR2r compliance was another complicating factor. For example, would programming change upper-case letters to lower case? Requiring authors of theses (in all formats) to provide keywords for use in the bibliographic record may enhance search results. Assigning Library of Congress subject headings (or other controlled vocabulary) is very time consuming, so having the authors assign the uncontrolled subject headings may be an appropriate alternative. MARC tag 653 would be appropriate for author-assigned terms. Without LC subject headings, however, these MARC records would be considered "minimal level," rather than full level, cataloging. This seemed particularly unjust to because the ETD bibliographic records would actually be more robust than previous theses cataloging because additional information is included. The catalogers on the ad hoc task force suggested including tables of contents (MARC tag 505) and abstracts (MARC tag 520) since the standard copy-and-paste features of today's word processors would make this a relatively easy process. The table of contents for dissertations, however, proved to be quite generic, usually containing only the standard dissertation topics (e.g., literature review, methodology, findings, etc.), and, therefore, not an enhancement to the information available about the work in the OPAC. The abstract, however, contains valuable information and provides valuable information about the research topic. The 520 is also an indexed field in our online catalog and, therefore, a word-searchable field for OPAC users. Adding the abstract (250-350 words) can, however, add tremendously to the length of the MARC record. See figure 2 and figure 3. Cataloging conventions have not generally included the name of the thesis author's department as a standard feature of the bibliographic record. This is an opportunity to modify local cataloging practices. Taking advantage of the opportunity to incorporate changes in theses cataloging, consider using MARC tag 502, the dissertation note field, to include the degree, institution, and year the degree was granted, expanding institution to include the name of the department. The new note would follow this example: 502 Thesis (Ph. D. in Mechanical Engineering)--Virginia Polytechnic Institute and State University, 1955. Evaluating the potential value of an e-thesis bibliographic record provides the opportunity to propose a substantially enhanced record of real and continuing value to OPAC users. AACR2r compliance may be an issue. In reviewing her Cataloging Internet Resources: A Manual and Practical Guide, Nancy Olson states that when cataloging Internet-accessible documents, consider them to be published documents. Therefore, publisher information belongs in field 260 of an e-thesis record. [Coming from a serials background it seems reasonable to add a 710 for this corporate body tracing.] Additional fields required for cataloging computer files include tag 256 for computer file characteristics, tag 538 for notes of system details, and tag 856 for formatted electronic location and access information. These additional fields (505, 260, 710, etc.), however, also increase the length of the record and, therefore, should be carefully considered as should the usefulness of the information provided in meeting the combined needs of OPAC users and computerized access and retrieval systems. Figure 2: MARC bibliographic record for an ETD VT University Libraries - - - - - - - - ADDISON - - - - MARC BIBLIOGRAPHIC RECORD [OCLC fixed field tags] Local lvl: 0 Analyzed: 0 Operator: 0000 Edit: Type cntl: CNTL: Rec stat: n Entrd: 010608 Used: 010706 Type: a Bib lvl: m Govt pub: s Lang: eng Source: d Illus: a Repr: Enc lvl: K Conf pub: 0 Ctry: vau Dat tp: s M/F/B: 0 Indx: 0 Mod rec: Festschr: 0 Cont: b Desc: a Int lvl: Dates: 2001, [003-049 system assigned fields and information] 1. 001 ocm47092981 010608 2. 003 OCoLC 3. 005 20010608092044.0 4. 006 m d s 5. 007 c \b r \d u \e n \f u 6. 035 1475-05560 7. 040 VPI \c VPI 8. 049 VPII 1. 099 Electronic Thesis 2001 Alvarez 2. 100 1 Alvarez, Leticia, \d 1973- 3. 245 14 The influence of the Mexican muralists in the United States \h [computer file] : \b from the new deal to the abstract expressionism / \c Leticia Alvarez. 4. 256 Computer data (1 file) 5. 260 [Blacksburg, Va. : \b University Libraries, Virginia Polytechnic Institute and State University, \c 2001] 6. 440 0 VPI & SU. History. M.A. 2001 7. 500 Title from electronic submission form. 8. 500 Vita. 9. 500 Abstract. 10. 502 Thesis (M.A.)--Virginia Polytechnic Institute and State University, 2001. 1. 504 Includes bibliographical references. 2. 520 3 This thesis proposes to investigate the influence of the Mexican muralists in the United States, from the Depression to the Cold War. This thesis begins with the origins of the Mexican mural movement, which will provide the background to understand the artists2 ideologies and their relationship and conflicts with the Mexican government. Then, I will discuss the presence of Mexican artists in the United States, their repercussions, and the interaction between censorship and freedom of expression as well as the controversies that arose from their murals. This thesis will explore the influence that the Mexican mural movement had in the United States in the creation of a government-sponsored program for the arts (The New Deal, Works Progress Administration). During the 1930s, sociological factors caused that not only the art, but also the political ideologies of the Mexican artists to spread across the United States. The Depression provided the environment for a public art of social content, as well as a context that allowed some American artists to accept and follow the Marxist ideologies of the Mexican artists. This influence of radical politics will be also described. Later, I will examine the repercussions of the Mexican artists2 work on the Abstract Expressionist movement of the 1940s. Finally I will also examine the iconography of certain murals by Mexican and American artists to appreciate the reaction of their audience, their acceptance among a circle of artists, and the historical context that allowed those murals to be created. 1. 538 System requirements: PC, World Wide Web browser and PDF reader. 2. 538 Available electronically via Internet. 3. 653 mural painting \a WPA \a abstract expressionism 4. 856 40 \u http://scholar.lib.vt.edu/theses/available/etd-05092001-130514 5. 945 NBJun2001 6. 949 dpm/tm 06/07/01 7. 994 E0 \b VPI Figure 3: OPAC display for an electronic thesis from the Virginia Tech VTLS opac VT University Libraries - - - - - - - - ADDISON- - - - - - - - - - - -FULL RECORD CALL NUMBER: Electronic Thesis 2001 Alvarez Author: Alvarez, Leticia, 1973- Title: The influence of the Mexican muralists in the United States [computer file] : from the new deal to the abstract expressionism / Leticia Alvarez. File Type: Computer data (1 file) Imprint: [Blacksburg, Va. : University Libraries, Virginia Polytechnic Institute and State University, 2001] Series: VPI & SU. History. M.A. 2001 Note: System requirements: PC, World Wide Web browser and PDF reader. Note: Available electronically via Internet. Remote Acc.: http://scholar.lib.vt.edu/theses/available/etd-05092001-130514 Note: Title from electronic submission form. Note: Vita. Note: Abstract. Note: Thesis (M.A.)--Virginia Polytechnic Institute and State University, 2001. Note: Includes bibliographical references. Abstract: This thesis proposes to investigate the influence of the Mexican muralists in the United States, from the Depression to the Cold War. This thesis begins with the origins of the Mexican mural movement, which will provide the background to understand the artists2 ideologies and their relationship and conflicts with the Mexican government. Then, I will discuss the presence of Mexican artists in the United States, their repercussions, and the interaction between censorship and freedom of expression as well as the controversies that arose from their murals. This thesis will explore the influence that the Mexican mural movement had in the United States in the creation of a government-sponsored program for the arts (The New Deal, Works Progress Administration). During the 1930s, sociological factors caused that not only the art, but also the political ideologies of the Mexican artists to spread across the United States. The Depression provided the environment for a public art of social content, as well as a context that allowed some American artists to accept and follow the Marxist ideologies of the Mexican artists. This influence of radical politics will be also described. Later, I will examine the repercussions of the Mexican artists2 work on the Abstract Expressionist movement of the 1940s. Finally I will also examine the iconography of certain murals by Mexican and American artists to appreciate the reaction of their audience, their acceptance among a circle of artists, and the historical context that allowed those murals to be created. Key Words: -- mural painting In terms of the broader topic of bibliographic control of electronic publications, focus on adding to current cataloging practices those fields that would enhance the OPAC users' access and conform to AACR2r. So many of the fields describing computer files appear to be redundant; 245 \h, 256, 516, and 538, for example; which tell the OPAC user over and over that the item is a computer file. To stay within the stringent restrictions of full-level cataloging, the members of the task force saw no way to avoid requiring catalogers to use most of the available fields. Concentrate on the MARC tags that would provide information about access. The principal fields include: 256 (computer file characteristics), 506 (restrictions on access note), 516 (type of computer file or data note), 530 (other formats available), 538 (system details note), 556 (accompanying documentation), and 856 (electronic location and access). Current OPACs, in addition to the limitations of hardware and workstations, however, still prevent most users from accessing electronic texts or images directly and smoothly from one menu or even from a single, multi-function workstation. However, workstations are gradually becoming available that permit users to copy the URL from the bibliographic record and paste it into a World Wide Web browser for accessing an e-text from a single terminal. Knowing this was possible include MARC tag 856. Another issue that must be addressed is using subfield u or splitting the URL into the multiple subfields. We went for simplicity and decided to format the 856 subfield u so that it could be copied and pasted into a World Wide Web browser. Again, we were not willing to wait for the programming that would be necessary to combine the separate subfields into a clickable URL. Cataloging has greatly benefited from advances in library automation and the cataloging of e-texts is ripe for further automation. It is now possible to derive MARC cataloging from text mark-up languages, subsets of XL, SGML such as TEI (Text Encoding Initiative) headers, and possibly even HTML (hypertext markup language) tags. ETD Submission Form Name: [MARC tag 100] Title: [MARC tag 245] Document Type (check one): Abstract: [MARC tag 520] Keywords: [MARC tag 653] 1. 2. 3. Department: [MARC tag 502] Degree: [MARC tag 502] Filename(s), size(s): [MARC tag 256] 1. 2. 3. 4. In addition to considering the MARC record and the cataloging department work flow, consider a procedure for getting the files from the Graduate School (approving unit), the mechanics of making an ETD available to a cataloguer (from the secure and private environment of the server) and for moving an ethesis into public access. Have the cataloguer forward a copy as each ETD is processed to a server at UMI. If UMI would prefer batch processing, files could be accumulated (i.e., stored in a directory on the etheses server) for batched file transfer, or perhaps a UMI-access point could be established on the ETD server from which its staff could retrieve them. With input from the University Archivist and addressing a concern of the Graduate School's, long term preservation and access of ETDs should also be factored into the procedures. A plan may include periodically writing ETDs to CD-ROMs for security back-ups and possibly longer term preservation. While this is may not be the final answer, an alternative has not been brought forward; how frequently this would be done has also not been determined. Processing Electronic Theses: a possible scenario 1. Graduate School • Electronically transfers approved file to library theses server E-mails Thesis Transmission Form to library thesis coordinator 2. Library/Cataloguer • Downloads ED from closed server to her workstation • Prepares cataloging (see new features above in figure 4) • Adds a screen to the file that includes the call number and property "stamp" (using "memos" feature of Acrobat) • Move file to server for public access • Electronically send file to UMI or move to UMI holding file 3. Library Theses Server Administrator • Indexes text for word searchability • Integrates new index with existing index • Maintains server, including weekly back-ups (stored on site) and monthly tapes (stored off site) • Removes files from closed server following completed processing • Makes CD-ROMs 4. Special Collections Department/University Archives • Retains CD-ROMs • Works with Theses Server Administrator as necessary to ensure that archival files are accessible Theses and dissertations as electronic files may be the first major source of electronic texts that many libraries encounter regularly. Seize this opportunity to enhance the OPAC users search results by expanding current theses cataloging and taking advantage of online information prepared by authors. Since authors will probably not be adding TEI, MARC, or other element tags to their documents to help cataloging in the near term, catalogers could use the information available in a variety of online sources including the document itself or from the online submission form to provide the basic descriptive MARC fields. Whether programmatic changes can be made or standard copy-and-paste features of word processors are incorporated, enhancing the ETD bibliographic record does not require a lot of extra work. See also "Electronic Theses and Dissertations: Merging Perspectives," chapter in Electronic Resources: Selection and Bibliographic Control, Pattie, Ling-yuh W. (Miko), and Bonnie Jean Cox, eds. New York: Haworth, 1997 (105-125). [Simultaneously published in Cataloging and Classification Quarterly, 22(3/4)] Next Section: Database and IR # Technical Issues/Database and IR The next step after identification of the items of the digital library, the ETDs, is to address storage of the cataloguing attributes and the action of searching and retrieving. Remember that the quality of retrieval is dependent both on the programming of the search and retrieve functions but, as important as this, on the quality of the information used to catalog the items of the collection. Databases are common and suitable tools to store, search and retrieve information. Besides this, they can also be very helpful in the process of capturing the attributes since they have the general function of managing information. Before implementing the database, the database model must be created. This will happen only after the metadata model has been defined and related to other existing identification procedures such as traditional cataloguing on an automated library system. If the traditional OPAC is to be maintained during the ETD program, it is desirable to avoid duplicated information. Thus, the attributes that are present in the OPAC should not be repeated and a link between the OPAC record and the ETD metadata is be created. No matter where information is stored, the user should be able to perform the types of search that are standard in library systems: author, title, keywords, subjects, ISBN, etc. As mentioned in the section Metadata models for ETDs, some information that is used to identify the items are language dependent (title, keywords, subjects, etc.). If the database holds only one language per record, search procedures are to be performed using the arguments in this language. If a multilingual database is modelled, it is recommended that search be language independent, i.e., that the argument be checked against all languages. In this last situation, after a record is found as the result of a search, all its language instances should be displayed to the user so that he/she can choose the language for retrieval. Database extenders (text, image) may be considered to increase the number of points of access by performing the searches in the ETD's not only on the cataloguing attributes. The relation with the legacy systems databases must be examined since information concerning the TDs may be stored on them. Some examples are the graduate program, the mentor, the examining committee, etc. The next sections address specific aspects of this topic. Next Section: Packaged solutions # Technical Issues/Packaged solutions Since 1995, there have been plans to share work developed in connection with ETDs so as to help others. The first such effort was funded by the Southeastern Universities Research Association (SURA), to spread the work around the Southeast of the USA. Virginia Tech acted as the agent for this effort, creating software, documentation, training materials, and other resources. In particular, it became clear that software was needed to support student submission, staff checking, library processing, and support for access. The Virginia Tech resources were packaged in connection with these efforts and the follow-on work funded by the USA’s Department of Education. These were made available starting in 1996 through local and NDLTD sites, and updated regularly since. The following subsections give details of the Virginia Tech solutions, as well as extensions to it, and alternatives. NDLTD offers to assist such sharing by providing a clearinghouse for whatever seems appropriate and useful to share. Next Section: DiTeD and DIENST # Technical Issues/DiTeD and DIENST Theses and dissertations are traditionally covered by the legal deposit law in Portugal. Nowadays, almost all the thesis and dissertations are created using word processors, just confirming the fact that science and technology became one of the first areas for digital publishing. In this context, the deposit of theses and dissertations emerged as an ideal case study for a scenario concerned with a specific genre. For that, the National Library of Portugal promoted the project DiTeD-Digital Thesis and Dissertations from which a software package with the same name originated. Requirements Theses and dissertations carry special requirements for registration and access, since their contents are usually used to produce other genres, such as books and papers, or they can include sensitive material related with, for example. patents. This requires the management system functionality to make it possible to the authors to declare special requirements for access, which have to be registered and respected. Universities have a long tradition of independence in their organization, culture and procedures. As a consequence, soon it was learned that it would be impossible to reach, in the short and medium term, any kind of overall agreement for common formats or standard procedures with the different administrative services. Therefore, the main objective defined for DiTeD was the development, on the top of the Internet, of a framework that would connect the National Library to the local university libraries and would make it possible to support a full digital circuit for the deposit of theses and dissertations. Architecture A solution for this framework was found in the DIENST technology [3], which provides a good set of core services. DIENST also has an open architecture that can be used with great flexibility, making it possible to extend its services and build new functionalities. The basic entities of this architecture are shown in Figure 1, as a class diagram in UML - Unified Modeling Language. Master Server The Master Meta Server provides the centralized services, including a directory of all the local servers members of the system. Only one of these servers must exist in each system. In DiTeD this server exists at the National Library. It was renamed Master Server, and differs substantially from the original versions developed for DIENST. The original server was designed to manage only metadata, while now it is necessary to manage also the contents of the theses or dissertations and give support to the workflow for its submission and deposit. DIENST Standard Server The DIENST Standard Server is the server installed at the university libraries. This server was modified in DiTeD, and renamed Local Server. The following core modules compose it: • Repository Service: This is where the documents are stored. It manages metadata structures and multiple content formats for the same document, functions that were substantially extended in DiTeD (to support a specific metadata format, as also to recognize a thesis or dissertation as possibly composed by several files). It is also possible to define and manage different collections in the same server. • Index Service: This service is responsible for indexing the metadata and responding to queries. Small adjustments were made in DiTeD to support diacritics in the indexes and queries, a requirement in the Portuguese writing. • User Interface: This service is responsible for the interaction with the user. It was extended in DiTeD to support a flexible multilingual interface and a workflow for submissions using HTTP. Identifiers Two Local Servers are running at the National Library. One, named Deposit Server, is used to locally store the deposited theses and dissertations coming from all universities (the deposit will consist in a copy, so in the end each thesis or dissertation will exist in two places, the Local Server and the Deposit Server). A second Local Server is used as a virtual system for those university libraries that do not have the necessary technical resources or skills to maintain their own server. Each thesis or dissertation deposited in DiTeD automatically receives a URN [4], which will be registered and managed by a namespace and resolution service. This is in fact a simple implementation of the concept of PURL - Persistent URL, with the particular property that it resolves any PURL by returning its real URL in the original Local Server, unless it is not available anymore. In this case, it resolves it by returning its URL in the Deposit Server. The entities of this final DiTeD architecture are shown in Figure 2. The prefix of the URN has the form "HTTP://PURL.PT/DITED", while the suffix is formed by an identifier of the university library (the "publisher") and by a specific identifier of the work itself, automatically assigned locally. Workflow The workflow comprises two main steps: submission and deposit. Submission The submission process comprises the following steps: Delivery: The process starts with the submission by the student of the thesis or dissertation to a local server. In this step the student fills a metadata form, where it is recorded the bibliographic information and the access conditions. All of this information is hold in a pending status, until it is checked. Verification: In a second step a librarian checks the quality of the submission (a login in the local server gives access to all the pendent submissions). This task is supposed to be assured by a local librarian, but it can be also assured remotely, such as by a professional from the National Library (in a first phase of the project, this task will be assured by the National Library, especially to assure uniformity in the criteria and test and tune the procedures). Registration: If everything is correct (metadata and contents), the thesis or dissertation is stored in the local repository, and the student receives a confirmation. Otherwise, the student is contacted to solve any problem, and the submission remains in the pending status. Deposit The deposit consists in the copy of the thesis or dissertation, as also of its metadata, from the Local Server to the Deposit Server. This is done in the following steps: What’s new: Periodically, the Master Server contacts the repository of a Local Server to check if there are new submissions. The Local Server replies, giving a list of the identifiers of the new submissions. Delivery: For each new submission, the Master Server sends a request to the Local Server to deposit it in the Deposit Server. Because this Deposit Server is also a Local Server, this deposit works just like a normal local submission. Verification: A librarian in the National Library checks the deposit. This double checking is important, especially in the first times of the project, to reassess the procedures and test the automatic transfer of files over the Internet -not always a reliable process). Registration: If everything is correct, the thesis or dissertation is stored in the deposit repository, the final URN (a PURL) is assigned, and both the student and the local librarian receive a confirmation. The metadata is also reused to produce a standard UNIMARC record, for the national catalogue. If it detected any problem, the local librarian is contacted and the deposit remains in the pending status. One can argue that, if the Deposit Server is really also a Local Server, than the first step would be excused and the Local Server could perform the delivery automatically after a successful submission. This can be a future optimization, but for now the reason for this extra step is to preserve the requirement of an asynchronous system, making it possible for the Master Server, for example, to better control the moment of the deposit (such as to give preference for the night periods). Metadata DiTeD utilizes a metadata structure for theses and dissertations defined by the National Library and coded in XML. This structure contains descriptive bibliographic information about the work and the author, as well as information about the advisers and jury members, access conditions, etc. This metadata structure is configurable at installation time, making the software flexible for use in other countries, or even with other publication genres. Metadata may also be accessed and exported in other formats, like UNIMARC and Dublin Core. Multilingual interface DiTeD's user interface has multilingual capabilities, allowing the users to switch between the available languages at any time. The base configuration includes English and Portuguese. Software availability The software is maintained by the National Library of Portugal, and distributed freely for noncommercial use. Access to the software package may be requested by email to dited@bn.pt. References 1. <http://dited.bn.pt> 2. <http://purl.org> 3. <http://www.cs.cornell.edu/cdlrg/dienst/software/DienstSoftware.htm> 4. Sollins, K; Masinter, L. (1994). Functional Requirements for Uniform Resource Names. RFC 1737. Next Section: ADT # Technical Issues/ADT Australian Digital Theses Program http://adt.caul.edu.au/ ADT Program is a collaborative Australia-wide university libraries initiative. Membership is open to all Australian universities and is voluntary. National coordination of the ADT is via the Council of Australian University Librarians [CAUL; an umbrella group representing all Australian university libraries]. The ADT Steering Committee is currently investigating a proposal to further extend ADT Program & software to the Australasian region. ADT software was designed to be transportable and to be flexible enough to be 'plugged in' at each member institution with minimum modification. The ADT software was also designed to automatically generate simple core metadata which is gathered automatically to form national distributed 'metadata' database of ADT-ETDs ADT software is based on the original Virginia Tech [VT] software. The original ADT v1.0/1999 released in April 1999, with upgrade v1.1/2000 released in October 2000. ADT software basics: • perl scripts extended by the library cgi.pm, which uses objects to create web forms on the fly and parse their content • facilitates file uploading and form handling by generation of html via function call and passing of the form state • extension of variable use to make scripts more generic to facilitate local institutional setup and style • developed a standard for generation of unique addresses [URLs] for deposited documents • automatic generation of DC metadata from the deposited information The distribution software package requires modification of a list of variables and some webserver-dependent adjustments so it runs in standard way for each of the local institutions. Local search software, which may be institution-mandated, is not included in the package. It is expected that each institution will use their own security accordingly. ADT software overview: • Deposit Form : generic look and feel; includes copyright and authenticity statements; complete set up help screens; quality control using alerts when errors are made and does not allow non compliance with core ADT standards • Administration pages: possible to edit html and change document restrictions; also possible to add, delete, rename files as well as moving files to no access directory to restrict files temporarily for copyright reasons; varying levels of restrictions possible; easy to 'un make' deposit • Metadata: revised and updated according to latest DC Qualifiers document; generated and gathered automatically, used to create central national searchable 'metadata' database • ADT Standards: few, simple but necessary for success of collaborative program. Core standards are - research theses only; PDF document format; PDF filename convention; unique URL; Metadata standard. Summary: • the ADT software & model is fully transportable and designed to be easily installed locally by any participating institution regardless of local IT infrastructure and architecture • the ADT software facilitates loading digital versions of theses to the local institutions' servers where the PDF files will be housed permanently • the theses can be fully integrated into the local access infrastructure and searched using any local database, and/or the local web based catalogue • all ADT Program theses can also be searched nationally via the ADT Program metadata database. This database is constructed from DC metadata generated automatically during the deposit process. The metadata gathered creates rich records that allow highly flexible and specific searching, with links back to the local institutions' servers where the full digital theses are housed • the ADT Program is a collaborative effort across the whole Australian university community and a proven model for creating a national dataset of digitised theses • the ADT software and model is relatively inexpensive to install, integrate within local requirements and process. Once this is achieved it is virtually maintenance free, is sustainable, scalable, and very cost effective for both the institution and the student/author • the ADT metadata and other standards conform to current internationalstandards and therefore have potential to integrate with other international open archive initiatives • theses can be deposited from anywhere, and similarly, the metadata can be gathered from anywhere • full details and information available from the ADT homepage Next Section: Cybertheses # Technical Issues/Cybertheses The www.cybertheses.org site is the result of cooperative project that started at first between l’Université de Montréal "http://www.theses.umontreal.ca/" et l’Université Lumière, Lyon 2 "http://www.univlyon2.fr/, supported by the Fonds Francophone des Inforoutes "http://www.francophonie.org/fonds/", and dealing with the theme of the electronic publication and distribution of theses on the Web. At first, this cooperation dealt with the conception and creation of a production line for electronic documents, using the SGML norm. It also had the objective of setting up a server to be shared by the different participating establishments so as to allow their theses to be indexed. In a desire for openness, we decided to enlarge participation on this server to all establishments of higher education distributing full-text versions of their thesis on the Internet, without constraints based on the language used or on the chosen format of distribution. The www.cybertheses.org site allows theses to be indexed on line using a common metadata model. Its implementation provides a structure for this growing cooperative effort by basing itself on the Internet’s own modes of functioning and of distributing skills. Our wish is that it can quickly ensure a better distribution of the research work conducted within the participating establishments and serve as an effective tool for the entire community of researchers. From the conception of the project, the partners wanted to contribute to the distribution of software tools created for use in the university environment. These tools are available to the partners in the Cybertheses network. They permit all the partners to participate, on equal footing, in the construction of an electronic university library that functions in a dispersed manner and applies the concept of distributed intelligence. One of the principles that motivated the implementation of this program is to favor, as much as possible, the re-appropriation of research work by the researchers themselves. Our objective is to eventually help create a new political economy of knowledge, which makes researchers the masters of their publications, beyond any economic constraints. Many aspects of our project respond to this goal: • Placing theses freely and completely on-line on the Internet, and thus widening their distribution, means that the thesis will no longer be considered as solely the result of a research project. It will instead become a genuine work instrument integrated to a much larger system in order to satisfy the user’s demand. • The creation of the Cybertheses database containing the metadata of the participating institutions’ theses. Cybertheses provides an efficient indexation system and rapid searching, even while significantly increasing the visibility and the distribution of the theses. • The use of free or freely distributed software is favored at each stage of the production and distribution process. • The sharing of research, self-created software programs, and documentation between the Cybertheses partners, thereby creating a "toolbox" permitting participation in the program. The Cybertheses partners are concerned with developing solutions that can be used by all actors, be they of the North, South or East. Our procedure in based on appropriating the skills and techniques related to the electronic production and distribution of the university community’s research results. For more information, consult: http://www.cybertheses.org/ Next Section: VT DV and other tools # Technical Issues/VT DV and other tools This page describe the hardware and software requirements involved in setting up your own ETD database. Hardware To use this software, you must have a web server available. At Virginia Tech we use a UNIX-based server platform. You should allocate enough disk space on your machine for at least a year's worth of submissions. Our site averages 2.5 Megabytes per submission. Keep in mind that it's better to have more space early on, as the scripts are not designed to deal with a collection which spans multiple drives. You should also have enough memory to handle the web server, the database server, and whatever other tasks you have in mind. As an example, our site uses a dual-processor Sun Enterprise 250 with 384 Mb of RAM, running Solaris 2.7. Our machine has an 18Gb drive allocated solely for the ETD collection. Software Before you can make use of the scripts provided here, you must have the following installed: Mysql Mysql is a database server and client which partially implements the SQL 9.2 standard. It is many ways similar to other SQL databases, such as Oracle, Postgres, and miniSQL. UNIX versions of Mysql are made available without charge to education institutions at http://www.mysql.com/. A version of Mysql for Windows NT is available as well for an additional charge, although the scripts are not designed for use with Windows NT. Perl All of the scripts included with this distribution are written using perl. Perl is also freely available (under the GNU public license) from http://www.perl.com/CPAN/. It is recommended that you download, install, and test the latest version available for your operating environment. CGI.pm The CGI module for perl is one of the most widely used and best supported libraries of CGI oriented routines in existence. Virtually all of the query handling performed in these scripts relies heavily on CGI.pm. CGI.pm is available from http://www.perl.com/CPAN-local/modules/bymodule/CGI/. The DBI and DBD:Mysql modules for perl The DBI module for perl is a generic set of database calls designed to interface with a wide range of different database technologies in a powerful, reliable, and easy to understand way. To make use of the DBI module, you need a DBD module for the particular database you intend to use. The DBD:Mysql module (also known as the DBD:Msql module) allows you to easily perform all types of database operations on a Mysql database from within perl. Both modules are available from http://www.perl.com/CPAN/modules/dbperl/. The Tie-IxHash module for perl The Tie-IxHash module is a very small add-in that allows you to reliably output hashes in the order they are defined in. Without this module, none of the global hashes that contain department names, degree information, etc. would appear in the order we'd like them to. This module is available at http://www.perl.com/CPAN-local/modules/by-module/Tie/ Web Server Software The perl scripts provided are designed for the most part to be used through a CGI interface, meaning that you must have a compatible web server installed. The freely available Apache Web Server is our recommendation, although any web server capable of seamlessly handling html output from perl scripts should be acceptable. Apache is available from http://www.apache.org/. Once you have all the preceding items installed and tested, you should be ready to download (http://scholar.lib.vt.edu/ETD-db/developer/download/) and install(http://scholar.lib.vt.edu/ETD-db/developer/install.html) the scripts Next Section: Library Automation/OPAC: VTLS # Technical Issues/Library Automation\OPAC: VTLS ETDs are intimately connected with libraries. The first key automation of libraries, launched in the 1980s, involved computerization of the card catalogs. Online Public Access Catalogs (OPACs) were developed, and supported search through records describing library holdings. Many library catalogs have their holdings encoded according to the MARC (machine readable cataloging) standard. Managed by the US Library of Congress, the MARC standard in the USA has been applied in many other contexts, leading to broader standards like UNIMARC. The latest standard as of 2001 is MARC 21. VTLS, Inc. (http://www.vtls.com), is one company that sells software (e.g., their Virtual system, which works with MARC) and services to support catalog search and other related library requirements. OPACs in many cases have evolved into sophisticated digital libraries, and can be used to help manage ETD collections. Indeed, ETDs at most universities can be found by searching in the local catalog system, as long as one can distinguish this genre or type of work from other holdings (e.g., books, journals, maps). VTLS has volunteered equipment, software, staff, and services to help NDLTD. From http://www.vtls.com/ndltd one can gain access to a system affording search and browse support to the union catalog of all ETDs that can be harvested using the mechanisms supported in the Open Archives Initiative. Next Section: Harvest usage in Germany, France # Technical Issues/Harvest usage in Germany, France The HARVEST system is often used for a fulltext search within the ETD archives. In Germany, most of the university libraries are using this particular software. Preconditions for using Harvest Before the installation, one should check whether the following technical preconditions are fulfilled. Hardware: fast processor(e.g. Sparc5...) fast I/O enough RAM ( > 64 MB) – and 1-2 GB free disk space (sources 25 MB) Operating Systems supported DEC OSF/1 ab 2.0 SunOS ab 4.1.x SunSolaris ab 2.3 HPUX AIX ab 3.x Linux alle Kernel ab 1999 on alle Unix-Platformen WindowsNT Following additional software is needed to use Harvest: Perl v4.0 and higher (v5.0 ) gzip tar HTTP-Server (with remote machine) GNU gcc v2.5.8 and higher flex v2.4.7 bison v1.22 Harvest Components The Harvest system consists of two major components: The Harvest Gatherer The Harvest Broker. This allows establishing a distributed retrieval and search model. Installation procedure (ftp://ftp.tardis.ed.ac.uk/pub/harvest/develop/snapshots/ ) Gatherer This program part is responsible for collecting the full text and metadata of the dissertations. The Gatherer visits several sites regularly, sometimes daily or weekly and builds an incremental index area database. The collected index data are held in a special format, called SOIF-Format (Summary Object Interchange Format). The Gatherer can be extended so that it can interpret different formats. Broker This part of the software is responsible to provide the indexed information by using a search interface. The broker operates as Query-Manager and does the real indexing of the data. Using a Web based interface he can reach several Gatherer and Broker simultaneously and perform several search requests. In Germany there has been established a Germany-wide retrieval interface on the basis of the Harvest software, called The (Theses Online Broker), which is accessible via: http://www.iuk-initiative.org/iwi/TheO Within NDLTD a special Broker has been set up to add the German sites using Harvest to the international search. Harvest is able to do a search within the following document formats:  C, Cheader Extract procedure names, included file names, and comments Dvi Invoke the Text summarizer on extracted ASCII text FAQ, FullText,README Extract all words in file Framemaker Up-convert to SGML and pass through SGML summarizer HTML Up-convert to SGML and pass through SGML summarizer LaTex Parse selected LaTex fields (author, title, etc.) Makefile Extract comments and target names ManPage Extract synopsis, author, title, etc., based on -man News Extract certain header fields Patch Extract patched file names Perl Extract procedure names and comments PostScript Extract text in word processor-specific fashion, and pass through Text summarizer. RTF Up-convert to SGML and pass through SGML summarizer SGML Extract fields named in extraction table SourceDistribution Extract full text of README file and comments for Makefile and source code files, and summarize any manual pages Tex Invoke the Text summarizer on extracted ASCII text Text Extract first 100 lines plus first sentence of each remaining paragraph Troff Extract author, title, etc., based on -man, -ms, `-me macro packages, or extract section headers and topic sentences Unrecognized Extract file name, owner, and date created Configuration for PDF files: Before the Harvest-Gatherer can collect PDF documents and transform into SOIF format it has to be configured. Using only the standards configuration ignores the format. In order to make a format known to the Gatherer a summarizer for PDF has to be build: Delete the following line in the file /lib/gatherer/byname.cf: Pdf ^.*\.(pdf|PDF)$

Configure the PDF summarizer. Use Acrobat to transfer PDF documents into PS documents, that are used by the summarizer. A better choice provides the xpdf packages by Derek B. Noonburg (http://www.foolabs.com/xpdf ). It contains a PDF-to text converter (pdftotext), that can be integrated into, the summerizer Pdf.sum:
/usr/local/bin/pdftotext $1 /tmp/$$.txt Text.sum /tmp/$$.txt rm /tmp/$$.txt Configuring the Gatherers for HTML-Metadata The Harvest-Gatherer is by standard configured to map every HTML metatag into an SOIF attribute, e.g. <META NAME="DC.Title" CONTENT="Test"> into an own SOIF attribute, that is equal to the NAME attribute of the metatag. The configuration can be found at: <harvest home>/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl The summarizer table has entries like this: <META:CONTENT>$NAME If a retrieval should be done only in the HTML metatags, meaning within certain SOIF attributes, those attributes have to put in front of the search request and put into the retrieval forms provided to the user, e.g.
DC.Title: Test

As in Germany there is a nationwide agreed metadata set for ETDs, those are searchable within the German wide Harvest network.

The following example shows, how those Dublin Core metadata (only a small part is displayed) is encoded within the HTML front pages for ETDs:

<META NAME="DC.Type" CONTENT="Text.PhDThesis"> <META NAME="DC.Title" LANG="ger" CONTENT="Titelseite: Ergebnisse der CT- Angiographie bei der Diagnostik von Nierenarterienstenosen"> <META NAME="DC.Creator.PersonalName" CONTENT="Ludewig, Stefan"> <META NAME="DC.Contributor.Referee" CONTENT="Prof. Dr. med. K.- J. Wolf"> <META NAME="DC.Contributor.Referee" CONTENT="Prof. Dr. med. B. Hamm"> <META NAME="DC.Contributor.Referee" CONTENT="PD Dr. med. S. Mutze">

For this agreed metadata set, there has been formulated a suggestion, on how the metadata can be produced and operated at the university libraries. The following schema shows the details: The doctoral candidate uploads his ETD document to the library. He does this while filling in an HTML form, that collects internally metadata in Dublin Core format. The university libraries check the metadata of correctness and the ETD of readability and correct usage of style sheets. The university library adds some descriptive metadata to the metadata set and put a presentation version of the ETD on its server. During this procedure an HTML format page containing the Dublin Core metadata encoded as HTML

At last submits the university library the metadata to the National library, which is in charge of archiving all German language literature.

The national library copies the ETD and metadata to its own internal system.

Searching SGML/XML documents

Harvest also allows a search within SGML/XML DTD (document type definition) elements.

All that has to be done to configure the Gatherer component according to the following rules:

Within the home of the Harvest software (written as <harvest-home>) in /lib/gatherer/byname.cf a line has to be added: DIML ^.*\.did$. (DiML is the DTD used at Humboldt-University, did is the file names of the SGML documents accorning to the DIML-DTD). This says the Harvest-Gatherer, which Summerizer should be used, if documents ending with .did are found. Now, the summarizer has to be build and saved as DIML.sum within the filesystem at <harvest-home>/lib/gatherer: (The summarizer contains the following line: #!/bin/sh exec SGML.sum ETD$*

Within the catalog file <harvest-home>/lib/gatherer/sgmls-lib/catalog the following entries have to be done: (They point to the public identifiers of the DIML and from DIML used DTVs)
DOCTYPE ETD DIML/diml2_0.dtd
PUBLIC "-//HUB//DTD Electronic Thesis and Dissertations Version DiML 2.0//EN" DIML/diml2_0.dtd
PUBLIC "-//HUB//DTD Cals-Table-Model//EN" DIML/cals_tbl.dtd
PUBLIC "-//HUBspec//ENTITIES Special Symbols//EN" DIML/dimlspec.ent

Now <harvest-home>/lib/gatherer/lib/sgmls-lib/DIML can be created (mkdir <path>) and four files copied into the path:
diml2_0.dtd, cals_tbl.dtd, dimlspec.ent und diml2_0.sum.tbl (DTD, entity file and sumarizer table). The file diml2_0.sum.tbl consists of the DTD tags that should be searcheable and the appropriate SOIF attributes

The Gatherer can be launched now.

In order to saerch within certain SOIF tags, the name of the SOIF attribute has to be put in front of the search term, e.g. searching for "title: Hallo"means, searching within the SOIF-attribute 'title' for the search term 'Hallo'.

At Humboldt-University Berlin, there has been installed a prototype that allows a retrieval within documents structures, so a user may search within the following parts of a document and therefore specialize the search in order to retrieve only the wanted information and hits:

• Fulltext (im Volltext)
• For authors (nach Autoren)
• In titles (in Titen)
• In abstracts (Im Abstract)
• Wthin authors keywords (in Autorenschlagwörtern)
• For Institutes/ Subjects (nach Institut/ Fachgebiet)
• For approvals (nach Gutachtern)
• Captions of figures (in Abbildungsbeschriftungen)
• In Tables (in Tabellen)
• Within the bibliography (in der Bibliographie)
• For Author names within bibliography (nach Autoren in der Bibliographie)

Harvest and OAI

With the growing enthusiasm for the approach of the open archives initiative, where a special OAI software protocol allows to send out requests to document archives and receive standardized metadata sets as answers, there are ideas on how this could be connected with the Harvest-based infrastructure that has been set up in Germany.

Making a Harvest archive OAI compliant means, that the information, that the Gatherer holds, has to be normalized (same metadata usage) and that the index has temporarily saved within a database. The Institute for Science Networking at the Department of Physics at the University of Oldenburg, Germany developed the following software. This software, written in php4, uses an SQL database to perform the OAI protocol requests. The SQL database holds the normalized data from the Harvest Gatherer.

Other university document servers, like the one at Humboldt-University, additionally hold the Dublin core metadata within an SQL database (Sybase) anyway. There a php4 script operating at the cgi-interface reads the OAI protocol requests that is transported via the HTTP protocol and puts them into SQL statements, which are then used as requests for the SQL-database. The database response is given in an SQL syntax as well, which is then transformed into OAI protocol syntax using XML and Dublin Core.(see http://edoc.hu-berlin.de/oai)

Next Section: The NDLTD Union Catalog

# Technical Issues/The NDLTD Union Catalog

The Networked Digital Library of Theses and Dissertations is an interconnected system of digital archives. After NDLTD became a reality, the logical next step was to give users the means to search and browse the entire collection of theses and dissertations. Much research went in to deciding how best to do this, and a couple of proto-solutions were examined. In the end, by adopting the Metadata Harvesting Protocol of the Open Archives Initiative to gather metadata in the ETDMS format, NDLTD was able to make the collection of ETDs accessible via a central portal. VTLS Inc., using their Virtua—Integrated Library System (Virtua ILS), is hosting this portal, providing a Web interface to the ETD Union Catalog. This portal gives users a simple and intuitive way to search and browse through the merged collection of theses and dissertations. Once relevant items are discovered via a browse, keyword, or Boolean search, users can follow links that go directly to the source archives or to an alphabetical index that allows for further searching.

VTLS’ flagship product, Virtua ILS, is especially suited to the needs of NDLTD. For starters, it is inherently a distributed system and adheres to emerging standards, such as the Unicode® standard, for encoding metadata. Secondly, VTLS’ Web portal, called Chameleon Gateway to emphasize its changeability, is not only straightforward and easy to use, but highly customizable and translatable, giving global users the ability to retrieve information and enter searches in a multitude of languages. Currently, versions of the NDLTD portal interface exist in 14 languages, including Arabic, Korean, and Russian, and plans call for support of all languages used by NDLTD members.

In addition to creating an ETD, students now need to become their own cataloguers. The tools being developed by the NDLTD will assist in the submission of a basic description of the ETD but there are skills to be developed and issues to be aware of.

Students will have already discovered that one of the greatest barriers to finding information is the difficulty of coming up with the right terminology. Lists of standardized subject heading terms, structured thesauri, and fielded searching have been created to remedy this problem.

Accurate metadata improves precision and increases the recall of the content of ETDs by using the same standardized terms or elements. However, even if common metadata elements are used, there is no guarantee that the vocabularies, the content of the elements, will be compatible between communities of interest. Students and researchers working within specialized areas sometimes forget that language and terms often have particular and precise meanings. Outside this field of interest, global searches can return too much of the wrong information.

Generating accurate metadata requires some of the basic skills of resource description and good practice in avoiding three language problems that cause poor precision:

• Polysemy: words with multiple meanings. For example, if we are searching for an article that discusses types of springs and their uses, we might retrieve articles on freshwater springs or on the season of spring, as well as on leaf springs, flat springs, or coil springs.
• Synonymy: words representing the same concept, although they may do it with different shades of meaning. Take the words 'ball,' 'sphere,' and 'orb,' or 'scuba diving' versus 'skin diving.' If we look for scuba diving, but the term used is skin diving, we will miss materials we might otherwise find. Good metadata should draw these materials together, despite their use of different synonyms.
• Ambiguity: If we return to our example of springs, we can see what differentiates these meanings is their context. It is unlikely an article on coil springs will also discuss water quality. The other words used in the article, and the processes described, will be entirely different. A search engine must understand the meaning, not just be able to match the spelling of a word, if it is going to differentiate between different meanings of the same word. A possible solution to this difficulty lies in the recent development of the notion of Application Profiles. Application Profiles provide a model for the way in which metadata might be aggregated in ‘packages’ in order to combine different element sets relating to one resource. This is a way of making sense of the differing relationship that implementers and namespace managers have towards metadata schema, and the different ways they use and develop schema. Students should investigate these developments within their community of interest.

Resources

See: Metadata: Cataloging by Any Other Name ... by Jessica Milstead and Susan Feldman ONLINE, January 1999

See: Application profiles: mixing and matching metadata schemas Rachel Heery and Manjula Patel

Next Section: Fulltext

# Technical Issues/Fulltext

When all of the text of an ETD is available for searching, a digital library system is said to support fulltext searching. Users can submit queries that call for documents that have particular phrases, words, categories, or word stems appearing anywhere in the text (e.g., in the middle of a paragraph, or as part of the caption of a figure).

In fulltext searching it often is possible to specify that query terms appear in the same paragraph, same sentence, or within n words of each other. These refinements may work together with support for exact or approximate phrase and/or word matching.

For fulltext searching to work, the entire document must be analyzed, and used to build an index that will speed up searching. This may require a good deal of space for the index, often around 30% of the size of the texts themselves. Further, such searching may lead to decreased precision, since a document may be located that only makes casual mention of a topic, when the bulk of the document is about other topics. On the other hand, fulltext searching may improve recall, since works can be found that are not classified to be about a certain topic. Further, fulltext searching often yields passages in a document, so one can find a possibly relevant paragraph, rather than just a pointer to a document that then must be scanned to ascertain relevance.

Next Section: SGML/XML Overview

# Technical Issues/SGML\XML Overview

SGML/XML is a multiple-targeted strategy. "It allows librarians to ensure longevity of digital dissertations. Modern hardware and redundancy can keep all the bits of an electronic thesis or dissertation (ETD) intact. But electronic archives must be modernized continually as new document formats become popular." As librarians always tend to think in decades, document formats like TIFF, Postscript or PDF do not meet their requirements. If PDF is replaced by another de facto (industry, not ISO-like) standard, preserving digital documents would mean converting thousends of documents. XML can help overcome those difficulties. If an electronic document is to be of ‘archival quality, it should be liberated from the page metaphor."

A second reason for using SGML/XML is that it ensures reusability of documents by preserving raw data and content-based structuring of information pieces. Preserving data for statistics and formulas in mathematics and chemistry could allow researchers to reuse and repeat simulations, calculations and experiments, deriving the needed data directly from an archive.

Third, using structured information allows the reuse of the same information or documents in different contexts, i.e., the same digital dissertation can be used to produce an online or print version, and to produce additional information products, like monthly proceedings containing the abstracts of all dissertations produced within the university during the last month, or a citation index. Additionally, the dissertation can be displaysd for different media, so a Braille reader or an automatic voice synthesizer could be used as a back-end machine.

Another reason for using markup for encoding documents is that a wider, more qualified retrieval could be provided to the users of an archive. As university libraries are more and more challenged by the problem of handling, converting, archiving and providing electronic publications, one of the major tasks is providing a new quality for retrieval within the user interface. Using an SGML/XML-based publishing concept enables a new quality in the distribution of scientific contents via specific information and knowledge management.

What does SGML/XML mean?

The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web. The current W3C Recommendations are XML 1.0, Feb '98, Namespaces, Jan '99, and Associating Stylesheets, Jun '99, and XSLT/XPath, Nov '99.( http://www.w3.org/XML ) The development of XML started in 1996 and it is a W3C (http://www.w3.org/) standard since February 1998, which may make you suspect that this is rather immature technology. But in fact the technology isn't very new.

Before XML there was the Standard Generalized Markup Language (SGML), developed in the early '80s, an ISO standard since 1986, and widely used for large documentation projects. And of course HTML, whose development started in 1990. The designers of XML simply took the best parts of SGML, guided by the experience with HTML, and produced something that is no less powerful than SGML, but vastly more regular and simpler to use. While SGML was mostly used for technical documentation and much less for other kinds of data, with XML it is the opposite.

"Structured data", such as mathematical or chemical formulas, spreadsheets, address books, configuration parameters, financial transactions, technical drawings, etc. are usually put on the Web using the output of layout programs as Postscript or PDF or by putting them into graphic formats like gif, jpeg, png, vrml, and so on. Programs that produce such data often also store it on disk, for which they can use either a binary format or a text format. So, if soemebody wants to look at the data, he usually needs the program that produced it. With XML those data could be stored in a text format, which allows the user reading the file without having the original program. XML is a set of rules, guidelines, conventions, whatever you want to call them, for designing text formats for such data, in a way that produces files that are easy to generate and read (by a computer).

The eXtensible Markup Language (XML) is a markup or structuring language for documents, a so-called metalanguage, that defines rules for the structural markup of documents independently from any output media. XML is a "reduced" version of the Structured Generalized Markup Language (SGML), which has been an ISO-certified standard since 1986. In the field of internet publishing, it never achieved wide success due to the complexity of the standard and the high cost of the tools. It prevailed only in certain areas, such as technical documentation in large enterprizes (Boeing, patent information). The main philosophy of SGML and XML is the strict separation of content, structure and layout of documents. Most ETD projects use either the SGML standard (ISO 8879 with Korregendum K vom 4.12.1997) or the definition of the World Wide Web Consortium (W3C) XML 1.0 (10.02.1998, revised 6.10.2000). The crux of all those projects was always the document type definition (DTD).

Next Section: SGML/XML and other Markup Languages

# Technical Issues/SGML\XML and Other Markup Languages

SGML (Standard Generalized Markup Language) and XML (eXtensible Markup Language) are markup languages, which use tags ("<" and ">") with names of labels inside around the sections of the documents that are thus marked or bracketed. Document Type Definition (DTD) specifies the grammar or structure for a type or a class of documents. SGML requires a DTD while XML employs DTD optionally. But given current trends it seems that XML is most likely to be used due to the following reasons.

• XML is a method for putting structured data in a text file for "structured data" think of such things as spreadsheets, address books, configuration parameters, financial transactions, technical drawings, etc. Programs that produce such data often also store it on disk, for which they can use either a binary format or a text format. The latter allows you, if necessary, to look at the data without the program that produced it. XML is a set of rules, guidelines, conventions, whatever you want to call them, for designing text formats for such data, in a way that produces files that are easy to generate and read (by a computer), that are unambiguous, and that avoid common pitfalls, such as lack of extensibility, lack of support for internationalization/localization, and platform dependency.
• XML looks a bit like HTML but isn't HTML
Like HTML, XML makes use of tags and attributes (of the form name="value"), but while HTML specifies what each tag and attribute means (and often how the text between them will look in a browser), XML uses the tags only to delimit pieces of data, and leaves the interpretation of the data completely to the application that reads it. In other words, if you see "<p>" in an XML file, don't assume it is a paragraph. Depending on the context, it may be a price, a parameter, a person. In short it allows you to develop your own mark up language specific to a particular domain.
• XML documents can be preserved for a long time.
XML is, at a basic level an incredibly simple data format. It can write in 100 percent pure ASCII text as well as in a few other well-defined formats. ASCII text is reasonably resistant to corruption. Also XML is very well documented. The W3C’s XML 1.0 specification tells us exactly how to read XML data.
• XML is license-free, platform-independent and well-supported.
By choosing XML as the basis for some project, you buy into a large and growing community of tools (one of which may already do what you need!) and engineers experienced in the technology. Opting for XML is a bit like choosing SQL for databases: you still have to build your own database and your own programs/procedures that manipulate it, but there are many tools available and many people that can help you. And since XML, as a W3C technology, is license-free, you can build your own software around it without paying anybody anything. The large and growing support means that you are also not tied to a single vendor. XML isn't always the best solution, but it is always worth considering.
• XML is a family of technologies.
There is XML 1.0, the specification that defines what "tags" and "attributes" are, but around XML 1.0, there is a growing set of optional modules that provide sets of tags & attributes, or guidelines for specific tasks. There is, e.g., Xlink which describes a standard way to add hyperlinks to an XML file. XPointer & XFragments are syntaxes for pointing to parts of an XML document. (An Xpointer is a bit like a URL, but instead of pointing to documents on the Web, it points to pieces of data inside an XML file.) CSS, the style sheet language, is applicable to XML as it is to HTML. XSL is the advanced language for expressing style sheets. The DOM is a standard set of function calls for manipulating XML (and HTML) files from a programming language. XML Namespaces is a specification that describes how you can associate a URL with every single tag and attribute in an XML document. What that URL is used for is up to the application that reads the URL, though. XML Schemas help developers to precisely define their own XML-based formats. There are several more modules and tools available or under development.
• XML provides Structured and Integrated Data XML is ideal for large and complex data like ETD’s because data is structured. It not only lets you specify a vocabulary that defines the elements in the document; but it also allows you to specify relations between the elements.
Documents are often supplemented with metadata (that is data about data). If such metadata were included inside an ETD then it would make ETD self-describing. XML can encode such metadata. However on the downside XML comes with its own bag of discomforts.
1. Conversion from word processing forms to XML requires more planning is advance, different tools and broader learning about processing concepts than it is required for PDF.
2. There are many fewer people knowledgeable about these matters and tools that support this conversion are less mature and expensive. Also process of converting may be complicated, difficult and time consuming.
3. Writing directly in XML by using XML authoring tools requires some prior knowledge of XML.
4. Also XML is very strict regarding the naming and ordering of tags. It is also case sensitive illustrating the relative effort required by students to prepare ETD’s in this form.

Process of Creating an XML document

XML documents have four-stage life cycle.

XML documents are mostly created using an editor. It may be a basic text editor like notepad. or .vi. editor. We may even use WYSIWYG editors. The XML parser reads the document and converts it into a tree of elements. The parser passes the tree to the browser that displays it. It is important to know that all this processes are independent and decoupled from each other.

Putting XML to work for ETD’s Before we jump into the XML details for ETD.s we should make certain things clear, since we would be using them on a regular basis now onwards.

DTD (Document Type Definition):

An XML document primarily consists of a strictly nested hierarchy of elements with a single root. Elements can contain character data, child elements, or a mixture of both. The structure of the XML document is described in the DTD. There are different kinds of documents like letter, poem, book, thesis, etc. Each of the documents has its own structure. This specific structure is defined in a separate document called Document Type Definition (DTD).

DTD used is based on XML and it covers most of the basic HTML formatting tags and also some specific tags from the Dublin core metadata. A DTD has been developed for ETD. The developed DTD is too generic. If someone wants to use mathematical equation or incorporate some chemical equation, it won't be sufficient. For that we can incorporate MathML (Mathematical Markup Language) and/or CML (Chemical Markup Language). There are defined DTDs for these languages that we also have to use for our documents. But research of incorporating more that one DTD for different parts of the documents is still going on.

CSS is a flexible, cross-platform, standards-based language used to suggest stylistic or presentational features applied throughout entire websites or web pages. In their most elegant forms, CSS are specified in a separate file and called from within the XML or HTML header area when documents loads into the CSS-enabled browser. Users can always turn off the author's styles and apply their own or mix their important styles with the authors. This points to the "cascading" aspect of CSS.

CSS is based on rules and style sheets. A rule is a statement about one stylistic aspect of one or more elements. A style sheet is one or more rules that apply to a markup document.

An example of a simple style sheet is a sheet that consists of one rule. In the following example, we add a color to all first-level headings (H1). Here's the line of code - the rule - that we add:

H1 {color: red}

XSL (the eXtensible Stylesheet Language):

XSL is a language for expressing stylesheets. It consists of two parts:

1. A language for transforming XML documents, and
2. An XML vocabulary for specifying formatting semantics.

If you don't understand the meaning of this, think of XSL as a language that can transform XML into HTML, a language that can filter and sort XML data, a language that can address parts of an XML document, a language that can format XML data based on the data value, like displaying negative numbers in red, and a language that can output XML data to different devices, like screen, paper or voice. XSL is developed by the W3C XSL Working Group whose charter is to develop the next version of XSL.

Because XML does not use predefined tags (we can use any tags we want), the meanings of these tags are not understood: could mean an HTML table or maybe a piece of furniture. Because of the nature of XML, the browser does not know how to display an XML document. In order to display XML documents, it is necessary to have a mechanism to describe how the document should be displayed. One of these mechanisms is CSS as discussed above, but XSL is the preferred style sheet language of XML, and XSL is far more sophisticated and powerful than the CSS used by HTML. XML NamespacesThe purpose of XML namespaces is to distinguish between duplicate element type and attribute names. Such duplication might occur, for example, in an XSLT stylesheet or in a document that contains element types and attributes from two different DTDs. An XML namespace is a collection of element type and attribute names. The namespace is identified by a unique name, which is a URI. Thus, any element type or attribute name in an XML namespace can be uniquely identified by a two-part name: the name of its XML namespace and its local name. This two part naming system is the only function of XML namespaces.
• XML namespaces are declared with an xmlns attribute, which can associate a prefix with the namespace. The declaration is in scope for the element containing the attribute and all its descendants. For example code below declares two XML namespaces. Their scope is the A and B elements:
<A xmlns:foo="http://www.foo.org/" xmlns="http://www.bar.org/">abcd</A>
• If an XML namespace declaration contains a prefix, you refer to element type and attribute names in that namespace with the prefix. For example code below declare A and B in http://www.foo.org namespace, which is associated with the foo prefix:
<foo:A xmlns:foo="http://www.foo.org/"> <foo:B>abcd</foo:B> </foo:A>
• If an XML namespace declaration does not contain a prefix, the namespace is the default XML namespace and you refer to element type names in that namespace without a prefix. For example, code below is same as previous example but uses a default namespace instead of foo prefix:
<A xmlns="http://www.foo.org/"><B>abcd</B></A>
Glossary
tokenized attribute typeEach attribute has an attribute type. Seven attribute types are characterized as tokenized: ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN, and NMTOKENS. Uniform Resource Identifier (URI)The generic set of all names and addresses that refer to resources, including URLs and URNs. Defined in Berners-Lee, T., R. Fielding, and L. Masinter, Uniform Resource Identifiers (URI): Generic Syntax and Semantics. 1997. See updates to the W3C document RFC1738. The Layman-Bray proposal for namespaces makes every element name subordinate to a URI, which would ensure that element names are always unambiguous. Uniform Resource Locator (URL)The set of URI schemes that have explicit instructions on how to access the resource on the Internet. Uniform Resource Name (URN)A Uniform Resource Name identifies a persistent Internet resource. valid XMLXML that conforms to the vocabulary specified in a DTD or schema. W3CWorld Wide Web Consortium well-formed XMLXML that meets the requirements listed in the W3C Recommendation for XML 1.0: It contains one or more elements; it has a single document element, with any other elements properly nested under it; each of the parsed entities referenced directly or indirectly within the document is well-formed. A well-formed XML document does not necessarily include a DTD. World Wide Web Consortium (W3C)The international consortium founded in 1994 to develop standards for the Web. See XLLExtensible Linking Language XMLExtensible Markup Language XML declarationThe first line of an XML file can optionally contain the "xml" processing instruction, which is known as the XML declaration. The XML declaration can contain pseudoattributes to indicate the XML language version, the character set, and whether the document can be used as a standalone entity. XML documentA data object that is well-formed, according to the XML recommendation, and that might (or might not) be valid. The XML document has a logical structure (composed of declarations, elements, comments, character references, and processing instructions) and a physical structure (composed of entities, starting with the root, or document entity). XML parserA generalized XML parser reads XML files and generates a hierarchically structured tree, then hands off data to viewers and other applications for processing. A validating XML parser also checks the XML syntax and reports errors.

Next Section: Multimedia

# Technical Issues/Multimedia

Searching through works with multimedia content requires extra support beyond what is usually provided (e.g., searching through metadata or fulltext). Content based image retrieval (CBIR), searching in files of spoken language, and searching in video collections are all supported with special methods and systems.

While most ETD collections today allow search for multimedia content based on descriptions of such content (i.e., metadata), in the future, as collections grow and have richer assortments of multimedia elements, it is likely that multimedia database and multimedia content search software will be more widely deployed. Indeed, Virginia Tech carried out some preliminary work in this regard with its ETD collection, using tools provided by IBM (e.g., QBIC, VideoCharger). Other vendors and systems exist, and can be used for searching local, regional, national, and international collections of ETDs.

Next Section: Interfaces

# Technical Issues/Interfaces

Information retrieval (IR) systems have been improving since the 1950s. One of the most important areas of advance in the 1990s and in the 21st century benefits from the rapid enhancement of humancomputer interaction (HCI) that results from new types of interfaces, new methods of interaction, and new integration of those activities with IR methods. Information visualization, for example, allows faster analysis of the contents of a collection of ETDs, or of the result set from a search.

Next Section: Training the Trainers

# Training the Trainers

We believe that all universities should have ETD programs. Clearly that will be the case, when one considers the situation in the long term (e.g., 10 years). Why not join now, so that students over the next decade can benefit? Why not participate, so that the research of universities is more widely shared? Why not develop local infrastructure, so that students are properly prepared for the Information Age, save money, and save money for the university?

Training efforts should aim to make clear the benefits of working with ETDs. They should assuage concerns, and make obvious how changes can be made in a smooth transition from current practice.

Next Section: Initiatives to support ETD projects in Latin America

# Training the Trainers/Initiatives to support ETD projects in Latin America

Abstract: This section focuses on the outreach work of the Ibero-American Science & Technology Education Consortium (ISTEC) and selected other organizations in developing EDT projects in Latin America. Training for librarians and for EDT trainers are described.

Many Latin American universities have digital library projects, some of which include electronic theses. However, standards and consistency may be lacking in local EDT initiatives. The Ibero-American Science & Technology Education Consortium (ISTEC) and its partners have been creating learning opportunities and instigating local projects in digital libraries and EDT’s. This section describes ISTEC’s outreach process and progress in regards to EDT projects for the science and technology libraries that are members of the organization.

Overview of ISTEC

ISTEC is a non-profit organization comprised of educational, research, and industrial institutions throughout the Americas and the Iberian Peninsula. The Consortium was established in September 1990 to foster scientific, engineering, and technology education, joint international research and development efforts among its members and to provide a cost-effective vehicle for the application of technology.

With start-up funding from the State of New Mexico and selected IT companies, the ISTEC board created four initiatives to address obstacles to IT developments and to encourage IT manpower development. These are:

1. The ACE Initiative champions continuing engineering and computer sciences education projects. The most important goals are to upgrade human resources and curriculum development through training and non-traditional exchange programs. The methodology involves on-site training, web-based education, video courses, satellite delivery, and "sandwich" graduate programs. The latter brings graduate students from Ibero-America together with experts from ISTEC member organizations to ensure excellence.
2. The Research and Development (R&D) Initiative focuses on the development and enhancement of laboratory infrastructure at member organizations. The major goal is the design and installation of modular, flexible, and expandable laboratory facilities for education, training, and R&D with links to the private sector.
3. The Los Libertadores Initiative champions networks of excellence in the region.The main goal is to network Centers of Excellence equipped with the latest telecommunications and computer technology to provide real-time access to a world-wide system of expertise and knowledge. This requires partnerships among industries and governments to create an Ibero-American academic and R&D Internet backbone.
4. The Library Linkages Initiative (LibLINK) is ISTEC’s information creation, management and sharing project. Below is a description of LibLINK efforts in developing digital library projects in Latin America, especially in the area of EDT’s.

The major goal of LibLINK is to design and implement innovative, international Science and Technology (S&T) information-sharing services. The annual compound growth rate of the Rapid Document Delivery (RDD) project has been hovering around 200% since 1995. Over 27 libraries in 19 countries are connected in real-time and documents are provided using the Ariel® software. The RDD project, although the most popular service, is a foundation for the more important digital library initiatives which were started in 1998. The projects within LibLINK can be categorized as follows:

• Connecting libraries for information transfer. This is accomplished through opening S&T library collections - especially Latin American collections - for scholars through regional networks created to compliment the LiBLINK document delivery services. Currently these include LigDoc in Brazil, PrEBi in Argentina, REBIDIMEX in Mexico, and most recently, a cooperative group of libraries in Colombia.
• Training librarians and researchers in digital library concepts.
• Working with the Networked Dissertation/Thesis Library (NDTL) initiative at Virginia Tech to expand the concept in Latin America. The LibLink initiative seeks to promote easy access to scientific information in the region, especially to thesis and dissertations of master's and doctoral candidates as this is our member organizations’ most important intellectual property.
• Advancing and piloting new types of scholarly communication by actively supporting new publishing efforts such as the NDLTD and the Open Archives initiatives.

LibLINK volunteers plan and carry out workshops and mini-conferences to facilitate the above. Funding generally come from grants provided by organizations such as the US National Science Foundation (NSF) and other national science councils such as CONACyT in Mexico, and regional organizations such as the OAS and UNESCO.

LibLINK and 'EDT’s in Latin America'

We have refined a process for involving librarians and computer scientists in digital library projects that has proven successful. The principles on which ISTEC and LibLINK base their outreach efforts are:

• to establish the capacity of libraries and library staff for participating in digital projects.
• Site visits Participation in regional or local IT and computer science workshops to identify computer scientists working on digital library projects or components thereof. We are especially interested in initiatives created in isolation from each other and from their local libraries.
• In this way digital library initiatives and researchers are identified and a DL group can be established from the above findings that consist of librarians and computer scientists/engineers. Outcome: Critical mass of computer scientists and librarians linked to each other and to ISTEC.
• With this groundwork done, we plan and find funding sources for a Digital Library Workshop that generally have two major aims:
• To share information about current DL initiatives in that specific country or region
• To provide training in EDT’s as the preferred first DL project
• In some cases, this is also an opportunity to create strategic plans for coordinated national projects

This process has resulted in the following EDT and DL workshops:

• A NSF/CONACyT digital library workshop that included NDLTD training by Ed Fox, in Albuquerque, NM. Funding was obtained from various sources, mainly the National Science Foundation, CONACyT (the Mexican Science Council) and the Organization of American States. July 7–9, 1999
• A successful DL conference in Costa Rica for Central American countries (Seminario / Taller Subregional sobre Bibliotecas Digitales). A full day EDT workshop was delivered by Ed Fox, followed by a day of digital leadership training and planning for local EDT projects. Funding was provided by the Organization of American States(OAS), the US Ambassador to Costa Rica and ISTEC. San Jose, Costa Rica, November 1999 (more detail below).
• 1st Course for the Training for Project Directors for Electronic Thesis and Dissertation Projects. The course was organized by UNESCO, the VII CYTED, Universidad de los Andes, and ISTEC’s Library Linkages Initiative and held in conjunction with the VII Jornadas Iberoamericanas de Informatica, Cartagena de Indias, Colombia, from August 30 to September 1, 2000. This is an example of how EDT training can be piggy-backed on a significant regional event that is synergistic and that provide more value for the money spent to attend.
• A 2nd Training Course for Directors of EDT Projects was funded by UNESCO (Montevideo), the Asociación de Universidades Grupo Montevideo (AUGM), and ISTEC. This "Train the Trainer" course was held in Montevideo, December 7–9, 2000. The main goals for this series of courses are to create a group of specialists responsible for the dissemination and management of electronic dissertations and theses. The trainer for both these sessions was Ana Pavani from the Pontificia Universidade Catolica do Rio de Janeiro.
• REBIDIMEX is the Mexican operations for ISTEC Library Linkages. This group has had a number of meetings to develop coordinated digital library and EDT projects in Mexico. The digital theses project at the library of the Universidad de las Américas- Puebla is an excellent example, http://biblio.udlap.mx/tesis/, and forms the basis for a national Mexican EDT project.

Case Study:

The Primer Seminario-Taller Subregional sobre Bibliotecas Digitales, sponsored by the OAS and ISTEC at the Universidad de Costa Rica, San Jose, Costa Rica, mentioned above, provides a good case study of EDT outreach events. Each participating Central American country was asked to identify universities with sufficient technological infrastructure to support a digital library/ EDT project. Then each organization was funded to send representatives from each of their computer systems and library groups. The agenda focussed on providing one whole day of basic training by Ed Fox (a co-author of this Guide) in digital library and EDT concepts, followed by another day of leadership training for digital environments and a planning session.

During this portion the groups identified a project that all could participate in. They chose the digitization of their organization’s theses and dissertations and making it available through the Open Archives system using the processes developed by Virginia Tech and others described in other sections of the Guide. Regional working groups were assigned. The most important outcomes, however, were the creation of a network of librarians and computer scientists that understand EDT technological, operational and political issues and that now have contacts for joint projects in the region.

What next?

The model of creating synergism and connections between librarians and computer scientists and focusing their energies on basic digital library/EDT projects will continue to be replicated in other parts of Latin America. Ana Pavani (a section author of the Guide) continues to deliver EDT courses with the help of ISTEC, such as one in Pernambucu, Brasil, in the spring of 2001. Members from ISTEC’s regional and executive offices regularly speak at conferences regarding EDT’s and are available to help arrange EDT events. Organizing training events and developing joint funding arrangements takes a lot of time, expertise, local contacts, and effort, but is critical for creating opportunities in under-served countries.

ISTEC and its partner organizations, the OAS, International Development Bank, UNESCO, etc., continue to work together to offer regional digital library workshops in Latin America. UNESCO is formulating an international strategy for creating and disseminating electronic theses and dissertations that will support Latin American outreach efforts (UNESCO, 1999). As well, we are assisting governments to draft suitable policies to improve access to information, especially in making their universities’ intellectual property (theses and dissertations) widely available to publicize the universities’ research strengths. ISTEC is also sponsoring the Spanish translation of the Guide and will disseminate it through the ISTEC Science & Technology Portal.

Other Latin American Projects

The Latin American Network Information Center (LANIC) at the University of Texas at Austin is the most comprehensive resource for academic and economic information on Latin America (http://lanic.utexas.edu/), but not specifically for EDT projects. Some beginning and mature EDT projects can be found at:

• The Library System of the Universidad Católica de Valparaíso (Chile). http://biblioteca.ucv.cl/tesis_digitales/
• The Digital Technology Research and Development Center (CITEDI) of the Instituto Politecnico Nacional in Tijuana, México. http://www.citedi.net/docs/tesis.htm
• The UNESP site at the Universidade Estadual Paulista in Brazil. http://www.cgb.unesp.br/etheses/
• The digital theses project at the library of the Universidad de las Américas-Puebla, Mexico. http://biblio.udlap.mx/tesis/
• University of Antioquia (Medellin, Colombia)
• University of São Paulo (Brazil). http://www.usp.teses.br/
• In Chile the Universidad de Chile has been developing Cyberthesis, since November 1999. This is an electronic theses production project with support from Unesco and the cooperation of the Universit de Lyon and the Universit de Montreal. The Information Service and Library System, SISIB, is coordinating the Electronic Theses and Dissertations Project (Cybertesis), applying the production process developed by the Universite de Montreal, based on the conversion of texts to SGML/XML. In 2002, the Universidad de Chile will organize training workshops in the production of ETD (structured documents) for other Chilean universities.

Associations

The Transborder Library Forum/FORO Transfronterizo de Bibliotecas share many of the aims of ISTEC’s Library Linkages initiative. Their meetings have been held annually since 1991 to work on ways to improve communications relating to border issues and to foster professional networking among librarians from Mexico and the United States. Recently, Canadians and representatives from Latin American libraries also interested in NAFTA and border issues began to have a presence. At the 10th Transborder Library FORO held in Albuquerque, New Mexico in March 23–25, 2000, attendance of several representatives from Latin American libraries were sponsored and ISTEC’s LibLink project provided a training session and talks about the REBIDIMEX initiative in Mexico.

The Association of Latin American and Caribbean Academic Libraries ( Bibliotecas Universitarias da America Latina e do Caribe)) sponsored a LibLINK workshop that included EDT discussions and talks at the annual meeting in Florianopolis, Brazil, in April 2000. ISTEC continues to have a presence at their conferences. And are involved with the Brazilian EDT initiatives.

The impact of Bandwidth and Infrastructure issues on EDT outreach in Latin America

Bandwidth and IT infrastructure are important factors for digital EDT project developers in Latin America. IT policy development is another. More aggressive action is needed in both the governmental and industrial sectors. ISTEC emphasize these issues at the "IT Challenge" conferences by bringing industrial and academic members together with regional decision makers, such as ministers of education and technology and representatives from national science councils.

The first high-performance Internet link between North and South America for research and education was inaugurated in Santiago, Chile on September 12, 2000. Chili and the USA connected their respective high performance networks, REUNA* and Internet2, enabling collaboration among researchers and educators at universities in the two countries. Such high-performance network links are critical to ensure the bandwidth required for future format-enriched EDT projects. ISTEC and its partners are strongly committed to advance this cause. * Reuna is a collaboration between the National Universities of Chili that introduced the Internet in Chile in 1992. Reuna´s high-speed network, REUNA2, is an ATM network of 155 Mbps. across the country. The National University Network is a non-profit consortium of 19 leading Chilean universities plus the National Commission for Science and Technology. Its mission is the creation and development of networks and services in IT aimed at supporting participation in the Information Society.

Conclusion:

The Library Linkage initiative of ISTEC has found a methodology for encouraging and supporting EDT developments in Latin America that has proven successful. The most important step is to identify local players in digital library initiatives in both libraries and computer science and computer engineering departments. The next step then brings these players together at events that provide opportunities for training, information sharing and national/regional EDT project planning. ISTEC monitors subsequent developments and provide support to keep projects going as appropriate. The most important strategic outcome we are aiming for is to create an open archive structure that will provide access to all science and technology theses and dissertations of member organizations through the ISTEC Portal. We believe that this will be a rich source of innovation, manpower identification and development, and an opportunity to highlight the intellectual property of our member universities.

REFERENCES:

UNESCO (1999). Workshop on an international project of electronic dissemination of thesis and dissertations, UNESCO, Paris, September 27–28, 1999 http://www.unesco.org/webworld/etd/

NOTES:

Next Section: Tool kits for trainers

# Training the Trainers/Tool kits for trainers

This section of the Guide outlines the approaches to student training used by certain universities. For the most part, links are made to sites where the tools themselves can be consulted and eventually used or adapted for local use.

Next Section: Identifying what is available

# Training the Trainers/Identifying what is available

The training offered to doctoral students registered at the Université de Montréal, entitled "Helpful tools for writing a thesis" ("Outils d’aide à la rédaction d’une thèse"). is part of that institution’s programme for electronically distributing theses. The training seeks to respond to two sorts of needs: 1- the needs related to the processing of theses (greatly aided by the adequate use of a normalized document template); 2- the needs of the doctoral students (who want to increase their capabilities in using writing tools, and, in the process, their level of productivity and the quality of their production).

The implementation of a programme of electronic thesis distribution affects the entire institution. At the Université de Montréal, three units are actively involved: the Faculty of Graduate Studies, the Library , and the Information technology services (Direction générale des technologies de l’information et de la communication—DGTIC, which holds the mandate to process and distribute theses in electronic formats). The training of doctoral students is planned and provided by individuals drawn from these three units. In terms of the sessions held in September 2001, four people acted as trainers (one from Graduate Studies, one from the Libraries and two from the DGTIC).

The content of the training is as follows:

1. Welcome and general presentation of the Programme for the electronic publication and distribution of theses (DGTIC)
2. Presentation of the Université de Montréal’s Style Guide for masters and doctoral theses (Graduate Studies).
3. On-campus services offered to thesis-writing students: equipment rentals, self-service digitizer, digital camera, etc. (DGTIC).
4. Presentation of the capabilities of the EndNote software for managing bibliographic referencing (Library)
5. Practical exercises with the Word document template to be used by the students.

Points 1 to 4 in this general plan involved presentations given by trainers, while point five directly involved participants in concrete exercises, accompanied by several demonstrations. For these exercises, the students used a working text (a Word document containing the principal editorial elements of a thesis, but without a proper layout), a Word document template developed for theses, and a list of instructions.. The exercises essentially consisted of applying the template’s styles to the working text, of producing and inserting an image captured on the screen into the text, and of automatically producing a table of contents and an index of tables.

As well, several documents were provided to participants at the sessions: the instructions, the document template (Word style sheets) as well as training guides and informative documents of interest to the students. All these documents are available on-line at www.theses.umontreal.ca.

The workshops are offered to all doctoral students, whether they are at the beginning of their research or near the end of their writing. To reach the largest possible number of students, we used a multidimensional communications plans: posters, advertising, messages to student associations, and letters from the Dean of Graduate studies to the heads of departments and research centres. The number of seats in the training laboratory is limited to 15 to 20 students per session, depending on the laboratory used. Students are invited to register in advance, either by telephoning one of the DGTIC’s administrative assistants, or, in an autonomous manner, by using an interactive Web form managed using a CGI scripts. This latter form tells students how many participants have already registered for each session and the maximum number of seats available. When the maximum number of registrations is reached, the script no longer allows registrations. Nevertheless, an invitation to register on a "waiting list" allows students to signal their interest and to be rapidly informed when new workshops are held. This also allowed us to note the pertinence of the training: the available spaces were quickly filled and the waiting list allowed us to reach interested students for the next sessions.

Next Section: Demonstrations, explanations

# Training the Trainers/Demonstrations, explanations

Many technological choices such as encoding formats, computing tools, dissemination tools, and copyrights, have an important impact on the ETD project’s success. It is necessary an overview of the different aspects of producing, disseminating, archiving theses in electronic formats, implementation and management of ETD projects.

At the present time several universities have initiated individual or cooperative projects for publishing and distributing theses, with different strategies approach : NDLTD (Networked Digital Library of Theses and Dissertations) with 103 member universities, Cyberthèses (Université de Montrèal, Université de Lyon, Universidad de Chile, Université de Genève with about 10 member universities) , Digital Australian Theses Program (8 Australian universities) and Dissertationen Online (5 German universities).

Next Section: Initiatives and Projects

# Training the Trainers/Initiatives and Projects

Collectives:

1. (http://elfikom.physik.uni-oldenburg.de/dissonline/PhysDis/dis_europe.html) PhysDis - a large collection of Physics Theses of Universities across Europe
2. (http://www.iwi-iuk.org/dienste/TheO/) TheO - a collection of theses of different fields of 43 Universities in Germany, in as much as the Theses do contain Metadata.
3. (http://MathNet.preprints.org/) MPRESS - a large collection of European Mathematical Theses. Which also contains the next entry as a subset.
4. (http://mathdoc.ujf-grenoble.fr/Harvest/brokers/prepub/query.html#math-prepub) Index nationaux prépublications, thèses et habilitations - a collection of theses in France in Mathematics

Next Section: Guidelines and Tutorials for ETDs

# Training the Trainers/Guidelines and Tutorials for ETDs

The inclusion of courses and tutorials with online demonstrations can facilitate a better understanding of all the production and submission process. With a comprehensive guide for ETD creation, the students can learn how to create ETD in word processing software, how to convert to PDF format and the submission procedure.

Next Section: Specific Guidelines

# Training the Trainers/Specific Guidelines

This section includes a series of tools intended to support the work of production and submission of ETD.

Writing in word processing systems: ETD Formatting

• Humboldt University of Berlin Word-style (for German Word Versions only)
• Université de Lyon
• Université de Montréal
• University of South Florida
• Virginia Tech

ETD Submission

Prepare a PDF document

Writing SGML/XML

• Sites with SGML/XML-Theses and Dissertations
• HelsinkiUniversity of Technology
• Humboldt-University Berlin
• Université de Montreal
• Université de Lyon 2
• University of Iowa
• University of Michigan at Ann Arbor
• University of Oslo
• Swedish University of Agricultural Sciences Uppsala

# Training the Trainers/Creating an online database of problem solving solutions

A quite important document must be at the users' disposal together with the various tools implemented for the electronic edition of scientific documents:

• User’s guide for the preparation of the documents
• User’s guide for the conversion itself
• Pedagogical toolkit for the training of authors...

This first edition, as complete as it may be, is nevertheless insufficient. It must be completed by the online diffusion of a database listing all the problems usually encountered and their solutions (FAQ) eventually completed with links to specific documents with examples.

Most of the time, such a database cannot be a priori completely defined and will have to be enriched as the number of users grows and their mastery of the various tools increases.

The collection of the questions and problems are done very efficiently thanks to the creation of discussion lists or forums opened to the users. It is recommendable to group the frequent questions or problems by subject, to offer an easier and friendly consultation.

Then, two methods considered for the construction of the database of solutions:

• Synthesis, based on the discussion list, prepared by some "experts" used to supply the database with questions and answers. The information available for the users is validated.
• The content of the forum indexed directly; the information is not validated, and it is thus the users themselves who will have to make a sort in the list of answers they get for their request. If this second method is lighter in terms of administration and management, its efficiency depends on the quality of the posted messages (questions and answers): precise and clear subject, good presentation of the problem, argumentation of the answer, etc.

Whichever the adopted solution may be, its implementation may rest on usual free software classical list and database servers.

The online diffusion of the so constructed FAQ must still have the name and e-mail of a member of the technical staff or a visible link to the discussion list in order to answer the questions not taken into account in the database.

As an example, CyberThèses proposes a forum construct with a MySql database and interfaced with Php, and will develop a second database from validated information.

Next Section: Help develop a broad local team

# Training the Trainers/Help develop a broad local team

The core of the ADT Program was developed at The University of New South Wales Library (UNSW Library) as the lead institution. The team at UNSW Library included the overall coordinator & designer; technical manager & programmer; metadata consultant as well as the web coordinator & designer. This team developed the model from the conceptual to reality. The most important thing was not to lose focus, to keep the model as close to the original project description proposal as possible and to not overly complicate processes - in fact keep them as simple as possible. It was seen as critical to develop a workable model for the project partners to test and further refine.

During the development and testing process, all 7 original partners were consulted and had input at all stages. Two workshops for all 7 members were held approximately 12 months apart. These were used to discuss all aspects of the ADT model as well as to agree on the standards and protocols used. The agreed standards are at the core of the distributed model of the ADT Program. Without the involvement of an effective team to lead the process, and the effective input of the broader team, arriving at the desired outcome would have been very difficult.

The ADT Program is now effectively working across all 7 original member sites. Membership to the ADT has now been opened up to all Australian universities. It is anticipated that all Australian universities will become members and thus form a comprehensive national program. The benefits of a broad membership team are many: shared infrastructure, shared software development, shared metadata, shared documentation and training as well the shared satisfaction that comes with effective collaboration for the common good. The membership to the ADT may in time also include others in the region such as New Zealand and others.

Next Section: Standards, cooperation, and collaboration

# Training the Trainers/Standards, cooperation, and collaboration

While each institution will have differences as to the way in which its procedures are implemented and ETDs presented to satisfy local needs, for the potential of optimal global dissemination of ETDs to be achieved, adherence to basic standards is essential. These standards relate to document format and settings, filename protocols and metadata. Agreement on use of a small set of standard metadata elements can facilitate harvesting for creation of collaborative databases. While individual institutions can apply additional metadata including subject or format schemas as desired, if a minimum level of metadata is established, particularly if this metadata can be automatically generated, then any institution can access the document regardless of its resources and expertise. This provides an entry point for retrieval of an ETD from any institution without compromising the ability of other institutions to provide very rich metadata for their ETDs.

Cooperation and collaboration between institutions in the creation and dissemination of ETDs has a number of benefits. From the point of view of creating ETDs, the benefits include the utilization of software and procedures developed and tested by others, sharing generic training tools, sharing of expertise in problem solving and developmental work and the provision of mutual support. In the dissemination of information about research findings contained in ETDs, collaborative approaches can be particularly effective in small or developing countries where the total volume of ETDs may not be large. In these situations a national or regional approach can provide increased visibility, economies of scale and sharing of resources required to mount and maintain ETDs.

There are a number of models for cooperation and collaboration in creation, dissemination and preservation of ETDs. While the models can vary in detail, major elements are:

• Shared infrastructure
In this model, a central agency provides the infrastructure for publication, dissemination, maintenance and preservation of ETDs. The central agency provides the server and the network access to a central repository of ETDs. Other institutions cooperate with the central agency by supplying copies of theses or dissertations. These copies could be supplied in digital form, conforming to the basic standards, or the central agency can be responsible for ensuring a suitable digital version is created and made available. The central agency has the responsibility for maintaining archival versions. This model could operate at a supranational, national or regional level, or could be used by a group of universities with similar interests.
• Shared software development
Sharing of generic software, which is easily installed and maintained, can be a cost effective way of establishing an ETD program. This can be of particular benefit for institutions where staff with highly developed IT skills are in short supply. Collaborative development or modification of new versions of software can also be very cost-effective.
In this model, institutions publish and maintain digital theses on their own institutional servers but the metadata is harvested to produce a central database of details of the theses from the collaborating institutions. The metadata is linked to the full text of the ETD wherever it resides. The home institution retains the responsibility for the preservation of the ETD.
• Shared documentation and training tools
Sharing of detailed documentation on all aspects of operating an ETD program is a very cost effective method of collaboration. The development of generic procedures and training programs, which can be customized for local conditions, can facilitate the participation of institutions in an ETD program. Sharing of documentation is also likely to reinforce the use of standards, which will ensure the ETDs are readily discoverable

Next Section: Outreach/helping others

# Training the Trainers/Outreach/helping others

An effective mechanism for providing support is to identify centres of expertise which are prepared to assist others in setting up their ETD system. These centres of expertise can be regional, national or within an international geographic or linguistic region. The benefits flowing from these centres of expertise include assistance with designing processes, assistance with implementing software, especially if a common software product is being used, assistance in use of metadata and continuing support as systems evolve. Another very important function of a centre of expertise is provision of a focus around which a group can develop which can offer a self-support function and an environment taking into account local or regional conditions for continuing discussion and development of ETD programs as the technology continues to evolve. This latter is essential if ETD programs are to prosper.

The centre of expertise concept also has the great advantage of reducing significantly the learning curve in starting up an ETD program and consequently the timeframe and cost of new programs.

# Training the Trainers/Developing Centres of Expertise where appropriate and helpful

An effective mechanism for providing support is to identify centers of expertise, which are prepared to assist others in setting up their ETD system. These centers of expertise can be regional, national or within an international geographic or linguistic region. The benefits flowing from these centers of expertise include assistance with designing processes, assistance with implementing software, especially if a common software product is used, assistance in use of metadata and continuing support as systems evolve. Another very important function of a center of expertise is provision of a focus around which a group can develop which can offer a self-support function and an environment taking into account local or regional conditions for continuing discussion and development of ETD programs as the technology continues to evolve. This latter is essential if ETD programs are to prosper.

The centre of expertise concept also has the great advantage of reducing significantly the learning curve in starting up an ETD program and consequently the timeframe and cost of new programs.

Next Section: Expanding ETD initiatives

# The Future/Expanding ETD initiatives

ETD initiatives will spread in many ways. Word of mouth, from universities with successful programs, is one of the most effective means. Word of mouth, from graduates of universities with ETD programs, will spread the idea further.

Further, as the worldwide collection of ETDs grows, more and more researchers, including graduate students, will make use of the valuable content. As use expands, others will be convinced to include their works in the collection, so it approaches full coverage.

Academics stand on a precipice separating our past, when genres of communication evolved slowly, and our future, when new genres emerge overnight. Our concepts of research, the authority of knowledge, and the shape of content are being radically challenged. We have difficulty imagining what dissertations or academic digital libraries will look like ten years from now. The shape of a dissertation is evolving from the first six-page, handwritten thesis at Yale University in 1860 into a form we cannot yet predict. Today's researchers and scholars are challenging the conventions of linear texts, one-inch margins, and texts written for extremely narrow audiences. They are integrating video, audio, animation, and graphics into their works. They are creating interactive elements, including real-time video, pivot tables, and online writing spaces.

The power of ETDs is rooted in access. No longer are theses and dissertations just an academic hurdle, a last step in the arduous process of graduate education. Instead, ETDs are a meaningful connection with significant readers. Collaborative author tools enable faculty to serve on dissertation committees at universities distant from their home campuses and using tools such as NetMeeting to mentor students from a distance. Rather than accepting their research and scholarship will be read only by a select few (i.e., their committees), graduate students can now expect many readers.

Predicting the future of academic scholarship is a little like predicting the stock market: both are volatile and unpredictable. Given this fact, however, it appears that there are a number of emerging trends that will affect our enterprise:

• Dissertations will matter more than they have in the past. Thanks to digital libraries, which increase access (http://scholar.lib.vt.edu/theses/data/somefacts.html) from one or two readers to sometimes more than 60,000, students and universities will pay greater attention to the quality of students' research and writing.
• Given this increased access, both students and universities may begin to pay greater attention to the quality of scholarly writing.
• Progressive universities will use their digital libraries of ETDs to market their programs, and universities will provide the resources students need to write multimedia research.
• Multimedia documents will transform author-reader relations. Authors will interact synchronously with readers, create different reading paths for different readers, and use visuals, animation, and pivot tables.
• Students will increasingly search the worldwide digital libraries of ETDs, resulting in research that is more collaborative and more current.
• Across disciplines, students will provide links that clarify the significance, methodology, and findings of their work to a broader range of readers, including lay audiences, thereby helping the general public better understand the value of academic scholarship. As an example, students in the social sciences can incorporate video of cultures and primary subjects; they can create polyvocal case studies and ethnographies - that is, studies with alternative interpretations.
• Faculty members will work more collaboratively with students, resulting in more complete bibliographies and saved time.

Next Section: Managing technology changes

# The Future/Managing technology changes

The ease of distributed access accompanying the digital revolution has a darker side. Technological obsolescence and ephemeral formats have left little firm ground upon which to build the infrastructure necessary for the effective management and preservation of digital resources. An infrastructure supporting long-term access needs to be able to accommodate the continuous evolution of the technology as well as continuous streams of data. This means that a durable record of such work now includes a flexible and adaptable approach to maintaining live access.

Devices and processes used to record, store and retrieve digital information now have a life cycle of between 2 – 5 years, far less than some of the most fragile physical materials they are seeking to preserve. The practice known as 'refreshing' digital information by copying it onto new media is particularly vulnerable to problems of 'backward compatibility' and 'interoperability'. Software capable of emulating obsolete systems currently relies on the goodwill and enthusiasm of talented programmers but has little basis in economic reality.

The foundation of effective management of an ETD as digital resource occurs at the point of creation. How a digital resource is created and the form and format it is created in will determine how it can be managed, used, preserved and re-used at some future date. Authors/creators and migrators of all digital resources need to be aware of their importance to the life span or term of access of a digital resource.

Simple but practical strategies that might increase the life span of an ETD are as follows:

• do not assume any stability in hardware or software and try to record all the dependencies that support your ETD.
• Use non-proprietary formats such as (ASCII, Unicode or plain text) as much as possible (e.g. Notepad or Simple Text)
• Use widely available formats such as Word or PDF that can be easily converted into plain text
• Use widely available non-proprietary image formats such as JPEG or PNG rather than GIF (proprietary format)
• If you must use custom or proprietary software, make such you keep a record of the version number (e.g. Real Media 7) and all the hardware dependencies (e.g. MAC system 7).

This information can be stored within the metadata associated with your ETD. For example, you could use the Dublin Core metadata element, 'DC.Relation.Requires'. As technologies evolve, the use of preservation metadata to record this kind of information is increasing. Check with your library to see what standard is currently in use.

Next Section: Interoperability

# The Future/Interoperability

At our various locations, document servers for electronic theses and dissertations have been set up independently from each other. The aim is to build a network of sites, which would allow for worldwide retrieval within a heterogeneous knowledge base, independently from the physical location of the data provided. The users should not have to navigate and search the various servers separately. They should be provided with one retrieval interface that can link to all the different nodes of the network of ETDs sites. This is the horizontal level on which retrieval could take place.

Another level, which we call the vertical level of the information portal, would be the one, which configures the retrieval interface in a manner allowing the user to retrieve only the relevant and desired information rather than receiving all of the information that can be possibly provided. We wish to avoid the "Altavista effect" of information overload. The user should be able to search within specific subjects and for specific information structures. For example, they should be able to undertake searches just within the author field or the title field, for certain keywords or institutions or just within the abstracts field. A highly sophisticated retrieval facility would allow a worldwide search within certain internal document structures, such as the bibliography.

For the scientific use of theses in the humanities and social sciences, as well as in the natural and technical sciences, it is necessary to offer not only bibliographical metadata and full text but also structural information for retrieval purposes, such as:

• captions of tables and graphs;
• special index terms such as name or person indexes or location indexes etc.);
• references (links) to external sources (printed resources as well as Web sources);
• the bibliography;
• references or footnotes within the work;
• definitions;
• mathematical / chemical formulas;
• theses / hypotheses

These structural metadata are an integral part of the document and have to be defined by the author. At present, this predominantly takes place while formatting the text (e.g. headings, footnotes etc.). In order to also use these structural data for retrieval, they must be tagged as such by the author, by using either a structured language like LaTeX, or "style sheets" as with WinWord.

What could be the lowest common denominator for interoperability?

The first step within the above mentioned development is to reach agreement on a common metadata set for theses and dissertations and to formulate guidelines on how to use it for ETD projects. See http://www.ndltd.org/standards/metadata/ for the Dublin Core metadata set proposed by the NDLTD. Those guidelines could be supported by additional free software tools, which would allow library staff to create the necessary metadata set without needing a technical knowledge of the actual encoding in HTML or XML/RDF. Such a "metamaker" has been developed for the German ETD projects and could be translated into English, French, Spanish and Portuguese. MySQL or other free software can be used as the underlying database system.

The Open Archives Specification: A chance for metadata interoperability

During the last 2 years one initiative has effected the discussions about interoperability in digital libraries and the digital lbrary community enormously.

The development of

• a protocol, that can easily be implemented at archive servers and

Allow archives, like ETD servers, preprint archives as well as museums and other institutions to provide their local catalogues to a worldwide community without having to implement specialised and complicated interfaces.

So the Open Archives Framework (see http://www.openarchives.org) allow an interoperability f heterogeneous and distributed ETD archives and servers in a very low interoperability level.

For the ETD initiatives and projects the OAI compliance has to be seen as chance to connect ETD servers worldwide.

Next Section: A vision of the future

# The Future/A vision of the future

Theses and dissertations represent a global source of information resulting from cutting edge research. While a proportion of this information is published in other forms much of the detail is not, and research emanating from lesser-known institutions, particularly in the developing countries, may be less likely to be published in mainstream journals. Creation of this information in electronic form making it readily accessible via the Web through standard, ubiquitous and free software programs provides the key to dissemination of this information independent of the source of the research. The ETD initiatives to date have proven that electronic theses and dissertations can be created using relatively low technology at a cost, which would be within the reach of most institutions. Portable packages have been developed which eliminate the majority of the developmental work required. The point has been reached where all research institutions can technically establish their own ETD program.

Traditionally, theses and dissertations have been extremely underutilized sources of information due to their lack of physical availability. The development of ETDs provides the opportunity for theses and dissertations to be recognized as a basic channel for the dissemination of research findings and an essential resource in the discovery process. Therefore, the focus for the future needs to be to ensure optimal access to ETDs by information seekers. This in essence means ensuring that ETD metadata records are accessible through as many channels as possible and are retrieved as integral components of searches without the researcher necessarily specifying an ETD. Some ways of achieving this could be:

• Creation of a virtual union catalogue of ETD metadata through frequent regular harvesting of data from ETD sites which could be individual, regional or national
• Integration of ETD metadata into general metadata repositories for electronic scholarly information
• Ensuring search engines being established for other electronic scholarly information initiatives such as the Open Archives Initiative also search the ETD metadata repositories
• Inclusion of ETD metadata in local library catalogues
• Inclusion of ETD metadata in subject or form oriented databases

The development of ETD programs worldwide and the implementation of access structures have the potential to significantly enhance the opportunity for all researchers, independent of geographic and economic constraints, to make their contribution to the global research effort.

Work in all of these areas continues under development by NDLTD. Acting as agent for NDLTD, Virginia Tech is running a union catalog, drawing upon sites that support OAI. The result is accessible through Virginia Tech software as well as VTLS’s Virtua software.

With support from the USA’s National Science Foundation, Virginia Tech also is engaged in a number of research activities related to ETDs. These include matching efforts funded by DFG in Germany (in collection with Oldenburg U.) and by CONACyT in Mexico (with Puebla and Monterrey). These aim to promote mirroring (of metadata as well as regular data), high performance access, effective searching and browsing, visualization of results and of sites, and other advanced schemes.

It is hoped that all involved in the ETD efforts will assist, with a global perspective, so that all universities become involved, and ultimately all students submit an ETD, thus becoming better prepared to be a leader in the Information Age.