Chemical Information Sources/General Search Strategies
Introduction: Search Engines versus Databases
The most common first step in finding information of any type is to use an Internet search engine, such as Google. A search engine is a computer program designed to retrieve Internet-based resources (web pages, files, images, etc.) that correspond to an entered search term. Usually, there is little to no additional information provided with the search results. The search results themselves may differ from engine to engine, depending on the program used to compile and return results. For specialized or scholarly information, including chemical information, general search engines fall short in two key aspects:
- They are, at a basic level, very broad. This leads to user frustration when an unrefined search for information retrieves too many irrelevant results, some of which may not be appropriate for an academic or industrial research project.
- They are by their nature limited to items available online which give permission to be accessed by the search engine indexing software, commonly called web crawlers or spiders. Thus, in the case of journals or books that do not have online representation (some small publishers and older titles) or those web sites/databases who block web spiders, search engines can not access the content nor include them in their search results.
There are several search strategies that can be employed for any given text-based electronic search engine to help alleviate Problem 1. These include using Boolean operators to narrow or widen a search with specific terms, employing truncation or “wildcard” symbols to provide for variable matches on a base search term, and enclosing phrases in quotation marks to ensure exact phrase matching. Section 2 - Search Strategies describes some of these techniques in more detail.
The use of subject-specific databases helps with both Problem 1 and Problem 2. A database is a searchable repository of actual information, in contrast to a search engine, which only provides links to information. Databases may exist in print, online, or electronic but offline, e.g. a DVD. Databases are usually maintained by individuals or organizations that control how the database is structured, what preferred words are used within the database, and what types of searches can be performed. By choosing an appropriate database - one that covers the subject of interest - the search is more likely to initially return more relevant results. Subject-specific databases also focus on all the literature available, even from journals or resources still produced or more popularly available in print, or archival materials that have not yet been digitized.
Two key skills are important to navigating subject databases successfully:
- Knowing what the database covers with respect to sub-disciplines (and even individual journals), types of documents such as conference papers or dissertations, and time periods;
- Knowing the language of the subject and the specific database, including preferred search terms (referred to as index terms, supplementary terms, keywords, etc.), as well as categorization terms, which may vary from database to database.
Developing Skill 1 involves familiarization with specific databases. Usually coverage information can be found on the company or database website. Skill 2 is naturally developed by researching a topic, and taking advantage of information provided at both the article level and database level. Skill 2 development also involves the use of the search strategies mentioned earlier, and will be included throughout this chapter in the relevant database or resource sections. In particular, the chemical literature can be searched by visual terms (i.e. chemical structures) in addition to text, which poses its own challenges and opportunities.
Section 3 - Types of Electronic Information Sources outlines different electronic sources available for searching different types of information, and recommends an appropriate approach for each one.
Section 4 - Chemistry Databases and Search Engines provides an overview of some of the most popular chemical information databases and associated platforms. Some databases can be accessed via several different platforms.
Finally Section 5 - Summary and Supplemental Information Section 5 provides a brief Summary and contains links to further reading and supplemental information.
With the vast majority of current searches being performed on electronic interfaces, this section will focus on applying specific search techniques to those interfaces. Most of these techniques will work across a variety of search engines as well as databases. It will be noted where specific information for specific databases is provided.
Boolean Search Operators
BOOLEAN SEARCH OPERATORS show the logical relationship among different concepts or words in the search.
For a concrete example, let us assume that we are sending orders for home delivery to Doc's Gourmet Bakery using Boolean operators to express our orders. Assume that the plates on which the desserts are delivered are documents, and the pie, cake, and ice cream are words in those documents. The tray on which the plates are sitting represents the answer set.
The most common Boolean operators are:
- OR - Concepts linked with the OR operator are synonymous or related in some fashion.
The OR operator broadens the scope of the search by including acronyms, abbreviations, and similar terms that may be used in the indexing of the documents in the database. One document in the answer set might contain only one of the terms, a different document might have another one, and a third might contain two, three, or all of the terms in an OR statement. The OR Boolean operator puts all of these documents into the final answer set, even if only one of the terms is actually present in a given document.
The normal use of the English word "or" implies a choice, with only one thing possible in the final selection. In a Boolean sense, OR really grabs all of the items and puts them into a set. A special variant of the OR operator is XOR. XOR retrieves a document only if one of the terms in the OR statement is present, but would skip any documents that have both terms.
Example: pie OR cake
If each of the pieces of pie and each of the pieces of cake in Doc's Gourmet Bakery were placed on its own plate and arranged on a huge tray, we would satisfy the search (pie OR cake), and the tray would represent our answer set. Since the XOR operator was not used, there could even be some plates on which both pie and cake were found. In the Venn diagram, everything that is represented by the top two circles would be pulled and delivered in the order. The overlapping segment of the top two circles implies that some of the plates would definitely have both pie and cake on them.
- AND - Different concepts are combined with the AND operator to insure that both are found in the same document(s).
In conversational English, "and" is used to group things that may or may not be similar. In a Boolean search, all terms connected with the AND operator must appear in each document in the answer set.
Example: cake AND ice cream
In this example, each of the pieces of cake in our order would be on its own plate with some ice cream on top in order to satisfy the search, and only those plates would be on the tray that is delivered. The two segments of the bottom circle that are shaded in the upper right-hand area represents this search.
- NOT - A concept is excluded from the final answer set with the use of the NOT operator.
Example: (cake AND ice cream) NOT chocolate
Now, let's add a further refinement to the search that is not really illustrated in the Venn diagram. Let's assume that you are allergic to chocolate, but that Doc's Gourmet Bakery at the time of your order has only chocolate cake left. You would not get any dessert because the NOT completely eliminates the subset when one of the terms satisfies it. It throws out each of the plates containing the chocolate cake even if the ice cream on top is your favorite, vanilla.
Let's try another search for pie on the same day that Doc's has only chocolate cake on the shelves.
Example: (pie AND ice cream) NOT chocolate
In this case, our order would get us some pie (as long as it wasn't chocolate pie or the pie didn't have chocolate ice cream on it).
From the examples, you should realize that the NOT command must be used with caution in online searching since it could eliminate some documents that are of interest if they also happen to discuss aspects of a topic that are not of interest. In the last NOT example, for instance, you would not get any plate that had both pie and chocolate cake on it.
There are more specific variants of the AND command that can be used to define the spatial relationships of search terms. These are called POSITIONAL or PROXIMITY OPERATORS. On STN, they are:
- (A) - terms must be adjacent without regard to order
- (W) - terms must be in the order specified
- (L) - terms must occur in the same logical unit (field)
- (S) - terms must be in the same sentence within the same field.
Note that on STN the (A) and (W) operators mean the same in all files; other proximity operators may yield different results depending on the file. STN assumes that multi-word phrases are to be searched using the (W) operator in the absence of explicit positional or other Boolean operators.
See "Operators for Relating Search Terms" for some examples of Boolean search operators on the STN system.
Some of the examples illustrate the use of NESTING, placing terms in parentheses so that the search system knows to perform those functions first before moving on to other operators.
Truncation (Masking) of Characters to Expand a Search
In many cases where subject searches are concerned, we are looking for topics that involve words built on a common root word, or that have some other variations that are easily signaled to a computer by means of a special symbol. TRUNCATION is the technique that tells the computer to form an answer set consisting of all records that contain words with the characters input for the search, but could also contain related words with suffixes (or, in some cases, prefixes) or variable characters at a given point in the word. It is NOT possible to use the truncation technique on SciFinder research topic searches. However, it can be applied on command-driven searches such as those done on STN. For examples, see:
Truncation can occur at the left end or the right end of a word stem or within the word. STN now allows all three types of truncation in the CA File Basic Index, an index of subject words from the title words, words in the abstracts, or index terms (including Registry Numbers for compounds discussed in the documents). The limit of terms that can be gathered in a set by truncation is 30,000 stems. For left truncation the search term must have at least four characters.
On the STN system, truncation symbols are:
|exclamation point (!)||Exactly one character||cataly!e|
|hash mark (#)||One or no character||alcohol#|
|question mark (?)||Any number of characters||?therap?|
As noted in the table, the # sign can be used at the end of a word to pick up both singular and plural forms of a word. Another way of accomplishing the same thing on STN using the command language option is to enter SET PLURALS ON at the system prompt. Both left- and right-hand truncations are allowed with the "?".
There are limits to the number of terms that can be gathered into a set using truncation. Therefore, caution must be exercised in using truncation to prevent too many search terms (or unexpected words) from entering the answer set.
Novice searchers and even professionals sometimes make gross errors with truncation, especially in systems that allow both left- and right-hand truncation. Think what would happen if a search were run with these character strings truncated on both sides:
Every occurrence of the word "chemical" or "chemistry" or "biochemical," etc. would be pulled in the first search, but also documents containing words such as "hemisphere". In the second case, every document that contains an English word that ends in -ION would be pulled. Probably not what the searcher would have wanted!
Unfortunately, there is no uniformity of symbols used to designate truncation among different vendors or search engines, although often we find an asterisk (*) used to indicate the right-hand truncation point. That is the case with the Web of Science, for example.
With SciFinder, no truncation is used. The searcher simply types into the Research Topic search window the natural language expression that defines the search, without even trying to insert Boolean search terms. The SciFinder search algorithm has some built-in intelligence to look for relevant word forms for the search. For instance, the search system automatically searches for both singular and plural subject words.
Let's consider the results of a research topic search run a few years ago on SciFinder for the analytical technique "Electron Spectroscopy for Chemical Analysis (ESCA)," including results from both the CAplus and Medline databases.
At the time it was run, the search as entered found 4395 references where the two concepts "electron spectroscopy" and "chemical analysis" were closely associated with each other and only 582 where the phrase as entered was found. In this case, let's repeat the search using the acronym for the analytical technique (ESCA) and also use a synonymous acronym, XPS. (The technique is also known as X-Ray Photoelectron Spectroscopy.) We have the option of entering synonymous words in parentheses, following a term or phrase. Thus, entering the research topic search on SciFinder as:
would imply to the system that you are looking for synonymous terms (an OR search). This search found considerably more documents: 114,511 at the time of the search on October 3, 2004. However, many of the 35,609 records pulled by the ESCA part of the search were false drops that match the word "escape"! Entering ESCA by itself pulls 7516 records with the term "as entered," and it appears that all but the oldest (a 1918 record) are relevant. Thus, the technique of entering synonyms in parentheses must be used with caution on SciFinder.
Enclosing a phrase in quotation marks considerably narrows a search by limiting results to those in which the exact phrase appears, in the order in which it is entered. A basic example would be searching for polymer nanorods versus "polymer nanorods":
polymer nanorods: Most search engines will perform an AND search with the terms polymer and nanorods and return results that have both terms anywhere in the result, resulting in extraneous results.
"polymer nanorods": Enclosing the term in quotes will ensure the results returned contain polymer nanorods as adjacent terms.
Types of Electronic Information Sources
Bibliographic versus Non-Bibliographic
When searching for peer-reviewed scientific information, two broad types of databases can be distinguished:
This includes sources such as property databases, chemical structure databases, dictionaries, and encyclopedias that provide actual answers to questions without having to consult another source.
Examples: Encyclopedia Britannica, the CRC Handbook of Chemistry and Physics, SciFinder, ChemSpider
These databases includes records of published works, perhaps with abstracts, and increasingly with links to the full texts of the primary documents.
Examples: Web of Science, SciFinder, Compendex, PubMed
Usually, commercial products cannot be found or accessed via a public Internet connection - access is limited to organizations that have paid for access, and this is usually enforced by computer IP authentication. Examples include the CRC Hanbook and the Web of Knowledge. Some resources are publicly available, such as ChemSpider and PubMed. Web search engines do not have access to library online public access catalogs (OPACs) that tell you specific library holdings, nor can they access any of the commercial vendors' offerings. However, publicly-accessible databases will often show up in search engine results. Thus, they can be very powerful tools, and for certain types of questions, they can be very useful in a search for information. Many people today, including chemists, maintain their own personal web pages. For locating someone and perhaps finding a full or selective bibliography or a curriculum vitae (CV) of a chemist, the Web may offer the best route to reliable, up-to-date information. Likewise, very new or hot topics may be discussed in Web news groups, discussion lists, or blogs long before they appear in traditional journals and, later, in abstracting and indexing services. For all of these reasons, we are beginning to see the commercial vendors add options to transfer the search strategy used in a commercial database search to the Internet for further information.
In spite of the ease of accessing the Web, it ought to be a fairly rare case that you begin a subject search for information with a Web search engine if you have easy access to online commercial databases in your organization. Databases such as the Web of Science (including Science Citation Index potentially all the way back to 1900), Elsevier's Reaxys databases (among which are the Gmelin and Beilstein databases that cover the literature of modern inorganic, organic, and organometallic chemistry back to their beginnings in the 18th and 19th centuries), and Chemical Abstracts (that covers all areas of chemistry in a comprehensive manner back to 1907 and even earlier in some cases) are usually much better first choices, if they are available to you.
The options for database searching include:
- ONLINE SEARCHING of commercial databases located outside your organization.
Vendors of online search services (for example, STN International) lease or acquire databases from the database producers (such as Chemical Abstracts Service or Thomson Reuters) and make them available on remote computers. For a given vendor, which may have dozens or hundreds of databases on its computers, the databases are all searched by a common command language or graphical user interface. In the vast majority of these cases, there is a fee for searching the databases.
- WEB SEARCH ENGINES.
As noted above, the powerful search engines of today can provide a useful supplement to traditional online searches.
- Free Chemistry Databases On The Web.
Some databases that are available for searching free on the Internet are of very high quality, for example, those produced by the National Library of Medicine or other government agencies or commercial organizations. However, the quality of most databases that are freely accessible on the Internet is likely not to be as high as that of commercial databases. In addition, there are many differences in the search interfaces that the user encounters among free Internet databases. Nevertheless, they should not be ignored for certain types of searches.
- IN-HOUSE SEARCHING of databases within the organization.
Chemical and pharmaceutical companies now routinely load databases on their own computers.
Summary and Supplemental Information
Commercial databases offer many advantages over free Web search engines, including significantly more in-depth indexing of the material and more sophisticated search techniques. Although many of the search techniques discussed here, such as the use of Boolean operators and truncation, can be applied to free search engines, the in-depth indexing of commercial databases to include fields such as document type (among others) make these techniques much more powerful. It is always advisable to consult a specialized database when available, and to not solely rely on search engine results.
CIIM Link for further study (Major Tools or Databases)
Chemical Abstracts Databases vs. the Printed Chemical Abstracts
David Flaxbart, chemistry librarian at the University of Texas, pointed out some of the reasons to retain the printed Chemical Abstracts volumes in library collections (CHMINF-L, 8 June 2010). He notes: SciFinder is not identical to Chemical Abstracts. All (or nearly all) the content of the latter is included in the CAPLUS file and robustly substance-indexed via the Registry file. But it is an oversimplification to say that you can do everything in SciFinder that you could do in the print.
- The Collective subject/substance/formula indexes allow browsing of chemical names, formulas, and subject headings in a way that isn't possible in SciFinder. SciFinder is great for snapshots, but it doesn't provide any view of the hierarchical structure of the CA database, or its indexing and nomenclature practices; nor does it allow browsing for derivatives, salts, and other variants of a parent structure. In other words, you can't browse online for nearby entries like you can in the print, which removes a serendipity factor. For some purposes, this is an important distinction. (Browsing index entries is possible in STN.)
- When you can't figure out how CAS has defined the structure or formula of certain types of compounds, especially inorganic (salts, hydrates, ions, decimals, etc.), coordination compounds, and multicomponent substances, SciFinder can be frustrating. Using the Index Guide and Chemical Substance Index can actually save some time, and when you find the Registry number then you can go back to SciFinder, locate the substance record and complete the literature search. (Of course, this method only works for compounds registered before your last Collective Index.)
- Pre-1967 CA abstract numbers are not searchable or displayed in SciFinder, and can only be looked up or verified in the print or on STN. These numbers are occasionally cited in the older literature, especially as stand-ins for obscure and foreign documents.
- Some printed abstracts may contain structure graphics that aren't duplicated online.
- Some older CA records were not properly converted and are missing from SciFinder or merged with adjacent records. CAS will fix these when notified, and it seems to be a rare occurrence.
- SciFinder is not available to unaffiliated users, per license restrictions. CA print is a potential fallback. (Unless it's in storage. Indexes stored remotely will almost certainly never be used again, and can't be used for their intended purpose, so this is essentially no different than discarding them.) Naturally, CA in print is only for historical searching. Even if you were to lose access to SciFinder, print CA could not fill the gap or be an acceptable substitute for modern users.
- Even if you decide to discard the bulk of CA, consider retaining the most valuable parts, such as the Index Guides (very useful for finding index terms, synonyms, controlled vocabulary, Registry Numbers, etc.); patent indexes; formula and name indexes; and the Ring Systems Handbook. Also, general wisdom suggests that the older (and smaller) pre-1967 portion of CA is more valuable archivally than the post-1967 volumes, which are somewhat more expendable.
See also from Chemical Abstracts Service: Transitioning from CA Print to CAS' Electronic Products for more information.