Chemical Information Sources/Cheminformatics

Introduction

Cheminformatics is the application of information technology to the investigation of chemistry research problems and to the organization and analysis of chemical data. Cheminformaticians work with huge amounts of data and develop systems to organize and evaluate data to give new insights for further chemical research. There is a fine line between theoretical chemistry/computational chemistry and cheminformatics. Cheminformatics has had its biggest impact in the pharmaceutical industry, although its techniques and tools are beginning to be applied to other areas of chemistry.

How Can Cheminformatics Help?

Cheminformatics can help chemists and other scientists produce and manage information. In silico analysis using cheminformatics techniques can actually reduce the risks of developing a drug. Such techniqes as virtual screening, library design, and docking figure into the analysis. Physical properties that might have an impact on whether a substance could potentially be developed as a drug are often examined in cheminformatics as features that can be compared among large numbers of substances. An example is clogP, a measure of the amount of fattiness in the system. Sometimes, inferences can be drawn about a related set of properties, as when Chris Lipinski formulated his now famous Rule of Five that says that compounds which are drug-like tend to have 5 or fewer hydrogen donor atoms, 10 or fewer hydrogen acceptor atoms, calculated logP less than or equal to 5, and molecular weight up to 500. Compounds that exhibit greater than these values tend to have poor absorption or permeation.

Techniques Used In Cheminformatics

Representation of molecular structures (2D, 3D, protein structures, 3-point pharmacophores, fragments)
- Graph isomorphism: determining if 2 graphs are identical, e.g., by comparing connection
- Line notations, e.g., SMILES.
Representation of chemical reactions
Molecular Modeling (simulations) and Molecular Diversity
Structure-Activity Relationships (QSAR, QSPR)
Combinatorial Chemistry and High-Throughput Screening
Calculation of physicochemical effects
Topological Indices
Statistics

InChI--The IUPAC International Chemical Identifier

An InChI is a character string generated by computer algorithm to represent a chemical structure. It is used in software applications and databases where chemical structures need to be represented as machine-readable strings of information. InChIs are unique to the compound they describe and can encode absolute stereochemistry. InChI has been called the bar-code for chemistry and chemical structures. The InChI format and algorithm are non-proprietary and the software is open source, with ongoing development done by the community.

Steve Heller wrote in a 9/15/2010 posting on CHMINF-L that virtually all major publishers are now supporting InChI and are adding the InChI/InChIKey to the chemicals reported in journal articles. InChI's and InChIKeys are searchable in Google, Yahoo, Bing, and other search engines. The two major NIH databases (PubChem and NCI) have over 60 million InChI's, while ChemSpider has well over 20 million. All the major commercial and Open Source structure drawing programs have imbedded InChI generation in their products. InChIs are freely usable and non-proprietary. They allow a more advanced representation of chemical information than other codes (such as the SMILES code). InChIs are unambiguous (i.e., conversion of chemical structures using standardized algorithms only leads to one InChI), and they are precisely indexed by major search engines such as Google.

Standards for Coding Chemical Data

In order for cheminformatics to succeed, certain standards had to be developed, although often a development of a dominant company turned into a standard coding method if made public, as in the case of MDL's SDF format or more recently, their CTfile format. In the field of crystallography, the CIF format is widely used for small molecules and mmCIF for macromolecules. Even for such things as the color of molecules in in a 3D depiction, it is important to follow standards. For example, the CPK (Corey-Pauling-Koltun) representation for color coding requires:

Carbon: grey or black (although some use green)
Hydrogen: white
Oxygen: red
Nitrogen: blue
Sulfur: yellow
Phosphorous: orange
Chlorine: green
Sodium: blue
Iron: purple
Bromine: brown
Zinc: brown
Calcium: dark grey
Other metals: dark grey
Unknown: deep pink

CPK models have their atomic radii defined to reflect the space which molecules take up when they pack in solids or associate in liquids.

Current Issues in Cheminformatics

What is a small molecule?
What is an adequate representtion of a sample?
Property calculations vs. measurements
Scoring functions for drug-like molecules
Docking for ligand binding prediction
Calculating diversity and similarity
Where do cheminformatics and bioinformatics merge?
Toxicology, ADME (Absorption, Distribution, Metabolism, Excrection), and other pieces of the puzzle for drugs
Depictions of structure and visualization of data
Electronic notebooks

Summary

Cheminformatics (or as it is more commonly known in Europe, chemoinformatics) has almost as long a history as the computer itself. It is the application of computer technology and methods to chemistry. Related fields are molecular modeling and computational chemistry. Chemiformatic techniques have found particular applications in the drug industry, but are now beginning to penetrate other areas of chemistry.

CIIM Link for further study

SIRCh Link for Cheminformatics

Cheminformatics Introductory Resources