Proteomics/Proteomics and Drug Discovery/Docking and Scoring
Chapter Edited and Updated by: Vishal Thovarai & Kevin Smith
Docking & Scoring
Following protein studies and either virtual library creation or the decision to use a particular existing library, the next steps in this type of drug discovery are docking and scoring. In docking, various algorithms fit compounds from the virtual libraries into the specified target site or sites on a particular protein. The algorithms used in this stage often use either Monte Carlo methods, Genetic algorithms, or incremental construction, which are elaborated upon later in this section.
After compounds are docked, they are then scored. Scoring takes into account the chemical interactions between the ligand and target, in order to determine the energy state of the complex and to rank the efficacy of the compound being scored.
The top-scoring compounds are then visually inspected in the predicted docking orientation to verify the calculated stability, before actual laboratory experiments are performed to verify and determine both the efficacy and toxicity of the drug lead.
Principles of Molecular Docking
The aim of molecular docking is essentially to determine the binding interactions between two molecular species, in this context either protein to protein, or protein to ligand interactions. Of course, data representing the molecule and its respective properties must be already known on order to be able to simluate its behavior. For this reason, this type of behavioral deduction is known as retrospective virtual screening (Paetz). From a computational perspective, this is modernly accomplished through simulated molecular modeling software, or algorithms. The model then analyzes a gross amount of potential conformations which can be assumed, in order to assess those conformations which appear energetically favorable. The main algorithms used to do this use Monte Carlo methods, genetic algorithms, or incremental construction.
Monte Carlo method algorithims can be used to help predict the behavior of a physical system. The use pseudo-random number generation to power calculations result in many random conformations and configurations of the ligand in the binding site, in an attempt to “randomly” find a best-fit match. Given the complexity and sheer volume of the possible interactions, this type of algorithm is well suited for applications which are designed to emulate molecular interactions.
Genetic algorithms treat the different ligand states as members of a population. These members are mated and mutated, with the best-fit members of each generation carried on to additional generations. In working towards a best-fit case, evolution is modeled in order to produce better and better matches with each new generation. Mutations and deleterious events are used to avoid local maxima, which would give the illusion of working towards a “local” best-fit match, even though it might not be the absolute best-fit match.
Incremental construction treats the ligands as fragments that can be either rigid or flexible, depending on known stererochemical principles. The rigid sections are first put into ideal positions in the target site, followed by the flexible sections. The goal of adding the flexible portions second is to be able to connect all of the rigid fragments completely and appropriately. Versions of incremental construction are similar to the clique search. In this method, the rigid fragments are selected and positioned based on specific characteristics such as ability to function in a hydrogen bond, or electrostatic interactions between functional groups and side chains (Krovat).
A major factor to be decided upon is whether the ligand, the protein, or both are going to be treated as rigid or flexible. The computation time is already significant when the ligand is treated as flexible and the protein, rigid. The reason for this scenario, which is often typical in initial screenings, is that a.) flexibility in the ligand is easier to simulate, due to the small size of the compound and b.) flexibility in the ligand can lead to a broader range of compounds that fit a particular conformation; thus, there is a set of alternatives to choose from in terms of a chemical that is cheap and easy to synthesize or acquire. Many chemicals in virtual libraries do not actually exist, or are very difficult to acquire, so this element of choice among several good matches is critical in making VLS a much cheaper alternative to traditional laboratory screening.
Treating the protein as flexible greatly increases the computation time for docking due to the vast number of conformational changes that need to be considered; however, this allows more accurate docking when proteins are flexible in vivo, especially when induced-fit proteins are being targeted. Certain ligands that would not necessarily be top matches when the protein is considered rigid, may be ranked much higher when the protein is allowed to be flexible. There are alternatives for protein flexibility that keep computation time down. One such method is to use the X-ray crystallographic or NMR structure of a protein bound to a known ligand during docking. This allows the virtual library to be docked to the protein in its ligand-bound conformation. Another method is the use of several protein structures with slightly different conformations, which is usually achieved during NMR structure determination. This set of structures, or ensemble, is used in docking to find chemical matches that dock appropriately in several of the conformational models; the goal is to find a lead that could remain docked during a transition from the unbound protein conformation to the protein-ligand conformation.
Approaches and Algorithms
The two commonly used approaches are Surface Complementarity and Simulation based approaches. Surface complementarity approaches use techniques such as Fourier Transforms or Fast Fourier Transforms to convert the structural information of the receptors and ligands into mathematical grids and then check for complementarity between the structures. Simulation algorithms such as Simulated Annealing and Tabu Searching generally perform simulations using energy calculations as described later.
Simulated Annealing is a probalistic meta-algorithm that has many applications, one of which is molecular docking. This algorithm will create a good approximation when optimizing functions. This does this by taking the point s in the function E(s) and minimizing the energy, which in molecular docking will result in the optimization of protein interaction. It does this by running a series of calculations moving from the original state of s, to a state of s’ which is an arbitrary number. This is called a transition. Transitions are shown through an acceptance probability function n P(e,e',T), which depends on the energy and temperature of the two states. In the function P e is the energy, e’ is the energy state the algorithm is moving to, and T is the temperature. The acceptance probability function is beneficial because with very small temperature values the computation will move to values of lower energy levels. A requirement of the probability function is that each move must be nonzero. 1 Another aspect of simulated annealing is the annealing schedule. This is where the temperature is slowly reduced as the simulation progresses. The parameters used in simulated annealing are the state space, the energy function, the candidate generator procedure neighbor, the acceptance probability function, and the annealing schedule. 1 Flowchart of Simulated Annealing Process
Tabu search is another type of algorithm. The difference here is that it has a memory structure and will not attempt to calculate a possibility more than twice. It can be used for combinatorial optimization problems, where the goal is to find the best possible solution. It uses at local search procedure to move from one solution to the next. The N*(x) function in the tabu search algorithm is set to memorize the last number (n) of possibilities it has tried so there are no repeat calculations.2 This makes it more efficient than other algorithms (Hou).
Due to the enormous computational demand of docking and scoring algorithms, many such applications are developed with parallel programming pragmatics, so that they can be run on clusters of multiple CPUs and be greatly sped up. Grid Computing, a rapidly growing area of parallel computing, is currently being investigated as an alternative to devoted computer clusters. Grid computing, which models the electrical utility grid, would allow all of the idle computers at a pharmaceutical company to be used in a distributed computational grid. Especially at night, hundreds or even thousands of idle workstations could rival dedicated clusters, at a fraction of the cost. The tradeoff, however, is grid computing’s lack of guaranteed availability-on-demand.
Scoring, which evaluates the binding strength and energy state of the ligand-protein complex, follows docking during virtual drug discovery. Scoring can be done with several degrees of stringency. The most basic scoring functions consider hydrogen bond formation, van der Waal interactions, and other electrostatic forces. More advanced scoring functions consider, in addition to the above forces, the dielectric constant of the solution, in addition to various chemical energy models (Anderson). More basic scoring is often done in initial screenings, with increasing levels of scoring complexity used iteratively to narrow the field of high-scoring drug leads.
Another approach aimed at “scoring” non-metallic protein to ligand interactions is based on the comprehensive analysis of specific atomic characteristics of the molecule(s) in question. While considering some of the more basic criteria listed above, such as van der Waals and electostatic interactions, an empirical score is also designated with respect to hydrophobicity, and free energy changes in the bound state (Jain). This protocol has thus far shown much promise in terms of rapid computation and accuracy, and is available on the internet under the name Binding Affinity Prediction of Protein- Ligand server, or BAPPL .
A second phase of scoring involves evaluating the highest-scored matches visually to verify the predicted binding potential. Conformation, alignment of bonds, and the shapes of the interacting surfaces are considered in this final review step. Furthermore, size and availability/producability of the leads are included in the evaluation step. A compound that is easily synthesized but slightly less effective may be cheaper and better to investigate further than a higher-scoring ligand that does not exist outside of the virtual library.
Issues of Concern in Docking and Scoring
One of the first issues to be concerned with in docking is the resolution of a protein’s structure, along with the temperature of the atoms that comprise the protein. Temperature refers to the degree of uncertainty in the three-dimensional coordinates of an atom, due to resonance and vibration during the structure determination phase. This vibration is present in all atoms; however, greater vibration leads to greater temperature, and thus an atom’s determined coordinates become less reliable and accurate (Krovat).
A further concern in docking is the physiological pH of the protein versus the pH of the protein during crystallization. This a major factor in locating potential hydrogen bonding sites. The protonation and deprotonation of many amino acid side chains as pH changes could mean the difference between a tightly binding ligand and one that does not bind well at all, even though the former may be a false positive and the latter, a false negative. The context of hydrogen bonds must be considered as well, because they can occupy a range of binding strengths. This factor affects scoring too.
Additional concerns include binding in broad and shallow sites, as surface geometry plays less of a role due to the absence of specific shapes that might otherwise interact. Also, unusual binding modes that might exist in vitro might not be detected during in silico docking (Krovat).
Due the difficulty in resolving membrane-bound proteins, homology modeling, or the elucidation of molecular structure inferred from the primary sequence, is used in docking in these cases; this is in addition to the homology modeling used in determining the structure of membrane proteins.
Bleicher, H. Konrad, et al. Hit and Lead Generation: Beyond High-Throughput Screening. Nature Reviews, Drug Discovery. 2003 May:(2), 369-378.
Hou, T., Wang, J., XU, X., “A Compatison of Three Heuristic Algorythms for Molecular Docking”. Chinese Chemical Letters 10 (1999): 615-618
Jain, Tarum, Jayaram, B. An all atom energy based computational protocol for predicting binding affinity of protein-ligand complexes. FEBS Letters. 2005:(579), 6659-6666.
Kapetanovic, M. I. Computer-aided drug discovery and development (CADDD): In-silico-chemico-biological approach. Chemico-Biology Int. 2007.
Lengauer, Thomas, et al. Novel technologies virtual screening. Drug Discovery Today. 2004 Jan:(9)1, 27-34.
Meek, J. Peter, et al. Shape Signatures; speeding up computer aided drug discovery. Drug Discovery Today. 2006 Oct:(11)19-20, 895-904.
Paetz, Jurgen, Schneider, Gisbert. A neuro-fuzzy approach to virtual screening in molecular bioinformatics. Fuzzy Sets and Systems. 2005:(152), 67-82.
Krovat, m. Eva, Langer, Thierry. Impact of Scoring Functions on Enrichment in Docking-Based Virtual Screening: An Application Study on Renin Inhibitors. J. Chem. Info. Comput. Sci. 2004 Apr:(3)44, 1123-1129.
Next: Software Tools