DChip/Printable version

From Wikibooks, open books for an open world
Jump to navigation Jump to search


The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.


Obtain the dChip software and gene information files and store them into a local directory, and double-click the program icon to start dChip. This manual is written in a tutorial style. It is best to explore the functionalities of dChip by following the steps described here sequentially. Paragraphs labeled as “[Analysis example]” may be skipped in first reading. Please cite Li and Wong 2001a if dChip results are used in manuscripts.

dChip is a single executable program running on Windows 2000, under which it is being developed. Presumably it can also run properly on Windows NT/XP, but some functions such as clustering picture zooming and image saving do not work properly on Windows 95/98. The computer memory should be as large as the number of chips (e.g. if there are 100 chips to analyze together, we need > 100M memory), since dChip reads in all CEL data of a group of arrays into the memory. dChip is written in Visual C++ and uses Windows-specific functions for graphic tasks. Victoria Perreau has run dChip on an iBook, using the Connectix Virtual PC Windows 2000 software which costs about $225. When running on the iBook (466 MHz, 320 RAM) it takes about 7 times longer to do the analysis than a PC (1000Mhz Pentium III , 256RAM), i.e. 6 CEL files are normalized in 8 minutes instead of 1.


User support and trouble-shooting guide

Please send dChip usage questions and discussions of microarray data analysis to dChip Yahoo group, and send function suggestions and bug reports of dChip to Cheng Li (cli@hsph.harvard.edu). It will be helpful if the followings are attached with email for diagnostic purpose: dChip version and downloading dates, analysis outputs and error messages (use “Analysis/Copy” or “Analysis/ Save”), dChip exported data and image files, parameter files (with .INI extension and in the same directory as dchip.exe), or the screenshot (Press PrintScreen key and paste into Paint software to save). If dChip crashes or you observe strange things happening, you may also try one or more of the following:

· In the clustering picture, the part of figure is overlapped or not displayed: use Arrow keys, Control+Arrow keys and Shift+Arrow keys to adjust.

· Click the "Tools/Options/Reset Default" button to reset settings, or delete the existing “group_name”.ini files (in the same directory as dchip.exe) and then restart dChip.

· Download and use the latest version of Gene information files or dChip itself. If needed copy these files and the CEL data files (but not the *.ini files containing dChip settings) into a new directory, and start dChip from here.

· Change to another PC or Windows system.

· When opening a group, check “Open group/ignore existing DCP file” and “Open group/Other information/ignore existing cdf.bin file” to re-extract the CEL and CDF files.

· “Gene information file” may have been saved in Excel format. Open files in Excel and use “File/Save As” to save in tab-delimited text format.

· If “out of memory” when clustering many genes, you can uncheck “Tools/Options/Clustering/Pre-calculate distance” before clustering.

· Clustering image exporting problem.

Example Method Description

Example method description using dChip

Array normalization, expression value calculation and clustering analysis were performed using DNA-Chip Analyzer (www.dchip.org; Li & Wong 2001a). The Invariant Set Normalization method (Li & Wong 2001b) was used to normalize arrays at probe cell level to make them comparable, and the model-based method (Li & Wong 2001b) was used for probe-selection and computing expression values. These expression levels were attached with standard errors as measurement accuracy, which were subsequently used to compute 90% confidence intervals of fold changes in two-sample or two-group comparisons (Li & Wong 2001b). The lower confidence bounds of fold changes were conservative estimate of the real fold changes. Genes with increased or decreased expression after treatments by more than 2 fold (lower confidence bound) were selected for further study.

Hierarchical clustering analysis (Eisen et al. 1998) is used to group genes with same expression pattern. A genes is selected for clustering if (1) its expression values in the 20 samples has coefficient of variation (standard deviation / mean) between 0.5 to 10 (2) it is called “Present” by MAS5 (or GCOS or dChip) software in more than 5 samples. Then the expression values for a gene across the 20 samples are standardized to have mean 0 and standard deviation 1 by linear transformation, and the distance between two genes is defined as 1 - r where r is the standard correlation coefficient between the 20 standardize values of two genes. Two genes with the closest distance are first merged into a super-gene and connected by branches with length representing their distance, and are deleted for future merging. The expression level of the newly formed super-gene is the average of standardized expression levels of the two genes (average-linkage) for each sample. Then the next pair of genes (super-genes) with the smallest distance are chosen to merge and the process is repeated until all genes are merged into one cluster. The dendrogram in Figure ? illustrates the final clustering tree, where genes close to each other have high similarity in their standardized expression values across the 20 samples.

Design Experiments

Designing microarray experiments[edit | edit source]

Randomize samples[edit | edit source]

To minimize experimental variation, it is desirable to have the same person perform all the experiments in the same microarray core facility. However if an experiment involves many samples we have to do them on different days. Arrays generated at different days may have “batch difference”, since different reagents are used to amplify and label the samples. This may be detected by unsupervised clustering. Although “Array list file” can be used to alleviate the situation, it is better to consider such effects before doing experiment. For example, if an experiment compares two conditions with multiple samples in each condition, it is less desirable to have all samples of condition A amplified and hybridized in one day, and all samples of condition B done in another day. In such case even if batch effects happen, we cannot tell since they are mixed with real biological variations that we are interested in. Thus it is more reasonable to consider a balanced design where samples of all conditions are randomly distributed into different sample amplification and hybridization days.

Replicate samples[edit | edit source]

Another way to reduce the experimental variation is to have replicate samples. If variations are introduced in an unbiased manner in the experimental or analysis steps after the replication point, averaging the final expression values can better estimate the “expression level” at the replication point (variance of the average is inversely proportional to the number of replicates). Replicates can be done from early point to late point following this rough scale: different individuals (cell line strains), independently grown cell lines (pure strain animals), different tissue sample from the same individual, split tissue samples, split mRNA, split IVT, and scanning one array multiple times. We have observed that replicates at split-IVT level usually agree well in terms of expression values and cluster very tightly. Therefore such replications may not help us to better estimate and reduce the variation introduced before the IVT-splitting point.

The practical choice of the point of replication should suit experimental purpose. For example, when an investigator only has very small amount of RNA, the choices are using double-round amplification (which may have 5' bias) or pooling RNA from different animals in the same litter. Pooling more samples is good as long as the gene expression variation among the pooled animals is expected to be smaller than the gene expression difference among the studied biological conditions, but this choice may be more expensive. Even when the sample amount is enough, the choice exists between whether to process the sample of each animal and hybridize it to a different array, or pool the sample of different animals first and then split the pooled samples into several aliquots, and do IVT and hybridization separately into replicate arrays. This situation is more complex and the answer is probably dependent on specific experiment (the number of replicates, what point to replicate and pool, and the variation of genes at these points).

Classify Genes

Classify genes by annotation terms

After a list of genes is obtained by “Compare samples” or “Filter genes”, we can use the “Tools/Classify Genes” dialog to classify these genes into different groups according to GeneOntology or other annotational terms. Gene groups have header lines such as “Found 15 GeneOntology 'response to external stimulus' genes in a 120-group (all: 1068/7734, PValue: 0.661181)”. The p-values are calculated in the same way as for the significant gene clusters. Here 120 is the number of genes having GeneOntology annotation in the input gene list, thus may be fewer than the actual number of genes in the list. Note that at "Tools/Classify genes", the whole gene list is considered to assess the significant enrichment; while at clustering, every gene clusters with at least 4 annotated genes is considered. Thus the former gives fewer significant gene groups than the latter.

Significant p-values as defined in the “Tools/Options/Clustering” dialog are suffixed by stars (“***”) in the output file. Also one may check the “Only report significant results” box to output only gene groups with significant p-values. The additional data columns such as expression values or fold changes of the “gene list file” will be copied into the output “classified file”.

To prevent multiple probe sets for the same gene from biasing the result of the functional significance computation, it is best to check “Analysis/Open group/Options/Analysis/Mask redundant probe sets” to exclude the redundant probe sets (identified by LocusLink ID) from a gene list. This can also be done at “Tools/Options/Analysis/Mask redundant probe sets”, but redoing “Analysis/Open group” is desired since the array background information on gene annotation is computed after reading in the “gene information file”.