# Data Mining Algorithms In R/Sequence Mining/DEGSeq

Working with DEGseq

## Introduction

First we have to choose the way to organize your data. There are two ways to do that: using the MA-plot-based method with random sampling model or MA-plot-based method with technical replicates.

MA-plot-based method with random sampling model

The MA-plot is a statistical analysis tool that having been used to detect and visualize intensity-dependent ratio of microarray data. Let C1 and C2 denote the counts of reads mapped to a specific gene obtained from two samples, with Ci ~ binomial(ni, pi), I = 1,2, where ni denotes the total number of mapped reads and pi denotes the probability of a read coming from that gene. We define M = log2C1 - log2C2, and A = (log2C1 + log2C2)/2. It can be proven that under the random sampling assumption the conditional distribution of M given that A = a (a is an observation of A), follows an approximate normal distribution. For each gene on the MA-plot, we do the hypothesis test of H0: p1 = p2 versus H1: p1 ≠ p2. Then a P-value could be assigned based on the conditional normal distribution.

MA-plot-based method with technical replicates

Though it has been reported that sequencing platform has low background noise, technical replicates would still be informative for quality control and to estimate the variation due to different machines or platforms. MA-plot-based is an another method which estimates the noise level by comparing technical replicates in the data (if available). In this method, a sliding-window is first applied on the MA-plot of the two technical replicates along the A-axis to estimate the random variation corresponding to different expression levels. A smoothed estimate of the intensity-dependent noise level is done by loess regression, and converted to local standard deviations of M conditioned on A, under the assumption of normal distribution. The local standard deviations are then used to identify the difference of the gene expression between the two samples.

Multiple testing correction

For the above methods, the P-values calculated for each gene are adjusted to Q-values for multiple testing corrections. Users can set either a P-value or a false discovery rate (FDR) threshold to identify differentially expressed genes.

Once you choose your data, you can apply the R package DEGseq. There are five different methods to run the program. They are DEGexp, DEGseq, readGeneExp, samWrapper and getGeneExp.

## Functions of Package

### DEGexp

Description

This function is used to identify differentially expressed genes when users already have the gene expression values (such as the number of reads mapped to each gene).

Usage

`DEGexp( geneExpFile1, geneCol1=1, expCol1=2, depth1=rep(0, length(expCol1) ), groupLabel1=" geneExpFile2, geneCol2=1, expCol2=2, depth2=rep(0, length(expCol2) ), groupLabel2=" header=TRUE, sep="", method=c("LRT", "CTR", "FET", "MARS", "MATR", "FC"), pValue=1e-3, zScore=4, qValue=1e-3, foldChange=4, thresholdKind=1, outputDir="none", normalMethod=c("none", "loess", "median"), replicate1="none", geneColR1=1, expColR1=2, depthR1=rep(0, length(expColR1) ), replicate2="none", geneColR2=1, expColR2=2, depthR2=rep(0, length(expColR2) ), rawcount=TRUE )`

Arguments

Argument Description
geneExpFile1 file containing gene expression values for replicates of sample1 (or replicate1 when method="CTR").
geneCol1 gene id column in geneExpFile1.
expCol1 expression value columns in geneExpFile1 for replicates of sample1 (numeric vector). Note: Each column corresponds to a replicate of sample1.
depth1 the total number of reads uniquely mapped to genome for each replicate of sample1 (numeric vector), default: take the total number of reads mapped to all annotated genes as the depth for each replicate.
groupLabel1 label of group1 on the plots.
GeneExpFile2 file containing gene expression values for replicates of sample2 (or replicate2 when method="CTR").
geneCol2 gene id column in geneExpFile2.
expCol2 expression value columns in geneExpFile2 for replicates of sample2 (numeric vector). Note: Each column corresponds to a replicate of sample2.
depth2 the total number of reads uniquely mapped to genome for each replicate of sample2 (numeric vector), default: take the total number of reads mapped to all annotated genes as the depth for each replicate.
groupLabel2 label of group2 on the plots.
header a logical value indicating whether geneExpFile1 and geneExpFile2 contain the names of the variables as its first line.
sep the field separator character. If sep = "" (the default for read.table) the separator is white space, that is one or more spaces, tabs, newlines or carriage returns.
method method to identify differentially expressed genes. Possible methods are: • "LRT": Likelihood Ratio Test (Marioni et al. 2008),• "CTR": Check whether the variation between Technical Replicates can be explained by the random sampling model (Wang et al. 2009), • "FET": Fisher’s Exact Test (Joshua et al. 2009), • "MARS": MA-plot-based method with Random Sampling model (Wang et al. 2009), • "MATR": MA-plot-based method with Technical Replicates (Wang et al.2009), • "FC" : Fold-Change threshold on MA-plot.
pValue pValue threshold (for the methods: LRT, FET, MARS, MATR). only used when thresholdKind=1.
zScore zScore threshold (for the methods: MARS, MATR). only used when thresholdKind=2.
qValue qValue threshold (for the methods: LRT, FET, MARS, MATR). only used when thresholdKind=3 or thresholdKind=4.
thresholdKind the kind of threshold. Possible kinds are: • 1: pValue threshold, • 2: zScore threshold, • 3: qValue threshold (Benjamini et al. 1995), • 4: qValue threshold (Storey et al. 2003).
foldChange fold change threshold on MA-plot (for the method: FC).
outputDir the output directory.
normalMethod the normalization method: "none", "loess", "median". recommend: "none".
replicate1 file containing gene expression values for replicate batch1 (only used when method="MATR"). Note: replicate1 and replicate2 are two (groups of) technical replicates of a sample.
GeneColR1 gene id column in the expression file for replicate batch1 (only used when method="MATR").
expColR1 expression value columns in the expression file for replicate batch1 (numeric vector) (only used when method="MATR").
depthR1 the total number of reads uniquely mapped to genome for each replicate in replicate batch1 (numeric vector), default: take the total number of reads mapped to all annotated genes as the depth for each replicate (only used when method="MATR").
ReplicateLabel1 label of replicate batch1 on the plots (only used when method="MATR").
replicate2 file containing gene expression values for replicate batch2 (only used when method="MATR"). Note: replicate1 and replicate2 are two (groups of) technical replicates of a sample.
geneColR2 gene id column in the expression file for replicate batch2 (only used when method="MATR").
expColR2 expression value columns in the expression file for replicate batch2 (numeric vector) (only used when method="MATR").
depthR2 the total number of reads uniquely mapped to genome for each replicate in replicate batch2 (numeric vector), default: take the total number of reads mapped to all annotated genes as the depth for each replicate (only used when method="MATR").
ReplicateLabel2 label of replicate batch2 on the plots (only used when method="MATR").
rawCount a logical value indicating the gene expression values are based on raw read counts or normalized values.

Example:

```> library(DEGseq)
> geneExpFile <- system.file("data", "GeneExpExample5000.txt",
+ package = "DEGseq")
> if (geneExpFile == "") {
+ zipFile <- system.file("data", "Rdata.zip", package = "DEGseq")
+ if (zipFile != "") {
+ unzip(zipFile, "GeneExpExample5000.txt", exdir = tempdir())
+ geneExpFile <- file.path(tempdir(), "GeneExpExample5000.txt")
+ }
+ }
> layout(matrix(c(1, 2, 3, 4, 5, 6), 3, 2, byrow = TRUE))
> par(mar = c(2, 2, 2, 2))
> DEGexp(geneExpFile1 = geneExpFile, expCol1 = c(7, 9, 12, 15, 18), groupLabel1 = "kidney", geneExpFile2 = geneExpFile, expCol2 = c(8, 10, 11, 13, 16), groupLabel2 = "liver", method = "MARS")

geneExpFile1: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/GeneExpExample5000.txt
gene id column in geneExpFile1: 1
expression value column(s) in geneExpFile1: 7 9 12 15 18
total number of reads uniquely mapped to genome obtained from sample1: 345504 354981 334557

geneExpFile2: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/GeneExpExample5000.txt
gene id column in geneExpFile2: 1
expression value column(s) in geneExpFile2: 8 10 11 13 16
total number of reads uniquely mapped to genome obtained from sample2: 274430 274486 264999

method to identify differentially expressed genes: MARS
pValue threshold: 0.001
output directory: none
The outputDir is not specified!

Identifying differentially expressed genes ...
Please wait patiently ...
output …
The results can be observed in directory: none

```

### DEGseq

Description

This function is used to identify differentially expressed genes from RNA-seq data. It takes uniquely mapped reads from RNA-seq data for the two samples with a gene annotation as input. So users should map the reads (obtained from sequencing libraries of the samples) to the corresponding genome in advance.

Usage

DEGseq ( mapResultBatch1, mapResultBatch2, fileFormat="bed", readLength=32, strandInfo=FALSE, refFlat, groupLabel1="group1", groupLabel2="group2", method=c("LRT", "CTR", "FET", "MARS", "MATR", "FC"), pValue=1e-3, zScore=4, qValue=1e-3, foldChange=4, thresholdKind=1, outputDir="none", normalMethod=c("none", "loess", "median"), depthKind=1, replicate1="none", replicate2="none", replicateLabel1="replicate1", replicateLabel2="replicate2" )

Arguments

Argument Description
mapResultBatch1 uniquely mapping result files for technical replicates of sample1 (or replicate1 when method="CTR").
MapResultBatch2 uniquely mapping result files for technical replicates of sample2 (or replicate2 when method="CTR").
fileFormat file format: "bed" or "eland". example of "bed" format: chr12 7 38 readID 2 example of "eland" format: readID chr12.fa 7 U2 F Note: The field separator character is TAB. And the files must follow the format as one of the examples.
ReadLength the length of the reads (only used if fileFormat="eland"). strandInfo whether the strand information was retained during the cloning of the cDNAs. "TRUE" : retained, "FALSE": not retained.
RefFlat gene annotation file in UCSC refFlat format.
GroupLabel1 label of group1 on the plots.
GroupLabel2 label of group2 on the plots.
Method method to identify differentially expressed genes. Possible methods are: "LRT": Likelihood Ratio Test, "CTR": Check whether the variation between two Technical Replicates can be explained by the random sampling model, "FET": Fisher’s Exact Test, "MARS": MA-plot-based method with Random Sampling model, "MATR": MA-plot-based method with Technical Replicates, "FC" : Fold-Change threshold on MA-plot.
pValue pValue threshold (for the methods: LRT, FET, MARS, MATR). only used when thresholdKind=1.
zScore zScore threshold (for the methods: MARS, MATR). only used when thresholdKind=2.
qValue qValue threshold (for the methods: LRT, FET, MARS, MATR). only used when thresholdKind=3 or thresholdKind=4.
ThresholdKind the kind of threshold. Possible kinds are: • 1: pValue threshold, • 2: zScore threshold, • 3: qValue threshold, • 4: qValue threshold .
foldChange fold change threshold on MA-plot (for the method: FC).
outputDir the output directory.
normalMethod the normalization method: "none", "loess", "median". recommend: "none".
DepthKind 1- take the total number of reads uniquely mapped to genome as the depth for each replicate, 0: take the total number of reads uniquely mapped to all annotated genes as the depth for each replicate. We recommend taking depthKind=1, especially when the genes in annotation file are part of all genes.
replicate1 files containing uniquely mapped reads obtained from replicate batch1 (only used when method="MATR").
replicate2 files containing uniquely mapped reads obtained from replicate batch2 (only used when method="MATR").
ReplicateLabel1 label of replicate batch1 on the plots (only used when method="MATR").
ReplicateLabel2 label of replicate batch2 on the plots (only used when method="MATR").

Example:

> kidneyR1L1 <- system.file("data", "kidneyChr21.bed.txt", package = "DEGseq") > liverR1L2 <- system.file("data", "liverChr21.bed.txt", package = "DEGseq") > refFlat <- system.file("data", "refFlatChr21.txt", package = "DEGseq") > mapResultBatch1 <- c(kidneyR1L1) > mapResultBatch2 <- c(liverR1L2) > outputDir <- file.path(tempdir(), "DEGseqExample") > DEGseq(mapResultBatch1, mapResultBatch2, fileFormat = "bed",refFlat = refFlat, outputDir = outputDir, method = "LRT")

mapResultBatch1: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt mapResultBatch2: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/liverChr21.bed.txt file format: bed refFlat: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/refFlatChr21.txt Ignore the strand information when count the reads mapped to genes! Count the number of reads mapped to each gene ... This will take several minutes, please wait patiently!

SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes.

total 259 unique genes processed 0 reads (kidneyChr21.bed.txt) processed 10000 reads (kidneyChr21.bed.txt) processed 20000 reads (kidneyChr21.bed.txt) processed 30000 reads (kidneyChr21.bed.txt) processed 34304 reads (kidneyChr21.bed.txt) total used 0.328000 seconds!

SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/liverChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes.

total 259 unique genes processed 0 reads (liverChr21.bed.txt) processed 10000 reads (liverChr21.bed.txt) processed 20000 reads (liverChr21.bed.txt) processed 30000 reads (liverChr21.bed.txt) processed 30729 reads (liverChr21.bed.txt) total used 0.297000 seconds!

geneExpFile1: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample/group1.exp gene id column in geneExpFile1: 1 expression value column(s) in geneExpFile1: 2 total number of reads uniquely mapped to genome obtained from sample1: 34304 geneExpFile2: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample/group2.exp gene id column in geneExpFile2: 1 expression value column(s) in geneExpFile2: 2 total number of reads uniquely mapped to genome obtained from sample2: 30729 method to identify differentially expressed genes: LRT pValue threshold: 0.001 output directory: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGseqExample

Identifying differentially expressed genes ... Please wait patiently ... output ...

Done ... The results can be observed in directory: C:\DOCUME~1\wanglk\LOCALS~1\Temp\RtmpUvf7Fw/DEGs

### getGeneExp

Description

This function is used to count the number of reads and calculate the RPKM for each gene. It takes uniquely mapped reads from RNA-seq data for a sample with a gene annotation file as input. So users should map the reads (obtained from sequencing library of the sample) to the corresponding genome in advance.

Usage

getGeneExp( mapResultBatch, fileFormat="bed", readLength=32, strandInfo=FALSE, refFlat, output=paste(mapResultBatch[1],".exp",sep=""), min.overlapPercent= 1 )

Arguments

Argument Description
mapResultBatch a vector containing uniquely mapping result files for a sample. Note: The sample can have multiple technical replicates.
fileFormat file format: "bed" or "eland". example of "bed" format: chr12 7 38 readID 2 example of "eland" format: readID chr12.fa 7 U2 F Note: The field separator character is TAB. And the files must follow the format as one of the examples.
readLength the length of the reads (only used if fileFormat="eland"). strandInfo whether the strand information was retained during the cloning of the cDNAs. • "TRUE" : retained, • "FALSE": not retained.
refFlat gene annotation file in UCSC refFlat format.
output the output file.
min.overlapPercent the minimum percentage of the overlapping length for a read and an exon over the length of the read itself, for counting this read from the exon. should be <=1. 0: at least 1 bp overlap between a read and an exon.

Example:

> kidneyR1L1 <- system.file("data", "kidneyChr21.bed.txt", package = "DEGseq") > refFlat <- system.file("data", "refFlatChr21.txt", package = "DEGseq") > mapResultBatch <- c(kidneyR1L1) > output <- file.path(tempdir(), "kidneyChr21.bed.exp") > getGeneExp(mapResultBatch, refFlat = refFlat, output = output) Please wait... SampleFiles: D:/myrpackage/DEGseq.Rcheck/DEGseq/data/kidneyChr21.bed.txt Count the number of reads mapped to each gene. This will take several minutes. Please wait ... total 259 unique genes processed 0 reads (kidneyChr21.bed.txt) processed 10000 reads (kidneyChr21.bed.txt) processed 20000 reads (kidneyChr21.bed.txt) processed 30000 reads (kidneyChr21.bed.txt) processed 34304 reads (kidneyChr21.bed.txt) total used 0.328000 seconds! > exp <- readGeneExp(file = output, geneCol = 1, valCol = c(2, + 3), label = c("raw count", "RPKM")) > exp[30:32, ] raw count RPKM C21orf131 0 0.000 C21orf15 0 0.000 C21orf2 51 665.789

Description

This method is used to read gene expression values from a file to a matrix in R workspace. So that the matrix can be used as input of other packages, such as edgeR. The input of the method is a file that contains gene expression values.

Usage

readGeneExp(file, geneCol=1, valCol=2, label = NULL, header=TRUE, sep="")

Arguments

Argument Descripition file file containing gene expression values. GeneCol gene id column in file. valCol expression value columns to be read in the file. label label for the columns. Header a logical value indicating whether the file contains the names of the variables as its first line. sep the field separator character. If sep = "" (the default for read.table) the separator is white space, that is one or more spaces, tabs, newlines or carriage returns.

Example:

> geneExpFile <- system.file("data", "GeneExpExample1000.txt", + package = "DEGseq") > exp <- readGeneExp(file = geneExpFile, geneCol = 1, valCol = c(7, + 9, 12, 15, 18, 8, 10, 11, 13, 16)) > exp[30:32, ]

R1L1Kidney

```R1L3Kidney
```

R1L7Kidney

```R2L2Kidney
```

R2L6Kidney ENSG00000188976 73 77 68 70 82 ENSG00000187961 15 15 13 12 15 ENSG00000187583 1 1 3 0 3

R1L2Liver

```R1L4Liver
```

R1L6Liver

```R1L8Liver
```

R2L3Liver ENSG00000188976 34 56 45 55 42 ENSG00000187961 8 13 11 12 20 ENSG00000187583 0 1 0 0 2

### samWrapper

Description

This function is a wrapper of the functions in samr. It is used to identify differentially expressed genes for two sets of samples with multiple replicates or two groups of samples from different individuals (e.g. disease samples vs. control samples).

Usage

samWrapper( geneExpFile1, geneCol1=1, expCol1=2, measure1=rep(1, length(expCol1) ), geneExpFile2, geneCol2=1, expCol2=2, measure2=rep(2, length(expCol2) ), header=TRUE, sep="", paired=FALSE, s0=NULL, s0.perc=NULL, nperms=100, testStatistic= c("standard","wilcoxon"), max.qValue=1e-3, in.foldchange=logged2=FALSE, output )

Arguments

Argument Description geneExpFile1 file containing gene expression values for group1. geneCol1 gene id column in geneExpFile1. expCol1 expression value columns in geneExpFile1. See the example. measure1

```numeric vector of outcome measurements for group1. like c(1,1,1...) when paired=FALSE, or like c(-1,-2,-3,...) when paired=TRUE.
```

geneExpFile2 file containing gene expression values for group2. geneCol2 gene id column in geneExpFile2. ExpCol2 expression value columns in geneExpFile2. See the example. Measure2 numeric vector of outcome measurements for group2. like c(2,2,2...) when paired=FALSE, or like c(1,2,3,...) when paired=TRUE. header a logical value indicating whether geneExpFile1 and geneExpFile2 contain the names of the variables as its first line. sep the field separator character. If sep = "" (the default for read.table) the separator is white space, that is one or more spaces, tabs, newlines or carriage returns. paired a logical value indicating whether the samples are paired. s0 exchangeability factor for denominator of test statistic; Default is automatic choice. s0.perc percentile of standard deviation values to use for s0; default is automatic choice; -1 means s0=0 (different from s0.perc=0, meaning s0=zeroeth percentile of standard deviation values= min of sd values. Nperms number of permutations used to estimate false discovery rates. TestStatistic test statistic to use in two class unpaired case. Either "standard" (t-statistic) or "wilcoxon" (Two-sample wilcoxon or Mann-Whitney test). recommend "standard". max.qValue the max qValue desired; shoube be <1. min.foldchange the minimum fold change desired; should be >1. default is zero, meaning no fold change criterion is applied. logged2 a logical value indicating whether the expression values are logged2. output the output file.

Example

> geneExpFile <- system.file("data", "GeneExpExample1000.txt", + package = "DEGseq") > set.seed(100) > geneExpFile1 <- geneExpFile > geneExpFile2 <- geneExpFile > output <- file.path(tempdir(), "samWrapperOut.txt") > expCol1 = c(7, 9, 12, 15, 18) > expCol2 = c(8, 10, 11, 13, 16) > measure1 = c(-1, -2, -3, -4, -5) > measure2 = c(1, 2, 3, 4, 5) > samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, nperms = 100, min.foldchange = 2, max.qValue = 1e-04, output = output, paired = TRUE)

## References

Jiang,H. and Wong,W.H. (2009) Statistical inferences for isoform expression in RNASeq. Bioinformatics, 25, 1026–1032.

Likun Wang, Zhixing Feng, Xi Wang, Xiaowo Wang, and Xuegong Zhang. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Bioinformatics Advance Access published on October 24, 2009, DOI 10.1093/bioinformatics/btp612.

Wang, Likun; Feng, Zhixing; Wang, Xi; Wang, Xiaowo; Zhang, Xuegong. DEGseq Manual. Tsinghua University. Beijing, China. 2009(B). Available at <http://www.bioconductor.org/packages/devel/bioc/manuals/DEGseq/man/DEGseq.pdf> Access on: 18 November 2009

Wang, Likun; Wang, Xi. How to use the Degseq Package. Laboratory of Bioinformatics and Bioinformatics Division, TNLIST / Department of Automation, Tsinghua University. Beijing, China. 2009(A). Available at <http://bioinfo.au.tsinghua.edu.cn/software/degseq/DEGseq.pdf> Access on: 18 November 2009