「R」用方差分析差异表达基因

Q1

读入数据

代码语言：javascript复制

## Read data of mouse weight
Mouse.Weight <- read.table('体重.csv', sep = ',', skip = 1)
colnames(Mouse.Weight) <- c('A', 'B', 'Weight')

检查因素A和因素B下数据的正态性

对数据的正态性，R中有许多的方法和函数（可以参考博文R语言与正态性检验），这里利用自带常用Shapiro-Wilk正态检验方法(W检验)进行正态性检测。

先将因素对应的体重数据提取出来（A1，A2，A3，B1，B2），分别进行正态性检测。

代码语言：javascript复制

###########################################
## Test normality by using Shapiro-Wilk test
############################################

# test factor A
A1 <- as.numeric(subset(Mouse.Weight, A=='A1')$Weight)
shapiro.test(A1)
A2 <- as.numeric(subset(Mouse.Weight, A=='A2')$Weight)
shapiro.test(A2)
A3 <- as.numeric(subset(Mouse.Weight, A=='A3')$Weight)
shapiro.test(A3)

# test factor B
B1 <- as.numeric(subset(Mouse.Weight, B=='B1')$Weight)
shapiro.test(B1)
B2 <- as.numeric(subset(Mouse.Weight, B=='B2')$Weight)
shapiro.test(B2)

结果：

代码语言：javascript复制

    Shapiro-Wilk normality test

data:  A1
W = 0.97521, p-value = 0.8588


    Shapiro-Wilk normality test

data:  A2
W = 0.94792, p-value = 0.3366


    Shapiro-Wilk normality test

data:  A3
W = 0.96289, p-value = 0.603

    Shapiro-Wilk normality test

data:  B1
W = 0.9442, p-value = 0.118

    Shapiro-Wilk normality test

data:  B2
W = 0.95376, p-value = 0.213

由上述结果可以知道，所有因素都满足正态性(p>0.05)。

方差齐性的检测

可以通过Bartlett检验来检测数据的方差齐性。

代码语言：javascript复制

###########################################
## Test homogeneity of variances by using bartlett.test
############################################
bartlett.test(Weight~A,data=Mouse.Weight)
bartlett.test(Weight~B,data=Mouse.Weight)

结果：

代码语言：javascript复制

> bartlett.test(Weight~A,data=Mouse.Weight)

    Bartlett test of homogeneity of variances

data:  Weight by A
Bartlett's K-squared = 0.19888, df = 2, p-value = 0.9053

> bartlett.test(Weight~B,data=Mouse.Weight)

    Bartlett test of homogeneity of variances

data:  Weight by B
Bartlett's K-squared = 0.20953, df = 1, p-value = 0.6471

从p-value来看，因素A、因素B满足方差齐性(p>0.05)。

方差分析

调用aov函数进行方差分析

代码语言：javascript复制

> attach(Mouse.Weight)
The following objects are masked from Mouse.Weight (pos = 3):

    A, B, Weight

> fit <- aov(Weight ~ A*B)
> summary(fit)
            Df Sum Sq Mean Sq F value   Pr(>F)    
A            2   9080    4540  56.809 5.22e-14 ***
B            1    127     127   1.589    0.213    
A:B          2     17       9   0.108    0.897    
Residuals   54   4316      80                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

由结果可见，只有因素A对大鼠体重有显著性影响。

多重比较

TukeyHSD()函数提供了对各组均值差异的成对检验。

代码语言：javascript复制

########## Multiple Comparison
TukeyHSD(fit)
par(las=2)
par(mar=c(5,8,4,2))
plot(TukeyHSD(fit)) # plot  comparsion test result

结果

代码语言：javascript复制

> TukeyHSD(fit)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Weight ~ A * B)

$A
       diff       lwr       upr     p adj
A2-A1 25.39 18.576905 32.203095 0.0000000
A3-A1 26.75 19.936905 33.563095 0.0000000
A3-A2  1.36 -5.453095  8.173095 0.8805321

$B
      diff       lwr      upr     p adj
B2-B1 2.91 -1.717782 7.537782 0.2128406

$`A:B`
              diff        lwr       upr     p adj
A2:B1-A1:B1  25.09  13.277922  36.90208 0.0000009
A3:B1-A1:B1  25.49  13.677922  37.30208 0.0000006
A1:B2-A1:B1   1.87  -9.942078  13.68208 0.9970585
A2:B2-A1:B1  27.56  15.747922  39.37208 0.0000001
A3:B2-A1:B1  29.88  18.067922  41.69208 0.0000000
A3:B1-A2:B1   0.40 -11.412078  12.21208 0.9999985
A1:B2-A2:B1 -23.22 -35.032078 -11.40792 0.0000050
A2:B2-A2:B1   2.47  -9.342078  14.28208 0.9892562
A3:B2-A2:B1   4.79  -7.022078  16.60208 0.8358408
A1:B2-A3:B1 -23.62 -35.432078 -11.80792 0.0000035
A2:B2-A3:B1   2.07  -9.742078  13.88208 0.9952518
A3:B2-A3:B1   4.39  -7.422078  16.20208 0.8800277
A2:B2-A1:B2  25.69  13.877922  37.50208 0.0000005
A3:B2-A1:B2  28.01  16.197922  39.82208 0.0000001
A3:B2-A2:B2   2.32  -9.492078  14.13208 0.9919360

比较结果

从结果图中可以发现有7组区间穿过mean=0，它们就是没有差异的组。差异的组是置信区间不含0的组别，分别是

代码语言：javascript复制

A2:B1-A1:B1
A3:B1-A1:B1
A2:B2-A1:B1
A3:B2-A1:B1
A1:B2-A2:B1
A1:B2-A3:B1
A2:B2-A1:B2
A3:B2-A1:B2

Q2

对数据进行标准化，每个数取底为10的对数后绘制箱线图

首先读入数据，然后处理为正确的数据集，最后运用apply函数进行列操作得到标准化数据，接着画出箱线图形。

代码语言：javascript复制

## Read data 
library(readr)
dataset <- read.csv('protein.csv')
row.names(dataset) <- dataset$X  ## get colnames
dataset <- dataset[,-1]          ## romove unuseful coloum  

#################
## normalize data
#################
protein.dataset <- apply(dataset, 2, function(x)return(x/mean(x)))

## plot boxplot #log10
boxplot(log10(protein.dataset))

箱线图

筛选差异表达基因

先用一行进行计算P-value测试

代码语言：javascript复制

  x = protein.dataset[1,]  
  a <- rep('A', times=5)
  b <- rep('B', times=5)
  c <- rep('C', times=5)
  Drug <- c(a,b,c)
  test <- data.frame(Drug, x)

> summary(aov(test$x ~ test$Drug))
            Df   Sum Sq   Mean Sq F value Pr(>F)
test$Drug    2 0.000818 0.0004089   0.463   0.64
Residuals   12 0.010606 0.0008839

尝试取出p值

代码语言：javascript复制

> T1 <- summary(aov(test$x ~ test$Drug))
> T1[[1]][5] > 0.05
            Pr(>F)
test$Drug     TRUE
Residuals       NA
> T1[[1]][5][1,] > 0.05
[1] TRUE
> T1[[1]][5][1,] 
[1] 0.6404265

现在进行整体计算筛选

代码语言：javascript复制

############################
## Screen differential genes
############################

# function: oneway test
anova_test <- function(x){
  a <- rep('A', times=5)
  b <- rep('B', times=5)
  c <- rep('C', times=5)
  Drug <- c(a,b,c)
  test <- data.frame(Drug, x)
  return(summary(aov(test$x ~ test$Drug))[[1]][5][1,])
}

anova_result <- apply(protein.dataset,1,anova_test)
table(anova_result < 0.05)

结果计数

代码语言：javascript复制

FALSE  TRUE 
 1062  1048

由此得知差异表达基因数有1048个。

可以利用anova_result[anova_result < 0.05]对差异基因进行输出。

bonferroni矫正

代码语言：javascript复制

adjust_result <- p.adjust(anova_result, method='bonferroni')
table(adjust_result < 0.05)

结果

代码语言：javascript复制

FALSE  TRUE 
 1907   203

经过矫正的结果有203个。

可以利用adjust_result[adjust_result < 0.05]对差异基因进行输出。

anova apply mean

0 人点赞