Q1
读入数据
代码语言:javascript复制## Read data of mouse weight
Mouse.Weight <- read.table('体重.csv', sep = ',', skip = 1)
colnames(Mouse.Weight) <- c('A', 'B', 'Weight')
检查因素A和因素B下数据的正态性
对数据的正态性,R中有许多的方法和函数(可以参考博文R语言与正态性检验),这里利用自带常用Shapiro-Wilk正态检验方法(W检验)进行正态性检测。
先将因素对应的体重数据提取出来(A1,A2,A3,B1,B2),分别进行正态性检测。
代码语言:javascript复制###########################################
## Test normality by using Shapiro-Wilk test
############################################
# test factor A
A1 <- as.numeric(subset(Mouse.Weight, A=='A1')$Weight)
shapiro.test(A1)
A2 <- as.numeric(subset(Mouse.Weight, A=='A2')$Weight)
shapiro.test(A2)
A3 <- as.numeric(subset(Mouse.Weight, A=='A3')$Weight)
shapiro.test(A3)
# test factor B
B1 <- as.numeric(subset(Mouse.Weight, B=='B1')$Weight)
shapiro.test(B1)
B2 <- as.numeric(subset(Mouse.Weight, B=='B2')$Weight)
shapiro.test(B2)
结果:
代码语言:javascript复制 Shapiro-Wilk normality test
data: A1
W = 0.97521, p-value = 0.8588
Shapiro-Wilk normality test
data: A2
W = 0.94792, p-value = 0.3366
Shapiro-Wilk normality test
data: A3
W = 0.96289, p-value = 0.603
Shapiro-Wilk normality test
data: B1
W = 0.9442, p-value = 0.118
Shapiro-Wilk normality test
data: B2
W = 0.95376, p-value = 0.213
由上述结果可以知道,所有因素都满足正态性(p>0.05)。
方差齐性的检测
可以通过Bartlett检验来检测数据的方差齐性。
代码语言:javascript复制###########################################
## Test homogeneity of variances by using bartlett.test
############################################
bartlett.test(Weight~A,data=Mouse.Weight)
bartlett.test(Weight~B,data=Mouse.Weight)
结果:
代码语言:javascript复制> bartlett.test(Weight~A,data=Mouse.Weight)
Bartlett test of homogeneity of variances
data: Weight by A
Bartlett's K-squared = 0.19888, df = 2, p-value = 0.9053
> bartlett.test(Weight~B,data=Mouse.Weight)
Bartlett test of homogeneity of variances
data: Weight by B
Bartlett's K-squared = 0.20953, df = 1, p-value = 0.6471
从p-value来看,因素A、因素B满足方差齐性(p>0.05)。
方差分析
调用aov
函数进行方差分析
> attach(Mouse.Weight)
The following objects are masked from Mouse.Weight (pos = 3):
A, B, Weight
> fit <- aov(Weight ~ A*B)
> summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 2 9080 4540 56.809 5.22e-14 ***
B 1 127 127 1.589 0.213
A:B 2 17 9 0.108 0.897
Residuals 54 4316 80
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
由结果可见,只有因素A对大鼠体重有显著性影响。
多重比较
TukeyHSD()
函数提供了对各组均值差异的成对检验。
########## Multiple Comparison
TukeyHSD(fit)
par(las=2)
par(mar=c(5,8,4,2))
plot(TukeyHSD(fit)) # plot comparsion test result
结果
代码语言:javascript复制> TukeyHSD(fit)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Weight ~ A * B)
$A
diff lwr upr p adj
A2-A1 25.39 18.576905 32.203095 0.0000000
A3-A1 26.75 19.936905 33.563095 0.0000000
A3-A2 1.36 -5.453095 8.173095 0.8805321
$B
diff lwr upr p adj
B2-B1 2.91 -1.717782 7.537782 0.2128406
$`A:B`
diff lwr upr p adj
A2:B1-A1:B1 25.09 13.277922 36.90208 0.0000009
A3:B1-A1:B1 25.49 13.677922 37.30208 0.0000006
A1:B2-A1:B1 1.87 -9.942078 13.68208 0.9970585
A2:B2-A1:B1 27.56 15.747922 39.37208 0.0000001
A3:B2-A1:B1 29.88 18.067922 41.69208 0.0000000
A3:B1-A2:B1 0.40 -11.412078 12.21208 0.9999985
A1:B2-A2:B1 -23.22 -35.032078 -11.40792 0.0000050
A2:B2-A2:B1 2.47 -9.342078 14.28208 0.9892562
A3:B2-A2:B1 4.79 -7.022078 16.60208 0.8358408
A1:B2-A3:B1 -23.62 -35.432078 -11.80792 0.0000035
A2:B2-A3:B1 2.07 -9.742078 13.88208 0.9952518
A3:B2-A3:B1 4.39 -7.422078 16.20208 0.8800277
A2:B2-A1:B2 25.69 13.877922 37.50208 0.0000005
A3:B2-A1:B2 28.01 16.197922 39.82208 0.0000001
A3:B2-A2:B2 2.32 -9.492078 14.13208 0.9919360
比较结果
从结果图中可以发现有7组区间穿过mean=0
,它们就是没有差异的组。差异的组是置信区间不含0的组别,分别是
A2:B1-A1:B1
A3:B1-A1:B1
A2:B2-A1:B1
A3:B2-A1:B1
A1:B2-A2:B1
A1:B2-A3:B1
A2:B2-A1:B2
A3:B2-A1:B2
Q2
对数据进行标准化,每个数取底为10的对数后绘制箱线图
首先读入数据,然后处理为正确的数据集,最后运用apply函数进行列操作得到标准化数据,接着画出箱线图形。
代码语言:javascript复制## Read data
library(readr)
dataset <- read.csv('protein.csv')
row.names(dataset) <- dataset$X ## get colnames
dataset <- dataset[,-1] ## romove unuseful coloum
#################
## normalize data
#################
protein.dataset <- apply(dataset, 2, function(x)return(x/mean(x)))
## plot boxplot #log10
boxplot(log10(protein.dataset))
箱线图
筛选差异表达基因
先用一行进行计算P-value测试
代码语言:javascript复制 x = protein.dataset[1,]
a <- rep('A', times=5)
b <- rep('B', times=5)
c <- rep('C', times=5)
Drug <- c(a,b,c)
test <- data.frame(Drug, x)
> summary(aov(test$x ~ test$Drug))
Df Sum Sq Mean Sq F value Pr(>F)
test$Drug 2 0.000818 0.0004089 0.463 0.64
Residuals 12 0.010606 0.0008839
尝试取出p
值
> T1 <- summary(aov(test$x ~ test$Drug))
> T1[[1]][5] > 0.05
Pr(>F)
test$Drug TRUE
Residuals NA
> T1[[1]][5][1,] > 0.05
[1] TRUE
> T1[[1]][5][1,]
[1] 0.6404265
现在进行整体计算筛选
代码语言:javascript复制############################
## Screen differential genes
############################
# function: oneway test
anova_test <- function(x){
a <- rep('A', times=5)
b <- rep('B', times=5)
c <- rep('C', times=5)
Drug <- c(a,b,c)
test <- data.frame(Drug, x)
return(summary(aov(test$x ~ test$Drug))[[1]][5][1,])
}
anova_result <- apply(protein.dataset,1,anova_test)
table(anova_result < 0.05)
结果计数
代码语言:javascript复制FALSE TRUE
1062 1048
由此得知差异表达基因数有1048个。
可以利用anova_result[anova_result < 0.05]
对差异基因进行输出。
bonferroni矫正
代码语言:javascript复制adjust_result <- p.adjust(anova_result, method='bonferroni')
table(adjust_result < 0.05)
结果
代码语言:javascript复制FALSE TRUE
1907 203
经过矫正的结果有203个。
可以利用adjust_result[adjust_result < 0.05]
对差异基因进行输出。