问题
你想要按照组别总结你的数据(均值、标准差等等)。
方案
有三种方法描述基于一些特定变量的分组数据,然后对每一组使用总结函数(像均值、标准差等等)。
ddply()
函数:它比较容易使用,但需要载入plyr
包。这种方法可能就是你要找的(说明很多人用呗,好用呗)。summaryBy()
函数:它也比较容易使用,然而它需要载入doBy
包。aggregate()
函数,它比较难使用一点但内置于R中。
假设你有以下数据并想求得每一组样本大小、均值的改变、标准差以及均值的标准误,而这里的组别是根据性别和条件指定的:F-placebo, F-aspirin, M-placebo和 M-aspirin。
代码语言:javascript复制data <- read.table(header=TRUE, text='
subject sex condition before after change
1 F placebo 10.1 6.9 -3.2
2 F placebo 6.3 4.2 -2.1
3 M aspirin 12.4 6.3 -6.1
4 F placebo 8.1 6.1 -2.0
5 M aspirin 15.2 9.9 -5.3
6 F aspirin 10.9 7.0 -3.9
7 F aspirin 11.6 8.5 -3.1
8 M aspirin 9.5 3.0 -6.5
9 F placebo 11.5 9.0 -2.5
10 M placebo 11.9 11.0 -0.9
11 F aspirin 11.4 8.0 -3.4
12 M aspirin 10.0 4.4 -5.6
13 M aspirin 12.5 5.4 -7.1
14 M placebo 10.6 10.6 0.0
15 M aspirin 9.1 4.3 -4.8
16 F placebo 12.1 10.2 -1.9
17 F placebo 11.0 8.8 -2.2
18 F placebo 11.9 10.2 -1.7
19 M aspirin 9.1 3.6 -5.5
20 M placebo 13.5 12.4 -1.1
21 M aspirin 12.0 7.5 -4.5
22 F placebo 9.1 7.6 -1.5
23 M placebo 9.9 8.0 -1.9
24 F placebo 7.6 5.2 -2.4
25 F placebo 11.8 9.7 -2.1
26 F placebo 11.8 10.7 -1.1
27 F aspirin 10.1 7.9 -2.2
28 M aspirin 11.6 8.3 -3.3
29 F aspirin 11.3 6.8 -4.5
30 F placebo 10.3 8.3 -2.0
')
使用 ddply
代码语言:javascript复制library(plyr)
# 给每一组运行长度、均值、标准差等函数
# 每一组依据性别 条件划分
cdata <- ddply(data, c("sex", "condition"), summarise,
N = length(change),
mean = mean(change),
sd = sd(change),
se = sd / sqrt(N)
)
cdata
#> sex condition N mean sd se
#> 1 F aspirin 5 -3.420000 0.8642916 0.3865230
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867
#> 3 M aspirin 9 -5.411111 1.1307569 0.3769190
#> 4 M placebo 4 -0.975000 0.7804913 0.3902456
处理缺失值
如果数据中存在NA值,需要给每个函数添加na.rm=TRUE
标记去除缺失值。因为函数length()
没有na.rm
选项,所以可以使用sum(!is.na(...))
的方式对非缺失值进行计数。
# 给数据加些NA值
dataNA <- data
dataNA$change[11:14] <- NA
cdata <- ddply(dataNA, c("sex", "condition"), summarise,
N = sum(!is.na(change)),
mean = mean(change, na.rm=TRUE),
sd = sd(change, na.rm=TRUE),
se = sd / sqrt(N)
)
cdata
#> sex condition N mean sd se
#> 1 F aspirin 4 -3.425000 0.9979145 0.4989572
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867
#> 3 M aspirin 7 -5.142857 1.0674848 0.4034713
#> 4 M placebo 3 -1.300000 0.5291503 0.3055050
自动总结函数
不像我们刚才手动地指定想要的值然后计算标准误,这个函数可以自动处理所有的细节。它可以干以下的事情:
- 寻找均值、标准差和计数
- 寻找均值的标准误(强调,如果你处理的是被试内变量这可能不是你想要的)
- 寻找95%的置信区间(也可以指定其他值)
- 重命令结果数据集的变量名,这样更方便后续处理
要使用的话,把函数放你的代码中然后像下面一样调用它。
代码语言:javascript复制## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
## data: a data frame.
## measurevar: the name of a column that contains the variable to be summariezed
## groupvars: a vector containing names of columns that contain grouping variables
## na.rm: a boolean that indicates whether to ignore NA's
## conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95, .drop=TRUE) {
library(plyr)
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
举个例子使用它(用95%的置信区间)。与之前手动计算这些步骤相反,summarySE
函数一步搞定:
summarySE(data, measurevar="change", groupvars=c("sex", "condition"))
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
#> 4 M placebo 4 -0.975000 0.7804913 0.3902456 1.2419358
# With a data set with NA's, use na.rm=TRUE
summarySE(dataNA, measurevar="change", groupvars=c("sex", "condition"), na.rm=TRUE)
#> sex condition N change sd se ci
#> 1 F aspirin 4 -3.425000 0.9979145 0.4989572 1.5879046
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3 M aspirin 7 -5.142857 1.0674848 0.4034713 0.9872588
#> 4 M placebo 3 -1.300000 0.5291503 0.3055050 1.3144821
用零填满空组合
有时候总结的数据框中存在因子的空组合 - 这意思是,因子组合可能存在,但原始数据框里又没有实际出现。它在自动填满有NA值的数据框时有用。要做到这一点,当调用ddply
或summarySE
时设置.drop=FALSE
。(这里我翻译的不是很如意,大家可以查看原文)
例子:
代码语言:javascript复制# 首先移除所有 Male Placebo 条目
dataSub <- subset(data, !(sex=="M" & condition=="placebo"))
# 如果我们总结数据,在本来有 Male Placebo 的地方会存在空行
# 因为这个组合已经被我们删除了
summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"))
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
# 设置 .drop=FALSE 指定不要扔掉这个组合
summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"), .drop=FALSE)
#> Warning in qt(conf.interval/2 0.5, datac$N - 1): NaNs produced
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
#> 4 M placebo 0 NaN NA NA NA
使用summaryBy
使用summarizeBy()
函数瓦解数据:
library(doBy)
# 给每一组运行长度、均值、标准差等函数
# 每一组依据性别 条件划分
cdata <- summaryBy(change ~ sex condition, data=data, FUN=c(length,mean,sd))
cdata
#> sex condition change.length change.mean change.sd
#> 1 F aspirin 5 -3.420000 0.8642916
#> 2 F placebo 12 -2.058333 0.5247655
#> 3 M aspirin 9 -5.411111 1.1307569
#> 4 M placebo 4 -0.975000 0.7804913
# Rename column change.length to just N
names(cdata)[names(cdata)=="change.length"] <- "N"
# Calculate standard error of the mean
cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
cdata
#> sex condition N change.mean change.sd change.se
#> 1 F aspirin 5 -3.420000 0.8642916 0.3865230
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867
#> 3 M aspirin 9 -5.411111 1.1307569 0.3769190
#> 4 M placebo 4 -0.975000 0.7804913 0.3902456
注意如果你有任何被试内变量,这些标准误值在比对组别差异时就没用了。
处理缺失值
如果数据中存在NA值,你需要添加na.rm=TRUE
选项。通常你可以在summaryBy()
函数中设置,但length()
函数识别不了这个选项。一种解决方式是根据length()
函数定义一个新的取长度函数去处理NA值。
# 新版的length函数可以处理NA值,如果na.rm=T,则不对NA计数
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# 给数据添加一些NA值
dataNA <- data
dataNA$change[11:14] <- NA
cdataNA <- summaryBy(change ~ sex condition, data=dataNA,
FUN=c(length2, mean, sd), na.rm=TRUE)
cdataNA
#> sex condition change.length2 change.mean change.sd
#> 1 F aspirin 4 -3.425000 0.9979145
#> 2 F placebo 12 -2.058333 0.5247655
#> 3 M aspirin 7 -5.142857 1.0674848
#> 4 M placebo 3 -1.300000 0.5291503
# Now, do the same as before
自动总结函数
(注意这里的自动总结函数与之前的不同,它是通过summaryBy
实现的)
不像我们刚才手动地指定想要的值然后计算标准误,这个函数可以自动处理所有的细节。它可以干以下的事情:
- 寻找均值、标准差和计数
- 寻找均值的标准误(强调,如果你处理的是被试内变量这可能不是你想要的)
- 寻找95%的置信区间(也可以指定其他值)
- 重命令结果数据集的变量名,这样更方便后续处理
要使用的话,把函数放你的代码中然后像下面一样调用它。
代码语言:javascript复制## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and confidence
## interval (default 95%).
## data: a data frame.
## measurevar: the name of a column that contains the variable to be summariezed
## groupvars: a vector containing names of columns that contain grouping variables
## na.rm: a boolean that indicates whether to ignore NA's
## conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
library(doBy)
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# Collapse the data
formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" "), sep=" ~ "))
datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)
# Rename columns
names(datac)[ names(datac) == paste(measurevar, ".mean", sep="") ] <- measurevar
names(datac)[ names(datac) == paste(measurevar, ".sd", sep="") ] <- "sd"
names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N"
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
举个例子使用它(用95%的置信区间)。与之前手动计算这些步骤相反,summarySE
函数一步搞定:
summarySE(data, measurevar="change", groupvars=c("sex","condition"))
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
#> 4 M placebo 4 -0.975000 0.7804913 0.3902456 1.2419358
# With a data set with NA's, use na.rm=TRUE
summarySE(dataNA, measurevar="change", groupvars=c("sex","condition"), na.rm=TRUE)
#> sex condition N change sd se ci
#> 1 F aspirin 4 -3.425000 0.9979145 0.4989572 1.5879046
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3 M aspirin 7 -5.142857 1.0674848 0.4034713 0.9872588
#> 4 M placebo 3 -1.300000 0.5291503 0.3055050 1.3144821
用零填满空组合
有时候总结的数据框中存在因子的空组合 - 这意思是,因子组合可能存在,但原始数据框里又没有实际出现。它在自动填满有NA值的数据框时有用。
这个例子将会用0填满缺失的组合:
代码语言:javascript复制fillMissingCombs <- function(df, factors, measures) {
# Make a list of the combinations of factor levels
levelList <- list()
for (f in factors) { levelList[[f]] <- levels(df[,f]) }
fullFactors <- expand.grid(levelList)
dfFull <- merge(fullFactors, df, all.x=TRUE)
# Wherever there is an NA in the measure vars, replace with 0
for (m in measures) {
dfFull[is.na(dfFull[,m]), m] <- 0
}
return(dfFull)
}
使用例子:
代码语言:javascript复制# First remove some all Male Placebo entries from the data
dataSub <- subset(data, !(sex=="M" & condition=="placebo"))
# If we summarize the data, there will be a missing row for Male Placebo,
# since there were no cases with this combination.
cdataSub <- summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"))
cdataSub
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
# This will fill in the missing combinations with zeros
fillMissingCombs(cdataSub, factors=c("sex","condition"), measures=c("N","change","sd","se","ci"))
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
#> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
#> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
#> 4 M placebo 0 0.000000 0.0000000 0.0000000 0.0000000
使用 aggregate
aggregate
函数比较难用,但它内置于R,所以不需要按照其他包。
# 对每个目录 (sex*condition) 中的对象计数
cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)
cdata
#> sex condition subject
#> 1 F aspirin 5
#> 2 M aspirin 9
#> 3 F placebo 12
#> 4 M placebo 4
# 重命名 "subject" 列为 "N"
names(cdata)[names(cdata)=="subject"] <- "N"
cdata
#> sex condition N
#> 1 F aspirin 5
#> 2 M aspirin 9
#> 3 F placebo 12
#> 4 M placebo 4
# 按性别排序
cdata <- cdata[order(cdata$sex),]
cdata
#> sex condition N
#> 1 F aspirin 5
#> 3 F placebo 12
#> 2 M aspirin 9
#> 4 M placebo 4
# 我们也保留 before 和 after列:
# 得到性别和条件下的平均影响大小
# Get the average effect size by sex and condition
cdata.means <- aggregate(data[c("before","after","change")],
by = data[c("sex","condition")], FUN=mean)
cdata.means
#> sex condition before after change
#> 1 F aspirin 11.06000 7.640000 -3.420000
#> 2 M aspirin 11.26667 5.855556 -5.411111
#> 3 F placebo 10.13333 8.075000 -2.058333
#> 4 M placebo 11.47500 10.500000 -0.975000
# 融合数据框
cdata <- merge(cdata, cdata.means)
cdata
#> sex condition N before after change
#> 1 F aspirin 5 11.06000 7.640000 -3.420000
#> 2 F placebo 12 10.13333 8.075000 -2.058333
#> 3 M aspirin 9 11.26667 5.855556 -5.411111
#> 4 M placebo 4 11.47500 10.500000 -0.975000
# 得到标准差
cdata.sd <- aggregate(data["change"],
by = data[c("sex","condition")], FUN=sd)
# 重命名列
names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
cdata.sd
#> sex condition change.sd
#> 1 F aspirin 0.8642916
#> 2 M aspirin 1.1307569
#> 3 F placebo 0.5247655
#> 4 M placebo 0.7804913
# 融合
cdata <- merge(cdata, cdata.sd)
cdata
#> sex condition N before after change change.sd
#> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916
#> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655
#> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569
#> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913
# 计算标准误
cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
cdata
#> sex condition N before after change change.sd change.se
#> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916 0.3865230
#> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655 0.1514867
#> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569 0.3769190
#> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 0.3902456
如果你有NA值想要跳过,设置 na.rm=TRUE
:
cdata.means <- aggregate(data[c("before","after","change")],
by = data[c("sex","condition")],
FUN=mean, na.rm=TRUE)
cdata.means
#> sex condition before after change
#> 1 F aspirin 11.06000 7.640000 -3.420000
#> 2 M aspirin 11.26667 5.855556 -5.411111
#> 3 F placebo 10.13333 8.075000 -2.058333
#> 4 M placebo 11.47500 10.500000 -0.975000