前几天对LULU这种方法进行了介绍: 但是! 我在组会上讲了一下这个方法的原理,马上得到了老板的灵魂连击:
- 为啥作者用植物数据做,这个方法对细菌的适用度有多高?
- 序列相似性为啥默认为84%?这对于微生物来说阈值也太低了。这样划分的依据是什么?
- 共现性具体是怎么计算的?计算的是perason还是spearman?另外对于0是怎么处理的?
其实这些问题我之前也想到过,但是NC原文确实都没有写,我也回答不上来。不知道当时审稿人是怎么提的意见,这些关键的细节作者竟然都没有写清楚。 没办法只好看LULU的源代码,也只解决了第三个问题。前两个的原因不清楚。因此在用于细菌群落,及设定相似性阈值的时候要小心。
这一段是对OTU进行排序,先按照每个OTU在多少个样本中检测到,再按照每个OTU的序列数,从高到底排序。
代码语言:javascript复制lulu <- function(otutable, matchlist, minimum_ratio_type = "min", minimum_ratio = 1, minimum_match = 84, minimum_relative_cooccurence = 0.95) {
require(dplyr)
start.time <- Sys.time()
colnames(matchlist) <- c("OTUid", "hit", "match")
# remove no hits (vsearch)
matchlist = matchlist[which(matchlist$hit != "*"), ]
# remove self-hits
matchlist = matchlist[which(matchlist$hit != matchlist$OTUid), ]
# Making a separate table with stats (total readcount and spread).
statistics_table <- otutable[, 0]
statistics_table$total <- rowSums(otutable)
# calculating spread (number of presences (samples with 1 read) pr OTU)
statistics_table$spread <- rowSums(otutable > 0) #包含此OTU的样本的个数
statistics_table <- statistics_table[with(statistics_table,
order(spread,
total,
decreasing = TRUE)), ]
otutable <- otutable[match(row.names(statistics_table),
row.names(otutable)), ]
这一段是对每个OTU,判断与其他OTU的相似度,默认84%。挑出这些相似度高于84%且丰度高于此OTU的对应OTU,最为潜在的parent OTU。
代码语言:javascript复制 statistics_table$parent_id <- "NA"
log_con <- file(paste0("lulu.log_", format(start.time, "%Y%m%d_%H%M%S")),
open = "a")
for (line in seq(1:nrow(statistics_table))) { #line=2
# make a progressline
print(paste0("progress: ",
round(((line/nrow(statistics_table)) * 100), 0), "%"))
potential_parent_id <- row.names(otutable)[line]
cat(paste0("n", "####processing: ", potential_parent_id, " #####"),
file = log_con)
daughter_samples <- otutable[line, ]
#minimum_match = 84
hits <- matchlist[which(matchlist$OTUid == potential_parent_id &
matchlist$match > minimum_match), "hit"]
cat(paste0("n", "---hits: ", hits), file = log_con)
last_relevant_entry <- sum(statistics_table$spread >=
statistics_table$spread[line])
potential_parents <- which(row.names(otutable)[1:last_relevant_entry]
%in% hits)
cat(paste0("n", "---potential parent: ",
row.names(statistics_table)[potential_parents]), file = log_con)
success <- FALSE
对于这些潜在的parent OTU与考察的OTU, 先计算共发生率。即parent OTU在样本中出现的次数/考察的OTU在样本中出现的次数。若小于阈值(95%),则考察的OTU为真实OTU。若大于阈值,进行abundance比例的进一步判断。abundance比例默认为1。
这里又有两种方法,默认为min。对于每个样本,计算parent OTU序列数/考察的OTU序列数。得到的所有值取平均。若此值大于阈值(1),考察的OTU为真OTU。否则为假OTU。
另外还有一种average方法,即得到的所有值取最小值。
代码语言:javascript复制 if (length(potential_parents) > 0) {
for (line2 in potential_parents) { #line2 = 1
cat(paste0("n", "------checking: ", row.names(statistics_table)[line2]),
file = log_con)
if (!success) {
relative_cooccurence <-
sum((daughter_samples[otutable[line2, ] > 0]) > 0)/
sum(daughter_samples > 0)
cat(paste0("n", "------relative cooccurence: ",
relative_cooccurence), file = log_con)
if (relative_cooccurence >= minimum_relative_cooccurence) {
cat(paste0(" which is sufficient!"), file = log_con)
if (minimum_ratio_type == "avg") {
relative_abundance <-
mean(otutable[line2, ][daughter_samples > 0]/
daughter_samples[daughter_samples > 0])
cat(paste0("n", "------mean avg abundance: ",
relative_abundance), file = log_con)
} else {
relative_abundance <-
min(otutable[line2, ][daughter_samples > 0]/
daughter_samples[daughter_samples > 0])
cat(paste0("n", "------min avg abundance: ",
relative_abundance), file = log_con)
}
if (relative_abundance > minimum_ratio) {
cat(paste0(" which is OK!"), file = log_con)
if (line2 < line) {
statistics_table$parent_id[line] <-
statistics_table[row.names(otutable)[line2],"parent_id"]
cat(paste0("n", "SETTING ",
potential_parent_id, " to be an ERROR of ",
(statistics_table[row.names(otutable)[line2],
"parent_id"]), "n"),
file = log_con)
} else {
statistics_table$parent_id[line] <- row.names(otutable)[line2]
cat(paste0("n", "SETTING ", potential_parent_id,
" to be an ERROR of ", (row.names(otutable)[line2]),
"n"), file = log_con)
}
success <- TRUE
}
}
}
}
}
if (!success) {
statistics_table$parent_id[line] <- row.names(statistics_table)[line]
cat(paste0("n", "No parent found!", "n"), file = log_con)
}
}
#输出结果 close(log_con)
total_abundances <- rowSums(otutable)
curation_table <- cbind(nOTUid = statistics_table$parent_id, otutable)
statistics_table$curated <- "merged"
curate_index <- row.names(statistics_table) == statistics_table$parent_id
statistics_table$curated[curate_index] <- "parent"
statistics_table <- transform(statistics_table,
rank = ave(total,FUN = function(x)
rank(-x, ties.method = "first")))
curation_table <- as.data.frame(curation_table %>%
group_by(nOTUid) %>%
summarise_all(funs(sum)))
row.names(curation_table) <- as.character(curation_table$nOTUid)
curation_table <- curation_table[, -1]
curated_otus <- names(table(statistics_table$parent_id))
curated_count <- length(curated_otus)
discarded_otus <- setdiff(row.names(statistics_table), curated_otus)
discarded_count <- length(discarded_otus)
end.time <- Sys.time()
time.taken <- end.time - start.time
result <- list(curated_table = curation_table,
curated_count = curated_count,
curated_otus = curated_otus,
discarded_count = discarded_count,
discarded_otus = discarded_otus,
runtime = time.taken,
minimum_match = minimum_match,
minimum_relative_cooccurence = minimum_relative_cooccurence,
otu_map = statistics_table,
original_table = otutable)
return(result)
}
综上,共现的计算先考虑OTU在样本出现的次数,再考虑丰度的比例。用的并不是pearson或spearman的方法,也没有对0进行处理。只是简单的计数与算比例。
链接:https://github.com/tobiasgf/lulu/blob/master/R/Functions.R
END