不让人省心的LULU!

2020-06-01 16:22:58 浏览数 (2)

前几天对LULU这种方法进行了介绍: 但是! 我在组会上讲了一下这个方法的原理,马上得到了老板的灵魂连击:

  1. 为啥作者用植物数据做,这个方法对细菌的适用度有多高?
  2. 序列相似性为啥默认为84%?这对于微生物来说阈值也太低了。这样划分的依据是什么?
  3. 共现性具体是怎么计算的?计算的是perason还是spearman?另外对于0是怎么处理的?

其实这些问题我之前也想到过,但是NC原文确实都没有写,我也回答不上来。不知道当时审稿人是怎么提的意见,这些关键的细节作者竟然都没有写清楚。 没办法只好看LULU的源代码,也只解决了第三个问题。前两个的原因不清楚。因此在用于细菌群落,及设定相似性阈值的时候要小心。

这一段是对OTU进行排序,先按照每个OTU在多少个样本中检测到,再按照每个OTU的序列数,从高到底排序。

代码语言:javascript复制
lulu <- function(otutable, matchlist, minimum_ratio_type = "min", minimum_ratio = 1, minimum_match = 84, minimum_relative_cooccurence = 0.95) {

  require(dplyr)

  start.time <- Sys.time()

  colnames(matchlist) <- c("OTUid", "hit", "match")

  # remove no hits (vsearch)

  matchlist = matchlist[which(matchlist$hit != "*"), ]

  # remove self-hits

  matchlist = matchlist[which(matchlist$hit != matchlist$OTUid), ]

  # Making a separate table with stats (total readcount and spread).


  statistics_table <- otutable[, 0]

  statistics_table$total <- rowSums(otutable)



  # calculating spread (number of presences (samples with 1  read) pr OTU)

  statistics_table$spread <- rowSums(otutable > 0) #包含此OTU的样本的个数

  statistics_table <- statistics_table[with(statistics_table,

                                            order(spread,

                                                  total,

                                                  decreasing = TRUE)), ]

  otutable <- otutable[match(row.names(statistics_table),

                             row.names(otutable)), ]

这一段是对每个OTU,判断与其他OTU的相似度,默认84%。挑出这些相似度高于84%且丰度高于此OTU的对应OTU,最为潜在的parent OTU。

代码语言:javascript复制
  statistics_table$parent_id <- "NA"

  log_con <- file(paste0("lulu.log_", format(start.time, "%Y%m%d_%H%M%S")),

                  open = "a")

  for (line in seq(1:nrow(statistics_table))) {  #line=2

    # make a progressline

    print(paste0("progress: ",

                 round(((line/nrow(statistics_table)) * 100), 0), "%"))

    potential_parent_id <- row.names(otutable)[line]

    cat(paste0("n", "####processing: ", potential_parent_id, " #####"),

        file = log_con)

    daughter_samples <- otutable[line, ]

    #minimum_match = 84

    hits <- matchlist[which(matchlist$OTUid == potential_parent_id &

                              matchlist$match > minimum_match), "hit"]

    cat(paste0("n", "---hits: ", hits), file = log_con)

    last_relevant_entry <- sum(statistics_table$spread >=

                                 statistics_table$spread[line])

    potential_parents <- which(row.names(otutable)[1:last_relevant_entry]

                               %in% hits)

    cat(paste0("n", "---potential parent: ",

               row.names(statistics_table)[potential_parents]), file = log_con)

    success <- FALSE

对于这些潜在的parent OTU与考察的OTU, 先计算共发生率。即parent OTU在样本中出现的次数/考察的OTU在样本中出现的次数。若小于阈值(95%),则考察的OTU为真实OTU。若大于阈值,进行abundance比例的进一步判断。abundance比例默认为1。

这里又有两种方法,默认为min。对于每个样本,计算parent OTU序列数/考察的OTU序列数。得到的所有值取平均。若此值大于阈值(1),考察的OTU为真OTU。否则为假OTU。

另外还有一种average方法,即得到的所有值取最小值。

代码语言:javascript复制
    if (length(potential_parents) > 0) {

      for (line2 in potential_parents) {  #line2 = 1

        cat(paste0("n", "------checking: ", row.names(statistics_table)[line2]),

            file = log_con)

        if (!success) {

          relative_cooccurence <-

            sum((daughter_samples[otutable[line2, ] > 0]) > 0)/

            sum(daughter_samples > 0)

          cat(paste0("n", "------relative cooccurence: ",

                     relative_cooccurence), file = log_con)

          if (relative_cooccurence >= minimum_relative_cooccurence) {

            cat(paste0(" which is sufficient!"), file = log_con)

            if (minimum_ratio_type == "avg") {

              relative_abundance <-

                mean(otutable[line2, ][daughter_samples > 0]/

                       daughter_samples[daughter_samples > 0])

              cat(paste0("n", "------mean avg abundance: ",

                         relative_abundance), file = log_con)

            } else {

              relative_abundance <-

                min(otutable[line2, ][daughter_samples > 0]/

                      daughter_samples[daughter_samples > 0])

              cat(paste0("n", "------min avg abundance: ",

                         relative_abundance), file = log_con)

            }

            if (relative_abundance > minimum_ratio) {

              cat(paste0(" which is OK!"), file = log_con)

              if (line2 < line) {

                statistics_table$parent_id[line] <-

                  statistics_table[row.names(otutable)[line2],"parent_id"]

                cat(paste0("n", "SETTING ",

                           potential_parent_id, " to be an ERROR of ",

                           (statistics_table[row.names(otutable)[line2],

                                             "parent_id"]), "n"),

                    file = log_con)

              } else {

                statistics_table$parent_id[line] <- row.names(otutable)[line2]

                cat(paste0("n", "SETTING ", potential_parent_id,

                           " to be an ERROR of ", (row.names(otutable)[line2]),

                           "n"), file = log_con)

              }

              success <- TRUE

            }

          }

        }

      }

    }

    if (!success) {

      statistics_table$parent_id[line] <- row.names(statistics_table)[line]

      cat(paste0("n", "No parent found!", "n"), file = log_con)

    }

  }

 #输出结果  close(log_con)

  total_abundances <- rowSums(otutable)

  curation_table <- cbind(nOTUid = statistics_table$parent_id, otutable)

  statistics_table$curated <- "merged"

  curate_index <- row.names(statistics_table) == statistics_table$parent_id

  statistics_table$curated[curate_index] <- "parent"

  statistics_table <- transform(statistics_table,

                                rank = ave(total,FUN = function(x)

                                  rank(-x, ties.method = "first")))

  curation_table <- as.data.frame(curation_table %>%

                                    group_by(nOTUid) %>%

                                    summarise_all(funs(sum)))

  row.names(curation_table) <- as.character(curation_table$nOTUid)

  curation_table <- curation_table[, -1]

  curated_otus <- names(table(statistics_table$parent_id))

  curated_count <- length(curated_otus)

  discarded_otus <- setdiff(row.names(statistics_table), curated_otus)

  discarded_count <- length(discarded_otus)

  end.time <- Sys.time()

  time.taken <- end.time - start.time

  result <- list(curated_table = curation_table,

                 curated_count = curated_count,

                 curated_otus = curated_otus,

                 discarded_count = discarded_count,

                 discarded_otus = discarded_otus,

                 runtime = time.taken,

                 minimum_match = minimum_match,

                 minimum_relative_cooccurence = minimum_relative_cooccurence,

                 otu_map = statistics_table,

                 original_table = otutable)



  return(result)

}

综上,共现的计算先考虑OTU在样本出现的次数,再考虑丰度的比例。用的并不是pearson或spearman的方法,也没有对0进行处理。只是简单的计数与算比例。

链接:https://github.com/tobiasgf/lulu/blob/master/R/Functions.R

END

0 人点赞