GMSB文章九:微生物的相关关系组间波动

2024-06-30 15:53:40 浏览数 (1)

欢迎大家关注全网生信学习者系列:

  • WX公zhong号:生信学习者
  • Xiao hong书:生信学习者
  • 知hu:生信学习者
  • CDSN:生信学习者2

介绍

计算配对微生物在组间的相关关系波动情况进而评估不同分组的微生物状态。secom_linear 函数可以评估不同分组(例如,健康组与疾病组)中微生物分类群之间的线性相关性,帮助研究者理解不同分类群如何相互作用以及它们在不同状态下的相互关系。通过分析不同分组间微生物相关性的波动情况,secom_linear 函数能够揭示微生物群落结构的动态变化,这对于理解微生物群落对环境变化的响应至关重要。

在不同分组之间,微生物分类群的相互关系表现出显著的波动性。这种波动性反映了微生物群落结构在不同环境或条件下的动态变化,是评估微生物群落稳定性和功能多样性的关键指标。通过定量分析这些波动,研究者可以深入理解微生物群落如何响应外部扰动,以及它们在不同生态位中的作用和相互依赖性。

加载R包

代码语言:javascript复制
library(readr)
library(openxlsx)
library(tidyverse) 
library(igraph)
library(ggraph)
library(tidygraph)
library(ggpubr)
library(microbiome)
library(ANCOMBC)

导入数据

大家通过以下链接下载数据:

  • 百度网盘链接:https://pan.baidu.com/s/1fz5tWy4mpJ7nd6C260abtg
  • 提取码: 请关注WX公zhong号生信学习者后台发送 复现gmsb 获取提取码
代码语言:javascript复制
otu_table <- read_tsv("./data/GMSB-data/otu-table.tsv", show_col_types = FALSE)
​
tax <- read_tsv("./data/GMSB-data/taxonomy.tsv", show_col_types = FALSE)
​
meta_data <- read_csv("./data/GMSB-data/df_v1.csv", show_col_types = FALSE)

数据预处理

  • 提取差异物种丰度表
  • 合并分组变量和差异物种丰度表

Primary group: 按照频率分组

  • G1: # receptive anal intercourse = 0
  • G2: # receptive anal intercourse = 1
  • G3: # receptive anal intercourse = 2 - 5
  • G4: # receptive anal intercourse = 6
代码语言:javascript复制
# OTU table
otu_id <- otu_table$`#OTU ID`
otu_table <- data.frame(otu_table[, -1], check.names = FALSE, row.names = otu_id)
​
# Taxonomy table
otu_id <- tax$`Feature ID`
tax <- data.frame(tax[, - c(1, 3)], row.names = otu_id)
tax <- tax %>% 
  separate(col = Taxon, 
           into = c("Kingdom", "Phylum", "Class", "Order", 
                    "Family", "Genus", "Species"),
           sep = ";") %>%
  rowwise() %>%
  dplyr::mutate_all(function(x) strsplit(x, "__")[[1]][2]) %>%
  mutate(Species = ifelse(!is.na(Species) & !is.na(Genus),
                          paste(ifelse(strsplit(Genus, "")[[1]][1] == "[",
                                       strsplit(Genus, "")[[1]][2],
                                       strsplit(Genus, "")[[1]][1]), Species, sep = "."),
                          NA)) %>%
  ungroup()
tax <- as.matrix(tax)
rownames(tax) <- otu_id
tax[tax == ""] <- NA
​
# Meta data
meta_data$status <- factor(meta_data$status, levels = c("nc", "sc"))
meta_data$time2aids <- factor(meta_data$time2aids,
                             levels = c("never", "> 10 yrs",
                                        "5 - 10 yrs", "< 5 yrs"))
​
# Phyloseq object
OTU <- otu_table(otu_table, taxa_are_rows = TRUE)
META <- sample_data(meta_data)
sample_names(META) <- meta_data$sampleid
TAX <- tax_table(tax)
otu_data <- phyloseq(OTU, TAX, META)
species_data <- aggregate_taxa(otu_data, "Species")
​
tse <- mia::makeTreeSummarizedExperimentFromPhyloseq(otu_data)
tse <- tse[, tse$group1 != "missing"]
tse1 <- tse[, tse$group1 == "g1"]
tse2 <- tse[, tse$group1 == "g2"]
tse3 <- tse[, tse$group1 == "g3"]
tse4 <- tse[, tse$group1 == "g4"]
​
tse4
​
key_species <- c(
  "A.muciniphila", "B.caccae", "B.fragilis", "B.uniformis",
  "Bacteroides spp.", "Butyricimonas spp.", "Dehalobacterium spp.", 
  "Methanobrevibacter spp.", "Odoribacter spp.")
代码语言:javascript复制
class: TreeSummarizedExperiment 
dim: 6111 35 
metadata(0):
assays(1): counts
rownames(6111): 000e1601e0051888d502cd6a535ecdda 0011d81f43ec49d21f2ab956ab12de2f ...
  fff3a05f150a52a2b6c238d2926a660d fffc6ea80dc6bb8cafacbab2d3b157b3
rowData names(7): Kingdom Phylum ... Genus Species
colnames(35): F-195 F-133 ... F-368 F-377
colData names(45): sampleid subjid ... hbv hcv
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
rowLinks: NULL
rowTree: NULL
colLinks: NULL
colTree: NULL

函数

  • get_upper_tri:获取上三角矩阵结果
  • data_preprocess:获取相关性画图矩阵
代码语言:javascript复制
get_upper_tri <- function(cormat){
  
  cormat[lower.tri(cormat)] <- NA
  diag(cormat) <- NA
  
  return(cormat)
}
​
data_preprocess <- function(
  res, 
  type = "linear", 
  level = "Species", 
  tax) {
  
  if (type == "linear") {
    df_corr <- res$corr_fl
  } else {
    df_corr <- res$dcorr_fl
  }
  
  if (level == "Species") {
    tax_name <- colnames(df_corr)
    tax_name <- sapply(tax_name, function(x) {
      name <- ifelse(grepl("Genus:", x), 
             paste(strsplit(x, ":")[[1]][2], "spp."),
             ifelse(grepl("Species:", x), 
                    strsplit(x, ":")[[1]][2], x))
      return(name)
    })
    colnames(df_corr) <- tax_name
    rownames(df_corr) <- tax_name
    
    df_corr <- df_corr[tax, tax]
  } else {
    tax_name <- colnames(df_corr)
    tax_name <- sapply(tax_name, function(x) 
      strsplit(x, ":")[[1]][2])
    colnames(df_corr) <- tax_name
    rownames(df_corr) <- tax_name
    
    df_corr <- df_corr[tax, tax]
  }
  
  tax_name <- sort(colnames(df_corr))
  df_corr <- df_corr[tax_name, tax_name]
  
  df_clean <- data.frame(get_upper_tri(df_corr)) %>%
    rownames_to_column("var1") %>%
    pivot_longer(cols = -var1, names_to = "var2", values_to = "value") 
  
  # Correct for names
  df_name <- data.frame(
    var1 = unique(df_clean$var1),
    var2 = unique(df_clean$var2))
  
  for (i in seq_len(nrow(df_name))) {
    df_clean$var2[df_clean$var2 == df_name$var2[i]] = df_name$var1[i]
  }
  df_clean <- df_clean %>%
    filter(!is.na(value)) %>%
    mutate(value = round(value, 2))
  
  return(df_clean)
}

Linear correlations

secom_linear 函数是 ANCOMBC 包中的一个函数,用于在微生物组数据中进行线性相关性的稀疏估计。该函数支持三种相关性系数的计算:皮尔逊(Pearson)、斯皮尔曼(Spearman)和肯德尔(Kendall's tau)相关系数。以下是 secom_linear 函数的主要参数和它们的作用:

  • data: 包含微生物组数据的列表。
  • assay_name: 指定数据集中的哪个检测类型(如“counts”)。
  • tax_level: 指定使用的分类水平,例如“Phylum”(门)。
  • pseudo: 伪计数,用于稳定稀疏矩阵的计算。
  • prv_cut: 用于过滤掉低丰度的物种的阈值。
  • lib_cut: 用于过滤掉低测序深度的样本的阈值。
  • corr_cut: 用于过滤掉低相关性的阈值。
  • wins_quant: 用于确定窗口大小的分位数。
  • method: 指定计算哪种相关性系数,可以是“pearson”、“spearman”。
  • soft: 是否使用软阈值。
  • thresh_len: 硬阈值的长度。
  • n_cv: 交叉验证的迭代次数。
  • thresh_hard: 硬阈值,用于确定最终的相关性矩阵。
  • max_p: 最大 p 值,用于多重测试校正。
  • n_cl: 聚类的数量。

函数会返回两个主要的结果对象:corr_thcorr_fl,分别代表阈值相关性矩阵和完整相关性矩阵。这些矩阵提供了不同物种或分类水平之间的线性相关性估计。

Run SECOM

secom_linear 函数1)首先通过设置不同的阈值来过滤数据,2)然后使用指定的方法计算相关性系数,3)并通过交叉验证等技术来确定最终的相关性矩阵。这个过程涉及到数据的预处理、相关性计算和结果的后处理,以确保相关性估计的准确性和稀疏性。

代码语言:javascript复制
set.seed(123)
​
res_linear1 <- ANCOMBC::secom_linear(
  data = list(tse1), assay_name = "counts",
  tax_level = "Species", pseudo = 0, 
  prv_cut = 0.1, lib_cut = 1000, corr_cut = 0.5,
  wins_quant = c(0.05, 0.95), method = "pearson",
  soft = FALSE, thresh_len = 20, n_cv = 10,
  thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
​
names(res_linear1)
​
res_linear2 <- ANCOMBC::secom_linear(
  data = list(tse2), assay_name = "counts",
  tax_level = "Species", pseudo = 0,
  prv_cut = 0.1, lib_cut = 1000, corr_cut = 0.5,
  wins_quant = c(0.05, 0.95), method = "pearson", 
  soft = FALSE, thresh_len = 20, n_cv = 10, 
  thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
​
res_linear3 <- ANCOMBC::secom_linear(
  data = list(tse3), assay_name = "counts",
  tax_level = "Species", pseudo = 0, 
  prv_cut = 0.1, lib_cut = 1000, corr_cut = 0.5,
  wins_quant = c(0.05, 0.95), method = "pearson",
  soft = FALSE, thresh_len = 20, n_cv = 10, 
  thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
​
res_linear4 <- ANCOMBC::secom_linear(
  data = list(tse4), assay_name = "counts",
  tax_level = "Species", pseudo = 0, 
  prv_cut = 0.1, lib_cut = 1000, corr_cut = 0.5,
  wins_quant = c(0.05, 0.95), method = "pearson",
  soft = FALSE, thresh_len = 20, n_cv = 10, 
  thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
​
species1 <- rownames(res_linear1$corr_fl)
species1 <- species1[grepl("Species:", species1)|grepl("Genus:", species1)]
species2 <- rownames(res_linear2$corr_fl)
species2 <- species2[grepl("Species:", species2)|grepl("Genus:", species2)]
species3 <- rownames(res_linear3$corr_fl)
species3 <- species3[grepl("Species:", species3)|grepl("Genus:", species3)]
species4 <- rownames(res_linear4$corr_fl)
species4 <- species4[grepl("Species:", species4)|grepl("Genus:", species4)]
common_species <- Reduce(intersect, list(species1, species2, species3, species4))
common_species <- sapply(common_species, function(x) 
  ifelse(grepl("Genus:", x), 
                        paste(strsplit(x, ":")[[1]][2], "spp."),
                        strsplit(x, ":")[[1]][2]))
代码语言:javascript复制
[1] "s_diff_hat"  "y_hat"       "cv_error"    "thresh_grid" "thresh_opt"  "mat_cooccur" "corr"        "corr_p"     
 [9] "corr_th"     "corr_fl"  

结果:四个分组的线性相关的共有物种,查看微生物两两之间的相关系数

Visualization

可视化同一组微生物两两之间的相关系数在不同组的变化状态

代码语言:javascript复制
df_corr1 <- data_preprocess(res_linear1, type = "linear", level = "Species", tax = key_species) %>%
  dplyr::mutate(group = "G1")
df_corr2 <- data_preprocess(res_linear2, type = "linear", level = "Species", tax = key_species) %>%
  dplyr::mutate(group = "G2")
df_corr3 <- data_preprocess(res_linear3, type = "linear", level = "Species", tax = key_species) %>%
  dplyr::mutate(group = "G3")
df_corr4 <- data_preprocess(res_linear4, type = "linear", level = "Species", tax = key_species) %>%
  dplyr::mutate(group = "G4")
​
df_corr <- do.call('rbind', list(df_corr1, df_corr2, df_corr3, df_corr4)) %>%
  unite("pair", var1:var2, sep = " vs. ")
​
value_check <- df_corr %>%
  dplyr::group_by(pair) %>%
  dplyr::summarise(empty_idx = ifelse(all(value == 0), TRUE, FALSE))
​
non_empty_pair <- value_check %>%
  filter(empty_idx == FALSE) %>%
  .$pair
df_fig <- df_corr %>%
  dplyr::filter(pair %in% non_empty_pair)
​
fig_species_linear <- df_fig %>%
  ggline(x = "group", y = "value",
         color = "steelblue", facet.by = "pair",
         xlab = "Correlation coefficient", ylab = "", title = "Pearson Correlation")  
  scale_y_continuous(breaks = seq(0, 0.8, 0.2), limits = c(0, 0.9))  
  geom_text(aes(label = round(value, 2)), vjust = -0.5)  
  theme(plot.title = element_text(hjust = 0.5))
​
fig_species_linear

结果:同一微生物对两两之间的相关关系在不同分组不同,这可能表明不同状态下,微生物之间的相关关系不一样或意味着不同的微生物模式。

Nonlinear correlations

secom_linear 函数是 ANCOMBC 包中的一个函数,用于在微生物组数据中进行线性相关性的稀疏估计。该函数支持三种相关性系数的计算:皮尔逊(Pearson)、斯皮尔曼(Spearman)和肯德尔(Kendall's tau)相关系数。以下是 secom_linear 函数的主要参数和它们的作用:

  • data: 包含微生物组数据的列表。
  • assay_name: 指定数据集中的哪个检测类型(如“counts”)。
  • tax_level: 指定使用的分类水平,例如“Phylum”(门)。
  • pseudo: 伪计数,用于稳定稀疏矩阵的计算。
  • prv_cut: 用于过滤掉低丰度的物种的阈值。
  • lib_cut: 用于过滤掉低测序深度的样本的阈值。
  • corr_cut: 用于过滤掉低相关性的阈值。
  • wins_quant: 用于确定窗口大小的分位数。
  • method: 指定计算哪种相关性系数,可以是“pearson”、“spearman”。
  • soft: 是否使用软阈值。
  • thresh_len: 硬阈值的长度。
  • n_cv: 交叉验证的迭代次数。
  • thresh_hard: 硬阈值,用于确定最终的相关性矩阵。
  • max_p: 最大 p 值,用于多重测试校正。
  • n_cl: 聚类的数量。

函数会返回两个主要的结果对象:corr_thcorr_fl,分别代表阈值相关性矩阵和完整相关性矩阵。这些矩阵提供了不同物种或分类水平之间的线性相关性估计。

Run SECOM

secom_linear 函数1)首先通过设置不同的阈值来过滤数据,2)然后使用指定的方法计算相关性系数,3)并通过交叉验证等技术来确定最终的相关性矩阵。这个过程涉及到数据的预处理、相关性计算和结果的后处理,以确保相关性估计的准确性和稀疏性。

代码语言:javascript复制
set.seed(123)
​
res_dist1 <- ANCOMBC::secom_dist(
  data = list(tse1), assay_name = "counts",
  tax_level = "Species", pseudo = 0, 
  prv_cut = 0.1, lib_cut = 1000, corr_cut = 0.5,
  wins_quant = c(0.05, 0.95),
  R = 1000, thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
​
res_dist2 <- ANCOMBC::secom_dist(
  data = list(tse2), assay_name = "counts",
  tax_level = "Species", pseudo = 0,
  prv_cut = 0.1, lib_cut = 1000, corr_cut = 0.5,
  wins_quant = c(0.05, 0.95),
  R = 1000, thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
​
res_dist3 <- ANCOMBC::secom_dist(
  data = list(tse3), assay_name = "counts",
  tax_level = "Species", pseudo = 0, 
  prv_cut = 0.1, lib_cut = 1000, corr_cut = 0.5,
  wins_quant = c(0.05, 0.95),
  R = 1000, thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
​
res_dist4 <- ANCOMBC::secom_dist(
  data = list(tse4), assay_name = "counts",
  tax_level = "Species", pseudo = 0,
  prv_cut = 0.1, lib_cut = 1000, corr_cut = 0.5,
  wins_quant = c(0.05, 0.95),
  R = 1000, thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
​
species1 <- rownames(res_dist1$corr_fl)
species1 <- species1[grepl("Species:", species1)|grepl("Genus:", species1)]
species2 <- rownames(res_dist2$corr_fl)
species2 <- species2[grepl("Species:", species2)|grepl("Genus:", species2)]
species3 <- rownames(res_dist3$corr_fl)
species3 <- species3[grepl("Species:", species3)|grepl("Genus:", species3)]
species4 <- rownames(res_dist4$corr_fl)
species4 <- species4[grepl("Species:", species4)|grepl("Genus:", species4)]
common_species <- Reduce(intersect, list(species1, species2, species3, species4))
common_species <- sapply(common_species, function(x) 
  ifelse(grepl("Genus:", x), 
                        paste(strsplit(x, ":")[[1]][2], "spp."),
                        strsplit(x, ":")[[1]][2]))

结果:四个分组的线性相关的共有物种,查看微生物两两之间的相关系数

Visualization

可视化同一组微生物两两之间的相关系数在不同组的变化状态

代码语言:javascript复制
df_corr1 <- data_preprocess(res_dist1, type = "dist", level = "Species", tax = key_species) %>%
  dplyr::mutate(group = "G1")
df_corr2 <- data_preprocess(res_dist2, type = "dist", level = "Species", tax = key_species) %>%
  dplyr::mutate(group = "G2")
df_corr3 <- data_preprocess(res_dist3, type = "dist", level = "Species", tax = key_species) %>%
  dplyr::mutate(group = "G3")
df_corr4 <- data_preprocess(res_dist4, type = "dist", level = "Species", tax = key_species) %>%
  dplyr::mutate(group = "G4")
​
df_corr <- do.call('rbind', list(df_corr1, df_corr2, df_corr3, df_corr4)) %>%
  unite("pair", var1:var2, sep = " vs. ")
​
value_check <- df_corr %>%
  dplyr::group_by(pair) %>%
  dplyr::summarise(empty_idx = ifelse(all(value == 0), TRUE, FALSE))
​
non_empty_pair <- value_check %>%
  filter(empty_idx == FALSE) %>%
  .$pair
df_fig <- df_corr %>%
  dplyr::filter(pair %in% non_empty_pair)
​
fig_species_dist <- df_fig %>%
  ggline(x = "group", y = "value",
         color = "steelblue", facet.by = "pair",
         xlab = "", ylab = "", title = "Distance Correlation")  
  scale_y_continuous(breaks = seq(0, 0.8, 0.2), limits = c(0, 0.9))  
  geom_text(aes(label = round(value, 2)), vjust = -0.5)  
  theme(plot.title = element_text(hjust = 0.5))
​
fig_species_dist

结果:同一微生物对两两之间的相关关系在不同分组不同,这可能表明不同状态下,微生物之间的相关关系不一样或意味着不同的微生物模式。

  • B.caccae vs. Bacteroides spp.的距离相关系数在G2组是0.68,而在G4组则是0,相比G4组,其他三个组是较为轻微的症状。同样的发现也在Bacteroides spp. vs. Butyricimonas spp.Bacteroides spp. vs. Odoribacter spp.中出现。

0 人点赞