R语言中,当我们获取到了基因表达的count矩阵,怎么下载对应的基因长度并将count矩阵转换为FPKM矩阵
**********************************************
count矩阵:适用于差异基因表达分析
FPKM矩阵:适用于绘制heatmap
**********************************************
1. 从下方网址下载gencode.v21.annotation.gtf,其他版本也可以
https://www.gencodegenes.org/human/release_21.html
2. 加载文件进行处理
代码语言:r复制if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenomicFeatures") #安装必要library
library(GenomicFeatures)
filepath <- "gencode.v21.annotation.gtf.gz" #改成文件路径
txdb <- makeTxDbFromGFF(filepath ,format = "gtf")
genes <- exonsBy(txdb,by="gene")
genes_length <- lapply(genes,function(x){sum(width(reduce(x)))}) #慢!
genes_length <- as.data.frame(t(as.data.frame(genes_length)))
genes_length<- data.frame(gene_id=rownames(genes_length),len=genes_length$V1)
genes_length$gene_id <- gsub("\..*$", "", genes_length$gene_id)
genes_length
3. 处理后的结果
得到的结果,其中gene_id如需要,可以根据gtf文件,改成gene_symbol
gene_id | len | gene_symbol |
---|---|---|
ENSG00000000003 | 4535 | TSPAN6 |
ENSG00000000005 | 1610 | TNMD |
ENSG00000000419 | 1207 | DPM1 |
ENSG00000000457 | 6883 | SCYL3 |
ENSG00000000460 | 5967 | C1orf112 |
ENSG00000000938 | 3474 | FGR |
4. 读取count矩阵,每行一个基因,每列一个样本,行名为基因,列名为样本,基因不能作为第一列
示例为生成一个count矩阵:
代码语言:r复制# 创建基因表达数据框
gene_names <- c("TSPAN6", "TNMD", "DPM1", "SCYL3", "C1orf112", "FGR")
sample_names <- paste0("Sample", 1:6)
# 模拟基因表达数据
set.seed(123) # 设置随机数种子以便复现结果
count_matrix <- data.frame(matrix(rpois(length(gene_names) * length(sample_names), lambda = 10),
nrow = length(gene_names), ncol = length(sample_names)))
rownames(count_matrix) <- gene_names
colnames(count_matrix) <- sample_names
Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | Sample6 | |
---|---|---|---|---|---|---|
TSPAN6 | 8 | 11 | 10 | 6 | 6 | 12 |
TNMD | 9 | 5 | 8 | 8 | 8 | 11 |
DPM1 | 14 | 4 | 15 | 6 | 5 | 9 |
SCYL3 | 10 | 13 | 11 | 8 | 12 | 9 |
C1orf112 | 10 | 11 | 3 | 4 | 12 | 8 |
FGR | 15 | 11 | 7 | 6 | 12 | 7 |
5. 计算FPKM
代码语言:r复制# Step 1: 从gene_length的字典中,选择与count_matrix行名对应的基因长度
select_gene_length <- gene_length[gene_length$gene_symbol %in% rownames(gene_expression),]
# Step 2: 计算FPKM
countToFPKM <- function(counts, effLen){
N <- sum(counts)
exp( log(counts) log(1e9) - log(effLen) - log(N) )
}
fpkms <- as.data.frame(apply(count_matrix, 2, countToFPKM, effLen = select_gene_length$len))
fpkms