count值转FPKM(R语言)

2024-04-28 20:10:34 浏览数 (2)

R语言中,当我们获取到了基因表达的count矩阵,怎么下载对应的基因长度并将count矩阵转换为FPKM矩阵

**********************************************

count矩阵:适用于差异基因表达分析

FPKM矩阵:适用于绘制heatmap

**********************************************

1. 从下方网址下载gencode.v21.annotation.gtf,其他版本也可以

https://www.gencodegenes.org/human/release_21.html

2. 加载文件进行处理

代码语言:r复制
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GenomicFeatures") #安装必要library

library(GenomicFeatures)
filepath <- "gencode.v21.annotation.gtf.gz" #改成文件路径
txdb <- makeTxDbFromGFF(filepath ,format = "gtf")
genes <- exonsBy(txdb,by="gene")
genes_length <- lapply(genes,function(x){sum(width(reduce(x)))}) #慢!
genes_length <- as.data.frame(t(as.data.frame(genes_length)))
genes_length<- data.frame(gene_id=rownames(genes_length),len=genes_length$V1)
genes_length$gene_id <- gsub("\..*$", "", genes_length$gene_id)
genes_length 

3. 处理后的结果

得到的结果,其中gene_id如需要,可以根据gtf文件,改成gene_symbol

gene_id

len

gene_symbol

ENSG00000000003

4535

TSPAN6

ENSG00000000005

1610

TNMD

ENSG00000000419

1207

DPM1

ENSG00000000457

6883

SCYL3

ENSG00000000460

5967

C1orf112

ENSG00000000938

3474

FGR

4. 读取count矩阵,每行一个基因,每列一个样本,行名为基因,列名为样本,基因不能作为第一列

示例为生成一个count矩阵:

代码语言:r复制
# 创建基因表达数据框
gene_names <- c("TSPAN6", "TNMD", "DPM1", "SCYL3", "C1orf112", "FGR")
sample_names <- paste0("Sample", 1:6)
# 模拟基因表达数据
set.seed(123) # 设置随机数种子以便复现结果
count_matrix <- data.frame(matrix(rpois(length(gene_names) * length(sample_names), lambda = 10), 
                          nrow = length(gene_names), ncol = length(sample_names)))
rownames(count_matrix) <- gene_names
colnames(count_matrix) <- sample_names

Sample1

Sample2

Sample3

Sample4

Sample5

Sample6

TSPAN6

8

11

10

6

6

12

TNMD

9

5

8

8

8

11

DPM1

14

4

15

6

5

9

SCYL3

10

13

11

8

12

9

C1orf112

10

11

3

4

12

8

FGR

15

11

7

6

12

7

5. 计算FPKM

代码语言:r复制
# Step 1: 从gene_length的字典中,选择与count_matrix行名对应的基因长度
select_gene_length <- gene_length[gene_length$gene_symbol %in% rownames(gene_expression),]

# Step 2: 计算FPKM
countToFPKM <- function(counts, effLen){
    N <- sum(counts)
    exp( log(counts)   log(1e9) - log(effLen) - log(N) )
}
fpkms <- as.data.frame(apply(count_matrix, 2, countToFPKM, effLen = select_gene_length$len))
fpkms

0 人点赞