论文
Structural variants and short tandem repeats impact gene expression and splicing in bovine testis tissue
https://academic.oup.com/genetics/article/225/3/iyad161/7258327?login=false
代码
https://github.com/Meenu-Bhati/SV-STR/blob/main/RNA_quantification/TPM_normalization.r
https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py
https://github.com/broadinstitute/pyqtl/blob/master/qtl/norm.py
为啥要做这个分位数标准化和反正则转换暂时不太理解,这个链接给出一些解释 https://bsunfullsite.github.io/post/quantile-nomrlization-and-inverse-normal-transform/quantile-normalization-and-inverse-normal-transform/
也没看太明白,基本上做eQTL分析的论文里方法部分都会写这个,比如开头提到的论文,方法不部分写到
Finally, TPM values were quantile normalized and inverse normal transformed across samples per gene using the R package RNOmni
水稻泛基因组的论文 (A super pan-genomic landscape of rice),做eQTL分析,方法部分写到
To obtain a normal distribution of expression values for each gene, FPKM values of each gene were further normalized using the quantile-quantile normalization (qqnorm) function in R (version 3.1.2).
玉米eQTL的论文 (A role for heritable transcriptomic variation in maize adaptation to temperate environments)方法部分写到
For each gene, expression values were transformed using the Box-Cox method [103] prior to mapping
西红柿泛基因组 eQTL分析部分方法
Genes with TPM values of >0.5 were defined as expressed. Only genes expressed in at least 100 accessions were retained for the downstream analyses. Raw expression levels were normalized with quantile–quantile normalization
这篇文章里提供的代码链接 https://github.com/YaoZhou89/TGG/blob/main/5.Genetic_analysis/scripts/prepare_gene_expression.R
这里标准化是自己自定义的函数
代码语言:javascript复制quantile_normalisation <- function(df){
df_rank <- apply(df,2,rank,ties.method="min")
df_sorted <- data.frame(apply(df, 2, sort))
df_mean <- apply(df_sorted, 1, mean)
index_to_mean <- function(my_index, my_mean){
return(my_mean[my_index])
}
df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
rownames(df_final) <- rownames(df)
return(df_final)
}
我试了一下这个函数的输出和preprocessCore::normalize.quantiles 这个函数的输出是一致的
https://github.com/broadinstitute/pyqtl/blob/master/qtl/norm.py 这个链接里提供了python做这个标准化的函数
代码语言:javascript复制def normalize_quantiles(df):
"""
Quantile normalization to the average empirical distribution
Note: replicates behavior of R function normalize.quantiles
from library("preprocessCore")
Reference:
[1] Bolstad et al., Bioinformatics 19(2), pp. 185-193, 2003
Adapted from https://github.com/andrewdyates/quantile_normalize
"""
M = df.values.copy()
Q = M.argsort(axis=0)
m,n = M.shape
# compute quantile vector
quantiles = np.zeros(m)
for i in range(n):
quantiles = M[Q[:,i],i]
quantiles = quantiles / n
for i in range(n):
# Get equivalence classes; unique values == 0
dupes = np.zeros(m, dtype=np.int)
for j in range(m-1):
if M[Q[j,i],i] == M[Q[j 1,i],i]:
dupes[j 1] = dupes[j] 1
# Replace column with quantile ranks
M[Q[:,i],i] = quantiles
# Average together equivalence classes
j = m-1
while j >= 0:
if dupes[j] == 0:
j -= 1
else:
idxs = Q[j-dupes[j]:j 1,i]
M[idxs,i] = np.median(M[idxs,i])
j -= 1 dupes[j]
assert j == -1
return pd.DataFrame(M, index=df.index, columns=df.columns)
开头提到的论文里除了分位数标准化还做了反正则转换,这个有现成的R包 RNOmni
,代码
expr.int = t(apply(file_filter_norm, 1, RankNorm ))
eQTL分析还有一步是用peer这个包计算混杂因素(To remove potential batch effects and cconfounding factors),之前有一个困惑是直接用TPM值去计算混杂因素还是用标准化后的表达数据去计算这个混杂因素
https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl
这个链接里有一些步骤,这里用的是标准化后的数据。计算多少个混杂因素这个链接里也给出了建议
The number of PEER factors was selected as function of sample size (N):
- 15 factors for N < 150
- 30 factors for 150 ≤ N < 250
- 45 factors for 250 ≤ N < 350
- 60 factors for N ≥ 350