eQTL分析中对转录组表达量的值进行分位数标准化和反正则转换

论文

Structural variants and short tandem repeats impact gene expression and splicing in bovine testis tissue

https://academic.oup.com/genetics/article/225/3/iyad161/7258327?login=false

代码

https://github.com/Meenu-Bhati/SV-STR/blob/main/RNA_quantification/TPM_normalization.r

https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py

https://github.com/broadinstitute/pyqtl/blob/master/qtl/norm.py

为啥要做这个分位数标准化和反正则转换暂时不太理解，这个链接给出一些解释 https://bsunfullsite.github.io/post/quantile-nomrlization-and-inverse-normal-transform/quantile-normalization-and-inverse-normal-transform/

也没看太明白，基本上做eQTL分析的论文里方法部分都会写这个，比如开头提到的论文，方法不部分写到

Finally, TPM values were quantile normalized and inverse normal transformed across samples per gene using the R package RNOmni

水稻泛基因组的论文 (A super pan-genomic landscape of rice)，做eQTL分析，方法部分写到

To obtain a normal distribution of expression values for each gene, FPKM values of each gene were further normalized using the quantile-quantile normalization (qqnorm) function in R (version 3.1.2).

玉米eQTL的论文（A role for heritable transcriptomic variation in maize adaptation to temperate environments）方法部分写到

For each gene, expression values were transformed using the Box-Cox method [103] prior to mapping

西红柿泛基因组 eQTL分析部分方法

Genes with TPM values of >0.5 were defined as expressed. Only genes expressed in at least 100 accessions were retained for the downstream analyses. Raw expression levels were normalized with quantile–quantile normalization

这篇文章里提供的代码链接 https://github.com/YaoZhou89/TGG/blob/main/5.Genetic_analysis/scripts/prepare_gene_expression.R

这里标准化是自己自定义的函数

代码语言：javascript复制

quantile_normalisation <- function(df){
  df_rank <- apply(df,2,rank,ties.method="min")
  df_sorted <- data.frame(apply(df, 2, sort))
  df_mean <- apply(df_sorted, 1, mean)
  
  index_to_mean <- function(my_index, my_mean){
    return(my_mean[my_index])
  }
  
  df_final <- apply(df_rank, 2, index_to_mean, my_mean=df_mean)
  rownames(df_final) <- rownames(df)
  return(df_final)
}

我试了一下这个函数的输出和preprocessCore::normalize.quantiles 这个函数的输出是一致的

https://github.com/broadinstitute/pyqtl/blob/master/qtl/norm.py 这个链接里提供了python做这个标准化的函数

代码语言：javascript复制

def normalize_quantiles(df):
    """
    Quantile normalization to the average empirical distribution
    Note: replicates behavior of R function normalize.quantiles
          from library("preprocessCore")

    Reference:
     [1] Bolstad et al., Bioinformatics 19(2), pp. 185-193, 2003

    Adapted from https://github.com/andrewdyates/quantile_normalize
    """
    M = df.values.copy()

    Q = M.argsort(axis=0)
    m,n = M.shape

    # compute quantile vector
    quantiles = np.zeros(m)
    for i in range(n):
        quantiles  = M[Q[:,i],i]
    quantiles = quantiles / n

    for i in range(n):
        # Get equivalence classes; unique values == 0
        dupes = np.zeros(m, dtype=np.int)
        for j in range(m-1):
            if M[Q[j,i],i] == M[Q[j 1,i],i]:
                dupes[j 1] = dupes[j] 1

        # Replace column with quantile ranks
        M[Q[:,i],i] = quantiles

        # Average together equivalence classes
        j = m-1
        while j >= 0:
            if dupes[j] == 0:
                j -= 1
            else:
                idxs = Q[j-dupes[j]:j 1,i]
                M[idxs,i] = np.median(M[idxs,i])
                j -= 1   dupes[j]
        assert j == -1

    return pd.DataFrame(M, index=df.index, columns=df.columns)

开头提到的论文里除了分位数标准化还做了反正则转换，这个有现成的R包 RNOmni，代码

代码语言：javascript复制

expr.int = t(apply(file_filter_norm, 1, RankNorm ))

eQTL分析还有一步是用peer这个包计算混杂因素（To remove potential batch effects and cconfounding factors），之前有一个困惑是直接用TPM值去计算混杂因素还是用标准化后的表达数据去计算这个混杂因素

https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl

这个链接里有一些步骤，这里用的是标准化后的数据。计算多少个混杂因素这个链接里也给出了建议

The number of PEER factors was selected as function of sample size (N):

15 factors for N < 150
30 factors for 150 ≤ N < 250
45 factors for 250 ≤ N < 350
60 factors for N ≥ 350

github https 函数论文数据

0 人点赞