R支持同名函数,小心李逵变李鬼

2021-11-09 15:21:01 浏览数 (2)

程序出乎意料,怎么办?

今天在星球圈里收到提问:

img

我对ddply()这个函数是不熟悉的,只知道hadley一个过时的包plyr里有一系列这样的函数。所以我首先想到的是这位朋友用错了。不过我马上就排除了,这种问题是非常容易发现和处理的。

因此还是得动手实际检验一下这个问题在我的电脑上是否可以重复。

我们首先把数据导入进来:

代码语言:javascript复制
library(scales)
library(tidyverse)
library(plyr)

ts <- openxlsx::read.xlsx("~/Downloads/示例数据.xlsx")
head(ts)
##      Name     variable      value
## 1 SLCO1B1 TCGA-44-2666 3.52916020
## 2    GCGR TCGA-44-2666 0.08499940
## 3   HTR3A TCGA-44-2666 0.05029628
## 4     CA9 TCGA-44-2666 0.19814361
## 5 TNFSF11 TCGA-44-2666 0.28202803
## 6     FGB TCGA-44-2666 4.56223499

按照两种不同的方法生成结果:

代码语言:javascript复制
out1 = ts %>%
 ddply(., .(variable), transform, rescale = rescale(value)) %>%
 arrange(variable, Name)

head(out1)
##    Name     variable       value     rescale
## 1 ADRB2 TCGA-05-4390  2.07637862 0.179759689
## 2 BIRC5 TCGA-05-4390  4.76410093 0.412445637
## 3   CA9 TCGA-05-4390  4.47563050 0.387471697
## 4   FGB TCGA-05-4390 11.55085787 1.000000000
## 5  GCGR TCGA-05-4390  0.06531648 0.005654687
## 6 HTR3A TCGA-05-4390  0.13418263 0.011616680
out2 = ts %>%
 group_by(variable) %>%
 mutate(rescale = rescale(value)) %>%
 ungroup() %>%
 arrange(variable, Name) %>%
 as.data.frame()

head(out2)
##    Name     variable       value     rescale
## 1 ADRB2 TCGA-05-4390  2.07637862 0.179759689
## 2 BIRC5 TCGA-05-4390  4.76410093 0.412445637
## 3   CA9 TCGA-05-4390  4.47563050 0.387471697
## 4   FGB TCGA-05-4390 11.55085787 1.000000000
## 5  GCGR TCGA-05-4390  0.06531648 0.005654687
## 6 HTR3A TCGA-05-4390  0.13418263 0.011616680

从结果的格式可以看的出来,它们从输入到输出经历的操作应该是相同的。

那么我们实际对比下结果:

代码语言:javascript复制
setdiff(out1, out2)
##       Name     variable      value     rescale
## 1    ADRB2 TCGA-21-5782 1.64848811 0.192834783
## 2    BIRC5 TCGA-21-5782 4.75155422 0.675892757
## 3      CA9 TCGA-21-5782 6.83355317 1.000000000
## 4      FGB TCGA-21-5782 0.84742072 0.068131673
## 5     GCGR TCGA-21-5782 0.51442280 0.016293492
## 6    HTR3A TCGA-21-5782 0.86791639 0.071322259
## 7      IL2 TCGA-21-5782 0.40975672 0.000000000
## 8    KCNA5 TCGA-21-5782 0.83802752 0.066669422
## 9    PADI4 TCGA-21-5782 0.65747836 0.038563121
## 10 SLCO1B1 TCGA-21-5782 0.91029877 0.077919974
## 11 TNFSF11 TCGA-21-5782 0.68943867 0.043538419
## 12   ADRB2 TCGA-38-4632 1.49522485 0.173357774
## 13   BIRC5 TCGA-38-4632 5.68731111 0.659392191
## 14     CA9 TCGA-38-4632 0.47006487 0.054499762
## 15     FGB TCGA-38-4632 8.62508107 1.000000000
## 16    GCGR TCGA-38-4632 0.10181144 0.011804114
## 17   HTR3A TCGA-38-4632 0.37084853 0.042996526
## 18     IL2 TCGA-38-4632 0.04300049 0.004985518
## 19   KCNA5 TCGA-38-4632 0.18014798 0.020886526
## 20   PADI4 TCGA-38-4632 0.11071744 0.012836683
## 21 TNFSF11 TCGA-38-4632 0.21541180 0.024975047
## 22   ADRB2 TCGA-44-2666 3.11114961 0.678389828
## 23   BIRC5 TCGA-44-2666 1.34130377 0.286131434
## 24     CA9 TCGA-44-2666 0.19814361 0.032768028
## 25     FGB TCGA-44-2666 4.56223499 1.000000000
## 26    GCGR TCGA-44-2666 0.08499940 0.007691400
## 27   HTR3A TCGA-44-2666 0.05029628 0.000000000
## 28     IL2 TCGA-44-2666 0.06204253 0.002603372
## 29   KCNA5 TCGA-44-2666 0.89569660 0.187369639
## 30   PADI4 TCGA-44-2666 0.26135921 0.046778767
## 31 SLCO1B1 TCGA-44-2666 3.52916020 0.771035279
## 32 TNFSF11 TCGA-44-2666 0.28202803 0.051359685
## 33   ADRB2 TCGA-49-AAQV 1.61516402 0.183927550
## 34   BIRC5 TCGA-49-AAQV 3.19367091 0.365190414
## 35     CA9 TCGA-49-AAQV 4.63585169 0.530798699
## 36     FGB TCGA-49-AAQV 8.72183748 1.000000000
## 37    GCGR TCGA-49-AAQV 0.01345201 0.000000000
## 38   HTR3A TCGA-49-AAQV 3.07484686 0.351545629
## 39     IL2 TCGA-49-AAQV 0.21759594 0.023442225
## 40   KCNA5 TCGA-49-AAQV 0.23745833 0.025723059
## 41   PADI4 TCGA-49-AAQV 0.07889044 0.007514416
## 42 SLCO1B1 TCGA-49-AAQV 1.10041358 0.124817806
## 43 TNFSF11 TCGA-49-AAQV 0.72236264 0.081405517

结果显示两个数据框确实不一样。那这是怎么回事?

代码语言:javascript复制
out1$rescale[which(out1$rescale != out2$rescale)]
##  [1] 0.192834783 0.675892757 1.000000000 0.068131673 0.016293492 0.071322259
##  [7] 0.000000000 0.066669422 0.038563121 0.077919974 0.043538419 0.173357774
## [13] 0.659392191 0.054499762 1.000000000 0.011804114 0.042996526 0.004985518
## [19] 0.020886526 0.012836683 0.024975047 0.678389828 0.286131434 0.032768028
## [25] 1.000000000 0.007691400 0.000000000 0.002603372 0.187369639 0.046778767
## [31] 0.771035279 0.051359685 0.183927550 0.365190414 0.530798699 1.000000000
## [37] 0.000000000 0.351545629 0.023442225 0.025723059 0.007514416 0.124817806
## [43] 0.081405517
out2$rescale[which(out1$rescale != out2$rescale)]
##  [1] 0.142715643 0.411359422 0.591605684 0.073364310 0.044535462 0.075138695
##  [7] 0.035474138 0.072551106 0.056920306 0.078807893 0.059687226 0.129447083
## [13] 0.492371317 0.040695234 0.746704805 0.008814189 0.032105713 0.003722710
## [19] 0.015596069 0.009585213 0.018648988 0.269343597 0.116121572 0.017154017
## [25] 0.394969365 0.007358709 0.004354333 0.005371249 0.077543730 0.022626822
## [31] 0.305532303 0.024416198 0.139830655 0.276487768 0.401342631 0.755081361
## [37] 0.001164590 0.266200735 0.018838077 0.020557636 0.006829834 0.095266827
## [43] 0.062537575

认真查看数据我发现这里的variable指代的是TCGA的样本,上面的操作是一个对不同样本进行相同处理的操作。最实际 的探索笨办法就是一个一个样本检查:

代码语言:javascript复制
setdiff(out1[1:11, ], out2[1:11, ])
## [1] Name     variable value    rescale 
## <0 行> (或0-长度的row.names)

第1个样本没有问题,继续:

代码语言:javascript复制
setdiff(out1[12:22, ], out2[12:22, ])
##       Name     variable     value    rescale
## 1    ADRB2 TCGA-21-5782 1.6484881 0.19283478
## 2    BIRC5 TCGA-21-5782 4.7515542 0.67589276
## 3      CA9 TCGA-21-5782 6.8335532 1.00000000
## 4      FGB TCGA-21-5782 0.8474207 0.06813167
## 5     GCGR TCGA-21-5782 0.5144228 0.01629349
## 6    HTR3A TCGA-21-5782 0.8679164 0.07132226
## 7      IL2 TCGA-21-5782 0.4097567 0.00000000
## 8    KCNA5 TCGA-21-5782 0.8380275 0.06666942
## 9    PADI4 TCGA-21-5782 0.6574784 0.03856312
## 10 SLCO1B1 TCGA-21-5782 0.9102988 0.07791997
## 11 TNFSF11 TCGA-21-5782 0.6894387 0.04353842

问题出来了,那个是对的呢?手动算一下

代码语言:javascript复制
out1[12:22, ]
##       Name     variable     value    rescale
## 12   ADRB2 TCGA-21-5782 1.6484881 0.19283478
## 13   BIRC5 TCGA-21-5782 4.7515542 0.67589276
## 14     CA9 TCGA-21-5782 6.8335532 1.00000000
## 15     FGB TCGA-21-5782 0.8474207 0.06813167
## 16    GCGR TCGA-21-5782 0.5144228 0.01629349
## 17   HTR3A TCGA-21-5782 0.8679164 0.07132226
## 18     IL2 TCGA-21-5782 0.4097567 0.00000000
## 19   KCNA5 TCGA-21-5782 0.8380275 0.06666942
## 20   PADI4 TCGA-21-5782 0.6574784 0.03856312
## 21 SLCO1B1 TCGA-21-5782 0.9102988 0.07791997
## 22 TNFSF11 TCGA-21-5782 0.6894387 0.04353842
out2[12:22, ] # there are some problem with the dplyr approach
##       Name     variable     value    rescale
## 12   ADRB2 TCGA-21-5782 1.6484881 0.14271564
## 13   BIRC5 TCGA-21-5782 4.7515542 0.41135942
## 14     CA9 TCGA-21-5782 6.8335532 0.59160568
## 15     FGB TCGA-21-5782 0.8474207 0.07336431
## 16    GCGR TCGA-21-5782 0.5144228 0.04453546
## 17   HTR3A TCGA-21-5782 0.8679164 0.07513870
## 18     IL2 TCGA-21-5782 0.4097567 0.03547414
## 19   KCNA5 TCGA-21-5782 0.8380275 0.07255111
## 20   PADI4 TCGA-21-5782 0.6574784 0.05692031
## 21 SLCO1B1 TCGA-21-5782 0.9102988 0.07880789
## 22 TNFSF11 TCGA-21-5782 0.6894387 0.05968723
rescale(out1[12:22, ]$value)
##  [1] 0.19283478 0.67589276 1.00000000 0.06813167 0.01629349 0.07132226
##  [7] 0.00000000 0.06666942 0.03856312 0.07791997 0.04353842

手动算的结果跟第1个是一致的,那么第2种方法肯定哪里出问题了。

难道是调用rescale()这个函数出了问题??由于这个函数要做的事情很简单,我们可以自己 写一个看看:

代码语言:javascript复制
rescale2 <- function(x) {
 (x - min(x)) / (max(x) - min(x))
}

out3 = ts %>%
 group_by(variable) %>%
 mutate(rescale = rescale2(value)) %>%
 ungroup() %>%
 arrange(variable, Name) %>%
 as.data.frame()

setdiff(out1, out3)
##       Name     variable      value     rescale
## 1    ADRB2 TCGA-21-5782 1.64848811 0.192834783
## 2    BIRC5 TCGA-21-5782 4.75155422 0.675892757
## 3      CA9 TCGA-21-5782 6.83355317 1.000000000
## 4      FGB TCGA-21-5782 0.84742072 0.068131673
## 5     GCGR TCGA-21-5782 0.51442280 0.016293492
## 6    HTR3A TCGA-21-5782 0.86791639 0.071322259
## 7      IL2 TCGA-21-5782 0.40975672 0.000000000
## 8    KCNA5 TCGA-21-5782 0.83802752 0.066669422
## 9    PADI4 TCGA-21-5782 0.65747836 0.038563121
## 10 SLCO1B1 TCGA-21-5782 0.91029877 0.077919974
## 11 TNFSF11 TCGA-21-5782 0.68943867 0.043538419
## 12   ADRB2 TCGA-38-4632 1.49522485 0.173357774
## 13   BIRC5 TCGA-38-4632 5.68731111 0.659392191
## 14     CA9 TCGA-38-4632 0.47006487 0.054499762
## 15     FGB TCGA-38-4632 8.62508107 1.000000000
## 16    GCGR TCGA-38-4632 0.10181144 0.011804114
## 17   HTR3A TCGA-38-4632 0.37084853 0.042996526
## 18     IL2 TCGA-38-4632 0.04300049 0.004985518
## 19   KCNA5 TCGA-38-4632 0.18014798 0.020886526
## 20   PADI4 TCGA-38-4632 0.11071744 0.012836683
## 21 TNFSF11 TCGA-38-4632 0.21541180 0.024975047
## 22   ADRB2 TCGA-44-2666 3.11114961 0.678389828
## 23   BIRC5 TCGA-44-2666 1.34130377 0.286131434
## 24     CA9 TCGA-44-2666 0.19814361 0.032768028
## 25     FGB TCGA-44-2666 4.56223499 1.000000000
## 26    GCGR TCGA-44-2666 0.08499940 0.007691400
## 27   HTR3A TCGA-44-2666 0.05029628 0.000000000
## 28     IL2 TCGA-44-2666 0.06204253 0.002603372
## 29   KCNA5 TCGA-44-2666 0.89569660 0.187369639
## 30   PADI4 TCGA-44-2666 0.26135921 0.046778767
## 31 SLCO1B1 TCGA-44-2666 3.52916020 0.771035279
## 32 TNFSF11 TCGA-44-2666 0.28202803 0.051359685
## 33   ADRB2 TCGA-49-AAQV 1.61516402 0.183927550
## 34   BIRC5 TCGA-49-AAQV 3.19367091 0.365190414
## 35     CA9 TCGA-49-AAQV 4.63585169 0.530798699
## 36     FGB TCGA-49-AAQV 8.72183748 1.000000000
## 37    GCGR TCGA-49-AAQV 0.01345201 0.000000000
## 38   HTR3A TCGA-49-AAQV 3.07484686 0.351545629
## 39     IL2 TCGA-49-AAQV 0.21759594 0.023442225
## 40   KCNA5 TCGA-49-AAQV 0.23745833 0.025723059
## 41   PADI4 TCGA-49-AAQV 0.07889044 0.007514416
## 42 SLCO1B1 TCGA-49-AAQV 1.10041358 0.124817806
## 43 TNFSF11 TCGA-49-AAQV 0.72236264 0.081405517

问题依旧,数据肯定在rescale的时候出了问题。我debug()进去一看,发现全部的数据,而不是单独一个样本的数据作为输入!

在确定group_by()函数没有问题后,终于在mutate()上发现了端倪。

代码语言:javascript复制
mutate
## function (.data, ...) 
## {
##     stopifnot(is.data.frame(.data) || is.list(.data) || is.environment(.data))
##     cols <- as.list(substitute(list(...))[-1])
##     cols <- cols[names(cols) != ""]
##     for (col in names(cols)) {
##         .data[[col]] <- eval(cols[[col]], .data, parent.frame())
##     }
##     .data
## }
## <bytecode: 0x7fbbcd1fa470>
## <environment: namespace:plyr>

环境显示它来自plyr。而我们实际想要使用的是dplyr包中的同名函数!

明确指定命名空间后发现问题也确实解决了。

代码语言:javascript复制
out3 = ts %>%
 group_by(variable) %>%
 dplyr::mutate(rescale = rescale2(value)) %>%
 ungroup() %>%
 arrange(variable, Name) %>%
 as.data.frame()

setdiff(out1, out3)
## [1] Name     variable value    rescale 
## <0 行> (或0-长度的row.names)

mutate()非彼mutate(),建议大家在发现类似问题时(一个常用函数做了一件意外的事情), 请检查使用的函数来自哪个包。

0 人点赞