「R」怎么比较两个字符串的差异

2020-07-02 21:16:05 浏览数 (4)

假设你有序列AAA和ATA,怎么用R比较它们的差异,即第二个字符,并返回差异的位点与字符?

我用谷歌搜索这个问题时发现stackoverflow上有类似的提问,但不完全一致,基本就是问找出差异的字符,并没有我想要的这么全。提供的解决方案有两种:

代码语言:javascript复制
do.call(setdiff, strsplit(c(a, b), split = ""))
# 或者
Reduce(setdiff, strsplit(c(a, b), split = ""))

a,b是两个字符串。

代码语言:javascript复制
> do.call(setdiff, strsplit(c("ATA", "AAA"), split = ""))
[1] "T"
> Reduce(setdiff, strsplit(c("ATA", "AAA"), split = ""))
[1] "T"

神奇的是,如果你将两个序列呼唤,就不work了!

代码语言:javascript复制
> Reduce(setdiff, strsplit(c("AAA", "ATA"), split = ""))
character(0)
> do.call(setdiff, strsplit(c("AAA", "ATA"), split = ""))
character(0)

相关资料不多,终于在R博客看到一个实现类似需求的函数,修改了一下,感觉很棒:

代码语言:javascript复制
list_string_diff = function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE, only.position = TRUE){
    if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ")
    if(ignore.case){
        a = toupper(a)
        b = toupper(b)
    }
    
    split_seqs = strsplit(c(a, b), split = "")
    only.diff = split_seqs[[1]] != split_seqs[[2]]
    only.diff[
        (split_seqs[[1]] %in% exclude) |
        (split_seqs[[2]] %in% exclude)
    ] = NA
    
    diff.info = data.frame(which(is.na(only.diff)|only.diff),
                                 split_seqs[[1]][only.diff], split_seqs[[2]][only.diff])
    names(diff.info) = c("position", "seq.a", "seq.b")
    
    if(!show.excluded) diff.info = na.omit(diff.info)
    if(only.position){
        diff.info$position
    }else diff.info
}

这个函数 可以同时记录位置和原始序列,并可以忽略大小写,甚至排除一些序列,为了使结果简化,我添加了只返回位置的默认参数。

代码语言:javascript复制
> list_string_diff("AAA", "ATA")
[1] 2
> list_string_diff("ATA", "AAA")
[1] 2
> list_string_diff("ATA", "AAA", only.position = FALSE)
  position seq.a seq.b
1        2     T     A
> list_string_diff("ATA", "AAa", only.position = FALSE)
  position seq.a seq.b
1        2     T     A
> list_string_diff("ATA", "AAa", only.position = FALSE, ignore.case = FALSE)
  position seq.a seq.b
1        2     T     A
2        3     A     a

1 人点赞