假设你有序列AAA和ATA,怎么用R比较它们的差异,即第二个字符,并返回差异的位点与字符?
我用谷歌搜索这个问题时发现stackoverflow上有类似的提问,但不完全一致,基本就是问找出差异的字符,并没有我想要的这么全。提供的解决方案有两种:
代码语言:javascript复制do.call(setdiff, strsplit(c(a, b), split = ""))
# 或者
Reduce(setdiff, strsplit(c(a, b), split = ""))
a,b是两个字符串。
代码语言:javascript复制> do.call(setdiff, strsplit(c("ATA", "AAA"), split = ""))
[1] "T"
> Reduce(setdiff, strsplit(c("ATA", "AAA"), split = ""))
[1] "T"
神奇的是,如果你将两个序列呼唤,就不work了!
代码语言:javascript复制> Reduce(setdiff, strsplit(c("AAA", "ATA"), split = ""))
character(0)
> do.call(setdiff, strsplit(c("AAA", "ATA"), split = ""))
character(0)
相关资料不多,终于在R博客看到一个实现类似需求的函数,修改了一下,感觉很棒:
代码语言:javascript复制list_string_diff = function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE, only.position = TRUE){
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ")
if(ignore.case){
a = toupper(a)
b = toupper(b)
}
split_seqs = strsplit(c(a, b), split = "")
only.diff = split_seqs[[1]] != split_seqs[[2]]
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] = NA
diff.info = data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff], split_seqs[[2]][only.diff])
names(diff.info) = c("position", "seq.a", "seq.b")
if(!show.excluded) diff.info = na.omit(diff.info)
if(only.position){
diff.info$position
}else diff.info
}
这个函数 可以同时记录位置和原始序列,并可以忽略大小写,甚至排除一些序列,为了使结果简化,我添加了只返回位置的默认参数。
代码语言:javascript复制> list_string_diff("AAA", "ATA")
[1] 2
> list_string_diff("ATA", "AAA")
[1] 2
> list_string_diff("ATA", "AAA", only.position = FALSE)
position seq.a seq.b
1 2 T A
> list_string_diff("ATA", "AAa", only.position = FALSE)
position seq.a seq.b
1 2 T A
> list_string_diff("ATA", "AAa", only.position = FALSE, ignore.case = FALSE)
position seq.a seq.b
1 2 T A
2 3 A a