需求
豆同学的需求,从大量的句子里提取出基因名称。
拿其中一个做例句:
"To ascertain whether a pre-existing subset of endoderm progenitors were responsible for generating endoderm cells in EZH2-/- cultures, we used flow cytometry to separate KIT /CXCR4 (endoderm primed) and KIT-/CXCR4- (not endoderm primed) EZH2-/- populations and subjected the cells to endoderm differentiation"
这句话里的基因名有:"EZH2" "KIT" "CXCR4"三个。
思路
把文中所有的标点符号替换成空格,然后以空格为分隔符拆分。
代码
代码语言:text复制library(stringr)
s = "To ascertain whether a pre-existing subset of endoderm progenitors were responsible for generating endoderm cells in EZH2-/- cultures, we used flow cytometry to separate KIT /CXCR4 (endoderm primed) and KIT-/CXCR4- (not endoderm primed) EZH2-/- populations and subjected the cells to endoderm differentiation"
s2 = gsub("[[:punct:]]"," ",s)
m2 = str_split(s2," ")[[1]]
# all_g是全部基因组成的向量,可以简化一下变短点。
all_g = c("EZH2", "KIT", "CXCR4", "AKR1B1P8", "AKR1B10", "AKR1B10P1",
"AKR1B10P2", "AKR1B11", "AKR1B15")
all_g
、
## [1] "EZH2" "KIT" "CXCR4" "AKR1B1P8" "AKR1B10" "AKR1B10P1"
## [7] "AKR1B10P2" "AKR1B11" "AKR1B15"
unique(m2[m2 %in% all_g])
## [1] "EZH2" "KIT" "CXCR4"
核心是:正则表达式 [:punct:] 匹配所有的标点符号。
gsub把全部标点替换成了空格。