今天碰到一个自然语言处理相关的问题,题目如下。
这里小编分别用了三种编程语言来处理这个问题,分别是R,perl和Python
1.R
代码语言:javascript复制#要统计词频的段落
para='This is a test. Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Now is better than never. Is this a complex thing?'
#将.和?替换成空,然后转成小写
para_sub=tolower(gsub("\.|\?","",para))
#按照空格分词,统计词频
count=sort(table(unlist(strsplit(para_sub," "))),decreasing = T)
#保留出现一次以上的单词,作为关键词
keys=count[count>1]
#统计关键词的长度
keylen=sum(nchar(names(keys))*as.numeric(keys))
#统计关键词占整段文字的百分比
percent=keylen/nchar(para)
2.Perl
代码语言:javascript复制#!/usr/bin/perl
$para='This is a test. Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Now is better than never. Is this a complex thing?';
$para_sub=$para;
#将.和?替换成空,
$para_sub=~s/.|?//g;
#转成小写
$para_sub=lc($para_sub);
#分词
my @array=split " ",$para_sub;
#统计词频
foreach $word (@array){
$hash{$word} ;
}
#计算关键词的总长度
foreach $i (sort {$hash{$b}<=>$hash{$a}} keys %hash){
if($hash{$i}>1){
print "$it$hash{$i}n";
$key_len =length($i)*$hash{$i};
}
}
#统计关键词占整段文字的百分比
$percent=$key_len/length($para);
$percent=sprintf("%.4f",$percent);
print "keyword percent: $percent";
3.Python