自然语言处理——词频统计

2020-08-06 11:38:05 浏览数 (1)

今天碰到一个自然语言处理相关的问题,题目如下。

这里小编分别用了三种编程语言来处理这个问题,分别是RperlPython

1.R

代码语言:javascript复制
#要统计词频的段落
para='This is a test. Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Now is better than never. Is this a complex thing?'
#将.和?替换成空,然后转成小写
para_sub=tolower(gsub("\.|\?","",para))

#按照空格分词,统计词频
count=sort(table(unlist(strsplit(para_sub," "))),decreasing = T)
#保留出现一次以上的单词,作为关键词
keys=count[count>1]
#统计关键词的长度
keylen=sum(nchar(names(keys))*as.numeric(keys))
#统计关键词占整段文字的百分比
percent=keylen/nchar(para)

2.Perl

代码语言:javascript复制
#!/usr/bin/perl  
$para='This is a test. Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Now is better than never. Is this a complex thing?';
$para_sub=$para;
#将.和?替换成空,
$para_sub=~s/.|?//g;
#转成小写
$para_sub=lc($para_sub);
#分词
my @array=split " ",$para_sub;

#统计词频
foreach $word (@array){
 $hash{$word}  ;
}

#计算关键词的总长度
foreach $i (sort {$hash{$b}<=>$hash{$a}} keys %hash){
 if($hash{$i}>1){
  print "$it$hash{$i}n";
  $key_len =length($i)*$hash{$i};
 }
}

#统计关键词占整段文字的百分比
$percent=$key_len/length($para);
$percent=sprintf("%.4f",$percent);
print "keyword percent: $percent";

3.Python

0 人点赞