这是奔跑的键盘侠的第170篇文章
作者|我是奔跑的键盘侠
来源|奔跑的键盘侠(ID:runningkeyboardhero)
转载请联系授权(微信ID:ctwott)
当里个当,我来了!
今天来讲一个词频统计的方法,说高大上一点,就是大数据分析;看完以后,也不过数行代码而已。
用途倒是很广泛,比如我们统计某篇文章中的用词频率,网络热点词汇,再比如起名排行榜呀、热门旅游景点排行榜呀什么的,其实也都可以套用。
1
coding
代码语言:javascript复制#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time : 2020-03-29 22:04
# @Author : Ed Frey
# @File : counter_func.py
# @Software: PyCharm
text = '''O, that this too too solid flesh would melt
Thaw and resolve itself into a dew!
Or that the Everlasting had not fix'd
His canon 'gainst self-slaughter! O God! God!
How weary, stale, flat and unprofitable,
Seem to me all the uses of this world!
Fie on't! ah fie! 'tis an unweeded garden,
That grows to seed; things rank and gross in nature
Possess it merely. That it should come to this!
But two months dead: nay, not so much, not two:
So excellent a king; that was, to this,
Hyperion to a satyr; so loving to my mother
That he might not beteem the winds of heaven
Visit her face too roughly. Heaven and earth!
Must I remember? why, she would hang on him,
As if increase of appetite had grown'''
all_strings = text.replace("n"," ")
words = all_strings.split(" ")
stat_counter = {}
for word in words:
if word in stat_counter.keys():
stat_counter[word] = 1
else:
stat_counter[word] = 1
result = sorted(stat_counter,key=stat_counter.get,reverse=True)[:10]
for key in result:
print("%s:%d"%(key,stat_counter[key]))
代码语言:javascript复制测试结果如下:
to:6
and:4
not:4
that:3
too:3
a:3
the:3
:3
of:3
That:3
其中用到了sorted关键字的取值排序。
代码语言:javascript复制
2
补充一个Counter函数用法
代码语言:javascript复制python内置模块collections中有个Counter函数,功能也极为强大,做实验设计可能会到,不过跟上面的单词统计不太一样。Counter函数是以文本中的单个字母、或单个文字作为处理对象,而代码就更简烈了。
代码语言:javascript复制
代码语言:javascript复制#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time : 2020-03-29 22:04
# @Author : Ed Frey
# @File : counter_func.py
# @Software: PyCharm
from collections import Counter
text= '''清明时节雨纷纷,路上行人欲断魂。
借问酒家何处有?牧童遥指杏花村。'''
stat = Counter(text.replace("n",""))
print(stat.most_common(5))
代码语言:javascript复制运行结果如下:
代码语言:javascript复制[('纷', 2), ('。', 2), ('清', 1), ('明', 1), ('时', 1)]
代码语言:javascript复制
代码语言:javascript复制最后再节选部分Counter使用手册中的语法,供大家参详:
代码语言:javascript复制'''
Help on class Counter in module collections:
class Counter(builtins.dict)
| Dict subclass for counting hashable items. Sometimes called a bag
| or multiset. Elements are stored as dictionary keys and their counts
| are stored as dictionary values.
|
| >>> c = Counter('abcdeabcdabcaba') # count elements from a string
|
| >>> c.most_common(3) # three most common elements
| [('a', 5), ('b', 4), ('c', 3)]
| >>> sorted(c) # list all unique elements
| ['a', 'b', 'c', 'd', 'e']
| >>> ''.join(sorted(c.elements())) # list elements with repetitions
| 'aaaaabbbbcccdde'
| >>> sum(c.values()) # total of all counts
| 15
|
| >>> c['a'] # count of letter 'a'
| 5
| >>> for elem in 'shazam': # update counts from an iterable
| ... c[elem] = 1 # by adding 1 to each element's count
| >>> c['a'] # now there are seven 'a'
| 7
| >>> del c['b'] # remove all 'b'
| >>> c['b'] # now there are zero 'b'
| 0
|
| >>> d = Counter('simsalabim') # make another counter
| >>> c.update(d) # add in the second counter
| >>> c['a'] # now there are nine 'a'
| 9
|
| >>> c.clear() # empty the counter
| >>> c
| Counter()
|
| Note: If a count is set to zero or reduced to zero, it will remain
| in the counter until the entry is deleted or the counter is cleared:
|
| >>> c = Counter('aaabbc')
| >>> c['b'] -= 2 # reduce the count of 'b' by two
| >>> c.most_common() # 'b' is still in, but its count is zero
| [('a', 3), ('c', 1), ('b', 0)]
'''
-END-
© Copyright
奔跑的键盘侠原创作品 | 尽情分享朋友圈 | 转载请联系授权