Python——大数据词频统计

2020-03-31 17:02:28 浏览数 (1)

这是奔跑的键盘侠的第170篇文章

作者|我是奔跑的键盘侠

来源|奔跑的键盘侠(ID:runningkeyboardhero)

转载请联系授权(微信ID:ctwott)

当里个当,我来了!

今天来讲一个词频统计的方法,说高大上一点,就是大数据分析;看完以后,也不过数行代码而已。

用途倒是很广泛,比如我们统计某篇文章中的用词频率,网络热点词汇,再比如起名排行榜呀、热门旅游景点排行榜呀什么的,其实也都可以套用。

1

coding

代码语言:javascript复制
#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time    : 2020-03-29 22:04
# @Author  : Ed Frey
# @File    : counter_func.py
# @Software: PyCharm

text = '''O, that this too too solid flesh would melt
Thaw and resolve itself into a dew!
Or that the Everlasting had not fix'd
His canon 'gainst self-slaughter! O God! God!
How weary, stale, flat and unprofitable, 
Seem to me all the uses of this world!
Fie on't! ah fie! 'tis an unweeded garden,
That grows to seed; things rank and gross in nature
Possess it merely. That it should come to this!
But two months dead: nay, not so much, not two: 
So excellent a king; that was, to this,
Hyperion to a satyr; so loving to my mother
That he might not beteem the winds of heaven
Visit her face too roughly. Heaven and earth!
Must I remember? why, she would hang on him, 
As if increase of appetite had grown'''

all_strings = text.replace("n"," ")
words = all_strings.split(" ")
stat_counter = {}
for word in words:
    if word in stat_counter.keys():
        stat_counter[word]  = 1
    else:
        stat_counter[word] = 1

result = sorted(stat_counter,key=stat_counter.get,reverse=True)[:10]
for key in result:
    print("%s:%d"%(key,stat_counter[key]))
代码语言:javascript复制
测试结果如下:

to:6

and:4

not:4

that:3

too:3

a:3

the:3

:3

of:3

That:3

其中用到了sorted关键字的取值排序。

代码语言:javascript复制

2

补充一个Counter函数用法

代码语言:javascript复制
python内置模块collections中有个Counter函数,功能也极为强大,做实验设计可能会到,不过跟上面的单词统计不太一样。Counter函数是以文本中的单个字母、或单个文字作为处理对象,而代码就更简烈了。
代码语言:javascript复制
代码语言:javascript复制
#!/usr/bin/env python3.7
# -*- coding: utf-8 -*-
# @Time    : 2020-03-29 22:04
# @Author  : Ed Frey
# @File    : counter_func.py
# @Software: PyCharm
from collections import Counter

text= '''清明时节雨纷纷,路上行人欲断魂。
借问酒家何处有?牧童遥指杏花村。'''
stat = Counter(text.replace("n",""))
print(stat.most_common(5))
代码语言:javascript复制
运行结果如下:
代码语言:javascript复制
[('纷', 2), ('。', 2), ('清', 1), ('明', 1), ('时', 1)]
代码语言:javascript复制
代码语言:javascript复制
最后再节选部分Counter使用手册中的语法,供大家参详:
代码语言:javascript复制
'''
Help on class Counter in module collections:
class Counter(builtins.dict)
 |  Dict subclass for counting hashable items.  Sometimes called a bag
 |  or multiset.  Elements are stored as dictionary keys and their counts
 |  are stored as dictionary values.
 |
 |  >>> c = Counter('abcdeabcdabcaba')  # count elements from a string
 |
 |  >>> c.most_common(3)                # three most common elements
 |  [('a', 5), ('b', 4), ('c', 3)]
 |  >>> sorted(c)                       # list all unique elements
 |  ['a', 'b', 'c', 'd', 'e']
 |  >>> ''.join(sorted(c.elements()))   # list elements with repetitions
 |  'aaaaabbbbcccdde'
 |  >>> sum(c.values())                 # total of all counts
 |  15
 |
 |  >>> c['a']                          # count of letter 'a'
 |  5
 |  >>> for elem in 'shazam':           # update counts from an iterable
 |  ...     c[elem]  = 1                # by adding 1 to each element's count
 |  >>> c['a']                          # now there are seven 'a'
 |  7
 |  >>> del c['b']                      # remove all 'b'
 |  >>> c['b']                          # now there are zero 'b'
 |  0
 |
 |  >>> d = Counter('simsalabim')       # make another counter
 |  >>> c.update(d)                     # add in the second counter
 |  >>> c['a']                          # now there are nine 'a'
 |  9
 |
 |  >>> c.clear()                       # empty the counter
 |  >>> c
 |  Counter()
 |
 |  Note:  If a count is set to zero or reduced to zero, it will remain
 |  in the counter until the entry is deleted or the counter is cleared:
 |
 |  >>> c = Counter('aaabbc')
 |  >>> c['b'] -= 2                     # reduce the count of 'b' by two
 |  >>> c.most_common()                 # 'b' is still in, but its count is zero
 |  [('a', 3), ('c', 1), ('b', 0)]
'''

-END-

© Copyright

奔跑的键盘侠原创作品 | 尽情分享朋友圈 | 转载请联系授权

0 人点赞