Python爬虫技术系列-06requests完成yz网数据采集V01
- 1.yz网数据爬取概述
- 2. 案例实现
- 2.1 模拟登录页面分析
- 2.2 模拟登录实现
- 2.3 构建待爬取的药材名称
- 2.4 药材查询页面分析
- 2.3 拼接网址,并获取数据
- 2.4 保存数据
- 2.5 保存数据
- 2.6 总结
1.yz网数据爬取概述
药智网在医药领域,有着很多的数据,在一次编书过程中,需要需要相关中药材数据,就通过使用爬虫技术获取部分数据。 由于官方对爬虫类文章的限制,所以本文不进行过多展示。
2. 案例实现
2.1 模拟登录页面分析
登录页:
登录提交页面:
可以看到需要提交的数据
2.2 模拟登录实现
代码语言:javascript复制import requests
import sys
import time
import bs4
# 初始化headers数据
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0",
"Referer":"https://www.yaozh.com/login/proxy?time={}".format(str(time.time_ns())[0:13]),
"Cookie":"",
}
# 请求的登录路径
post_url = "https://www.yaozh.com/login"
# 传入的参数
post_data = {
"username":"xxx",
"pwd":"XXX",
"type":"0",
"pc_vode":"",
"country":"86_zh-CN",
"mobile":"",
"vcode":"",
"pincode":"",
"formhash":"https://www.yaozh.com/login",
"backurl":"https://www.yaozh.com/login"
}
# 构建session
session = requests.session()
# 通过session发送post请求
session.post(post_url,post_data,headers=headers)
2.3 构建待爬取的药材名称
代码语言:javascript复制zy_list = [
"巴戟天","白芍","白术","萹蓄","苍耳子","炒白术","车前子","川芎",
"大黄","大云","当归","党参","翟麦","茯苓","腹毛","甘草","狗脊",
"枸杞","故之子","海金沙","黑杜仲","黑芋头","红花","黄柏","黄精",
"黄连","黄芪","鸡血藤","姜黄","韭子","鹿茸","没药","牡丹皮",
"木瓜","木通","木香","年见","牛膝","全虫","人参","乳香","桑寄生",
"山药","山芋肉","伸筋草","生地","石楠藤","石韦","熟地","锁阳",
"葶苈子","菟丝子","文术","西吉","细辛","香附","辛夷","续断",
"阳起石","淫羊藿","元参","元胡","云苓","枝子","枳壳"
]
2.4 药材查询页面分析
基本网址为:
代码语言:javascript复制https://db.yaozh.com/zhongyaocai?comprehensivesearchcontent=当归&
换一个中药,只需要更新当归即可。 每个中药的页面为:
2.3 拼接网址,并获取数据
代码语言:javascript复制import time
content_dict = {}
for zy_name in zy_list:
time.sleep(10)
# https://db.yaozh.com/zhongyaocai?comprehensivesearchcontent=黄芪&
# 拼接药材信息网址
zy_url = "https://db.yaozh.com/zhongyaocai?comprehensivesearchcontent={}&".format(zy_name)
try:
res = session.get(zy_url)
res.encoding = res.apparent_encoding
content_all = res.text
# 把药材详情页加载为dom文档
soup = bs4.BeautifulSoup(content_all,"html.parser")
# 获取每个药材页面的数据
tbody = soup.select('.table > tbody:nth-child(2) > tr')
# 遍历页面
for tr in tbody:
num = 0
content = []
for cell in tr:
num = 1
# 如果为偶数 提取其内容 把内容追加到列表中
if num % 2 == 0:
content.append(cell.get_text())
if num == 17:
# 把当前数据追加到字典中 并其用中药名作为key
content_dict[content[0]] = content
print("content:=====>",content)
except:
print("当前行有问题{}".format(zy_name))
# 查看保存的数据
print(content_dict)
输出为:
代码语言:javascript复制{'巴戟天': ['巴戟天', '中药', '本品为茜草科植物巴戟天Morinda officinalis How 的干燥根。全年均可采挖,n洗净,除去须根,晒至六、七成干,轻轻捶扁,晒干。', '巴戟、巴吉天、戟天、巴戟肉、鸡肠风、猫肠筋、兔儿肠', '甘、辛,微温。归肾、肝经。', '补肾阳,强筋骨,祛风湿。用于阳痿遗精,宫冷不孕,月经不调,少腹冷痛,风湿痹痛,筋骨痿软。', '查看', '查看'],
'白芍': ['白芍', '中药', '本品为毛茛科植物芍药Paeonia tacti lora Pall.的干燥根。夏、秋二季采挖,洗净,除去头尾和细根,置沸水中煮后除去外皮或去皮后再煮,晒干。', '白芍药、金芍药。', '苦、酸,微寒。归肝、脾经。', '养血调经,敛阴止汗,柔肝止痛,平抑肝阳。用于血虚萎黄,月经不调,自汗,盗汗,胁痛,腹痛,四肢挛痛,头痛眩晕。', '查看', '查看'],
...
}
2.4 保存数据
代码语言:javascript复制import pandas as pd
df1 = pd.DataFrame(content_dict,index = ['药材名称', '药材类别', '药材基原', '药材别名', '性味归经', '功能主治', '文件原文', '详情'])
df1
输出为:
2.5 保存数据
代码语言:javascript复制df_t = df1.T
df_t.to_excel("./zhongyao.xlsx",engine = "openpyxl")
df_t
输出为:
2.6 总结
尽管该网站对外公开的,但使用的过程中一定要注意采集数据的频率,不要给该网站带来性能压力。