文章背景:之前学习了BeautifulSoup模块和Re库(参见文末的延伸阅读
),在此基础上,通过获取淘宝搜索
页面的信息,提取其中的商品名称和价格。
技术路线:requests-bs4-re
重点理解:翻页的处理
起始页:
代码语言:javascript复制https://s.taobao.com/search?initiative_id=staobaoz_20201209&q=牛奶
第2页:
代码语言:javascript复制https://s.taobao.com/search?initiative_id=staobaoz_20201209&q=牛奶&bcoffset=3&ntoffset=3&p4ppushleft=1,48&s=44
第3页:
代码语言:javascript复制https://s.taobao.com/search?initiative_id=staobaoz_20201209&q=牛奶&bcoffset=1&ntoffset=1&p4ppushleft=1,48&s=88
1 定向爬虫的可行性
Robots协议:https://s.taobao.com/robots.txt
代码语言:javascript复制User-agent: *
Disallow: /
注意:这个例子仅探讨技术实现的可行性,请不要不加限制地爬取该网站。
2 程序的结构设计
- 从网络上获取大学排名网页内容 getHTMLText()
- 提取网页内容中的信息到合适的数据结构 fillUnivList()
- 利用数据结构展示并输出结果 printUnivList()
3 代码展示
代码语言:javascript复制import requests
import re
def getHTMLText(url):
try:
headers = {
'authority': 's.taobao.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0',
'accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': '***********',
}
r = requests.get(url, headers=headers, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except Exception as exc:
print('There was a problem: %s' % (exc))
def parsePage(ilt, html):
try:
plt = re.findall(r'"view_price":"[d.]*"',html)
tlt = re.findall(r'"raw_title":".*?"',html)
for i in range(len(plt)):
price = eval(plt[i].split(':')[1])
title = eval(tlt[i].split(':')[1])
ilt.append([price , title])
except Exception as exc:
print('There was a problem: %s' % (exc))
def printGoodsList(ilt):
tplt = "{:4}t{:8}t{:16}"
print(tplt.format("序号", "价格", "商品名称"))
count = 0
for g in ilt:
count = count 1
print(tplt.format(count, g[0], g[1]))
def main():
goods = '牛奶'
depth = 2
start_url = 'https://s.taobao.com/search?q=' goods
infoList = []
for i in range(depth):
try:
url = start_url '&s=' str(44*i)
html = getHTMLText(url)
parsePage(infoList, html)
except Exception as exc:
print('There was a problem: %s' % (exc))
printGoodsList(infoList)
main()
注:由于淘宝网站采取了反扒措施,因此使用requests模块时,需要提供headers定制头的信息。代码中关于cookies
的值已用*代替。针对headers定制头信息的获取,可以借鉴文末的参考资料[2]。
代码运行结果:
参考资料:
[1] 中国大学MOOC: Python网络爬虫与信息提取(https://www.icourse163.org/course/BIT-1001870001)
[2] 通过requests库re库进行淘宝商品爬虫爬取(https://zhuanlan.zhihu.com/p/112125997)
[3] python重要函数eval多种用法解析(https://www.jb51.net/article/178395.htm)
延伸阅读:
[1] Python: BeautifulSoup库入门
[2] Python: Re(正则表达式)库入门