前言
老司机带你去看车,网上的几千条的二手车数据,只需几十行代码,就可以统统获取,保存数据到我们本地电脑上
知识点:
1.python基础知识2.函数3.requests库4.xpath适合零基础的同学
环境:
windows pycharm python3
爬虫流程:
1.目标网址2. 发送请求,获取响应3. 解析网页 提取数据4. 保存数据
步骤:
1.导入工具
代码语言:javascript复制import io
import sys
import requests # pip install requests
from lxml import etree # pip
2.获取汽车详情页面的url,解析网站
代码语言:javascript复制def get_detail_urls(url):
# 目标网址
# url = 'https://www.guazi.com/cs/buy/o3/'
# 发送请求,获取响应
resp = requests.get(url,headers=headers)
text = resp.content.decode('utf-8')
# 解析网页
html = etree.HTML(text)
ul = html.xpath('//ul[@class="carlist clearfix js-top"]')[0]
# print(ul)
lis = ul.xpath('./li')
detail_urls = []
for li in lis:
detail_url = li.xpath('./a/@href')
# print(detail_url)
detail_url = 'https://www.guazi.com' detail_url[0]
# print(detail_url)
detail_urls.append(detail_url)
return detail_urls
3.添加请求头
代码语言:javascript复制headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
'Cookie':'uuid=5a823c6f-3504-47a9-8360-f9a5040e5f23; ganji_uuid=4238534742401031078259; lg=1; Hm_lvt_936a6d5df3f3d309bda39e92da3dd52f=1590045325; track_id=79952087417704448; antipas=q7222002m3213k0641719; cityDomain=cs; clueSourceCode=*#00; user_city_id=204; sessionid=38afa34e-f972-431b-ce65-010f82a03571; close_finance_popup=2020-05-23; cainfo={"ca_a":"-","ca_b":"-","ca_s":"pz_baidu","ca_n":"pcbiaoti","ca_medium":"-","ca_term":"-","ca_content":"","ca_campaign":"","ca_kw":"-","ca_i":"-","scode":"-","keyword":"-","ca_keywordid":"-","ca_transid":"","platform":"1","version":1,"track_id":"79952087417704448","display_finance_flag":"-","client_ab":"-","guid":"5a823c6f-3504-47a9-8360-f9a5040e5f23","ca_city":"cs","sessionid":"38afa34e-f972-431b-ce65-010f82a03571"}; preTime={"last":1590217273,"this":1586866452,"pre":1586866452}',
}
4.提取每辆汽车详情页面的数据
代码语言:javascript复制def parse_detail_page(url):
resp = requests.get(url,headers=headers)
text = resp.content.decode('utf-8')
html = etree.HTML(text)
# 标题
title = html.xpath('//div[@class="product-textbox"]/h2/text()')[0]
title = title.strip()
print(title)
# 信息
info = html.xpath('//div[@class="product-textbox"]/ul/li/span/text()')
# print(info)
infos = {}
cardtime = info[0]
km = info[1]
displacement = info[2]
speedbox = info[3]
infos['title'] = title
infos['cardtime'] = cardtime
infos['km'] = km
infos['displacement'] = displacement
infos['speedbox'] = speedbox
print(infos)
return infos
5.保存数据
代码语言:javascript复制def save_data(infos, f):
f.write('{},{},{},{},{}n'.format(infos['title'],infos['cardtime'],infos['km'],infos['displacement'],infos['speedbox']))
if __name__ == '__main__':
base_url = 'https://www.guazi.com/cs/buy/o{}/'
with open('guazi.csv','a',encoding='utf-8') as f:
for x in range(1,51):
url = base_url.format(x)
detail_urls = get_detail_urls(url)
for detail_url in detail_urls:
infos = parse_detail_page(detail_url)
save_data(infos, f)
最后运行代码,效果如下图