文章背景:之前基于requests-bs4-re的技术路线(参加文末的延伸阅读
),获取沪深两市A股所有股票的名称和交易信息,并保存到文件中。本文采用scrapy模块,进行股票数据的爬虫。
技术路线:scrapy
代码运行环境:win10 JupyterLab
1 数据网站的确定
选取原则:股票信息静态存在于HTML页面中,非Js代码生成。
选取方法:浏览器F12,查看源文件等
选取心态:不要纠结于某个网站,多找信息源。
(1)获取股票列表:
炒股一点通:http://www.cgedt.com/stockcode/yilanbiao.asp
(2)获取个股信息:
股城网:https://hq.gucheng.com/HSinfo.html
单个股票:https://hq.gucheng.com/SH600050/
https://hq.gucheng.com/SZ002276/
2 设计思路
- 建立工程和Spider模板
- 编写Spider
- 编写ITEM Pipelines
3 代码实现
(1) 建立工程和Spider模板(JupyterLab)
代码语言:javascript复制import scrapy, os
os.chdir("E:\python123\网络爬虫")
!scrapy startproject GuchengStocks
(2.1) 创建Spider(JupyterLab)
代码语言:javascript复制import scrapy,os
os.chdir("E:python123网络爬虫GuchengStocks")
!scrapy genspider stocks hq.gucheng.com
(2.2) 编写Spider(修改stocks.py文件的代码)
代码语言:javascript复制# -*- coding: utf-8 -*-
# stocks.py
import scrapy, re
class StocksSpider(scrapy.Spider):
name = "stocks"
start_urls = ['http://www.cgedt.com/stockcode/yilanbiao.asp']
def parse(self, response):
for href in response.css('a::attr(href)').extract():
try:
temp = re.findall(r"/stock/d{6}/", href)[0]
if temp[7] == "6":
stock = "SH" temp[7:13]
else:
stock = "SZ" temp[7:13]
url = 'https://hq.gucheng.com/' stock
yield scrapy.Request(url, callback=self.parse_stock)
except:
continue
def parse_stock(self, response):
infoDict = {}
stockInfo1 = response.css('.stock_title')
name1 = stockInfo1.css('h1').extract()[0]
name2 = stockInfo1.css('h2').extract()[0]
stockInfo2 = response.css('.stock_price.clearfix')
keyList = stockInfo2.css('dt').extract()
valueList = stockInfo2.css('dd').extract()
for i in range(len(keyList)):
key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
try:
val = re.findall(r'd .?.*</dd>', valueList[i])[0][0:-5]
except:
val = '--'
infoDict[key]=val
infoDict.update(
{'股票名称': re.findall('>.*</h1>',name1)[0][1:-5]
re.findall('>.*</h2>', name2)[0][1:-5]})
yield infoDict
(3.1) 编写Pipelines(修改pinelines.py文件的代码)
- 定义对爬取项(Scraped Item)的处理类
from itemadapter import ItemAdapter
# pipeline.py
class GuchengstocksPipeline:
def process_item(self, item, spider):
return item
class GuchengstocksInfoPipeline:
def open_spider(self, spider):
self.f = open('GuchengStockInfo.txt', 'w')
def close_spider(self, spider):
self.f.close()
def process_item(self, item, spider):
try:
line = str(dict(item)) 'n'
self.f.write(line)
except:
pass
return item
(3.2) 配置ITEM_Pipelines选项(修改settings.py文件的代码)
代码语言:javascript复制# settings.py
ITEM_PIPELINES = {
'GuchengStocks.pipelines.GuchengstocksInfoPipeline': 300,
}
(4) 运行爬虫(命令提示符
窗口)
运行结果:
参考资料:
[1] 中国大学MOOC: Python网络爬虫与信息提取(https://www.icourse163.org/course/BIT-1001870001)
[2] Scrapy css选择器提取数据(https://www.cnblogs.com/runningRain/p/12741095.html)
[3] python中回调函数,callback的含义(https://blog.csdn.net/qq_37849776/article/details/88407371)
[4] scrapy--解决css选择器遇见含空格类提取问题response.css(https://blog.csdn.net/liuhehe123/article/details/81608225)
延伸阅读:
[1] Python: “股票数据定向爬虫”实例