Python: “股票数据Scrapy爬虫”实例

2022-09-20 14:01:05 浏览数 (1)

文章背景:之前基于requests-bs4-re的技术路线(参加文末的延伸阅读),获取沪深两市A股所有股票的名称和交易信息,并保存到文件中。本文采用scrapy模块,进行股票数据的爬虫。

技术路线:scrapy

代码运行环境:win10 JupyterLab

1 数据网站的确定

选取原则:股票信息静态存在于HTML页面中,非Js代码生成。

选取方法:浏览器F12,查看源文件等

选取心态:不要纠结于某个网站,多找信息源。

(1)获取股票列表:

炒股一点通:http://www.cgedt.com/stockcode/yilanbiao.asp

(2)获取个股信息:

股城网:https://hq.gucheng.com/HSinfo.html

单个股票:https://hq.gucheng.com/SH600050/

https://hq.gucheng.com/SZ002276/

2 设计思路
  1. 建立工程和Spider模板
  2. 编写Spider
  3. 编写ITEM Pipelines
3 代码实现

(1) 建立工程和Spider模板(JupyterLab)

代码语言:javascript复制
import scrapy, os
os.chdir("E:\python123\网络爬虫")

!scrapy startproject GuchengStocks

(2.1) 创建Spider(JupyterLab)

代码语言:javascript复制
import scrapy,os

os.chdir("E:python123网络爬虫GuchengStocks")

!scrapy genspider stocks hq.gucheng.com

(2.2) 编写Spider(修改stocks.py文件的代码)

代码语言:javascript复制
# -*- coding: utf-8 -*-
# stocks.py

import scrapy, re

class StocksSpider(scrapy.Spider):
    name = "stocks"
    start_urls = ['http://www.cgedt.com/stockcode/yilanbiao.asp']

    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                temp = re.findall(r"/stock/d{6}/", href)[0]
                if temp[7] == "6":
                    stock = "SH"   temp[7:13]
                else:
                    stock = "SZ"   temp[7:13]
                url = 'https://hq.gucheng.com/'   stock
                yield scrapy.Request(url, callback=self.parse_stock)
            except:
                continue

    def parse_stock(self, response):
        infoDict = {}
        stockInfo1 = response.css('.stock_title')
        name1 = stockInfo1.css('h1').extract()[0]
        name2 = stockInfo1.css('h2').extract()[0]
        
        stockInfo2 = response.css('.stock_price.clearfix')
        keyList = stockInfo2.css('dt').extract()
        valueList = stockInfo2.css('dd').extract()
        for i in range(len(keyList)):
            key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
            try:
                val = re.findall(r'd .?.*</dd>', valueList[i])[0][0:-5]
            except:
                val = '--'
            infoDict[key]=val

        infoDict.update(
            {'股票名称': re.findall('>.*</h1>',name1)[0][1:-5]   
             re.findall('>.*</h2>', name2)[0][1:-5]})
        yield infoDict

(3.1) 编写Pipelines(修改pinelines.py文件的代码)

  • 定义对爬取项(Scraped Item)的处理类
代码语言:javascript复制
from itemadapter import ItemAdapter

# pipeline.py
class GuchengstocksPipeline:
    def process_item(self, item, spider):
        return item

class GuchengstocksInfoPipeline:
    def open_spider(self, spider):
        self.f = open('GuchengStockInfo.txt', 'w')

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        try:
            line = str(dict(item))   'n'
            self.f.write(line)
        except:
            pass
        return item

(3.2) 配置ITEM_Pipelines选项(修改settings.py文件的代码)

代码语言:javascript复制
# settings.py
ITEM_PIPELINES = {
    'GuchengStocks.pipelines.GuchengstocksInfoPipeline': 300,
}

(4) 运行爬虫(命令提示符窗口)

运行结果:

参考资料:

[1] 中国大学MOOC: Python网络爬虫与信息提取(https://www.icourse163.org/course/BIT-1001870001)

[2] Scrapy css选择器提取数据(https://www.cnblogs.com/runningRain/p/12741095.html)

[3] python中回调函数,callback的含义(https://blog.csdn.net/qq_37849776/article/details/88407371)

[4] scrapy--解决css选择器遇见含空格类提取问题response.css(https://blog.csdn.net/liuhehe123/article/details/81608225)

延伸阅读:

[1] Python: “股票数据定向爬虫”实例

0 人点赞