简单使用Scrapy爬取小说网

准备工作

Windows 11
Python 3.7.9

搭建环境

安装Scrapy

代码语言：txt复制

pip install Scrapy

创建Scrapy项目

代码语言：txt复制

scrapy startproject novelScrapy

目录已经出来了，大概像下面这样

代码语言：txt复制

novelScrapy/
    scrapy.cfg
    novelScrapy/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

框架都出来了，我们先创建一只可爱的小爬虫

代码语言：txt复制

scrapy genspider novel "https://www.xbiquge.la"

然后准备工作就做好了，此时的目录是这样的

代码语言：txt复制

novelScrapy/
    scrapy.cfg
    novelScrapy/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            novel.py

写代码

打开Chrome或者Edge(我用的Edge)，打开某趣阁的目录界面，然后F12审查元素，找到目录的超链接标签，右键复制Xpath

代码语言：txt复制

//*[@id="list"]/dl/dd[1]/a

然后就巴拉巴拉省略一大堆分析过程，直接打开我们的小爬虫文件(novel.py)，然后直接咔咔一顿写代码，写完之后，就是这样的

代码语言：txt复制

import scrapy

from novelCrapy.items import NovelcrapyItem


class NovelSpider(scrapy.Spider):
    name = 'novel'
    allowed_domains = ['www.xbiquge.la']
    start_urls = ['https://www.xbiquge.la/xiaoshuodaquan/']
    root_url = 'https://www.xbiquge.la'

    # 先获取小说列表
    def parse(self, response):
        # 获取小说分类
        novel_class_list = response.xpath('//*[@id="main"]/div[@class="novellist"]')
        for i in novel_class_list:
            # 具体分类名
            novel_class = i.xpath('./h2/text()').get()
            # 小说列表
            novel_url = i.xpath('./ul/li/a/@href').extract()
            for novel in novel_url:
                yield scrapy.Request(
                    url=novel,
                    meta={'novel_class': novel_class},
                    callback=self.parse_chapter
                )

    # 获取小说名，和小说章节
    def parse_chapter(self, response):
        # 获取小说分类
        novel_class = response.meta['novel_class']
        # 获取小说名
        novel_name = response.xpath('//*[@id="info"]/h1/text()').get()
        # 获取小说章节列表
        novel_chapter_list = response.xpath('//*[@id="list"]/dl/dd')
        for i in novel_chapter_list:
            # 获取小说章节名
            # novel_chapter = i.xpath('./a/@text()').get()
            novel_chapter = i.xpath('./a').xpath('string(.)').get()
            # 拼接小说章节完整Url
            link = self.root_url   i.xpath('./a/@href').get()
            yield scrapy.Request(
                url=link,
                meta={'novel_class': novel_class, 'novel_name': novel_name, 'novel_chapter': novel_chapter},
                callback=self.parse_content
            )

    # 再获取小说章节内容
    def parse_content(self, response):
        # 小说分类
        novel_class = response.meta['novel_class']
        # 小说名
        novel_name = response.meta['novel_name']
        # 小说章节
        novel_chapter = response.meta['novel_chapter']
        # 获取小说内容
        novel_content = response.xpath('//*[@id="content"]/text()').extract()

        item = NovelcrapyItem()
        item['novel_class'] = novel_class
        item['novel_chapter'] = novel_chapter
        item['novel_name'] = novel_name
        item['novel_content'] = novel_content

        # 处理完毕返回数据
        yield item

好像忘记写这个item.py了，这个就相当于那个ORM数据库模型似的，写个字段就行了，大概就行下面这样

代码语言：txt复制

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class NovelcrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # 小说分类
    novel_class = scrapy.Field()
    # 小说名
    novel_name = scrapy.Field()
    # 章节名
    novel_chapter = scrapy.Field()
    # 章节内容
    novel_content = scrapy.Field()

经过上面步骤，我们的小爬虫就可以爬取网站上面所有的小说了，至于分析过程，自己看代码吧，我感觉我注释写的挺全的。

这个时候，我们好像忽略了一个问题，回顾一下Scrapy的工作过程，大概就像下面这样

代码语言：txt复制

引擎：Hi！Spider, 你要处理哪一个网站？
Spider：老大要我处理xxxx.com。
引擎：你把第一个需要处理的URL给我吧。
Spider：给你，第一个URL是xxxxxxx.com。
引擎：Hi！调度器，我这有request请求你帮我排序入队一下。
调度器：好的，正在处理你等一下。
引擎：Hi！调度器，把你处理好的request请求给我。
调度器：给你，这是我处理好的request
引擎：Hi！下载器，你按照老大的下载中间件的设置帮我下载一下这个request请求
下载器：好的！给你，这是下载好的东西。（如果失败：sorry，这个request下载失败了。然后引擎告诉调度器，这个request下载失败了，你记录一下，我们待会儿再下载）
引擎：Hi！Spider，这是下载好的东西，并且已经按照老大的下载中间件处理过了，你自己处理一下（注意！这儿responses默认是交给def parse()这个函数处理的）
Spider：（处理完毕数据之后对于需要跟进的URL），Hi！引擎，我这里有两个结果，这个是我需要跟进的URL，还有这个是我获取到的Item数据。
引擎：Hi ！管道 我这儿有个item你帮我处理一下！调度器！这是需要跟进URL你帮我处理下。然后从第四步开始循环，直到获取完老大需要全部信息。
管道调度器：好的，现在就做！

看到这，貌似发现了，这个管道调度器一直在偷懒，那怎么行呢，毕竟劳动人民最光荣，必须得给他找点活干，所以我们打开pipelines.py，写一下保存文件的代码

代码语言：txt复制

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import os
import time

from itemadapter import ItemAdapter


class NovelcrapyPipeline:

    def process_item(self, item, spider):
        # 定义小说储存路径
        dir = 'D:\Project\Python\小说\'   item['novel_class']   '\'   item['novel_name']  '\'
        # 如果不存在则创建
        if not os.path.exists(dir):
            os.makedirs(dir)

        filename = dir   item['novel_chapter']   ".txt"
        with open(filename, 'w', encoding="utf-8") as f:
            f.write("".join(item['novel_content']))

        now_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
        print('[%s] %s %s %s 已下载' % (now_time, item['novel_class'], item['novel_name'], item['novel_chapter']))
        return item

写完这个，我们的小爬虫就可以正常工作了，只需要在cmd里面敲下面的代码，小爬虫就可以爬起来了，只需要Ctrl C就可以保存进度，下一次可以接着爬

代码语言：txt复制

scrapy crawl novel -s JOBDIR=crawls/novel-1

优化

试了一下，感觉速度还是太慢了，我又果断在settings.py里面加上了下面的代码，别问，问就是提速的

代码语言：txt复制

DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 100
CONCURRENT_REQUESTS_PER_IP = 100
COOKIES_ENABLED = False

成品

成品源码：https://moleft.lanzoum.com/i7O3W01n23xi

scrapy python

0 人点赞