Python爬虫—爬取小说

2022-01-13 14:45:45 浏览数 (1)

导入库

代码语言:javascript复制
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

爬取一章内容

选择的小说是你是我的城池营垒,如果要把所有章节爬取下来就要点进每一章然后去爬取,一开始觉得有点击所以要用selenium,但是写到后面发现传每一章的url就可以不用模拟点击,所以可以不用selenium来实现用requests也可以。

请求网站:

代码语言:javascript复制
url = 'http://www.fyhuabo.com/bqg/3805/4369788.html'
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)

先点进去一章看一下结构,

可以看到文章标题的class为title,内容在一个class="content"的div里面。

把title和div存起来: 在title后面加一个"n"换行。 div后面也加一个,要不然每一章小说就会连在一起。

代码语言:javascript复制
title = driver.find_element_by_class_name('title')
title = title.text   "n"
print(title)
div = driver.find_element_by_id('content')
str = div.text   "nn"

存到文件里,这里要设置一下编码为utf-8

代码语言:javascript复制
f = open("d:/a.txt", 'a ', encoding='utf-8')
f.write(title)
f.write(str)
f.close()

这样爬取一章的代码写好了,可以运行测试一下

代码语言:javascript复制
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

url = 'http://www.fyhuabo.com/bqg/3805/4369788.html'
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
title = driver.find_element_by_class_name('title')
title = title.text   "n"
div = driver.find_element_by_id('content')
f = open("d:/a.txt", 'a ', encoding='utf-8')
print(title)
str = div.text   "nn"
f.write(title)
f.write(str)
f.close()

爬取所有章节

把上面的爬取一个章节封装成一个函数,一会调用。

接着分析页面:

发现最新章节和下面的正文div的class属性一样,我们要获取第二个的div所以要让all_li获取所有的class="section-box"的div然后取第二个,就是我们要的正文。

代码语言:javascript复制
all_li = BeautifulSoup(driver.page_source, "lxml").find_all(class_="section-box")
all_li = all_li[1]

我们要的是li里面的a的href属性,所以我们执行all_li = all_li.find_all('a')获取所有a的值。 查看all_li的值: 第1章 序 第2章 上个路口遇见你 1 可以发现所有的href链接都是有长度相等的字符串,所以可以用切片的方法获取每一章的链接:

代码语言:javascript复制
for li in all_li:
    str_0 = str(li)
    str_0 = str_0[9: 31]

然后把链接传到爬取每一章的函数里就可以完成整章小说爬取了

所有代码

代码语言:javascript复制
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time
from lxml import etree
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def download(url_0):
    url = 'http://www.fyhuabo.com/bqg/3805'   url_0
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
        "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
    )
    driver = webdriver.PhantomJS(desired_capabilities=dcap)
    driver.get(url)
    title = driver.find_element_by_class_name('title')
    title = title.text   "n"
    div = driver.find_element_by_id('content')
    f = open("d:/a.txt", 'a ', encoding='utf-8')
    print(title)
    str = div.text   "nn"
    f.write(title)
    f.write(str)
    f.close()

driver = webdriver.PhantomJS()
driver.get('http://www.fyhuabo.com/bqg/3805/')
all_li = BeautifulSoup(driver.page_source, "lxml").find_all(class_="section-box")
all_li = all_li[1]
all_li = all_li.find_all('a')

for li in all_li:
    str_0 = str(li)
    str_0 = str_0[9: 31]
    download(str_0)

0 人点赞