导入库
代码语言:javascript复制from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
爬取一章内容
选择的小说是你是我的城池营垒,如果要把所有章节爬取下来就要点进每一章然后去爬取,一开始觉得有点击所以要用selenium,但是写到后面发现传每一章的url就可以不用模拟点击,所以可以不用selenium来实现用requests也可以。
请求网站:
代码语言:javascript复制url = 'http://www.fyhuabo.com/bqg/3805/4369788.html'
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
先点进去一章看一下结构,
可以看到文章标题的class为title,内容在一个class="content"的div里面。
把title和div存起来: 在title后面加一个"n"换行。 div后面也加一个,要不然每一章小说就会连在一起。
代码语言:javascript复制title = driver.find_element_by_class_name('title')
title = title.text "n"
print(title)
div = driver.find_element_by_id('content')
str = div.text "nn"
存到文件里,这里要设置一下编码为utf-8
代码语言:javascript复制f = open("d:/a.txt", 'a ', encoding='utf-8')
f.write(title)
f.write(str)
f.close()
这样爬取一章的代码写好了,可以运行测试一下
代码语言:javascript复制from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
url = 'http://www.fyhuabo.com/bqg/3805/4369788.html'
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
title = driver.find_element_by_class_name('title')
title = title.text "n"
div = driver.find_element_by_id('content')
f = open("d:/a.txt", 'a ', encoding='utf-8')
print(title)
str = div.text "nn"
f.write(title)
f.write(str)
f.close()
爬取所有章节
把上面的爬取一个章节封装成一个函数,一会调用。
接着分析页面:
发现最新章节和下面的正文div的class属性一样,我们要获取第二个的div所以要让all_li获取所有的class="section-box"的div然后取第二个,就是我们要的正文。
代码语言:javascript复制all_li = BeautifulSoup(driver.page_source, "lxml").find_all(class_="section-box")
all_li = all_li[1]
我们要的是li里面的a的href属性,所以我们执行all_li = all_li.find_all('a')
获取所有a的值。
查看all_li的值:
第1章 序
第2章 上个路口遇见你 1
可以发现所有的href链接都是有长度相等的字符串,所以可以用切片的方法获取每一章的链接:
for li in all_li:
str_0 = str(li)
str_0 = str_0[9: 31]
然后把链接传到爬取每一章的函数里就可以完成整章小说爬取了
所有代码
代码语言:javascript复制from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time
from lxml import etree
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
def download(url_0):
url = 'http://www.fyhuabo.com/bqg/3805' url_0
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
)
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get(url)
title = driver.find_element_by_class_name('title')
title = title.text "n"
div = driver.find_element_by_id('content')
f = open("d:/a.txt", 'a ', encoding='utf-8')
print(title)
str = div.text "nn"
f.write(title)
f.write(str)
f.close()
driver = webdriver.PhantomJS()
driver.get('http://www.fyhuabo.com/bqg/3805/')
all_li = BeautifulSoup(driver.page_source, "lxml").find_all(class_="section-box")
all_li = all_li[1]
all_li = all_li.find_all('a')
for li in all_li:
str_0 = str(li)
str_0 = str_0[9: 31]
download(str_0)