精通Python爬虫框架Scrapy
精通Python爬虫框架Scrapy
2018年2月的书,居然代码用的是Python2
环境使用的是Vagrant
,但是由于国内网络的问题,安装的太慢了。
书里内容比较高深,需要了解一些比较简单的Scrapy内容可以看一下我github上的一些例子:https://github.com/zx490336534/spider-review
使用Xpath选择HTML元素
选择Html元素
代码语言:javascript复制$x('//h1')
Xpath表达式通过使用前缀点号「.」转为相对Xpath
XQuery 1.0、XPath 2.0 以及 XSLT 2.0 共享相同的函数库。
Xpath的函数:https://www.w3school.com.cn/xsl/xsl_functions.asp
调试Scrapy
代码语言:javascript复制$ scrapy shell http://example.com
>>> response.xpath('//a/text()')
[<Selector xpath='//a/text()' data='More information...'>]
创建Scrapy项目
代码语言:javascript复制$ scrapy startproject xxx
Selectors对象
抽取数据的方式:https://docs.scrapy.org/en/latest/topics/selectors.html
查看创建爬虫模版
代码语言:javascript复制(venv) (base) 192:properties zhongxin$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
使用scrapy genspider -t
选择模版进行创建
打印日志
代码语言:javascript复制def parse(self, response):
self.log('title:%s' % response.xpath('//*[@itemprop="name"][1]/text()').extract())
代码语言:javascript复制2021-03-06 09:15:30 [basic] DEBUG: title:['United Kingdom', 'England', 'London', 'All Categories', 'Property']
测试其他url
代码语言:javascript复制$ scrapy parse --spider=basic http://xxx
保存文件
代码语言:javascript复制from properties.items import PropertiesItem
def parse(self, response):
item = PropertiesItem()
item['title'] = response.xpath('//*[@itemprop="name"][1]/text()').extract()
return item
代码语言:javascript复制2021-03-06 09:23:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.gumtree.com/flats-houses/london>
{'title': ['United Kingdom', 'England', 'London', 'All Categories', 'Property']}
2021-03-06 09:23:08 [scrapy.core.engine] INFO: Closing spider (finished)
使用-o
将item内容存到制定文件中
(venv) (base) 192:properties zhongxin$ scrapy crawl basic -o a.json
保存文件
item装载器与管理字段
官方文档:https://docs.scrapy.org/en/latest/topics/loaders.html
代码语言:javascript复制import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from properties.items import PropertiesItem
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['web']
start_urls = ['http://web/']
start_urls = (
'https://www.gumtree.com/flats-houses/london',
)
def parse(self, response):
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
MapCompose(str.strip, str.title))
return l.load_item()
创建contract
为爬虫设计的单元测试
代码语言:javascript复制def parse(self, response):
""" This function parses a property page.
@url http://web:9312/properties/property_000000.html
@returns items 1
@scrapes title price description address image_urls
@scrapes url project spider server date
"""
检查该url
并找到我列出的字段中有值的一个Item
$ scrapy check basic
使用CrawlSpider实现双向爬取
CrawlSpider
提供了一个使用rules
变量实现的parse()
方法
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
callback='parse_item')
)
提交登录表单
代码语言:javascript复制from scrapy.http import FormRequest
class LoginSpider(CrawlSpider):
name = 'login'
allowed_domains = ["web"]
# Start with a login request
def start_requests(self):
return [
FormRequest(
"http://web:9312/dynamic/login",
formdata={"user": "user", "pass": "pass"}
)]
定制化登录
代码语言:javascript复制from scrapy.http import Request, FormRequest
class NonceLoginSpider(CrawlSpider):
name = 'noncelogin'
allowed_domains = ["web"]
# Start on the welcome page
def start_requests(self):
return [
Request(
"http://web:9312/dynamic/nonce",
callback=self.parse_welcome)
]
# Post welcome page's first form with the given user/pass
def parse_welcome(self, response):
return FormRequest.from_response(
response,
formdata={"user": "user", "pass": "pass"}
)
在响应间传参
代码语言:javascript复制def parse(self, response):
xxx
yield Request(url, meta={"title": title}, callback=self.parse_item)
def parse_item(self, response):
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_value('title', response.meta['title'], MapCompose(unicode.strip, unicode.title))
部署到Scrapinghub
http://scrapinghub.com/
编写Item管道
代码语言:javascript复制# scrapybook/ch08/properties/properties/pipelines/tidyup.py
from datetime import datetime
class TidyUp(object):
"""A pipeline that does some basic post-processing"""
def process_item(self, item, spider):
"""
Pipeline's main method. Formats the date as a string.
"""
item['date'] = map(datetime.isoformat, item['date'])
return item
# scrapybook/ch08/properties/properties/settings.py
ITEM_PIPELINES = {
'properties.pipelines.tidyup.TidyUp': 100,
}
常用管道
代码语言:javascript复制class XXXPipeline(object):
def open_spider(self, spider):
self.f = xx.open()
def process_item(self, item, spider):
self.f.write(item)
return item
def close_spider(self, spider):
self.f.close()
例如,写入json
代码语言:javascript复制import json
class MyPipeline(object):
def open_spider(self, spider):
self.file = open('Thanzhou.json', 'w', encoding='utf8')
def process_item(self, item, spider):
content = json.dumps(dict(item), ensure_ascii=False) 'n'
self.file.write(content)
return item
def close_spider(self, spider):
self.file.close()
Scrapy是一个Twisted应用
在任何情况下,都不要编写阻塞的代码
实现插入Mysql
代码语言:javascript复制import traceback
import dj_database_url
import MySQLdb
from twisted.internet import defer
from twisted.enterprise import adbapi
from scrapy.exceptions import NotConfigured
class MysqlWriter(object):
"""
A spider that writes to MySQL databases
"""
@classmethod
def from_crawler(cls, crawler):
"""Retrieves scrapy crawler and accesses pipeline's settings"""
# Get MySQL URL from settings
mysql_url = crawler.settings.get('MYSQL_PIPELINE_URL', None)
# If doesn't exist, disable the pipeline
if not mysql_url:
raise NotConfigured
# Create the class
return cls(mysql_url)
def __init__(self, mysql_url):
"""Opens a MySQL connection pool"""
# Store the url for future reference
self.mysql_url = mysql_url
# Report connection error only once
self.report_connection_error = True
# Parse MySQL URL and try to initialize a connection
conn_kwargs = MysqlWriter.parse_mysql_url(mysql_url)
self.dbpool = adbapi.ConnectionPool('MySQLdb',
charset='utf8',
use_unicode=True,
connect_timeout=5,
**conn_kwargs)
def close_spider(self, spider):
"""Discard the database pool on spider close"""
self.dbpool.close()
@defer.inlineCallbacks
def process_item(self, item, spider):
"""Processes the item. Does insert into MySQL"""
logger = spider.logger
try:
yield self.dbpool.runInteraction(self.do_replace, item)
except MySQLdb.OperationalError:
if self.report_connection_error:
logger.error("Can't connect to MySQL: %s" % self.mysql_url)
self.report_connection_error = False
except:
print traceback.format_exc()
# Return the item for the next stage
defer.returnValue(item)
@staticmethod
def do_replace(tx, item):
"""Does the actual REPLACE INTO"""
sql = """REPLACE INTO properties (url, title, price, description)
VALUES (%s,%s,%s,%s)"""
args = (
item["url"][0][:100],
item["title"][0][:30],
item["price"][0],
item["description"][0].replace("rn", " ")[:30]
)
tx.execute(sql, args)
@staticmethod
def parse_mysql_url(mysql_url):
"""
Parses mysql url and prepares arguments for
adbapi.ConnectionPool()
"""
params = dj_database_url.parse(mysql_url)
conn_kwargs = {}
conn_kwargs['host'] = params['HOST']
conn_kwargs['user'] = params['USER']
conn_kwargs['passwd'] = params['PASSWORD']
conn_kwargs['db'] = params['NAME']
conn_kwargs['port'] = params['PORT']
# Remove items with empty values
conn_kwargs = dict((k, v) for k, v in conn_kwargs.iteritems() if v)
return conn_kwargs