python爬虫scrapy项目(二)
爬取目标:房天下全国租房信息网站(起始url:http://zu.fang.com/cities.aspx)
爬取内容:城市;名字;出租方式;价格;户型;面积;地址;交通
反反爬措施:设置随机user-agent、设置请求延时操作、
1、开始创建项目
代码语言:javascript复制1 scrapy startproject fang
2、进入fang文件夹,执行启动spider爬虫文件代码,编写爬虫文件。
代码语言:javascript复制1 scrapy genspider zufang "zu.fang.com"
命令执行完,用Python最好的IDE---pycharm打开该文件目录
3、编写该目录下的items.py文件,设置你需要爬取的字段。
代码语言:javascript复制 1 import scrapy
2
3
4 class HomeproItem(scrapy.Item):
5 # define the fields for your item here like:
6 # name = scrapy.Field()
7
8 city = scrapy.Field() #城市
9 title = scrapy.Field() # 名字
10 rentway = scrapy.Field() # 出租方式
11 price = scrapy.Field() #价格
12 housetype = scrapy.Field() # 户型
13 area = scrapy.Field() # 面积
14 address = scrapy.Field() # 地址
15 traffic = scrapy.Field() # 交通
4、进入spiders文件夹,打开hr.py文件,开始编写爬虫文件
代码语言:javascript复制 1 # -*- coding: utf-8 -*-
2 import scrapy
3 from homepro.items import HomeproItem
4 from scrapy_redis.spiders import RedisCrawlSpider
5 # scrapy.Spider
6 class HomeSpider(RedisCrawlSpider):
7 name = 'home'
8 allowed_domains = ['zu.fang.com']
9 # start_urls = ['http://zu.fang.com/cities.aspx']
10
11 redis_key = 'homespider:start_urls'
12 def parse(self, response):
13 hrefs = response.xpath('//div[@class="onCont"]/ul/li/a/@href').extract()
14 for href in hrefs:
15 href = 'http:' href
16 yield scrapy.Request(url=href,callback=self.parse_city,dont_filter=True)
17
18
19 def parse_city(self, response):
20 page_num = response.xpath('//div[@id="rentid_D10_01"]/span[@class="txt"]/text()').extract()[0].strip('共页')
21 # print('*' * 100)
22 # print(page_num)
23 # print(response.url)
24
25 for page in range(1, int(page_num)):
26 if page == 1:
27 url = response.url
28 else:
29 url = response.url 'house/i%d' % (page 30)
30 print('*' * 100)
31 print(url)
32 yield scrapy.Request(url=url, callback=self.parse_houseinfo, dont_filter=True)
33
34 def parse_houseinfo(self, response):
35 divs = response.xpath('//dd[@class="info rel"]')
36 for info in divs:
37 city = info.xpath('//div[@class="guide rel"]/a[2]/text()').extract()[0].rstrip("租房")
38 title = info.xpath('.//p[@class="title"]/a/text()').extract()[0]
39 rentway = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[0].extract().replace(" ", '').lstrip('rn')
40 housetype = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[1].extract().replace(" ", '')
41 area = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[2].extract().replace(" ", '')
42 addresses = info.xpath('.//p[@class ="gray6 mt12"]//span/text()').extract()
43 address = '-'.join(i for i in addresses)
44 try:
45 des = info.xpath('.//p[@class ="mt12"]//span/text()').extract()
46 traffic = '-'.join(i for i in des)
47 except Exception as e:
48 traffic = "暂无详细信息"
49
50 p_name = info.xpath('.//div[@class ="moreInfo"]/p/text()').extract()[0]
51 p_price = info.xpath('.//div[@class ="moreInfo"]/p/span/text()').extract()[0]
52 price = p_price p_name
53
54 item = HomeproItem()
55 item['city'] = city
56 item['title'] = title
57 item['rentway'] = rentway
58 item['price'] = price
59 item['housetype'] = housetype
60 item['area'] = area
61 item['address'] = address
62 item['traffic'] = traffic
63 yield item
5、设置setting.py文件,配置scrapy运行的相关内容
代码语言:javascript复制 1 # 指定使用scrapy-redis的调度器
2 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
3
4 # 指定使用scrapy-redis的去重
5 DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
6
7 # 指定排序爬取地址时使用的队列,
8 # 默认的 按优先级排序(Scrapy默认),由sorted set实现的一种非FIFO、LIFO方式。
9 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
10
11 REDIS_HOST = '10.8.153.73'
12 REDIS_PORT = 6379
13 # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
14 SCHEDULER_PERSIST = True
6、然后把代码发给其他附属机器,分别启动.子程序redis链接主服务器redis。
代码语言:javascript复制1 redis-cli -h 主服务器ip
7、主服务器先启动redis-server,再启动redis-cli
代码语言:javascript复制1 lpush homespider:start_urls 起始的url