scrapy爬取--腾讯社招的网站

2019-07-19 15:51:47 浏览数 (1)

需求:得到相应的职位、职位类型、职位的链接 、招聘人数、工作地点、发布时间

一、创建Scrapy项目的流程

1)使用命令创建爬虫腾讯招聘的职位项目:scrapy startproject tencent

2)进程项目命令:cd tencent,并且创建爬虫:scrapy genspider tencentPosition hr.tencent.com

3) 使用PyCharm打开项目

4)根据需求分析,完成items.py文件的字段

5)完成爬虫的编写

6)管道文件的编程

7)settings.py文件的配置信息

8)pycharm打开文件的效果图:

二、编写各个文件的代码:

1.tencentPosition.py文件

代码语言:javascript复制
import scrapy

from tencent.items import TencentItem


class TencentpositionSpider(scrapy.Spider):
    name = 'tencentPosition'
    allowed_domains = ['hr.tencent.com']
    offset = 0
    url = "https://hr.tencent.com/position.php?&start="
    start_urls = [url   str(offset)   '#a', ]

    def parse(self, response):
        position_lists = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
        for postion in position_lists:
            item = TencentItem()
            position_name = postion.xpath("./td[1]/a/text()").extract()[0]
            position_link = postion.xpath("./td[1]/a/@href").extract()[0]
            position_type = postion.xpath("./td[2]/text()").get()
            people_num = postion.xpath("./td[3]/text()").extract()[0]
            work_address = postion.xpath("./td[4]/text()").extract()[0]
            publish_time = postion.xpath("./td[5]/text()").extract()[0]

            item["position_name"] = position_name
            item["position_link"] = position_link
            item["position_type"] = position_type
            item["people_num"] = people_num
            item["work_address"] = work_address
            item["publish_time"] = publish_time
            yield item

            # 下一页的数据
            total_page = response.xpath('//div[@class="left"]/span/text()').extract()[0]
            print(total_page)

            if self.offset < int(total_page):
                self.offset  = 10
            new_url = self.url   str(self.offset)   "#a"
            yield scrapy.Request(new_url, callback=self.parse)

2.items.py 文件

代码语言:javascript复制
import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    position_name = scrapy.Field()
    position_link = scrapy.Field()
    position_type = scrapy.Field()
    people_num = scrapy.Field()
    work_address = scrapy.Field()
    publish_time = scrapy.Field()

*****切记字段和TencentpositionSpider.py文件保持一致

3.pipelines.py文件

代码语言:javascript复制
import json


class TencentPipeline(object):
    def __init__(self):
        print("=======start========")
        self.file = open("tencent.json", "w", encoding="utf-8")

    def process_item(self, item, spider):
        print("=====ing=======")
        dict_item = dict(item)  # 转换成字典
        json_text = json.dumps(dict_item, ensure_ascii=False)   "n"
        self.file.write(json_text)
        return item

    def close_spider(self, spider):
        print("=======end===========")
        self.file.close()

4.settings.py文件

5.运行文件:

1)在根目录下创建一个main.py

2)main.py文件

代码语言:javascript复制
from scrapy import cmdline

cmdline.execute("scrapy crawl tencentPosition".split())

三、运行效果:

0 人点赞