Scrapy框架+Elasticsearch

2019-09-23 15:20:31 浏览数 (1)

前提

1. 已安装scrapy框架

2. 已安装elasticsearch

创建一个项目scrapyes

代码语言:javascript复制
scrapy startproject scrapyes

目录结构

代码语言:javascript复制
.
|____scrapy.cfg
|____scrapyes
| |______init__.py
| |____items.py
| |____middlewares.py
| |____pipelines.py
| |____settings.py
| |____spiders
| | |______init__.py

安装ScrapyElasticSearch

代码语言:javascript复制
pip install ScrapyElasticSearch

配置setting.py

代码语言:javascript复制
...

ITEM_PIPELINES = {
  'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 300,
}

ELASTICSEARCH_SERVERS = ['192.168.4.215']
ELASTICSEARCH_PORT = 9200 # If port 80 leave blank
ELASTICSEARCH_USERNAME = ''
ELASTICSEARCH_PASSWORD = ''
ELASTICSEARCH_INDEX = 'scrapy.course'
ELASTICSEARCH_TYPE = 'course'
ELASTICSEARCH_UNIQ_KEY = 'url'

...

配置说明见 https://github.com/knockrentals/scrapy-elasticsearch

写一个网络课程爬虫

代码语言:javascript复制
import scrapy

class ESCourseSpider(scrapy.Spider):
    name = 'es_course'

    def start_requests(self):
        urls=[]
        for i in xrange(1,30):
            urls.append('http://demo.edusoho.com/course/' str(i))

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        yield {
            'title': response.css('span.course-detail-heading::text').extract_first(),
            'price': response.css('b.pirce-num::text').extract_first(),
            'url' : response.url,
        }

跑一下爬虫

代码语言:javascript复制
scrapy crawl es_course -o es_course.json

爬下来的内容会存放在新生成的一个文件es_course.json里

代码语言:javascript复制
[
{"url": "http://demo.edusoho.com/course/1", "price": "免费", "title": "n               课程功能体验n                        "},
{"url": "http://demo.edusoho.com/course/20", "price": "0.01", "title": "n               官方主题n                        "},
{"url": "http://demo.edusoho.com/course/24", "price": "999.00", "title": "n               会员专区n                        "},
{"url": "http://demo.edusoho.com/course/22", "price": "免费", "title": "n               第三方主题n                        "},
{"url": "http://demo.edusoho.com/course/27", "price": "0.01", "title": "n               优惠码n                        "}
]

到elasticsearch中查看数据,查询条件如下

代码语言:javascript复制
GET scrapy.course*/_search
{
  "query" : {
    "match_all": {}
  }
  ,"from" : 0, "size" : 50
}

结果

代码语言:javascript复制
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 1,
    "hits": [
      {
        "_index": "scrapy.course",
        "_type": "course",
        "_id": "6306093149d91c35eabc1c59f28d68355cc4de9d",
        "_score": 1,
        "_source": {
          "url": "http://demo.edusoho.com/course/1",
          "price": "免费",
          "title": "n               课程功能体验n                        "
        }
      },
      {
        "_index": "scrapy.course",
        "_type": "course",
        "_id": "6a090cfe8f9dbf3d21248d64d9907eab4b31bc4d",
        "_score": 1,
        "_source": {
          "url": "http://demo.edusoho.com/course/24",
          "price": "999.00",
          "title": "n               会员专区n                        "
        }
      },

...

说明数据已经存到elasticsearch中。

0 人点赞