来自于华为云开发者大会,使用Python爬虫抓取图片和文字实验,应用Scrapy框架进行数据抓取,保存应用了mysql数据库,实验采用的是线上服务器,而这里照抄全是本地进行,如有不同,那肯定是本渣渣瞎改了!
- step1.配置环境
1.新建文件夹 huawei
2.命令行配置python虚拟环境
代码语言:javascript复制python -m venv ven
3.安装Scrapy框架
win7 64位系统下安装Scrapy框架 “pip install scrapy”,需要先安装相关环境,不然会报错,比如Twisted-,请自行对照python版本安装,本渣渣用的python3.8的所以下载的是Twisted-20.3.0-cp38-cp38-win_amd64.whl,没错,本渣渣是本地安装的方法安装的!
详细安装 win 安装Scrapy框架方法,推荐善用搜索引擎!
- step2.创建Scrapy项目
同样的命令行操作
1.需要进入到指定目录,激活虚拟环境
代码语言:javascript复制cd huawei
envScriptsactivate.bat
2.cmd命令行,新建Scrapy项目
代码语言:javascript复制scrapy startproject vmall_spider
代码语言:javascript复制cd vmall_spider
代码语言:javascript复制scrapy genspider -t crawl vmall "vmall.com"
- step3.关键源码
1.vmall.py(核心爬取代码)
代码语言:javascript复制import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from vmall_spider.items import VmallSpiderItem
class VamllSpider(CrawlSpider):
name = 'vmall'
allowed_domains = ['vmall.com']
start_urls = ['https://www.vmall.com/']
rules = (
Rule(LinkExtractor(allow=r'.*/product/.*'), callback='parse_item', follow=True),
)
def parse_item(self, response):
title=response.xpath("//div[@class='product-meta']/h1/text()").get()
image=response.xpath("//a[@id='product-img']/img/@src").get()
item=VmallSpiderItem(
title=title,
image=image,
)
print("="*30)
print(title)
print(image)
print("="*30)
yield item
2.items.py
代码语言:javascript复制import scrapy
class VmallSpiderItem(scrapy.Item):
title=scrapy.Field()
image=scrapy.Field()
3.pipelines.py
数据存储处理
代码语言:javascript复制
import pymysql
import os
from urllib import request
class VmallSpiderPipeline:
def __init__(self):
dbparams={
'host':'127.0.0.1', #云数据库弹性公网IP
'port':3306, #云数据库端口
'user':'vmall', #云数据库用户
'password':'123456', #云数据库RDS密码
'database':'vmall', #数据库名称
'charset':'utf8'
}
self.conn=pymysql.connect(**dbparams)
self.cursor=self.conn.cursor()
self._sql=None
self.path=os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
if not os.path.exists(self.path):
os.mkdir(self.path)
def process_item(self,item,spider):
url=item['image']
image_name=url.split('_')[-1]
print("--------------------------image_name-----------------------------")
print(image_name)
print(url)
request.urlretrieve(url,os.path.join(self.path,image_name))
self.cursor.execute(self.sql,(item['title'], item['image']))
self.conn.commit()
return item
3.settings.py
更改的部分代码
代码语言:javascript复制BOT_NAME = 'vmall_spider'
SPIDER_MODULES = ['vmall_spider.spiders']
NEWSPIDER_MODULE = 'vmall_spider.spiders'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16'
}
ITEM_PIPELINES = {
'vmall_spider.pipelines.VmallSpiderPipeline': 300,
}
4.新建 start.py 运行调试入口文件
代码语言:javascript复制from scrapy import cmdline
cmdline.execute("scrapy crawl vmall".split())
- step4.本地数据库配置
工具:phpstudy-面板(小皮面板)
链接工具:Navicat for MySQL
运行
来源:
使用Python爬虫抓取图片和文字实验
https://lab.huaweicloud.com/testdetail.html?testId=468&ticket=ST-1363346-YzykQhBcmiNeURp6pgL0ahIy-sso