今天我们使用Web抓取模块(如Selenium,Beautiful Soup和urllib)在Python中编写脚本来抓取一个分类广告网站Craigslist的数据。主要通过浏览器访问网站Craigslist提取出搜索结果中的标题、链接等信息。
首先我们先看下具体被抓取网站的样子:
我们根据输入的参数提前整理出url的信息主要包括邮编、最高价格、距离范围、以及网站域名位置。
https://sfbay.craigslist.org/search/sss?search_distance=5&postal=94201&max_price=500
我们根据这个地址来看具体的代码编写过程,最后将完整的代码展示给大家:
首先导入要使用的安装包:
代码语言:javascript复制from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import urllib.request
接下来我们定义一个类实现抓取网站的具体操作:
location:具体的域名位置
postal:邮编
max_price:最高价
radius:距离
url:拼接要访问的地址
driver:使用chrome浏览器
deley:延迟时间
代码语言:javascript复制class CraiglistScraper(object):
def __init__(self, location, postal, max_price, radius):
self.location = location
self.postal = postal
self.max_price = max_price
self.radius = radius
self.url = f"https://{location}.craigslist.org/search/sss?search_distance={radius}&postal={postal}&max_price={max_price}"
self.driver = webdriver.Chrome('chromedriver.exe')
self.delay = 3
接下来在类中定义load_craigslist_url方法,使用selenium打开浏览器,然后进行3秒的延迟加载后 获取到搜索框的元素这里是id为searchform:
具体方法如下:
代码语言:javascript复制def load_craigslist_url(self):
self.driver.get(self.url)
try:
wait = WebDriverWait(self.driver,self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
print("页面已经初始化完毕")
except TimeoutException:
print("加载页面超时")
根据网站源码可知,搜索结果是由li标签组成并且样式为class="result-row":
根据以上分析我们编写extract_post_information方法获取搜索结果中的标题、价格、日期数据:
代码语言:javascript复制def extract_post_information(self):
all_posts = self.driver.find_elements_by_class_name("result-row")
dates = []
titles = []
prices = []
for post in all_posts:
title = post.text.split("$")
if title[0] == '':
title = title[1]
else:
title = title[0]
title = title.split("n")
price = title[0]
title = title[-1]
title = title.split(" ")
month = title[0]
day = title[1]
title = ' '.join(title[2:])
date = month " " day
titles.append(title)
prices.append(price)
dates.append(date)
return titles,prices,dates
接下来我们提取商品的链接,根据源码分析可知,链接是a标签中class为result-title hdrlnk的代码:
我们编写抽取超链接的方法extract_post_urls并使用BeautifulSoup实现:
代码语言:javascript复制def extract_post_urls(self):
url_list = []
html_page = urllib.request.urlopen(self.url)
soup = BeautifulSoup(html_page, "lxml")
for link in soup.findAll("a", {"class": "result-title hdrlnk"}):
print(link["href"])
url_list.append(link["href"])
return url_list
然后设置关闭浏览器的方法:
代码语言:javascript复制 def quit(self):
self.driver.close()
调用程序进行执行抓取:
代码语言:javascript复制#运行测试
location = "sfbay"
postal = "94201"
max_price = "500"
radius = "5"
scraper = CraiglistScraper(location, postal, max_price, radius)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
print(titles)
scraper.extract_post_urls()
scraper.quit()
然后就可以运行看效果啦,最终的完整代码如下:
代码语言:javascript复制from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import urllib.request
class CraiglistScraper(object):
def __init__(self, location, postal, max_price, radius):
self.location = location
self.postal = postal
self.max_price = max_price
self.radius = radius
self.url = f"https://{location}.craigslist.org/search/sss?search_distance={radius}&postal={postal}&max_price={max_price}"
self.driver = webdriver.Chrome('chromedriver.exe')
self.delay = 3
def load_craigslist_url(self):
self.driver.get(self.url)
try:
wait = WebDriverWait(self.driver,self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
print("页面已经初始化完毕")
except TimeoutException:
print("加载页面超时")
def extract_post_information(self):
all_posts = self.driver.find_elements_by_class_name("result-row")
dates = []
titles = []
prices = []
for post in all_posts:
title = post.text.split("$")
if title[0] == '':
title = title[1]
else:
title = title[0]
title = title.split("n")
price = title[0]
title = title[-1]
title = title.split(" ")
month = title[0]
day = title[1]
title = ' '.join(title[2:])
date = month " " day
titles.append(title)
prices.append(price)
dates.append(date)
return titles,prices,dates
def extract_post_urls(self):
url_list = []
html_page = urllib.request.urlopen(self.url)
soup = BeautifulSoup(html_page, "lxml")
for link in soup.findAll("a", {"class": "result-title hdrlnk"}):
print(link["href"])
url_list.append(link["href"])
return url_list
def quit(self):
self.driver.close()
#运行测试
location = "sfbay"
postal = "94201"
max_price = "500"
radius = "5"
scraper = CraiglistScraper(location, postal, max_price, radius)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
print(titles)
scraper.extract_post_urls()
scraper.quit()
感兴趣的童鞋可以做下测试,对于Selenium、BeautifulSoup不太熟悉的童鞋可以参考之前的文章:
web爬虫-搞一波天涯论坛帖子练练手
web爬虫-用Selenium操作浏览器抓数据
今天的学习就到这里了,下节见吧
关注公号
下面的是我的公众号二维码图片,欢迎关注。