任务要求: 网址为https://www.jianshu.com/u/9104ebf5e177,爬取内容为简书用户的最新评论中的评论题目、评论内容及评论时间,爬取5页,用selenium爬取,将这些评论存入Excel文件中,文件后缀为.xls。将ipynb文件和后缀为.xls的文件压缩打包。
通过博客对selenium的简单介绍,现在开始实战啦,没有看过的,可以先看看
- 使用selenium定位获取标签对象并提取数据
- 利用selenium爬取数据总结
直接上代码
代码语言:javascript复制# -*- coding: utf-8 -*-
"""
Created on Mon Dec 21 14:03:06 2020
@author: kun
"""
from selenium import webdriver
from time import sleep
from random import uniform
import pandas as pd
url = "https://www.jianshu.com/u/9104ebf5e177"
browser = webdriver.Chrome()
browser.maximize_window()
browser.implicitly_wait(3)
browser.get(url)
title,comment,time1 =[],[],[]
browser.find_element_by_xpath("/html/body/div[2]/div/div[1]/ul/li[3]/a").click()
#browser.find_element_by_link_text("最新评论").click()
sleep(uniform(2,3))
for i in range(1,20):
browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(uniform(1,2))
def get_info():
titles = browser.find_elements_by_css_selector("a[class='title']")
comments = browser.find_elements_by_css_selector("p[class='abstract']")
times = browser.find_elements_by_css_selector(" div > div > span.time")
try:
for i in titles:
title.append(i.text)
for i in comments:
comment.append(i.text)
for i in times:
time1.append(i.text)
data = {"title":title,
"comment":comment,
"time":time1}
except:
pass
finally:
df = pd.DataFrame(data)
df.to_excel("jianshu.xlsx",index=False,na_rep="null")
if __name__ == "__main__":
get_info()
sleep(uniform(1,2))
结果如下: |