使用selenium爬取简书用户的最新评论、标题、评论时间

2022-11-27 11:09:12 浏览数 (1)

任务要求: 网址为https://www.jianshu.com/u/9104ebf5e177,爬取内容为简书用户的最新评论中的评论题目、评论内容及评论时间,爬取5页,用selenium爬取,将这些评论存入Excel文件中,文件后缀为.xls。将ipynb文件和后缀为.xls的文件压缩打包。


通过博客对selenium的简单介绍,现在开始实战啦,没有看过的,可以先看看

  • 使用selenium定位获取标签对象并提取数据
  • 利用selenium爬取数据总结

直接上代码

代码语言:javascript复制
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 21 14:03:06 2020

@author: kun
"""

from selenium import webdriver
from time import sleep
from random import uniform
import pandas as pd

url = "https://www.jianshu.com/u/9104ebf5e177"
browser = webdriver.Chrome()
browser.maximize_window()
browser.implicitly_wait(3)
browser.get(url)

title,comment,time1 =[],[],[]
browser.find_element_by_xpath("/html/body/div[2]/div/div[1]/ul/li[3]/a").click()
#browser.find_element_by_link_text("最新评论").click()
sleep(uniform(2,3))
for i in range(1,20):
    browser.execute_script("window.scrollTo(0,document.body.scrollHeight)") 
    time.sleep(uniform(1,2))
def get_info():
    titles = browser.find_elements_by_css_selector("a[class='title']")
    comments = browser.find_elements_by_css_selector("p[class='abstract']")
    times = browser.find_elements_by_css_selector(" div > div > span.time")   
    try:
        for i in titles:
            title.append(i.text)
        for i in comments:
            comment.append(i.text)
        for i in times:
            time1.append(i.text)
        data = {"title":title,
                "comment":comment,
                "time":time1}
    except:
        pass
    finally:
        df = pd.DataFrame(data)
        df.to_excel("jianshu.xlsx",index=False,na_rep="null")
if __name__ == "__main__": 
    get_info()
    sleep(uniform(1,2))

结果如下: |

0 人点赞