一个简单的搜狗微信公众号案例

这里做了一个微信文章的爬取案例。

搜狗提供了微信公众号的链接，尽管里面只有10条最新文章数据，但是还是值得一抓的。

因为想要实现抓取不同的微信公众号的文章，所以采用了selenium来模拟浏览器操作，我们可以先通过搜索来获得一个类型的公众号的所有微信号。

我这里为了方便，手动复制了几个微信号来做测试。

下面开始操作教程：先用selenium访问这个界面，然后输入微信id ，点击搜索公众号

搜索后，即出现对应的公众号。

接着这里要直接使用click操作来点击进去，因为去获取url来进行请求是很麻烦的。

搜狗对这个url进行了js伪装，你请求的话是：

他很嚣张的把ip显示了出来，就是告诉你，再多试几次就封你。封的不还仅是ip，甚至会封了cookie。而且过不了几天js就更换一次，所以没必要去强求。

我们这里就直接使用selenium来点击进去。

代码语言：javascript复制

    driver.get('https://weixin.sogou.com/')
    driver.find_element_by_xpath('//*[@id="query"]').click()
    driver.find_element_by_xpath('//*[@id="query"]').send_keys( 'pythonlx' )
    driver.find_element_by_xpath('//*[@id="searchForm"]/div/input[4]').click()
    main_handle = driver.current_window_handle
    driver.find_element_by_xpath('//*[@id="sogou_vr_11002301_box_0"]/div/div[2]/p[1]/a').click()

因为这里会打开一个新的窗口，所以要进行窗口句柄切换。

代码语言：javascript复制

    new_handle = driver.current_window_handle
    Handles = driver.window_handles
    for handle in Handles:
        if handle != main_handle:
            driver.switch_to_window(handle)
    href_list = []
    for i in range(1,11):
        time.sleep(0.2)
        href_ = driver.find_element_by_xpath('//*[@id="history"]/div[{}]/div[2]/div/div/h4'.format(i)).get_attribute('hrefs')
        href = "https://mp.weixin.qq.com" href_
        href_list.append(href)

这里为了方便下一次使用窗口，节省浏览器内存占用，关闭当前窗口，然后切换到之前的窗口。

代码语言：javascript复制

    driver.close()
    driver.switch_to_window(new_handle)
    return href_list

这里就可以直接把列表页的url拿出来了。拿出来之后，我们就可以使用普通的requests来进行请求了。

代码语言：javascript复制

def parse1(href):
    dict={}
    a=requests.get(href,verify=False)
    s= (a.content.decode('utf-8'))
    doc = etree.HTML(s)
    title = doc.xpath('//*[@id="activity-name"]/text()')[0].replace('n','').replace(' ','')
    author = doc.xpath('//*[@id="js_author_name"]/text()')
    p = doc.xpath('//*[@id="js_content"]/p')
    text = []
    for i in range(1,len(p) 1):
        p = doc.xpath('//*[@id="js_content"]/p[{}]//text()'.format(i))
        if p ==[]:
            continue
        text.append(p[0])
    dict['标题'] = title
    dict['作者'] = author
    dict['内容'] = text
    json.dump(dict, f , ensure_ascii=False)
    f.write('n')

其实到这里就差不多了，后面的完整代码会把那几个微信号放一起。

但是因为有的数据字段是不全的，比如没有声明作者。我们还需要详细的去处理，我这里是一个案例，所以细节并没有过多处理，大家有兴趣可以自行更改。

完整代码：

代码语言：javascript复制

import requests
import time
from selenium import webdriver
from lxml import etree
import json

def web_driver(name):
    driver.get('https://weixin.sogou.com/')
    driver.find_element_by_xpath('//*[@id="query"]').click()
    time.sleep(1)
    driver.find_element_by_xpath('//*[@id="query"]').send_keys(name)
    driver.find_element_by_xpath('//*[@id="searchForm"]/div/input[4]').click()
    time.sleep(1)
    main_handle = driver.current_window_handle
    driver.find_element_by_xpath('//*[@id="sogou_vr_11002301_box_0"]/div/div[2]/p[1]/a').click()
    time.sleep(1)
    new_handle = driver.current_window_handle
    Handles = driver.window_handles
    for handle in Handles:
        if handle != main_handle:
            driver.switch_to_window(handle)
    href_list = []
    for i in range(1,11):
        time.sleep(0.2)
        href_ = driver.find_element_by_xpath('//*[@id="history"]/div[{}]/div[2]/div/div/h4'.format(i)).get_attribute('hrefs')
        href = "https://mp.weixin.qq.com" href_
        href_list.append(href)
    driver.close()
    driver.switch_to_window(new_handle)
    return href_list

def parse1(href):
    dict={}
    a=requests.get(href,verify=False)
    s= (a.content.decode('utf-8'))
    doc = etree.HTML(s)
    try:
        title = doc.xpath('//*[@id="activity-name"]/text()')[0].replace('n','').replace(' ','')
        print(title)
        dict['标题'] = title
        author = doc.xpath('//*[@id="js_author_name"]/text()')
        p = doc.xpath('//*[@id="js_content"]/p')
        text = []
        for i in range(1,len(p) 1):
            p = doc.xpath('//*[@id="js_content"]/p[{}]//text()'.format(i))
            if p ==[]:
                continue
            text.append(p[0])
        if not author:
            dict['作者'] = "未声明作者"
        else:
            dict['作者'] = author
        if text ==[]:
            print("空内容，有误，请查看")
        else:
            dict['内容'] = text
        json.dump(dict, f , ensure_ascii=False)
        f.write('n')
    except Exception as e:
        print(e)


if __name__ == '__main__':
    f = open("test.json", "a ")
    driver = webdriver.Chrome()
    name_list = ['pythonlx','python6359 ','pythonbuluo ',
                 'PythonCoder','PythonPush','Python_xiaowu','cainiao_xueyuan']
    for name in name_list:
        href_list = web_driver(name)
        for href in href_list:
            parse1(href)
    f.close()
    driver.quit()

selenium php 微信 tcp/ip 腾讯云开发者社区

0 人点赞