前言
爬虫的路上总有我们这些小白解不了的密, 反不了的爬。这时候就需要自动化工具了, 但是一般情况下, 直接使用自动化工具都会被目标网站监测到, 因为有几十个特征会被暴露的特征。所以这篇文章写一下, 常见的浏览器如何执行js, 和隐藏浏览器特征。文章不会涉及到配安装和配置环境步骤。自行查教程
selemium
代码语言:javascript复制最早接触的自动化模块
# -*- coding: utf-8 -*-
# @Author: Mehaei
# @Date: 2023-12-07 19:58:47
# @Last Modified by: Mehaei
# @Last Modified time: 2023-12-07 21:03:31
import time
from selenium import webdriver
def start():
driver = webdriver.Chrome()
with open('stealth.min.js', 'r') as f:
js = f.read()
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': js})
driver.get("https://bot.sannysoft.com/")
time.sleep(60)
if __name__ == '__main__':
start()
pyppeteer
代码语言:javascript复制实测还是会有少部分特征会无法隐藏, 不过还有其它办法 pyppeteer_stealth隐藏pyppeteer特征天花板神
# -*- coding: utf-8 -*-
# @Author: Mehaei
# @Date: 2023-12-07 19:58:47
# @Last Modified by: Mehaei
# @Last Modified time: 2023-12-07 21:22:31
import asyncio
from pyppeteer import launch
async def start():
browser = await launch(headless=False)
page = await browser.newPage()
with open('stealth.min.js', 'r') as f:
js = f.read()
await page.evaluateOnNewDocument(js)
await page.goto("https://bot.sannysoft.com/")
await asyncio.sleep(60)
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(start())
playwright
代码语言:javascript复制新一代爬虫工具 可以录制手动的操作, 自动生成代码。自动化神器
官网 https://playwright.dev/
# -*- coding: utf-8 -*-
# @Author: Mehaei
# @Date: 2023-12-07 19:58:47
# @Last Modified by: Mehaei
# @Last Modified time: 2023-12-07 20:52:55
import time
from playwright.sync_api import sync_playwright
def start():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
context.add_init_script(path='stealth.min.js')
page = context.new_page()
page.goto("https://bot.sannysoft.com/", timeout=100000)
time.sleep(60)
if __name__ == '__main__':
start()
DrissionPage
代码语言:javascript复制新的自动化工具, 同时兼容requests便利性和自动化工具的强大行 且会自动隐藏掉一些自动化特征和无需安装驱动, 感兴趣的可以看官网
https://g1879.gitee.io/drissionpagedocs/
# -*- coding: utf-8 -*-
# @Author: Mehaei
# @Date: 2023-12-07 19:58:47
# @Last Modified by: Mehaei
# @Last Modified time: 2023-12-07 22:02:58
import time
from DrissionPage import ChromiumPage
def start():
page = ChromiumPage()
with open('stealth.min.js', 'r') as f:
js = f.read()
"""
运行js, 但是运行这个stealth脚本会报错
"""
# page.run_js(js)
page.get("https://bot.sannysoft.com/")
time.sleep(60)
if __name__ == '__main__':
start()