Linux中Chrome无界模式动态代理IP的配置（Selenium）

Python 爬虫设置代理的方式有很多，比如给urlib、request、selenium等方式设置代理，这部分的细节代码在网上一搜一大堆。那么问题来了，比如你要抓取淘宝或模拟验证码操作登录，是不是要采用这种方式（Selenium Chromedriver Chrome）实现呢？

以上就是结合（Selenium Chromedriver Chrome）实现的淘宝商品数据爬取，在该实例代码中，并没有设置代理ip的部分代码，说明当爬取超过一定次数之后，将无法访问淘宝，也就是本机ip被暂时封禁。那么使用该种技术如何设置代理呢？ Selnium 同样也可以设置代理，包括两种方式，一种是有界面浏览器，以 Chrome 为例；另是无界面浏览器Chrome headless

环境准备

本文所用环境：

CentOS 7.8
Python 2.7.5
Selenium 3.141.0
Chromedriver 83.0.4103.14
Google Chrome 83.0.4103.116

参考 CentOS7 安装Chrome

参考 centos7 安装chromedriver

参考 Python环境安装

不需要账号密码的代理设置（Windows）

代码语言：javascript复制

from selenium import webdriver

proxy = '127.0.0.1:9743'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=http://'   proxy)
chrome = webdriver.Chrome(chrome_options=chrome_options)
chrome.get('http://httpbin.org/get')

在这里我们通过 ChromeOption 来设置代理，在创建Chrom 对象的时候用 chrome_options 参数传递即可。

使用阿布云的代理设置（Windows）

如果代理是认证代理，则设置方法相对比较麻烦，设置方法如下所示。这里需要在本地创建一个 manifest.json 置文件和 background.js 脚本来设置认证代理，运行代码之后本地会生成一个 authProxy@http-dyn.abuyun.9020.zip 文件来保存当前配置

代码语言：javascript复制

import base64
import string

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import zipfile

proxyHost = "http-dyn.abuyun.com"
proxyPort = "9020"
# 隧道身份信息
proxyUser = "xxxxxxxxxx"
proxyPass = "xxxxxxxxxx"
authStr = proxyUser   ":"   proxyPass

proxyAuth = "Basic "   base64.b64encode(authStr.encode('utf-8')).decode('utf-8')


def create_proxy_auth_extension(proxy_host, proxy_port,
                                proxy_username, proxy_password,
                                scheme='http', plugin_path=None):
    if plugin_path is None:
        plugin_path = r'./authProxy@http-dyn.abuyun.9020.zip'

    manifest_json = """
        {
            "version": "1.0.0",
            "manifest_version": 2,
            "name": "Abuyun Proxy",
            "permissions": [
                "proxy",
                "tabs",
                "unlimitedStorage",
                "storage",
                "<all_urls>",
                "webRequest",
                "webRequestBlocking"
            ],
            "background": {
                "scripts": ["background.js"]
            },
            "minimum_chrome_version":"22.0.0"
        }
        """

    background_js = string.Template(
        """
        var config = {
            mode: "fixed_servers",
            rules: {
                singleProxy: {
                    scheme: "${scheme}",
                    host: "${host}",
                    port: parseInt(${port})
                },
                bypassList: ["foobar.com"]
            }
          };

        chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});

        function callbackFn(details) {
            return {
                authCredentials: {
                    username: "${username}",
                    password: "${password}"
                }
            };
        }

        chrome.webRequest.onAuthRequired.addListener(
            callbackFn,
            {urls: ["<all_urls>"]},
            ['blocking']
        );
        """
    ).substitute(
        host=proxy_host,
        port=proxy_port,
        username=proxy_username,
        password=proxy_password,
        scheme=scheme,
    )

    with zipfile.ZipFile(plugin_path, 'w') as zp:
        zp.writestr("manifest.json", manifest_json)
        zp.writestr("background.js", background_js)

    return plugin_path


proxy_auth_plugin_path = create_proxy_auth_extension(
    proxy_host=proxyHost,
    proxy_port=proxyPort,
    proxy_username=proxyUser,
    proxy_password=proxyPass)


chrome_options = Options()
chrome_options.add_argument("--start-maximized")
# 通过 option.add_extension 命令安装至chrome 通过插件实现动态代理
chrome_options.add_extension(proxy_auth_plugin_path)
# 多次打开浏览器，查看代理是否设置成功
for i in range(5):
    browser = webdriver.Chrome(chrome_options=chrome_options)
    browser.get('http://httpbin.org/get')
复制代码

chromedriver 使用认证代理插件在无界面环境下运行

通过以上的代理设置后，会有一个普遍的问题，就是使用chromedriver添加认证代理时不能使用headless的问题。装插件后无法直接使用无界面模式运行，可以通过虚拟现实技术间接实现pyvirtualdisplay

安装Xvfb虚拟界面工具 yum install Xvfb
安装对应的python工具包 pip install pyvirtualdisplay

以下为测试代码

代码语言：javascript复制

from selenium import webdriver
from pyvirtualdisplay import Display
# 在chromedriver启动前启动一个显示器
display = Display(visible=0, size=(800, 800))
display.start()
# 使用上个例子中制作好的阿布云代理插件
plugin_path = './authProxy@http-dyn.abuyun.9020.zip'
# 添加插件及必要的配置 
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_extension(plugin_path)
# 测试查看效果
driver = webdriver.Chrome(chrome_options=option)
driver.get("https://httpbin.org/ip")
print(driver.page_source)
driver.quit()

可以看到每次返回的 IP 都不一样，接下来就是把这部分代码迁移到最初淘宝爬虫的那个例子当中，就完成了动态IP抓取商品的功能了，不用担心爬取到一半就被封 IP 了。

python selenium centos

0 人点赞