Selenium | 笔记

2022-04-01 13:03:08 浏览数 (1)

引言

  • selenium 保存网页为 图片
  • selenium 保存网页为 pdf
  • 更多

准备

chromedriver 下载 - 官方: https://chromedriver.storage.googleapis.com/index.html - 淘宝镜像: https://npm.taobao.org/mirrors/chromedriver/

Chrome 下载 - https://www.slimjet.com/chrome/google-chrome-old-version.php -

selenium / webdriver 基础

导入包

pip 安装 pythhon selenium 包

代码语言:javascript复制
pip install selenium

ubuntu

下载安装 Chrome

注意: 建议固定 Chrome 版本, Chrome 版本必须与 chromedriver 版本对应一致

代码语言:javascript复制
# 安装
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f

下载 对应版本的 chromedriver

代码语言:javascript复制
# 下载 chromedriver
sudo wget http://chromedriver.storage.googleapis.com/88.0.4324.96/chromedriver_linux64.zip

sudo apt-get install unzip

# 解压
sudo unzip chromedriver_linux64.zip

# 为所有用户添加可执行权限 (对 chromedriver 文件)
sudo chmod a x chromedriver

# 解决中文网页截图时, 中文乱码: 安装中文字体
# 下面两行安装中文字体
sudo apt install -y --force-yes --no-install-recommends fonts-wqy-microhei
sudo apt install -y --force-yes --no-install-recommends ttf-wqy-zenhei

在代码中导入

代码语言:javascript复制
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

补充

代码语言:javascript复制
from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.edge.options import Options as EdgeOptions
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.ie.options import Options as IEOptions

driver 实例

代码语言:javascript复制
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

import time

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')

# 1. Chrome 2. 指定 chromedriver 路径: './chromedriver'
driver = webdriver.Chrome('./chromedriver', options=chrome_options)

driver.get("https://github.com/yiyungent/WebScreenshot")

width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)

# 保存截图
driver.save_screenshot('./screenshots/'   time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))  '.png')

driver.quit()

selenium 保存网页为 图片

代码语言:javascript复制
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

import time

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')

# 1. Chrome 2. 指定 chromedriver 路径: './chromedriver'
driver = webdriver.Chrome('./chromedriver', options=chrome_options)

driver.get("https://github.com/yiyungent/WebScreenshot")

width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)

# 保存截图
driver.save_screenshot('./screenshots/'   time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))  '.png')

selenium 保存网页为 pdf

思路

主要有如下几种:

  • 利用第三方包:pdfkit,可参考:https://www.cnblogs.com/silence-cc/p/9463227.html
  • 使用chrome的—print-to-pdf模式,将请求到html导出为pdf,可参考:http://osask.cn/front/ask/view/1029784
  • 使用js命令'window.print();来调用浏览器打印,可参考:https://gitee.com/shinemic/codes/09y87ph6vf2c5zamwls3q48

这里我们选用第三种,相对来说适应性比较好,也方便查看进展,如果想隐藏页面,只需要加入—headlss选项即可。

实现

配置 chromedriver 的 options

代码语言:javascript复制
appState = {
    "recentDestinations": [
        {
            "id": "Save as PDF",
            "origin": "local"
        }
    ],
    "selectedDestinationId": "Save as PDF",
    "version": 2
}
profile = {
    'printing.print_preview_sticky_settings.appState': json.dumps(appState),
    'savefile.default_directory': './articles'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', profile)
chrome_options.add_argument('--kiosk-printing')

这里 savefile.default_directory 用来指定保存文件的路径,需自行配置。

保存pdf

代码语言:javascript复制
driver.get(url)
time.sleep(5)
# 保存 PDF
temp_title = driver.title
driver.execute_script('window.print();')

这里 chrome 打印网页时默认文件名为网页的title,所以这里先保存一下 temp_title=driver.title

改名

代码语言:javascript复制
os.rename('./articles/'   temp_title   '.pdf', './articles/'   title   '.pdf')

由于如果打开同一个网站的多个页面并保存pdf,那么很可能就会出现由于网站title相同而覆盖的情况,所以每次保存完毕后,改一下pdf的文件名。

注意:当网页异常等情况可能出现title为空的情况,那么这里改名的时候就会报异常错误,需要进行异常处理。

Cookies

参考:

  • Working with cookies | Selenium

等待

参考:

  • Waits | Selenium

显式等待

代码语言:javascript复制
from selenium.webdriver.support.ui import WebDriverWait
def document_initialised(driver):
    return driver.execute_script("return initialised")

driver.navigate("file:///race_condition.html")
WebDriverWait(driver).until(document_initialised)
el = driver.find_element(By.TAG_NAME, "p")
assert el.text == "Hello from JavaScript!"

上方可以简化为下方

代码语言:javascript复制
from selenium.webdriver.support.ui import WebDriverWait

driver.navigate("file:///race_condition.html")
el = WebDriverWait(driver).until(lambda d: d.find_element_by_tag_name("p"))
assert el.text == "Hello from JavaScript!"

Q&A

其它类似

Puppeteer

  • puppeteer/puppeteer: Headless Chrome Node.js API

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

PhantomJS

  • ariya/phantomjs: Scriptable Headless Browser

PhantomJS (phantomjs.org) is a headless WebKit scriptable with JavaScript. The latest stable release is version 2.1. Important: PhantomJS development is suspended until further notice (see #15344 for more details).

补充

Selenium driver.Url vs. driver.Navigate().GoToUrl()

Selenium is an open source framework, so please have a look at the source code here.

GoToUrl() is defined in RemoteNavigator.cs:

代码语言:javascript复制
/// <summary>
/// Navigate to a url for your test
/// </summary>
/// <param name="url">String of where you want the browser to go to</param>
public void GoToUrl(string url)
{
    this.driver.Url = url;
}

/// <summary>
/// Navigate to a url for your test
/// </summary>
/// <param name="url">Uri object of where you want the browser to go to</param>
public void GoToUrl(Uri url)
{
    if (url == null)
    {
        throw new ArgumentNullException("url", "URL cannot be null.");
    }

    this.driver.Url = url.ToString();
}

driver.Navigate().GoToUrl() 实际上内部就是 driver.Url = url

ubuntu 安装/卸载 *.deb

如果你想在命令行中安装 deb 软件包,你可以使用 apt 命令或者 dpkg 命令。 实际上,apt 命令在底层上使用 dpkg 命令,但是 apt 却更流行和易于使用。

如果你在安装 deb 软件包的过程中得到一个依赖项的错误,你可以使用下面的命令来修复依赖项的问题:

代码语言:javascript复制
sudo apt install -f

方法1

代码语言:javascript复制
# 安装.deb文件
sudo dpkg -i 软件包名.deb

# 卸载
sudo dpkg -r program_name

# 查询
# 这将给予我全部的名称中含有 "grid" 的软件包,从这里,我可以得到准确的程序名称。
apt list --installed | grep grid
#WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
#appgrid/now 0.298 all [installed,local]

方法2

代码语言:javascript复制
# 安装
# sudo apt install ./teamviewer_amd64.deb
sudo apt install path_to_deb_file

# 卸载
# 或 sudo apt remove program_name
sudo apt-get remove 软件包名称

# 查询
dpkg -l | grep grid
#ii appgrid 0.298 all Discover and install apps for Ubuntu

Selenium 反反爬

chromedriver: error while loading shared libraries: libglib-2.0.so.0:

下方, 成功解决

代码语言:javascript复制
apt-get install libglib2.0 -y

但没有解决下方:

代码语言:javascript复制
Network is unreachable Network is unreachable
OpenQA.Selenium.WebDriverException: Cannot start the driver service on http://localhost:39255/
   at OpenQA.Selenium.DriverService.Start()

chromedriver: error while loading shared libraries: libnss3.so:

代码语言:javascript复制
apt-get install libnss3-dev -y

chromedriver: error while loading shared libraries: libxcb.so.1:

代码语言:javascript复制
apt-get install libxcb1 -y

OpenQA.Selenium.WebDriverException: unknown error: cannot find Chrome binary

解决:

未正确安装 Chrome, 如果还是保存, 则手动指定

代码语言:javascript复制
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");

// Chrome 的启动文件路径
// 只要正确安装的就不需要指定
//options.BinaryLocation = "";

OpenQA.Selenium.WebDriverArgumentException: invalid argument

代码语言:javascript复制
// url 应为合法完整url, 如: http://moeci.com
OpenQA.Selenium.Navigator.GoToUrl(String url)

OpenQA.Selenium.WebDriverException: The HTTP request to the remote WebDriver server for URL http://localhost:40811/session timed out after 60 seconds.

代码语言:javascript复制
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--headless");

// 注意: TimeSpan.FromMinutes(5) 设置 5分钟 超时
var driver = new ChromeDriver(chromeDriverDirectory: "/app/tools/selenium/", options, commandTimeout: TimeSpan.FromMinutes(5));

driver.Navigate().GoToUrl(url);

OpenQA.Selenium.WebDriverException: unknown error: session deleted because of page crash

代码语言:javascript复制
OpenQA.Selenium.WebDriverException: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
(Session info: headless chrome=88.0.4324.182)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.ExecuteScriptCommand(String script, String commandName, Object[] args)
at OpenQA.Selenium.WebDriver.ExecuteScript(String script, Object[] args)
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 109
at WebScreenshot.Controllers.HomeController.Get(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 78

这是在 docker 容器中运行才会出现的错误, 由于 shm_size 不够用了, 默认 64MB

代码语言:javascript复制
docker run -d -p 4444:4444 --shm-size="2g" selenium/standalone-chrome:4.1.2-20220217
代码语言:javascript复制
version: "3"
services:
  hub:
    image: selenium/hub
    ports:
      - "4444:4444"
  chrome:
    image: selenium/node-chrome
    shm_size: '1gb'
    depends_on:
      - hub
    environment:
      - HUB_HOST=hub
  firefox:
    image: selenium/node-firefox
    shm_size: '1gb'
    depends_on:
      - hub
    environment:
      - HUB_HOST=hub

System.InvalidOperationException: session not created

代码语言:javascript复制
System.InvalidOperationException: session not created
from tab crashed
(Session info: headless chrome=88.0.4324.182) (SessionNotCreated)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.StartSession(ICapabilities desiredCapabilities)
at OpenQA.Selenium.WebDriver..ctor(ICommandExecutor executor, ICapabilities capabilities)
at OpenQA.Selenium.Chromium.ChromiumDriver..ctor(ChromiumDriverService service, ChromiumOptions options, TimeSpan commandTimeout)
at OpenQA.Selenium.Chrome.ChromeDriver..ctor(ChromeDriverService service, ChromeOptions options, TimeSpan commandTimeout)
at OpenQA.Selenium.Chrome.ChromeDriver..ctor(String chromeDriverDirectory, ChromeOptions options, TimeSpan commandTimeout)
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 109
at WebScreenshot.Controllers.HomeController.Get(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 78

解决

代码语言:javascript复制
var options = new ChromeOptions();
// https://stackoverflow.com/questions/59186984/selenium-common-exceptions-sessionnotcreatedexception-message-session-not-crea
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
// 重要: 测试添加了这行后,才成功
options.AddArgument("--ignore-certificate-errors");

Timed out receiving message from renderer: 10.000

代码语言:javascript复制
ChromeDriver was started successfully.
[1646482757.506][SEVERE]: Timed out receiving message from renderer: 10.000
[1646482757.506][WARNING]: screenshot failed, retrying timeout: Timed out receiving message from renderer: 10.000
[1646482767.506][SEVERE]: Timed out receiving message from renderer: 10.000
OpenQA.Selenium.WebDriverTimeoutException: timeout: Timed out receiving message from renderer: 10.000
(Session info: headless chrome=88.0.4324.182)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.GetScreenshot()
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url, String jsurl, String jsStr) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 288
at WebScreenshot.Controllers.HomeController.FileCache(Byte[]& cacheEntry, String url, String jsurl, String jsStr) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 167
at WebScreenshot.Controllers.HomeController.Get(String url, String jsurl, Int32 windowWidth, Int32 windowHeight) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 130

解决

代码语言:javascript复制
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
options.AddArgument("--ignore-certificate-errors");
// 重要: 下方此行
options.AddArgument("--disable-gpu");

screenshot failed, retrying timeout: Timed out receiving message from renderer: 10.

Dockerfile: /bin/sh: 1: source: not found

添加 chromedriver 所在目录到 PATH

代码语言:javascript复制
# TODO: 以下添加 PATH 失败: 无效
RUN echo 'export PATH=$PATH:/app' >> ~/.bash_profile
RUN /bin/bash -c "source ~/.bash_profile"
# 使用 Dockerfile 方式 添加 PATH
ENV PATH=/app:$PATH
# 效验版本
RUN google-chrome --version
RUN chromedriver --version

PS: ~ 这个符号表示你的家目录, .bash_profile 是一个隐藏的配置文件,主要是用来配置bash shell的, source ~/.bash_profile 就是让这个配置文件在修改后立即生效。

Selenium 利用 Cookie 免登录

参考:

  • 利用cookie免帐号密码登陆b站 - JavaShuo
  • 利用python selenium带上cookies自动登录bilibili-python黑洞网

执行 JavaScript

代码语言:javascript复制
document.cookie ="SESSDATA=49d4147c%6557247677,f295e641;domain=.bilibili.com;path=/";
  • 本文作者: yiyun
  • 本文链接: https://moeci.com/posts/分类-爬虫/selenium/
  • 版权声明: 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!

0 人点赞