引言
- selenium 保存网页为 图片
- selenium 保存网页为 pdf
- 更多
准备
chromedriver 下载 - 官方: https://chromedriver.storage.googleapis.com/index.html - 淘宝镜像: https://npm.taobao.org/mirrors/chromedriver/
Chrome 下载 - https://www.slimjet.com/chrome/google-chrome-old-version.php -
selenium / webdriver 基础
导入包
代码语言:javascript复制pip 安装 pythhon selenium 包
pip install selenium代码语言:javascript复制ubuntu
下载安装 Chrome
注意: 建议固定 Chrome 版本, Chrome 版本必须与 chromedriver 版本对应一致
# 安装
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f代码语言:javascript复制下载 对应版本的 chromedriver
# 下载 chromedriver
sudo wget http://chromedriver.storage.googleapis.com/88.0.4324.96/chromedriver_linux64.zip
sudo apt-get install unzip
# 解压
sudo unzip chromedriver_linux64.zip
# 为所有用户添加可执行权限 (对 chromedriver 文件)
sudo chmod a x chromedriver
# 解决中文网页截图时, 中文乱码: 安装中文字体
# 下面两行安装中文字体
sudo apt install -y --force-yes --no-install-recommends fonts-wqy-microhei
sudo apt install -y --force-yes --no-install-recommends ttf-wqy-zenhei代码语言:javascript复制在代码中导入
from selenium import webdriver
from selenium.webdriver.chrome.options import Options补充
代码语言:javascript复制from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.edge.options import Options as EdgeOptions
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.ie.options import Options as IEOptionsdriver 实例
代码语言:javascript复制from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
# 1. Chrome 2. 指定 chromedriver 路径: './chromedriver'
driver = webdriver.Chrome('./chromedriver', options=chrome_options)
driver.get("https://github.com/yiyungent/WebScreenshot")
width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)
# 保存截图
driver.save_screenshot('./screenshots/' time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) '.png')
driver.quit()selenium 保存网页为 图片
代码语言:javascript复制from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
# 1. Chrome 2. 指定 chromedriver 路径: './chromedriver'
driver = webdriver.Chrome('./chromedriver', options=chrome_options)
driver.get("https://github.com/yiyungent/WebScreenshot")
width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)
# 保存截图
driver.save_screenshot('./screenshots/' time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) '.png')selenium 保存网页为 pdf
思路
主要有如下几种:
- 利用第三方包:pdfkit,可参考:https://www.cnblogs.com/silence-cc/p/9463227.html
- 使用chrome的
—print-to-pdf模式,将请求到html导出为pdf,可参考:http://osask.cn/front/ask/view/1029784 - 使用js命令
'window.print();来调用浏览器打印,可参考:https://gitee.com/shinemic/codes/09y87ph6vf2c5zamwls3q48
这里我们选用第三种,相对来说适应性比较好,也方便查看进展,如果想隐藏页面,只需要加入—headlss选项即可。
实现
代码语言:javascript复制配置 chromedriver 的 options
appState = {
"recentDestinations": [
{
"id": "Save as PDF",
"origin": "local"
}
],
"selectedDestinationId": "Save as PDF",
"version": 2
}
profile = {
'printing.print_preview_sticky_settings.appState': json.dumps(appState),
'savefile.default_directory': './articles'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', profile)
chrome_options.add_argument('--kiosk-printing')代码语言:javascript复制这里
savefile.default_directory用来指定保存文件的路径,需自行配置。保存pdf
driver.get(url)
time.sleep(5)
# 保存 PDF
temp_title = driver.title
driver.execute_script('window.print();')这里 chrome 打印网页时默认文件名为网页的title,所以这里先保存一下 temp_title=driver.title
代码语言:javascript复制改名
os.rename('./articles/' temp_title '.pdf', './articles/' title '.pdf')由于如果打开同一个网站的多个页面并保存pdf,那么很可能就会出现由于网站title相同而覆盖的情况,所以每次保存完毕后,改一下pdf的文件名。
注意:当网页异常等情况可能出现title为空的情况,那么这里改名的时候就会报异常错误,需要进行异常处理。
Cookies
参考:
- Working with cookies | Selenium
等待
参考:
- Waits | Selenium
显式等待
代码语言:javascript复制from selenium.webdriver.support.ui import WebDriverWait
def document_initialised(driver):
return driver.execute_script("return initialised")
driver.navigate("file:///race_condition.html")
WebDriverWait(driver).until(document_initialised)
el = driver.find_element(By.TAG_NAME, "p")
assert el.text == "Hello from JavaScript!"代码语言:javascript复制上方可以简化为下方
from selenium.webdriver.support.ui import WebDriverWait
driver.navigate("file:///race_condition.html")
el = WebDriverWait(driver).until(lambda d: d.find_element_by_tag_name("p"))
assert el.text == "Hello from JavaScript!"Q&A
其它类似
Puppeteer
- puppeteer/puppeteer: Headless Chrome Node.js API
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
PhantomJS
- ariya/phantomjs: Scriptable Headless Browser
PhantomJS (phantomjs.org) is a headless WebKit scriptable with JavaScript. The latest stable release is version 2.1. Important: PhantomJS development is suspended until further notice (see #15344 for more details).
补充
Selenium driver.Url vs. driver.Navigate().GoToUrl()
Selenium is an open source framework, so please have a look at the source code here.
GoToUrl() is defined in RemoteNavigator.cs:
/// <summary>
/// Navigate to a url for your test
/// </summary>
/// <param name="url">String of where you want the browser to go to</param>
public void GoToUrl(string url)
{
this.driver.Url = url;
}
/// <summary>
/// Navigate to a url for your test
/// </summary>
/// <param name="url">Uri object of where you want the browser to go to</param>
public void GoToUrl(Uri url)
{
if (url == null)
{
throw new ArgumentNullException("url", "URL cannot be null.");
}
this.driver.Url = url.ToString();
}
driver.Navigate().GoToUrl()实际上内部就是driver.Url = url
ubuntu 安装/卸载 *.deb
如果你想在命令行中安装 deb 软件包,你可以使用 apt 命令或者 dpkg 命令。 实际上,apt 命令在底层上使用 dpkg 命令,但是 apt 却更流行和易于使用。
如果你在安装 deb 软件包的过程中得到一个依赖项的错误,你可以使用下面的命令来修复依赖项的问题:
代码语言:javascript复制sudo apt install -f代码语言:javascript复制方法1
# 安装.deb文件
sudo dpkg -i 软件包名.deb
# 卸载
sudo dpkg -r program_name
# 查询
# 这将给予我全部的名称中含有 "grid" 的软件包,从这里,我可以得到准确的程序名称。
apt list --installed | grep grid
#WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
#appgrid/now 0.298 all [installed,local]代码语言:javascript复制方法2
# 安装
# sudo apt install ./teamviewer_amd64.deb
sudo apt install path_to_deb_file
# 卸载
# 或 sudo apt remove program_name
sudo apt-get remove 软件包名称
# 查询
dpkg -l | grep grid
#ii appgrid 0.298 all Discover and install apps for UbuntuSelenium 反反爬
chromedriver: error while loading shared libraries: libglib-2.0.so.0:

代码语言:javascript复制下方, 成功解决
apt-get install libglib2.0 -y但没有解决下方:
代码语言:javascript复制Network is unreachable Network is unreachable
OpenQA.Selenium.WebDriverException: Cannot start the driver service on http://localhost:39255/
at OpenQA.Selenium.DriverService.Start()chromedriver: error while loading shared libraries: libnss3.so:

apt-get install libnss3-dev -ychromedriver: error while loading shared libraries: libxcb.so.1:

apt-get install libxcb1 -yOpenQA.Selenium.WebDriverException: unknown error: cannot find Chrome binary


解决:
未正确安装 Chrome, 如果还是保存, 则手动指定
代码语言:javascript复制var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
// Chrome 的启动文件路径
// 只要正确安装的就不需要指定
//options.BinaryLocation = "";OpenQA.Selenium.WebDriverArgumentException: invalid argument

// url 应为合法完整url, 如: http://moeci.com
OpenQA.Selenium.Navigator.GoToUrl(String url)OpenQA.Selenium.WebDriverException: The HTTP request to the remote WebDriver server for URL http://localhost:40811/session timed out after 60 seconds.
代码语言:javascript复制var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--headless");
// 注意: TimeSpan.FromMinutes(5) 设置 5分钟 超时
var driver = new ChromeDriver(chromeDriverDirectory: "/app/tools/selenium/", options, commandTimeout: TimeSpan.FromMinutes(5));
driver.Navigate().GoToUrl(url);OpenQA.Selenium.WebDriverException: unknown error: session deleted because of page crash
代码语言:javascript复制OpenQA.Selenium.WebDriverException: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
(Session info: headless chrome=88.0.4324.182)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.ExecuteScriptCommand(String script, String commandName, Object[] args)
at OpenQA.Selenium.WebDriver.ExecuteScript(String script, Object[] args)
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 109
at WebScreenshot.Controllers.HomeController.Get(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 78代码语言:javascript复制这是在 docker 容器中运行才会出现的错误, 由于
shm_size不够用了, 默认64MB
docker run -d -p 4444:4444 --shm-size="2g" selenium/standalone-chrome:4.1.2-20220217代码语言:javascript复制version: "3"
services:
hub:
image: selenium/hub
ports:
- "4444:4444"
chrome:
image: selenium/node-chrome
shm_size: '1gb'
depends_on:
- hub
environment:
- HUB_HOST=hub
firefox:
image: selenium/node-firefox
shm_size: '1gb'
depends_on:
- hub
environment:
- HUB_HOST=hubSystem.InvalidOperationException: session not created
代码语言:javascript复制System.InvalidOperationException: session not created
from tab crashed
(Session info: headless chrome=88.0.4324.182) (SessionNotCreated)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.StartSession(ICapabilities desiredCapabilities)
at OpenQA.Selenium.WebDriver..ctor(ICommandExecutor executor, ICapabilities capabilities)
at OpenQA.Selenium.Chromium.ChromiumDriver..ctor(ChromiumDriverService service, ChromiumOptions options, TimeSpan commandTimeout)
at OpenQA.Selenium.Chrome.ChromeDriver..ctor(ChromeDriverService service, ChromeOptions options, TimeSpan commandTimeout)
at OpenQA.Selenium.Chrome.ChromeDriver..ctor(String chromeDriverDirectory, ChromeOptions options, TimeSpan commandTimeout)
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 109
at WebScreenshot.Controllers.HomeController.Get(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 78代码语言:javascript复制解决
var options = new ChromeOptions();
// https://stackoverflow.com/questions/59186984/selenium-common-exceptions-sessionnotcreatedexception-message-session-not-crea
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
// 重要: 测试添加了这行后,才成功
options.AddArgument("--ignore-certificate-errors");Timed out receiving message from renderer: 10.000
代码语言:javascript复制ChromeDriver was started successfully.
[1646482757.506][SEVERE]: Timed out receiving message from renderer: 10.000
[1646482757.506][WARNING]: screenshot failed, retrying timeout: Timed out receiving message from renderer: 10.000
[1646482767.506][SEVERE]: Timed out receiving message from renderer: 10.000
OpenQA.Selenium.WebDriverTimeoutException: timeout: Timed out receiving message from renderer: 10.000
(Session info: headless chrome=88.0.4324.182)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.GetScreenshot()
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url, String jsurl, String jsStr) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 288
at WebScreenshot.Controllers.HomeController.FileCache(Byte[]& cacheEntry, String url, String jsurl, String jsStr) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 167
at WebScreenshot.Controllers.HomeController.Get(String url, String jsurl, Int32 windowWidth, Int32 windowHeight) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 130代码语言:javascript复制解决
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
options.AddArgument("--ignore-certificate-errors");
// 重要: 下方此行
options.AddArgument("--disable-gpu");screenshot failed, retrying timeout: Timed out receiving message from renderer: 10.
Dockerfile: /bin/sh: 1: source: not found
代码语言:javascript复制添加 chromedriver 所在目录到 PATH
# TODO: 以下添加 PATH 失败: 无效
RUN echo 'export PATH=$PATH:/app' >> ~/.bash_profile
RUN /bin/bash -c "source ~/.bash_profile"
# 使用 Dockerfile 方式 添加 PATH
ENV PATH=/app:$PATH
# 效验版本
RUN google-chrome --version
RUN chromedriver --versionPS:
~这个符号表示你的家目录,.bash_profile是一个隐藏的配置文件,主要是用来配置bash shell的,source ~/.bash_profile就是让这个配置文件在修改后立即生效。
Selenium 利用 Cookie 免登录
参考:
- 利用cookie免帐号密码登陆b站 - JavaShuo
- 利用python selenium带上cookies自动登录bilibili-python黑洞网
代码语言:javascript复制执行 JavaScript
document.cookie ="SESSDATA=49d4147c%6557247677,f295e641;domain=.bilibili.com;path=/";- 本文作者: yiyun
- 本文链接: https://moeci.com/posts/分类-爬虫/selenium/
- 版权声明: 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!


