引言
- selenium 保存网页为 图片
- selenium 保存网页为 pdf
- 更多
准备
chromedriver 下载 - 官方: https://chromedriver.storage.googleapis.com/index.html - 淘宝镜像: https://npm.taobao.org/mirrors/chromedriver/
Chrome 下载 - https://www.slimjet.com/chrome/google-chrome-old-version.php -
selenium / webdriver 基础
导入包
代码语言:javascript复制pip 安装 pythhon selenium 包
pip install selenium
代码语言:javascript复制ubuntu
下载安装 Chrome
注意: 建议固定 Chrome 版本, Chrome 版本必须与 chromedriver 版本对应一致
# 安装
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
代码语言:javascript复制下载 对应版本的 chromedriver
# 下载 chromedriver
sudo wget http://chromedriver.storage.googleapis.com/88.0.4324.96/chromedriver_linux64.zip
sudo apt-get install unzip
# 解压
sudo unzip chromedriver_linux64.zip
# 为所有用户添加可执行权限 (对 chromedriver 文件)
sudo chmod a x chromedriver
# 解决中文网页截图时, 中文乱码: 安装中文字体
# 下面两行安装中文字体
sudo apt install -y --force-yes --no-install-recommends fonts-wqy-microhei
sudo apt install -y --force-yes --no-install-recommends ttf-wqy-zenhei
代码语言:javascript复制在代码中导入
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
补充
代码语言:javascript复制from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.edge.options import Options as EdgeOptions
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.ie.options import Options as IEOptions
driver 实例
代码语言:javascript复制from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
# 1. Chrome 2. 指定 chromedriver 路径: './chromedriver'
driver = webdriver.Chrome('./chromedriver', options=chrome_options)
driver.get("https://github.com/yiyungent/WebScreenshot")
width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)
# 保存截图
driver.save_screenshot('./screenshots/' time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) '.png')
driver.quit()
selenium 保存网页为 图片
代码语言:javascript复制from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
# 1. Chrome 2. 指定 chromedriver 路径: './chromedriver'
driver = webdriver.Chrome('./chromedriver', options=chrome_options)
driver.get("https://github.com/yiyungent/WebScreenshot")
width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)
# 保存截图
driver.save_screenshot('./screenshots/' time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) '.png')
selenium 保存网页为 pdf
思路
主要有如下几种:
- 利用第三方包:pdfkit,可参考:https://www.cnblogs.com/silence-cc/p/9463227.html
- 使用chrome的
—print-to-pdf
模式,将请求到html
导出为pdf,可参考:http://osask.cn/front/ask/view/1029784 - 使用js命令
'window.print();
来调用浏览器打印,可参考:https://gitee.com/shinemic/codes/09y87ph6vf2c5zamwls3q48
这里我们选用第三种,相对来说适应性比较好,也方便查看进展,如果想隐藏页面,只需要加入—headlss
选项即可。
实现
代码语言:javascript复制配置 chromedriver 的 options
appState = {
"recentDestinations": [
{
"id": "Save as PDF",
"origin": "local"
}
],
"selectedDestinationId": "Save as PDF",
"version": 2
}
profile = {
'printing.print_preview_sticky_settings.appState': json.dumps(appState),
'savefile.default_directory': './articles'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', profile)
chrome_options.add_argument('--kiosk-printing')
代码语言:javascript复制这里
savefile.default_directory
用来指定保存文件的路径,需自行配置。保存pdf
driver.get(url)
time.sleep(5)
# 保存 PDF
temp_title = driver.title
driver.execute_script('window.print();')
这里 chrome 打印网页时默认文件名为网页的title
,所以这里先保存一下 temp_title=driver.title
代码语言:javascript复制改名
os.rename('./articles/' temp_title '.pdf', './articles/' title '.pdf')
由于如果打开同一个网站的多个页面并保存pdf,那么很可能就会出现由于网站title相同而覆盖的情况,所以每次保存完毕后,改一下pdf的文件名。
注意:当网页异常等情况可能出现title为空的情况,那么这里改名的时候就会报异常错误,需要进行异常处理。
Cookies
参考:
- Working with cookies | Selenium
等待
参考:
- Waits | Selenium
显式等待
代码语言:javascript复制from selenium.webdriver.support.ui import WebDriverWait
def document_initialised(driver):
return driver.execute_script("return initialised")
driver.navigate("file:///race_condition.html")
WebDriverWait(driver).until(document_initialised)
el = driver.find_element(By.TAG_NAME, "p")
assert el.text == "Hello from JavaScript!"
代码语言:javascript复制上方可以简化为下方
from selenium.webdriver.support.ui import WebDriverWait
driver.navigate("file:///race_condition.html")
el = WebDriverWait(driver).until(lambda d: d.find_element_by_tag_name("p"))
assert el.text == "Hello from JavaScript!"
Q&A
其它类似
Puppeteer
- puppeteer/puppeteer: Headless Chrome Node.js API
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
PhantomJS
- ariya/phantomjs: Scriptable Headless Browser
PhantomJS (phantomjs.org) is a headless WebKit scriptable with JavaScript. The latest stable release is version 2.1. Important: PhantomJS development is suspended until further notice (see #15344 for more details).
补充
Selenium driver.Url vs. driver.Navigate().GoToUrl()
Selenium is an open source framework, so please have a look at the source code here.
GoToUrl()
is defined in RemoteNavigator.cs:
/// <summary>
/// Navigate to a url for your test
/// </summary>
/// <param name="url">String of where you want the browser to go to</param>
public void GoToUrl(string url)
{
this.driver.Url = url;
}
/// <summary>
/// Navigate to a url for your test
/// </summary>
/// <param name="url">Uri object of where you want the browser to go to</param>
public void GoToUrl(Uri url)
{
if (url == null)
{
throw new ArgumentNullException("url", "URL cannot be null.");
}
this.driver.Url = url.ToString();
}
driver.Navigate().GoToUrl()
实际上内部就是driver.Url = url
ubuntu 安装/卸载 *.deb
如果你想在命令行中安装 deb 软件包,你可以使用 apt 命令或者 dpkg 命令。 实际上,apt 命令在底层上使用 dpkg 命令,但是 apt 却更流行和易于使用。
如果你在安装 deb 软件包的过程中得到一个依赖项的错误,你可以使用下面的命令来修复依赖项的问题:
代码语言:javascript复制sudo apt install -f
代码语言:javascript复制方法1
# 安装.deb文件
sudo dpkg -i 软件包名.deb
# 卸载
sudo dpkg -r program_name
# 查询
# 这将给予我全部的名称中含有 "grid" 的软件包,从这里,我可以得到准确的程序名称。
apt list --installed | grep grid
#WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
#appgrid/now 0.298 all [installed,local]
代码语言:javascript复制方法2
# 安装
# sudo apt install ./teamviewer_amd64.deb
sudo apt install path_to_deb_file
# 卸载
# 或 sudo apt remove program_name
sudo apt-get remove 软件包名称
# 查询
dpkg -l | grep grid
#ii appgrid 0.298 all Discover and install apps for Ubuntu
Selenium 反反爬
chromedriver: error while loading shared libraries: libglib-2.0.so.0:
代码语言:javascript复制下方, 成功解决
apt-get install libglib2.0 -y
但没有解决下方:
代码语言:javascript复制Network is unreachable Network is unreachable
OpenQA.Selenium.WebDriverException: Cannot start the driver service on http://localhost:39255/
at OpenQA.Selenium.DriverService.Start()
chromedriver: error while loading shared libraries: libnss3.so:
代码语言:javascript复制apt-get install libnss3-dev -y
chromedriver: error while loading shared libraries: libxcb.so.1:
代码语言:javascript复制apt-get install libxcb1 -y
OpenQA.Selenium.WebDriverException: unknown error: cannot find Chrome binary
解决:
未正确安装 Chrome, 如果还是保存, 则手动指定
代码语言:javascript复制var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
// Chrome 的启动文件路径
// 只要正确安装的就不需要指定
//options.BinaryLocation = "";
OpenQA.Selenium.WebDriverArgumentException: invalid argument
代码语言:javascript复制// url 应为合法完整url, 如: http://moeci.com
OpenQA.Selenium.Navigator.GoToUrl(String url)
OpenQA.Selenium.WebDriverException: The HTTP request to the remote WebDriver server for URL http://localhost:40811/session timed out after 60 seconds.
代码语言:javascript复制var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--headless");
// 注意: TimeSpan.FromMinutes(5) 设置 5分钟 超时
var driver = new ChromeDriver(chromeDriverDirectory: "/app/tools/selenium/", options, commandTimeout: TimeSpan.FromMinutes(5));
driver.Navigate().GoToUrl(url);
OpenQA.Selenium.WebDriverException: unknown error: session deleted because of page crash
代码语言:javascript复制OpenQA.Selenium.WebDriverException: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
(Session info: headless chrome=88.0.4324.182)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.ExecuteScriptCommand(String script, String commandName, Object[] args)
at OpenQA.Selenium.WebDriver.ExecuteScript(String script, Object[] args)
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 109
at WebScreenshot.Controllers.HomeController.Get(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 78
代码语言:javascript复制这是在 docker 容器中运行才会出现的错误, 由于
shm_size
不够用了, 默认64MB
docker run -d -p 4444:4444 --shm-size="2g" selenium/standalone-chrome:4.1.2-20220217
代码语言:javascript复制version: "3"
services:
hub:
image: selenium/hub
ports:
- "4444:4444"
chrome:
image: selenium/node-chrome
shm_size: '1gb'
depends_on:
- hub
environment:
- HUB_HOST=hub
firefox:
image: selenium/node-firefox
shm_size: '1gb'
depends_on:
- hub
environment:
- HUB_HOST=hub
System.InvalidOperationException: session not created
代码语言:javascript复制System.InvalidOperationException: session not created
from tab crashed
(Session info: headless chrome=88.0.4324.182) (SessionNotCreated)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.StartSession(ICapabilities desiredCapabilities)
at OpenQA.Selenium.WebDriver..ctor(ICommandExecutor executor, ICapabilities capabilities)
at OpenQA.Selenium.Chromium.ChromiumDriver..ctor(ChromiumDriverService service, ChromiumOptions options, TimeSpan commandTimeout)
at OpenQA.Selenium.Chrome.ChromeDriver..ctor(ChromeDriverService service, ChromeOptions options, TimeSpan commandTimeout)
at OpenQA.Selenium.Chrome.ChromeDriver..ctor(String chromeDriverDirectory, ChromeOptions options, TimeSpan commandTimeout)
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 109
at WebScreenshot.Controllers.HomeController.Get(String url) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 78
代码语言:javascript复制解决
var options = new ChromeOptions();
// https://stackoverflow.com/questions/59186984/selenium-common-exceptions-sessionnotcreatedexception-message-session-not-crea
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
// 重要: 测试添加了这行后,才成功
options.AddArgument("--ignore-certificate-errors");
Timed out receiving message from renderer: 10.000
代码语言:javascript复制ChromeDriver was started successfully.
[1646482757.506][SEVERE]: Timed out receiving message from renderer: 10.000
[1646482757.506][WARNING]: screenshot failed, retrying timeout: Timed out receiving message from renderer: 10.000
[1646482767.506][SEVERE]: Timed out receiving message from renderer: 10.000
OpenQA.Selenium.WebDriverTimeoutException: timeout: Timed out receiving message from renderer: 10.000
(Session info: headless chrome=88.0.4324.182)
at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse)
at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
at OpenQA.Selenium.WebDriver.GetScreenshot()
at WebScreenshot.Controllers.HomeController.SaveScreenshot(String url, String jsurl, String jsStr) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 288
at WebScreenshot.Controllers.HomeController.FileCache(Byte[]& cacheEntry, String url, String jsurl, String jsStr) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 167
at WebScreenshot.Controllers.HomeController.Get(String url, String jsurl, Int32 windowWidth, Int32 windowHeight) in /src/src/WebScreenshot/Controllers/HomeController.cs:line 130
代码语言:javascript复制解决
var options = new ChromeOptions();
options.AddArgument("--no-sandbox");
options.AddArgument("--disable-dev-shm-usage");
options.AddArgument("--headless");
options.AddArgument("--ignore-certificate-errors");
// 重要: 下方此行
options.AddArgument("--disable-gpu");
screenshot failed, retrying timeout: Timed out receiving message from renderer: 10.
Dockerfile: /bin/sh: 1: source: not found
代码语言:javascript复制添加 chromedriver 所在目录到 PATH
# TODO: 以下添加 PATH 失败: 无效
RUN echo 'export PATH=$PATH:/app' >> ~/.bash_profile
RUN /bin/bash -c "source ~/.bash_profile"
# 使用 Dockerfile 方式 添加 PATH
ENV PATH=/app:$PATH
# 效验版本
RUN google-chrome --version
RUN chromedriver --version
PS:
~
这个符号表示你的家目录,.bash_profile
是一个隐藏的配置文件,主要是用来配置bash shell的,source ~/.bash_profile
就是让这个配置文件在修改后立即生效。
Selenium 利用 Cookie 免登录
参考:
- 利用cookie免帐号密码登陆b站 - JavaShuo
- 利用python selenium带上cookies自动登录bilibili-python黑洞网
代码语言:javascript复制执行 JavaScript
document.cookie ="SESSDATA=49d4147c%6557247677,f295e641;domain=.bilibili.com;path=/";
- 本文作者: yiyun
- 本文链接: https://moeci.com/posts/分类-爬虫/selenium/
- 版权声明: 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!