Linux 服务器配置selenium 爬虫-编程学习网

Linux 服务器配置使用代理 IP 的selenium 爬虫

在 Linux 服务器运行爬虫有时可以取得奇效，但在 Linux 服务器环境（即无图形化界面）下配置爬虫环境、代理 IP 与常见的 Windows 环境有着较大区别。本文为对在 Linux 服务器上配置 selenium 及 Google Chrome 环境并基于代理 IP 运行爬虫的经历记录，针对一些笔者遇到的坑提供了解决方案，供读者参考。

一、基础环境

操作系统：Ubuntu 20.0

Python：3.7

代理 IP：Clash（关于在 Linux 环境配置 Clash 的操作可见文章Linux服务器基于代理IP的爬虫_西南小游侠的博客-CSDN博客）

二、安装并使用 Google Chrome

首先需要在 Root 账户下，直接从 Chrome 官网下载并安装 Chrome：

sudo apt-get install libxss1 libappindicator1 libindicator7wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debsudo dpkg -i google-chrome*.deb sudo apt-get install -f

安装完成后，需要修改配置文件使其能够在 Root 权限下运行。打开 /opt/google/chrome/google-chrome 文件，找到命令：

exec -a "$0" "$HERE/chrome" "$@"

在其末尾添加命令成为：

exec -a "$0" "$HERE/chrome" "$@" --user-data-dir --no-sandbox

接着基于以下命令测试是否可以使用 Chrome：

google-chrome --headless --remote-debugging-port=9222 https://chromium.org --disable-gpu

输出网页内容即可。

接下来需要通过修改部分文件权限保证在非 Root 账户可以使用 Chrome。

登录到一个非 Root 账户，测试上述命令，发现其报错：

在这里插入图片描述

该错误是因为启动 Chrome 需要修改 /tmp/Crashpad 文件夹，但该账户没有权限修改该文件夹。解决方法为将 /tmp/Crashpad 文件夹的权限修改，在 Root 账户输入：

chmod -R 777 /tmp/Crashpad

在运行 Chrome，发现报错：

在这里插入图片描述

同理，修改文件夹权限，此处直接将 /opt 文件夹权限开放：

chmod -R 777 /opt

再运行 Chrome，发现可以成功运行了。

三、安装并使用 selenium

首先，需要安装和你所安装的 Chrome 版本一致的 webdriver，首先查看 Chrome 版本：

google-chrome --version

接着，根据 Chrome 版本号，从网站 CNPM Binaries Mirror (npmmirror.com) 下载对应版本的 webdriver 并解压到一个自定义目录即可。

接下来需要安装 Python 第三方库 selenium，直接通过 conda 安装即可：

conda install selenium

然后就可以在代码中运行基于 selenium 的爬虫了，提供一个代码示例：

from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionswd = webdriver.Chrome(executable_path='path_for_webdriver')wd.get("https://www.baidu.com")content = wd.page_sourceurl = wd.current_urlprint(url)print(content)wd.quit()

但是，直接运行该代码会报错：

在这里插入图片描述

该错误是比较常见的，通过在网上查询可知，是需要添加启动参数，但不同的系统可能需要不同的参数才能启动，有的仅需要1~3个，笔者的系统需要以下5个参数都添加才可以启动：

chrome_options = Options()chrome_options.add_argument('--headless')chrome_options.add_argument('--disable-gpu')chrome_options.add_argument('--no-sandbox')chrome_options.add_argument('--disable-dev-shm-usage')chrome_options.add_argument("--remote-debugging-port=9222")

添加这些参数后，即可成功运行，完整代码示例如下：

from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionschrome_options = Options()chrome_options.add_argument('--headless')# 使用无头模式，无 GUI的Linux服务器必须添加chrome_options.add_argument('--disable-gpu')# 不使用GPU，有的机器不支持GPUchrome_options.add_argument('--no-sandbox')# 运行 Chrome 的必需参数chrome_options.add_argument('--disable-dev-shm-usage')chrome_options.add_argument("--remote-debugging-port=9222")# 以上两个参数具体作用不明，但笔者机器需要这两个参数才能运行chrome_options.add_argument("user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'")# 该参数用于避免被认出，非必需参数wd = webdriver.Chrome(chrome_options=chrome_options,executable_path='path_for_webdriver')wd.get("https://www.baidu.com")content = wd.page_sourceurl = wd.current_urlprint(url)print(content)wd.quit()

四、使用代理 IP 的 selenium 爬虫

在上一篇文章中，我们已经配置好了基于 Clash 的代理 IP，接下来直接向爬虫代码中添加部分即可：

from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionschrome_options = Options()chrome_options.add_argument('--headless')chrome_options.add_argument('--disable-gpu')chrome_options.add_argument('--no-sandbox')chrome_options.add_argument('--disable-dev-shm-usage')chrome_options.add_argument("--remote-debugging-port=9222")chrome_options.add_argument('--proxy-server=http://127.0.0.1:7890') # 添加部分，使用代理IPchrome_options.add_argument("user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'")wd = webdriver.Chrome(chrome_options=chrome_options,executable_path='path_for_webdriver')try:    wd.get("https://www.google.com")    content = wd.page_source    url = wd.current_url    print(url)    print(content)finally:    wd.quit()

接下来运行即可成功访问 Google。

来源地址：https://blog.csdn.net/UIBE_day_day_up/article/details/128989600

文章详情

Linux 服务器配置selenium 爬虫

Linux 服务器配置使用代理 IP 的selenium 爬虫

一、基础环境

二、安装并使用 Google Chrome

三、安装并使用 selenium

四、使用代理 IP 的 selenium 爬虫

软考中级精品资料免费领

相关文章

猜你喜欢