PythonScrapy实战之古诗文网的爬取-编程学习网

需求

通过python,Scrapy框架，爬取古诗文网上的诗词数据，具体包括诗词的标题信息，作者，朝代，诗词内容，及译文。爬取过程需要逐页爬取，共4页。第一页的url为（https://www.gushiwen.cn/default_1.aspx）。

1. Scrapy项目创建

首先创建Scrapy项目及爬虫程序

在目标目录下，创建一个名为prose的项目：

scrapy startproject prose

进入项目目录下，然后创建一个名为gs的爬虫程序，爬取范围为 gushiwen.cn

cd prose
scrapy genspider gs gushiwen.cn

2. 全局配置 settings.py

对配置文件settings.py做如下编辑：

①选择不遵守robots协议

②下载间隙设置为1

③并添加请求头，启用管道

④此外设置打印等级：LOG_LEVEL=“WARNING”

具体如下：

# Scrapy settings for prose project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'prose'

SPIDER_MODULES = ['prose.spiders']
NEWSPIDER_MODULE = 'prose.spiders'

LOG_LEVEL = "WARNING"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'prose (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'prose.middlewares.ProseSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'prose.middlewares.ProseDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'prose.pipelines.ProsePipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3. 爬虫程序.py

首先是进行页面分析，这里不再赘述该过程。

这部分代码，也即需要编辑的核心部分。

首先是要把初始URL加以修改，修改为要爬取的界面的第一页，而非古诗文网的首页。

需求：我们要爬取的内容包括：诗词的标题信息，作者，朝代，诗词内容，及译文。爬取过程需要逐页爬取。

其中，标题信息，作者，朝代，诗词内容，及译文都存在于同一个<div>标签中。

为了体现两种不同的操作方式，

标题信息，作者，朝代，诗词内容四项，我们使用一种方法获取。并在该for循环中使用到一个异常处理语句（try…except…）来避免取到空值时使用索引导致的报错；

对于译文，我们额外定义一个parse_detail函数，并在scrapy.Request()中传入其，来获取。

关于翻页，我们的思路是：遍历获取完每一页需要的数据后（即一大轮循环结束后），从当前页面上获取下一页的链接，然后判断获取到的链接是否为空。如若不为空则表示获取到了，则再一次使用scrapy.Requests()方法，传入该链接，并再次调用parse函数。如果为空，则表明这已经是最后一页了，程序就会在此处结束。

具体代码如下：

import scrapy
from prose.items import ProseItem


class GsSpider(scrapy.Spider):
    name = 'gs'
    allowed_domains = ['gushiwen.cn']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    # 解析列表页面
    def parse(self, response):
        # 一个class="sons"对应的是一首诗
        div_list = response.xpath('//div[@class="left"]/div[@class="sons"]')
        for div in div_list:
            try:
                # 提取诗词标题信息
                title = div.xpath('.//b/text()').get()
                # 提取作者和朝代
                source = div.xpath('.//p[@class="source"]/a/text()').getall()
                # 作者
                # replace
                author = source[0]
                # 朝代
                dynasty = source[1]
                content_list = div.xpath('.//div[@class="contson"]//text()').getall()
                content_plus = ''.join(content_list).strip()
                # 拿到诗词详情页面的url
                detail_url = div.xpath('.//p/a/@href').get()
                item = ProseItem(title=title, author=author, dynasty=dynasty, content_plus=content_plus, detail_url=detail_url)
                # print(item)
                yield scrapy.Request(
                    url=detail_url,
                    callback=self.parse_detail,
                    meta={'prose_item': item}
                )
            except:
                pass

        next_url = response.xpath('//a[@id="amore"]/@href').get()
        if next_url:
            print(next_url)
            yield scrapy.Request(
                url=next_url,
                callback=self.parse
            )


    # 用于解析详情页面
    def parse_detail(self, response):
        item = response.meta.get('prose_item')
        translation = response.xpath('//div[@class="sons"]/div[@class="contyishang"]/p//text()').getall()
        item['translation'] = ''.join(translation).strip()
        # print(item)
        yield item
        pass

4. 数据结构 items.py

在这里定义了ProseItem类，以便在上边的爬虫程序中调用。（此外要注意的是，爬虫程序中导入了该模块，有必要时需要将合适的文件夹标记为根目录。）

import scrapy


class ProseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 朝代
    dynasty = scrapy.Field()
    # 诗词内容
    content_plus = scrapy.Field()
    # 详情页面的url
    detail_url = scrapy.Field()
    # 译文
    translation = scrapy.Field()
    pass

5. 管道 pipelines.py

管道，在这里编辑数据存储的过程。

from itemadapter import ItemAdapter
import json


class ProsePipeline:
    def __init__(self):
        self.f = open('gs.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
    	# 将item先转化为字典， 再转化为 json类型的字符串
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.f.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.f.close()

6. 程序执行 start.py

定义一个执行命令的程序。

from scrapy import cmdline

cmdline.execute('scrapy crawl gs'.split())

程序执行效果如下：

我们需要的数据，被保存在了一个名为gs.txt的文本文件中了。

以上就是Python Scrapy实战之古诗文网的爬取的详细内容，更多关于Python Scrapy爬取古诗文网的资料请关注编程网其它相关文章！

文章详情

PythonScrapy实战之古诗文网的爬取

目录

需求

1. Scrapy项目创建

2. 全局配置 settings.py

3. 爬虫程序.py

4. 数据结构 items.py

5. 管道 pipelines.py

6. 程序执行 start.py

软考中级精品资料免费领

相关文章

猜你喜欢