【python爬虫学习】python-编程学习网

pip 安装 pip install scrapy
可能的问题：
问题/解决：error: Microsoft Visual C++ 14.0 is required.

实例demo教程中文教程文档
第一步：创建项目目录

scrapy startproject tutorial

第二步：进入tutorial创建spider爬虫

scrapy genspider baidu www.baidu.com

第三步：创建存储容器，复制项目下的items.py重命名为BaiduItems

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class BaiduItems(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    pass

第四步：修改spiders/baidu.py xpath提取数据

# -*- coding: utf-8 -*-
import scrapy
# 引入数据容器
from tutorial.BaiduItems import BaiduItems

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.readingbar.net']
    start_urls = ['http://www.readingbar.net/']
    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = BaiduItems()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item
        pass

第五步：解决百度首页网站抓取空白问题,设置setting.py

# 设置用户代理
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'

# 解决 robots.txt 相关debug
ROBOTSTXT_OBEY = False
# scrapy 解决数据保存乱码问题
FEED_EXPORT_ENCODING = 'utf-8'

最后一步：开始爬取数据命令并保存数据为指定的文件
执行的时候可能报错：No module named 'win32api' 可以下载指定版本安装

scrapy crawl baidu -o baidu.json

深度爬取百度首页及导航菜单相关页内容

# -*- coding: utf-8 -*-
import scrapy

from scrapyProject.BaiduItems import BaiduItems

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    # 由于tab包含其他域名,需要添加域名否则无法爬取
    allowed_domains = [
        'www.baidu.com',
        'v.baidu.com',
        'map.baidu.com',
        'news.baidu.com',
        'tieba.baidu.com',
        'xueshu.baidu.com'
    ]
    start_urls = ['https://www.baidu.com/']
    def parse(self, response):
        item = BaiduItems()
        item['title'] = response.xpath('//title/text()').extract()
        yield item
        for sel in response.xpath('//a[@class="mnav"]'):
            item = BaiduItems()
            item['nav'] = sel.xpath('text()').extract()
            item['href'] = sel.xpath('@href').extract()
            yield item
            # 根据提取的nav地址建立新的请求并执行回调函数
            yield scrapy.Request(item['href'][0],callback=self.parse_newpage)
        pass
    # 深度提取tab网页信息
    def parse_newpage(self, response):
        item = BaiduItems()
        item['title'] = response.xpath('//title/text()').extract()
        yield item
        pass

绕过登录进行爬取
a.解决图片验证 pytesseract

文章详情

【python爬虫学习】python

软考中级精品资料免费领

相关文章

猜你喜欢

【python爬虫学习】python

Python爬虫学习路线

python爬虫学习三：python正则

Python爬虫框架Scrapy 学习

零基础学习Python爬虫

Python 爬虫学习笔记之多线程爬虫

Python 爬虫学习笔记之单线程爬虫

学习python爬虫能做什么

Python爬虫学习教程：天猫商品数据爬虫

零基础怎么学习Python爬虫

【Python学习】爬虫报错处理bs4.

爬虫学习

Python爬虫练习汇总

python爬虫Mitmproxy安装使用学习笔记

Python爬虫学习之requests的使用教程

学习Python爬虫前必掌握知识点

学习网络爬虫python会不会很难

好程序员Python学习路线之python爬虫入门

python爬虫要学多久

Python的Scrapy爬虫框架简单学习笔记

文章详情

【python爬虫学习 】python

软考中级精品资料免费领

相关文章

猜你喜欢

【python爬虫学习 】python

Python爬虫学习路线

python爬虫学习三：python正则

Python爬虫框架Scrapy 学习

零基础学习Python爬虫

Python 爬虫学习笔记之多线程爬虫

Python 爬虫学习笔记之单线程爬虫

学习python爬虫能做什么

Python爬虫学习教程：天猫商品数据爬虫

零基础怎么学习Python爬虫

【Python学习】爬虫报错处理bs4.

爬虫学习

Python爬虫练习汇总

python爬虫Mitmproxy安装使用学习笔记

Python爬虫学习之requests的使用教程

学习Python爬虫前必掌握知识点

学习网络爬虫python会不会很难

好程序员Python学习路线之python爬虫入门

python爬虫要学多久

Python的Scrapy爬虫框架简单学习笔记

【python爬虫学习】python

【python爬虫学习】python