python爬取豆瓣新书清单-编程学习网

使用python3的requests库快速获取豆瓣图书推荐的新书清单，并保存书籍信息和图书缩略图图片到本地

#!/usr/bin/env python
# -*- coding:utf-8 -*-
"""
@author:Aiker Zhao
@file:douban3.py
@time:上午10:34
"""
import json
import os
import re
from multiprocessing import Pool
import requests
from requests.exceptions import RequestException

dir = 'z:\\douban\\'

def get_web(url):
    try:
        rq = requests.get(url)
        if rq.status_code == 200:
            return rq.text
        return None
    except RequestException:
        return None

def parse_web(html):
    pattern = re.compile('<li\sclass="">.*?cover".*?href="(.*?)"\stitle="(.*?)".*?img\***c="(.*?)"' +
                         '.*?class="author">(.*?)<.*?year">(.*?)<.*?publisher">(.*?)<.*?</li>', re.S)
    results = re.findall(pattern, html)
    # print(results)
    for i in results:
        # url, title, img, author, yeah, publisher = i
        # author = re.sub('\s', '', author)
        # yeah = re.sub('\s', '', yeah)
        # publisher = re.sub('\s', '', publisher)
        # print(url, title, img, author, yeah, publisher)
        yield {
            'title': i[1],
            'url': i[0],
            'img': i[2],
            'author': i[3].strip(),
            'yeah': i[4].strip(),
            'publisher': i[5].strip()
        }
        # print(url, title, img, author, yeah, publisher)
        # return img,title

def save_image(title, img):
    images = dir + title + '.jpg'
    if os.path.exists(images):
        pass
    else:
        with open(images, 'wb') as f:
            f.write(requests.get(img).content)
            f.close()

def save_info(content):
    info = dir + 'info.txt'
    with open(info, 'a', encoding='utf-8') as fd: #防止出现ascII
        fd.write(json.dumps(content, ensure_ascii=False) + '\n') ##防止出现ascII
        fd.close()

def main():
    url = 'https://book.douban.com/'
    html = get_web(url)
    # parse_web(html)
    for i in parse_web(html):
        print(i)
        save_info(i)
        save_image(i.get('title'), i.get('img'))

if __name__ == '__main__':
    main()

python爬取豆瓣新书清单

心得：
- 需要注意正则的匹配规则的准确度，否则会没有响应，或者无限超时

文章详情

python爬取豆瓣新书清单

软考中级精品资料免费领

相关文章

猜你喜欢

python爬取豆瓣新书清单

Python3 爬取豆瓣书籍 Xpat

第一个爬虫——豆瓣新书信息爬取

怎么用python爬虫获取豆瓣的书评

Python爬虫怎么爬取豆瓣影评

python怎么爬取豆瓣网页

python爬取豆瓣电影TOP250数据

python 爬取豆瓣网页的示例

Python爬虫使用lxml模块爬取豆瓣

利用Python爬取豆瓣读书页面源码分享

python爬取豆瓣top250的电影数

Python爬虫爬取豆瓣电影之数据提取值

用python爬取豆瓣前一百电影

使用Python怎么爬取豆瓣电影名

怎么用python爬取豆瓣前一百电影

python 开心网和豆瓣日记爬取的小爬虫

Python爬取豆瓣电影方法是什么

python如何爬取豆瓣电影TOP250数据

Python爬虫实战之使用Scrapy爬取豆瓣图片

python爬取豆瓣评论制作词云代码