BeautifulSoup与aiohtt-编程学习网

　　香港电台的节目素质都比较不错，其中有个《中华五千年》的节目是以情景剧与旁白的形式来展示历史故事，由传说时代一直到民国，1983年首播至2000年，非常长寿的一个节目。网上能找到版本声音非常模糊，不过在其《网上中华五千年》的网站上可以在线收听所有节目。虽然可以在线听，但要science上网，而且在线听中断了就不能再续着听，很难受。因此，就想到利用Python来的爬虫来把节目都下载下来慢慢听。

　　在浏览器打开审查元素找到音频的链接标签，发现链接都在class为.listen-button的a标签里。只要定位到这个标签，取出text作为文件名，href作为下载url就可以了。

　　代码很简单，首先，主体结构是这样的：

'''
    下载中华五千年
'''
from bs4 import BeautifulSoup
import requests,urllib,re
import time
import aiohttp
import asyncio
import os

async def main():
    start_page = 1
    while True:     
        url = 'http://rthk9.rthk.hk/chiculture/fivethousandyears/subpage{0}.htm'.format(start_page)
        soup = await getUrl(url)      #取html内容
        if not soup.title: return   #直到无内容退出
        title = soup.title.text 
        title = title[title.rfind(' ')+1:]
        listenbutton = soup.select(".listen-button") #查出所有.listen-button类的标签
        #根据title 创建相应的文件夹
        rootPath = './中华五千年/'
        if not os.path.exists(rootPath + title):
            os.makedirs(rootPath + title)

        for l in listenbutton:
            if  l.text != "":
                href = l['href']
                filename  = str(title) +'_' + str(l.text)
                if filename.find('公元') > -1
                    await download(filename=filename,url=href,title=title)  #下载语音

        start_page += 1 #下一页

asyncio.run(main())

其中异步函数（协程）getUrl :

async def getUrl(url):
    async with aiohttp.ClientSession() as session:
        #因需science上网所以需要本地代理
        async with session.get(url,proxy='http://127.0.0.1:1080') as resp:
            wb_data = await resp.text()
            soup = BeautifulSoup(wb_data,'lxml')
    return soup

异步下载语音函数 download：

async def download(url,filename,title):
    file_name = './中华五千年/{0}/{1}'.format(title,filename + '.mp3') 
    async with aiohttp.ClientSession() as session:
        async with session.get(url,proxy='http://127.0.0.1:1080') as resp:
            with open(file_name, 'wb') as fd:
                while True:
                    chunk = await resp.content.read()
                    if not chunk:
                        break
                    fd.write(chunk)

由于用了异步IO的方式，很快便可以下载完一页。

文章详情

BeautifulSoup与aiohtt

软考中级精品资料免费领

相关文章

猜你喜欢

BeautifulSoup与aiohtt

python BeautifulSoup

BeautifulSoup的基本用法

Python中有哪些BeautifulSoup对象

Python中BeautifulSoup模块详解

Python使用BeautifulSoup实现解析网页

BeautifulSoup怎么解析XML命名空间

BeautifulSoup常用语法有哪些

使用BeautifulSoup和正则表达

python爬虫怎么使用BeautifulSoup库

beautifulsoup库怎么在python中使用

python BeautifulSoup中findNext()函数怎么使用

Python中怎么利用Beautifulsoup爬取网站

Python爬虫之BeautifulSoup的基本使用教程

python 中的 BeautifulSoup 网页使用方法解析

利用requests+BeautifulSoup爬取网页关键信息

Python 爬虫：如何用 BeautifulSoup 爬取网页数据

Python爬虫包 BeautifulSoup 递归抓取实例详解

Python实战快速上手BeautifulSoup库爬取专栏标题和地址

Python使用BeautifulSoup库解析HTML基本使用教程