使用PyCharm怎么爬取小说-编程学习网

使用PyCharm怎么爬取小说？针对这个问题，这篇文章详细介绍了相对应的分析和解答，希望可以帮助更多想解决这个问题的小伙伴找到更简单易行的方法。

爬取小说的思路：

1.获取小说地址

本文以搜书网一小说为例《嘘，梁上有王妃！》
目录网址：https://www.soshuw.com/XuLiangShangYouWangFei/
加载需要的包：

import refrom bs4 import BeautifulSoup as dsimport requests

获取小说目录文件，返回<Response [200]>，表示可正常爬取该网页

base_url='https://www.soshuw.com/XuLiangShangYouWangFei/'chapter_html=requests.get(base_url)print(chapter_html)

2.分析小说地址结构

解析目录网页 , 输出结果为目录网页的源代码

chapter_page_html=ds(chapter_page,'lxml')print(chapter_page)

打开目录网页，发现在正文的目录前面有一个最新章节目录（这里有九个章节），再完整的目录中是包含最新章节的，所以这里最新章节是不需要的。

使用PyCharm怎么爬取小说

在网页单击右键选择“检查”（或者“属性”，不同的浏览器的叫法不一致，我用的是IE）选择“元素”列，鼠标再右侧代码块上移动时。左侧网页会高亮显示其对应网页区域，找到完整目录对应的代码块。如下图：

使用PyCharm怎么爬取小说

完整目录的锚有两个，分别是class="novel_list"和id=“novel108799”,仔细观察后发现class不唯一，所以我们选用id提取该块内容

使用PyCharm怎么爬取小说

将完整目录块提取出来

chapter_novel=chapter_page.find(id="novel108799")print(chapter_novel)

结果如下（仅部分结果）：

使用PyCharm怎么爬取小说

对比小说章节内容网址和目录网址（base_url）发现，我们只需要将base_url和章节内容网址的后半段拼接到一起就可以得到完整的章节内容网址

3.拼接地址

利用正则语言库将地址后半段提取出来

chapter_novel_str=str(chapter_novel)regx = '<dd><a href="/XuLiangShangYouWangFei(.*?)"'chapter_href_list = re.findall(regx, chapter_novel_str)print(chapter_href_list)

拼接url:
定义一个列表chapter_url_list接收完整地址

chapter_url_list = []for i in chapter_href_list: url=base_url+i chapter_url_list.append(url)print(chapter_url_list)

4.分析章节内容结构

打开章节，右键→“属性”，查看内容结构，发现小说正文有class和id两个锚，class是不变的，id随着章节而变化，所以我们用class提取正文

使用PyCharm怎么爬取小说

提取正文段

chapter_novel=chapter_page.find(id="novel108799")print(chapter_novel)

提取正文文本和

body_html=requests.get('https://www.soshuw.com/XuLiangShangYouWangFei/3647144.html')body_page=ds(body_html.content,'lxml')body = body_page.find(class_='content')body_content=str(body)print(body_content)body_regx='<br/> (.*?)\n'content_list=re.findall(body_regx,body_content)print(content_list)title_regx = '<h2>(.*?)</h2>'title = re.findall(title_regx, body_html.text)print(title)

5.保存文本

with open('1.txt', 'a+') as f: f.write('\n\n') f.write(title[0] + '\n') f.write('\n\n') for e in content_list:  f.write(e + '\n')print('{} 爬取完毕'.format(title[0]))

6.完整代码

import refrom bs4 import BeautifulSoup as dsimport requestsbase_url='https://www.soshuw.com/XuLiangShangYouWangFei'chapter_html=requests.get(base_url)chapter_page=ds(chapter_html.content,'lxml')chapter_novel=chapter_page.find(id="novel108799")#print(chapter_novel)chapter_novel_str=str(chapter_novel)regx = '<dd><a href="/XuLiangShangYouWangFei(.*?)"'chapter_href_list = re.findall(regx, chapter_novel_str)#print(chapter_href_list)chapter_url_list = []for i in chapter_href_list: url=base_url+i chapter_url_list.append(url)#print(chapter_url_list)for u in chapter_url_list: body_html=requests.get(u) body_page=ds(body_html.content,'lxml') body = body_page.find(class_='content') body_content=str(body) # print(body_content) body_regx='<br/> (.*?)\n' content_list=re.findall(body_regx,body_content) #print(content_list) title_regx = '<h2>(.*?)</h2>' title = re.findall(title_regx, body_html.text) #print(title) with open('1.txt', 'a+') as f:  f.write('\n\n')  f.write(title[0] + '\n')  f.write('\n\n')  for e in content_list:   f.write(e + '\n') print('{} 爬取完毕'.format(title[0]))

关于使用PyCharm怎么爬取小说问题的解答就分享到这里了，希望以上内容可以对大家有一定的帮助，如果你还有很多疑惑没有解开，可以关注编程网行业资讯频道了解更多相关知识。

文章详情

使用PyCharm怎么爬取小说

爬取小说的思路：

1.获取小说地址

2.分析小说地址结构

3.拼接地址

4.分析章节内容结构

5.保存文本

6.完整代码

软考中级精品资料免费领

相关文章

猜你喜欢

使用PyCharm怎么爬取小说

使用PyCharm批量爬取小说的完整代码

python中怎么使用XPath爬取小说

怎么用python爬取小说内容

怎么使用pycharm爬取数据

python怎么爬取小说内容

python中使用XPath爬取小说的方法

Python爬虫教程使用Scrapy框架爬取小说代码示例

python使用XPath解析数据爬取起点小说网数据

怎么用Python写个听小说的爬虫

怎么使用python爬虫爬取数据

怎么用node抓取小说章节

Pycharm怎么爬取网页文本和图片

python如何使用XPath解析数据爬取起点小说网数据

怎么用python爬取今日说法每期数据

使用python怎么爬取数据

怎么使用Java爬虫批量爬取图片

怎么使用PyCharm Profile分析异步爬虫效率

怎么使用Python爬取QQ密码

使用Python怎么爬取MP3音频