Python中的Beautiful Soup模块的用法-编程学习网

这篇文章主要介绍“Python中的Beautiful Soup模块的用法”，在日常操作中，相信很多人在Python中的Beautiful Soup模块的用法问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”Python中的Beautiful Soup模块的用法”的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

1.Beautiful Soup模块的介绍

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库，简单来说，它能将HTML的标签文件解析成树形结构，然后方便地获取到指定标签的对应属性，还可以方便的实现全站点的内容爬取和解析；
Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器；
lxml 是python的一个解析库，支持HTML和XML的解析，html5lib解析器能够以浏览器的方式解析，且生成HTML5文档；

pip install beautifulsoup4pip install html5libpip install lxml

2. Beautiful Soup模块解析HTML文档

假如现在有一段不完整的HTML代码，我们现在要使用Beautiful Soup模块来解析这段HTML代码

data = '''                                         <html><head><title>The Dormouse's story</title></he<body>                                             <p class="title"><b id="title">The Dormouse's story</b></p>   <p class="story">Once upon a time there were three <a href="http://example.com/elsie" class="sister" i<a href="http://example.com/lacie" class="sister" i<a href="http://example.com/tillie" class="sister" and they lived at the bottom of a well.</p>        <p class="story">...</p>                           '''

首先需要导入BeautifulSoup模块，再实例化BeautifulSoup对象

from bs4 import BeautifulSoup           soup = BeautifulSoup(data,'lxml')

然后通过BeautifulSoup提供的方法就可以拿到HTML的元素、属性、链接、文本等，BeautifulSoup模块可以将不完整的HTML文档，格式化为完整的HTML文档，比如我们打印print(soup.prettify())看一下输出什么？

<html> <head>  <title>   The Dormouse's story  </title> </head> <body>  <p class="title">   <b id="title">    The Dormouse's story   </b>  </p>  <p class="story">   Once upon a time there were three   <a a="" and="" at="" bottom="" class="sister" href="http://example.com/elsie" i="" lived="" of="" the="" they="" well.="">    <p class="story">     ...    </p>   </a>  </p> </body></html>

获取标签，如title标签，a标签等

print('title = {}'.format(soup.title))             # 输出：title = <title>The Dormouse's story</title>print('a={}'.format(soup.a))

获取标签的名称，如title标签，body标签等

print('title_name = {}'.format(soup.title.name))# 输出：title_name = titleprint('body_name = {}'.format(soup.body.name))# 输出：body_name = body

获取标签的内容，如title标签

print('title_string = {}'.format(soup.title.string))#  输出：title_string = The Dormouse's story

如果想要获取某个标签的父标签的名称，可以使用parent，如title标签，可以得到父标签head标签，且会自定补齐不完整的标签；

print('title_pareat_name = {}'.format(soup.title.parent))# 输出：title_pareat_name = <head><title>The Dormouse's story</title></head>

获取第一个p标签

print('p = {}'.format(soup.p))# 输出：p = <p class="title"><b>The Dormouse's story</b></p>

获取第一个p标签的class的值，获取第一个a标签的class值

print('p_class = {}'.format(soup.p["class"]))# 输出：p_class = ['title']print('a_class = {}'.format(soup.a["class"]))# 输出：a_class = ['sister']

获取所有的标签

#  获取所有的a标签print('a = {}'.format(soup.find_all('a')))#  获取所有的p标签  print('p = {}'.format(soup.find_all('p')))

获取id为link3的标签

print('a_link = {}'.format(soup.find(id='title')))# 输出：a_link = <b id="title">The Dormouse's story</b>

3.BeautifulSoup中的对象

BeautifulSoup对象分为四类，分别是Tag(获取标签), NavigableString(获取标签内容) , BeautifulSoup(根标签), Comment(标签内的所有的文本) ；

语法：

soup.标签名：获取HTML中的标签；
soup.标签名.name：获取HTML中标签的名称；
soup.标签名.attrs：获取标签的所有属性；
soup.标签名.string：获取HTML中标签的文本内容；
soup.标签名.parent：获取HTML中标签的父标签；
prettify()方法：可以将Beautiful Soup的文档树格式化后以Unicode编码输出，每个XML/HTML标签都独占一行；

4.遍历文档

contents：获取所有子节点，返回一个列表，可以通过下标取值；

soup = BeautifulSoup(html,"lxml")# 返回一个列表print(soup.p.contents)# 拿到第一个子节点print(soup.p.contents[0])

children：返回子节点的生成器对象；

for tag in soup.p.children:    print(tag)

soup.strings：获取所有节点的内容，包括空格；

soup = BeautifulSoup(html,"lxml")for content in soup.strings:    print(repr(content))

soup.stripped_strings：获取所有节点的内容，不包括空格；

soup = BeautifulSoup(html,"lxml")for tag in soup.stripped_strings:    print(repr(tag))

5.查找标签

find_all()：查找所有指定标签名称的子节点（可同时查找多个标签），并判断是否符合过滤器的条件，返回一个列表；

soup = BeautifulSoup(html,"lxml")print(soup.find_all('a'))print(soup.find_all(['a','p']))print(soup.find_all(re.compile('^a')))

find()：和find_all()差不多，但是find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回结果；

soup = BeautifulSoup(html,"lxml")print(soup.find('a'))

到此，关于“Python中的Beautiful Soup模块的用法”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注编程网网站，小编会继续努力为大家带来更多实用的文章！

文章详情

Python中的Beautiful Soup模块的用法

1.Beautiful Soup模块的介绍

2. Beautiful Soup模块解析HTML文档

3.BeautifulSoup中的对象

4.遍历文档

5.查找标签

软考中级精品资料免费领

相关文章

猜你喜欢

Python中的Beautiful Soup模块的用法

Python Beautiful Soup模块如何使用

Python利用Beautiful Soup模块创建对象详解

Python利用Beautiful Soup模块修改内容方法示例

Python利用Beautiful Soup模块搜索内容详解

使用Python Beautiful Soup解析HTML内容的方法

Python 页面解析Beautiful Soup库的使用方法

python爬虫beautiful soup的使用方式

Python的爬虫包Beautiful Soup中用正则表达式来搜索

使用 Python 的 requests 和 Beautiful Soup 来分析网页

python网络爬虫精解之Beautiful Soup的使用说明

Python使用Beautiful Soup包编写爬虫时的一些关键点

Python中模块的使用--binascii模块用法

python中decimal模块的用法

Python中zoneinfo模块的用法

Python中的itertools模块的用法

Python中的time模块与datetime模块用法总结

Python中time模块和datetime模块的用法示例

详解Python中heapq模块的用法

Python中jieba分词模块的用法