python beautiful soup库入门安装教程-编程学习网

beautiful soup库的安装


pip install beautifulsoup4

beautiful soup库的理解

beautiful soup库是解析、遍历、维护“标签树”的功能库

beautiful soup库的引用


from bs4 import BeautifulSoup
import bs4

BeautifulSoup类

BeautifulSoup对应一个HTML/XML文档的全部内容

回顾demo.html


import requests

r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
print(demo)


<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

Tag标签

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾


import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title)
tag = soup.a
print(tag)


<title>This is a python demo page</title>
<a  href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a>

任何存在于HTML语法中的标签都可以用soup.访问获得。当HTML文档中存在多个相同对应内容时，soup.返回第一个

Tag的name

基本元素	说明
Name	标签的名字， … 的名字是'p',格式：.name


import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)


a
p   
body

Tag的attrs（属性）

基本元素	说明
Attributes	标签的属性，字典形式组织，格式：.attrs


import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(tag.attrs['href'])
print(type(tag.attrs))
print(type(tag))


{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>

Tag的NavigableString

基本元素	说明
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：.string

Tag的Comment

基本元素	说明
Comment	标签内字符串的注释部分，一种特殊的Comment类型


import requests
from bs4 import BeautifulSoup
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(newsoup.b.string)
print(type(newsoup.b.string))
print(newsoup.p.string)
print(type(newsoup.p.string))


This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>

HTML基本格式

标签树的下行遍历

属性	说明
.contents	子节点的列表，将所有儿子结点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子结点
.descendents	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

BeautifulSoup类型是标签树的根节点


import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
print(soup.head)
print(soup.head.contents)
print(soup.body.contents)
print(len(soup.body.contents))
print(soup.body.contents[1])


<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', <p ><b>The demo python introduces several python courses.</b></p>, '\n', <p >Python 
is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the 
following courses:
<a  href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a  href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>, '\n']
5
<p ><b>The demo python introduces several python courses.</b></p>


for child in soup.body.children:
	print(child)  #遍历儿子结点
for child in soup.body.descendants:
	print(child) #遍历子孙节点

标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点


import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
print(soup.title.parent)
print(soup.html.parent)


<head><title>This is a python demo page</title></head>
<html><head><title>This is a python demo page</title></head>
<body>
<p ><b>The demo python introduces several python courses.</b></p>
<p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a  href="http://www.icourse163.org/course/BIT-268001" >Basic Python</a> and <a  href="http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</a>.</p>
</body></html>


import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)


p
body      
html      
[document]

标签的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous.sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous.siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签


import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
print(soup.a.next_sibling)
print(soup.a.next_sibling.next_sibling)

print(soup.a.previous_sibling)
print(soup.a.previous_sibling.previous_sibling)

print(soup.a.parent)


and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

None
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>


for sibling in soup.a.next_sibling:
	print(sibling)  #遍历后续节点
for sibling in soup.a.previous_sibling:
	print(sibling)  #遍历前续节点

在这里插入图片描述

bs库的prettify()方法


import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo,"html.parser")
print(soup.prettify())


<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

.prettify()为HTML文本<>及其内容增加更加'\n'
.prettify()可用于标签，方法：.prettify()

bs4库的编码

bs4库将任何HTML输入都变成utf-8编码
python 3.x默认支持编码是utf-8,解析无障碍


import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup("<p>中文</p>","html.parser")
print(soup.p.string)

print(soup.p.prettify())


中文

<p>  
 中文
</p>

到此这篇关于python beautiful soup库入门安装教程的文章就介绍到这了,更多相关python beautiful soup库入门内容请搜索编程网以前的文章或继续浏览下面的相关文章希望大家以后多多支持编程网！

文章详情

python beautiful soup库入门安装教程

目录

beautiful soup库的安装

beautiful soup库的理解

beautiful soup库的引用

BeautifulSoup类

回顾demo.html

Tag标签

Tag的name

Tag的attrs（属性）

Tag的NavigableString

HTML基本格式

标签树的下行遍历

标签树的上行遍历

标签的平行遍历

bs库的prettify()方法

bs4库的编码

软考中级精品资料免费领

相关文章

猜你喜欢

python beautiful soup库入门安装教程

python 解析库Beautiful Soup的安装

python解析库Beautiful Soup安装的详细步骤

Python入门教程(三十九)Python的NumPy安装与入门

Android studio 4.1.2安装入门教程

Python入门教程之Python的安装下载配置

python开发之Docker入门安装部署教程

python扩展库numpy入门教程

Drupal8入门教程之安装部署

Pythonpyecharts模块安装与入门教程

python库pydantic的简易入门教程

Maven入门教程(一)：安装Maven环境

Centos7安装ElasticSearch 6.4.1入门教程详解

PyQT安装教程：快速入门指南

Python使用Selenium WebDriver的入门介绍及安装教程

Linux - Java 8 入门安装与重装教程集锦

Python入门教程之pycharm安装/基本操作/快捷键

PyQt5 教科书级完整教程（一）安装与入门

Python入门教程｜｜Python3 MySQL 数据库连接｜｜

Python安装jieba库详细教程