python3 爬虫笔记（一）beaut-编程学习网

很多人学习python，爬虫入门，在python爬虫中，有很多库供开发使用。

用于请求的urllib(python3)和request基本库，xpath,beautiful soup,pyquery这样的解析库。其中xpath中用到大量的正则表示式，对于新手来说，写正则很容易出错，在这里，从beautiful soup开始说。

from beautiful_soup.constant import HTML_TEXT

from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML_TEXT, 'lxml')
# 将html文件以标准的格式输出, 会自动补全缺失的HTML结构
print(soup.prettify())
# 获取title标签的内容
print(soup.div.string)
# 获取名称
print(soup.div.name)
# 获取属性 属性值多个，所以返回值为list列表
print(soup.div.attrs)
# 元素选择可以嵌套 ,这样的方式在多个的情况下，只取第一个，
# 比如body中有多个div,这里取了第一个
print(soup.body.div.a.attrs)

# contents 属性获取直接的子节点 children属性也是如此

用属性选择较快，但是遇到复杂的情况，就很不灵活，这时候我们需要调用beautiful_soup中的一些方法# find_all 查询所有符合条件的元素

# find_all(name, attrs, recursive, text, **kwargs)
# name是属性名  attrs是属性
print(soup.find_all(name="ul"))

for ul in soup.find_all(name="ul"):
    print(ul.find_all(name="li"))
# 属性传入夫人参数为字典格式
print(soup.find_all(attrs={"class": "js-geo-city"}))

# text
print(soup.find_all(text=re.compile("热")))

# find() 用法和find_all()一致，只不过返回的是单个元素，匹配到的第一个

# 其他方法
 find_parents() # 返回所有的祖先节点
 find_parent() # 直接返回父节点

find_next_siblings() # 返回后面所有的兄弟节点
find_next_sibling()  # 返回后面第一个兄弟节点

find_previous_siblings() # 返回前面所有的兄弟节点
find_pervious_sibling() # 返回前面第一个兄弟节点

# css选择器 select()
print(soup.select("ul li"))

文章详情

python3 爬虫笔记（一）beaut

软考中级精品资料免费领

相关文章

猜你喜欢

python3 爬虫笔记（一）beaut

python爬虫笔记-day3

爬虫笔记1：Python爬虫常用库

Python3网络爬虫(十一)：爬虫黑科

Python爬虫笔记4-Beautif

python简单爬虫笔记

Python 爬虫学习笔记之多线程爬虫

Python3爬虫下载pdf（一）

Python 爬虫学习笔记之单线程爬虫

Python爬虫笔记3-解析库Xpat

Python爬虫笔记5-JSON格式数

【Python3爬虫】百度一下，坑死你？

一次爬虫实践记录

用python3爬虫的一些准备工作

python爬虫Mitmproxy安装使用学习笔记

Python的Scrapy爬虫框架简单学习笔记

python萌新爬虫学习笔记【建议收藏】

Python 爬虫学习笔记之正则表达式

PHP学习笔记：网络爬虫与数据采集

python爬虫学习笔记--BeautifulSoup4库的使用详解