使用python怎么提取html文本-编程学习网

这期内容当中小编将会给大家带来有关使用python怎么提取html文本，文章内容丰富且以专业的角度为大家分析和叙述，阅读完这篇文章希望大家可以有所收获。

# coding: utf-8from time import timeimport warcfrom bs4 import BeautifulSoupfrom selectolax.parser import HTMLParserdef get_text_bs(html):    tree = BeautifulSoup(html, 'lxml')    body = tree.body    if body is None:        return None    for tag in body.select('script'):        tag.decompose()    for tag in body.select('style'):        tag.decompose()    text = body.get_text(separator='\n')    return textdef get_text_selectolax(html):    tree = HTMLParser(html)    if tree.body is None:        return None    for tag in tree.css('script'):        tag.decompose()    for tag in tree.css('style'):        tag.decompose()    text = tree.body.text(separator='\n')    return textdef read_doc(record, parser=get_text_selectolax):    url = record.url    text = None    if url:        payload = record.payload.read()        header, html = payload.split(b'\r\n\r\n', maxsplit=1)        html = html.strip()        if len(html) > 0:            text = parser(html)    return url, textdef process_warc(file_name, parser, limit=10000):    warc_file = warc.open(file_name, 'rb')    t0 = time()    n_documents = 0    for i, record in enumerate(warc_file):        url, doc = read_doc(record, parser)        if not doc or not url:            continue        n_documents += 1        if i > limit:            break    warc_file.close()    print('Parser: %s' % parser.__name__)    print('Parsing took %s seconds and produced %s documents\n' % (time() - t0, n_documents))

>>> ! wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz>>> file_name = "CC-MAIN-20180116070444-20180116090444-00000.warc.gz">>> process_warc(file_name, get_text_selectolax, 10000)Parser: get_text_selectolaxParsing took 16.170367002487183 seconds and produced 3317 documents>>> process_warc(file_name, get_text_bs, 10000)Parser: get_text_bsParsing took 432.6902508735657 seconds and produced 3283 documents

显然，这并不是对某些事物进行基准测试的最佳方法，但是它提供了一个想法，即selectolax有时比lxml快30倍。
selectolax最适合将HTML剥离为纯文本。如果我有10,000多个HTML片段，需要将它们作为纯文本索引到Elasticsearch中。（Elasticsearch有一个html_strip文本过滤器，但这不是我想要/不需要在此上下文中使用的过滤器）。事实证明，以这种规模将HTML剥离为纯文本实际上是非常低效的。那么，最有效的方法是什么？

PyQuery

from pyquery import PyQuery as pqtext = pq(html).text()

selectolax

from selectolax.parser import HTMLParsertext = HTMLParser(html).text()

正则表达式

import reregex = re.compile(r'<.*?>')text = clean_regex.sub('', html)

结果

我编写了一个脚本来计算时间，该脚本遍历包含HTML片段的10,000个文件。注意！这些片段不是完整的<html>文档（带有<head>和<body>等），只是HTML的一小部分。平均大小为10,314字节（中位数为5138字节）。结果如下：

pyquery  SUM:    18.61 seconds  MEAN:   1.8633 ms  MEDIAN: 1.0554 msselectolax  SUM:    3.08 seconds  MEAN:   0.3149 ms  MEDIAN: 0.1621 msregex  SUM:    1.64 seconds  MEAN:   0.1613 ms  MEDIAN: 0.0881 ms

我已经运行了很多次，结果非常稳定。重点是：selectolax比PyQuery快7倍。

正则表达式好用？真的吗？

对于最基本的HTML Blob，它可能工作得很好。实际上，如果HTML是<p> Foo＆amp; Bar </ p>，我希望纯文本转换应该是Foo＆Bar，而不是Foo＆amp; bar。
更重要的一点是，PyQuery和selectolax支持非常特定但对我的用例很重要的内容。在继续之前，我需要删除某些标签（及其内容）。例如：

<h5 class="warning">This should get stripped.</h5><p>Please keep.</p><div >This should also get stripped.</div>

正则表达式永远无法做到这一点。

2.0 版本

因此，我的要求可能会发生变化，但基本上，我想删除某些标签。例如：<div class =“ warning”> 、 <div class =“ hidden”> 和 <div style =“ display：none”>。因此，让我们实现一下：

PyQuery

from pyquery import PyQuery as pq_display_none_regex = re.compile(r'display:\s*none')doc = pq(html)doc.remove('div.warning, div.hidden')for div in doc('div[style]').items():    style_value = div.attr('style')    if _display_none_regex.search(style_value):        div.remove()text = doc.text()

selectolax

from selectolax.parser import HTMLParser_display_none_regex = re.compile(r'display:\s*none')tree = HTMLParser(html)for tag in tree.css('div.warning, div.hidden'):    tag.decompose()for tag in tree.css('div[style]'):    style_value = tag.attributes['style']    if style_value and _display_none_regex.search(style_value):        tag.decompose()text = tree.body.text()

这实际上有效。当我现在为10,000个片段运行相同的基准时，新结果如下：

pyquery  SUM:    21.70 seconds  MEAN:   2.1701 ms  MEDIAN: 1.3989 msselectolax  SUM:    3.59 seconds  MEAN:   0.3589 ms  MEDIAN: 0.2184 msregex  Skip

上述就是小编为大家分享的使用python怎么提取html文本了，如果刚好有类似的疑惑，不妨参照上述分析进行理解。如果想知道更多相关知识，欢迎关注编程网行业资讯频道。

文章详情

使用python怎么提取html文本

结果

正则表达式好用？真的吗？

2.0 版本

软考中级精品资料免费领

相关文章

猜你喜欢

使用python怎么提取html文本

python 提取html文本的方法

Python如何提取html中文本到txt

html怎么读取文本文件

html怎么读取本地文本文件

怎么在python中提取文本信息

HTML怎么使用粗体文本

html怎么读取文本文件内容

使用python怎么提取字符串的中英文

使用Python怎么在m3u8文件中提取视频

python怎么读取文本文档

使用Python怎么提取PDF表格

Python中JsonPath提取器和正则提取器怎么使用

html中怎么取消文本选中功能

python怎么提取文件内容

HTML body里的文本与文本格式标签怎么使用

使用vbs怎么循环读取文本

Python sklearn怎么对文本数据进行特征化提取

怎么用for方法提取文本整行内容

python文本数据提取的方法是什么