当前位置：首页 > article >正文

Beautiful Soup

article 2026/4/9 6:50:14

什么是Beautiful Soup官网推荐现在的项目使用BeautifulSoup4 BeautifulSoup 4版本简称为bs4开发。bs4是一个HTML/XML的解析器主要的功能是解析和提取HTML/XML数据。bs4不仅支持CSS选择器而且支持Python标准库中的HTML解析器以及lxml的XML解析器。bs4库将复杂的HTML文档转换成树结构HTML DOM这个结构中的每个节点都是一个Python对象这些对象可以归纳为如下四种类说明bs4.element.Tag表示HTML中的标签最基本的信息组织单元。bs4.element.NavigableString表示HTML中标签的文本非属性字符串bs4.BeautifulSoup表示HTML DOM中的全部内容支持遍历文档树和搜索文档树的大部分方法。bs4.element.Comment表示标签内字符串的注释部分是一种特殊的NavigableString对象。使用bs4的一般流程安装所需的库pip install lxml -i https://mirrors.aliyun.com/pypi/simple/pip install bs4 -i https://mirrors.aliyun.com/pypi/simple/解析过程创建BeautifulSoup类的对象def __init__(self, markup, featuresNone, builderNone, parse_onlyNone, from_encodingNone, exclude_encodingsNone, **kwargs)markup表示要解析的文档字符串或者文件对象features表示解析器的名称builder参数表示指定的解析器from_encoding参数表示指定的编码格式exclude_encodings参数表示排除的编码格式。例如from bs4 import BeautifulSoup soup BeautifulSoup(html_doc, lxml)bs4支持的解析器目前bs4支持的解析器包括Python标准库、lxml和html5lib它们的优势和劣势的比较如下。解析器使用方法优势劣势Python标准库BeautifulSoup(markup, html.parser)1Python的内置标准库2执行速度适中3文档容错能力强Python 2.7.3或3.2.2之前的版本中文档容错能力差lxml HTML解析器BeautifulSoup(markup, lxml)1速度快2文档容错能力强需要安装C语言库lxml XML解析器BeautifulSoup(markup, [lxml-xml])1速度快2唯一支持XML的解析器需要安装C语言库html5libBeautifulSoup(markup, html5lib)1最好的容错性2以浏览器的方式解析文档3生成HTML5格式的文档1速度慢2不依赖外部扩展在创建BeautifulSoup对象时如果没有明确地指定解析器那么BeautifulSoup对象会根据当前系统安装的库自动选择解析器解析器的选择顺序为lxml》html5lib》Python标准库要解析的html字符串html head title测试数据/title p title内嵌的标题/title /p /head body p idlink/p p idu1 a hrefhttp://example.com/elsie classsister idlink1Elsie/a a hrefhttp://example.com/elsie classsister idlink2Elsie/a a hrefhttp://example.com/elsie classsister idlink3Elsie/a a hrefhttp://example.com/elsie classsister idlink4Elsie/a b a hrefhttp://tieba.baidu.com classLacie idlink5Elsie/a /b /p div classbri p classbri/p a classbriLacie/a span>通过api之find_all查找import re from bs4 import BeautifulSoup from pathlib import Path # 创建 Path 对象 file_path Path(test.html) soup BeautifulSoup(file_path.read_text(utf-8), html.parser) print(----,打印文档树,----) print(soup) print(----,打印格式化文档树,----) print(soup.prettify()) # find_all(self, nameNone, attrs{}, recursiveTrue, textNone,limitNone, **kwargs) # 传入字符串 print(----,传入字符串,----) result_set soup.find_all(b) print(result_set) # 传入正则 print(----,传入正则,----) for tag in soup.find_all(re.compile(^b)): print(tag.name) # 传入列表 print(----,传入列表,----) r soup.find_all([a, b]) print(r) # 传入属性 print(----,传入属性,----) r soup.find_all(idlink1) print(r) # 传入多个属性 print(----,传入多个属性,----) r soup.find_all(hrefre.compile(elsie), idlink1) print(r) # 识别不了特殊的属性 print(----,识别不了特殊的属性,----) #r soup.find_all(data-foopig) r soup.find_all(attrs{data-foo: pig}) print(r) # 传入class标签 print(----,传入class属性,----) r soup.find_all(a, class_sister) print(r) # 传入text print(----,传入text,----) r soup.find_all(stringElsie) print(r) # 传入多个text print(----,传入多个text,----) r soup.find_all(string[Tillie, Elsie, Lacie]) print(r) # limit 参数 print(----,limit 参数,----) r soup.find_all(a, limit2) print(r) # recursive参数如果只想搜索当前节点的直接子节点那么就可以使用参数recursiveFalse。 print(----,recursive参数true,----) r soup.html.find_all(title) print(r) print(----,recursive参数false,----) r soup.html.find_all(title, recursiveFalse) print(r)css选择器查找import re from bs4 import BeautifulSoup from pathlib import Path # 创建 Path 对象 file_path Path(test.html) bs BeautifulSoup(file_path.read_text(utf-8), html.parser) #print(bs.prettify()) # css选择器 # 1通过标签名查找 print(----,通过标签a查找,----) print(len(bs.select(a))) print(----,通过组合标签p a查找,----) print(len(bs.select(p a))) print(----,通过组合标签pa查找,----) print(len(bs.select(pa))) # 2通过类名查找 print(----,通过类名sister查找,----) print(len(bs.select(.sister))) # 3通过id查找 print(----,通过idu1查找,----) print(bs.select(#u1)) # 4组合查找 print(----,组合查找----) print(bs.select(div .bri)) # 5属性查找 print(----,通过class属性查找,----) print(bs.select(a[class~bri])) print(---- ,通过属性href查找,----) print(bs.select(a[hrefhttp://tieba.baidu.com])) # 6获取文本内容 print(---- ,获取文本内容,----) print(bs.select(title)[0].get_text())比较Find_all()与select()作用Find_all()Select()筛选出所有a标签soup.find_all(a)soup.select(a)筛选出class属性是footer的标签soup.find_all(class_footer)soup.select(.footer)筛选出id属性是link的p标签soup.find_all(p, idlink)soup.select(p[idlink])筛选出head标签下的title标签soup.head.find_all(title)孙子title也会被找出来soup.select(head title)筛选出超链接属性是”www.baidu.com”的a标签soup.find_all(a,hrefwww.baidu.com)soup.select(a[hrefwww.baidu.com])import re from bs4 import BeautifulSoup from pathlib import Path # 创建 Path 对象 file_path Path(test.html) soup BeautifulSoup(file_path.read_text(), html.parser) rsoup.find_all(p, idlink) print(r) rsoup.select(p[idlink]) print(r) rsoup.head.find_all(title) print(r) rsoup.select(head title) print(r)

Beautiful Soup

相关文章：

Beautiful Soup

Intv_AI_MK11与Claude协同实战：构建多模型AI应用开发平台

Qwen3.5-9B-AWQ-4bit Proteus电路仿真辅助：原理图分析与代码生成

Git-RSCLIP多场景落地：生态环境监测中‘红树林退化’语义识别案例

Qwen-Image-Edit-2511在云端：集成显卡/Mac也能流畅运行的AI修图方案

基于Matlab实现 IEEE33节点配电网系统simulink仿真模型，并配套前推回代法潮流计算程序

M2LOrder模型赋能软件测试：用例生成与缺陷预测实践

从直觉到算法：贝叶斯思维的技术底层与工程实现督

算术运算符(i++与++i)

从零解析SHA-1：一个160位哈希的诞生之旅

别再死记硬背Payload了：用BUUCTF Basic靶场案例拆解漏洞利用的本质逻辑

OpenClaw调用Qwen3-32B镜像成本实测：RTX4090D长任务Token消耗分析

系统分析师论文模版分析

GPU算力优化实践：Pixel Epic智识终端显存配额与逻辑发散调参详解

LFM2.5-1.2B-Thinking-GGUF效果展示：32K上下文下跨段落信息关联与归纳能力实测

OpenClaw跨模型路由：按图片类型分配Qwen3.5-9B与本地LLM

基于YOLOV5的手势识别检测系统

全国首个！深开鸿与前海供电公司打造的数据中心电鸿变配电室正式投运

Qwen3-Reranker完整指南：支持Markdown/HTML文档解析的增强版方案

基于Qt框架的桌面应用开发：集成nli-distilroberta-base实现本地文本分析工具

仅限首批内测用户掌握的PyTorch 3.0图优化黑盒（torch._dynamo.eval_frame._optimize_ctx），3行代码解锁Graph-Level Profiling

005、边缘AI与嵌入式智能：芯片、算法与场景的融合

跨平台兼容秘诀：OpenClaw在Linux对接百川2-13B-4bits模型全记录

Qwen3-ASR-1.7B应用案例：在线面试平台→实时语音转文字+回答时长分析

2026年本地录音转文字工具实测对比算准确率算本地处理速度，差距竟然这么大

OpenClaw 核心概念关系与配置指南

网站 Favicon 获取 API 技术实现指南

突破算力边界：生成式AI与深度学习的前沿实践

FlowState Lab实战：5步搞定时间序列预测，效果惊艳！

墨语灵犀赋能在线教育：AI助教自动批改编程作业实践