当前位置：首页 > news >正文

Java中使用Jsoup实现网页内容爬取与Html内容解析并使用EasyExcel实现导出为Excel文件

news 2026/2/11 0:10:36

场景

Pythont通过request以及BeautifulSoup爬取几千条情话：

Pythont通过request以及BeautifulSoup爬取几千条情话_爬取情话-CSDN博客

Node-RED中使用html节点爬取HTML网页资料之爬取Node-RED的最新版本：

Node-RED中使用html节点爬取HTML网页资料之爬取Node-RED的最新版本_node-red html-CSDN博客

Jsoup

Jsoup是一种Java 的HTML(html也是XML文档)解析器，可直接解析某个URL地址、HTML文本内容。

它提供了一套易于操作的API，可通过DOM，CSS以及类似于jQuery选择器的操作方法来取出和操作数据。

使用jsoup就可以解析HTML。

Jsoup使用的是DOM解析方式，把整个HTML文档（XML文档）加载到内存中形成一棵DOM树，得到文档的Document对象。

HTML里的标签，会转换成Element对象。

官网地址：

jsoup: Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety

EasyExcel

Java解析、生成Excel比较有名的框架有Apache poi、jxl。但他们都存在一个严重的问题就是非常的耗内存，

poi有一套SAX模式的API可以一定程度的解决一些内存溢出的问题，但POI还是有一些缺陷，

比如07版Excel解压缩以及解压后存储都是在内存中完成的，内存消耗依然很大。

easyexcel重写了poi对07版Excel的解析，一个3M的excel用POI sax解析依然需要100M左右内存，

改用easyexcel可以降低到几M，并且再大的excel也不会出现内存溢出；03版依赖POI的sax模式，

在上层做了模型转换的封装，让使用者更加简单方便。

官网地址：

关于Easyexcel | Easy Excel

注：

博客：
https://blog.csdn.net/badao_liumang_qizhi

实现

1、引入依赖

        <!--Jsoup 是一个用于解析HTML和XML文档的Java库--><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.11.3</version></dependency><!--EasyExcel是一个基于Java的、快速、简洁、解决大文件内存溢出的Excel处理工具--><dependency><groupId>com.alibaba</groupId><artifactId>easyexcel</artifactId><version>3.0.5</version></dependency>

2、找到需要爬取的网页内容

比如以下面为例

2023财富世界500强企业榜单 2023全球500强企业世界500强排名一览表→买购网

这里要获取500强排名数据，因为单次刷新网页只能返回100条数据，所以只解析前100条。获取更多数据可根据其分页请求规则分别进行爬取。

打开F12找到要爬取的数据的dom结构

这里要获取到id为t_container的div元素大的第22个子元素(索引为21)的table元素的tr元素的td数据。

3、编写测试代码，连接并解析html元素

        String url = "https://www.maigoo.com/news/3jcNODk3.html";try {//读取url，得到DocumentDocument document = Jsoup.connect(url).ignoreContentType(true).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3").timeout(30000).header("referer","https://www.maigoo.com").get();Elements select = document.select("#t_container > div:eq(21) table tr");} catch (IOException e) {e.printStackTrace();}

注意这里使用选择器的语法：

#t_container 代表id为t_container

>代表找父元素下的子元素

div:eq(21) 代表第22个元素

table tr 代表table 标签下tr标签

更多select选择器用法

Use CSS selectors to find elements: jsoup Java HTML parser

Selector overview

tagname: find elements by tag, e.g. div
#id: find elements by ID, e.g. #logo
.class: find elements by class name, e.g. .masthead
[attribute]: elements with attribute, e.g. [href]
[^attrPrefix]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
[attr=value]: elements with attribute value, e.g. [width=500] (also quotable, like [data-name='launch sequence'])
[attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
[attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
*: all elements, e.g. *
ns|tag: find elements by tag in a namespace prefix, e.g. fb|name finds <fb:name> elements
*|tag: final elements by tag in any namespace prefix, e.g. *|name finds <fb:name> and <name> elements

Selector combinations

el#id: elements with ID, e.g. div#logo
el.class: elements with class, e.g. div.masthead
el[attr]: elements with attribute, e.g. a[href]
Any combination, e.g. a[href].highlight
ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo

Pseudo selectors

:has(selector): find elements that contain elements matching the selector; e.g. div:has(p)
:is(selector): find elements that match any of the selectors in the selector list; e.g. :is(h1, h2, h3, h4, h5, h6) finds any heading element
:not(selector): find elements that do not match the selector; e.g. div:not(.logo)
:contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
:containsOwn(text): find elements that directly contain the given text
:matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
:matchesOwn(regex): find elements whose own text matches the specified regular expression
:lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
:gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
:eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

除使用select选择器之外还可使用XPath选择器用法

Use XPath selectors to find elements and nodes: jsoup Java HTML parser

4、解析dom数据并赋值到对象添加到list

新建实体对象，并添加excel注解

import com.alibaba.excel.annotation.ExcelProperty;
import lombok.Builder;
import lombok.Data;import java.io.Serializable;@Data
@Builder
public class WealthEntity implements Serializable {private static final long serialVersionUID = -1760099890427975758L;@ExcelProperty(value = "排名",index = 0)private Integer index;@ExcelProperty(value = "公司名称",index = 1)private String companyName;@ExcelProperty(value = "收入",index = 2)private String income;@ExcelProperty(value = "利润",index = 3)private String profit;}

进行dom解析和添加到list

            Elements select = document.select("#t_container > div:eq(21) table tr");List<WealthEntity> list = new ArrayList<>();for (int i = 1; i < select.size(); i++) {Element tr = select.get(i);Elements tds = tr.select("td");Integer index = Integer.valueOf(tds.get(0).text());String companyName = tds.get(1).text();String income = tds.get(2).text();String profit = tds.get(3).text();WealthEntity wealthEntity = WealthEntity.builder().index(index).companyName(companyName).income(income).profit(profit).build();list.add(wealthEntity);}

5、导出为excel

            String fileName = "D:/2023财富世界100强.xlsx";EasyExcel.write(fileName,WealthEntity.class).sheet("100强").doWrite(list);

6、完整示例代码

        String url = "https://www.maigoo.com/news/3jcNODk3.html";try {//读取url，得到DocumentDocument document = Jsoup.connect(url).ignoreContentType(true).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3").timeout(30000).header("referer","https://www.maigoo.com").get();Elements select = document.select("#t_container > div:eq(21) table tr");List<WealthEntity> list = new ArrayList<>();for (int i = 1; i < select.size(); i++) {Element tr = select.get(i);Elements tds = tr.select("td");Integer index = Integer.valueOf(tds.get(0).text());String companyName = tds.get(1).text();String income = tds.get(2).text();String profit = tds.get(3).text();WealthEntity wealthEntity = WealthEntity.builder().index(index).companyName(companyName).income(income).profit(profit).build();list.add(wealthEntity);}String fileName = "D:/2023财富世界100强.xlsx";EasyExcel.write(fileName,WealthEntity.class).sheet("100强").doWrite(list);} catch (IOException e) {e.printStackTrace();}

7、运行结果

Java中使用Jsoup实现网页内容爬取与Html内容解析并使用EasyExcel实现导出为Excel文件

场景 Pythont通过request以及BeautifulSoup爬取几千条情话： Pythont通过request以及BeautifulSoup爬取几千条情话_爬取情话-CSDN博客 Node-RED中使用html节点爬取HTML网页资料之爬取Node-RED的最新版本： Node-RED中使用html节点爬取HTML网页资料之爬…...

编程日记 2024/3/6 23:02:59

闫震海:腾讯音乐空间音频技术的发展和应用 | 演讲嘉宾公布

一、3D 音频 3D 音频分论坛将于3月27日同期举办！ 3D音频技术不仅能够提供更加真实、沉浸的虚拟世界体验，跨越时空的限制，探索未知的世界。同时，提供更加丰富、立体的情感表达和交流方式，让人类能够更加深入地理解彼此&…...

编程日记 2024/3/6 22:54:49

Java基础 - 6 - 面向对象（二）

Java基础 - 6 - 面向对象（一）-CSDN博客二. 面向对象高级 2.1 static static叫做静态，可以修饰成员变量、成员方法 2.1.1 static修饰成员变量成员变量按照有无static修饰，分为两种：类变量、实例变量（对象…...

编程日记 2024/3/6 22:53:48

SpringCloud-MQ消息队列

一、消息队列介绍 MQ (MessageQueue) ，中文是消息队列，字面来看就是存放消息的队列。也就是事件驱动架构中的Broker。消息队列是一种基于生产者-消费者模型的通信方式，通过在消息队列中存放和传递消息，实现了不同组件、服务或系统…...

编程日记 2024/3/6 22:50:45

代码随想录算法训练营第三十八天|509. 斐波那契数、70. 爬楼梯、746. 使用最小花费爬楼梯

509. 斐波那契数刷题https://leetcode.cn/problems/fibonacci-number/description/文章讲解https://programmercarl.com/0509.%E6%96%90%E6%B3%A2%E9%82%A3%E5%A5%91%E6%95%B0.html#%E7%AE%97%E6%B3%95%E5%85%AC%E5%BC%80%E8%AF%BE视频讲解https://www.bilibili.com/video/BV…...

编程日记 2024/3/6 22:49:43

[python] 代码工具箱

在 Python 3 的开发过程中，有一些小而实用的工具包可以帮助减轻开发负担，提升工作效率。这些工具包通常专注于解决特定问题或提供特定功能，使代码更简洁和可维护。以下是一些常用的工具包，可以简化开发过程： backoff&a…...

编程日记 2024/3/6 22:46:39

Linux——网络基础

计算机网络背景网络发展独立模式: 计算机之间相互独立在早期的时候，计算机之间是相互独立的，此时如果多个计算机要协同完成某种业务，那么就只能等一台计算机处理完后再将数据传递给下一台计算机，然后下一台计算机再进行相应…...

编程日记 2024/3/6 22:45:38

Vue：双token无感刷新

文章目录初次授权与发放Token：Access Token的作用：Refresh Token的作用：无感刷新：安全机制：后端创建nest项目AppController 添加login、refresh、getinfo接口创建user.dto.tsAppController添加模拟数据前端Hbuilder创…...

编程日记 2024/3/6 22:44:36

实现一个作用域插槽的场景

vue项目中，插槽slot有三种分别是：默认插槽、具名插槽、作用域插槽。默认插槽和具名插槽在平时的开发中用的比较多，作用域插槽用的相对较少，以前我对作用域插槽不是很理解，现在理解了一下。下面通过代码来实现一个作用域…...

编程日记 2024/3/6 22:42:34

Qt QPainter的使用方法

重点： 1.QPainter在QWidget窗口的paintEvent中使用。 2.QPainter通常涉及到设置画笔、设置画刷、绘图（QPen、QBrush、drawxx）三个流程。 class Widget : public QWidget {Q_OBJECTprotected:void paintEvent(QPaintEvent *event) Q_DEC…...

编程日记 2024/3/6 22:38:28

低代码：数智化助力新农业发展

随着科技的飞速发展和数字化转型的深入推进，低代码开发平台正逐渐成为软件开发的热门话题。尤其在农业领域，低代码技术为传统农业注入了新的活力，助力新农业实现高效、智能的发展。低代码开发平台的概念与特点随着科技的飞速发展&#xff0…...

编程日记 2024/3/6 22:36:26

3d模型怎么镜像？3d模型镜像的步骤---模大狮模型网

在3D建模软件中，对3D模型进行镜像操作通常是指沿着某个轴线(如X、Y、Z轴)进行镜像翻转，使模型在该轴线的一侧产生对称的镜像效果。以下是在常见的3D建模软件中对3D模型进行镜像的一般步骤： 3d模型镜像步骤： 选择模型：…...

编程日记 2024/3/6 22:34:24

笔记本hp6930p安装Android-x86补记

在上一篇日记中（笔记本hp6930p安装Android-x86避坑日记-CSDN博客）提到hp6930p安装Android-x86-9.0，无法正常启动，本文对此再做尝试，原因是：Android-x86-9.0不支持无线网卡，需要在BIOS中关闭WLAN…...

编程日记 2024/3/6 22:33:22

为什么MySQL中多表联查效率低，连接查询实现的原理是什么？

MySQL中多表联查效率低的原因主要涉及到以下几个方面： 数据量大: 当多个表通过连接查询时，如果这些表的数据量很大，那么查询就需要处理更多的数据，这自然会降低查询效率。连接操作复杂性: 连接查询需要对参与连接的每个表中的数…...

编程日记 2024/3/6 22:28:17

从下一代车规MCU厘清存储器的发展(2)

目录 1.概述 2.MCU大厂的选择 2.1 瑞萨自研STT-MRAM 2.2 ST专注PCM 2.3 英飞凌和台积电联手RRAM 2.4 NXP如何计划eNVM 3.小结 1.概述上篇文章，我们简述了当前主流的存储器技术，现在我们来讲讲各大MCU大厂的技术选择 2.MCU大厂的选择瑞萨日…...

编程日记 2024/3/6 22:25:14

Redis(理论版)

Redis 1.Redis是什么 Redis其实就是一个数据库，它是一个文档型数据库（非关系型数据库）,而mysql是一个关系型数据库。它是一个开源的、基于内存的高性能键值存储数据库，支持多种数据结构，广泛用于缓存、消息队列、应用…...

编程日记 2024/3/6 22:24:13

【NR 定位】3GPP NR Positioning 5G定位标准解读（四）

目录前言 6 Signalling protocols and interfaces 6.1 支持定位操作的网络接口 6.1.1 通用LCS控制平面架构 6.1.2 NR-Uu接口 6.1.3 LTE-Uu接口 6.1.4 NG-C接口 6.1.5 NL1接口 6.1.6 F1接口 6.1.7 NR PC5接口 6.2 终端协议 6.2.1 LTE定位协议（LPP&#x…...

编程日记 2024/3/6 22:22:11

Docker容器化解决方案

什么是Docker？ Docker是一个构建在LXC之上，基于进程容器的轻量级VM解决方案，实现了一种应用程序级别的资源隔离及配额。Docker起源于PaaS提供商dotCloud 基于go语言开发，遵从Apache2.0开源协议。 Docker 自开源后受到广泛的关注和…...

编程日记 2024/3/6 22:19:07

Docker安装+基础命令

一、检测、配置安装环境 （1）查看linux版本，是否符合>centos 7 （2）查看网络是否通畅 （3）安装gcc，gcc-c编译器 （4）安装device-mapper-persistent-data和lvm2…...

编程日记 2024/3/6 22:18:05

构建高性能Linux Virtual Server（LVS）集群

目录引言一、集群的基本理论 （一）什么是集群 （二）集群的分类 （三）LB Cluster 负载均衡集群 1.按实现方式划分 2.按协议层划分 （四）HA 高可用集群实现二、LVS简介 &…...

编程日记 2024/3/6 22:17:04

测试微信模版消息推送

进入“开发接口管理”--“公众平台测试账号”，无需申请公众账号、可在测试账号中体验并测试微信公众平台所有高级接口。获取access_token: 自定义模版消息： 关注测试号：扫二维码关注测试号。发送模版消息： import requests da…...

编程新知 2026/2/8 4:37:13

Java 语言特性(面试系列2)

一、SQL 基础 1. 复杂查询 （1）连接查询（JOIN） 内连接（INNER JOIN）：返回两表匹配的记录。 SELECT e.name, d.dept_name FROM employees e INNER JOIN departments d ON e.dept_id d.dept_id; 左…...

编程新知 2025/10/24 14:20:29

rknn优化教程（二）

文章目录 1. 前述2. 三方库的封装2.1 xrepo中的库2.2 xrepo之外的库2.2.1 opencv2.2.2 rknnrt2.2.3 spdlog 3. rknn_engine库 1. 前述 OK，开始写第二篇的内容了。这篇博客主要能写一下： 如何给一些三方库按照xmake方式进行封装，供调用如何按…...

编程新知 2025/6/11 15:25:30

AI Agent与Agentic AI：原理、应用、挑战与未来展望

文章目录一、引言二、AI Agent与Agentic AI的兴起2.1 技术契机与生态成熟2.2 Agent的定义与特征2.3 Agent的发展历程三、AI Agent的核心技术栈解密3.1 感知模块代码示例：使用Python和OpenCV进行图像识别 3.2 认知与决策模块代码示例：使用OpenAI GPT-3进…...

编程新知 2026/1/23 7:04:53

Linux相关概念和易错知识点（42）（TCP的连接管理、可靠性、面临复杂网络的处理）

目录 1.TCP的连接管理机制（1）三次握手①握手过程②对握手过程的理解 （2）四次挥手（3）握手和挥手的触发（4）状态切换①挥手过程中状态的切换②握手过程中状态的切换 2.TCP的可靠性&…...

编程新知 2026/1/30 0:09:51

Opencv中的addweighted函数

一.addweighted函数作用 addweighted（）是OpenCV库中用于图像处理的函数，主要功能是将两个输入图像（尺寸和类型相同）按照指定的权重进行加权叠加（图像融合），并添加一个标量值&#x…...

编程新知 2026/2/1 1:50:03

拉力测试cuda pytorch 把 4070显卡拉满

import torch import timedef stress_test_gpu(matrix_size16384, duration300):"""对GPU进行压力测试，通过持续的矩阵乘法来最大化GPU利用率参数:matrix_size: 矩阵维度大小，增大可提高计算复杂度duration: 测试持续时间（秒&…...

编程新知 2025/12/7 12:35:20