• Java中使用Jsoup实现网页内容爬取与Html内容解析并使用EasyExcel实现导出为Excel文件


    场景

    Pythont通过request以及BeautifulSoup爬取几千条情话:

    Pythont通过request以及BeautifulSoup爬取几千条情话_爬取情话-CSDN博客

    Node-RED中使用html节点爬取HTML网页资料之爬取Node-RED的最新版本:

    Node-RED中使用html节点爬取HTML网页资料之爬取Node-RED的最新版本_node-red html-CSDN博客

    Jsoup

    Jsoup是一种Java 的HTML(html也是XML文档)解析器,可直接解析某个URL地址、HTML文本内容。

    它提供了一套易于操作的API,可通过DOM,CSS以及类似于jQuery选择器的操作方法来取出和操作数据。

    使用jsoup就可以解析HTML。

    Jsoup使用的是DOM解析方式,把整个HTML文档(XML文档)加载到内存中形成一棵DOM树,得到文档的Document对象。

    HTML里的标签,会转换成Element对象。

    官网地址:

    jsoup: Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety

    EasyExcel

    Java解析、生成Excel比较有名的框架有Apache poi、jxl。但他们都存在一个严重的问题就是非常的耗内存,

    poi有一套SAX模式的API可以一定程度的解决一些内存溢出的问题,但POI还是有一些缺陷,

    比如07版Excel解压缩以及解压后存储都是在内存中完成的,内存消耗依然很大。

    easyexcel重写了poi对07版Excel的解析,一个3M的excel用POI sax解析依然需要100M左右内存,

    改用easyexcel可以降低到几M,并且再大的excel也不会出现内存溢出;03版依赖POI的sax模式,

    在上层做了模型转换的封装,让使用者更加简单方便。

    官网地址:

    关于Easyexcel | Easy Excel

    注:

    博客:
    https://blog.csdn.net/badao_liumang_qizhi 

    实现

    1、引入依赖

    1.         <!--Jsoup 是一个用于解析HTML和XML文档的Java库-->
    2.         <dependency>
    3.             <groupId>org.jsoup</groupId>
    4.             <artifactId>jsoup</artifactId>
    5.             <version>1.11.3</version>
    6.         </dependency>
    7.         <!--EasyExcel是一个基于Java的、快速、简洁、解决大文件内存溢出的Excel处理工具-->
    8.         <dependency>
    9.             <groupId>com.alibaba</groupId>
    10.             <artifactId>easyexcel</artifactId>
    11.             <version>3.0.5</version>
    12.         </dependency>

    2、找到需要爬取的网页内容

    比如以下面为例

    2023财富世界500强企业榜单 2023全球500强企业 世界500强排名一览表→买购网

    这里要获取500强排名数据,因为单次刷新网页只能返回100条数据,所以只解析前100条。获取更多数据可根据其分页请求规则分别进行爬取。

    打开F12找到要爬取的数据的dom结构

    这里要获取到id为t_container的div元素大的第22个子元素(索引为21)的table元素的tr元素的td数据。

    3、编写测试代码,连接并解析html元素

    1. String url = "https://www.maigoo.com/news/3jcNODk3.html";
    2. try {
    3. //读取url,得到Document
    4. Document document = Jsoup.connect(url)
    5. .ignoreContentType(true)
    6. .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")
    7. .timeout(30000)
    8. .header("referer","https://www.maigoo.com")
    9. .get();
    10. Elements select = document.select("#t_container > div:eq(21) table tr");
    11. } catch (IOException e) {
    12. e.printStackTrace();
    13. }

    注意这里使用选择器的语法:

    #t_container 代表id为t_container

    >代表找父元素下的子元素

    div:eq(21) 代表第22个元素

    table tr 代表table 标签下tr标签

    更多select选择器用法

    Use CSS selectors to find elements: jsoup Java HTML parser

    Selector overview

    • tagname: find elements by tag, e.g. div
    • #id: find elements by ID, e.g. #logo
    • .class: find elements by class name, e.g. .masthead
    • [attribute]: elements with attribute, e.g. [href]
    • [^attrPrefix]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
    • [attr=value]: elements with attribute value, e.g. [width=500] (also quotable, like [data-name='launch sequence'])
    • [attr^=value][attr$=value][attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
    • [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
    • *: all elements, e.g. *
    • ns|tag: find elements by tag in a namespace prefix, e.g. fb|name finds  elements
    • *|tag: final elements by tag in any namespace prefix, e.g. *|name finds  and  elements

    Selector combinations

    • el#id: elements with ID, e.g. div#logo
    • el.class: elements with class, e.g. div.masthead
    • el[attr]: elements with attribute, e.g. a[href]
    • Any combination, e.g. a[href].highlight
    • ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
    • parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
    • siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
    • siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
    • el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo

    Pseudo selectors

    • :has(selector): find elements that contain elements matching the selector; e.g. div:has(p)
    • :is(selector): find elements that match any of the selectors in the selector list; e.g. :is(h1, h2, h3, h4, h5, h6) finds any heading element
    • :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
    • :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
    • :containsOwn(text): find elements that directly contain the given text
    • :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
    • :matchesOwn(regex): find elements whose own text matches the specified regular expression
    • :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
    • :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
    • :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
    • Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

    除使用select选择器之外还可使用XPath选择器用法

    Use XPath selectors to find elements and nodes: jsoup Java HTML parser

    4、解析dom数据并赋值到对象添加到list

    新建实体对象,并添加excel注解

    1. import com.alibaba.excel.annotation.ExcelProperty;
    2. import lombok.Builder;
    3. import lombok.Data;
    4. import java.io.Serializable;
    5. @Data
    6. @Builder
    7. public class WealthEntity implements Serializable {
    8.     private static final long serialVersionUID = -1760099890427975758L;
    9.     @ExcelProperty(value = "排名",index = 0)
    10.     private Integer index;
    11.     @ExcelProperty(value = "公司名称",index = 1)
    12.     private String companyName;
    13.     @ExcelProperty(value = "收入",index = 2)
    14.     private String income;
    15.     @ExcelProperty(value = "利润",index = 3)
    16.     private String profit;
    17. }

    进行dom解析和添加到list

    1.             Elements select = document.select("#t_container > div:eq(21) table tr");
    2.             List<WealthEntity> list = new ArrayList<>();
    3.             for (int i = 1; i < select.size(); i++) {
    4.                 Element tr = select.get(i);
    5.                 Elements tds = tr.select("td");
    6.                 Integer index = Integer.valueOf(tds.get(0).text());
    7.                 String companyName = tds.get(1).text();
    8.                 String income = tds.get(2).text();
    9.                 String profit = tds.get(3).text();
    10.                 WealthEntity wealthEntity = WealthEntity.builder()
    11.                         .index(index)
    12.                         .companyName(companyName)
    13.                         .income(income)
    14.                         .profit(profit)
    15.                         .build();
    16.                 list.add(wealthEntity);
    17.             }

    5、导出为excel

    1.             String fileName = "D:/2023财富世界100强.xlsx";
    2.             EasyExcel.write(fileName,WealthEntity.class).sheet("100强").doWrite(list);

    6、完整示例代码

    1. String url = "https://www.maigoo.com/news/3jcNODk3.html";
    2. try {
    3. //读取url,得到Document
    4. Document document = Jsoup.connect(url)
    5. .ignoreContentType(true)
    6. .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")
    7. .timeout(30000)
    8. .header("referer","https://www.maigoo.com")
    9. .get();
    10. Elements select = document.select("#t_container > div:eq(21) table tr");
    11. List<WealthEntity> list = new ArrayList<>();
    12. for (int i = 1; i < select.size(); i++) {
    13. Element tr = select.get(i);
    14. Elements tds = tr.select("td");
    15. Integer index = Integer.valueOf(tds.get(0).text());
    16. String companyName = tds.get(1).text();
    17. String income = tds.get(2).text();
    18. String profit = tds.get(3).text();
    19. WealthEntity wealthEntity = WealthEntity.builder()
    20. .index(index)
    21. .companyName(companyName)
    22. .income(income)
    23. .profit(profit)
    24. .build();
    25. list.add(wealthEntity);
    26. }
    27. String fileName = "D:/2023财富世界100强.xlsx";
    28. EasyExcel.write(fileName,WealthEntity.class).sheet("100强").doWrite(list);
    29. } catch (IOException e) {
    30. e.printStackTrace();
    31. }

    7、运行结果

  • 相关阅读:
    eslint错误修改之后依然报错
    Linux进程创建、进程终止、进程等待、进程程序替换
    R语言——R语言基础
    记录tomcat-9.0.65在apr模式下无法通过IP访问问题排查和处理
    408计算机网络知识点简记 (背诵用
    数说故事与中山大学人机物智能融合实验室正式达成战略合作
    C语言指针(习题篇)
    【每日一题Day47】LC1774最接近目标价格的甜点成本 | 回溯 哈希表
    国学---佛系算吉凶~
    【ML特征工程】第 1 章 :机器学习管道
  • 原文地址:https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/details/136341036