PHP——爬虫DOM解析

背景

php在爬取网页信息的时候，有一些函数可以使用。
这里介绍两个

DOMDocument
DOMXPath

代码解析



  
  
    Example
  
  
    
      Hello World!
      this is a test 中文
      Link
      Link
      Link
      Link
    
    
        Another paragraph
        Another link
        second empty div
        
            second div paragraph
        
    
  
';

function printDomNode($paragraph){
    echo "----\nNode: ".$paragraph->nodeValue . "\n";
    echo "all attr: \n";
    for ($i = 0; $i < $paragraph->attributes->length; $i++) {
        $attr = $paragraph->attributes->item($i);
        echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";
    }
}

// 创建DOMDocument实例并加载HTML
$dom = new DOMDocument();
@$dom->loadHTML($html);

// 创建DOMXPath实例
$xpath = new DOMXPath($dom);


echo "----------------------------\n";
// 示例1：查找所有元素
$paragraphs = $xpath->query('//p');
foreach ($paragraphs as $paragraph) {
    echo "----\nNode----------: ".$paragraph->nodeValue . "\n";
    echo "id attr: ".$paragraph->getAttribute('id') . "\n";
    echo "all attr: \n";
    for ($i = 0; $i < $paragraph->attributes->length; $i++) {
        $attr = $paragraph->attributes->item($i);
        echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";
    }
    echo "=foreach=\n";
    foreach ($paragraph->attributes as $attr) {
        echo "\t".$attr->name . ': ' . $attr->value . "\n";
        echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";
    }

}

echo "----------------------------\n";
$paragraphs = $xpath->query('//p[@xx="test_custom_key"]');
foreach ($paragraphs as $paragraph) {
    printDomNode($paragraph);
}

echo "----------------------------\n";
// 示例2：查找包含特定文本的元素
$links = $xpath->query('//a[text()="Link"]');
//$links = $xpath->query('//*[text()="Link"]'); //不限制a标签，会找到所有值是Link的节点
foreach ($links as $link) {
    $herf = $link->getAttribute('href');
    echo "origin: ".$herf . "\n";
    echo "decode: ".urldecode($herf) . "\n";
}

//如果找到指定路径下面的节点
echo "----------------------------指定路径下的p节点\n";
$paragraphs = $xpath->query('//div[@id="custom_div_id"]//p[@id="xxx"]');
foreach ($paragraphs as $paragraph) {
    printDomNode($paragraph);
}

echo "----------------------------\n";
// 示例3：查找元素内的所有节点（这个会找到所有子节点，包括节点里面的节点）
$divChildren = $xpath->query('//div/*');
foreach ($divChildren as $child) {
//    echo $child->nodeName . ": " . $child->nodeValue . "\n";
    printDomNode($paragraph);
}

?>

示例：比如获取一个 html文档中的p标签

步骤
- 获取网页html
  - 这里省略了请求url。如果需要从url获取html：$html = file_get_contents($url);
- 将html文件构建成DOM树结构
- 使用DOMXPath类来查找指定的元素节点
  - 构建DOMXPath类实例
  - 使用query函数查询
    - 如果找所有p标签，那么会遍历整个树，把所有的p标签找出来
    - 如何找指定属性的p标签：p[@xx="test_custom_key"]
      - 这里的xx是自定义的属性
      - 这里的@表示选择节点的属性
    - 如何找到指定路径下的节点
      - //div[@id="custom_div_id"]//p[@id="xxx"]

相关阅读:
阿里妈妈返利比率的商品搜索API接口
SwiftUI Swift CoreData 计算某实体某属性总和
c 实用化的摄像头生成avi视频程序（加入精确的时间控制和随时停止录像功能）
Spark.示例
速卖通跨境智星靠谱吗？还有其他隐藏费用吗？
重温历史：Palm OS经典游戏于发布20年后公开源代码
Java中的SPI原理浅谈
想做游戏翻译员，需要具备什么能力呢?
UVa11324 - The Largest Clique
【图解RabbitMQ-4】Docker安装RabbitMQ详细图文过程

原文地址：https://blog.csdn.net/u014704998/article/details/139880082