php在爬取网页信息的时候,有一些函数可以使用。
这里介绍两个
Example
';
function printDomNode($paragraph){
echo "----\nNode: ".$paragraph->nodeValue . "\n";
echo "all attr: \n";
for ($i = 0; $i < $paragraph->attributes->length; $i++) {
$attr = $paragraph->attributes->item($i);
echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";
}
}
// 创建DOMDocument实例并加载HTML
$dom = new DOMDocument();
@$dom->loadHTML($html);
// 创建DOMXPath实例
$xpath = new DOMXPath($dom);
echo "----------------------------\n";
// 示例1:查找所有元素
$paragraphs = $xpath->query('//p');
foreach ($paragraphs as $paragraph) {
echo "----\nNode----------: ".$paragraph->nodeValue . "\n";
echo "id attr: ".$paragraph->getAttribute('id') . "\n";
echo "all attr: \n";
for ($i = 0; $i < $paragraph->attributes->length; $i++) {
$attr = $paragraph->attributes->item($i);
echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";
}
echo "=foreach=\n";
foreach ($paragraph->attributes as $attr) {
echo "\t".$attr->name . ': ' . $attr->value . "\n";
echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";
}
}
echo "----------------------------\n";
$paragraphs = $xpath->query('//p[@xx="test_custom_key"]');
foreach ($paragraphs as $paragraph) {
printDomNode($paragraph);
}
echo "----------------------------\n";
// 示例2:查找包含特定文本的元素
$links = $xpath->query('//a[text()="Link"]');
//$links = $xpath->query('//*[text()="Link"]'); //不限制a标签,会找到所有值是Link的节点
foreach ($links as $link) {
$herf = $link->getAttribute('href');
echo "origin: ".$herf . "\n";
echo "decode: ".urldecode($herf) . "\n";
}
//如果找到指定路径下面的节点
echo "----------------------------指定路径下的p节点\n";
$paragraphs = $xpath->query('//div[@id="custom_div_id"]//p[@id="xxx"]');
foreach ($paragraphs as $paragraph) {
printDomNode($paragraph);
}
echo "----------------------------\n";
// 示例3:查找元素内的所有节点(这个会找到所有子节点,包括节点里面的节点)
$divChildren = $xpath->query('//div/*');
foreach ($divChildren as $child) {
// echo $child->nodeName . ": " . $child->nodeValue . "\n";
printDomNode($paragraph);
}
?>
示例:比如获取一个html文档中的p标签
- 步骤
- 获取网页html
- 这里省略了请求url。如果需要从url获取html:
$html = file_get_contents($url);
- 将html文件构建成DOM树结构
- 使用DOMXPath类来查找指定的元素节点
- 构建DOMXPath类实例
- 使用query函数查询
- 如果找所有p标签,那么会遍历整个树,把所有的p标签找出来
- 如何找指定属性的p标签:
p[@xx="test_custom_key"]
- 这里的xx是自定义的属性
- 这里的@表示选择节点的属性
- 如何找到指定路径下的节点
//div[@id="custom_div_id"]//p[@id="xxx"]