要闻url:https://www.gov.cn/yaowen/liebiao/home.htm
下一页的url:https://www.gov.cn/yaowen/liebiao/home_1.htm
- import re
- import openpyxl
- import requests
- from lxml import etree
- import os
-
- def download_xinwen():
- basic_url = 'https://www.gov.cn/yaowen/liebiao/home.htm'
- for num in range(1, 5):
- print(f"Downloading:第{num}页")
- new_url = 'https://www.gov.cn/yaowen/liebiao/home_{}.htm'.format(num)
- headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'}
- response = requests.get(new_url, headers=headers)
- response.encoding = 'utf8'
- # print(response.text)
- html = etree.HTML(response.text)
- xinwen_info = html.xpath('//div//li//h4/a/text()')
- xinwen_lianjie = html.xpath('//h4/a/@href')
- for describe, download_url in zip(xinwen_info, xinwen_lianjie):
- print("新闻标题:", describe, "|", "新闻链接:", download_url)
- # if "https" not in download_url:
- # original_str = download_url
- # new_str = original_str.replace("./", "")
- # download_url ="https://www.gov.cn/yaowen/liebiao/" + new_str
- # # print(download_url)
- # print("新闻标题:", describe, "|", "新闻链接:", download_url)
-
-
-
- # new_xinwen_url = download_url.split('.')[-1]
- # print(new_xinwen_url)
-
-
- download_xinwen()
可以看到有的url链接爬取下来之后不是全路径,针对这个问题加上如下代码:
在for循环内加上判断,如果https不在url列表里,说明这个url不是全路径,将./全部替换成空字符,再拼接下基础url,这样遍有了url全路径
- if "https" not in download_url:
- original_str = download_url
- new_str = original_str.replace("./", "")
- download_url ="https://www.gov.cn/yaowen/liebiao/" + new_str
-
- print("新闻标题:", describe, "|", "新闻链接:", download_url)
url连接都是全路径了