• 爬虫获取页面源码


     通过url获取网页源码,我们一般分为这几种情况

    1.直接获取

    网页是静态网页,直接使用

    1. public static string HttpPost(string url, string paraJsonStr)
    2. {
    3. WebClient webClient = new WebClient();
    4. webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
    5. byte[] postData = System.Text.Encoding.UTF8.GetBytes(paraJsonStr);
    6. byte[] responseData = webClient.UploadData(url, "POST", postData);
    7. string returnStr = System.Text.Encoding.UTF8.GetString(responseData);
    8. return returnStr;
    9. }

    就可以获取网页页面的源码

    2.使用上面的方法,无法获取页面,得到Request unsuccessful

    Request unsuccessful. Incapsula incident ID: 89300071...

    这种情况通常是因为网站有反爬虫的处理

    解决办法:

    2.1.请求添加Headers里面的信息

    通常需要添加  “User-Agent”的值,不行的话,可以多添加几个header的属性和 值进去

    常用的 “User-Agent”如下:

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39",

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.54"

    例如:

    1. public static string HttpPost(string url, string paraJsonStr = "")
    2. {
    3. HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
    4. request.Headers["User-Agent"] = "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Mobile Safari/537.36 Edg/104.0.1293.54";
    5. request.Headers["Cookie"] = "visid_incap_161858=BlNpjA+qS9ucNza1";
    6. HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    7. Stream responseStream = response.GetResponseStream();
    8. StreamReader streamReader = new StreamReader(responseStream, Encoding.UTF8);
    9. string str = streamReader.ReadToEnd();
    10. return str;
    11. }

    2.2.在访问页面时加上时间间隔

    3.无法获取到页面源代码,获取到的源代码不全,只有基本的框架

    在Response中无法获取到完整的页面源代码,跟element的元素不同。这种情况使用Selenium插件的办法

    1. import re
    2. import time
    3. import json
    4. import random
    5. from turtle import pd
    6. from xml.dom.minidom import Element
    7. import requests
    8. import csv
    9. from selenium import webdriver
    10. # from selenium.webdriver.chrome.options import Options
    11. from selenium.webdriver.edge.options import Options
    12. from time import strftime, localtime
    13. #path to webdriver
    14. //需要下载浏览器的driver,括号里面式msedgedriver.exe的下载地址
    15. driver = webdriver.Edge(executable_path = "C:\\...\\msedgedriver.exe")
    16. //加载时窗口的大小
    17. driver.maximize_window()
    18. //url时加载时的网址,获取信息的具体网页的地址
    19. driver.get(url)
    20. #The sleep time is the time for the web page to dynamically obtain information. Depending on the speed of the network, different adjustments can be made.
    21. time.sleep(10)
    22. //网页自动往下滑动的大小
    23. driver.execute_script("window.scrollBy(0,{})".format(1500))
    24. time.sleep(2)
    25. #The page has a button, click to get the information of the whole page
    26. //具体网页button的xpath路径
    27. if(driver.find_element_by_xpath("//div[@class='articlePage_button']/span[@class='buttonText']")):
    28. driver.find_element_by_xpath("//div[@class='articlePage_button']/span[@class='buttonText']").click()
    29. time.sleep(5)
    30. #Store data to file
    31. f=open("C:\\...\\a.csv","a+",encoding='utf-8',newline="")
    32. a_writer=csv.writer(f)
    33. #parsing html information
    34. aText=driver.find_element_by_xpath("//h1[@class='pageTitle']").text
    35. print("aText:",aText)
    36. a_writer.writerow(["name",aText])
    37. #输出空行需要这样
    38. a_writer.writerow(["",""])
    39. driver.quit()

    总结:

    前面两种办法,都可以通过Response中查看到完整的页面代码

  • 相关阅读:
    Pandas知识点-详解表级操作管道函数pipe
    MySQL 的存储引擎
    835. Trie字符串统计,836。最大异或对,(Tire树,字典树)
    地面文物古迹保护方案,用科技为文物古迹撑起“智慧伞”
    JUC强大的辅助类
    GBase8s数据库INTO table 子句
    Python文件操作及光标移动介绍
    Python为Excel中每一个单元格计算其在多个文件中的平均值
    短链接系统如何设计
    算法通关村第十关青铜挑战——什么是快速排序
  • 原文地址:https://blog.csdn.net/hellolianhua/article/details/126430340