• 爬虫获取页面源码


     通过url获取网页源码,我们一般分为这几种情况

    1.直接获取

    网页是静态网页,直接使用

    1. public static string HttpPost(string url, string paraJsonStr)
    2. {
    3. WebClient webClient = new WebClient();
    4. webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
    5. byte[] postData = System.Text.Encoding.UTF8.GetBytes(paraJsonStr);
    6. byte[] responseData = webClient.UploadData(url, "POST", postData);
    7. string returnStr = System.Text.Encoding.UTF8.GetString(responseData);
    8. return returnStr;
    9. }

    就可以获取网页页面的源码

    2.使用上面的方法,无法获取页面,得到Request unsuccessful

    Request unsuccessful. Incapsula incident ID: 89300071...

    这种情况通常是因为网站有反爬虫的处理

    解决办法:

    2.1.请求添加Headers里面的信息

    通常需要添加  “User-Agent”的值,不行的话,可以多添加几个header的属性和 值进去

    常用的 “User-Agent”如下:

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39",

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.54"

    例如:

    1. public static string HttpPost(string url, string paraJsonStr = "")
    2. {
    3. HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
    4. request.Headers["User-Agent"] = "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Mobile Safari/537.36 Edg/104.0.1293.54";
    5. request.Headers["Cookie"] = "visid_incap_161858=BlNpjA+qS9ucNza1";
    6. HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    7. Stream responseStream = response.GetResponseStream();
    8. StreamReader streamReader = new StreamReader(responseStream, Encoding.UTF8);
    9. string str = streamReader.ReadToEnd();
    10. return str;
    11. }

    2.2.在访问页面时加上时间间隔

    3.无法获取到页面源代码,获取到的源代码不全,只有基本的框架

    在Response中无法获取到完整的页面源代码,跟element的元素不同。这种情况使用Selenium插件的办法

    1. import re
    2. import time
    3. import json
    4. import random
    5. from turtle import pd
    6. from xml.dom.minidom import Element
    7. import requests
    8. import csv
    9. from selenium import webdriver
    10. # from selenium.webdriver.chrome.options import Options
    11. from selenium.webdriver.edge.options import Options
    12. from time import strftime, localtime
    13. #path to webdriver
    14. //需要下载浏览器的driver,括号里面式msedgedriver.exe的下载地址
    15. driver = webdriver.Edge(executable_path = "C:\\...\\msedgedriver.exe")
    16. //加载时窗口的大小
    17. driver.maximize_window()
    18. //url时加载时的网址,获取信息的具体网页的地址
    19. driver.get(url)
    20. #The sleep time is the time for the web page to dynamically obtain information. Depending on the speed of the network, different adjustments can be made.
    21. time.sleep(10)
    22. //网页自动往下滑动的大小
    23. driver.execute_script("window.scrollBy(0,{})".format(1500))
    24. time.sleep(2)
    25. #The page has a button, click to get the information of the whole page
    26. //具体网页button的xpath路径
    27. if(driver.find_element_by_xpath("//div[@class='articlePage_button']/span[@class='buttonText']")):
    28. driver.find_element_by_xpath("//div[@class='articlePage_button']/span[@class='buttonText']").click()
    29. time.sleep(5)
    30. #Store data to file
    31. f=open("C:\\...\\a.csv","a+",encoding='utf-8',newline="")
    32. a_writer=csv.writer(f)
    33. #parsing html information
    34. aText=driver.find_element_by_xpath("//h1[@class='pageTitle']").text
    35. print("aText:",aText)
    36. a_writer.writerow(["name",aText])
    37. #输出空行需要这样
    38. a_writer.writerow(["",""])
    39. driver.quit()

    总结:

    前面两种办法,都可以通过Response中查看到完整的页面代码

  • 相关阅读:
    一.无人车导航:CMU团队开源自主导航和规划算法框架
    java毕业设计校园美食评价系统mybatis+源码+调试部署+系统+数据库+lw
    【RISC-V 指令集】RISC-V 向量V扩展指令集介绍(三)-向量指令格式
    Docker 安装MYSQL 5.7.38
    [React] 自定义hooks设计模式
    前端图片上传
    jar添加jre运行环境,即是电脑没有安装jdk也可以运行
    python的自定义函数的用法和实例
    Dijkstra算法和Floyd算法求最短路径
    Redis实现短信登入功能(一)传统的Session登入
  • 原文地址:https://blog.csdn.net/hellolianhua/article/details/126430340