• 爬虫获取页面源码


     通过url获取网页源码,我们一般分为这几种情况

    1.直接获取

    网页是静态网页,直接使用

    1. public static string HttpPost(string url, string paraJsonStr)
    2. {
    3. WebClient webClient = new WebClient();
    4. webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
    5. byte[] postData = System.Text.Encoding.UTF8.GetBytes(paraJsonStr);
    6. byte[] responseData = webClient.UploadData(url, "POST", postData);
    7. string returnStr = System.Text.Encoding.UTF8.GetString(responseData);
    8. return returnStr;
    9. }

    就可以获取网页页面的源码

    2.使用上面的方法,无法获取页面,得到Request unsuccessful

    Request unsuccessful. Incapsula incident ID: 89300071...

    这种情况通常是因为网站有反爬虫的处理

    解决办法:

    2.1.请求添加Headers里面的信息

    通常需要添加  “User-Agent”的值,不行的话,可以多添加几个header的属性和 值进去

    常用的 “User-Agent”如下:

    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39",

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.54"

    例如:

    1. public static string HttpPost(string url, string paraJsonStr = "")
    2. {
    3. HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
    4. request.Headers["User-Agent"] = "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Mobile Safari/537.36 Edg/104.0.1293.54";
    5. request.Headers["Cookie"] = "visid_incap_161858=BlNpjA+qS9ucNza1";
    6. HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    7. Stream responseStream = response.GetResponseStream();
    8. StreamReader streamReader = new StreamReader(responseStream, Encoding.UTF8);
    9. string str = streamReader.ReadToEnd();
    10. return str;
    11. }

    2.2.在访问页面时加上时间间隔

    3.无法获取到页面源代码,获取到的源代码不全,只有基本的框架

    在Response中无法获取到完整的页面源代码,跟element的元素不同。这种情况使用Selenium插件的办法

    1. import re
    2. import time
    3. import json
    4. import random
    5. from turtle import pd
    6. from xml.dom.minidom import Element
    7. import requests
    8. import csv
    9. from selenium import webdriver
    10. # from selenium.webdriver.chrome.options import Options
    11. from selenium.webdriver.edge.options import Options
    12. from time import strftime, localtime
    13. #path to webdriver
    14. //需要下载浏览器的driver,括号里面式msedgedriver.exe的下载地址
    15. driver = webdriver.Edge(executable_path = "C:\\...\\msedgedriver.exe")
    16. //加载时窗口的大小
    17. driver.maximize_window()
    18. //url时加载时的网址,获取信息的具体网页的地址
    19. driver.get(url)
    20. #The sleep time is the time for the web page to dynamically obtain information. Depending on the speed of the network, different adjustments can be made.
    21. time.sleep(10)
    22. //网页自动往下滑动的大小
    23. driver.execute_script("window.scrollBy(0,{})".format(1500))
    24. time.sleep(2)
    25. #The page has a button, click to get the information of the whole page
    26. //具体网页button的xpath路径
    27. if(driver.find_element_by_xpath("//div[@class='articlePage_button']/span[@class='buttonText']")):
    28. driver.find_element_by_xpath("//div[@class='articlePage_button']/span[@class='buttonText']").click()
    29. time.sleep(5)
    30. #Store data to file
    31. f=open("C:\\...\\a.csv","a+",encoding='utf-8',newline="")
    32. a_writer=csv.writer(f)
    33. #parsing html information
    34. aText=driver.find_element_by_xpath("//h1[@class='pageTitle']").text
    35. print("aText:",aText)
    36. a_writer.writerow(["name",aText])
    37. #输出空行需要这样
    38. a_writer.writerow(["",""])
    39. driver.quit()

    总结:

    前面两种办法,都可以通过Response中查看到完整的页面代码

  • 相关阅读:
    idea插件之Smart Tomcat
    中期科技:智慧公厕是智慧城市管理智慧化的至佳表现
    JS标准库
    Java 序列化和反序列化为什么要实现 Serializable 接口?
    OpenCV的轮廓检测和阈值处理综合运用
    将音频格式从flac转到wav的两种方法
    kafka 3.5 主题分区的Follower创建Fetcher线程从Leader拉取数据源码
    431. 将 N 叉树编码为二叉树 DFS
    CGAL 点云数据生成DSM、DTM、等高线和数据分类
    PAT乙级 1070 结绳 python
  • 原文地址:https://blog.csdn.net/hellolianhua/article/details/126430340