爬虫获取页面源码

通过url获取网页源码，我们一般分为这几种情况

1.直接获取

网页是静态网页，直接使用


public static string HttpPost(string url, string paraJsonStr)
        {
            WebClient webClient = new WebClient();
            webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
            byte[] postData = System.Text.Encoding.UTF8.GetBytes(paraJsonStr);
            byte[] responseData = webClient.UploadData(url, "POST", postData);
            string returnStr = System.Text.Encoding.UTF8.GetString(responseData);
            return returnStr;
 
        }

就可以获取网页页面的源码

2.使用上面的方法，无法获取页面，得到Request unsuccessful

Request unsuccessful. Incapsula incident ID: 89300071...

这种情况通常是因为网站有反爬虫的处理

解决办法：

2.1.请求添加Headers里面的信息

通常需要添加 “User-Agent”的值，不行的话，可以多添加几个header的属性和值进去

常用的 “User-Agent”如下：

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36 Edg/101.0.1210.39",

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.54"

例如：


public static string HttpPost(string url, string paraJsonStr = "")
        {
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);         
            request.Headers["User-Agent"] = "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Mobile Safari/537.36 Edg/104.0.1293.54";
            request.Headers["Cookie"] = "visid_incap_161858=BlNpjA+qS9ucNza1";
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            Stream responseStream = response.GetResponseStream();
            StreamReader streamReader = new StreamReader(responseStream, Encoding.UTF8);
            string str = streamReader.ReadToEnd();
            return str;
        }

2.2.在访问页面时加上时间间隔

3.无法获取到页面源代码，获取到的源代码不全，只有基本的框架

在Response中无法获取到完整的页面源代码，跟element的元素不同。这种情况使用Selenium插件的办法


import re
import time
import json
import random
from turtle import pd
from xml.dom.minidom import Element
import requests
import csv
from selenium import webdriver
# from selenium.webdriver.chrome.options import Options
from selenium.webdriver.edge.options import Options
from time import strftime, localtime
 
#path to webdriver
//需要下载浏览器的driver，括号里面式msedgedriver.exe的下载地址
    driver = webdriver.Edge(executable_path = "C:\\...\\msedgedriver.exe")
//加载时窗口的大小
    driver.maximize_window()
//url时加载时的网址，获取信息的具体网页的地址
    driver.get(url)
    #The sleep time is the time for the web page to dynamically obtain information. Depending on the speed of the network, different adjustments can be made.
    time.sleep(10)
//网页自动往下滑动的大小    
    driver.execute_script("window.scrollBy(0,{})".format(1500)) 
    time.sleep(2)
    #The page has a button, click to get the information of the whole page
    
//具体网页button的xpath路径
if(driver.find_element_by_xpath("//div[@class='articlePage_button']/span[@class='buttonText']")):
        driver.find_element_by_xpath("//div[@class='articlePage_button']/span[@class='buttonText']").click()    
    time.sleep(5) 
 
#Store data to file
    f=open("C:\\...\\a.csv","a+",encoding='utf-8',newline="")
    a_writer=csv.writer(f)
 
#parsing html information
    aText=driver.find_element_by_xpath("//h1[@class='pageTitle']").text
    print("aText:",aText)    
    a_writer.writerow(["name",aText])
#输出空行需要这样
    a_writer.writerow(["",""])
 
 
    driver.quit()

总结：

前面两种办法，都可以通过Response中查看到完整的页面代码

相关阅读:
idea插件之Smart Tomcat
中期科技：智慧公厕是智慧城市管理智慧化的至佳表现
JS标准库
Java 序列化和反序列化为什么要实现 Serializable 接口？
OpenCV的轮廓检测和阈值处理综合运用
将音频格式从flac转到wav的两种方法
kafka 3.5 主题分区的Follower创建Fetcher线程从Leader拉取数据源码
431. 将 N 叉树编码为二叉树 DFS
CGAL 点云数据生成DSM、DTM、等高线和数据分类
PAT乙级 1070 结绳 python

原文地址：https://blog.csdn.net/hellolianhua/article/details/126430340