• python-爬虫-三字代码网站爬取


    三字代码 http://www.6qt.net/

    在这里插入图片描述

    爬取城市、三字代码、所属国家、国家代码、四字代码、机场名称、英文名称、查询次数

    import requests
    
    url = 'http://www.6qt.net/'
    r = requests.get(url)
    r.encoding='gb2312'
    print(r.text)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    在这里插入图片描述
    使用xpath解析,得到城市名

    html.fromstring(html, base_url=None, parser=None, **kw)
    Parse the html, returning a single element/document.解析html,返回单个元素/文档

    import requests
    from lxml import html
    
    url = 'http://www.6qt.net/'
    r = requests.get(url)
    r.encoding='gb2312'
    data_html = html.fromstring(r.text)
    # 提取所有class属性为tdbg的tr元素
    tr_html = data_html.xpath('//tr[@class="tdbg"]')# 得到Element对象列表# 
    for tr in tr_html:
        city_name = tr.xpath('td[1]/a/text()')
        if city_name:
            print(city_name[0])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    在这里插入图片描述

    完整代码

    import requests
    from lxml import html
    
    url = 'http://www.6qt.net/'
    r = requests.get(url)
    r.encoding='gb2312'
    data_html = html.fromstring(r.text)
    tr_html = data_html.xpath('//tr[@class="tdbg"]')
    for tr in tr_html:
        data = {}
        city_name = tr.xpath('td[1]/a/text()')
        if city_name:
            data['city_name'] = city_name[0].replace('\xa0','')
        tcc = tr.xpath('td[2]/a/text()')	# Three character code三字代码
        if city_name:
            data['tcc'] = tcc[0].replace('\xa0','')
        country = tr.xpath('td[3]/a/u/text()')
        if city_name:
            data['country'] = country[0]
        country_code = tr.xpath('td[4]/a/u/text()')
        if city_name:
            data['country_code'] = country_code[0]
        fcc = tr.xpath('td[5]/a/u/text()')
        if city_name:
            if fcc:		# 有的城市没有四字代码
                data['fcc'] = fcc[0]
            else:
                data['fcc'] = ''	# 没有四字代码用空字符串代替
        airport_name = tr.xpath('td[6]/text()')
        if city_name:
            data['airport_name'] = airport_name[0].replace('\xa0','')
        en_name = tr.xpath('td[7]/text()')
        if city_name:
            data['en_name'] = en_name[0].replace('\xa0','')
        number = tr.xpath('td[8]/a/text()')
        if city_name:
            data['number'] = number[0]
        
        if data:
            print(data)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40

    在这里插入图片描述

    ASCII 字符集中的\xa0代表的是非打印字符"非断行空格",也称为"不间断空格",在网页中通常用于表示空格或保持文本的格式。在 HTML 中,它被称为“ ”实体。


    可以使用lxmltostring()函数将一个Element对象转换成可读的代码。例如:

    from lxml import etree
    
    # 创建一个HTML文档的根元素
    root = etree.Element("html")
    
    # 创建一个标签并添加到根元素中
    head = etree.SubElement(root, "head")
    
    # 创建一个标签并添加到<head>标签中</span>
    title <span class="token operator">=</span> etree<span class="token punctuation">.</span>SubElement<span class="token punctuation">(</span>head<span class="token punctuation">,</span> <span class="token string">"title"</span><span class="token punctuation">)</span>
    title<span class="token punctuation">.</span>text <span class="token operator">=</span> <span class="token string">"Welcome to my website"</span>
    
    <span class="token comment"># 将根元素转换为可读的HTML代码并打印输出</span>
    html_code <span class="token operator">=</span> etree<span class="token punctuation">.</span>tostring<span class="token punctuation">(</span>root<span class="token punctuation">,</span> pretty_print<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">'unicode'</span><span class="token punctuation">)</span>
    <span class="token keyword">print</span><span class="token punctuation">(</span>html_code<span class="token punctuation">)</span>
    <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li></ul></pre> 
    <p>运行上述代码后,输出的结果为:</p> 
    <pre data-index="4" class="set-code-show prettyprint"><code class="prism language-html has-numbering" onclick="mdcp.signin(event)" style="position: unset;"><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>html</span><span class="token punctuation">></span></span>
      <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>head</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>title</span><span class="token punctuation">></span></span>Welcome to my website<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>title</span><span class="token punctuation">></span></span>
      <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>head</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>html</span><span class="token punctuation">></span></span>
    <div class="hljs-button signin active" data-title="登录复制" data-report-click="{"spm":"1001.2101.3001.4334"}"></div></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li></ul></pre>
                    </div>
                        </div>
                    </li>
    
                    <li class="list-group-item ul-li">
    
                        <b>相关阅读:</b><br>
                        <nobr>
    <a href="/Article/Index/1155216">Typora基础篇</a>                            <br />
    <a href="/Article/Index/836234">巨噬细胞膜包覆的负载二氧化锰mno2和顺铂pt的仿生纳米水凝胶</a>                            <br />
    <a href="/Article/Index/910051">一文理清Arbitrum、Arbitrum One、Arbitrum Nitro和Arbitrum Nova的关系</a>                            <br />
    <a href="/Article/Index/1315404">Nitrux 3.0 正式发布并全面上市</a>                            <br />
    <a href="/Article/Index/1394380">决策树-入门</a>                            <br />
    <a href="/Article/Index/777775">学会如何写一篇符合搜索引擎排名要求的高质量SEO文章</a>                            <br />
    <a href="/Article/Index/1260750">PYQT中线程使用Demo</a>                            <br />
    <a href="/Article/Index/1029086">LeetCode 43. 字符串相乘</a>                            <br />
    <a href="/Article/Index/1197482">实现微服务会带来哪些挑战?</a>                            <br />
    <a href="/Article/Index/1861318">计算机网络(第一章 概述)</a>                            <br />
                        </nobr>
                    </li>
                    <li class="list-group-item from-a mb-2">
                        原文地址:https://blog.csdn.net/weixin_64729620/article/details/132852450
                    </li>
    
                </ul>
            </div>
    
            <div class="col-lg-4 col-sm-12">
                <ul class="list-group" style="word-break:break-all;">
                    <li class="list-group-item ul-li-bg" aria-current="true">
                        最新文章
                    </li>
                    <li class="list-group-item ul-li">
                        <nobr>
    <a href="/Article/Index/1484446">攻防演习之三天拿下官网站群</a>                            <br />
    <a href="/Article/Index/1515268">数据安全治理学习——前期安全规划和安全管理体系建设</a>                            <br />
    <a href="/Article/Index/1759065">企业安全 | 企业内一次钓鱼演练准备过程</a>                            <br />
    <a href="/Article/Index/1485036">内网渗透测试 | Kerberos协议及其部分攻击手法</a>                            <br />
    <a href="/Article/Index/1877332">0day的产生 | 不懂代码的"代码审计"</a>                            <br />
    <a href="/Article/Index/1887576">安装scrcpy-client模块av模块异常,环境问题解决方案</a>                            <br />
    <a href="/Article/Index/1887578">leetcode hot100【LeetCode 279. 完全平方数】java实现</a>                            <br />
    <a href="/Article/Index/1887512">OpenWrt下安装Mosquitto</a>                            <br />
    <a href="/Article/Index/1887520">AnatoMask论文汇总</a>                            <br />
    <a href="/Article/Index/1887496">【AI日记】24.11.01 LangChain、openai api和github copilot</a>                            <br />
                        </nobr>
                    </li>
                </ul>
    
                <ul class="list-group pt-2" style="word-break:break-all;">
                    <li class="list-group-item ul-li-bg" aria-current="true">
                        热门文章
                    </li>
                    <li class="list-group-item ul-li">
                        <nobr>
    <a href="/Article/Index/888177">十款代码表白小特效 一个比一个浪漫 赶紧收藏起来吧!!!</a>                            <br />
    <a href="/Article/Index/797680">奉劝各位学弟学妹们,该打造你的技术影响力了!</a>                            <br />
    <a href="/Article/Index/888183">五年了,我在 CSDN 的两个一百万。</a>                            <br />
    <a href="/Article/Index/888179">Java俄罗斯方块,老程序员花了一个周末,连接中学年代!</a>                            <br />
    <a href="/Article/Index/797730">面试官都震惊,你这网络基础可以啊!</a>                            <br />
    <a href="/Article/Index/797725">你真的会用百度吗?我不信 — 那些不为人知的搜索引擎语法</a>                            <br />
    <a href="/Article/Index/797702">心情不好的时候,用 Python 画棵樱花树送给自己吧</a>                            <br />
    <a href="/Article/Index/797709">通宵一晚做出来的一款类似CS的第一人称射击游戏Demo!原来做游戏也不是很难,连憨憨学妹都学会了!</a>                            <br />
    <a href="/Article/Index/797716">13 万字 C 语言从入门到精通保姆级教程2021 年版</a>                            <br />
    <a href="/Article/Index/888192">10行代码集2000张美女图,Python爬虫120例,再上征途</a>                            <br />
                        </nobr>
                    </li>
                </ul>
    
            </div>
        </div>
    </div>
    <!-- 主体 -->
    
    
        <!--body结束-->
        <!--这里是footer模板-->
        
        <!--footer-->
    <nav class="navbar navbar-inverse navbar-fixed-bottom">
        <div class="container">
            <div class="row">
                <div class="col-md-12">
                    <div class="text-muted center foot-height">
                        Copyright © 2022 侵权请联系<a href="mailto:2656653265@qq.com">2656653265@qq.com</a>   
                        <a href="https://beian.miit.gov.cn/" target="_blank">京ICP备2022015340号-1</a>
                    </div>
                    <div style="width:300px;margin:0 auto; padding:0px 5px;">
                        <a href="/regex.html">正则表达式工具</a>
                        <a href="/cron.html">cron表达式工具</a>
                        <a href="/pwdcreator.html">密码生成工具</a>
                    </div>
                    <div style="width:300px;margin:0 auto; padding:5px 0;">
                        <a target="_blank" href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010502049817" style="display:inline-block;text-decoration:none;height:20px;line-height:20px;">
                        <img src="" style="float:left;" /><p style="float:left;height:20px;line-height:20px;margin: 0px 0px 0px 5px; color:#939393;">京公网安备 11010502049817号</p></a>
                    </div>
                </div>
            </div>
        </div>
      
    </nav>
    <!--footer-->
    
        <!--footer模板结束-->
    
        <script src="/js/plugins/jquery/jquery.js"></script>
        <script src="/js/bootstrap.min.js"></script>
    
        <!--这里是scripts模板-->
        
    
        
     
    
    
        <!--scripts模板结束-->
    
    </body>
    </html>