• Scrapy Spider Tutorial: Extracting Product Prices


    Scrapy Spider Tutorial: Extracting Product Prices

    1. Setting Up the Environment:
    • Install Scrapy:
      pip install scrapy
      
      • 1
    2. Creating a New Scrapy Project:
    • Navigate to where you want to create your project:

      cd /desired/path/
      
      • 1
    • Create a new Scrapy project:

      scrapy startproject price_scraper
      
      • 1
    3. Creating a Spider:

    Inside the price_scraper/spiders directory, create a new spider:

    • Navigate to spiders directory:

      cd price_scraper/spiders
      
      • 1
    • Create a new spider:

      scrapy genspider price_spider www.example.com
      
      • 1

      Replace www.example.com with the target website.

    4. Defining Spider Logic:

    Inside the price_spider.py:

    import scrapy
    
    class PriceSpider(scrapy.Spider):
        name = "price_spider"
        start_urls = [
            'https://www.example.com/product-page/'
        ]
    
        def parse(self, response):
            yield {
                'original_price': response.css('CSS_SELECTOR_FOR_ORIGINAL_PRICE::text').get(),
                'sale_price': response.css('CSS_SELECTOR_FOR_SALE_PRICE::text').get(),
            }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    Replace CSS_SELECTOR_FOR_ORIGINAL_PRICE and CSS_SELECTOR_FOR_SALE_PRICE with the actual CSS selectors of the data you want to extract.

    5. Pipeline to Send Email:

    In price_scraper/pipelines.py, add:

    from .mail import send_email
    
    class EmailPipeline:
        def process_item(self, item, spider):
            subject = "Product Price Update"
            body = f"Original Price: {item['original_price']}, Sale Price: {item['sale_price']}"
            from_email = spider.settings.get('FROM_EMAIL')
            from_password = spider.settings.get('FROM_PASSWORD')
            to_email = spider.settings.get('TO_EMAIL')
            send_email(subject, body, from_email, from_password, to_email)
            return item
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    Create a mail.py inside the price_scraper directory:

    import smtplib
    
    def send_email(subject, body, from_email, from_password, to_email):
        server = smtplib.SMTP('smtp.gmail.com', 587)
        server.starttls()
        server.login(from_email, from_password)
        message = f"Subject: {subject}\n\n{body}"
        server.sendmail(from_email, to_email, message)
        server.quit()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    6. Activate Pipeline:

    In price_scraper/settings.py, activate the pipeline:

    ITEM_PIPELINES = {
        'price_scraper.pipelines.EmailPipeline': 1,
    }
    
    • 1
    • 2
    • 3
    7. Run the Spider with Cron Job:
    • Open your crontab:

      crontab -e
      
      • 1
    • Add your cron job:

      0 0 * * * cd /path/to/your/scrapy/project && /usr/local/bin/scrapy crawl price_spider -s FROM_EMAIL=c@gmail.com -s FROM_PASSWORD="password" -s TO_EMAIL=receiver@gmail.com >> /path/to/logfile.log 2>&1
      
      • 1

      This will run the spider daily at midnight. Adjust the time as needed. Ensure the paths and email credentials are correct.

    8. Test & Monitor:
    • Initially, run the spider manually to ensure there’s no error:

      cd /path/to/your/scrapy/project
      scrapy crawl price_spider -s FROM_EMAIL=c@gmail.com -s FROM_PASSWORD="password" -s TO_EMAIL=receiver@gmail.com
      
      • 1
      • 2
    • Monitor the logs to troubleshoot and ensure smooth operation.

    9. Important Notes:
    • Make sure to handle exceptions and errors for a robust spider.
    • Respect robots.txt of websites. Use the ROBOTSTXT_OBEY setting wisely.
    • If using Gmail, allow “less secure apps” to send emails or consider using a dedicated email service for sending notifications.
  • 相关阅读:
    spring中事件的使用方法
    抖音热搜榜:探索热门话题,引领潮流新风尚
    PHP XML DOM
    Pulsar3.0 升级指北
    Vue 组件和插件:探索细节与差异
    LeetCode栈和队列练习
    微信小程序开发之路⑩
    在Uni-app中实现计时器效果
    Yolov5 ONNX导出报错: export failure: Unsupported ONNX opset version: 17
    Vue-2.4sync修饰符
  • 原文地址:https://blog.csdn.net/weixin_38396940/article/details/133760780