pip install scrapy
Navigate to where you want to create your project:
cd /desired/path/
Create a new Scrapy project:
scrapy startproject price_scraper
Inside the price_scraper/spiders
directory, create a new spider:
Navigate to spiders directory:
cd price_scraper/spiders
Create a new spider:
scrapy genspider price_spider www.example.com
Replace www.example.com
with the target website.
Inside the price_spider.py
:
import scrapy
class PriceSpider(scrapy.Spider):
name = "price_spider"
start_urls = [
'https://www.example.com/product-page/'
]
def parse(self, response):
yield {
'original_price': response.css('CSS_SELECTOR_FOR_ORIGINAL_PRICE::text').get(),
'sale_price': response.css('CSS_SELECTOR_FOR_SALE_PRICE::text').get(),
}
Replace CSS_SELECTOR_FOR_ORIGINAL_PRICE
and CSS_SELECTOR_FOR_SALE_PRICE
with the actual CSS selectors of the data you want to extract.
In price_scraper/pipelines.py
, add:
from .mail import send_email
class EmailPipeline:
def process_item(self, item, spider):
subject = "Product Price Update"
body = f"Original Price: {item['original_price']}, Sale Price: {item['sale_price']}"
from_email = spider.settings.get('FROM_EMAIL')
from_password = spider.settings.get('FROM_PASSWORD')
to_email = spider.settings.get('TO_EMAIL')
send_email(subject, body, from_email, from_password, to_email)
return item
Create a mail.py
inside the price_scraper
directory:
import smtplib
def send_email(subject, body, from_email, from_password, to_email):
server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(from_email, from_password)
message = f"Subject: {subject}\n\n{body}"
server.sendmail(from_email, to_email, message)
server.quit()
In price_scraper/settings.py
, activate the pipeline:
ITEM_PIPELINES = {
'price_scraper.pipelines.EmailPipeline': 1,
}
Open your crontab:
crontab -e
Add your cron job:
0 0 * * * cd /path/to/your/scrapy/project && /usr/local/bin/scrapy crawl price_spider -s FROM_EMAIL=c@gmail.com -s FROM_PASSWORD="password" -s TO_EMAIL=receiver@gmail.com >> /path/to/logfile.log 2>&1
This will run the spider daily at midnight. Adjust the time as needed. Ensure the paths and email credentials are correct.
Initially, run the spider manually to ensure there’s no error:
cd /path/to/your/scrapy/project
scrapy crawl price_spider -s FROM_EMAIL=c@gmail.com -s FROM_PASSWORD="password" -s TO_EMAIL=receiver@gmail.com
Monitor the logs to troubleshoot and ensure smooth operation.
robots.txt
of websites. Use the ROBOTSTXT_OBEY
setting wisely.