Saving Scraped Data to a Database in Scrapy
Saving Scraped Data to a Database in Scrapy
Overview
Scrapy’s pipeline architecture makes it straightforward to persist scraped items to a database. Combining a database pipeline with a custom duplicate filter lets you:
- Store every successfully scraped item.
- Skip URLs that have already been crawled across spider runs (persistent deduplication).
Architecture
Spider → Item → Pipeline → Database
↑
DupeFilter checks DB before
the request is scheduled
1. Database Pipeline (SQLAlchemy + SQLite / PostgreSQL)
Install dependencies
pip install sqlalchemy psycopg2-binary # PostgreSQL
# or
pip install sqlalchemy # SQLite (built-in)
Define the ORM model — models.py
from sqlalchemy import Column, Integer, String, DateTime, create_engine
from sqlalchemy.orm import declarative_base
from datetime import datetime
Base = declarative_base()
class Article(Base):
__tablename__ = "articles"
id = Column(Integer, primary_key=True, autoincrement=True)
url = Column(String, unique=True, nullable=False, index=True)
title = Column(String)
content = Column(String)
scraped_at = Column(DateTime, default=datetime.utcnow)
def get_engine(db_url: str):
engine = create_engine(db_url, echo=False)
Base.metadata.create_all(engine)
return engine
Write the pipeline — pipelines.py
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import IntegrityError
from .models import Article, get_engine
class DatabasePipeline:
"""Persist scraped items and silently drop duplicates (unique URL constraint)."""
def open_spider(self, spider):
db_url = spider.settings.get("DATABASE_URL", "sqlite:///scrapy.db")
engine = get_engine(db_url)
Session = sessionmaker(bind=engine)
self.session = Session()
def close_spider(self, spider):
self.session.close()
def process_item(self, item, spider):
article = Article(
url = item["url"],
title = item.get("title"),
content = item.get("content"),
)
self.session.add(article)
try:
self.session.commit()
spider.logger.info(f"Saved: {item['url']}")
except IntegrityError:
# Duplicate URL — the unique constraint fired
self.session.rollback()
spider.logger.debug(f"Duplicate skipped: {item['url']}")
return item
Enable the pipeline — settings.py
DATABASE_URL = "postgresql://user:password@localhost/scrapy_db"
# DATABASE_URL = "sqlite:///scrapy.db" # quick local testing
ITEM_PIPELINES = {
"myproject.pipelines.DatabasePipeline": 300,
}
2. Persistent Duplicate Request Filter
The pipeline above deduplicates items. To avoid even making the HTTP request again,
implement a custom RFPDupeFilter that checks the database before scheduling a request.
dupefilter.py
from scrapy.dupefilters import RFPDupeFilter
from sqlalchemy.orm import sessionmaker
from .models import Article, get_engine
class DBDupeFilter(RFPDupeFilter):
"""
Extends Scrapy's default fingerprint filter with a DB look-up so that
URLs already stored in the database are never re-requested, even across
spider restarts.
"""
@classmethod
def from_settings(cls, settings):
instance = super().from_settings(settings)
db_url = settings.get("DATABASE_URL", "sqlite:///scrapy.db")
engine = get_engine(db_url)
Session = sessionmaker(bind=engine)
instance.db_session = Session()
return instance
def request_seen(self, request):
# 1. Check in-memory fingerprint set (fast path)
if super().request_seen(request):
return True
# 2. Check persistent DB (survives restarts)
url = request.url
exists = (
self.db_session.query(Article)
.filter_by(url=url)
.first()
)
return exists is not None
def close(self, reason):
self.db_session.close()
super().close(reason)
Register the filter — settings.py
DUPEFILTER_CLASS = "myproject.dupefilter.DBDupeFilter"
3. Full Project Layout
myproject/
├── scrapy.cfg
├── myproject/
│ ├── __init__.py
│ ├── settings.py ← DATABASE_URL, ITEM_PIPELINES, DUPEFILTER_CLASS
│ ├── models.py ← SQLAlchemy ORM + get_engine()
│ ├── pipelines.py ← DatabasePipeline
│ ├── dupefilter.py ← DBDupeFilter
│ └── spiders/
│ └── article_spider.py
4. Spider Example
import scrapy
class ArticleSpider(scrapy.Spider):
name = "article"
start_urls = ["https://example.com/articles"]
def parse(self, response):
for link in response.css("a.article-link::attr(href)").getall():
yield response.follow(link, callback=self.parse_article)
# Follow pagination
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_article(self, response):
yield {
"url" : response.url,
"title" : response.css("h1::text").get(default="").strip(),
"content": " ".join(response.css("article p::text").getall()),
}
5. Deduplication Strategy Summary
| Layer | Mechanism | Scope |
|---|---|---|
Request filter (DBDupeFilter) |
DB look-up before HTTP request | Across spider restarts |
Pipeline (DatabasePipeline) |
UNIQUE constraint + IntegrityError catch |
Within a single run |
Scrapy default (RFPDupeFilter) |
In-memory fingerprint set | Current run only |
- Use all three layers for the most robust deduplication.
- The DB constraint is the final safety net — it prevents corrupt data even if the filter is bypassed.
6. Running the Spider
# First run — crawls and saves everything
scrapy crawl article
# Subsequent runs — DB filter skips already-seen URLs automatically
scrapy crawl article
7. Tips
- Index the
urlcolumn for fast look-ups (index=Truein the ORM model above). - Use connection pooling (
pool_size,max_overflowincreate_engine) for high-concurrency crawls. - For very large crawl sets consider a Bloom filter (e.g.
scrapy-crawl-once) as a memory-efficient alternative to DB look-ups on every request. - Set
DOWNLOAD_DELAYandAUTOTHROTTLE_ENABLED = Trueto be respectful to target servers.