Scrapy: Spider.parse → ItemPipeline → New Request Flow
Scrapy: Spider.parse → ItemPipeline → New Request Flow
Overview
A common question is: does Scrapy return control to Spider.parse after ItemPipeline finishes?
Short answer: No. ItemPipeline and new Request scheduling are two independent, concurrent paths managed by Scrapy’s async engine. parse() is a generator — it yields objects and the engine routes them. The engine never “returns” to parse() because of an item completing the pipeline.
Architecture: The Two Paths
When Spider.parse() runs, it can yield two types of objects:
| Yielded Type | Engine Action |
|---|---|
scrapy.Item / dict |
Sent to ItemPipeline for processing |
scrapy.Request |
Sent to Scheduler for future downloading |
Both paths run asynchronously via Scrapy’s Twisted event loop. They do not block each other, and neither returns control back to parse().
Full Data Flow
Spider.parse()
│
├── yield Item ──────────────────────────────────────────────────────────────────────────▶ ItemPipeline
│ │
│ process_item() chain
│ │
│ (Item saved / dropped)
│
└── yield Request ──▶ Scheduler ──▶ Downloader ──▶ Response ──▶ Request.callback()
│
(e.g., Spider.parse_detail)
│
yield Item ──▶ ItemPipeline
yield Request ──▶ Scheduler
Step-by-Step Walkthrough
Step 1 — Spider.parse() is a generator
parse() uses yield to produce objects lazily. The engine calls next() on it, collects what was yielded, and routes it. The method itself does not wait for the pipeline or the next download to complete.
import scrapy
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
# Yield an Item → goes to ItemPipeline
yield {
"title": response.css("h1::text").get(),
"url": response.url,
}
# Yield a Request → goes to Scheduler
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Step 2 — The Engine routes each yielded object
Scrapy’s ExecutionEngine receives the yielded objects and immediately dispatches them:
- Item → passed to
ItemPipelineManager.process_item()(runs through every pipeline stage) - Request → passed to
Scheduler.enqueue_request()(queued for downloading)
This routing is synchronous in terms of dispatch, but the actual work (pipeline processing, HTTP download) is asynchronous.
Step 3 — ItemPipeline processes the Item independently
Each ITEM_PIPELINES class in settings.py receives the item in order of priority. Pipeline processing does not pause spider execution or affect Request scheduling.
# settings.py
ITEM_PIPELINES = {
"myproject.pipelines.ValidationPipeline": 100,
"myproject.pipelines.DatabasePipeline": 300,
}
# pipelines.py
class DatabasePipeline:
def process_item(self, item, spider):
# This runs completely independent of any new Request being fetched
self.db.insert(item)
return item
Step 4 — The Scheduler feeds the Downloader
The Scheduler dequeues a Request and passes it to the Downloader. Concurrency is controlled by:
CONCURRENT_REQUESTS— max parallel downloadsCONCURRENT_REQUESTS_PER_DOMAIN— per-domain limit
Step 5 — The callback is invoked with the Response
When the download completes, the engine calls Request.callback(response). This callback is what produces the next wave of Items and Requests — it could be the same parse() method or a different one (e.g., parse_detail).
def parse(self, response):
for book_url in response.css("article.product_pod a::attr(href)"):
# New Request with a different callback
yield response.follow(book_url, callback=self.parse_detail)
# Pagination Request fed back to parse()
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_detail(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css("p.price_color::text").get(),
}
Mermaid Sequence Diagram
sequenceDiagram
participant Engine
participant Spider
participant Scheduler
participant Downloader
participant Pipeline
Engine->>Spider: parse(response)
Spider-->>Engine: yield Item
Spider-->>Engine: yield Request
Engine->>Pipeline: process_item(item)
Engine->>Scheduler: enqueue(request)
Note over Pipeline,Scheduler: Concurrent — neither blocks the other
Scheduler->>Downloader: fetch(request)
Downloader-->>Engine: response
Engine->>Spider: callback(response)
Spider-->>Engine: yield Item
Spider-->>Engine: yield Request
Engine->>Pipeline: process_item(item)
Engine->>Scheduler: enqueue(request)
Key Takeaways
parse()does not resume after ItemPipeline finishes. The pipeline and new Request fetching run on separate async paths.parse()is re-entered only via aRequestcallback, not by the pipeline completing.- Items and Requests yielded from the same
parse()call are dispatched concurrently. The engine does not wait for the item to complete the pipeline before starting to fetch the next URL. - Dropping an item in the pipeline (raising
DropItem) has zero effect on pending or future Requests. - If you need to send a new Request based on a pipeline result, use
scrapy.crawler.Crawlersignals or restructure the logic inparse()before yielding the item.
Common Misconception
def parse(self, response):
yield {"title": "foo"} # ← does NOT block here waiting for pipeline
yield response.follow("/next") # ← scheduled immediately alongside the item above
Many developers assume the second yield only executes after the item finishes the pipeline. It does not. The engine drains the generator greedily and routes each yielded object as soon as it appears.