Scrapy: Why custom_settings on Your Spider Gets Ignored
Scrapy: Why custom_settings on Your Spider Gets Ignored
The Problem
You define custom_settings on your Spider class, expecting Scrapy to pick them up — but they silently have no effect. The spider runs with default (or externally provided) settings as if custom_settings doesn’t exist.
class DocSpider(scrapy.Spider):
name = "doc"
custom_settings = {
"DEPTH_LIMIT": 3,
"CONCURRENT_REQUESTS": 8,
"DOWNLOAD_DELAY": 0.5,
"RETRY_TIMES": 3,
"RETRY_HTTP_CODES": [500, 502, 503, 504, 408],
}
Despite this, none of these settings actually take effect at runtime.
Root Cause
The issue arises when you instantiate Crawler directly and pass in external settings, instead of using the standard CrawlerProcess or CrawlerRunner.
# This bypasses Spider.custom_settings
crawler = Crawler(settings=cfg.crawler_settings)
deferred = crawler.crawl(DocSpider, urls=urls)
How custom_settings Normally Works
In Scrapy’s standard flow:
Crawleris initialized with base settings.- When
crawler.crawl(SpiderClass, ...)is called, Scrapy internally invokes_apply_settings(). _apply_settings()merges the Spider’scustom_settingsinto the active settings object.
This merge step is guaranteed when you use CrawlerProcess or CrawlerRunner, because they follow the full initialization protocol.
What Goes Wrong with Direct Crawler Instantiation
When you create a Crawler instance directly, the crawl() method may skip the _apply_settings() call. In that case, custom_settings on the Spider class is never read, and the settings you passed into the Crawler constructor are the only ones in effect.
| Approach | custom_settings Applied? |
|---|---|
CrawlerProcess / CrawlerRunner |
Yes — merged automatically |
Direct Crawler(settings=...) |
No — unless _apply_settings() is explicitly called |
How to Verify
Print the active settings after creating the crawler to confirm what is actually loaded:
crawler = Crawler(settings=cfg.crawler_settings)
print(dict(crawler.settings)) # Check if custom_settings values appear
If DEPTH_LIMIT, CONCURRENT_REQUESTS, etc. are missing or have default values, you know the merge didn’t happen.
Solutions
Option 1: Manually Merge Spider Settings
Before constructing the Crawler, read the Spider’s custom_settings and merge them into your configuration:
def main(urls, crawler_cfg_file):
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
from twisted.internet import reactor
cfg = CrawlerCfg.from_yaml(crawler_cfg_file)
# Manually merge Spider's custom_settings into the config
if hasattr(DocSpider, 'custom_settings') and DocSpider.custom_settings:
cfg.crawler_settings.update(DocSpider.custom_settings)
configure_logging(cfg.crawler_settings)
crawler = Crawler(settings=cfg.crawler_settings)
deferred = crawler.crawl(DocSpider, urls=urls)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
This ensures the Spider’s settings are present before the Crawler is initialized.
Option 2: Move Settings to the YAML Config
Remove custom_settings from the Spider class entirely and define everything in your external configuration file:
# crawler.yaml
DEPTH_LIMIT: 3
CONCURRENT_REQUESTS: 8
DOWNLOAD_DELAY: 0.5
RETRY_TIMES: 3
RETRY_HTTP_CODES: [500, 502, 503, 504, 408]
This is the recommended approach. It keeps all configuration in one place and avoids the split-brain problem of settings defined in two locations.
Option 3: Use CrawlerProcess / CrawlerRunner
If your architecture allows it, switch to Scrapy’s standard entry points so that custom_settings is merged automatically:
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings=cfg.crawler_settings)
process.crawl(DocSpider, urls=urls)
process.start()
Key Takeaway
Scrapy’s custom_settings class attribute relies on the framework calling _apply_settings() during the crawl setup. When you bypass the standard CrawlerProcess / CrawlerRunner flow and instantiate Crawler directly, that merge step may be skipped. The fix is straightforward: either merge manually, centralize settings in your config file, or use the standard Scrapy entry points.