Scrapy - Settings



Description

The behavior of Scrapy components can be modified using Scrapy settings. The settings can also select the Scrapy project that is currently active, in case you have multiple Scrapy projects.

Designating the Settings

You must notify Scrapy which setting you are using when you scrap a website. For this, environment variable SCRAPY_SETTINGS_MODULE should be used and its value should be in Python path syntax.

Populating the Settings

The following table shows some of the mechanisms by which you can populate the settings −

Sr.No Mechanism & Description
1

Command line options

Here, the arguments that are passed takes highest precedence by overriding other options. The -s is used to override one or more settings.

scrapy crawl myspider -s LOG_FILE = scrapy.log
2

Settings per-spider

Spiders can have their own settings that overrides the project ones by using attribute custom_settings.

class DemoSpider(scrapy.Spider): 
   name = 'demo'  
   custom_settings = { 
      'SOME_SETTING': 'some value', 
   }
3

Project settings module

Here, you can populate your custom settings such as adding or modifying the settings in the settings.py file.

4

Default settings per-command

Each Scrapy tool command defines its own settings in the default_settings attribute, to override the global default settings.

5

Default global settings

These settings are found in the scrapy.settings.default_settings module.

Access Settings

They are available through self.settings and set in the base spider after it is initialized.

The following example demonstrates this.

class DemoSpider(scrapy.Spider): 
   name = 'demo' 
   start_urls = ['http://example.com']  
   def parse(self, response): 
      print("Existing settings: %s" % self.settings.attributes.keys()) 

To use settings before initializing the spider, you must override from_crawler method in the _init_() method of your spider. You can access settings through attribute scrapy.crawler.Crawler.settings passed to from_crawler method.

The following example demonstrates this.

class MyExtension(object): 
   def __init__(self, log_is_enabled = False): 
      if log_is_enabled: 
         print("Enabled log") 
         @classmethod 
   def from_crawler(cls, crawler): 
      settings = crawler.settings 
      return cls(settings.getbool('LOG_ENABLED')) 

Rationale for Setting Names

Setting names are added as a prefix to the component they configure. For example, for robots.txt extension, the setting names can be ROBOTSTXT_ENABLED, ROBOTSTXT_OBEY, ROBOTSTXT_CACHEDIR, etc.

Built-in Settings Reference

The following table shows the built-in settings of Scrapy −

Sr.No Setting & Description
1

AWS_ACCESS_KEY_ID

It is used to access Amazon Web Services.

Default value: None

2

AWS_SECRET_ACCESS_KEY

It is used to access Amazon Web Services.

Default value: None

3

BOT_NAME

It is the name of bot that can be used for constructing User-Agent.

Default value: 'scrapybot'

4

CONCURRENT_ITEMS

Maximum number of existing items in the Item Processor used to process parallely.

Default value: 100

5

CONCURRENT_REQUESTS

Maximum number of existing requests which Scrapy downloader performs.

Default value: 16

6

CONCURRENT_REQUESTS_PER_DOMAIN

Maximum number of existing requests that perform simultaneously for any single domain.

Default value: 8

7

CONCURRENT_REQUESTS_PER_IP

Maximum number of existing requests that performs simultaneously to any single IP.

Default value: 0

8

DEFAULT_ITEM_CLASS

It is a class used to represent items.

Default value: 'scrapy.item.Item'

9

DEFAULT_REQUEST_HEADERS

It is a default header used for HTTP requests of Scrapy.

Default value −

{  
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,
	*/*;q=0.8', 'Accept-Language': 'en',  
} 
10

DEPTH_LIMIT

The maximum depth for a spider to crawl any site.

Default value: 0

11

DEPTH_PRIORITY

It is an integer used to alter the priority of request according to the depth.

Default value: 0

12

DEPTH_STATS

It states whether to collect depth stats or not.

Default value: True

13

DEPTH_STATS_VERBOSE

This setting when enabled, the number of requests is collected in stats for each verbose depth.

Default value: False

14

DNSCACHE_ENABLED

It is used to enable DNS in memory cache.

Default value: True

15

DNSCACHE_SIZE

It defines the size of DNS in memory cache.

Default value: 10000

16

DNS_TIMEOUT

It is used to set timeout for DNS to process the queries.

Default value: 60

17

DOWNLOADER

It is a downloader used for the crawling process.

Default value: 'scrapy.core.downloader.Downloader'

18

DOWNLOADER_MIDDLEWARES

It is a dictionary holding downloader middleware and their orders.

Default value: {}

19

DOWNLOADER_MIDDLEWARES_BASE

It is a dictionary holding downloader middleware that is enabled by default.

Default value −

{ 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, }
20

DOWNLOADER_STATS

This setting is used to enable the downloader stats.

Default value: True

21

DOWNLOAD_DELAY

It defines the total time for downloader before it downloads the pages from the site.

Default value: 0

22

DOWNLOAD_HANDLERS

It is a dictionary with download handlers.

Default value: {}

23

DOWNLOAD_HANDLERS_BASE

It is a dictionary with download handlers that is enabled by default.

Default value −

{ 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', }
24

DOWNLOAD_TIMEOUT

It is the total time for downloader to wait before it times out.

Default value: 180

25

DOWNLOAD_MAXSIZE

It is the maximum size of response for the downloader to download.

Default value: 1073741824 (1024MB)

26

DOWNLOAD_WARNSIZE

It defines the size of response for downloader to warn.

Default value: 33554432 (32MB)

27

DUPEFILTER_CLASS

It is a class used for detecting and filtering of requests that are duplicate.

Default value: 'scrapy.dupefilters.RFPDupeFilter'

28

DUPEFILTER_DEBUG

This setting logs all duplicate filters when set to true.

Default value: False

29

EDITOR

It is used to edit spiders using the edit command.

Default value: Depends on the environment

30

EXTENSIONS

It is a dictionary having extensions that are enabled in the project.

Default value: {}

31

EXTENSIONS_BASE

It is a dictionary having built-in extensions.

Default value: { 'scrapy.extensions.corestats.CoreStats': 0, }

32

FEED_TEMPDIR

It is a directory used to set the custom folder where crawler temporary files can be stored.

33

ITEM_PIPELINES

It is a dictionary having pipelines.

Default value: {}

34

LOG_ENABLED

It defines if the logging is to be enabled.

Default value: True

35

LOG_ENCODING

It defines the type of encoding to be used for logging.

Default value: 'utf-8'

36

LOG_FILE

It is the name of the file to be used for the output of logging.

Default value: None

37

LOG_FORMAT

It is a string using which the log messages can be formatted.

Default value: '%(asctime)s [%(name)s] %(levelname)s: %(message)s'

38

LOG_DATEFORMAT

It is a string using which date/time can be formatted.

Default value: '%Y-%m-%d %H:%M:%S'

39

LOG_LEVEL

It defines minimum log level.

Default value: 'DEBUG'

40

LOG_STDOUT

This setting if set to true, all your process output will appear in the log.

Default value: False

41

MEMDEBUG_ENABLED

It defines if the memory debugging is to be enabled.

Default Value: False

42

MEMDEBUG_NOTIFY

It defines the memory report that is sent to a particular address when memory debugging is enabled.

Default value: []

43

MEMUSAGE_ENABLED

It defines if the memory usage is to be enabled when a Scrapy process exceeds a memory limit.

Default value: False

44

MEMUSAGE_LIMIT_MB

It defines the maximum limit for the memory (in megabytes) to be allowed.

Default value: 0

45

MEMUSAGE_CHECK_INTERVAL_SECONDS

It is used to check the present memory usage by setting the length of the intervals.

Default value: 60.0

46

MEMUSAGE_NOTIFY_MAIL

It is used to notify with a list of emails when the memory reaches the limit.

Default value: False

47

MEMUSAGE_REPORT

It defines if the memory usage report is to be sent on closing each spider.

Default value: False

48

MEMUSAGE_WARNING_MB

It defines a total memory to be allowed before a warning is sent.

Default value: 0

49

NEWSPIDER_MODULE

It is a module where a new spider is created using genspider command.

Default value: ''

50

RANDOMIZE_DOWNLOAD_DELAY

It defines a random amount of time for a Scrapy to wait while downloading the requests from the site.

Default value: True

51

REACTOR_THREADPOOL_MAXSIZE

It defines a maximum size for the reactor threadpool.

Default value: 10

52

REDIRECT_MAX_TIMES

It defines how many times a request can be redirected.

Default value: 20

53

REDIRECT_PRIORITY_ADJUST

This setting when set, adjusts the redirect priority of a request.

Default value: +2

54

RETRY_PRIORITY_ADJUST

This setting when set, adjusts the retry priority of a request.

Default value: -1

55

ROBOTSTXT_OBEY

Scrapy obeys robots.txt policies when set to true.

Default value: False

56

SCHEDULER

It defines the scheduler to be used for crawl purpose.

Default value: 'scrapy.core.scheduler.Scheduler'

57

SPIDER_CONTRACTS

It is a dictionary in the project having spider contracts to test the spiders.

Default value: {}

58

SPIDER_CONTRACTS_BASE

It is a dictionary holding Scrapy contracts which is enabled in Scrapy by default.

Default value −

{ 
   'scrapy.contracts.default.UrlContract' : 1, 
   'scrapy.contracts.default.ReturnsContract': 2, 
} 
59

SPIDER_LOADER_CLASS

It defines a class which implements SpiderLoader API to load spiders.

Default value: 'scrapy.spiderloader.SpiderLoader'

60

SPIDER_MIDDLEWARES

It is a dictionary holding spider middlewares.

Default value: {}

61

SPIDER_MIDDLEWARES_BASE

It is a dictionary holding spider middlewares that is enabled in Scrapy by default.

Default value −

{ 
   'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 
}
62

SPIDER_MODULES

It is a list of modules containing spiders which Scrapy will look for.

Default value: []

63

STATS_CLASS

It is a class which implements Stats Collector API to collect stats.

Default value: 'scrapy.statscollectors.MemoryStatsCollector'

64

STATS_DUMP

This setting when set to true, dumps the stats to the log.

Default value: True

65

STATSMAILER_RCPTS

Once the spiders finish scraping, Scrapy uses this setting to send the stats.

Default value: []

66

TELNETCONSOLE_ENABLED

It defines whether to enable the telnetconsole.

Default value: True

67

TELNETCONSOLE_PORT

It defines a port for telnet console.

Default value: [6023, 6073]

68

TEMPLATES_DIR

It is a directory containing templates that can be used while creating new projects.

Default value: templates directory inside scrapy module

69

URLLENGTH_LIMIT

It defines the maximum limit of the length for URL to be allowed for crawled URLs.

Default value: 2083

70

USER_AGENT

It defines the user agent to be used while crawling a site.

Default value: "Scrapy/VERSION (+http://scrapy.org)"

For other Scrapy settings, go to this link.

Advertisements