Scrapy - Other Settings



The following table shows other settings of Scrapy −

Sr.No Setting & Description
1

AJAXCRAWL_ENABLED

It is used for enabling the large crawls.

Default value: False

2

AUTOTHROTTLE_DEBUG

It is enabled to see how throttling parameters are adjusted in real time, which displays stats on every received response.

Default value: False

3

AUTOTHROTTLE_ENABLED

It is used to enable AutoThrottle extension.

Default value: False

4

AUTOTHROTTLE_MAX_DELAY

It is used to set the maximum delay for download in case of high latencies.

Default value: 60.0

5

AUTOTHROTTLE_START_DELAY

It is used to set the initial delay for download.

Default value: 5.0

6

AUTOTHROTTLE_TARGET_CONCURRENCY

It defines the average number of requests for a Scrapy to send parallely to remote sites.

Default value: 1.0

7

CLOSESPIDER_ERRORCOUNT

It defines total number of errors that should be recieved before the spider is closed.

Default value: 0

8

CLOSESPIDER_ITEMCOUNT

It defines a total number of items before closing the spider.

Default value: 0

9

CLOSESPIDER_PAGECOUNT

It defines the maximum number of responses to crawl before spider closes.

Default value: 0

10

CLOSESPIDER_TIMEOUT

It defines the amount of time (in sec) for a spider to close.

Default value: 0

11

COMMANDS_MODULE

It is used when you want to add custom commands in your project.

Default value: ''

12

COMPRESSION_ENABLED

It indicates that the compression middleware is enabled.

Default value: True

13

COOKIES_DEBUG

If set to true, all the cookies sent in requests and received in responses are logged.

Default value: False

14

COOKIES_ENABLED

It indicates that cookies middleware is enabled and sent to web servers.

Default value: True

15

FILES_EXPIRES

It defines the delay for the file expiration.

Default value: 90 days

16

FILES_RESULT_FIELD

It is set when you want to use other field names for your processed files.

17

FILES_STORE

It is used to store the downloaded files by setting it to a valid value.

18

FILES_STORE_S3_ACL

It is used to modify the ACL policy for the files stored in Amazon S3 bucket.

Default value: private

19

FILES_URLS_FIELD

It is set when you want to use other field name for your files URLs.

20

HTTPCACHE_ALWAYS_STORE

Spider will cache the pages thoroughly if this setting is enabled.

Default value: False

21

HTTPCACHE_DBM_MODULE

It is a database module used in DBM storage backend.

Default value: 'anydbm'

22

HTTPCACHE_DIR

It is a directory used to enable and store the HTTP cache.

Default value: 'httpcache'

23

HTTPCACHE_ENABLED

It indicates that HTTP cache is enabled.

Default value: False

24

HTTPCACHE_EXPIRATION_SECS

It is used to set the expiration time for HTTP cache.

Default value: 0

25

HTTPCACHE_GZIP

This setting if set to true, all the cached data will be compressed with gzip.

Default value: False

26

HTTPCACHE_IGNORE_HTTP_CODES

It states that HTTP responses should not be cached with HTTP codes.

Default value: []

27

HTTPCACHE_IGNORE_MISSING

This setting if enabled, the requests will be ignored if not found in the cache.

Default value: False

28

HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS

It is a list containing cache controls to be ignored.

Default value: []

29

HTTPCACHE_IGNORE_SCHEME

It states that HTTP responses should not be cached with URI schemes.

Default value: ['file']

30

HTTPCACHE_POLICY

It defines a class implementing cache policy.

Default value: 'scrapy.extensions.httpcache.DummyPolicy'

31

HTTPCACHE_STORAGE

It is a class implementing the cache storage.

Default value: 'scrapy.extensions.httpcache.FilesystemCacheStorage'

32

HTTPERROR_ALLOWED_CODES

It is a list where all the responses are passed with non-200 status codes.

Default value: []

33

HTTPERROR_ALLOW_ALL

This setting when enabled, all the responses are passed despite of its status codes.

Default value: False

34

HTTPPROXY_AUTH_ENCODING

It is used to authenticate the proxy on HttpProxyMiddleware.

Default value: "latin-1"

35

IMAGES_EXPIRES

It defines the delay for the images expiration.

Default value: 90 days

36

IMAGES_MIN_HEIGHT

It is used to drop images that are too small using minimum size.

37

IMAGES_MIN_WIDTH

It is used to drop images that are too small using minimum size.

38

IMAGES_RESULT_FIELD

It is set when you want to use other field name for your processed images.

39

IMAGES_STORE

It is used to store the downloaded images by setting it to a valid value.

40

IMAGES_STORE_S3_ACL

It is used to modify the ACL policy for the images stored in Amazon S3 bucket.

Default value: private

41

IMAGES_THUMBS

It is set to create the thumbnails of downloaded images.

42

IMAGES_URLS_FIELD

It is set when you want to use other field name for your images URLs.

43

MAIL_FROM

The sender uses this setting to send the emails.

Default value: 'scrapy@localhost'

44

MAIL_HOST

It is a SMTP host used to send emails.

Default value: 'localhost'

45

MAIL_PASS

It is a password used to authenticate SMTP.

Default value: None

46

MAIL_PORT

It is a SMTP port used to send emails.

Default value: 25

47

MAIL_SSL

It is used to implement connection using SSL encrypted connection.

Default value: False

48

MAIL_TLS

When enabled, it forces connection using STARTTLS.

Default value: False

49

MAIL_USER

It defines a user to authenticate SMTP.

Default value: None

50

METAREFRESH_ENABLED

It indicates that meta refresh middleware is enabled.

Default value: True

51

METAREFRESH_MAXDELAY

It is a maximum delay for a meta-refresh to redirect.

Default value: 100

52

REDIRECT_ENABLED

It indicates that the redirect middleware is enabled.

Default value: True

53

REDIRECT_MAX_TIMES

It defines the maximum number of times for a request to redirect.

Default value: 20

54

REFERER_ENABLED

It indicates that referrer middleware is enabled.

Default value: True

55

RETRY_ENABLED

It indicates that the retry middleware is enabled.

Default value: True

56

RETRY_HTTP_CODES

It defines which HTTP codes are to be retried.

Default value: [500, 502, 503, 504, 408]

57

RETRY_TIMES

It defines maximum number of times for retry.

Default value: 2

58

TELNETCONSOLE_HOST

It defines an interface on which the telnet console must listen.

Default value: '127.0.0.1'

59

TELNETCONSOLE_PORT

It defines a port to be used for telnet console.

Default value: [6023, 6073]

scrapy_settings.htm
Advertisements