Scrapy - Extracting Items



Description

For extracting data from web pages, Scrapy uses a technique called selectors based on XPath and CSS expressions. Following are some examples of XPath expressions −

  • /html/head/title − This will select the <title> element, inside the <head> element of an HTML document.

  • /html/head/title/text() − This will select the text within the same <title> element.

  • //td − This will select all the elements from <td>.

  • //div[@class = "slice"] − This will select all elements from div which contain an attribute class = "slice"

Selectors have four basic methods as shown in the following table −

Sr.No Method & Description
1

extract()

It returns a unicode string along with the selected data.

2

re()

It returns a list of unicode strings, extracted when the regular expression was given as argument.

3

xpath()

It returns a list of selectors, which represents the nodes selected by the xpath expression given as an argument.

4

css()

It returns a list of selectors, which represents the nodes selected by the CSS expression given as an argument.

Using Selectors in the Shell

To demonstrate the selectors with the built-in Scrapy shell, you need to have IPython installed in your system. The important thing here is, the URLs should be included within the quotes while running Scrapy; otherwise the URLs with '&' characters won't work. You can start a shell by using the following command in the project's top level directory −

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

A shell will look like the following −

[ ... Scrapy log here ... ]

2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200) 
<GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>(referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x3636b50>
[s]   item       {}
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   settings   <scrapy.settings.Settings object at 0x3fadc50>
[s]   spider     <Spider 'default' at 0x3cebf50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]:

When shell loads, you can access the body or header by using response.body and response.header respectively. Similarly, you can run queries on the response using response.selector.xpath() or response.selector.css().

For instance −

In [1]: response.xpath('//title')
Out[1]: [<Selector xpath = '//title' data = u'<title>My Book - Scrapy'>]

In [2]: response.xpath('//title').extract()
Out[2]: [u'<title>My Book - Scrapy: Index: Chapters</title>']

In [3]: response.xpath('//title/text()')
Out[3]: [<Selector xpath = '//title/text()' data = u'My Book - Scrapy: Index:'>]

In [4]: response.xpath('//title/text()').extract()
Out[4]: [u'My Book - Scrapy: Index: Chapters']

In [5]: response.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Scrapy', u'Index', u'Chapters']

Extracting the Data

To extract data from a normal HTML site, we have to inspect the source code of the site to get XPaths. After inspecting, you can see that the data will be in the ul tag. Select the elements within li tag.

The following lines of code shows extraction of different types of data −

For selecting data within li tag −

response.xpath('//ul/li')

For selecting descriptions −

response.xpath('//ul/li/text()').extract()

For selecting site titles −

response.xpath('//ul/li/a/text()').extract()

For selecting site links −

response.xpath('//ul/li/a/@href').extract()

The following code demonstrates the use of above extractors −

import scrapy

class MyprojectSpider(scrapy.Spider):
   name = "project"
   allowed_domains = ["dmoz.org"]
   
   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]
   def parse(self, response):
      for sel in response.xpath('//ul/li'):
         title = sel.xpath('a/text()').extract()
         link = sel.xpath('a/@href').extract()
         desc = sel.xpath('text()').extract()
         print title, link, desc
Advertisements