Scrapy - Selectorlist Objects



Selector Examples on HTML Response

Following are some of the examples on HTMLResponse and we will have HTMLResponse object, which is instantiated with the selector, shown as follows −

res = Selector(html_response)

You can select the h2 elements from HTML response body, which returns the SelectorList object as −

>>res.xpath("//h2")

You can select the h2 elements from HTML response body, which returns the list of unicode strings as −

>>res.xpath("//h2").extract()

It returns the h2 elements.

and

>>res.xpath("//h2/text()").extract() 

It returns the text defined under h2 tag and does not include h2 tag elements.

You can run through the p tags and display the class attribute as −

for ele in res.xpath("//p"):
   print ele.xpath("@class").extract()

Selector Examples on XML Response

Following are some of the examples on XMLResponse and we will have XMLResponse object, which is instantiated with the selector, shown as follows −

res = Selector(xml_response)

You can select the description elements from XML response body, which returns the SelectorList object as −

>>res.xpath("//description")

You can get the price value from the Google Base XML feed by registering a namespace as −

>>res.register_namespace("g", "http://base.google.com/ns/1.0")
>>res.xpath("//g:price").extract()

Removing Namespaces

When you are creating the Scrapy projects, you can remove the namespaces using the Selector.remove_namespaces() method and use the element names to work appropriately with XPaths.

There are two reasons for not calling the namespace removal procedure always in the project −

  • You can remove the namespace which requires repeating the document and modifying the all elements that leads to expensive operation to crawl documents by Scrapy.

  • In some cases, you need to use namespaces and these may conflict with the some element names and namespaces. This type of case occurs very often.

scrapy_selectors.htm
Advertisements