Scrapy - Selectorlist Objects

Selector Examples on HTML Response

Following are some of the examples on HTMLResponse and we will have HTMLResponse object, which is instantiated with the selector, shown as follows −

res = Selector(html_response)

You can select the h2 elements from HTML response body, which returns the SelectorList object as −

>>res.xpath("//h2")

You can select the h2 elements from HTML response body, which returns the list of unicode strings as −

>>res.xpath("//h2").extract()

It returns the h2 elements.

and

>>res.xpath("//h2/text()").extract()

It returns the text defined under h2 tag and does not include h2 tag elements.

You can run through the p tags and display the class attribute as −

for ele in res.xpath("//p"):
   print ele.xpath("@class").extract()

Selector Examples on XML Response

Following are some of the examples on XMLResponse and we will have XMLResponse object, which is instantiated with the selector, shown as follows −

res = Selector(xml_response)

You can select the description elements from XML response body, which returns the SelectorList object as −

>>res.xpath("//description")

You can get the price value from the Google Base XML feed by registering a namespace as −

>>res.register_namespace("g", "http://base.google.com/ns/1.0")
>>res.xpath("//g:price").extract()

Removing Namespaces

When you are creating the Scrapy projects, you can remove the namespaces using the Selector.remove_namespaces() method and use the element names to work appropriately with XPaths.

There are two reasons for not calling the namespace removal procedure always in the project −

You can remove the namespace which requires repeating the document and modifying the all elements that leads to expensive operation to crawl documents by Scrapy.
In some cases, you need to use namespaces and these may conflict with the some element names and namespaces. This type of case occurs very often.

scrapy_selectors.htm

Print Page