Scrapy - Xpath Tips

Using Text Nodes in a Condition

When you are using text nodes in a XPath string function, then use .(dot) instead of using .//text(), because this produces the collection of text elements called as node-set.

For instance −

from scrapy import Selector
val = Selector(text = '<a href = "#">More Info<strong>click here</strong></a>')

If you are converting a node-set to a string, then use the following format −

>>val.xpath('//a//text()').extract()

It will display the element as −

[u'More Info',u'click here']

and

>>val.xpath("string('//a[1]//text())").extract()

It results the element as −

[u'More Info']

Difference Between //node[1] and (//node)[1]

The //node[1] displays all the first elements defined under respective parents. The (//node)[1] displays only first element in the document.

For instance −

from scrapy import Selector
val = Selector(text = """
   <ul class = "list">
      <li>one</li>
      <li>one</li>
      <li>one</li>
   </ul>
   
   <ul class = "list">
      <li>four</li>
      <li>five</li>
      <li>six</li>
   </ul>""")
res = lambda x: val.xpath(x).extract()

The following line displays all the first li elements defined under their respective parents −

>>res("//li[1]")

It will display the result as −

[u'<li>one</li>', u'<li>four</li>']

You can get the first li element of the complete document shown as follows −

>>res("(//li)[1]")

It will display the result as −

[u'<li>one</li>']

You can also display all the first li elements defined under ul parent −

>>res("//ul//li[1]")

It will display the result as −

[u'<li>one</li>', u'<li>four</li>']

You can get the first li element defined under ul parent in the whole document shown as follows −

>>res("(//ul//li)[1]")

It will display the result as −

[u'<li>one</li>']

Built-in Selectors Reference

The built-in selectors include the following class −

class scrapy.selector.Selector(response = None, text = None, type = None)

The above class contains the following parameters −

response − It is a HTMLResponse and XMLResponse that selects and extracts the data.
text − It encodes all the characters using the UTF-8 character encoding, when there is no response available.
type − It specifies the different selector types, such as html for HTML Response, xml for XMLResponse type and none for default type. It selects the type depending on the response type or sets to html by default, if it is used with the text.

The built-in selectors contain the following methods −

Sr.No	Method & Description
1	xpath(query) It matches the nodes according to the xpath query and provides the results as SelectorList instance. The parameter query specifies the XPATH query to be used.
2	css(query) It supplies the CSS selector and gives back the SelectorList instance. The parameter query specifies CSS selector to be used.
3	extract() It brings out all the matching nodes as a list of unicode strings.
4	re(regex) It supplies the regular expression and brings out the matching nodes as a list of unicode strings. The parameter regex can be used as a regular expression or string, which compiles to regular expression using the re.compile(regex) method.
5	register_namespace(prefix, uri) It specifies the namespace used in the selector. You cannot extract the data without registering the namespace from the non-standard namespace.
6	remove_namespaces() It discards the namespace and gives permission to traverse the document using the namespace-less xpaths.
7	__nonzero__() If the content is selected, then this method returns true, otherwise returns false.

SelectorList Objects

class scrapy.selector.SelectorList

The SelectorList objects contains the following methods −

Sr.No	Method & Description
1	xpath(query) It uses the .xpath() method for the elements and provides the results as SelectorList instance. The parameter query specifies the arguments as defined in the Selector.xpath() method.
2	css(query) It uses the .css() method for the elements and gives back the results as SelectorList instance. The parameter query specifies the arguments as defined in the Selector.css() method.
3	extract() It brings out all the elements of the list using the .extract() method and returns the result as a list of unicode strings.
4	re() It uses the .re() method for the elements and brings out the elements as a list of unicode strings.
5	__nonzero__() If the list is not empty, then this method returns true, otherwise returns false.

The SelectorList objects contain some of the concepts as explained in this link.

scrapy_selectors.htm

Print Page