Natural Language Toolkit - Transforming Trees



Following are the two reasons to transform the trees −

  • To modify deep parse tree and
  • To flatten deep parse trees

Converting Tree or Subtree to Sentence

The first recipe we are going to discuss here is to convert a Tree or subtree back to a sentence or chunk string. This is very simple, let us see in the following example −

Example

from nltk.corpus import treebank_chunk
tree = treebank_chunk.chunked_sents()[2]
' '.join([w for w, t in tree.leaves()])

Output

'Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
PLC , was named a nonexecutive director of this British industrial
conglomerate .'

Deep tree flattening

Deep trees of nested phrases can’t be used for training a chunk hence we must flatten them before using. In the following example, we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.

Example

To achieve this, we are defining a function named deeptree_flat() that will take a single Tree and will return a new Tree that keeps only the lowest level trees. In order to do most of the work, it uses a helper function which we named as childtree_flat().

from nltk.tree import Tree
def childtree_flat(trees):
   children = []
   for t in trees:
      if t.height() < 3:
         children.extend(t.pos())
      elif t.height() == 3:
         children.append(Tree(t.label(), t.pos()))
      else:
         children.extend(flatten_childtrees([c for c in t]))
   return children
def deeptree_flat(tree):
   return Tree(tree.label(), flatten_childtrees([c for c in tree]))

Now, let us call deeptree_flat() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named deeptree.py.

from deeptree import deeptree_flat
from nltk.corpus import treebank
deeptree_flat(treebank.parsed_sents()[2])

Output

Tree('S', [Tree('NP', [('Rudolph', 'NNP'), ('Agnew', 'NNP')]),
(',', ','), Tree('NP', [('55', 'CD'), 
('years', 'NNS')]), ('old', 'JJ'), ('and', 'CC'),
Tree('NP', [('former', 'JJ'), 
('chairman', 'NN')]), ('of', 'IN'), Tree('NP', [('Consolidated', 'NNP'), 
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 
'NNP')]), (',', ','), ('was', 'VBD'), 
('named', 'VBN'), Tree('NP-SBJ', [('*-1', '-NONE-')]), 
Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]),
('of', 'IN'), Tree('NP', 
[('this', 'DT'), ('British', 'JJ'), 
('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

Building Shallow tree

In the previous section, we flatten a deep tree of nested phrases by only keeping the lowest level subtrees. In this section, we are going to keep only the highest-level subtrees i.e. to build the shallow tree. In the following example we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.

Example

To achieve this, we are defining a function named tree_shallow() that will eliminate all the nested subtrees by keeping only the top subtree labels.

from nltk.tree import Tree
def tree_shallow(tree):
   children = []
   for t in tree:
      if t.height() < 3:
         children.extend(t.pos())
      else:
         children.append(Tree(t.label(), t.pos()))
   return Tree(tree.label(), children)

Now, let us call tree_shallow() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named shallowtree.py.

from shallowtree import shallow_tree
from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2])

Output

Tree('S', [Tree('NP-SBJ-1', [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','), 
('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'), 
('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'), 
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ',')]), 
Tree('VP', [('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'), 
('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'), 
('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

We can see the difference with the help of getting the height of the trees −

from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2]).height()

Output

3

from nltk.corpus import treebank
treebank.parsed_sents()[2].height()

Output

9

Tree labels conversion

In parse trees there are variety of Tree label types that are not present in chunk trees. But while using parse tree to train a chunker, we would like to reduce this variety by converting some of Tree labels to more common label types. For example, we have two alternative NP subtrees namely NP-SBL and NP-TMP. We can convert both of them into NP. Let us see how to do it in the following example.

Example

To achieve this we are defining a function named tree_convert() that takes following two arguments −

  • Tree to convert
  • A label conversion mapping

This function will return a new Tree with all matching labels replaced based on the values in the mapping.

from nltk.tree import Tree
def tree_convert(tree, mapping):
   children = []
   for t in tree:
      if isinstance(t, Tree):
         children.append(convert_tree_labels(t, mapping))
      else:
         children.append(t)
   label = mapping.get(tree.label(), tree.label())
   return Tree(label, children)

Now, let us call tree_convert() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named converttree.py.

from converttree import tree_convert
from nltk.corpus import treebank
mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'}
convert_tree_labels(treebank.parsed_sents()[2], mapping)

Output

Tree('S', [Tree('NP-SBJ-1', [Tree('NP', [Tree('NNP', ['Rudolph']), 
Tree('NNP', ['Agnew'])]), Tree(',', [',']), 
Tree('UCP', [Tree('ADJP', [Tree('NP', [Tree('CD', ['55']), 
Tree('NNS', ['years'])]), 
Tree('JJ', ['old'])]), Tree('CC', ['and']), 
Tree('NP', [Tree('NP', [Tree('JJ', ['former']), 
Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']), 
Tree('NP', [Tree('NNP', ['Consolidated']), 
Tree('NNP', ['Gold']), Tree('NNP', ['Fields']), 
Tree('NNP', ['PLC'])])])])]), Tree(',', [','])]), 
Tree('VP', [Tree('VBD', ['was']),Tree('VP', [Tree('VBN', ['named']), 
Tree('S', [Tree('NP', [Tree('-NONE-', ['*-1'])]), 
Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']), 
Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]), 
Tree('PP', [Tree('IN', ['of']), Tree('NP', 
[Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']), 
Tree('NN', ['conglomerate'])])])])])])]), Tree('.', ['.'])])
Advertisements