How can Tensorflow and Python be used to get code point of every word in the sentence?


To get the code point of every word in a sentence, it is first checked to see if sentence is the start of the word or not. Then, it is checked to see if index of character starts from specific index of word in the flattened list of characters from all sentences. Once this is verified, the code point of every character in every word is obtained by using the below method.

The script identifiers help determine the word boundaries and the location where should be added. Word boundary is added at the beginning of a sentence and for each character whose script is different from its previous character. The start offsets can be used to build a RaggedTensor. This RaggedTensor would contain the list of words from all batches

Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?

Let us understand how to represent Unicode strings using Python, and manipulate those using Unicode equivalents. First, we separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.

We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.

print("Check if sentence is the start of the word")
sentence_char_starts_word = tf.concat(
   [tf.fill([sentence_char_script.nrows(), 1], True),
    tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
   axis=1)
print("Check if index of character starts from specific index of word in flattened list of characters from all sentences")
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)
print("Get the code point of every character in every word")
word_char_codepoint = tf.RaggedTensor.from_row_starts(
   values=sentence_char_codepoint.values,
   row_starts=word_starts)
print(word_char_codepoint)

Code credit: https://www.tensorflow.org/tutorials/load_data/unicode

Output

Check if sentence is the start of the word
Check if index of character starts from specific index of word in flattened list of characters from all sentences
tf.Tensor([ 0   5   7 12 13 15], shape=(6,), dtype=int64)
Get the code point of every character in every word
<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [116, 104, 101, 114, 101], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>

Explanation

  • The script identifiers help in determining where the word boundaries should be added.
  • A word boundary is added at the beginning of every sentence and for each character whose script is different from its previous character.
  • Next, these start offsets can be used to build a RaggedTensor.
  • This RaggedTensor contains the list of words from all batches

Updated on: 20-Feb-2021

65 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements