Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can Tensorflow and Python be used to get code point of every word in the sentence?
TensorFlow provides powerful Unicode handling capabilities for processing multilingual text. To get the code point of every word in a sentence, we need to detect word boundaries using script identifiers and then extract Unicode code points for each character.
The process involves three main steps: detecting word boundaries, finding character start positions, and creating a RaggedTensor containing code points for each word.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
Prerequisites
We are using Google Colaboratory to run the code below. Google Colab provides free access to GPUs and requires zero configuration for running TensorFlow code.
import tensorflow as tf
# Sample multilingual sentences with their character code points and scripts
# This assumes you have already processed text into character-level representations
# For demonstration, let's create sample data structures
# Example sentence: "Hello, there.???????"
sentence_char_codepoint = tf.ragged.constant([
[72, 101, 108, 108, 111, 44, 32, 116, 104, 101, 114, 101, 46, 19990, 30028, 12371, 12435, 12395, 12385, 12399]
])
sentence_char_script = tf.ragged.constant([
[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0, 17, 17, 20, 20, 20, 20, 20]
])
print("Character code points:", sentence_char_codepoint)
print("Script identifiers:", sentence_char_script)
Character code points: tf.Tensor([[ 72 101 108 108 111 44 32 116 104 101 114 101 46 19990 30028 12371 12435 12395 12385 12399]], shape=(1, 20), dtype=int32) Script identifiers: tf.Tensor([[ 25 25 25 25 25 0 0 25 25 25 25 25 0 17 17 20 20 20 20 20]], shape=(1, 20), dtype=int32)
Step 1: Detecting Word Boundaries
Word boundaries occur at the beginning of sentences and where script identifiers change between adjacent characters ?
print("Check if sentence is the start of the word")
sentence_char_starts_word = tf.concat(
[tf.fill([sentence_char_script.nrows(), 1], True),
tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
axis=1)
print("Word boundary detection result:")
print(sentence_char_starts_word)
Check if sentence is the start of the word Word boundary detection result: <tf.RaggedTensor [[True, False, False, False, False, True, False, True, False, False, False, False, True, True, False, True, False, False, False, False]]>
Step 2: Finding Word Start Positions
Extract the positions where each word begins in the flattened character list ?
print("Check if index of character starts from specific index of word in flattened list of characters from all sentences")
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print("Word start positions:")
print(word_starts)
Check if index of character starts from specific index of word in flattened list of characters from all sentences Word start positions: tf.Tensor([ 0 5 7 12 13 15], shape=(6,), dtype=int64)
Step 3: Creating RaggedTensor with Code Points
Build a RaggedTensor where each row contains the code points for one word ?
print("Get the code point of every character in every word")
word_char_codepoint = tf.RaggedTensor.from_row_starts(
values=sentence_char_codepoint.values,
row_starts=word_starts)
print("Final result - Code points grouped by words:")
print(word_char_codepoint)
# Convert back to readable text for verification
words = tf.map_fn(
lambda word: tf.py_function(
lambda codes: ''.join([chr(c.numpy()) for c in codes]),
[word],
tf.string
),
word_char_codepoint,
dtype=tf.string
)
print("Decoded words:", words.numpy())
Get the code point of every character in every word Final result - Code points grouped by words: <tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [116, 104, 101, 114, 101], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]> Decoded words: [b'Hello' b', ' b'there' b'.' b'\xe4\xb8\x96\xe7\x95\x8c' b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']
How It Works
- Script identifiers help determine where word boundaries should be added
- A word boundary is added at the beginning of every sentence and for each character whose script differs from the previous character
- Start offsets are used to build a RaggedTensor containing words from all batches
- Each row in the final RaggedTensor represents one word with its constituent character code points
Conclusion
TensorFlow's Unicode support enables efficient multilingual text processing by detecting script changes to identify word boundaries. The RaggedTensor format efficiently stores variable-length words with their Unicode code points, making it ideal for natural language processing tasks.
