spaCy - Architecture



This chapter tells us about the data structures in spaCy and explains the objects along with their role.

Data Structures

The central data structures in spaCy are as follows −

  • Doc − This is one of the most important objects in spaCy’s architecture and owns the sequence of tokens along with all their annotations.

  • Vocab − Another important object of central data structure of spaCy is Vocab. It owns a set of look-up tables that make common information available across documents.

The data structure of spaCy helps in centralising strings, word vectors, and lexical attributes, which saves memory by avoiding storing multiple copies of the data.

Objects and their role

The objects in spaCy along with their role and an example are explained below −

Span

It is a slice from Doc object, which we discussed above. We can create a Span object from the slice with the help of following command −

doc[start : end]

Example

An example of span is given below −

import spacy
import en_core_web_sm
nlp_example = en_core_web_sm.load()
my_doc = nlp_example("This is my first example.")
span = my_doc[1:6]
span

Output

is my first example.

Token

As the name suggests, it represents an individual token such as word, punctuation, whitespace, symbol, etc.

Example

An example of token is stated below −

import spacy
import en_core_web_sm
nlp_example = en_core_web_sm.load()
my_doc = nlp_example("This is my first example.")
token = my_doc[4]
token

Output

example

Tokenizer

As name suggests, tokenizer class segments the text into words, punctuations marks etc.

Example

This example will create a blank tokenizer with just the English vocab −

from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp_lang = English()
blank_tokenizer = Tokenizer(nlp_lang.vocab)
blank_tokenizer

Output

<spacy.tokenizer.Tokenizer at 0x26506efc480>

Language

It is a text-processing pipeline which, we need to load once per process and pass the instance around application. This class will be created, when we call the method spacy.load().

It contains the following −

  • Shared vocabulary

  • Language data

  • Optional model data loaded from a model package

  • Processing pipeline containing components such as tagger or parser.

Example

This example of language will initialise English Language object

from spacy.vocab import Vocab
from spacy.language import Language
nlp_lang = Language(Vocab())
from spacy.lang.en import English
nlp_lang = English()
nlp_lang

Output

When you run the code, you will see the following output −

<spacy.lang.en.English at 0x26503773cf8>
Advertisements