- spaCy Tutorial
- spaCy - Home
- spaCy - Introduction
- spaCy - Getting Started
- spaCy - Models and Languages
- spaCy - Architecture
- spaCy - Command Line Helpers
- spaCy - Top-level Functions
- spaCy - Visualization Function
- spaCy - Utility Functions
- spaCy - Compatibility Functions
- spaCy - Containers
- Doc Class ContextManager and Property
- spaCy - Container Token Class
- spaCy - Token Properties
- spaCy - Container Span Class
- spaCy - Span Class Properties
- spaCy - Container Lexeme Class
- Training Neural Network Model
- Updating Neural Network Model
- spaCy Useful Resources
- spaCy - Quick Guide
- spaCy - Useful Resources
- spaCy - Discussion
spaCy - Architecture
This chapter tells us about the data structures in spaCy and explains the objects along with their role.
Data Structures
The central data structures in spaCy are as follows −
Doc − This is one of the most important objects in spaCy’s architecture and owns the sequence of tokens along with all their annotations.
Vocab − Another important object of central data structure of spaCy is Vocab. It owns a set of look-up tables that make common information available across documents.
The data structure of spaCy helps in centralising strings, word vectors, and lexical attributes, which saves memory by avoiding storing multiple copies of the data.
Objects and their role
The objects in spaCy along with their role and an example are explained below −
Span
It is a slice from Doc object, which we discussed above. We can create a Span object from the slice with the help of following command −
doc[start : end]
Example
An example of span is given below −
import spacy import en_core_web_sm nlp_example = en_core_web_sm.load() my_doc = nlp_example("This is my first example.") span = my_doc[1:6] span
Output
is my first example.
Token
As the name suggests, it represents an individual token such as word, punctuation, whitespace, symbol, etc.
Example
An example of token is stated below −
import spacy import en_core_web_sm nlp_example = en_core_web_sm.load() my_doc = nlp_example("This is my first example.") token = my_doc[4] token
Output
example
Tokenizer
As name suggests, tokenizer class segments the text into words, punctuations marks etc.
Example
This example will create a blank tokenizer with just the English vocab −
from spacy.tokenizer import Tokenizer from spacy.lang.en import English nlp_lang = English() blank_tokenizer = Tokenizer(nlp_lang.vocab) blank_tokenizer
Output
<spacy.tokenizer.Tokenizer at 0x26506efc480>
Language
It is a text-processing pipeline which, we need to load once per process and pass the instance around application. This class will be created, when we call the method spacy.load().
It contains the following −
Shared vocabulary
Language data
Optional model data loaded from a model package
Processing pipeline containing components such as tagger or parser.
Example
This example of language will initialise English Language object
from spacy.vocab import Vocab from spacy.language import Language nlp_lang = Language(Vocab()) from spacy.lang.en import English nlp_lang = English() nlp_lang
Output
When you run the code, you will see the following output −
<spacy.lang.en.English at 0x26503773cf8>