- spaCy Tutorial
- spaCy - Home
- spaCy - Introduction
- spaCy - Getting Started
- spaCy - Models and Languages
- spaCy - Architecture
- spaCy - Command Line Helpers
- spaCy - Top-level Functions
- spaCy - Visualization Function
- spaCy - Utility Functions
- spaCy - Compatibility Functions
- spaCy - Containers
- Doc Class ContextManager and Property
- spaCy - Container Token Class
- spaCy - Token Properties
- spaCy - Container Span Class
- spaCy - Span Class Properties
- spaCy - Container Lexeme Class
- Training Neural Network Model
- Updating Neural Network Model
- spaCy Useful Resources
- spaCy - Quick Guide
- spaCy - Useful Resources
- spaCy - Discussion
spaCy - Containers
In this chapter, we will learn about the spaCy’s containers. Let us first understand the classes which have spaCy’s containers.
Classes
We have four classes which consist of spaCy’s containers −
Doc
Doc, a container for accessing linguistic annotations, is a sequence of token objects. With the help of Doc class, we can access sentences as well as named entities.
We can also export annotations to numpy arrays and serialize to compressed binary strings as well. The Doc object holds an array of TokenC structs while, Token and Span objects can only view this array and can’t hold any data.
Token
As the name suggests, it represents an individual token such as word, punctuation, whitespace, symbol, etc.
Span
It is a slice from Doc object, which we discussed above.
Lexeme
It may be defined as an entry in the vocabulary. As opposed to a word token, a Lexeme has no string context. It is a word type hence, it does not have any PoS(Part-of-Speech) tag, dependency parse or lemma.
Now, let us discuss all four classes in detail −
Doc Class
The arguments, serialization fields, methods used in Doc class are explained below −
Arguments
The table below explains its arguments −
NAME | TYPE | DESCRIPTION |
---|---|---|
text | unicode | This attribute represents the document text in Unicode. |
mem | Pool | As name implies, this attribute is for the document’s local memory heap, for all C data it owns. |
vocab | Vocab | It stores all the lexical types. |
tensor | ndarray | Introduced in version 2.0, it is a container for dense vector representations. |
cats | dict | Introduced in version 2.0, this attribute maps a label to a score for categories applied to the document. Note that the label is a string, and the score should be a float value. |
user_data | - | It represents a generic storage area mainly for user custom data. |
lang | int | Introduced in version 2.1, it is representing the language of the document’s vocabulary. |
lang_ | unicode | Introduced in version 2.1, it is representing the language of the document’s vocabulary. |
is_tagged | bool | It is a flag that indicates whether the document has been part-of-speech tagged or not. It will return True, if the Doc is empty. |
is_parsed | bool | It is a flag that indicates whether the document has been syntactically parsed or not. It will return True, if the Doc is empty. |
is_sentenced | bool | It is a flag that indicates whether the sentence boundaries have been applied to the document or not. It will return True, if the Doc is empty. |
is_nered | bool | This attribute was introduced in version 2.1. It is a flag that indicates whether the named entities have been set or not. It will return True, if the Doc is empty. It will also return True, if any of the tokens has an entity tag set. |
sentiment | float | It will return the document’s positivity/negativity score (if any available) in float. |
user_hooks | dict | This attribute is a dictionary allowing customization of the Doc’s properties. |
user_token_hooks | dict | This attribute is a dictionary allowing customization of properties of Token children. |
user_span_hooks | dict | This attribute is a dictionary allowing customization of properties of Span children. |
_ | Underscore | It represents the user space for adding custom attribute extensions. |
Serialization fields
During serialization process, to restore various aspects of the object, spacy will export several data fields. We can also exclude data fields from serialization by passing names via one of the arguments called exclude.
The table below explains the serialization fields −
Sr.No. | Name & Description |
---|---|
1 | Text It represents the value of the Doc.text attribute. |
2 | Sentiment It represents the value of the Doc.sentiment attribute. |
3 | Tensor It represents the value of the Doc.tensor attribute. |
4 | user_data It represents the value of the Doc.user_data dictionary. |
5 | user_data_keys It represents the keys of the Doc.user_data dictionary. |
6 | user_data_values It represents the values of the Doc.user_data dictionary. |
Methods
Following are the methods used in Doc class −
Sr.No. | Method & Description |
---|---|
1 | Doc._ _init_ _ To construct a Doc object. |
2 | Doc._ _getitem_ _ To get a token object at a particular position. |
3 | Doc._ _iter_ _ To iterate over those token objects from which the annotations can be easily accessed. |
4 | Doc._ _len_ _ To get the number of tokens in the document. |
ClassMethods
Following are the classmethods used in Doc class −
Sr.No. | Classmethod & Description |
---|---|
1 | Doc.set_extension It defines a custom attribute on the Doc. |
2 | Doc.get_extension It will look up a previously extension by name. |
3 | Doc.has_extension It will check whether an extension has been registered on the Doc class or not. |
4 | Doc.remove_extension It will remove a previously registered extension on the Doc class. |