spaCy - Containers

In this chapter, we will learn about the spaCy’s containers. Let us first understand the classes which have spaCy’s containers.

Classes

We have four classes which consist of spaCy’s containers −

Doc

Doc, a container for accessing linguistic annotations, is a sequence of token objects. With the help of Doc class, we can access sentences as well as named entities.

We can also export annotations to numpy arrays and serialize to compressed binary strings as well. The Doc object holds an array of TokenC structs while, Token and Span objects can only view this array and can’t hold any data.

Token

As the name suggests, it represents an individual token such as word, punctuation, whitespace, symbol, etc.

Span

It is a slice from Doc object, which we discussed above.

Lexeme

It may be defined as an entry in the vocabulary. As opposed to a word token, a Lexeme has no string context. It is a word type hence, it does not have any PoS(Part-of-Speech) tag, dependency parse or lemma.

Now, let us discuss all four classes in detail −

Doc Class

The arguments, serialization fields, methods used in Doc class are explained below −

Arguments

The table below explains its arguments −

NAME	TYPE	DESCRIPTION
text	unicode	This attribute represents the document text in Unicode.
mem	Pool	As name implies, this attribute is for the document’s local memory heap, for all C data it owns.
vocab	Vocab	It stores all the lexical types.
tensor	ndarray	Introduced in version 2.0, it is a container for dense vector representations.
cats	dict	Introduced in version 2.0, this attribute maps a label to a score for categories applied to the document. Note that the label is a string, and the score should be a float value.
user_data	-	It represents a generic storage area mainly for user custom data.
lang	int	Introduced in version 2.1, it is representing the language of the document’s vocabulary.
lang_	unicode	Introduced in version 2.1, it is representing the language of the document’s vocabulary.
is_tagged	bool	It is a flag that indicates whether the document has been part-of-speech tagged or not. It will return True, if the Doc is empty.
is_parsed	bool	It is a flag that indicates whether the document has been syntactically parsed or not. It will return True, if the Doc is empty.
is_sentenced	bool	It is a flag that indicates whether the sentence boundaries have been applied to the document or not. It will return True, if the Doc is empty.
is_nered	bool	This attribute was introduced in version 2.1. It is a flag that indicates whether the named entities have been set or not. It will return True, if the Doc is empty. It will also return True, if any of the tokens has an entity tag set.
sentiment	float	It will return the document’s positivity/negativity score (if any available) in float.
user_hooks	dict	This attribute is a dictionary allowing customization of the Doc’s properties.
user_token_hooks	dict	This attribute is a dictionary allowing customization of properties of Token children.
user_span_hooks	dict	This attribute is a dictionary allowing customization of properties of Span children.
_	Underscore	It represents the user space for adding custom attribute extensions.

Serialization fields

During serialization process, to restore various aspects of the object, spacy will export several data fields. We can also exclude data fields from serialization by passing names via one of the arguments called exclude.

The table below explains the serialization fields −

Sr.No.	Name & Description
1	Text It represents the value of the Doc.text attribute.
2	Sentiment It represents the value of the Doc.sentiment attribute.
3	Tensor It represents the value of the Doc.tensor attribute.
4	user_data It represents the value of the Doc.user_data dictionary.
5	user_data_keys It represents the keys of the Doc.user_data dictionary.
6	user_data_values It represents the values of the Doc.user_data dictionary.

Methods

Following are the methods used in Doc class −

Sr.No.	Method & Description
1	Doc._ _init_ _ To construct a Doc object.
2	Doc._ _getitem_ _ To get a token object at a particular position.
3	Doc._ _iter_ _ To iterate over those token objects from which the annotations can be easily accessed.
4	Doc._ _len_ _ To get the number of tokens in the document.

ClassMethods

Following are the classmethods used in Doc class −

Sr.No.	Classmethod & Description
1	Doc.set_extension It defines a custom attribute on the Doc.
2	Doc.get_extension It will look up a previously extension by name.
3	Doc.has_extension It will check whether an extension has been registered on the Doc class or not.
4	Doc.remove_extension It will remove a previously registered extension on the Doc class.

Print Page