spaCy - Containers



In this chapter, we will learn about the spaCy’s containers. Let us first understand the classes which have spaCy’s containers.

Classes

We have four classes which consist of spaCy’s containers −

Doc

Doc, a container for accessing linguistic annotations, is a sequence of token objects. With the help of Doc class, we can access sentences as well as named entities.

We can also export annotations to numpy arrays and serialize to compressed binary strings as well. The Doc object holds an array of TokenC structs while, Token and Span objects can only view this array and can’t hold any data.

Token

As the name suggests, it represents an individual token such as word, punctuation, whitespace, symbol, etc.

Span

It is a slice from Doc object, which we discussed above.

Lexeme

It may be defined as an entry in the vocabulary. As opposed to a word token, a Lexeme has no string context. It is a word type hence, it does not have any PoS(Part-of-Speech) tag, dependency parse or lemma.

Now, let us discuss all four classes in detail −

Doc Class

The arguments, serialization fields, methods used in Doc class are explained below −

Arguments

The table below explains its arguments −

NAME TYPE DESCRIPTION
text unicode This attribute represents the document text in Unicode.
mem Pool As name implies, this attribute is for the document’s local memory heap, for all C data it owns.
vocab Vocab It stores all the lexical types.
tensor ndarray Introduced in version 2.0, it is a container for dense vector representations.
cats dict Introduced in version 2.0, this attribute maps a label to a score for categories applied to the document. Note that the label is a string, and the score should be a float value.
user_data - It represents a generic storage area mainly for user custom data.
lang int Introduced in version 2.1, it is representing the language of the document’s vocabulary.
lang_ unicode Introduced in version 2.1, it is representing the language of the document’s vocabulary.
is_tagged bool It is a flag that indicates whether the document has been part-of-speech tagged or not. It will return True, if the Doc is empty.
is_parsed bool It is a flag that indicates whether the document has been syntactically parsed or not. It will return True, if the Doc is empty.
is_sentenced bool It is a flag that indicates whether the sentence boundaries have been applied to the document or not. It will return True, if the Doc is empty.
is_nered bool This attribute was introduced in version 2.1. It is a flag that indicates whether the named entities have been set or not. It will return True, if the Doc is empty. It will also return True, if any of the tokens has an entity tag set.
sentiment float It will return the document’s positivity/negativity score (if any available) in float.
user_hooks dict This attribute is a dictionary allowing customization of the Doc’s properties.
user_token_hooks dict This attribute is a dictionary allowing customization of properties of Token children.
user_span_hooks dict This attribute is a dictionary allowing customization of properties of Span children.
_ Underscore It represents the user space for adding custom attribute extensions.

Serialization fields

During serialization process, to restore various aspects of the object, spacy will export several data fields. We can also exclude data fields from serialization by passing names via one of the arguments called exclude.

The table below explains the serialization fields −

Sr.No. Name & Description
1

Text

It represents the value of the Doc.text attribute.

2

Sentiment

It represents the value of the Doc.sentiment attribute.

3

Tensor

It represents the value of the Doc.tensor attribute.

4

user_data

It represents the value of the Doc.user_data dictionary.

5

user_data_keys

It represents the keys of the Doc.user_data dictionary.

6

user_data_values

It represents the values of the Doc.user_data dictionary.

Methods

Following are the methods used in Doc class −

Sr.No. Method & Description
1 Doc._ _init_ _

To construct a Doc object.

2 Doc._ _getitem_ _

To get a token object at a particular position.

3 Doc._ _iter_ _

To iterate over those token objects from which the annotations can be easily accessed.

4 Doc._ _len_ _

To get the number of tokens in the document.

ClassMethods

Following are the classmethods used in Doc class −

Sr.No. Classmethod & Description
1 Doc.set_extension

It defines a custom attribute on the Doc.

2 Doc.get_extension

It will look up a previously extension by name.

3 Doc.has_extension

It will check whether an extension has been registered on the Doc class or not.

4 Doc.remove_extension

It will remove a previously registered extension on the Doc class.

Advertisements