spaCy - Getting Started



This chapter will help the readers in understanding about the latest version of spaCy. Moreover, the readers can learn about the new features and improvements in the respective version, its compatibility and how to install spaCy.

Latest version

spaCy v3.0 is the latest version which is available as a nightly release. This is an experimental and alpha release of spaCy via a separate channel named spacy-nightly. It reflects “future spaCy” and cannot be use for production use.

To prevent potential conflicts, try to use a fresh virtual environment.

You can use the below given pip command to install it −

pip install spacy-nightly --pre

New Features and Improvements

The new features and improvements in the latest version of spaCy are explained below −

Transformer-based pipelines

It features all new transformer-based pipelines with support for multi-task learning. These new transformer-based pipelines make it the highest accurate framework (within 1% of the best available).

You can access thousands of pretrained models for your pipeline because, spaCy’s transformer support interoperates with other frameworks like PyTorch and HuggingFace transformers.

New training workflow and config system

The spaCy v3.0 provides a single configuration file of our training run.

There are no hidden defaults hence, makes it easy to return our experiments and track changes.

Custom models using any ML framework

New configuration system of spaCy v3.0 makes it easy for us to customise the Neural Network (NN) models and implement our own architecture via ML library Thinc.

Manage end-to-end workflows and projects

The spaCy project let us manage and share end-to-end workflow for various use cases and domains.

It also let us organise training, packaging, and serving our custom pipelines.

On the other hand, we can also integrate with other data science and ML tools like DVC (Data Vision Control), Prodigy, Streamlit, FastAPI, Ray, etc.

Parallel training and distributed computing with Ray

To speed up the training process, we can use Ray, a fast and simple framework for building and running distributed applications, to train spaCy on one or more remote machines.

New built-in pipeline components

This is the new version of spaCy following new trainable and rule-based components which we can add to our pipeline.

These components are as follows −

  • SentenceRecognizer

  • Morphologizer

  • Lemmatizer

  • AttributeRuler

  • Transformer

  • TrainablePipe

New pipeline component API

This SpaCy v3.0 provides us new and improved pipeline component API and decorators which makes defining, configuring, reusing, training, and analyzing easier and more convenient.

Dependency matching

SpaCy v3.0 provides us the new DependencyMatcher that let us match the patterns within the dependency parser. It uses Semgrex operators.

New and updated documentation

It has new and updated documentation including −

  • A new usage guide on embeddings, transformers, and transfer learning.

  • A guide on training pipelines and models.

  • Details about the new spaCy projects and updated usage documentation on custom pipeline components.

  • New illustrations and new API references pages documenting spaCy’s ML model architecture and projected data formats.

Compatibility

spaCy can run on all major operating systems such as Windows, macOS/OS X, and Unix/Linux. It is compatible with 64-bit CPython 2.7/3.5+ versions.

Installing spaCy

The different options to install spaCy are explained below −

Using package manager

The latest release versions of spaCy is available over both the package managers, pip and conda. Let us check out how we can use them to install spaCy −

pip − To install Spacy using pip, you can use the following command −

pip install -U spacy

In order to avoid modifying system state, it is suggested to install spacy packages in a virtual environment as follows −

python -m venv .env
source .env/bin/activate
pip install spacy

conda − To install spaCy via conda-forge, you can use the following command −

conda install -c conda-forge spacy

From source

You can also install spaCy by making its clone from GitHub repository and building it from source. It is the most common way to make changes to the code base.

But, for this, you need to have a python distribution including the following −

  • Header files

  • A compiler

  • pip

  • virtualenv

  • git

Use the following commands −

First, update pip as follows −

python -m pip install -U pip

Now, clone spaCy with the command given below:

git clone https://github.com/explosion/spaCy

Now, we need to navigate into directory by using the below mentioned command −

cd spaCy

Next, we need to create environment in .env, as shown below −

python -m venv .env

Now, activate the above created virtual environment.

source .env/bin/activate

Next, we need to set the Python path to spaCy directory as follows −

export PYTHONPATH=`pwd`

Now, install all requirements as follows −

pip install -r requirements.txt

At last, compile spaCy

python setup.py build_ext --inplace

Ubuntu

Use the following command to install system-level dependencies in Ubuntu Operating System (OS) −

sudo apt-get install build-essential python-dev git

macOS/OS X

Actually, macOS and OS X have preinstalled Python and git. So, we need to only install a recent version of XCode including CLT (Command Line Tools).

Windows

In the table below, there are Visual C++ Build Tools or Visual Studio Express versions given for official distribution of Python interpreter. Choose on as per your requirements and install −

Distribution Version
Python 2.7 Visual Studio 2008
Python 3.4 Visual Studio 2010
Python 3.5+ Visual Studio 2015

Upgrading spaCy

The following points should be kept in mind while upgrading spaCy −

  • Start with a clean virtual environment.

  • For upgrading spaCy to a new major version, you must have the latest compatible models installed.

  • There should be no old shortcut links or incompatible model package in your virtual environment.

  • In case if you have trained your own models, the train and runtime inputs must match i.e. you must retrain your models with the newer version as well.

The spaCy v2.0 and above provides a validate command, which allows the user to verify whether, all the installed models are compatible with installed spaCy version or not.

In case if there would be any incompatible models, validate command will print the tips and installation instructions. This command can also detect out-of-sync model links created in various virtual environments.

You can use the validate command as follows −

pip install -U spacy
python -m spacy validate

In the above command, python -m is used to make sure that we are executing the correct version of spaCy.

Running spaCy with GPU

spaCy v2.0 and above comes with neural network (NN) models that can be implemented in Thinc. If you want to run spaCy with Graphics Processing Unit (GPU) support, use the work of Chainer’s CuPy module. This module provides a numpy-compatible interface for GPU arrays.

You can install spaCy on GPU by specifying the following −

  • spaCy[cuda]

  • spaCy[cuda90]

  • spaCy[cuda91]

  • spaCy[cuda92]

  • spaCy[cuda100]

  • spaCy[cuda101]

  • spaCy[cuda102]

On the other hand, if you know your cuda version, the explicit specifier allows cupy to be installed. It will save the compilation time.

Use the following command for the installation −

pip install -U spacy[cuda92]

After a GPU-enabled installation, activate it by calling spacy.prefer_gpu or spacy.require_gpu as follows −

import spacy
spacy.prefer_gpu()
nlp_model = spacy.load("en_core_web_sm")
Advertisements