spaCy - Introduction



In this chapter, we will understand the features, extensions and visualisers with regards to spaCy. Also, a features comparison is provided which will help the readers in analysis of the functionalities provided by spaCy as compared to Natural Language Toolkit (NLTK) and coreNLP. Here, NLP refers to Natural Language Processing.

What is spaCy?

spaCy, which is developed by the software developers Matthew Honnibal and Ines Montani, is an open-source software library for advanced NLP. It is written in Python and Cython (C extension of Python which is mainly designed to give C like performance to the Python language programs).

spaCy is a relatively a new framework but, one of the most powerful and advanced libraries which is used to implement the NLP.

Features

Some of the features of spaCy that make it popular are explained below −

Fast − spaCy is specially designed to be as fast as possible.

Accuracy − spaCy implementation of its labelled dependency parser makes it one of the most accurate frameworks (within 1% of the best available) of its kind.

Batteries included − The batteries included in spaCy are as follows −

  • Index preserving tokenization.

  • “Alpha tokenization” support more than 50 languages.

  • Part-of-speech tagging.

  • Pre-trained word vectors.

  • Built-in easy and beautiful visualizers for named entities and syntax.

  • Text classification.

Extensile − You can easily use spaCy with other existing tools like TensorFlow, Gensim, scikit-Learn, etc.

Deep learning integration − It has Thinc-a deep learning framework, which is designed for NLP tasks.

Extensions and visualisers

Some of the easy-to-use extensions and visualisers that comes with spaCy and are free, open-source libraries are listed below −

Thinc − It is Machine Learning (ML) library optimised for Central Processing Unit (CPU) usage. It is also designed for deep learning with text input and NLP tasks.

sense2vec − This library is for computing word similarities. It is based on Word2vec.

displaCy − It is an open-source dependency parse tree visualiser. It is built with JavaScript, CSS (Cascading Style Sheets), and SVG (Scalable Vector Graphics).

displaCy ENT − It is a built-in named entity visualiser that comes with spaCy. It is built with JavaScript and CSS. It lets the user check its model’s prediction in browser.

Feature Comparison

The following table shows the comparison of the functionalities provided by spaCy, NLTK, and CoreNLP −

Features spaCy NLTK CoreNLP
Python API Yes Yes No
Easy installation Yes Yes Yes
Multi-language Support Yes Yes Yes
Integrated word vectors Yes No No
Tokenization Yes Yes Yes
Part-of-speech tagging Yes Yes Yes
Sentence segmentation Yes Yes Yes
Dependency parsing Yes No Yes
Entity Recognition Yes Yes Yes
Entity linking Yes No No
Coreference Resolution No No Yes

Benchmarks

spaCy has the fastest syntactic parser in the world and has the highest accuracy (within 1% of the best available) as well.

Following table shows the benchmark of spaCy −

System Year Language Accuracy
spaCy v2.x 2017 Python and Cython 92.6
spaCy v1.x 2015 Python and Cython 91.8
ClearNLP 2015 Java 91.7
CoreNLP 2015 Java 89.6
MATE 2015 Java 92.5
Turbo 2015 C++ 92.4
Advertisements