spaCy - Models and Languages



Let us learn about the languages supported by spaCy and its statistical models.

Language Support

Currently, spaCy supports the following languages −

Language Code
Chinese zh
Danish da
Dutch nl
English en
French fr
German de
Greek el
Italian it
Japanese ja
Lithuanian lt
Multi-language xx
Norwegian Bokmål nb
Polish pl
Portuguese pt
Romanian ro
Spanish es
Afrikaans af
Albanian sq
Arabic ar
Armenian hy
Basque eu
Bengali bn
Bulgarian bg
Catalan ca
Croatian hr
Czech cs
Estonian et
Finnish fi
Gujarati gu
Hebrew he
Hindi hi
Hungarian hu
Icelandic is
Indonesian id
Irish ga
Kannada kn
Korean ko
Latvian lv
Ligurian lij
Luxembourgish lb
Macedonian mk
Malayalam ml
Marathi mr
Nepali ne
Persian fa
Russian ru
Serbian sr
Sinhala si
Slovak sk
Slovenian sl
Swedish sv
Tagalog tl
Tamil ta
Tatar tt
Telugu te
Thai th
Turkish tr
Ukrainian uk
Urdu ur
Vietnamese vi
Yoruba yo

spaCy’s statistical models

As we know that spaCy’s models can be installed as Python packages, which means like any other module, they are a component of our application. These modules can be versioned and defined in requirement.txt file.

Installing spaCy’s Statistical Models

The installation of spaCy’s statistical models is explained below −

Using Download command

Using spaCy’s download command is one of the easiest ways to download a model because, it will automatically find the best-matching model compatible with our spaCy version.

You can use the download command in the following ways −

The following command will download best-matching version of specific model for your spaCy version −

python -m spacy download en_core_web_sm

The following command will download best-matching default model and will also create a shortcut link −

python -m spacy download en

The following command will download the exact model version and does not create any shortcut link −

python -m spacy download en_core_web_sm-2.2.0 --direct

Via pip

We can also download and install a model directly via pip. For this, you need to use pip install with the URL or local path of the archive file. In case if you do not have the direct link of a model, go to model release, and copy from there.

For example,

The command for installing model using pip with external URL is as follows −

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

The command for installing model using pip with local file is as follows −

pip install /Users/you/en_core_web_sm-2.2.0.tar.gz

The above commands will install the particular model into your site-packages directory. Once done, we can use spacy.load() to load it via its package name.

Manually

You can also download the data manually and place in into a custom directory of your choice.

Use any of the following ways to download the data manually −

  • Download the model via your browser from the latest release.

  • You can configure your own download script by using the URL (Uniform Resource Locator) of the archive file.

Once done with downloading, we can place the model package directory anywhere on our local file system. Now to use it with spaCy, we can create a shortcut link for the data directory.

Using models with spaCy

Here, how to use models with spaCy is explained.

Using custom shortcut links

We can download all the spaCy models manually, as discussed above, and put them in our local directory. Now whenever the spaCy project needs any model, we can create a shortcut link so that spaCy can load the model from there. With this you will not end up with duplicate data.

For this purpose, spaCy provide us the link command which can be used as follows −

python -m spacy link [package name or path] [shortcut] [--force]

In the above command, the first argument is the package name or local path. If you have installed the model via pip, you can use the package name here. Or else, you have a local path to the model package.

The second argument is the internal name. This is the name you want to use for the model. The –-force flag in the above command will overwrite any existing links.

The examples are given below for both the cases.

Example

Given below is an example for setting up shortcut link to load installed package as “default_model” −

python -m spacy link en_core_web_md en_default

An example for setting up shortcut link to load local model as “my_default_model” is as follows −

python -m spacy link /Users/Leekha/model my_default_en

Importing as module

We can also import an installed model, which can call its load() method with no arguments as shown below −

import spaCy
import en_core_web_sm
nlp_example = en_core_web_sm.load()
my_doc = nlp_example("This is my first example.")
my_doc

Output

This is my first example.

Using own models

You can also use your trained model. For this, you need to save the state of your trained model using Language.to_disk() method. For more convenience in deploying, you can also wrap it as a Python package.

Naming Conventions

Generally, the naming convention of [lang_[name]] is one such convention that spaCy expected all its model packages to be followed.

The name of spaCy’s model can be further divided into following three components −

  • Type − It reflects the capabilities of model. For example, core is used for general-purpose model with vocabulary, syntax, entities. Similarly, depent is used for only vocab, syntax, and entities.

  • Genre − It shows the type of text on which the model is trained. For example, web or news.

  • Size − As name implies, it is the model size indicator. For example, sm (for small), md (For medium), or lg (for large).

Model versioning

The model versioning reflects the following −

  • Compatibility with spaCy.

  • Major and minor model version.

For example, a model version r.s.t translates to the following −

  • rspaCy major version. For example, 1 for spaCy v1.x.

  • sModel major version. It restricts the users to load different major versions by the same code.

  • tModel minor version. It shows the same model structure but, different parameter values. For example, trained on different data for different number of iterations.

Advertisements