spaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython.
spaCy is designed to help you do real work
To build real products
Gather real insights.
The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive.
Features of spaCy
Non-destructive tokenization
Named entity recognition
Support for 52+ languages
19 statistical models for 9 languages
Pre-trained word vectors
State-of-the-art speed
Easy deep learning integrationPart-of-speech tagging
Labelled dependency parsing
Syntax-driven sentence segmentation
Built in visualizers for syntax and NER
Convenient string-to-hash mapping
Export to numpy data arrays
Efficient binary serialization
Easy model packaging and deployment
Robust, rigorously evaluated accuracy
Getting started
install spacy
$ pip install spacy
Statistical models
Download statistical models
Predict part-of-speech tags, dependency labels, named entities and more. See here for available models.
Download en_core_web_sm
$ python -m spacy download en_core_web_sm
Import and load
import spacy
nlp = spacy.load("en_core_web_sm")
if you get error in loading nlp = spacy.load("en_core_web_sm") like OSError: [E050] Can't find model 'en_core_web_sm'
try this
$ python -m spacy download en
import spacy
nlp = spacy.load("en")
Documents, tokens and spans
Processing text
Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships.
doc = nlp("This is a text")
A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings.
Accessing token attributes
doc = nlp("This is a text") # Token texts
tokens = [token.text for token in doc]
print(tokens)
Output:
['This', 'is', 'a', 'text']
Spans
A slice from a Doc object.
Accessing spans
doc = nlp("This is a text")
span = doc[2:4] span.text
Output: 'a text'
Linguistic features
Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_.
Part-of-speech tags (predicted by statistical model)
doc = nlp("This is a text.")
# Coarse-grained part-of-speech tags
[token.pos_ for token in doc]
Output: ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']
# Fine-grained part-of-speech tags
[token.tag_ for token in doc]
output: ['DT', 'VBZ', 'DT', 'NN', '.']
Syntactic dependencies
doc = nlp("This is a text.")
# Dependency labels
[token.dep_ for token in doc]
# ['nsubj', 'ROOT', 'det', 'attr', 'punct']
# Syntactic head token (governor)
[token.head.text for token in doc]
# ['is', 'is', 'text', 'is', 'is']
Named Entities
doc = nlp("Larry Page founded Google")
# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]
# [('Larry Page', 'PERSON'), ('Google', 'ORG')]
Sentences
doc = nlp("This a sentence. This is another one.")
# doc.sents is a generator that yields sentence spans
[sent.text for sent in doc.sents]
# ['This is a sentence.', 'This is another one.']
Base noun phrases
doc = nlp("I have a red car")
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]
# ['I', 'a red car']
Label explanations
spacy.explain("RB")
# 'adverb'
spacy.explain("GPE")
# 'Countries, cities, states'
Visualizing
If you're in a Jupyter notebook, use displacy.render. Otherwise, use displacy.serve to start a web server and show the visualization in your browser.
from spacy import displacy
doc = nlp("This is a sentence")
displacy.render(doc, style="dep")
Visualize named entities
doc = nlp("Larry Page founded Google")
displacy.render(doc, style="ent")