Natural Language Processing (NLP) in Python with spacy

spaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython.

spaCy is designed to help you do real work

To build real products
Gather real insights.

The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive.

Features of spaCy

Non-destructive tokenization
Named entity recognition
Support for 52+ languages
19 statistical models for 9 languages
Pre-trained word vectors
State-of-the-art speed
Easy deep learning integrationPart-of-speech tagging
Labelled dependency parsing
Syntax-driven sentence segmentation
Built in visualizers for syntax and NER
Convenient string-to-hash mapping
Export to numpy data arrays
Efficient binary serialization
Easy model packaging and deployment
Robust, rigorously evaluated accuracy

Getting started

install spacy


$ pip install spacy

Statistical models

Download statistical models

Predict part-of-speech tags, dependency labels, named entities and more. See here for available models.

Download en_core_web_sm


$ python -m spacy download en_core_web_sm

Import and load


import spacy
nlp = spacy.load("en_core_web_sm")

if you get error in loading nlp = spacy.load("en_core_web_sm") like OSError: [E050] Can't find model 'en_core_web_sm'

try this

$ python -m spacy download en

import spacy
nlp = spacy.load("en")

Documents, tokens and spans

Processing text

Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships.

doc = nlp("This is a text")

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings.

Accessing token attributes


doc = nlp("This is a text") # Token texts 
tokens = [token.text for token in doc]
print(tokens)

Output:

['This', 'is', 'a', 'text']

Spans

A slice from a Doc object.

Accessing spans


doc = nlp("This is a text") 
span = doc[2:4] span.text

Output: 'a text'

Linguistic features

Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_.

Part-of-speech tags (predicted by statistical model)


doc = nlp("This is a text.")

# Coarse-grained part-of-speech tags
[token.pos_ for token in doc]


Output: ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']

# Fine-grained part-of-speech tags 
[token.tag_ for token in doc]

output: ['DT', 'VBZ', 'DT', 'NN', '.']

Syntactic dependencies


doc = nlp("This is a text.")
 
# Dependency labels 
[token.dep_ for token in doc] 

# ['nsubj', 'ROOT', 'det', 'attr', 'punct'] 

# Syntactic head token (governor) 
[token.head.text for token in doc]
 
# ['is', 'is', 'text', 'is', 'is']

Named Entities

doc = nlp("Larry Page founded Google") 

# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]
 
# [('Larry Page', 'PERSON'), ('Google', 'ORG')]

Sentences

doc = nlp("This a sentence. This is another one.")
 
# doc.sents is a generator that yields sentence spans 
[sent.text for sent in doc.sents]

# ['This is a sentence.', 'This is another one.']

Base noun phrases

doc = nlp("I have a red car") 

# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]

# ['I', 'a red car']

Label explanations


spacy.explain("RB") 
# 'adverb'
 
spacy.explain("GPE") 
# 'Countries, cities, states'

Visualizing

If you're in a Jupyter notebook, use displacy.render. Otherwise, use displacy.serve to start a web server and show the visualization in your browser.


from spacy import displacy
doc = nlp("This is a sentence")
displacy.render(doc, style="dep")

Visualize dependencies - spacy — Visualize dependencies

Visualize named entities

doc = nlp("Larry Page founded Google") 
displacy.render(doc, style="ent")