home

day 2: getting started with nlp

section 2.2.1

  • some text

section 2.2.2

  • some text

section 2.2.3

  • some text
In [4]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

spaCy Features

In the documentation, you'll come across mentions of spaCy's features and capabilities. Some of them refer to linguistic concepts, while others are related to more general machine learning functionality.

Name Description
Tokenization Segmenting text into words, punctuations marks etc.
Part-of-speech (POS) Tagging Assigning word types to tokens, like verb or noun.
Dependency Parsing Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
Lemmatization Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".
Sentence Boundary Detection (SBD) Finding and segmenting individual sentences.
Named Entity Recognition (NER) Labelling named "real-world" objects, like persons, companies or locations.
Entity Linking (EL) Disambiguating textual entities to unique identifiers in a knowledge base.
Similarity Comparing words, text spans and documents and how similar they are to each other.
Text Classification Assigning categories or labels to a whole document, or parts of a document.
Rule-based Matching Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
Training Updating and improving a statistical model's predictions.
Serialization Saving objects to files or byte strings.

Tokenization

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  1. Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.

  2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

0 1 2 3 4 5 6 7 8 9 10
Apple is looking at buying U.K. startup for \$ 1 billion
In [4]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)
Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion

Linguistic annotations

spaCy provides a variety of linguistic annotations to give you insights into a text's grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you're analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether "google" is used as a verb, or refers to the website or company in a specific context.

In [6]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj

Part-of-speech tags and dependencies

After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token attributes. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

To learn more about part-of-speech tagging and rule-based morphology, and how to navigate and use the parse tree effectively, see the usage guides on part-of-speech tagging and using the dependency parse.

  • Text: The original word text.
  • Lemma: The base form of the word.
  • POS: The simple UPOS part-of-speech tag.
  • Tag: The detailed part-of-speech tag.
  • Dep: Syntactic dependency, i.e. the relation between tokens.
  • Shape: The word shape – capitalization, punctuation, digits.
  • is alpha: Is the token an alpha character?
  • is stop: Is the token part of a stop list, i.e. the most common words of the language?
Text Lemma POS Tag Dep Shape alpha stop
Apple apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
\$ \$ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False
In [7]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)
Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False
In [9]:
displacy.render(doc, style='dep', jupyter=True)
Apple PROPN is AUX looking VERB at ADP buying VERB U.K. PROPN startup NOUN for ADP $ SYM 1 NUM billion NUM nsubj aux prep pcomp compound dobj prep quantmod compound pobj

Named Entities

To learn more about entity recognition in spaCy, how to add your own entities to a document and how to train and update the entity predictions of a model, see the usage guides on named entity recognition and training pipelines.

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

  • Text: The original entity text.
  • Start: Index of start of entity in the Doc.
  • End: Index of end of entity in the Doc.
  • Label: Entity label, i.e. type.
Text Start End Label Description
Apple 0 5 ORG Companies, agencies, institutions.
U.K. 27 31 GPE Geopolitical entity, i.e. countries, cities, states.
\$1 billion 44 54 MONEY Monetary values, including unit.

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its named entities look like:

In [10]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
In [11]:
text = """Apple decided to fire Tim Cook and hire somebody called John Doe as the new CEO.
They also discussed a merger with Google. On the long run it seems more likely that Apple
will merge with Amazon and Microsoft with Google. The companies will all relocate to
Austin in Texas before the end of the century. John Doe bought a Prosche."""

doc = nlp(text)
displacy.render(doc, style='ent', jupyter=True)
Apple ORG decided to fire Tim Cook PERSON and hire somebody called John Doe PERSON as the new CEO.
They also discussed a merger with Google ORG . On the long run it seems more likely that Apple ORG
will merge with Amazon ORG and Microsoft ORG with Google ORG . The companies will all relocate to
Austin PERSON in Texas GPE before the end of the century DATE . John Doe PERSON bought a Prosche ORG .

Word vectors and similarity

To learn more about word vectors, how to customize them and how to load your own vectors into spaCy, see the usage guide on using word vectors and semantic similarities.

In [1]:
#from textblob import TextBlob

Word vectors and similarity

To use vectors in spaCy, you might consider installing the larger models for the particular language. The common module and language packages only come with the small models. The larger models can be installed as described on the spaCy vectors page:

python -m spacy download en_core_web_lg

The large model en_core_web_lg contains more than 1 million unique vectors.

Let us restart all necessary modules again, in particular spaCy:

In [2]:
import spacy

We can now import the English NLP pipeline to process some word list. Since the small models in spacy only include context-sensitive tensors, we should use the dowloaded large model for better word vectors. We load the large model as follows:

In [3]:
nlp = spacy.load('en_core_web_lg')
#nlp = spacy.load("en_core_web_sm")

We can process a list of words by the pipeline using the nlp object:

In [4]:
tokens = nlp(u'dog poodle beagle cat banana apple')

As described in the spaCy chapter Word Vectors and Semantic Similarity, the resulting elements of Doc, Span, and Token provide a method similarity(), which returns the similarities between words:

In [5]:
for token1 in tokens:
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))
dog dog 1.0
dog poodle 0.67507446
dog beagle 0.66592145
dog cat 0.80168545
dog banana 0.24327648
dog apple 0.26339024
poodle dog 0.67507446
poodle poodle 1.0
poodle beagle 0.71166867
poodle cat 0.573045
poodle banana 0.22574891
poodle apple 0.19250016
beagle dog 0.66592145
beagle poodle 0.71166867
beagle beagle 1.0
beagle cat 0.55627644
beagle banana 0.17828682
beagle apple 0.21266587
cat dog 0.80168545
cat poodle 0.573045
cat beagle 0.55627644
cat cat 1.0
cat banana 0.2815437
cat apple 0.28213844
banana dog 0.24327648
banana poodle 0.22574891
banana beagle 0.17828682
banana cat 0.2815437
banana banana 1.0
banana apple 0.5831844
apple dog 0.26339024
apple poodle 0.19250016
apple beagle 0.21266587
apple cat 0.28213844
apple banana 0.5831844
apple apple 1.0

We can access the vectors of these objects using the vector attribute:

In [6]:
tokens = nlp(u'dog cat banana grungle')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
dog True 7.0336733 False
cat True 6.6808186 False
banana True 6.700014 False
grungle False 0.0 True

The attribute has_vector returns a boolean depending on whether the token has a vector in the model or not. The token grungle has no vector. It is also out-of-vocabulary (OOV), as the fourth column shows. Thus, it also has a norm of $0$, that is, it has a length of $0$.

Here the token vector has a length of $300$. We can print out the vector for a token:

In [7]:
n = 0
print(tokens[n].text, len(tokens[n].vector), tokens[n].vector)
dog 300 [-4.0176e-01  3.7057e-01  2.1281e-02 -3.4125e-01  4.9538e-02  2.9440e-01
 -1.7376e-01 -2.7982e-01  6.7622e-02  2.1693e+00 -6.2691e-01  2.9106e-01
 -6.7270e-01  2.3319e-01 -3.4264e-01  1.8311e-01  5.0226e-01  1.0689e+00
  1.4698e-01 -4.5230e-01 -4.1827e-01 -1.5967e-01  2.6748e-01 -4.8867e-01
  3.6462e-01 -4.3403e-02 -2.4474e-01 -4.1752e-01  8.9088e-02 -2.5552e-01
 -5.5695e-01  1.2243e-01 -8.3526e-02  5.5095e-01  3.6410e-01  1.5361e-01
  5.5738e-01 -9.0702e-01 -4.9098e-02  3.8580e-01  3.8000e-01  1.4425e-01
 -2.7221e-01 -3.7016e-01 -1.2904e-01 -1.5085e-01 -3.8076e-01  4.9583e-02
  1.2755e-01 -8.2788e-02  1.4339e-01  3.2537e-01  2.7226e-01  4.3632e-01
 -3.1769e-01  7.9405e-01  2.6529e-01  1.0135e-01 -3.3279e-01  4.3117e-01
  1.6687e-01  1.0729e-01  8.9418e-02  2.8635e-01  4.0117e-01 -3.9222e-01
  4.5217e-01  1.3521e-01 -2.8878e-01 -2.2819e-02 -3.4975e-01 -2.2996e-01
  2.0224e-01 -2.1177e-01  2.7184e-01  9.1703e-02 -2.0610e-01 -6.5758e-01
  1.8949e-01 -2.6756e-01  9.2639e-02  4.3316e-01 -4.8868e-01 -3.8309e-01
 -2.1910e-01 -4.4183e-01  9.8044e-01  6.7423e-01  8.4003e-01 -1.8169e-01
  1.7385e-01  4.1848e-01  1.6098e-01 -1.0490e-01 -4.1965e-01 -3.5660e-01
 -1.6837e-01 -6.3458e-01  3.8422e-01 -3.5043e-01  1.7486e-01  5.3528e-01
  2.0143e-01  3.7877e-02  4.7105e-01 -4.4344e-01  1.6840e-01 -1.6685e-01
 -2.4022e-01 -1.0077e-01  3.0334e-01  4.2730e-01  3.3803e-01 -4.3481e-01
  1.1343e-01  6.1958e-02  6.1808e-02 -1.4007e-01  8.2018e-02 -3.9130e-02
  5.1442e-02  2.8725e-01  5.8025e-01 -5.7641e-01 -3.4652e-01  1.0132e-01
  1.4463e-01  1.1569e-02 -3.3701e-01 -1.7586e-01 -3.5724e-01 -2.1423e-01
  1.1429e-02  4.7645e-01 -3.7463e-02 -2.9488e-01 -1.7465e-01  3.0255e-01
  6.0317e-01 -6.6790e-02 -2.7050e+00 -7.0308e-01  4.0548e-01  6.2874e-01
  6.3080e-01 -5.4513e-01 -9.6191e-03  2.6533e-01  2.3391e-01 -5.1886e-02
 -6.5759e-03  1.8573e-02 -4.5693e-01 -7.0351e-02 -3.0621e-01 -1.4018e-02
 -2.0408e-01  3.7100e-01 -3.2354e-01 -8.4646e-01  2.7092e-01 -1.1961e-01
 -9.5576e-02 -6.0464e-01  4.2409e-02  2.4656e-01  3.8445e-02 -2.5467e-02
 -9.2908e-02 -2.1356e-01  3.6120e-01  1.9113e-02  6.2741e-02 -1.3083e-01
 -1.5146e-03  5.8238e-01 -1.8956e-01  7.8105e-01  1.0477e-02  1.0928e+00
  1.0140e-01 -3.6248e-01 -1.1962e-01 -3.4462e-01 -5.5704e-01  2.5797e-01
  3.3356e-01  3.3194e-01 -3.1298e-01 -7.5547e-01 -7.5290e-01 -9.3072e-02
 -1.1173e-01 -5.7251e-01  1.6639e-01  6.3579e-01  2.4006e-01 -2.9211e-01
  9.0182e-01  1.2425e-01 -5.7751e-01  4.7986e-02 -4.2748e-01  2.4446e-01
  4.7232e-02  3.5694e-01  4.4241e-01 -2.3055e-01  6.6037e-01 -7.3983e-03
 -3.7857e-01  2.2759e-01 -3.7138e-01  3.1055e-01 -7.2105e-02 -2.4490e-01
 -3.9761e-02  5.3650e-01 -4.1478e-01  1.6563e-01  3.3707e-01  1.0920e-01
  3.7219e-01 -5.5727e-01 -7.8060e-01  1.4251e-01 -3.5828e-01  4.1638e-01
  2.1446e-01  1.8410e-01 -4.7704e-01 -2.2005e-02 -2.3634e-01 -2.2840e-01
  3.4722e-01  2.3667e-01  7.4249e-02 -8.8416e-02  2.8618e-01 -4.6942e-01
 -4.3914e-01 -2.6474e-01 -3.0690e-01 -1.5260e-01 -8.4870e-02  2.8410e-01
 -1.8481e-01 -2.2122e-01 -1.1169e-01 -2.5241e-02  4.5968e-02  3.5343e-02
  2.2467e-01  5.1556e-01 -6.5137e-04  9.9559e-02 -1.4215e-01  2.0136e-01
  2.8334e-01 -2.8772e-01  3.7766e-02 -3.7608e-01 -1.1681e-01 -6.7020e-01
 -4.6265e-02  3.8784e-01 -3.2295e-02 -5.4291e-02 -4.5384e-01  1.9552e-01
 -2.9470e-01  8.5009e-01  1.0345e-01  9.7010e-02  1.1339e-01  3.9502e-01
  5.9043e-02  2.1978e-01  1.8845e-01 -1.5891e-01 -1.0301e-01  3.3164e-01
  6.1477e-02 -2.9848e-01  4.4510e-01  4.7329e-01  2.6312e-01 -1.8495e-01
  1.4652e-01 -3.1510e-02  2.2908e-02 -2.5929e-01 -3.0862e-01  1.7545e-03
 -1.8962e-01  5.4789e-01  3.1194e-01  2.4693e-01  2.9929e-01 -7.4861e-02]

Here just another example of similarities for some famous words:

In [8]:
tokens = nlp(u'queen king chef')

for token1 in tokens:
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))
queen queen 1.0
queen king 0.72526103
queen chef 0.24236034
king queen 0.72526103
king king 1.0
king chef 0.25258547
chef queen 0.24236034
chef king 0.25258547
chef chef 1.0

Similarities in Context

In spaCy parsing, tagging and NER models make use of vector representations of contexts that represent the meaning of words. A text meaning representation is represented as an array of floats, i.e. a tensor, computed during the NLP pipeline processing. With this approach words that have not been seen before can be typed or classified. SpaCy uses a 4-layer convolutional network for the computation of these tensors. In this approach these tensors model a context of four words left and right of any given word.

Let us use the example from the spaCy documentation and check the word labrador:

In [9]:
tokens = nlp(u'labrador')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
labrador True 6.850418 False

We can now test for the context:

In [10]:
doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"The people on Labrador are Canadians.")

dog = nlp(u"dog")

count = 0
for doc in [doc1, doc2, doc3]:
    lab = doc
    count += 1
    print(str(count) + ":", lab.similarity(dog))
1: 0.6907751984080799
2: 0.5961927660740638
3: 0.5374588437026319

Using this strategy we can compute document or text similarities as well:

In [11]:
docs = ( nlp(u"Paris is the largest city in France."),
        nlp(u"Vilnius is the capital of Lithuania."),
        nlp(u"An emu is a large bird.") )

for x in range(len(docs)):
    zset = set(range(len(docs)))
    zset.remove(x)
    for y in zset:
        print(x, y, docs[x].similarity(docs[y]))
0 1 0.7554966079333336
0 2 0.6921463288355282
1 0 0.7554966079333336
1 2 0.5668025741640493
2 0 0.6921463288355282
2 1 0.5668025741640493

We can vary the word order in sentences and compare them:

In [12]:
docs = [nlp(u"dog bites man"), nlp(u"man bites dog"),
        nlp(u"man dog bites"), nlp(u"cat eats mouse")]

for doc in docs:
    for other_doc in docs:
        print('"' + doc.text + '"', '"' + other_doc.text + '"', doc.similarity(other_doc))
"dog bites man" "dog bites man" 1.0
"dog bites man" "man bites dog" 0.9999999711588186
"dog bites man" "man dog bites" 1.000000047362914
"dog bites man" "cat eats mouse" 0.7096954239846529
"man bites dog" "dog bites man" 0.9999999711588186
"man bites dog" "man bites dog" 1.0
"man bites dog" "man dog bites" 1.0000000462548106
"man bites dog" "cat eats mouse" 0.709695423198237
"man dog bites" "dog bites man" 1.000000047362914
"man dog bites" "man bites dog" 1.0000000462548106
"man dog bites" "man dog bites" 1.0
"man dog bites" "cat eats mouse" 0.7096954242750528
"cat eats mouse" "dog bites man" 0.7096954239846529
"cat eats mouse" "man bites dog" 0.709695423198237
"cat eats mouse" "man dog bites" 0.7096954242750528
"cat eats mouse" "cat eats mouse" 1.0

Custom Models

In [ ]: