day 2: getting started with nlp¶

overview day 2

time	section	concepts	outcomes
09-10	2.1.1	spacy	install spacy packages to anaconda venv
	2.1.2	python syntax	practice data structures, conditional statements, flow control
	2.1.3	nlp doc	create our first nlp document
	break
10-11	2.2.1	features	applying nlp analyses to extract linguistic features
	2.2.2	part of speech	analyze syntactic structure, useful for finding e.g. negations
	2.2.3	entity recognition	extract named entities from text, analyze word similarity
	break
11-12	2.3.1	classify	how to use scikit-learn for sentiment analysis
	2.3.2
	2.3.3
12-13	lunch

section 2.1.1¶

check, update anaconda installation, create nlp venv

# update anaconda, env packages
conda update anaconda
conda update --all
# install spacy nlp package
conda install -c conda-forge -n base spacy
# get trained pipline, language model, small and large
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg
# start jupyter notebook
jupyter notebook

section 2.1.2¶

repetition, python syntax
practice data structures, conditional statements, flow control

section 2.1.3¶

extend python fuctionality with spacy nlp package
create our first basic nlp document

import spacy
from spacy import displacy
# english language pre-trained model, small
nlp = spacy.load("en_core_web_sm")

doc = nlp("my name is nils")
type(doc)

spacy.tokens.doc.Doc

doc
doc[0]

my

for token in doc:
    print(token.text, token.pos_, token.dep_)

my PRON poss
name NOUN nsubj
is AUX ROOT
nils NOUN attr

displacy.render(doc, jupyter=True)

How about other languages?¶

doc = nlp("jag heter nils")
displacy.render(doc, jupyter=True)

test swedish¶

from spacy.lang.sv import Swedish
nlp = Swedish()  # use directly
#nlp = spacy.blank("sv")  # blank instance

basic tokenization enabled¶

but not syntactic parsing

doc = nlp("jag heter nils")
for token in doc:
    print(token.text, token.pos_, token.dep_)
#displacy.render(doc, jupyter=True)

jag  
heter  
nils

from spacy.lang.en import English
nlp = English()

doc = nlp("my name is nils")
for token in doc:
    print(token.text, token.pos_, token.dep_)

my  
name  
is  
nils