NLP
Jump to navigation
Jump to search
Contents
Vocabulary
- Token
- Bag of words
- Stemming and lemmatization
Resources
- Stanford CS224n: Natural Language Processing with Deep Learning (Winter 2019)
- NLTK book
- Jurafsky book
Parts of speech tagging
- POS, word classes, syntactic categories: nouns, verbs, adjectives, conjunctions
- Categorizing and Tagging Words (NLTK book chapter 5)
- Rachiele G. 2018 Medium article
- Three POS algorithms
- hidden Markov Models (HMM) - generative
- maximum entropy markov model - discriminative
- Neural language model - uses RNNs
- POS tagset
- Universal
- Penn Treebank
Grammar
- Sentence (S)
- Noun phrases (NP) - "The homeless man in the park that I tried to help yesterday"
- Prepositional Phrases (PP) - "with a net" contains a noun phrase complement
- Verb phrases (VP)
- Adjective phrases( AP)
- phrasal verb - "turn down" "rule out" "find out" "go on"
nltk syntax
tokens = nltk.word_tokenize()
is a more robust.split()
nltk.pos_tag( tokens )
HMM POS tagging
- Training HMM means extracting transition probabilities.
- You get Transition probability matrix
- You get Observation likelihoods
- And you get the vocabulary
- Also get initial probability distribution Pi just by counting
- POS tagging: The states are part of speech tags, and the observations are the actual words.
- Only information is the order of the sentence
- Transition probabilities: captures probability of moving to part of speech
- Transition probability matrix - for that particular row, the sum of probabilities has to be 1 because that covers all the possibilities of transition
- Incoming probabilities does not have to sum to 1
- Observation is a word emission
- Observation likelihood = how many times you see that word, divided by how many words in the corpus
- Initial probability distribution - for each part of speech, wha't the probability that the sentence will start with it?
- E.g., when I start the sentence, what is the probability that the part of speech will be an article.
N-grams
TFIDF
- Term frequency inverse document frequency