- Bag of words
- Stemming and lemmatization
Parts of speech tagging
- Sentence (S)
- Noun phrases (NP) - "The homeless man in the park that I tried to help yesterday"
- Prepositional Phrases (PP) - "with a net" contains a noun phrase complement
- Verb phrases (VP)
- Adjective phrases( AP)
- phrasal verb - "turn down" "rule out" "find out" "go on"
tokens = nltk.word_tokenize() is a more robust
nltk.pos_tag( tokens )
HMM POS tagging
- Training HMM means extracting transition probabilities.
- You get Transition probability matrix
- You get Observation likelihoods
- And you get the vocabulary
- Also get initial probability distribution Pi just by counting
- POS tagging: The states are part of speech tags, and the observations are the actual words.
- Only information is the order of the sentence
- Transition probabilities: captures probability of moving to part of speech
- Transition probability matrix - for that particular row, the sum of probabilities has to be 1 because that covers all the possibilities of transition
- Incoming probabilities does not have to sum to 1
- Observation is a word emission
- Observation likelihood = how many times you see that word, divided by how many words in the corpus
- Initial probability distribution - for each part of speech, wha't the probability that the sentence will start with it?
- E.g., when I start the sentence, what is the probability that the part of speech will be an article.
- Term frequency inverse document frequency