From Colettapedia
Jump to navigation Jump to search


  • Token
  • Bag of words
  • Stemming and lemmatization


Parts of speech tagging


  • Sentence (S)
  • Noun phrases (NP) - "The homeless man in the park that I tried to help yesterday"
  • Prepositional Phrases (PP) - "with a net" contains a noun phrase complement
  • Verb phrases (VP)
  • Adjective phrases( AP)
    • phrasal verb - "turn down" "rule out" "find out" "go on"

nltk syntax

  • tokens = nltk.word_tokenize() is a more robust .split()
  • nltk.pos_tag( tokens )

HMM POS tagging

  • Training HMM means extracting transition probabilities.
    1. You get Transition probability matrix
    2. You get Observation likelihoods
    3. And you get the vocabulary
    4. Also get initial probability distribution Pi just by counting
  • POS tagging: The states are part of speech tags, and the observations are the actual words.
  • Only information is the order of the sentence
  • Transition probabilities: captures probability of moving to part of speech
    • Transition probability matrix - for that particular row, the sum of probabilities has to be 1 because that covers all the possibilities of transition
    • Incoming probabilities does not have to sum to 1
  • Observation is a word emission
  • Observation likelihood = how many times you see that word, divided by how many words in the corpus
  • Initial probability distribution - for each part of speech, wha't the probability that the sentence will start with it?
    • E.g., when I start the sentence, what is the probability that the part of speech will be an article.



  • Term frequency inverse document frequency