An unsupervised method for part-of-speech discovery is presented, whose aim is to induce a system of word-classes by looking at the distributional properties of words in raw text. In a further step, for each word the likelihood of belonging to one or several of these classes is determined. The assumption underlying our method is that the word pair consisting of the left and right neighbors of a particular token is characteristic of the part of speech to be selected at this position. Based on this observation, we cluster all such word pairs according to the patterns of their middle words. This gives us centroid vectors that are useful for the induction of a system of word classes and for the correct classification of ambiguous words.
Back to schedule