Abstract: | An unsupervised framework to partially resolve the four issues, namely ambiguity, unknown word, knowledge acquisition and efficient algorithm, in developing a robust Chinese segmentation system is described.It first proposes a statistical segmentation model integrating the simplified character juncture model (SCJM) with word formation power.The advantage of this model is that it can employ the affinity of characters inside or outside a word and word formation power simultaneously to process disambiguation and all the parameters can be estimated in an unsupervised way.After investigating the differences between real and theoretical size of segmentation space, we apply A* algorithm to perform segmentation without exhaustively searching all the potential segmentations.Finally, an unsupervised version of Chinese word-formation patterns to detect unknown words is presented.Experiments show that the proposed methods are efficient. |