(1) National Key Laboratory for Text Processing, Institute of Computer Science and Technology, Peking University, 100871 Beijing, China
Abstract:
In this paper, an improved algorithm, named STC\|I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page.