首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于网页特征的特征词提取技术
引用本文:庞宁.基于网页特征的特征词提取技术[J].西南民族学院学报(自然科学版),2014(1):137-141.
作者姓名:庞宁
作者单位:太原科技大学应用科学学院,山西太原030024
基金项目:山西省自然科学基金(2012011011-4).
摘    要:特征词提取是一项提炼整个web页面内容的实用技术,同时也为文本分类,信息抽取应用提供了技术支持.在web页面内容上,利用段落间语义关系划分出网页内容的篇章结构,并以此为基础使用网页的元数据和特殊标签,设计了一个特征词的加权函数,综合考虑了词频、词长和位置因子,最后,实验对比了各类位置因子对系统的贡献度.实验结果表明,改进方法的F1值比传统的TFIDF提取技术提高了15.5%,其中,位置因子中的标题,关键词和摘要因素对系统的贡献最大.

关 键 词:特征词提取  网页  元数据  加权函数

Signature word extracting retrieval based on web feature
PANG Ning.Signature word extracting retrieval based on web feature[J].Journal of Southwest Nationalities College(Natural Science Edition),2014(1):137-141.
Authors:PANG Ning
Institution:PANG Ning (The School of Applied Sciences, T.aiyuan University of Science and Technology, Taiyuan 030024, P.R.C.)
Abstract:Signature word extracting of the text is a useful technique which can abstract web page text, and it provides technical support for text classification, information extraction tasks. A web hierarchical structure is extracted through parsing the semantic relation between each adjacent paragraph in the web page contents. On the basis of the hierarchical structure, this paper uses the HTML metadata and special tags to design a weighting function, which is a combination of the factor of the frequency, length and location for a word. Meanwhile, an initial contrast analysis is carried out of various position factor about contributing degree to the system. Experimental results show that F1 value of improved method has increased by 15.5% than that of the traditional TFIDF extraction method. The contributing degree to the system of the title, abstract and keywords in the location thctor are the largest.
Keywords:signature word extracting  web  metadata  weighting function
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号