首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
随着智能终端的普及,文本的主题挖掘需求也越来越广泛,主题建模是文本主题挖掘的核心,LDA生成模型是基于贝叶斯框架的概率模型,它以语义关联为基础,很好地解决了文本潜在主题的提取问题。对文本聚类过程的核心技术LDA生成模型、数据采样、模型评价等作了较为深入的阐述和解析,结合网络教育平台的2 794篇学习刊物进行了主题发现和聚类实验,建立了包含3 800个词项的词库,通过kmeans算法和合并向量算法(UVM)分两步解决了主题聚类问题。提出了文本挖掘实验的一般方法,并对层次聚类中文本距离的算法提出了改进。实验结果表明,该平台刊物的主题整体相似度比较好,但主题过于集中使得许多刊物的内容不具有辨识度,影响用户对主题的定位。  相似文献   

2.
针对传统模糊聚类算法需要预先确定初始隶属度矩阵的问题,该文提出了基于增量式模糊聚类算法(Incremental fuzzy clustering algorithm, FCLDA)的文本挖掘方法。首先根据文本集中关键词出现次数进行排序,优先选择出现次数多的关键词作为文本集的主题,然后利用隐含狄利克雷分布(Latent Dirichlet allocation, LDA)主题模型构建文档-主题概率分布组成矩阵,将该矩阵作模糊C均值聚类(FCM)算法的隶属度矩阵,并对隶属度矩阵的隶属度值增加一个权值,在FCLDA算法迭代过程中,采用模糊信息熵作为聚类数确定的标准,增加主题词,当模糊信息熵达到最小值时,聚类数确定下来,最后将FCLDA算法应用到网页的文本挖掘中,结果试验表明,相对于FCM算法和K最近邻(K-nearest neighbor)算法,FCLDA算法的运行聚类结果准确率更高,运行速度加快,更适合处理具有模糊性的文本。  相似文献   

3.
LDA主题模型是一种有效的文本语义信息提取工具,利用在文档层中实现词项的共现,将词项矩阵转化为主题矩阵,得到主题特征;然而在生成文档过程中会蕴含冗余主题。针对LDA主题模型提取主题特征时存在冗余的不足,提出一种基于邻域粗糙集的LDA主题模型约简算法NRS-LDA。利用邻域粗糙集构造主题决策系统,通过预先设定主题个数,计算出每个主题的重要度;根据重要度进行排序,将排序后重要度低的主题删除。将提出的NRS-LDA算法应用于K-means文本聚类问题上并与传统的文本特征提取算法及改进的算法进行比较,结果表明NRS-LDA方法可以得到更高的聚类精度。  相似文献   

4.
针对传统的潜在狄利克雷分析(LDA)模型在提取评论主题时存在着计算时间长、计算效率低的问题,提出基于MapReduce架构的并行LAD模型建立方法.在文本预处理的基础上,得到文档-主题分布和主题-特征词分布,分别计算主题相似度和特征词权重,结合k-均值聚类算法,实现评论主题提取的并行化.通过Hadoop并行计算平台进行实验,结果表明,该方法在处理大规模文本时能获得接近线性的加速比,对主题模型的建立效果也有提高.  相似文献   

5.
传统的基于空间向量的文本谱聚类方法容易忽略文本上下文之间的语义联系,通过图结构进行文本表示可以很好的解决这一问题,在此基础上,本文提出了基于最大公共子图的谱聚类算法——SC-MCS算法。该算法通过求解文本之间的最大公共子图来进行文本相似度的计算,最后进行文本聚类。实验结果表明,与传统的基于空间向量的文本谱聚类方法相比,该算法在准确率和召回率都取得了一定的提升。  相似文献   

6.
给出了一种针对大量新闻数据的话题检测方法.首先通过LDA(latent dirichlet allocation)模型从语义层面抽取新闻数据主题,有效降低数据分析维度,更合理地体现新闻主题特征.然后改进OPTICS(ordering point to identify the cluster structure)密度聚类算法,基于新闻话题的时间延续性给出了T-OPTICS算法.该算法继承了OPTICS算法对参数不敏感的特性,降低了参数选择对聚类结果的影响.改进了OPTICS算法中文本间相似度的计算方法,体现了话题的时间延续性.基于TDT4数据集的实验表明,该方法能够快速有效地发现新闻中的话题.  相似文献   

7.
文本聚类作为一种自动化程度较高的无监督机器学习方法,能够实现对文本信息的有效组织、摘要和导航,近年来已经广泛应用在信息检索领域。笔者针对使用向量空间模型进行聚类时对于同义词和多义词的处理存在的缺陷,提出了基于本体的文本聚类模型。首先使用WordNet词典对文档中的词进行语义标注,得到文档的概念集合;然后对每个文档的概念集合进行概念聚类,生成文档的概念主题;最后通过计算主题的相似度完成文本聚类。该模型减少了相似度计算量,改善了聚类结果和聚类性能。  相似文献   

8.
针对文本在聚类或分类时,由于数据高维稀疏导致相似度值低的问题,提出一种基于改进文本相似度计算的聚类方法.首先,利用向量空间模型VSM表示文本,采用余弦函数计算文本之间的相似度;然后,基于网络中节点的相似性传播原理,通过设置阈值找到与各个文本相似度较大的文本集合,进而使用Jaccard系数将两个文本之间相似度计算转化为两个文本集合之间的相似度计算;最后根据得到的文本相似度矩阵,利用谱聚类算法对文本进行聚类.在WebKB上的实验结果表明,与传统的K-means、谱聚类方法相比,该方法提高了聚类的准确度,召回率与F值.  相似文献   

9.
提出了基于LDA(Latent Dirichlet Allocation)主题模型的Web文本分类方法,利用MCMC方法中的Gibbs抽样获得模型参数从而获取词汇的概率分布,使隐藏于WEB文本内的不同主题与WEB文本字词建立关系。将LDA算法应用于WEB文本分类识别领域,在实验中与k均值聚类和贝叶斯网络方法进行了对比,其结果表明LDA与其他同类算法相比具有一定的优势。  相似文献   

10.
文本相似度的计算是文本挖掘的基础。传统的基于向量空间模型(VSM)的文本相似度计算方法把文本映射成词向量,再利用余弦距离公式来计算相似度,这样存在文本向量维数过高以及语义敏感度差的问题。针对以上问题,通过对词性以及权值大小的过滤可以缩减特征词规模,在一定程度上可以减少高维稀疏的情况发生,并且引入LDA模型的文本隐含主题特征,增加文本表示的语义背景,通过线性加权的方式结合VSM模型的特征词特征和LDA模型的主题特征,计算文本相似度。实验表明,与单独使用VSM模型和LDA模型比较,利用加权特征计算文本相似度有着更好的效果。  相似文献   

11.
Language markedness is a common phenomenon in languages, and is reflected from hearing, vision and sense, i.e. the variation in the three aspects such as phonology, morphology and semantics. This paper focuses on the interpretation of markedness in language use following the three perspectives, i.e. pragmatic interpretation, psychological interpretation and cognitive interpretation, with an aim to define the function of markedness.  相似文献   

12.
何延凌 《科技信息》2008,(4):258-258
Language is a means of verbal communication. People use language to communicate with each other. In the society, no two speakers are exactly alike in the way of speaking. Some differences are due to age, gender, statue and personality. Above all, gender is one of the obvious reasons. The writer of this paper tries to describe the features of women's language from these perspectives: pronunciation, intonation, diction, subjects, grammar and discourse. From the discussion of the features of women's language, more attention should be paid to language use in social context. What's more, the linguistic phenomena in a speaking community can be understood more thoroughly.  相似文献   

13.
王慧 《科技信息》2008,(10):240-240
Wuthering Heights, Emily Bronte's only novel, was published in December of 1847 under the pseudonym Ellis Bell. The book did not gain immediate success, but it is now thought one of the finest novels in the English language. Catherine is the key character of this masterpiece, because everybody and everything center on her though she had a short life. We can understand this masterpiece better if we know Catherine well.  相似文献   

14.
The Williston Basin is a significant petroleum province, containing oil production zones that include the Middle Cambrian to Lower Ordovician, Upper Ordovician, Middle Devonian, Upper Devonian and Mississippian and within the Jurassic and Cretaceous. The oils of the Williston Basin exhibit a wide range of geochemical characteristics defined as "oil families", although the geochemical signature of the Cambrian Deadwood Formation and Lower Ordovician Winnipeg reservoired oils does not match any "oil family". Despite their close stratigraphic proximity, it is evident that the oils of the Lower Palaeozoic within the Williston Basin are distinct. This suggests the presence of a new "oil family" within the Williston Basin. Diagnostic geochemical signatures occur in the gasoline range chromatograms, within saturate fraction gas chromatograms and biomarker fingerprints. However, some of the established criteria and cross-plots that are currently used to segregate oils into distinct genetic families within the basin do not always meet with success, particularly when applied to the Lower Palaeozoic oils of the Deadwood and Winnipeg Formation.  相似文献   

15.
理论推导与室内实验相结合,建立了低渗透非均质砂岩油藏启动压力梯度确定方法。首先借助油藏流场与电场相似的原理,推导了非均质砂岩油藏启动压力梯度计算公式。其次基于稳定流实验方法,建立了非均质砂岩油藏启动压力梯度测试方法。结果表明:低渗透非均质砂岩油藏的启动压力梯度确定遵循两个等效原则。平面非均质油藏的启动压力梯度等于各级渗透率段的启动压力梯度关于长度的加权平均;纵向非均质油藏的启动压力梯度等于各渗透率层的启动压力梯度关于渗透率与渗流面积乘积的加权平均。研究成果可用于有效指导低渗透非均质砂岩油藏的合理井距确定,促进该类油藏的高效开发。  相似文献   

16.
As an American modern novelist who were famous in the literary world, Hemingway was not a person who always followed the trend but a sharp observer. At the same time, he was a tragedy maestro, he paid great attention on existence, fate and end-result. The dramatis personae's tragedy of his works was an extreme limit by all means tragedy on the meaning of fearless challenge that failed. The beauty of tragedy was not produced on the destruction of life, but now this kind of value was in the impact activity. They performed for the reader about the tragedy on challenging for the limit and the death.  相似文献   

17.
正The periodicity of the elements and the non-reactivity of the inner-shell electrons are two related principles of chemistry,rooted in the atomic shell structure.Within compounds,Group I elements,for example,invariably assume the+1 oxidation state,and their chemical properties differ completely from those of the p-block elements.These general rules govern our understanding of chemical structures and reactions.Using first principles calcula-  相似文献   

18.
We have developed an adiabatic connection to formulate the ground-state exchange-correlation energy in terms of pairing matrix linear fluctuations.This formulation of the exchange-correlation energy opens a new channel for density functional approximations based on the many-body perturbation theory.We illustrate the potential of such approaches with an approximation based on the particle-particle Random Phase Approximation(pp-RPA).This re-  相似文献   

19.
正The electronic and nuclear(structural/vibrational)response of 1D-3D nanoscale systems to electric fields gives rise to a host of optical,mechanical,spectral,etc.properties that are of high theoretical and applied interest.Due to the computational difficulty of treating such large systems it is convenient to model them as infinite and periodic(at least,in first approximation).The fundamental theoretical/computational problem in doing so is that  相似文献   

20.
For molecular systems,the quantum-mechanical treatment of their responses to static electromagnetic fields usually employs a scalar-potential treatment of the electric field and a vector-potential treatment of the magnetic field.Although the potential for each field separately is associated with the choice of an(unphysical)origin,the precise choice of the origin for the electrostatic field has little consequences for the results.This is different for the  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号