首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
随着智能终端的普及,文本的主题挖掘需求也越来越广泛,主题建模是文本主题挖掘的核心,LDA生成模型是基于贝叶斯框架的概率模型,它以语义关联为基础,很好地解决了文本潜在主题的提取问题。对文本聚类过程的核心技术LDA生成模型、数据采样、模型评价等作了较为深入的阐述和解析,结合网络教育平台的2 794篇学习刊物进行了主题发现和聚类实验,建立了包含3 800个词项的词库,通过kmeans算法和合并向量算法(UVM)分两步解决了主题聚类问题。提出了文本挖掘实验的一般方法,并对层次聚类中文本距离的算法提出了改进。实验结果表明,该平台刊物的主题整体相似度比较好,但主题过于集中使得许多刊物的内容不具有辨识度,影响用户对主题的定位。  相似文献   

2.
LDA主题模型是一种有效的文本语义信息提取工具,利用在文档层中实现词项的共现,将词项矩阵转化为主题矩阵,得到主题特征;然而在生成文档过程中会蕴含冗余主题。针对LDA主题模型提取主题特征时存在冗余的不足,提出一种基于邻域粗糙集的LDA主题模型约简算法NRS-LDA。利用邻域粗糙集构造主题决策系统,通过预先设定主题个数,计算出每个主题的重要度;根据重要度进行排序,将排序后重要度低的主题删除。将提出的NRS-LDA算法应用于K-means文本聚类问题上并与传统的文本特征提取算法及改进的算法进行比较,结果表明NRS-LDA方法可以得到更高的聚类精度。  相似文献   

3.
针对传统的潜在狄利克雷分析(LDA)模型在提取评论主题时存在着计算时间长、计算效率低的问题,提出基于MapReduce架构的并行LAD模型建立方法.在文本预处理的基础上,得到文档-主题分布和主题-特征词分布,分别计算主题相似度和特征词权重,结合k-均值聚类算法,实现评论主题提取的并行化.通过Hadoop并行计算平台进行实验,结果表明,该方法在处理大规模文本时能获得接近线性的加速比,对主题模型的建立效果也有提高.  相似文献   

4.
传统的基于空间向量的文本谱聚类方法容易忽略文本上下文之间的语义联系,通过图结构进行文本表示可以很好的解决这一问题,在此基础上,本文提出了基于最大公共子图的谱聚类算法——SC-MCS算法。该算法通过求解文本之间的最大公共子图来进行文本相似度的计算,最后进行文本聚类。实验结果表明,与传统的基于空间向量的文本谱聚类方法相比,该算法在准确率和召回率都取得了一定的提升。  相似文献   

5.
给出了一种针对大量新闻数据的话题检测方法.首先通过LDA(latent dirichlet allocation)模型从语义层面抽取新闻数据主题,有效降低数据分析维度,更合理地体现新闻主题特征.然后改进OPTICS(ordering point to identify the cluster structure)密度聚类算法,基于新闻话题的时间延续性给出了T-OPTICS算法.该算法继承了OPTICS算法对参数不敏感的特性,降低了参数选择对聚类结果的影响.改进了OPTICS算法中文本间相似度的计算方法,体现了话题的时间延续性.基于TDT4数据集的实验表明,该方法能够快速有效地发现新闻中的话题.  相似文献   

6.
针对文本在聚类或分类时,由于数据高维稀疏导致相似度值低的问题,提出一种基于改进文本相似度计算的聚类方法.首先,利用向量空间模型VSM表示文本,采用余弦函数计算文本之间的相似度;然后,基于网络中节点的相似性传播原理,通过设置阈值找到与各个文本相似度较大的文本集合,进而使用Jaccard系数将两个文本之间相似度计算转化为两个文本集合之间的相似度计算;最后根据得到的文本相似度矩阵,利用谱聚类算法对文本进行聚类.在WebKB上的实验结果表明,与传统的K-means、谱聚类方法相比,该方法提高了聚类的准确度,召回率与F值.  相似文献   

7.
文本聚类作为一种自动化程度较高的无监督机器学习方法,能够实现对文本信息的有效组织、摘要和导航,近年来已经广泛应用在信息检索领域。笔者针对使用向量空间模型进行聚类时对于同义词和多义词的处理存在的缺陷,提出了基于本体的文本聚类模型。首先使用WordNet词典对文档中的词进行语义标注,得到文档的概念集合;然后对每个文档的概念集合进行概念聚类,生成文档的概念主题;最后通过计算主题的相似度完成文本聚类。该模型减少了相似度计算量,改善了聚类结果和聚类性能。  相似文献   

8.
提出了基于LDA(Latent Dirichlet Allocation)主题模型的Web文本分类方法,利用MCMC方法中的Gibbs抽样获得模型参数从而获取词汇的概率分布,使隐藏于WEB文本内的不同主题与WEB文本字词建立关系。将LDA算法应用于WEB文本分类识别领域,在实验中与k均值聚类和贝叶斯网络方法进行了对比,其结果表明LDA与其他同类算法相比具有一定的优势。  相似文献   

9.
文本相似度的计算是文本挖掘的基础。传统的基于向量空间模型(VSM)的文本相似度计算方法把文本映射成词向量,再利用余弦距离公式来计算相似度,这样存在文本向量维数过高以及语义敏感度差的问题。针对以上问题,通过对词性以及权值大小的过滤可以缩减特征词规模,在一定程度上可以减少高维稀疏的情况发生,并且引入LDA模型的文本隐含主题特征,增加文本表示的语义背景,通过线性加权的方式结合VSM模型的特征词特征和LDA模型的主题特征,计算文本相似度。实验表明,与单独使用VSM模型和LDA模型比较,利用加权特征计算文本相似度有着更好的效果。  相似文献   

10.
现有的藏文文本聚类算法均采用向量空间模型来进行文本建模.该模型存在向量维度过高和无法表示语义信息的问题.该文根据藏文的语法特性并借鉴主题模型的思想,提出了一种基于词向量的藏文文本建模方法.该方法首先采用最大熵模型进行藏文文本词性标注,选择名词和动词作为文本的特征,然后利用word2vec工具训练得到词语类别并计算其在各文本的概率分布,最终以词类别概率矩阵表示文本,从而实现文本建模.与基于VSM和基于LDA的文本建模方法相比,该方法文本聚类结果的F值分别提高了10.5%和2.4%,聚类效果提升明显.  相似文献   

11.
Language markedness is a common phenomenon in languages, and is reflected from hearing, vision and sense, i.e. the variation in the three aspects such as phonology, morphology and semantics. This paper focuses on the interpretation of markedness in language use following the three perspectives, i.e. pragmatic interpretation, psychological interpretation and cognitive interpretation, with an aim to define the function of markedness.  相似文献   

12.
何延凌 《科技信息》2008,(4):258-258
Language is a means of verbal communication. People use language to communicate with each other. In the society, no two speakers are exactly alike in the way of speaking. Some differences are due to age, gender, statue and personality. Above all, gender is one of the obvious reasons. The writer of this paper tries to describe the features of women's language from these perspectives: pronunciation, intonation, diction, subjects, grammar and discourse. From the discussion of the features of women's language, more attention should be paid to language use in social context. What's more, the linguistic phenomena in a speaking community can be understood more thoroughly.  相似文献   

13.
The discovery of the prolific Ordovician Red River reservoirs in 1995 in southeastern Saskatchewan was the catalyst for extensive exploration activity which resulted in the discovery of more than 15 new Red River pools. The best yields of Red River production to date have been from dolomite reservoirs. Understanding the processes of dolomitization is, therefore, crucial for the prediction of the connectivity, spatial distribution and heterogeneity of dolomite reservoirs.The Red River reservoirs in the Midale area consist of 3~4 thin dolomitized zones, with a total thickness of about 20 m, which occur at the top of the Yeoman Formation. Two types of replacement dolomite were recognized in the Red River reservoir: dolomitized burrow infills and dolomitized host matrix. The spatial distribution of dolomite suggests that burrowing organisms played an important role in facilitating the fluid flow in the backfilled sediments. This resulted in penecontemporaneous dolomitization of burrow infills by normal seawater. The dolomite in the host matrix is interpreted as having occurred at shallow burial by evaporitic seawater during precipitation of Lake Almar anhydrite that immediately overlies the Yeoman Formation. However, the low δ18O values of dolomited burrow infills (-5.9‰~ -7.8‰, PDB) and matrix dolomites (-6.6‰~ -8.1‰, avg. -7.4‰ PDB) compared to the estimated values for the late Ordovician marine dolomite could be attributed to modification and alteration of dolomite at higher temperatures during deeper burial, which could also be responsible for its 87Sr/86Sr ratios (0.7084~0.7088) that are higher than suggested for the late Ordovician seawaters (0.7078~0.7080). The trace amounts of saddle dolomite cement in the Red River carbonates are probably related to "cannibalization" of earlier replacement dolomite during the chemical compaction.  相似文献   

14.
There are numerous geometric objects stored in the spatial databases. An importance function in a spatial database is that users can browse the geometric objects as a map efficiently. Thus the spatial database should display the geometric objects users concern about swiftly onto the display window. This process includes two operations:retrieve data from database and then draw them onto screen. Accordingly, to improve the efficiency, we should try to reduce time of both retrieving object and displaying them. The former can be achieved with the aid of spatial index such as R-tree, the latter require to simplify the objects. Simplification means that objects are shown with sufficient but not with unnecessary detail which depend on the scale of browse. So the major problem is how to retrieve data at different detail level efficiently. This paper introduces the implementation of a multi-scale index in the spatial database SISP (Spatial Information Shared Platform) which is generalized from R-tree. The difference between the generalization and the R-tree lies on two facets: One is that every node and geometric object in the generalization is assigned with a importance value which denote the importance of them, and every vertex in the objects are assigned with a importance value,too. The importance value can be use to decide which data should be retrieve from disk in a query. The other difference is that geometric objects in the generalization are divided into one or more sub-blocks, and vertexes are total ordered by their importance value. With the help of the generalized R-tree, one can easily retrieve data at different detail levels.Some experiments are performed on real-life data to evaluate the performance of solutions that separately use normal spatial index and multi-scale spatial index. The results show that the solution using multi-scale index in SISP is satisfying.  相似文献   

15.
理论推导与室内实验相结合,建立了低渗透非均质砂岩油藏启动压力梯度确定方法。首先借助油藏流场与电场相似的原理,推导了非均质砂岩油藏启动压力梯度计算公式。其次基于稳定流实验方法,建立了非均质砂岩油藏启动压力梯度测试方法。结果表明:低渗透非均质砂岩油藏的启动压力梯度确定遵循两个等效原则。平面非均质油藏的启动压力梯度等于各级渗透率段的启动压力梯度关于长度的加权平均;纵向非均质油藏的启动压力梯度等于各渗透率层的启动压力梯度关于渗透率与渗流面积乘积的加权平均。研究成果可用于有效指导低渗透非均质砂岩油藏的合理井距确定,促进该类油藏的高效开发。  相似文献   

16.
王慧 《科技信息》2008,(10):240-240
Wuthering Heights, Emily Bronte's only novel, was published in December of 1847 under the pseudonym Ellis Bell. The book did not gain immediate success, but it is now thought one of the finest novels in the English language. Catherine is the key character of this masterpiece, because everybody and everything center on her though she had a short life. We can understand this masterpiece better if we know Catherine well.  相似文献   

17.
The Williston Basin is a significant petroleum province, containing oil production zones that include the Middle Cambrian to Lower Ordovician, Upper Ordovician, Middle Devonian, Upper Devonian and Mississippian and within the Jurassic and Cretaceous. The oils of the Williston Basin exhibit a wide range of geochemical characteristics defined as "oil families", although the geochemical signature of the Cambrian Deadwood Formation and Lower Ordovician Winnipeg reservoired oils does not match any "oil family". Despite their close stratigraphic proximity, it is evident that the oils of the Lower Palaeozoic within the Williston Basin are distinct. This suggests the presence of a new "oil family" within the Williston Basin. Diagnostic geochemical signatures occur in the gasoline range chromatograms, within saturate fraction gas chromatograms and biomarker fingerprints. However, some of the established criteria and cross-plots that are currently used to segregate oils into distinct genetic families within the basin do not always meet with success, particularly when applied to the Lower Palaeozoic oils of the Deadwood and Winnipeg Formation.  相似文献   

18.
As an American modern novelist who were famous in the literary world, Hemingway was not a person who always followed the trend but a sharp observer. At the same time, he was a tragedy maestro, he paid great attention on existence, fate and end-result. The dramatis personae's tragedy of his works was an extreme limit by all means tragedy on the meaning of fearless challenge that failed. The beauty of tragedy was not produced on the destruction of life, but now this kind of value was in the impact activity. They performed for the reader about the tragedy on challenging for the limit and the death.  相似文献   

19.
AcomputergeneratorforrandomlylayeredstructuresYUJia shun1,2,HEZhen hua2(1.TheInstituteofGeologicalandNuclearSciences,NewZealand;2.StateKeyLaboratoryofOilandGasReservoirGeologyandExploitation,ChengduUniversityofTechnology,China)Abstract:Analgorithmisintrod…  相似文献   

20.
In the 19th century the society was controlled by men, and women were just appendants of them, they had not any rights and freedom. But Jane was an exception, she showed some characteristics of early feminist. Jane showed her characteristics of feminism in three aspects: rebellion, equality, and independence. These characteristics were helpful to her success, and feminism is the only way out for women of that time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号