首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
In this paper, we improve the trawling and point out some communities missed by trawling. We use the DBG (Dense Bipartite Graph) to identify a structure of a potential community instead of CBG (Complete Bipartite Graph). Based on DBG, we proposed a new method based on edge removal to extract cores from a web graph. Moreover, we improve the crawler to save only potential pages as fans of a core and save a lot of disk storage space. To evaluate the set of cores whether or not belong to a community, the statistics of term frequency is used. In the paper, the dataset of experiment were crawled under domain ".cn". The result show that the our algorithm works properly and some new cores can be found by our method.  相似文献   

2.
The Web cluster has been a popular solution of network server system because of its scalability and cost effective ness. The cache configured in servers can result in increasing significantly performance, In this paper, we discuss the suitable configuration strategies for caching dynamic content by our experimental results. Considering the system itself can provide support for caching static Web page, such as computer memory cache and disk's own cache, we adopt a special pattern that only caches dynamic Web page in some experiments to enlarge cache space. The paper is introduced three different replacement algorithms in our cache proxy module to test the practical effects of caching dynamic pages under different conditions. The paper is chiefly analyzed the influences of generated time and accessed frequency on caching dynamic Web pages. The paper is also provided the detailed experiment results and main conclusions in the paper.  相似文献   

3.
利用超链接信息改进网页爬行器的搜索策略   总被引:5,自引:0,他引:5  
网页爬行器在Web空间中爬行时,要面对如下两个问题:1)由于Internet上的信息量十分巨大,网络搜索引擎不可能包含整个Web网页;2)受到硬件资源的限制,它所能存储的网页是有限的.爬行器如果按照传统的宽度优先搜索策略在Web空间中爬行,它对所有的网页都采取一视同仁的态度,这样爬行的结果就导致了它所爬行回来的网页质量不高.为此,给出了利用超链接信息改进网页爬行器搜索策略的算法.该算法充分考虑了网页之间的超链接信息,克服了传统的宽度优先搜索策略的盲目性爬行.实验表明,利用该算法爬行得到的网页与某一特定主题相关的网页超过50%.  相似文献   

4.
Web offers a very convenient way to access remote information resources, an important measurement of evaluating Web services quality is how long it takes to search and get information. By caching the Web server‘s dynamic content, it can avoid repeated queries for database and reduce the access frequency of original resources, thus to improve the speed of server‘s response. This paper describes the concept. advantages, principles and concrete realization procedure of a dvnamic content cache module for Web server.  相似文献   

5.
How to integrate heterogeneous semi-structured Web records into relational database is an important and challengeable research topic. An improved model of conditional random fields was presented to combine the learning of labeled samples and unlabeled database records in order to reduce the dependence on tediously hand-labeled training data. The pro- posed model was used to solve the problem of schema matching between data source schema and database schema. Experimental results using a large number of Web pages from diverse domains show the novel approach's effectiveness.  相似文献   

6.
采用计算向量之间相似度的方法, 通过实验分析验证了表格信息在主题爬行中的重要性. 研究结果表明, 与整个网页相比, 表格所能提供的与用户相关的信息占整个网页信息总量的80%以上, 因而在主题爬行领域可以充分利用这一结论进行网页解析. 在舍弃除表格和标题之外的其他元素后, 提高了爬行程序的效率.  相似文献   

7.
Web search engines are very useful information service tools in the Internet. The current web search engines produce search results relating to the search terms and the actual information collected hy them. Since the selections of the search results cannot affect the future ones. they may not cover most people‘s interests. In this paper, feedback informarion produced by the users‘ accessing lists will be represented By the rough set and can reconstruct the query string and influence the search results. And thus the search engines can provide self-adaptability.  相似文献   

8.
With the explosive growth of information sources available on the World Wide Web, how to combine the results of multiple search engines has become a valuable problem. In this paper, a search strategy based on genetic simulated annealing for search engines in Web mining is proposed. According to the proposed strategy, there exists some important relationship among Web statistical studies, search engines and optimization techniques. We have proven experimentally the relevance of our approach to the presented queries by comparing the qualities of output pages with those of the original downloaded pages, as the number of iterations increases better results are obtained with reasonable execution time.   相似文献   

9.
Forms enhance both the dynamic and interactive abilities of Web applications and the system complexity. And it is especially important to test forms completely and thoroughly. Therefore, this paper discusses how to carry out the form testing by different methods in the related testing phases. Namely, at first, automatically abstracting forms in the Web pages by parsing the HTML documents; then, ohtai ning the testing data with a certain strategies, such as by requirement specifications, by mining users' hefore input informarion or by recording meehanism; and next executing the testing actions automatically due to the well formed test cases; finally, a case study is given to illustrate the convenient and effective of these methods.  相似文献   

10.
陈丽君 《科技资讯》2009,(16):21-21
传统网络爬虫只处理页面中的超链接,而忽略了大量有价值的深层网搜索表单。本文设计了一个表单检测器用于检测搜索表单,介绍了其功能模块及具体实现,最后用实验验证该检测器的有效性。  相似文献   

11.
介绍了搜索引擎的总体结构,分析了搜索引擎中爬行器的爬行策略和网页库的更新模式。介绍了其中一种较为合理的爬行和更新模式及其实现技术,实现了渐增式地爬行高质量网页和提高网页库新鲜度的目的。  相似文献   

12.
We propose an algorithm for learning hierarchical user interest models according to the Web pages users have browsed. In this algorithm, the interests of a user are represented into a tree which is called a user interest tree, the content and the structure of which can change simultaneously to adapt to the changes in a user's interests. This expression represents a user's specific and general interests as a continuurn. In some sense, specific interests correspond to shortterm interests, while general interests correspond to longterm interests. So this representation more really reflects the users' interests. The algorithm can automatically model a us er's multiple interest domains, dynamically generate the in terest models and prune a user interest tree when the number of the nodes in it exceeds given value. Finally, we show the experiment results in a Chinese Web Site.  相似文献   

13.
Deep Web sources contain a large of high-quality and query-related structured date. One of the challenges in the Deep Web is extracting result schemas of Deep Web sources. To address this challenge, this paper describes a novel approach that extracts both result data and the result schema of a Web database. The approach first models the query interface of a Deep Web source and fills in it with a specifically query instance. Then the result pages of the Deep Web sources are formatted in the tree structure to retrieve subtrees that contain elements of the query instance, Next, result schema of the Deep Web source is extracted by matching the subtree' nodes with the query instance, in which, a two-phase schema extraction method is adopted for obtaining more accurate result schema. Finally, experiments on real Deep Web sources show the utility of our approach, which provides a high precision and recall.  相似文献   

14.
To alleviate the scalability problem caused by the increasing Web using and changing users' interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant Colony Clustering. Firstly, an active movement strategy about direction selection and speed, different with the positive strategy employed by other Ant Colony Clustering algorithms, is proposed to construct an Active Ant Colony Clustering algorithm, which avoid the idle and "flying over the plane" moving phenomenon, effectively improve the quality and speed of clustering on large dataset. Then a mechanism of decomposing clusters based on above methods is introduced to form new clusters when users' interests change. Empirical studies on a real Web dataset show the active ant colony clustering algorithm has better performance than the previous algorithms, and the incremental approach based on the proposed mechanism can efficiently implement incremental Web usage mining.  相似文献   

15.
Aiming at the problem of merging heterogeneous semantic taxonomy emerged in Web information integration, a method of building Web classification ontology (WCO) has been proposed. A WCO that is logically consistent with the suggested upper merged ontology (SUMO) is defined, together with axioms needed to classify Web pages. WCO can be used as a foundation of merging heterogeneous semantic taxonomy, and could be used to support Web information integration and classification based Web information retrieval.  相似文献   

16.
提出了一种可定制聚焦网络爬虫技术.该技术采用简单的主题描述方法提高网络爬虫的可定制性,利用基于站点页面链接结构的链接导航技术实现对主题信息的高效抓取,通过配置文件实施定制,从而构建一个资源消耗小、数据采集准确性高、可控性强的轻量级聚焦网络爬虫,以满足P2P搜索的需求.文中进一步提出一种增量更新和批量更新相结合的网络爬虫数据更新机制,这种混合更新机制降低了增量更新的实现复杂性,相比批量更新具有更小的资源消耗,实验表明采用这种机制能达到较高的数据新鲜度和召回率.  相似文献   

17.
基于Map/Reduce的网页消重并行算法   总被引:1,自引:0,他引:1  
网页消重模块是搜索引擎系统的重要组成部分,其作用是对搜索引擎的爬虫系统下载的网页进行过滤,去除重复内容的网页,从而提高搜索引擎爬虫系统的性能和检索的质量。提出了一种网页消重的并行算法以及基于Map/Reduce的实现机制,并通过实际网站的实验验证了该消重算法的稳定性和处理大量网页时的并行性能。  相似文献   

18.
基于网页分块技术主题爬行器的实现   总被引:1,自引:0,他引:1  
针对目前通用搜索引擎搜索到的结果过多、 与主题相关性不强的现状, 提出一种基于网页分块技术的主题爬行器实现方法, 并实现了一个原型系统Crawler1. 实验结果表明, 本系统性能较好, 所爬网页的相关度在55%以上.  相似文献   

19.
The demand for individualized teaching from Elearning websites is rapidly increasing due to the huge differences existed among Web learners. A method for clusteringWeb learners based on rough set is proposed. The basic ideaof the method is to reduce the learning auributes prior to clustering, and therefore the clustering of Web learners iscarried out in a relative low-dimensional space. Using thismethod, the E-learning websites can arrange correspondingleaching content for different clusters of learners so that thelearners‘ individual requirements can be more satisfied.  相似文献   

20.
A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tune the result; user can answer "yes" or "no" for existing annotations, or manually annotate form elements. Mass feedback is added to the annotation algorithm to produce more accurate result. By this approach, query interface annotation can reach a perfect accuracy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号