首页 | 本学科首页   官方微博 | 高级检索  
     检索      

网络日志挖掘中基于时间间隔的会话切分
引用本文:庄力可,寇忠宝,张长水.网络日志挖掘中基于时间间隔的会话切分[J].清华大学学报(自然科学版),2005,45(1):115-118.
作者姓名:庄力可  寇忠宝  张长水
作者单位:清华大学,自动化系,北京,100084
摘    要:针对网络日志挖掘中的会话切分问题,提出了一种基于时间间隔的方法。该方法在相邻页面访问时间间隔超出某阈值时切分会话,针对特定IP的阈值根据其频率矢量来定义。实验表明:代理服务器IP和单用户IP的频率矢量具有不同特性,代理服务器IP的频率矢量具有Power-law的特点,而单用户IP的频率矢量具有Gauss分布的特点,在此基础上提出一种基于Gauss假设的方法来设定不同单用户IP的阈值。与传统的对所有IP地址使用单一的先验阈值进行切分的方法相比,该方法更为合理有效。

关 键 词:数据库理论  网络日志挖掘  会话切分  时间间隔  频率矢量
文章编号:1000-0054(2005)01-0115-04
修稿时间:2003年12月5日

Session identification based on time intervals in Web log mining
ZHUANG Like,KOU Zhongbao,ZHANG Changshui.Session identification based on time intervals in Web log mining[J].Journal of Tsinghua University(Science and Technology),2005,45(1):115-118.
Authors:ZHUANG Like  KOU Zhongbao  ZHANG Changshui
Abstract:This paper presents a method for session identification based on an analysis of intervals of user access logs. This method separates the access logs into distinct sessions at points where the access intervals exceed some threshold. The threshold for a specific IP is defined by the statistic of its frequency vectors. Tests show that the frequency vectors of proxy IPs and single user IPs are different. For a proxy IP, the frequency vector often shows a power-law distribution, however for a single user IP, it approximates a Gauss distribution. A method based on the Gauss hypothesis was proposed for computing different thresholds for each single user IP. Compare to the traditional approach that experimentially defines a uniform threshold for all IP addresses, the method presented is more reasonable and effective.
Keywords:database theory  Web log mining  session identification  access interval  frequency vector  
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号