首页 | 本学科首页   官方微博 | 高级检索  
     检索      

Selections of data preprocessing methods and similarity metrics for gene cluster analysis
作者姓名:YANG Chunmei  WAN Baikun  GAO Xiaofeng
作者单位:1. Department of Biomedical Engineering and Scientific Instrumentations,Tianjin University, Tianjin 300072, China; 2. Motorola (China) Electronics Ltd., Tianjin 300457, China
基金项目:Supported by the Tianjin Key Academic Subject Fund (Grant No. 2001-31)
摘    要:Clustering is one of the major exploratory techniques for gene expression data analysis. Only with suitable similarity metrics and when datasets are properly preprocessed, can results of high quality be obtained in cluster analysis. In this study, gene expression datasets with external evaluation criteria were preprocessed as normalization by line, normalization by column or logarithm transformation by base-2, and were subsequently clustered by hierarchical clustering, k-means clustering and self-organizing maps (SOMs) with Pearson correlation coefficient or Euclidean distance as similarity metric. Finally, the quality of clusters was evaluated by adjusted Rand index. The results illustrate that k-means clustering and SOMs have distinct advantages over hierarchical clustering in gene clustering, and SOMs are a bit better than k-means when randomly initialized. It also shows that hierarchical clustering prefers Pearson correlation coefficient as similarity metric and dataset normalized by line. Meanwhile, k-means clustering and SOMs can produce better clusters with Euclidean distance and logarithm transformed datasets. These results will afford valuable reference to the implementation of gene expression cluster analysis.

关 键 词:gene  expression    cluster  analysis    data  preprocessing    similarity  metrics    Rand  index.

Selections of data preprocessing methods and similarity metrics for gene cluster analysis
YANG Chunmei,WAN Baikun,GAO Xiaofeng.Selections of data preprocessing methods and similarity metrics for gene cluster analysis[J].Progress in Natural Science,2006,16(6):607-613.
Authors:YANG Chunmei  WAN Baikun  GAO Xiaofeng
Institution:1. Department of Biomedical Engineering and Scientific Instrumentations, Tianjin University, Tianjin 300072, China
2. Motorola (China) Electronics Ltd., Tianjin 300457, China
Abstract:Clustering is one of the major exploratory techniques for gene expression data analysis. Only with suitable similarity metrics and when datasets are properly preprocessed, can results of high quality be obtained in cluster analysis. In this study, gene expression datasets with external evaluation criteria were preprocessed as normalization by line, normalization by column or logarithm transformation by base-2, and were subsequently clustered by hierarchical clustering, k-means clustering and self-organizing maps (SOMs) with Pearson correlation coefficient or Euclidean distance as similarity metric. Finally, the quality of clusters was evaluated by adjusted Rand index. The results illustrate that k-means clustering and SOMs have distinct advantages over hierarchical clustering in gene clustering, and SOMs are a bit better than k-means when randomly initialized. It also shows that hierarchical clustering prefers Pearson correlation coefficient as similarity metric and dataset normalized by line. Meanwhile, k-means clustering and SOMs can produce better clusters with Euclidean distance and logarithm transformed datasets. These results will afford valuable reference to the implementation of gene expression cluster analysis.
Keywords:gene expression  cluster analysis  data preprocessing  similarity metrics  Rand index
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《自然科学进展(英文版)》浏览原始摘要信息
点击此处可从《自然科学进展(英文版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号