首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Classification and spatial methods can be used in conjunction to represent the individual information of similar preferences by means of groups. In the context of latent class models and using Simulated Annealing, the cluster-unfolding model for two-way two-mode preference rating data has been shown to be superior to a two-step approach of first deriving the clusters and then unfolding the classes. However, the high computational cost makes the procedure only suitable for small or medium-sized data sets, and the hypothesis of independent and normally distributed preference data may also be too restrictive in many practical situations. Therefore, an alternating least squares procedure is proposed, in which the individuals and the objects are partitioned into clusters, while at the same time the cluster centers are represented by unfolding. An enhanced Simulated Annealing algorithm in the least squares framework is also proposed in order to address the local optimum problem. Real and artificial data sets are analyzed to illustrate the performance of the model.  相似文献   

2.
Framework of this paper is statistical data editing, specifically how to edit or impute missing or contradictory data and how to merge two independent data sets presenting some lack of information. Assuming a missing at random mechanism, this paper provides an accurate tree-based methodology for both missing data imputation and data fusion that is justified within the Statistical Learning Theory of Vapnik. It considers both an incremental variable imputation method to improve computational efficiency as well as boosted trees to gain in prediction accuracy with respect to other methods. As a result, the best approximation of the structural risk (also known as irreducible error) is reached, thus reducing at minimum the generalization (or prediction) error of imputation. Moreover, it is distribution free, it holds independently of the underlying probability law generating missing data values. Performance analysis is discussed considering simulation case studies and real world applications.  相似文献   

3.
The objective of this paper is to develop the maximum likelihood approach for analyzing a finite mixture of structural equation models with missing data that are missing at random. A Monte Carlo EM algorithm is proposed for obtaining the maximum likelihood estimates. A well-known statistic in model comparison, namely the Bayesian Information Criterion (BIC), is used for model comparison. With the presence of missing data, the computation of the observed-data likelihood function value involved in the BIC is not straightforward. A procedure based on path sampling is developed to compute this function value. It is shown by means of simulation studies that ignoring the incomplete data with missing entries gives less accurate ML estimates. An illustrative real example is also presented.  相似文献   

4.
In the framework of incomplete data analysis, this paper provides a nonparametric approach to missing data imputation based on Information Retrieval. In particular, an incremental procedure based on the iterative use of tree-based method is proposed and a suitable Incremental Imputation Algorithm is introduced. The key idea is to define a lexicographic ordering of cases and variables so that conditional mean imputation via binary trees can be performed incrementally. A simulation study and real data applications are carried out to describe the advantages and the performance with respect to standard approaches.  相似文献   

5.
Graphical displays which show inter-sample distances are important for the interpretation and presentation of multivariate data. Except when the displays are two-dimensional, however, they are often difficult to visualize as a whole. A device, based on multidimensional unfolding, is described for presenting some intrinsically high-dimensional displays in fewer, usually two, dimensions. This goal is achieved by representing each sample by a pair of points, sayR i andr i, so that a theoretical distance between thei-th andj-th samples is represented twice, once by the distance betweenR i andr j and once by the distance betweenR j andr i. Selfdistances betweenR i andr i need not be zero. The mathematical conditions for unfolding to exhibit symmetry are established. Algorithms for finding approximate fits, not constrained to be symmetric, are discussed and some examples are given.  相似文献   

6.
Starting from the problem of missing data in surveys with Likert-type scales, the aim of this paper is to evaluate a possible improvement for the imputation procedure proposed by Lavori, Dawson, and Shera (1995) here called Approximate Bayesian bootstrap with Propensity score (ABP). We propose an imputation procedure named Approximate Bayesian bootstrap with Propensity score and Nearest neighbour (ABPN), which, after the ??propensity score step?? of ABP, randomly selects a donor in the nonrespondent??s neighbourhood, which includes cases with response patterns similar to the one of the nonrespondent to be imputed. A preliminary simulation study with single imputation on missing data in two Likerttype scales from a real data set shows that ABPN: (a) performed better than the ABP imputation, and (b) can be considered as a serious competitor of other procedures used in this context.  相似文献   

7.
In this commentary on Don Ihde’s paper “Stretching the in-between: embodiment and beyond” I argue that perceptions and observations are based on tacit frames and these frames are expressed through pre-reflexive intuitions thus giving meaning to the perceived content of observations. However, if the objective or given information in perception is incomplete or missing our brain and nervous system will intuitively and unconsciously fill in the missing information in order to act—these particular pieces of added information may not be relevant to the decoding of the given content of perception at all.  相似文献   

8.
In this paper we will offer a few examples to illustrate the orientation of contemporary research in data analysis and we will investigate the corresponding role of mathematics. We argue that the modus operandi of data analysis is implicitly based on the belief that if we have collected enough and sufficiently diverse data, we will be able to answer most relevant questions concerning the phenomenon itself. This is a methodological paradigm strongly related, but not limited to, biology, and we label it the microarray paradigm. In this new framework, mathematics provides powerful techniques and general ideas which generate new computational tools. But it is missing any explicit isomorphism between a mathematical structure and the phenomenon under consideration. This methodology used in data analysis suggests the possibility of forecasting and analyzing without a structured and general understanding. This is the perspective we propose to call agnostic science, and we argue that, rather than diminishing or flattening the role of mathematics in science, the lack of isomorphisms with phenomena liberates mathematics, paradoxically making more likely the practical use of some of its most sophisticated ideas.  相似文献   

9.
Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a “philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: “with enough data, the numbers speak for themselves”. The “end of science” is proclaimed. Using classical results from ergodic theory, Ramsey theory and algorithmic information theory, we show that this “philosophy” is wrong. For example, we prove that very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. They can be found in “randomly” generated, large enough databases, which—as we will prove—implies that most correlations are spurious. Too much information tends to behave like very little information. The scientific method can be enriched by computer mining in immense databases, but not replaced by it.  相似文献   

10.
Multiple imputation is one of the most highly recommended procedures for dealing with missing data. However, to date little attention has been paid to methods for combining the results from principal component analyses applied to a multiply imputed data set. In this paper we propose Generalized Procrustes analysis for this purpose, of which its centroid solution can be used as a final estimate for the component loadings. Convex hulls based on the loadings of the imputed data sets can be used to represent the uncertainty due to the missing data. In two simulation studies, the performance of Generalized Procrustes approach is evaluated and compared with other methods. More specifically it is studied how these methods behave when order changes of components and sign reversals of component loadings occur, such as in case of near-equal eigenvalues, or data having almost as many counterindicative items as indicative items. The simulations show that other proposed methods either may run into serious problems or are not able to adequately assess the accuracy due to the presence of missing data. However, when the above situations do not occur, all methods will provide adequate estimates for the PCA loadings.  相似文献   

11.
A latent class vector model for preference ratings   总被引:1,自引:1,他引:1  
A latent class formulation of the well-known vector model for preference data is presented. Assuming preference ratings as input data, the model simultaneously clusters the subjects into a small number of homogeneous groups (or latent classes) and constructs a joint geometric representation of the choice objects and the latent classes according to a vector model. The distributional assumptions on which the latent class approach is based are analogous to the distributional assumptions that are consistent with the common practice of fitting the vector model to preference data by least squares methods. An EM algorithm for fitting the latent class vector model is described as well as a procedure for selecting the appropriate number of classes and the appropriate number of dimensions. Some illustrative applications of the latent class vector model are presented and some possible extensions are discussed. Geert De Soete is supported as “Bevoegdverklaard Navorser” of the Belgian “Nationaal Fonds voor Wetenschappelijk Onderzoek.”  相似文献   

12.
A common approach to deal with missing values in multivariate exploratory data analysis consists in minimizing the loss function over all non-missing elements, which can be achieved by EM-type algorithms where an iterative imputation of the missing values is performed during the estimation of the axes and components. This paper proposes such an algorithm, named iterative multiple correspondence analysis, to handle missing values in multiple correspondence analysis (MCA). The algorithm, based on an iterative PCA algorithm, is described and its properties are studied. We point out the overfitting problem and propose a regularized version of the algorithm to overcome this major issue. Finally, performances of the regularized iterative MCA algorithm (implemented in the R-package named missMDA) are assessed from both simulations and a real dataset. Results are promising with respect to other methods such as the missing-data passive modified margin method, an adaptation of the missing passive method used in Gifi’s Homogeneity analysis framework.  相似文献   

13.
A general set of multidimensional unfolding models and algorithms is presented to analyze preference or dominance data. This class of models termed GENFOLD2 (GENeral UnFOLDing Analysis-Version 2) allows one to perform internal or external analysis, constrained or unconstrained analysis, conditional or unconditional analysis, metric or nonmetric analysis, while providing the flexibility of specifying and/or testing a variety of different types of unfolding-type preference models mentioned in the literature including Caroll's (1972, 1980) simple, weighted, and general unfolding analysis. An alternating weighted least-squares algorithm is utilized and discussed in terms of preventing degenerate solutions in the estimation of the specified parameters. Finally, two applications of this new method are discussed concerning preference data for ten brands of pain relievers and twelve models of residential communication devices.  相似文献   

14.
地球科学数据共享   总被引:1,自引:0,他引:1  
进入21世纪,随着计算机技术和网络技术的高速发展,使得人们利用网络获得各种数据成为现实,也使得地球科学数据共享成为可能。地球科学数据共享是时代的需求,为了实现地球科学数据跨系统、跨平台在不同硬件、软件平台下共享在数据层面上,必须制订地球科学数据共享的相关法规、政策,建立健全地球科学数据的采集标准、共享标准以及进行数据和用户分级;在技术层面上,需要选择适合地球科学数据特点的网络环境,以及开发处理海量异构数据的技术和元数据库,并需要确保数据的安全。  相似文献   

15.
In this study, we consider the type of interval data summarizing the original samples (individuals) with classical point data. This type of interval data are termed interval symbolic data in a new research domain called, symbolic data analysis. Most of the existing research, such as the (centre, radius) and [lower boundary, upper boundary] representations, represent an interval using only the boundaries of the interval. However, these representations hold true only under the assumption that the individuals contained in the interval follow a uniform distribution. In practice, such representations may result in not only inconsistency with the facts, since the individuals are usually not uniformly distributed in many application aspects, but also information loss for not considering the point data within the intervals during the calculation. In this study, we propose a new representation of the interval symbolic data considering the point data contained in the intervals. Then we apply the city-block distance metric to the new representation and propose a dynamic clustering approach for interval symbolic data. A simulation experiment is conducted to evaluate the performance of our method. The results show that, when the individuals contained in the interval do not follow a uniform distribution, the proposed method significantly outperforms the Hausdorff and city-block distance based on traditional representation in the context of dynamic clustering. Finally, we give an application example on the automobile data set.  相似文献   

16.
When clustering asymmetric proximity data, only the average amounts are often considered by assuming that the asymmetry is due to noise. But when the asymmetry is structural, as typically may happen for exchange flows, migration data or confusion data, this may strongly affect the search for the groups because the directions of the exchanges are ignored and not integrated in the clustering process. The clustering model proposed here relies on the decomposition of the asymmetric dissimilarity matrix into symmetric and skew-symmetric effects both decomposed in within and between cluster effects. The classification structures used here are generally based on two different partitions of the objects fitted to the symmetric and the skew-symmetric part of the data, respectively; the restricted case is also presented where the partition fits jointly both of them allowing for clusters of objects similar with respect to the average amounts and directions of the data. Parsimonious models are presented which allow for effective and simple graphical representations of the results.  相似文献   

17.
Two algorithms for pyramidal classification — a generalization of hierarchical classification — are presented that can work with incomplete dissimilarity data. These approaches — a modification of the pyramidal ascending classification algorithm and a least squares based penalty method — are described and compared using two different types of complete dissimilarity data in which randomly chosen dissimilarities are assumed missing and the non-missing ones are subjected to random error. We also consider relationships between hierarchical classification and pyramidal classification solutions when both are based on incomplete dissimilarity data.  相似文献   

18.
文章以Web of Science 数据库为基础,以2015—2021年大数据领域高被引论文为样本,运用知识图谱软件VOSviewer对样本中的关键词频次进行统计,对关键词数据进行手动预处理后生成科学知识图谱,然后从研究热点、研究前沿以及演进路径对大数据技术进行量化和聚类分析。结果表明,大数据技术前沿有三个研究方向,分别是大数据开发与挖掘技术、大数据分析与管理技术、大数据运维与云计算技术。数字化、智能化、网络化是大数据技术的未来发展方向,大数据运维与云计算是大数据技术的研究前沿,数据安全是大数据技术的未来研究热点。随着人们对大数据技术研究的不断深入,大数据理论体系和大数据治理体系会更加完善和成熟,人类将进入信息技术引领下的万物互联新时代。  相似文献   

19.
科学数据共享平台之数据联盟模式初探   总被引:2,自引:0,他引:2  
数据资源建设是科学数据共享平台建设的重要内容之一,也是其难点之一。本文比较了我国科学数据共享平台资源建设的三种主要模式,指出数据交换模式更适合于当前目标定位,即建设开放共享的科学数据共享平台;进一步提出了科学数据联盟模式,由此构建的科学数据共享平台可灵活地解决数据资源建设和整合问题,从而提高数据的利用率,提升服务质量;并提出了当前构建科学数据联盟模式的科学数据共享平台的两点建议;最后以地球系统科学数据共享网的建设为例进行了实例说明。本研究将为我国科学数据共享平台建设提供参考。  相似文献   

20.
This paper presents a Bayesian model based clustering approach for dichotomous item responses that deals with issues often encountered in model based clustering like missing data, large data sets and within cluster dependencies. The approach proposed will be illustrated using an example concerning Brand Strategy Research.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号