Login

aju · 10-04-2017, 09:39 PM

Although attempts have been made to solve the problem of grouping categorical data across clusters, the results being competitive with conventional algorithms, it is observed that these techniques, unfortunately, generate a final data partition based on incomplete information. The underlying set information array only displays cluster data point relationships, with many entries remaining unknown. The article presents an analysis that suggests that this problem degrades the quality of the result of the cluster and presents a new approach based on links that improves the conventional matrix by discovering unknown inputs through the similarity between groups in a set. In particular, we propose an efficient link-based algorithm for the evaluation of underlying similarity. Subsequently, to obtain the final grouping result, a graphical partitioning technique is applied to a weighted bipartite graph that is formulated from the refined matrix. Experimental results in multiple real data-sets suggest that the proposed link-based method almost always outperforms conventional clustering algorithms for well-known categorical data and clustering techniques.

Data mining is the practice of automatically searching for large data stores to discover patterns and trends that go beyond simple analysis. Data mining models (prediction and description) are achieved using the following main data mining tasks: Classification, Regression, Grouping, Summarization and Dependency Modelling, and Detection of Changes and Deviations. The grouping groups the elements into a data set according to their similarity in such a way that the elements of each grouping are similar, whereas the elements of different groups are dissimilar. It is about analysing or processing multivariate data, such as: characterise customer groups based on purchasing patterns, classify web documents, group genes and proteins that have similar functionality, group spatial locations prone to earthquakes based on seismological data, etc. It is the integration of the results of several clustering algorithms using a consensus function to obtain stable results. The idea of combining different clustering results (cluster set or cluster aggregation) emerged as an alternative approach to improve the quality of clustering algorithm results. In this work we have designed and implemented a clusters cluster approach using the divide and conquer technique to treat this type of mixed data sets. Therefore, the initial data set is divided into sub-sets of data, ie, numerical and categorical. Next, clustering algorithms designed for numeric and categorical data sets can be used to produce corresponding clusters. Finally, the grouping results from the previous step are combined as a categorical data set in which the same categorical grouping algorithm or any other one can be used to produce the final output clusters.

ambereesh · 10-04-2017, 09:39 PM

Abstract

Although attempts have been made to solve the problem of clustering categorical data via cluster ensembles, with the cresults being competitive to conventional algorithms, it is observed that these techniques unfortunately generate a final data partition based on incomplete information. The underlying ensemble-information matrix presents only cluster-data point relations, with many entries being left unknown. The paper presents an analysis that suggests this problem degrades the quality of the clustering result, and it presents a new link-based approach, which improves the conventional matrix by discovering unknown entries through similarity between clusters in an ensemble. In particular, an efficient link-based algorithm is proposed for the underlying similarity assessment. Afterward, to obtain the final clustering result, a graph partitioning technique is applied to a weighted bipartite graph that is formulated from the refined matrix. Experimental results on multiple real data sets suggest that the proposed link-based method almost always outperforms both conventional clustering algorithms for categorical data and well-known cluster