Introduction to Data Clustering

University of Genova, Italy

Data clustering is aimed at finding a structure that aggregates a set of unlabeled data into some groups with the property that data belonging to a group (or cluster) are more similar to data in that cluster than to data in other clusters. A more formal definition of clustering is difficult to state, as the clustering task is not a well-posed problem. Given a data set, an operational definition of a clustering procedure requests the following elements be set: (a) A data representation to transform the instances of the data set in vectors of characteristics; (b) A similarity measure to be used for comparing the instances of the data set; (c) A clustering algorithm. We will present the principal categories of clustering algorithms, including the hierarchical algorithms that tray to find data structures which can be further divided in substructures recursively, and the partitive algorithms that aim to find a single partition of data.