Implement three unsupervised clustering algorithms on categorical datasets.

kmodes(K = 1, datafile = NULL, n_init = 1, algorithm = "KMODES_HUANG",
init_method = "KMODES_INIT_RANDOM_SEEDS", seed = 1, shuffle = FALSE)

Arguments

K

Number of clusters. Default is 1.

datafile

Path to a data file.

n_init

Number of initializations.

algorithm

Algorithm to implement clustering. Default is "KMODES_HUANG". See details for the options available.

init_method

Initialization methods. Default is "KMODES_INIT_RANDOM_SEEDS". See details for the options available.

seed

Random number seed. Default is 1.

shuffle

Incidate if shuffle the input order. Default is FALSE.

Value

Returns a list of clustering results.

Details

Algorithms avaiable:

  • "KMODES_HUANG": MacQueen's algorithm

  • "KMODES_HARTIGAN_WONG": Hartigan and Wong algorithm

  • "KMODES_LLOYD": Lloyd's algorithm

Initialization methods avaiable:

  • "KMODES_INIT_RANDOM_SEEDS": Random sampling.

  • "KMODES_INIT_H97_RANDOM": Huang1997, randomized version.

  • "KMODES_INIT_HD17": Huang1997 interpretted by Python author de Vos.

  • "KMODES_INIT_CLB09_RANDOM": Cao2009, randomized version.

  • "KMODES_INIT_AV07": K-means++ adapted.

  • "KMODES_INIT_AV07_GREEDY": K-means++ greedy adapted.

Value:

  • "best_cluster_size": Number of observations in each cluster of the best initialization.

  • "best_criterion": Optimized criterion in each cluster of the best initialization.

  • "best_cluster_id": Cluster assignment of each observation of the best initialization.

  • "best_modes": Estimated modes for each cluster of the best initialization.

  • "best_seed_index": Seed index of the best initialization.

  • "total_best_criterion": Total optimized criterion of the best initialization.

  • "clsuter_size": Number of clusters.

  • "data_dim": Dimension of input data.

  • "data": The input data.

References

  • Lloyd S (1982). “Least squares quantization in PCM.” Information Theory, IEEE Transactions on, 28(2), 129 - 137.

  • MacQueen J (1967). “Some methods for classification and analysis of multivariate observations.” In Cam LML, Neyman J (eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 281-297.

  • Huang Z (1998). “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values.” Data Min. Knowl. Discov., 2, 283-304.

  • Huang Z (1997). “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining.” Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 28, 1-8.

  • Hartigan JA (1975). Clustering Algorithms. John Wiley & Sons.

Examples

# Clustering with three initializations with default algorithm ("KMODES_HUANG") datFile <- system.file("extdata", "zoo.int.data", package = "CClust") res_kmodes <- kmodes(K = 5, datafile = datFile, n_init = 3, shuffle = TRUE) # Clustering with Harigan and Wong and K-means++ greedy adapted initialization method. res_kmodes <- kmodes(K = 5, datafile = datFile, algorithm = "KMODES_HARTIGAN_WONG", init_method = "KMODES_INIT_AV07_GREEDY")