Clustering NGS datasets with quality scores.

Implement three unsupervised clustering algorithms on NGS datasets with quality scores.

khaplotype(K = 1, datafile = NULL, n_init = 1, algorithm = "FASTQ_HW_EFFICIENT",
seed = 0, shuffle = FALSE)

Arguments

K	Number of clusters. Default is 1.
datafile	Path to a data file. Has to be a fastq file if want to conduct clustering on amplicon data.
n_init	Number of initializations.
algorithm	Algorithm to implement clustering. Default is "FASTQ_LLOYDS_EFFICIENT". See details for the options available.
seed	Random number seed. Default is 1.
shuffle	Incidate if shuffle the input order. Default is FALSE.

Value

Returns a list of clustering results.

Details

Algorithms avaiable:

"FASTQ_LLOYDS_EFFICIENT": Efficient Lloyds algorithm
"FASTQ_HW_EFFICIENT": Efficient Hartigan and Wong algorithm
"FASTQ_MACQUEEN": MacQueen's algorithm
"FASTQ_LLOYDS": Lloyds algorithm
"FASTQ_HW": Hartigan and Wong algorithm

Value:

"best_cluster_size": Number of observations in each cluster of the best initialization.
"best_criterion": Optimized criterion in each cluster of the best initialization.
"best_cluster_id": Cluster assignment of each observation of the best initialization.
"best_modes": Estimated modes for each cluster of the best initialization.
"total_best_criterion": Total optimized criterion of the best initialization.
"clsuter_size": Number of clusters.
"data_dim": Dimension of input data.
"data": Reads of the input data.

References

Lloyd S (1982). “Least squares quantization in PCM.” Information Theory, IEEE Transactions on, 28(2), 129 - 137.
MacQueen J (1967). “Some methods for classification and analysis of multivariate observations.” In Cam LML, Neyman J (eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 281-297.
Huang Z (1998). “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values.” Data Min. Knowl. Discov., 2, 283-304.
Huang Z (1997). “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining.” Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 28, 1-8.
Hartigan JA (1975). Clustering Algorithms. John Wiley & Sons.

Examples

# Clustering an amplicon dataset and run three initializations with default
# algorithm ("FASTQ_HW_EFFICIENT")
datFile <- system.file("extdata", "sim.fastq", package = "CClust")
res_khap <- khaplotype(K = 5, datafile = datFile, n_init = 3)
#> Minimum quality score: ( (40)
#> Maximum quality score: G (71)
#> Minimum read length: 251
#> Maximum read length: 251
#> Time cost: 1.117819 secs
#> Log likelihood in 1th initialization: -108539.22 (5 iterations: 2979 4939)
#> Time cost: 0.519498 secs
#> Log likelihood in 2th initialization: -108624.68 (2 iterations: 5074 1)
#> Time cost: 0.671407 secs
#> Log likelihood in 3th initialization: -103563.00 (3 iterations: 2894 577)
#> Time cost is: 2.308838 secs
#> Best optimum is: -103562.999033

# Clustering an amplicon dataset and run three initializations with
# MacQueen's algorithm (shuffle the data)
res_khap <- khaplotype(K = 5, datafile = datFile, n_init = 3,
algorithm = "FASTQ_MACQUEEN", shuffle = TRUE)
#> Minimum quality score: ( (40)
#> Maximum quality score: G (71)
#> Minimum read length: 251
#> Maximum read length: 251
#> Time cost: 0.060295 secs
#> Log likelihood in 1th initialization: -229887.38 (1 iterations: 3 4401)
#> Time cost: 0.059978 secs
#> Log likelihood in 2th initialization: -202943.53 (3 iterations: 2957 4967)
#> Time cost: 0.084000 secs
#> Log likelihood in 3th initialization: -198153.01 (1 iterations: 579 2948)
#> Time cost is: 0.205126 secs
#> Best optimum is: -198153.011862

# Clustering an amplicon dataset provide a different seed
res_khap <- khaplotype(K = 5, datafile = datFile, seed = 1)
#> Minimum quality score: ( (40)
#> Maximum quality score: G (71)
#> Minimum read length: 251
#> Maximum read length: 251
#> Time cost: 0.538202 secs
#> Log likelihood in 1th initialization: -136735.57 (3 iterations: 81 155)
#> Time cost is: 0.538230 secs
#> Best optimum is: -136735.565422

Clustering NGS datasets with quality scores.

Arguments

Value

Details

References

Examples

Contents