Una introducción a los modelos de Clústering empleando R

Alonso C., Julio César; Hoyos B., Cristian Camilo; Largo L., María Fernanda

Una introducción a los modelos de Clústering empleando R

dc.audience	Todo Público
dc.contributor.author	Alonso C., Julio César
dc.contributor.author	Hoyos B., Cristian Camilo
dc.contributor.author	Largo L., María Fernanda
dc.coverage.spatial	Cali de Lat: 03 24 00 N degrees minutes Lat: 3.4000 decimal degrees Long: 076 30 00 W degrees minutes Long: -76.5000 decimal degrees.
dc.date.accessioned	2025-04-04T19:08:41Z
dc.date.available	2025-04-04T19:08:41Z
dc.date.issued	2025-03-01
dc.description.abstract	Este libro es una introducción clara y accesible a los modelos estadísticos y de aprendizaje de máquina aplicados al clustering. Dirigido a quienes inician su formación como científicos de datos, ofrece una guía esencial para entender cómo identificar y agrupar elementos similares, manteniendo la mayor diferencia posible entre los grupos, también llamados conglomerados, clases o clústeres. Estas técnicas no solo pueden responder preguntas de negocio por sí solas, sino que también son clave en la exploración de datos antes de desarrollar modelos complejos y probar hipótesis. A lo largo del libro, descubrirás cómo construir y analizar estos grupos con aplicaciones prácticas y enfoques fundamentales.	spa
dc.description.abstract	This book is a clear and accessible introduction to statistical and machine learning models applied to clustering. Aimed at those starting their training as data scientists, it offers an essential guide to understanding how to identify and group similar elements, while maintaining the greatest possible difference between groups, also called clusters, classes, or clusters. These techniques can not only answer business questions on their own, but are also key in data exploration before developing complex models and testing hypotheses. Throughout the book, you will discover how to build and analyze these groups with practical applications and fundamental approaches.	eng
dc.description.tableofcontents	Prefacio -- I Conceptos fundamentales -- 1 Introducción -- Objetivos del capítulo -- 1.1 Introducción -- 1.2 Comentarios Finales -- 2 Generalidades y métricas en el clústering -- Objetivos del capítulo -- 2.1 La intuición detrás de los algoritmos de clústering -- 2.2 Medidas de similitud -- 2.3 Algoritmos para la formación de clústeres -- 2.4 Criterio para determinar el número de clústeres -- 2.5 Comentarios finales -- 2.6 Anexo: Índices de validación -- II Algoritmos jerárquicos -- 3 Clústering Jerárquico -- Objetivos del capítulo -- 3.1 Introducción -- 3.2 La intuición de los métodos aglomerativos -- 3.3 Métodos de aglomeración -- 3.4 La intuición de los métodos de división -- 3.5 Comentarios finales -- 4 Implementando los algoritmos de clústering jerárquico en R -- Objetivos del capítulo -- 4.1 Introducción -- 4.2 Los datos y la pregunta de negocio -- 4.3 Exploración y preparación de los datos -- 4.4 Construcción de clústeres jerárquicos aglomerativos y dendrograma -- 4.5 Escogiendo el número óptimo de clústeres -- 4.6 Construcción de clústeres jerárquicos de división -- 4.7 Visualización y analisis de resultados -- 4.8 Comentarios finales -- 4.9 Anexos -- III Algoritmos basados en centroides -- 5 Modelo k-means -- Objetivos del capítulo -- 5.1 Introducción -- 5.2 La intuición -- 5.3 Detalle técnico del algoritmo k-means -- 5.4 k-means en R -- 5.5 k-means++ -- 5.6 Comentarios Finales -- 6 Modelos PAM y CLARA -- Objetivos del capítulo -- 6.1 Introducción -- 6.2 El modelo PAM -- 6.3 El modelo CLARA -- 6.4 Implementación de PAM y CLARA en R -- 6.5 Calculando siluetas individuales y membrecías para los algoritmos PAM y CLA- RA -- 6.6 Comentarios Finales -- IV Algoritmos basados en densidad -- 7 Modelo DBSCAN -- Objetivos del capítulo -- 7.1 Introducción -- 7.2 El algoritmo DBSCAN -- 7.3 Implementación del algoritmo DBSCAN en R -- 7.4 Otro ejemplo -- 7.5 Comentarios finales -- V Algoritmos basados en distribuciones -- 8 Modelo GMM -- Objetivos del capítulo -- 8.1 Introducción -- 8.2 Formalmente -- 8.3 Implementación del modelo GMM -- 8.4 Otro ejemplo -- 8.5 Comentarios finales -- VI Algoritmos combinados -- 9 Modelo FANNY -- Objetivos del capítulo -- 9.1 Introducción -- 9.2 El algoritmo FANNY -- 9.3 Implementación de FANNY en R -- 9.4 Comentarios finales -- VII Referencias -- Referencias -- Índice alfabético -- Índice de figuras
dc.format.extent	195 páginas
dc.format.mimetype	application/pdf
dc.identifier.doi	https://doi.org/10.18046/EUI/bda.h.6
dc.identifier.isbn	978-628-7740-99-0 (eBook)
dc.identifier.uri	https://hdl.handle.net/10906/130243
dc.language.iso	spa
dc.publisher	Universidad Icesi
dc.publisher.place	Santiago de cali
dc.relation.ispartof	Colección: Herramientas del Big Data y Analytics
dc.relation.references	Alonso, J. C. (2002). A new accelerator for the em algorithm. Master’s thesis, Iowa State University.	spa
dc.relation.references	Alonso,J.C.(2022). EmpezandoatransformarbasesdedatosconRydplyr . Universidad Icesi.	spa
dc.relation.references	Alonso, J. C. (2024). Introducción al Modelo Clásico de Regresión para Científico de Datos en R . Universidad Icesi.	spa
dc.relation.references	Alonso, J. C. y Arboleda, A. M. (2025). Introducción al Análisis de Canastas de Compra para analytics translators y científicos de datos (empleando R) . Universidad Icesi.	spa
dc.relation.references	Alonso, J. C. y Hoyos, C. C. (2025a). Una introducción a los modelos de Clasificación empleando R. Universidad Icesi.	spa
dc.relation.references	Alonso, J. C. y Hoyos, C. C. (2025b). Una introducción a los modelos de clasificación empleando R . Universidad Icesi.	spa
dc.relation.references	Alonso, J. C. y Largo, M. F. (2023). Empezando a visualizar datos con R y ggplot2. Universidad Icesi, 2. edition.	spa
dc.relation.references	Alonso, J. C. y Ocampo, M. P. (2022). Empezando a usaR: Una guía paso a paso. Universidad Icesi.	spa
dc.relation.references	Alonso C., J. C. (2020). Herramientas del Business Analitycs en R : Análisis de Compo- nentes Principales para resumir variables. Economics Lecture Notes , (10):1–32.	spa
dc.relation.references	Arthur, D., Vassilvitskii, S., et al. (2007). k-means++: The advantages of careful seeding. In Soda , volume 7, pages 1027–1035.	spa
dc.relation.references	Baker, F. B. y Hubert, L. J. (1976). A graph-theoretic approach to goodness-of-fit in complete-link hierarchical clustering. Journal of the American Statistical Association , 71(356):870–878.	spa
dc.relation.references	Ball, G. H. y Hall, D. J. (1965). Isodata, a novel method of data analysis and pattern classification. Technical report, Stanford research inst Menlo Park CA.	spa
dc.relation.references	Beale, E. (1969). Euclidean cluster analysis . Scientific Control Systems Limited.	spa
dc.relation.references	Bezdek, J. C. y Pal, N. R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , 28(3):301–315.	spa
dc.relation.references	Brock, G., Pihur, V., Datta, S., y Datta, S. (2008). clValid: An R package for cluster valida- tion. Journal of Statistical Software , 25(4):1–22.	spa
dc.relation.references	Caliński,T.yHarabasz,J.(1974). Adendritemethodforclusteranalysis. Communications in Statistics-theory and Methods , 3(1):1–27.	spa
dc.relation.references	Charrad, M., Ghazzali, N., Boiteau, V., y Niknafs, A. (2014). NbClust: An R package for determining the relevant number of clusters in a data set. Journal of Statistical Soft- ware , 61(6):1–36.	spa
dc.relation.references	Chen, Y., Ruys, W., y Biros, G. (2020). Knn-dbscan: a dbscan in high dimensions. arXiv preprint arXiv:2009.04552 .	spa
dc.relation.references	Davies, D. L. y Bouldin, D. W. (1979). A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence , (2):224–227.	spa
dc.relation.references	de Vries, A. y Ripley, B. D. (2024). ggdendro: Create Dendrograms and Tree Diagrams Using ’ggplot2’ . R package version 0.2.0.	spa
dc.relation.references	Dimitriadou, E., Dolničar, S., y Weingessel, A. (2002). An examination of indexes for de- termining the number of clusters in binary data sets. Psychometrika , 67(1):137–159.	spa
dc.relation.references	Duda, R. O., Hart, P. E., et al. (1973). Pattern classification and scene analysis , volume 3. Wiley New York.	spa
dc.relation.references	Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of cy- bernetics , 4(1):95–104.	spa
dc.relation.references	Edwards, A. W. y Cavalli-Sforza, L. L. (1965). A method for cluster analysis. Biometrics , pages 362–375.	spa
dc.relation.references	Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd , volume 96, pages 226–231.	spa
dc.relation.references	Fox, J. y Weisberg, S. (2019). An R Companion to Applied Regression . Sage, Thousand Oaks CA, third edition.	spa
dc.relation.references	Frey, T. y Van Groenewoud, H. (1972). A cluster analysis of the d2 matrix of white spruce stands in saskatchewan based on the maximum-minimum principle. The Journal of Ecology , pages 873–886.	spa
dc.relation.references	Friedman, H. P. y Rubin, J. (1967). On some invariant criteria for grouping data. Journal of the American Statistical Association , 62(320):1159–1178.	spa
dc.relation.references	Fukunaga, K. y Koontz, W. L. (1970). A criterion and an algorithm for grouping data. IEEE Transactions on Computers , 100(10):917–923.	spa
dc.relation.references	Godichon-Baggioni, A. y Surendran, S. (2023). Kmedians: K-Medians . R package version 2.2.0.	spa
dc.relation.references	Gordon, A. (1999). Cluster description. University of St. Andrews Scotland .	spa
dc.relation.references	Hahsler, M., Piekenbrock, M., y Doran, D. (2019). dbscan: Fast density-based clustering with R. Journal of Statistical Software , 91(1):1–30.	spa
dc.relation.references	Haldiki, M., Batistakis, Y., y Vazirgiannis, M. (2002). Cluster validity methods. In SIGMOD , volume 31, pages 40–45.	spa
dc.relation.references	Halkidi, M. y Vazirgiannis, M. (2001). Clustering validity assessment: Finding the optimal partitioning of a data set. In Proceedings 2001 IEEE international conference on data mining , pages 187–194. IEEE.	spa
dc.relation.references	Hartigan, J. A. (1975). Clustering algorithms . John Wiley & Sons, Inc.	spa
dc.relation.references	Hill, R. S. (1980). A stopping rule for partitioning dendrograms. Botanical Gazette , 141(3):321–324.	spa
dc.relation.references	Hubert, L. y Arabie, P. (1985). Comparing partitions. Journal of classification , 2(1):193– 218.	spa
dc.relation.references	Hubert, L. J. y Levin, J. R. (1976). A general statistical framework for assessing categorical clustering in free recall. Psychological bulletin , 83(6):1072.	spa
dc.relation.references	Kassambara, A. (2017). Practical guide to cluster analysis in R: Unsupervised machine learning , volume 1. Sthda.	spa
dc.relation.references	Kassambara, A. y Mundt, F. (2020). factoextra: Extract and Visualize the Results of Mul- tivariate Data Analyses . R package version 1.0.7.	spa
dc.relation.references	Kaufman, L. y Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis , volume 344. John Wiley & Sons.	spa
dc.relation.references	Kraemer, H. C. (2004). Biserial correlation. Encyclopedia of statistical sciences , 1.	spa
dc.relation.references	Lance, G. N. y Williams, W. T. (1967). Mixed-data classificatory programs i - agglomera- tive systems. Australian Computer Journal , 1(1):15–20.	spa
dc.relation.references	Lebart, L., Morineau, A., y Piron, M. (1995). Statistique exploratoire multidimensionnelle , volume 3. Dunod Paris.	spa
dc.relation.references	Lüdecke, D., Ben-Shachar, M. S., Patil, I., y Makowski, D. (2020). Extracting, computing and exploring the parameters of statistical models using R. Journal of Open Source Software , 5(53):2445.	spa
dc.relation.references	Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., y Hornik, K. (2022). cluster: Cluster Analysis Basics and Extensions . R package version 2.1.4 — For new features, see the ’Changelog’ file (in the package source).	spa
dc.relation.references	Marriott, F. (1971). Practical problems in a method of cluster analysis. Biometrics , pages 501–514.	spa
dc.relation.references	McClain, J. O. y Rao, V. R. (1975). Clustisz: A program to test for the quality of clustering of a set of objects. Journal of Marketing Research , pages 456–460.	spa
dc.relation.references	Milligan, G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. psychometrika , 45(3):325–342.	spa
dc.relation.references	Milligan, G. W. (1981). A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika , 46(2):187–199.	spa
dc.relation.references	Milligan, G. W. y Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika , 50(2):159–179.	spa
dc.relation.references	Murtagh, F. y Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? Journal of classification , 31(3):274–295.	spa
dc.relation.references	Orlóci, L. (1967). An agglomerative method for classification of plant communities. The Journal of Ecology , pages 193–206.	spa
dc.relation.references	R Core Team (2020). R: A Language and Environment for Statistical Computing . R Foun- dation for Statistical Computing, Vienna, Austria.	spa
dc.relation.references	R Core Team (2023). R: A Language and Environment for Statistical Computing . R Foun- dation for Statistical Computing, Vienna, Austria.	spa
dc.relation.references	Ratkowsky, D. y Lance, G. (1978). Criterion for determining the number of groups in a classification.	spa
dc.relation.references	Rdusseeun, L. y Kaufman, P. (1987). Clustering by means of medoids. In Proceedings of the statistical data analysis based on the L1 norm conference, neuchatel, switzerland , volume 31.	spa
dc.relation.references	Rohlf, F. J. (1974). Methods of comparing classifications. Annual Review of Ecology and Systematics , 5(1):101–113.	spa
dc.relation.references	Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics , 20:53–65.	spa
dc.relation.references	Schloerke, B., Cook, D., Larmarange, J., Briatte, F., Marbach, M., Thoen, E., Elberg, A., y Crowley, J. (2023). GGally: Extension to ’ggplot2’ . R package version 2.2.0.	spa
dc.relation.references	Scott, A. J. y Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics , pages 387–397.	spa
dc.relation.references	Scrucca, L., Fraley, C., Murphy, T. B., y Raftery, A. E. (2023). Model-Based Clustering, Classification, and Density Estimation Using mclust in R . Chapman and Hall/CRC.	spa
dc.relation.references	Shah, A. (2021). Credit card customer data. Technical report.	spa
dc.relation.references	Tibshirani, R., Walther, G., y Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 63(2):411–423.	spa
dc.relation.references	Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association , 58(301):236–244.	spa
dc.relation.references	Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag New York.	spa
dc.relation.references	Wickham, H., François, R., Henry, L., y Müller, K. (2022). dplyr: A Grammar of Data Mani- pulation . R package version 1.0.8.	spa
dc.relation.references	You, K. (2023). maotai: Tools for Matrix Algebra, Optimization and Inference . R package version 0.2.5.	spa
dc.relation.references	Zadeh, L. (1965). Fuzzy sets. Information and Control , 8(3):338–353	spa
dc.rights	EL AUTOR, expresa que la obra objeto de la presente autorización es original y la elaboró sin quebrantar ni suplantar los derechos de autor de terceros, y de tal forma, la obra es de su exclusiva autoría y tiene la titularidad sobre éste. PARÁGRAFO: en caso de queja o acción por parte de un tercero referente a los derechos de autor sobre el artículo, folleto o libro en cuestión, EL AUTOR, asumirá la responsabilidad total, y saldrá en defensa de los derechos aquí autorizados; para todos los efectos, la Universidad Icesi actúa como un tercero de buena fe. Esta autorización, permite a la Universidad Icesi, de forma indefinida, para que en los términos establecidos en la Ley 23 de 1982, la Ley 44 de 1993, leyes y jurisprudencia vigente al respecto, haga publicación de este con fines educativos Toda persona que consulte ya sea la biblioteca o en medio electróico podrá copiar apartes del texto citando siempre la fuentes, es decir el título del trabajo y el autor.	spa
dc.rights.accessrights	info:eu-repo/semantics/openAccess
dc.rights.coar	http://purl.org/coar/access_right/c_abf2
dc.rights.license	Attribution-NonCommercial-NoDerivatives 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject.proposal	R	spa
dc.subject.proposal	Analítica	spa
dc.subject.proposal	Modelos de clústering	spa
dc.subject.proposal	Modelos de agrupamiento	spa
dc.subject.proposal	Big Data Analytics	spa
dc.subject.proposal	R	eng
dc.subject.proposal	Analytics	eng
dc.subject.proposal	Clustering Models	eng
dc.subject.proposal	Clustering Models	eng
dc.subject.proposal	Big Data Analytics	eng
dc.title	Una introducción a los modelos de Clústering empleando R
dc.title.alternative	An introduction to clustering models using R	spa
dc.type	book
dc.type.coar	http://purl.org/coar/resource_type/c_2f33
dc.type.coarversion	http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.driver	info:eu-repo/semantics/book
dc.type.local	Libro
dc.type.version	info:eu-repo/semantics/publishedVersion