Abstract Digital twins are pivotal in driving the digital transformation across various industries. Once digital twins are established for o
Abstract Digital twins are pivotal in driving the digital transformation across various industries. Once digital twins are established for operational physical entities (i.e., digital twins for digital transformation, DT-DT), managing both the vast volumes of historical operational data and the continuous influx of new data becomes essential. This characteristic presents challenges in digital twin data mining, including high computational costs, performance limitations, and inefficiencies. Additionally, traditional algorithms struggle to efficiently handle both large historical datasets and newly generated streaming data simultaneously. The widely used k-means clustering algorithm requires pre-specifying the number of clusters and randomly selecting initial centroids. As the number of data points increases, the algorithm’s execution time and subsequent evaluation process tend to become significantly prolonged. To address these challenges, this paper proposes an adaptive k-means clustering algorithm based on grid and domain centroid weights (GDCW-AKM). The performance of the algorithm is evaluated using both real and synthetic datasets, considering runtime and multiple clustering evaluation metrics. Experimental results demonstrate that the GDCW-AKM algorithm is well-suited for DT-DT. It can efficiently process large-scale datasets and support streaming data, updating memory allocation information and saving data to disk with a single data scan. The algorithm automatically determines the optimal number of clusters k and initial centroids. For newly arriving data, it employs an incremental update mechanism, adjusting only the affected portions, which significantly reduces memory usage. While maintaining clustering accuracy comparable to k-means, the GDCW-AKM algorithm greatly improves overall efficiency, with the performance gain becoming more pronounced as data volume increases. Furthermore, the algorithm simplifies parameter selection, requiring only the recommended settings based on known data dimensions for good results, making it user-friendly. This method shows substantial promise for various industries in constructing DT-DT.