I have implemented k means as below, and as far as I know, it doesn’t guarantee k grouping when given k. But the task I have which is https://justpaste.it/5cmte, requires me to return exactly k clusters. How should I go around that?
You are given n points.
You have to separate them into k groups (clusters).
The first line contains two numbers: n – the number of points and k – the number of groups.
The next n lines contain two numbers x and y – the coordinates of a point.
The first line should contain the sum of distances between each point to the center of its cluster.
Each of the next k lines should contain the coordinates of center of the current cluster, the number of points of the cluster and the indices of the points belonging to the cluster (all values should be separated with a whitespace).
1.5 3 2 1 4
3 6 1 6
7 5.66667 3 2 3 5
Sometimes, the same input, gives k-1, k-2 clusters instead of k.
So my point is that. k means does not guarantee k clusters. I am required to return exact k clusters, how should I do that?
import sys, random, math from collections import defaultdict def pointAvg(points): dimensions = len(points(0)) newCenter = () for dimension in range(dimensions): dimSum = 0 for p in points: dimSum += p(dimension) newCenter.append(dimSum / float(len(points))) return newCenter def updateCenters(dataSet, assignments): newMeans = defaultdict(list) centers = () for assignment, point in zip(assignments, dataSet): newMeans(assignment).append(point) for points in newMeans.values(): centers.append(pointAvg(points)) return centers, newMeans def assignPoints(dataPoints, centers): assignments = () for point in dataPoints: shortest = float('inf') shortestIndex, distances = 0, 0 for i in range(len(centers)): val = distance(point, centers(i)) distances += val if val < shortest: shortest = val shortestIndex = i assignments.append(shortestIndex) return (assignments, distances) def distance(a, b): dimensions = len(a) _sum = 0 for dimension in range(dimensions): differenceSq = (a(dimension) - b(dimension)) ** 2 _sum += differenceSq return math.sqrt(_sum) def generateK(dataSet, k): centers = () dimensions = len(dataSet(0)) minMax = defaultdict(int) for point in dataSet: for i in range(dimensions): val = point(i) minKey = 'min_%d' % i maxKey = 'max_%d' % i if minKey not in minMax or val < minMax(minKey): minMax(minKey) = val if maxKey not in minMax or val > minMax(maxKey): minMax(maxKey) = val for _k in range(k): randPoint = () for i in range(dimensions): minVal = minMax('min_%d' % i) maxVal = minMax('max_%d' % i) randPoint.append(random.uniform(minVal, maxVal)) centers.append(randPoint) return centers def kMeans(dataset, k): kPoints = generateK(dataset, k) assignments = assignPoints(dataset, kPoints)(0) oldAssignments = None while assignments != oldAssignments: updatedCenters = updateCenters(dataset, assignments) newCenters = updatedCenters(0) means = updatedCenters(1) oldAssignments = assignments assignedPoints = assignPoints(dataset, newCenters) assignments = assignedPoints(0) sumDistances = assignedPoints(1) print(sumDistances) return (newCenters, means) if __name__ == '__main__': # sys.stdin = open('input.txt', 'r') # sys.stdout = open('output.txt', 'w') i, points = 0, () for line in sys.stdin: line = line.rstrip() line = line.split() line(0), line(1) = int(line(0)), int(line(1)) if (i == 0): n, k = line(0), line(1) else: points.append(line) i += 1 clusters = kMeans(points, k) for i in range(k): print('%g' % clusters(0)(i)(0), '%g' % clusters(0)(i)(1), len(clusters(1)(i)), end=' ') for j in range(len(clusters(1)(i))): print(points.index(clusters(1)(i)(j)) + 1, end=' ') print()