org.apache.spark.mllib.clustering
Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
Current version of model save/load format.
Current version of model save/load format.
Total number of clusters.
Total number of clusters.
Maps given points to their cluster indices.
Maps given points to their cluster indices.
Maps given points to their cluster indices.
Maps given points to their cluster indices.
Returns the cluster index that a given point belongs to.
Returns the cluster index that a given point belongs to.
Save this model to the given path.
Save this model to the given path.
This saves:
The model may be loaded using Loader.load.
Spark context used to save model data.
Path specifying the directory in which to save this model. If the directory already exists, this method throws an exception.
:: Experimental :: Export the model to a String in PMML format
:: Experimental :: Export the model to a String in PMML format
:: Experimental :: Export the model to the OutputStream in PMML format
:: Experimental :: Export the model to the OutputStream in PMML format
:: Experimental :: Export the model to a directory on a distributed file system in PMML format
:: Experimental :: Export the model to a directory on a distributed file system in PMML format
:: Experimental :: Export the model to a local file in PMML format
:: Experimental :: Export the model to a local file in PMML format
Perform a k-means update on a batch of data.
Perform a k-means update on a batch of data.
:: Experimental ::
StreamingKMeansModel extends MLlib's KMeansModel for streaming algorithms, so it can keep track of a continuously updated weight associated with each cluster, and also update the model by doing a single iteration of the standard k-means algorithm.
The update algorithm uses the "mini-batch" KMeans rule, generalized to incorporate forgetfullness (i.e. decay). The update rule (for each cluster) is:
c_t+1 = [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] n_t+t = n_t * a + m_t
Where c_t is the previously estimated centroid for that cluster, n_t is the number of points assigned to it thus far, x_t is the centroid estimated on the current batch, and m_t is the number of points assigned to that centroid in the current batch.
The decay factor 'a' scales the contribution of the clusters as estimated thus far, by applying a as a discount weighting on the current point when evaluating new incoming data. If a=1, all batches are weighted equally. If a=0, new centroids are determined entirely by recent data. Lower values correspond to more forgetting.
Decay can optionally be specified by a half life and associated time unit. The time unit can either be a batch of data or a single data point. Considering data arrived at time t, the half life h is defined such that at time t + h the discount applied to the data from t is 0.5. The definition remains the same whether the time unit is given as batches or points.