DistributedLDAModel

java.lang.Object
- org.apache.spark.mllib.clustering.LDAModel
- - org.apache.spark.mllib.clustering.DistributedLDAModel

All Implemented Interfaces:: Saveable

public class DistributedLDAModel
extends LDAModel

Method Summary

Methods
Modifier and Type	Method and Description
`scala.Tuple2<int[],double[]>[]`	`describeTopics(int maxTermsPerTopic)` Return the topics described by weighted terms.
`Vector`	`docConcentration()` Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
`protected java.lang.String`	`formatVersion()` Current version of model save/load format.
`protected double`	`gammaShape()` Shape parameter for random initialization of variational parameter gamma.
`breeze.linalg.DenseVector<java.lang.Object>`	`globalTopicTotals()`
`Graph<breeze.linalg.DenseVector<java.lang.Object>,java.lang.Object>`	`graph()`
`double[]`	`iterationTimes()`
`JavaRDD<scala.Tuple3<java.lang.Long,int[],int[]>>`	`javaTopicAssignments()` Java-friendly version of `topicAssignments`
`JavaPairRDD<java.lang.Long,Vector>`	`javaTopicDistributions()` Java-friendly version of `topicDistributions`
`JavaRDD<scala.Tuple3<java.lang.Long,int[],double[]>>`	`javaTopTopicsPerDocument(int k)` Java-friendly version of `topTopicsPerDocument`
`int`	`k()` Number of topics
`static DistributedLDAModel`	`load(SparkContext sc, java.lang.String path)`
`double`	`logLikelihood()` Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs \| topics, topic distributions for docs, alpha, eta)
`double`	`logPrior()` Log probability of the current parameter estimate: log P(topics, topic distributions for docs \| alpha, eta)
`void`	`save(SparkContext sc, java.lang.String path)` Java-friendly version of `topicDistributions`
`LocalLDAModel`	`toLocal()` Convert model to a local model.
`scala.Tuple2<long[],double[]>[]`	`topDocumentsPerTopic(int maxDocumentsPerTopic)` Return the top documents for each topic
`RDD<scala.Tuple3<java.lang.Object,int[],int[]>>`	`topicAssignments()` Return the top topic for each (doc, term) pair.
`double`	`topicConcentration()` Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
`RDD<scala.Tuple2<java.lang.Object,Vector>>`	`topicDistributions()` For each document in the training set, return the distribution over topics for that document ("theta_doc").
`Matrix`	`topicsMatrix()` Inferred topics, where each topic is represented by a distribution over terms.
`RDD<scala.Tuple3<java.lang.Object,int[],double[]>>`	`topTopicsPerDocument(int k)` For each document, return the top k weighted topics for that document and their weights.
`int`	`vocabSize()` Vocabulary size (number of terms or terms in the vocabulary)

Methods inherited from class org.apache.spark.mllib.clustering.LDAModel
describeTopics

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - load
```
public static DistributedLDAModel load(SparkContext sc,
                       java.lang.String path)
```
  - graph
```
public Graph<breeze.linalg.DenseVector<java.lang.Object>,java.lang.Object> graph()
```
  - globalTopicTotals
```
public breeze.linalg.DenseVector<java.lang.Object> globalTopicTotals()
```
  - k
```
public int k()
```
    Description copied from class: LDAModel
    
    Number of topics
    
    Specified by:
    
    k in class LDAModel
  - vocabSize
```
public int vocabSize()
```
    Description copied from class: LDAModel
    
    Vocabulary size (number of terms or terms in the vocabulary)
    
    Specified by:
    
    vocabSize in class LDAModel
  - docConcentration
```
public Vector docConcentration()
```
    Description copied from class: LDAModel
    
    Concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
    This is the parameter to a Dirichlet distribution.
    
    Specified by:
    
    docConcentration in class LDAModel
    
    Returns:
    (undocumented)
  - topicConcentration
```
public double topicConcentration()
```
    Description copied from class: LDAModel
    
    Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
    This is the parameter to a symmetric Dirichlet distribution.
    Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
    
    Specified by:
    
    topicConcentration in class LDAModel
    
    Returns:
    (undocumented)
  - iterationTimes
```
public double[] iterationTimes()
```
  - gammaShape
```
protected double gammaShape()
```
    Description copied from class: LDAModel
    
    Shape parameter for random initialization of variational parameter gamma. Used for variational inference for perplexity and other test-time computations.
    
    Specified by:
    
    gammaShape in class LDAModel
    
    Returns:
    (undocumented)
  - toLocal
```
public LocalLDAModel toLocal()
```
    Convert model to a local model. The local model stores the inferred topics but not the topic distributions for training documents.
    
    Returns:
    (undocumented)
  - topicsMatrix
```
public Matrix topicsMatrix()
```
    Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
    WARNING: This matrix is collected from an RDD. Beware memory usage when vocabSize, k are large.
    
    Specified by:
    
    topicsMatrix in class LDAModel
    
    Returns:
    (undocumented)
  - describeTopics
```
public scala.Tuple2<int[],double[]>[] describeTopics(int maxTermsPerTopic)
```
    Description copied from class: LDAModel
    
    Return the topics described by weighted terms.
    
    Specified by:
    
    describeTopics in class LDAModel
    
    Parameters:
    maxTermsPerTopic - Maximum number of terms to collect for each topic.
    
    Returns:
    Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic's terms are sorted in order of decreasing weight.
  - topDocumentsPerTopic
```
public scala.Tuple2<long[],double[]>[] topDocumentsPerTopic(int maxDocumentsPerTopic)
```
    Return the top documents for each topic
    
    Parameters:
    maxDocumentsPerTopic - Maximum number of documents to collect for each topic.
    
    Returns:
    Array over topics. Each element represent as a pair of matching arrays: (IDs for the documents, weights of the topic in these documents). For each topic, documents are sorted in order of decreasing topic weights.
  - topicAssignments
```
public RDD<scala.Tuple3<java.lang.Object,int[],int[]>> topicAssignments()
```
    Return the top topic for each (doc, term) pair. I.e., for each document, what is the most likely topic generating each term?
    
    Returns:
    RDD of (doc ID, assignment of top topic index for each term), where the assignment is specified via a pair of zippable arrays (term indices, topic indices). Note that terms will be omitted if not present in the document.
  - javaTopicAssignments
```
public JavaRDD<scala.Tuple3<java.lang.Long,int[],int[]>> javaTopicAssignments()
```
    Java-friendly version of topicAssignments
  - logLikelihood
```
public double logLikelihood()
```
    Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, alpha, eta)
    Note: - This excludes the prior; for that, use logPrior. - Even with logPrior, this is NOT the same as the data log likelihood given the hyperparameters.
    
    Returns:
    (undocumented)
  - logPrior
```
public double logPrior()
```
    Log probability of the current parameter estimate: log P(topics, topic distributions for docs | alpha, eta)
    
    Returns:
    (undocumented)
  - topicDistributions
```
public RDD<scala.Tuple2<java.lang.Object,Vector>> topicDistributions()
```
    For each document in the training set, return the distribution over topics for that document ("theta_doc").
    
    Returns:
    RDD of (document ID, topic distribution) pairs
  - javaTopicDistributions
```
public JavaPairRDD<java.lang.Long,Vector> javaTopicDistributions()
```
    Java-friendly version of topicDistributions
    
    Returns:
    (undocumented)
  - topTopicsPerDocument
```
public RDD<scala.Tuple3<java.lang.Object,int[],double[]>> topTopicsPerDocument(int k)
```
    For each document, return the top k weighted topics for that document and their weights.
    
    Parameters:
    k - (undocumented)
    
    Returns:
    RDD of (doc ID, topic indices, topic weights)
  - javaTopTopicsPerDocument
```
public JavaRDD<scala.Tuple3<java.lang.Long,int[],double[]>> javaTopTopicsPerDocument(int k)
```
    Java-friendly version of topTopicsPerDocument
    
    Parameters:
    k - (undocumented)
    
    Returns:
    (undocumented)
  - formatVersion
```
protected java.lang.String formatVersion()
```
    Description copied from interface: Saveable
    
    Current version of model save/load format.
  - save
```
public void save(SparkContext sc,
        java.lang.String path)
```
    Java-friendly version of topicDistributions
    
    Parameters:
    sc - (undocumented)
    path - (undocumented)

Class DistributedLDAModel

Method Summary

Methods inherited from class org.apache.spark.mllib.clustering.LDAModel

Methods inherited from class java.lang.Object

Method Detail

load

graph

globalTopicTotals

k

vocabSize

docConcentration

topicConcentration

iterationTimes

gammaShape

toLocal

topicsMatrix

describeTopics

topDocumentsPerTopic

topicAssignments

javaTopicAssignments

logLikelihood

logPrior

topicDistributions

javaTopicDistributions

topTopicsPerDocument

javaTopTopicsPerDocument

formatVersion

save