@@ -14,13 +14,15 @@ document. The space whence the words are drawn is termed the lexicon.
14
14
15
15
Formally, the model is defined as
16
16
17
+ ```
17
18
For each topic k,
18
19
phi_k ~ Dirichlet(beta)
19
20
For each document d,
20
21
theta ~ Dirichlet(alpha)
21
22
For each word w,
22
23
z ~ Multinomial(theta)
23
24
w ~ Multinomial(phi_z)
25
+ ```
24
26
25
27
alpha and beta are hyperparameters of the model. The number of topics, K,
26
28
is a fixed parameter of the model, and w is observed. This package fits
@@ -31,8 +33,10 @@ the topics using collapsed Gibbs sampling (Griffiths and Steyvers, 2004).
31
33
We describe the functions of the package using an example. First we load
32
34
corpora from data files as follows:
33
35
36
+ ```
34
37
testDocuments = readDocuments(open("cora.documents"))
35
38
testLexicon = readLexicon(open("cora.lexicon"))
39
+ ```
36
40
37
41
These read files in LDA-C format. The lexicon file is assumed to have one
38
42
word per line. The document file consists of one document per line. Each
@@ -45,7 +49,9 @@ the number of tuples for that document.
45
49
46
50
With the documents loaded, we instantiate a model that we want to train:
47
51
52
+ ```
48
53
model = Model(fill(0.1, 10), 0.01, length(testLexicon), testDocuments)
54
+ ```
49
55
50
56
This is a model with 10 topics. alpha is set to a uniform Dirichlet prior
51
57
with 0.1 weight on each topic (the dimension of this variable is used
@@ -54,7 +60,9 @@ the prior weight on phi (i.e. beta) should be set to 0.01. The third
54
60
parameter is the lexicon size; here we just use the lexicon we have
55
61
just read. The fourth parameter is the collection of documents.
56
62
63
+ ```
57
64
trainModel(testDocuments, model, 30)
65
+ ```
58
66
59
67
With the model defined, we can train the model on a corpus of documents.
60
68
The trainModel command takes the corpus as the first argument, the model
@@ -64,7 +72,9 @@ will be mutated in place.
64
72
65
73
Finally we can examine the output of the trained model using topTopicWords.
66
74
75
+ ```
67
76
topWords = topTopicWords(model, testLexicon, 10)
77
+ ```
68
78
69
79
This function retrieves the top words associated with each topic; this
70
80
serves as a useful summary of the model. The first parameter is the model,
0 commit comments