-
Notifications
You must be signed in to change notification settings - Fork 17
Corpus hierarchy #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
v1.3 compat fixed lexicon
1) Type hierarchy for data: rooted at abstract corpus and document, which support subtypes representing fully-synthetic and real world data 2) Type hierarchy for MCMC: break struct model into "model" and "state" reflecting the scope (document locality) of latent variables vs model parameters and hyperpriors. This will facilitate clear cut testing in next PR based on Grosse and Duvenaud https://arxiv.org/abs/1412.5218
1) Per-word topics: add a test for consistency (with the full joint) of the corresponding conditional 2) Get rid of mutability on structs in src/Data.jl in favor of in-place assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for taking this on and sorry for taking so long to review! Just a couple of comments.
return | ||
end | ||
|
||
function topTopicWords(model::Model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider this moving to a separate file where someday all the functions which visualize/inspect the model could live.
topicSums::Vector{Float64} | ||
docSums::Array{Float64,2} | ||
assignments::Array{Array{Int64,1},1} | ||
conditionals::Array{Array{Float64,2},1} # the p paramter for the word assignment (cat/multinom) variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be an issue but in the past on large corpora where there's memory pressure keeping the conditions around winds up being the limiting factor. (As opposed to just keeping a temporary conditional for the word currently being sampled)
(Also there seem to be conflicts on this branch) |
Thanks for looking through the code! (I'll look into those conflicts and
see what's going on)
I would consider this moving to a separate file where someday all the functions which visualize/inspect the model could live.
------------------------------
Sounds good, that would also solve the issue of exports being scattered
around various files, where most of the functions are not exported.
This might not be an issue but in the past on large corpora where there's
memory pressure keeping the conditions around winds up being the limiting
factor. (As opposed to just keeping a temporary conditional for the word
currently being sampled)
Yeah I agree, since we already have sufficient stats from the the
collapsed model, I can just compute the theta samples post hoc if
necessary. In my current application the mixed membership is of interest
but as you say the actual class is usually the relevant observation.
|
Sorry about the late reply (had another paper that needed to be submitted)... I think all these are safe to overwrite. TopicModels.jl is completely changed from having classes,and functions to just serving as the Julia equivalent of DESCRIPTION.R under the Julia 1.3 Pkg framework |
If you want I can do the merge by hand, just need write access ;) |
I'm not sure what you mean? You should be able to resolve the conflicts in your branch. |
Ok, I see, sorry just getting the hang of this. Done I'm going to move all .toml to gitignore, this is all going to be local platform dependent |
Hi, There are only two new commits in this branch and the commit messages are itemized with details.
Looking toward the next PR, I'm going to start adding supervised LDA functionality. I'm looking to support two forms of document-level response: