1

I'm trying to implement latent dirichlet allocation on a name disambiguation project. My data set includes a corpus of documents. Each document looks like:

Author, co-author, title, institution

I understand that the input for LDA should be a document-term matrix. But How do I take advantage of the structure of the data set. Should I just generate a document-term matrix disregarding the structure?

Sorry if the question seem vague. I would love to further clarify on any ambiguities.

Thank you

  • Follow up: would it be beneficial to define some of my own topics for this? e.g. for institution- each possible university becomes a new topic. – casualprogrammer Apr 05 '17 at 22:47
  • 1
    It's not clear what you actually want to use these topic models to do. What are you trying to disambiguate? LDA is a bag of words model, so in general it has no notion of syntactic structure. That doesn't make it useless to your project, necessarily, it's just hard to tell from your question. – Sean Easter Apr 06 '17 at 01:10
  • It's not clear from your post if each "document" is just the metadata you list, or if it includes actual natural language text (paragraphs etc.). – AaronDefazio Apr 06 '17 at 05:23
  • The aim is to tell whether entries with the same author name ie. J.Smith are actually the same person. And I want to use LDA to analyze the topics of the paper. Intuitively, if the topics of the papers of to J.Smith's are similar, then they are likely to be the same person. @Sean Easter – casualprogrammer Apr 06 '17 at 22:30
  • I'm not quite sure how to even for topics just with a couple of words from the title and coauthor. @AaronDefazio – casualprogrammer Apr 06 '17 at 22:31
  • I think your intuition is sound: Generally author works will share topics from one to the next. I think you might want to search for papers that cite the original LDA paper—some models have since been developed that use it in contexts that include a supervised learning task. You might find something useful there. – Sean Easter Apr 07 '17 at 14:31

0 Answers0