10

I am attempting to group, for example, strings about programming with other strings about programming, strings about physics with other strings about physics, etc., for a wide range of topics. Despite the glaring theoretical linguistic aspect of the problem, I am looking to actually do this using programming/software.

The rundown: Given a large number of strings, how would I go about grouping them by semantic theme?

The particular application: I have ~200k trivia questions that I would like to categorize into common groupings (cars, computers, politics, Canada, food, Barack Obama, etc.).

What I've looked into: Wikipedia has a list of natural language processing toolkits (assuming that what I'm trying to do is actually called NLP) so I have looked at a few but none seem to do anything similar to my needs.

Notes: It has been pointed out that doing this requires additional knowledge (e.g. a Porsche being a car, C++ being a programming language). I assume then that training data is needed, but if I have only the list of questions and answers, how can I generate training data? And then how do I use training data?

More notes: If the current formatting of my Q&As help (although it looks like JSON, it's basically a raw text file):

// row 1: is metadata
// row 2: is a very specific kind of "category"
// row 3: is the question
// row 4: is the answer
{
  15343
  A MUSICAL PASTICHE
  Of classical music's "three B's", he was the one born in Hamburg in 1833
  Johannes Brahms
}

But before someone points out that there already exists a category, note that there are ~200k questions and answers like this, and basically as many "categories". I am trying to group these into broader groups like the ones listed above. Also, this formatting can be changed for all the questions very easily, I do it programmatically.

And more notes: I don't actually know how many categories I'll need (at least 10-20), because I haven't read through all of the questions myself. I was partially expecting to have the finite number determined somehow during categorizing. In any case, I can always manually create a number of categories.

Whymarrh
  • 105
  • 1
  • 12
  • How were you using carrot? From my brief reading about it, it seems like it should easily handle 200k records. –  Mar 13 '12 at 23:25
  • It just took a lot longer than I thought it would, and forced me to increase the JVM's initial memory allotment to 1024m, and max memory to 2048m. It wasn't as bad as I may have made that sound. –  Mar 14 '12 at 01:13
  • Eh, you're doing bulk processing; giving the JVM isn't really a problem. How long did it take? From where were you loading the documents? A custom source? –  Mar 14 '12 at 08:27
  • I took maybe 10 minutes, but I agree, bulk processing is by definition time-consuming and memory intensive. Though that whole shpeal about it choking wasn't the issue, more of a side note. –  Mar 14 '12 at 16:27
  • Apart from JVM memory configuration (mentioned in the comments), this question would be fit for [Linguistics.SE](http://linguistics.stackexchange.com/). –  Mar 14 '12 at 18:30
  • @OtavioMacedo I would like to find a way using software and programming to solve this problem. I feel that [Linguistics.SE](http://linguistics.stackexchange.com/) is more of a theoretical approach. –  Mar 15 '12 at 14:20
  • the java [Mallet](http://mallet.cs.umass.edu/topics-devel.php) package is what you want. the tutorial shows how to import all your texts and group them by topic but you would name the topics yourself. I've used it for grouping books by topics and it works surprisingly well, especially if you have lots of data. Your questions are the training data –  Mar 26 '12 at 06:40
  • But in the case where there could be topics (categories) that I am not aware of, or even a large number of categories, would I have to know all of them? –  Mar 26 '12 at 18:09
  • Mallet knows nothing beforehand and assumes you dont either. For example you tell Mallet you want 25 topics and then feed in all 200k questions. Mallet gives each question a score in each topic. run the jar on this page and you'll see http://code.google.com/p/topic-modeling-tool/downloads/list –  Mar 26 '12 at 20:11
  • You just need enough training data, and then you should be able to classify the questions into these categories. A fully automatic approach will likely end up grouping them by other means, e.g. questions containing the word "car". You cannot learn synonyms at the same time as creating a grouping. – Has QUIT--Anony-Mousse Mar 14 '12 at 07:17
  • So, there isn't software that I can just give the questions to and get back the categorizations? How then should this be done? –  Mar 14 '12 at 16:31
  • No, the magic "solve all my problems" button has not yet been invented. Usually, a computer will not *understand* the text, just do some statistics. You however are attempting to do a grouping that requires even additional knowledge (e.g. a porsche being a car). You will have to give this information to your program. – Has QUIT--Anony-Mousse Mar 14 '12 at 17:08
  • I have updated the question to reflect your points. Maybe you could further explain using training data to accomplish my task. –  Mar 24 '12 at 23:31
  • Mallet doesn't need to know synonyms. With enough data it will notice that words like 'Porsche' and 'C++' almost never appear together but 'C++' and 'Compiler' do –  Mar 26 '12 at 06:50
  • C++ and Compiler aren't synonyms. Synonyms usually do not occur together. – Has QUIT--Anony-Mousse Mar 26 '12 at 07:32

2 Answers2

4

You are trying to solve two problems here.

Problem 1: Categorize questions strings in the proper category.

Problem 2: Create proper categories.

The first problem could be done by so-called supervised algorithms, many classifiers can give very good accuracy and performance. However, problem 2, creating categories out of thin air (tons of data), is much more tricky. This is an unsupervised problem, given lots of data the computer autonomically decides categories given some criteria. Ideally, these criteria and the algorithm should neatly organize your data into clusters. These could then be labeled. However, as this is a much more difficult task, I'd say that there is no acceptable drop-in solution here that will give a good result without a lot of tuning effort which would most likely require experts.

So, I'm afraid there's no magic button here just yet. What you can do however, is to help the machine out a bit. For instance, you can decide on the category set. When you have decided on categories, you can create training data. In this setup, the training data is just question and correct category pairs.

The more training data the better. However, as the task still is to something automatically, it doesn't make sense at first start doing things manually. Now why would you want to have training data? Accuracy evaluation. If you want good results, it is vital that you can perform some sort of evaluation on how good a setup is doing. And the only way do to that somewhat systematically is to manually label up some questiosn yourself. Otherwise you're in the blind.

Then, some new questions do arise. First: How much training data do I need? "It depends". Without having seen your data or categories I'm not sure i'd even take a guess; but I can take a "ballpark estimate" and say about 500 questions. Note that I could be off by an order of magnitude.

Does this really mean that you'd have to tag 500 questions by hand? Yes and no. It is possible to use intermediate results and some cleverness to "bootstrap" classifiers. It is still manual work though, and when you think on it 500 questions will not take that long to tag. Being clever here can quickly give worse results than being industrious.

When you have training data in a sufficient amount, take 75% of it and create a classifier using your favourite tool (e.g those mentioned here or whatnot). Now, let the classifier try to label the held out 25% of the data and meausre the resulting accuracry. If the result is good, then pop champagne. If not then make more training data or try another classifier.

TL;DR

To sum, here's how I would have done it.

0) Use a supervised learner.
1) Create a category set yourself. 
2) Label manually about 500 questions
3) Use 75% of those to train a classifier.
4) Check performance.
5) If good then cheers else goto 2.
  • One small question: You say "about 500 questions" for the training data and to manually tag those, but also "I could be off by an order of magnitude", so if I were to use 5k or 50k questions instead, would I still manually tag that many? –  Mar 26 '12 at 18:07
  • The thing is; without having seen your data nor having a very clear idea of all the minute details in your project, it is hard to give a good estimate. However, and this is important to remember, should the 500 be far too low, the tagging effort was not wasted. You still need manually labeled questions for evaluation. The more evaluation data you have, the better assessments you can make. –  Mar 26 '12 at 18:33
  • By one order of magnitude I meant 50-500-5000. I don't think you'll need to classify 50k. That's 1/4 of your entire corpus! Should 500 questions be far too low, it's possible to bootstrap classifiers. The idea here is that you train the classifier on a small initial corpus (e.g. your 500) and then tag the rest. Now, you can use some of the cases where the classifier was very confident to retrain a new, larger classifier. –  Mar 26 '12 at 18:38
  • Another important thing to keep in mind; the performance of many classifiers is not linear in the amount of training data, but will typically be a sigmoid like curve. That means 500 more tagged questions could be almost just as good a benefit as 5000. My advice is to work in small steps. –  Mar 26 '12 at 18:39
  • What details would provide additional insight into my project? I can share some example questions to show my formatting, but I am willing to tailor the format of my Q&As to fit the categorization process. I appreciate the help. –  Mar 26 '12 at 20:39
  • How many categories are you interested in having? What do the questions look like? What genre, type of language, edited or not etc. –  Mar 26 '12 at 22:09
  • I added a bit to the question, hopefully it helps. As per a specific genre, the questions are all over the map, from technology to fashion to types of trees. I'm only now realizing the magnitude of this task. –  Mar 26 '12 at 22:22
4

This is a fairly standard problem in NLP, and the magic Google words you're looking for are "topic modeling". Although your strings are quite short, you may have some success with Latent Dirichlet Allocation, or a similar method. There's a nice blog post by Edwin Chen here, which lays out the general idea behind the algorithm. The details of implementation are covered in this note by Yi Wang.

If you're looking for an off-the-shelf solution, I recommend trying out the topicmodels package for R, as this provides a reasonably nice interface to both LDA and a more sophisticated Correlated Topic Model. There's also a good list of implementations maintained by David Mimno here.

Whymarrh
  • 105
  • 1
  • 12
Martin O'Leary
  • 3,697
  • 21
  • 27
  • Thank you, Chen's blog post seems to be spot on what I'm trying to do. Is there any chance you have used any of what you listed/done this before? I am on completely new grounds here and would appreciate a walkthrough of what I'd need to do (using one of the off-the shelf solutions). How should I format my "documents"? Should I apply IDs to each Q&A to allow me to identify which document is in which group? How do I use the outputted data? Like I said, I don't understand a lot of the details. – Whymarrh Mar 29 '12 at 22:52
  • I've used the R topicmodels package quite a bit. I'd certainly recommend it over rolling your own code - there's some documentation with a worked example at http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf. The specific formatting of each document doesn't really matter, as everything's going to be reduced to a "bag of words" representation anyway. Just throw all the associated text into one string. – Martin O'Leary Mar 29 '12 at 22:57