I am attempting to group, for example, strings about programming with other strings about programming, strings about physics with other strings about physics, etc., for a wide range of topics. Despite the glaring theoretical linguistic aspect of the problem, I am looking to actually do this using programming/software.
The rundown: Given a large number of strings, how would I go about grouping them by semantic theme?
The particular application: I have ~200k trivia questions that I would like to categorize into common groupings (cars, computers, politics, Canada, food, Barack Obama, etc.).
What I've looked into: Wikipedia has a list of natural language processing toolkits (assuming that what I'm trying to do is actually called NLP) so I have looked at a few but none seem to do anything similar to my needs.
Notes: It has been pointed out that doing this requires additional knowledge (e.g. a Porsche being a car, C++ being a programming language). I assume then that training data is needed, but if I have only the list of questions and answers, how can I generate training data? And then how do I use training data?
More notes: If the current formatting of my Q&As help (although it looks like JSON, it's basically a raw text file):
// row 1: is metadata
// row 2: is a very specific kind of "category"
// row 3: is the question
// row 4: is the answer
{
15343
A MUSICAL PASTICHE
Of classical music's "three B's", he was the one born in Hamburg in 1833
Johannes Brahms
}
But before someone points out that there already exists a category, note that there are ~200k questions and answers like this, and basically as many "categories". I am trying to group these into broader groups like the ones listed above. Also, this formatting can be changed for all the questions very easily, I do it programmatically.
And more notes: I don't actually know how many categories I'll need (at least 10-20), because I haven't read through all of the questions myself. I was partially expecting to have the finite number determined somehow during categorizing. In any case, I can always manually create a number of categories.