11

I am somewhat new to data mining, and I am working on a classification model for movie rating prediction.

I have collected data sets from IMDB, and I am planning to use a decision trees and nearest neighbor approaches for my model. I would like to know which freely available data mining tool could provide the functionality that I require.

jonsca
  • 1,790
  • 3
  • 20
  • 30
K Hein
  • 255
  • 1
  • 5

3 Answers3

5

Weka is a free and open-source machine-learning suite of tools. They have a GUI as well as an API to call from your Java code if you want.

They have many classification algorithms including several decision tree algorithms. These are available in the UI. Nearest neighbors are a bit more tricky and it seems you have to use the API directly.

I think Rapid Miner probably supports this type of thing, but I haven't used it for such purposes before.

You might also consider R, but that might require getting your hands a little dirtier.

Note that Netflix has done a ton of work in movie rating classification. Several years ago they offered a $1 million prize to the group that could improve their classification the most. You might be interested in reading how various teams approached that problem.

Michael McGowan
  • 4,561
  • 3
  • 31
  • 46
  • Thanks Michael, I've tried Weka for decision tree algorithms, but I found that numeric values are not supported for most of the decision tree algorithms. In my data sets, I have numeric values such as rating (the class label), budget, director id, actor id, etc. So how could I handle those numeric values? (I am not sure if I should open a new thread for my question). Do you have any suggestion on any other suitable algorithm? – K Hein Nov 23 '11 at 03:10
  • 2
    @K Hein 1) I suggest to use Random Forests (RF) instead of DTs. See e.g. http://stats.stackexchange.com/questions/10001/implementations-of-the-random-forest-algorithm. 2) numeric variables: RF can handle both numeric and discrete labels, you should try both approaches; director_id,actor_id is not a numeric feature, it is either a boolean (actor participated ?) or a nominal (main actor); budget can be discretized or let RF handle them. In this case the algorithm searches for the optimal split point. I suggest to play around and come back later with more specific questions ;). – mlwida Nov 23 '11 at 08:22
  • @steffen Thanks Steffen! I'll give a try with RF, but I still have a few questions regarding your comment. Let say if I want to take actor_id as boolean, then for each unique actor_id, I have a boolean attribute like isActor1Particated (say for actor_id = 1)? If I'd like to change actor_id to nominal attribute, how should I proceed it? I would be very grateful if you could provide some descriptions as I am really new to data mining area. – K Hein Nov 23 '11 at 15:55
  • 1
    @KHein my idea behind the nominal suggestion was to restrict the actors to the most important ones by creating features like first_actor, second_actor etc. Anyways: How to deal with information of variable length (actors, directors, keywords etc.) is a topic for a separate question. – mlwida Nov 24 '11 at 17:19
  • @KHein When you ask the "How to deal with information of variable length" question, please link to it here :-) – Darren Cook Nov 25 '11 at 00:36
  • @DarrenCook looks like KHein forgot it. Here it [is](http://stats.stackexchange.com/questions/18911/how-to-deal-with-information-of-variable-length/). – mlwida Nov 26 '11 at 12:07
5

Hein,

there are a lot of tools and libs with the functionality available.

Which to choose depends whether you would like to use a gui for your work or if you would like to embed it in some other program.

Standalone Data mining tools (there are ohters like WEKA with Java interface):

  • Rapid Miner
  • Orange
  • Rattle gui for R
  • KNIME

Text based:

  • GNU R

Libs:

  • Scikit for Python
  • Mahout on Hadoop

If you know a programming language well enough I would use a lib for that language or give R a try. If not you may try one of the tools with gui.

A tree example in R:

# we are using the iris dataset
data(iris)

# for our tree based model we use the rpart package
# to download it type install.packages("rpart")
library(rpart)

# Building the tree
fit <- rpart(Species ~ Petal.Length + Petal.Width, method="class", data=iris)

# Plot the tree
plot(fit)
text(fit)

As suggested the analysis with R requires you to code yourself, but you will find a package for most classification tasks which will work out of the box. An overview can be found here Machine Learning Task View

To get started with RapidMinder you should have a look at Youtube. There are some screencasts, even for decision trees.

audijenz
  • 66
  • 4
  • 1
    I'd like to downvote, but you are new, so: You simply list a set of tools (a rather generic answer) without a demonstration why it is suitable for the specific task of the OP. I suggest to provide more details, otherwise your answer could be replaced by http://stats.stackexchange.com/questions/2007/a-survey-of-data-mining-software-tools. No offense, please take it as a friendly advice :) – mlwida Nov 22 '11 at 16:38
  • @steffen: respectfully, audijenz's receipt of 4 upvotes and 0 downvotes says otherwise. I believe s/he has answered the question nicely. It asked "which freely available data mining tool could provide functionalities that I require," and the answer gave that and more. Much more, actually, than any of the answers at the thread you linked. – rolando2 Nov 24 '11 at 13:58
  • 1
    @rolando2 I added the comment BEFORE audijenz edited it and I already have upvoted the edited answer ;). – mlwida Nov 24 '11 at 14:04
  • @steffen: I stand corrected! – rolando2 Nov 25 '11 at 01:46
1

May be... WEKA? http://www.cs.waikato.ac.nz/ml/weka/

Orsino
  • 19
  • 1
  • 1
    (-1): Although Weka is indeed a Data Mining Tool which indeed contains an implementation of NN and DT, the answer is so generic it could be reply to a ton of questions. If you think that Weka is suitable for the special task of rating prediction given extremely highdimensional sparse data, why don't you show an example (or a link to an example). No offense, please take it as a friendly suggestion. – mlwida Nov 22 '11 at 16:35