Data mining classification competition

Question

I'm currently taking a data mining class, and for one our projects we're required to predict the class label for an unknown data set by first building a classifier on a training data set which already provides the class label.

We're only required to get an accuracy of 80% to get a full mark on the assignment. I have already achieved this using the J48 Decision Tree algorithm (acc=84.08%).

There is also an ongoing competition on who can get the highest accuracy (determined by a Judge system we can't see).

I have two questions:

How can I use an ensemble method with to do this
Is there a way to optimize the parameters for each classifier?

 import java.io.*;
 import weka.core.Instances;
 import weka.filters.Filter;
 import weka.filters.unsupervised.attribute.*;
 import weka.classifiers.trees.*;
 import weka.classifiers.Evaluation;

public class CompClassifier {
public static FileOutputStream Output;
public static PrintStream file;

public static void main(String[] args) throws Exception {
    // load training data
    weka.core.Instances training_data = new weka.core.Instances(new       
java.io.FileReader("/Users//Weka/training.arff"));

    //load test data
    weka.core.Instances test_data = new  weka.core.Instances(new 
java.io.FileReader("/Users//Weka/unknown.arff"));

    //Clean up training data
    ReplaceMissingValues replace = new ReplaceMissingValues();
    replace.setInputFormat(training_data);
    Instances training_data_filter1 = Filter.useFilter(training_data, replace); 

    //Normalize training data
    Normalize norm = new Normalize();
    norm.setInputFormat(training_data_filter1);
Instances processed_training_data = Filter.useFilter(training_data_filter1, norm);

    //Set class attribute for pre-processed training data
 processed_training_data.setClassIndex(processed_training_data.numAttributes() - 1);    

    //output to file
Output = new  FileOutputStream("/Users//Desktop/CLASSIFICATION/test.txt");
    file = new PrintStream(Output);

    //build classifier
    J48 tree = new J48();
    tree.buildClassifier(processed_training_data);

    //Clean up test data
    replace.setInputFormat(test_data);
    Instances test_data_filter1 = Filter.useFilter(test_data, replace); 

    //Normalize test data
    norm.setInputFormat(training_data_filter1);
    Instances processed_test_data = Filter.useFilter(test_data_filter1, norm);

    //Set class attribute for pre-processed training data
    processed_test_data.setClassIndex(processed_test_data.numAttributes() - 1);

    //int num_correct=0;
    for (int i = 0; i < processed_test_data.numInstances(); i++) {
        weka.core.Instance currentInst = processed_test_data.instance(i);
        int predictedClass = (int) tree.classifyInstance(currentInst);
        System.out.println(predictedClass);
        file.println("O"+ predictedClass);
    }


}

On a side note, even though the Weka book is titled "Data mining", it actually is about regular machine-learning. You can even find a statement by the authors that it was supposed to be called "practical machine learning", and the publishing house changed it to "data mining" for marketing and sales reasons. All the classification material clearly belongs to AI and ML, not to data mining (no integration with database management!) — Has QUIT--Anony-Mousse, Dec 11 '11 at 10:47
Access of the originating database seems to me a logistical issue. Although in-database modeling methods exist, I believe that it is much more common to fit models in RAM, which becomes larger and cheaper every passing year. Also, there are a variety of techniques for either scaling up modeling processes, or scaling down the data (sampling, progressive sampling, data squashing, etc.). The core modeling techniques of data mining and machine learning (and inferential statistics, pattern recognition, etc.) are otherwise indistinguishable. — , Dec 19 '11 at 16:13

score 3 · Answer 1 · answered Dec 20 '11 at 14:26

An easy way to build an ensemble is by using a random forest. I'm fairly sure weka has a random forest algorithm, and if other tree-based models are performing well it's worth trying out.

You could also build your own ensemble by training multiple (say 50 or 100) J48 decision trees and using them to "vote" on the classification of each object. For example, if 60 tress say a given observation belongs to class "A", and 40 say it belongs to class "B", you classify the object as class "A."

You can further improve such an ensemble by training each tree on a random sub-sample of the training data. This is called "bagging," and the random sub-samples are usually created with replacement.

Finally, you can additionally give each tree a random subset of variables from the training set. This is called a "random forest." While your professor will probably be impressed if your write your own random forest algorithm, it's probably best to use an existing implementation.

score 1 · Answer 2 · answered Dec 19 '11 at 16:10

A model ensemble is simply a collection of models whose output is combined (hopefully generating superior performance in the process). Obviously, to be of any interest, the base models must vary somehow, and there are several ways to do this: vary the model type (tree induction, neural network, discriminant function, etc.), vary the starting conditions of the model training (such as differing weight initializations for feedforward neural networks), vary the observations used (typically random samples of the entire training set), vary the candidate input variables (again, typically random samples of all those available), etc.

There are several ways to combine the base model outputs. The simplest are averaging or voting, though these may require some calibration.

score 0 · Answer 3 · answered Dec 19 '11 at 19:42

You could try the new machine learning library called ML-Flex (http://mlflex.sourceforge.net). It is designed to execute a variety of ensemble methods and can also provide side-by-side comparisons when different algorithm parameters are used (though perhaps not exactly as you desire). If you're interested, give it a try and provide any feedback you may have. Full disclosure: I am the author of this package.

Data mining classification competition

3 Answers3