Reducing size of PMML DataDictionary and MiningSchema elements

Asked Sep 02 '11 at 16:43

Active Sep 03 '11 at 15:10

Viewed 121 times

The DataDictionary and MiningSchema elements for a PMML model requires quite a bit of metadata for each field. With sparse, high dimensional data these could each be many times larger than either the training data or the trained model.

Are there any conventional extensions (or evil non-standard kludges, depending on how you think about it) to the PMML syntax that, for instance, just says that all fields have the same metadata? Or, even better, do something like specify all fields whose names have the same prefix get the same metadata?

Also, any typical way of having the MiningSchema just say "use everything in the DataDictionary that's not the "predicted" feature as an input feature"?

edited Sep 03 '11 at 15:10

asked Sep 02 '11 at 16:43

DavidDLewis

Would all the fields need to be explicitly referenced in the models as well? If so, you'd also want to "shrink" parts of the model definition, such as the [MiningSchema](http://www.dmg.org/v4-0-1/MiningSchema.html). – Derek Ploor Sep 03 '11 at 08:14
Derek - Good point. I've edited the question to make clear the same problem shows up in the MiningSchema. – DavidDLewis Sep 03 '11 at 13:28

Reducing size of PMML DataDictionary and MiningSchema elements

0 Answers0