Is there a way to represent datasets using Predictive Model Markup Language?

Question

This seems like a dumb question, but does PMML have a way to represent a data set? And if not, why not? There's detailed support for defining the name, type, and legal values of each feature that will appear in the data, as well as which of those features to use in the trained model, and what transformations to apply to the values of those features before using them.

However, I haven't been able to find an example, or any discussion, of representing raw data (either training sets or test data) in PMML. Or for that matter representing data in XML at all. PMML in Action (pp. 7-8) explicitly says "...raw input data is usually formatted as a flat file in which columns represent data fields and rows records or transactions." The sample datasets at the DMG website are all in .csv format.

This omission seems particularly strange, since PMML has a model format for support vector machines, which are (almost) just weighted sets of examples. So it would have been only the tiniest bit of effort to define format for training sets and test sets.

score 2 · Accepted Answer · answered Aug 08 '11 at 07:31

I think looking at what the letters of PMML stand for provides some insight into this. PMML is Predictive Model Markup Language. From PMML in Action, p. 5

PMML is an XML-based language... which provides a way for applications to define statistical and data mining models and to share models between PMML-compliant applications.

Emphasis added by me to the word "model," and notice the lack of the word data here. Data storage is a separate issue.

This means that in the case of support vector machines, the PMML needs to define the support vectors, kernel function, and whatnot, but the storage of the data is a separate responsiblity. Even if the support vectors are just weighted sets of examples, in general, they're not going to be the entire raw data set.

PMML separates the responsibility of modeling from application. The consumer gets to be agnostic about the model building process—no matter how simple or complicated it may be—and only has to worry about how to read the PMML and apply it.

Exceptions where PMML Stores Data

Two exceptions where raw data can be stored in PMML are time series models and model verification.

Raw data can be stored in time series models, because these models tend to be defined from (at least a portion of) the original data from which the model was built. For example, with a moving average model, at least the first few time steps after deployment will need some of the previous data points in order to proceed.

Raw data is also stored in the Model Verification element. This allows the consumer of the PMML to check a model implementation against tests provided by the PMML producer.

I suspect you're right, since I haven't gotten any answers to the contrary. Still seems silly to me: in operational settings problems routinely arise because of inconsistencies between the way that data is represented during training of a model and its use. PMML is an invitation to make those problems worse by sending models off into the world detached from the context in which their training data was generated. Ah well. — DavidDLewis, Aug 09 '11 at 01:09

Is there a way to represent datasets using Predictive Model Markup Language?

1 Answers1