11

So for instance here are the definitions that I get from standard text books

Variable - characteristic of population or sample. ex. Price of a stock or grade on a test

Data - actual observed values

So for a two column report [Name | Income] the column names would be the variables and the actual observed values {dave | 100K} , {jim | 200K} would be the data

So if I say that the [Name] column is nominal data and that [income] is ratio data, wouldn't I be more accurate describing it as a type of variable instead of a type of data like most textbooks do? I understand that this might be semantics, and that's fine it that's all there is too it. But I fear that I might be missing something here.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
User 42
  • 125
  • 1
  • 6
  • Doesn't strike me as a meaningful difference; I'd consider either phrasing acceptable, personally. The definition of "variable" seems a little off though. – Nick Stauner Jul 09 '14 at 22:12
  • 2
    @Nick I believe that if we translate the colloquial "characteristic" to the mathematical "real-valued function," we get part of the definition of a random variable. (The missing part, of course, is measurability with respect to a sigma field on the population.) Normally, though, we would translate "characteristic of a sample" into the technical term *statistic*: maybe that's what you are referring to as being a "little off." With these translations, variables do not have "types" at all in Stevens' sense (we can only distinguish discrete from continuous *distributions*)--but some data can. – whuber Jul 09 '14 at 22:30

2 Answers2

18

Stevens' scale typology isn't necessarily some inherent characteristic of the variables, nor even data itself, but of how we treat the information - of what we're using it to mean.

In some circumstances, exactly the same value may be considered ratio, interval, ordinal or nominal, depending on what we're doing with it - it's a matter of what meaning we give the values, which can change from one analysis to the next. Stevens' typology has some value, but it doesn't do to be overly prescriptive about it.

This issue of the importance of scale as meaning dates back at least to Lord (1953), who offered an example where there were both nominal and interval interpretations of the same set of numbers.

This point was even more clearly made by Velleman and Wilkinson (1993), who offer an example of people receiving consecutive numbered tickets on entry to a reception with a prize being awarded to one of the tickets; depending on the use being made of the numbers on the tickets, they have interpretations on all four scales.

So, for example 'did I win?' is a question treating the number as nominal, while 'did I arrive too early to get the winning ticket?' is a question that treats it as ordinal; on the other hand (and I don't think this one is in the paper) using 5 random ticket numbers in order to estimate the number of people in the room would treat them as ratio (e.g. if there were 4 randomly drawn numbers that got consolation prizes, you'd have 5 random numbers altogether from which to estimate total attendance).

They argue that "good data analysis does not assume data types", "Stevens’s categories do not describe fixed attributes of data", "Stevens’s categories are insufficient to describe data scales" and "Statistics procedures cannot be classified according to Stevens’s criteria" (indeed each statement is also a section title).

Criticisms were also offered in several places by Tukey (e.g. in chapter 5 of Mosteller and Tukey's 1977 book Data analysis and regression); Mosteller and Tukey offered a typology - names, grades (ordered labels), ranks (starting from 1, which may represent either the largest or smallest), counted fractions (bounded by zero and one, these include percentages), counts (non-negative integers), amounts (non-negative real numbers), balances (unbounded, positive or negative values).

In my own work, I've seen situations where severe problems with analysis were caused by people failing to appreciate the great difference between variables relating to levels (sometimes called 'stock' variables) and flows - a simple example of these types is the difference in the kinds of analysis appropriate for the amounts of water actually in a storage tank in each of a sequence of periods, and the amount of water flowing into it. These would (in some of those cases) both be sub-categories of the Mosteller and Tukey 'amounts' type (and in those same cases, both ratio variables in Stevens' scheme), indicating that issues of typology may be quite subtle, but can still critically impact appropriate analyses.

P.F.Velleman and L.Wilkinson (1993),
"Nominal, Ordinal, Interval, and Ratio Typologies are Misleading,"
The American Statistician, vol.47 no.1 pp.65-72

(a working version seems to be available at the 2nd authors web page here)

Lord, F. (1953),
"On the statistical treatment of football numbers,"
American Psychologist, 8, pp.750-751

(The year of this paper is given wrongly in the references of the version of the Velleman and Wilkinson paper I linked to, but correctly referred to in the body of the paper)

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Thanks. Very thorough answer. I was thinking along those lines but when researching this stuff many times they make it seem as if it is concrete and consensus has been reached. That's why I ended up here. – User 42 Jul 10 '14 at 19:59
  • 1
    Stevens' typology has been debated and disputed since it was first published. It's a sometimes-helpful framework, not a theorem. – Glen_b Jul 11 '14 at 05:41
  • Is there any "new favorite" besides Stevens and Mosteller? In the levels/flows example, if I understand you correctly, both have the same type, yet need to be treated differently? Can you explain this difference? And how would e.g. log transformation of a value fit into this typology? Thanks. – Erich Schubert Sep 30 '17 at 21:08
  • 1. I don't know of any recent attempts to make one -- and I think that they're not necessarily useful since they tend to shoehorn people into less appropriate analyses (see Lord's paper for a toy example but the consequences for analyses are very real -- those lists of analysis by type cause no end of terrible statistical analysis, while cutting out vast swathes of statistics from the possibility of consideration in appropriate situations). .. ctd – Glen_b Sep 30 '17 at 21:48
  • ctd... 2. One example of how levels and flows are quite different: Note that if you looked at the level each day, today's level would be the previous level plus the intervening in- or out-flow (or the sum of both, if both are possible). So level measurements are necessarily dependent, often highly so. It cannot make sense to treat them as if they were independent -- yet I see people do it all the time. 3. I'm not sure quite what you're asking with the log thing. Can you be more explicit about that one? Which typology (note that I mention more than one)? – Glen_b Sep 30 '17 at 21:55
  • I'm wondering about data transformations in general. Say we have a value to capture earnings or value of a company. There is a meaningful zero, so by Stevens' it would be a ratio variable. Yet, for many analysis, it may be worth to rather use log(earnings) or log(worth), as many effects don't appear to be linear to the original variable. To my understanding, this is not a "permissible transformation" of Stevens', and thus discouraged. – Erich Schubert Oct 01 '17 at 08:13
  • In general, I'm interested in how to teach this to students: to reflect on how they treat variables, and also to try different preprocessing. So I'm interested in any material that is helpful in teaching these differences (a if-then classification like Stevens' is not the best way to make people consider and evaluate different approaches to the same data...) – Erich Schubert Oct 01 '17 at 08:15
  • Take a look at what Stevens actually says about ratio scales and logarithms ... "*All types of statistical measures are applicable to ratio scales, and only with these scales may we properly indulge in logarithmic transformations*" (p680, left column, top of page) ... – Glen_b Oct 01 '17 at 09:02
  • Interesting. Because with zero and negative values, a log transform may be undefined. But I'm more looking for the general picture. E.g., also how we might interpret the common TF-IDF transforms in this framework (and yes, I have read Spärck Jones and Robertson on this specific transform). – Erich Schubert Oct 01 '17 at 10:46
  • The extent of my replies to you in this comment thread is well beyond clarification of my answer here; they constitute a reasonably lengthy answer to another question already. Perhaps you should ask that one, and then perhaps ask this new one about TF-IDF (which is about text frequency) – Glen_b Oct 01 '17 at 22:32
  • I had intened to ask a new Q, but found it to be similar to above question. Either way, I have since been reading **Hand, D. J. (1996). Statistics and the theory of measurement. Journal of the Royal Statistical Society. Series A (Statistics in Society), 445-492.** and its discussion, which shows how philiosophical and open this question is; there probably is not a "correct" answer. **Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. MIT press** chapter 2 may be a suitable starting point for teaching this without going too deep into philosophical aspects. – Erich Schubert Oct 02 '17 at 09:23
  • Thanks for those references. I'd tend to agree that there isn't really a single "correct" answer. – Glen_b Oct 02 '17 at 09:32
1

The type of the data is related but not identical to the type of the variable. Most of the cases, they are the same but they don't have to be.

For example, if you collect N samples from a normal distribution. You would think it's a numerical (ratio or scale) data. But I can also say it's a categorical variable with N different categories, with frequency of 1 for each category. It looks stupid but it's also a valid variable.

SmallChess
  • 6,764
  • 4
  • 27
  • 48
  • This seems a little at odds with Stevens (who is credited with formulating this typology), who wrote "the real issue is the meaning of measurement." Although you may always elect to treat such data as nominal, that does not make them nominal in Stevens' estimation. His paper is available at http://gaius.fpce.uc.pt/niips/novoplano/mip1/mip1_201314/scales/Stevens_1946.pdf . – whuber Jul 10 '14 at 13:18