2

I have this dataset of flows:

Source entity | Dest entity | Traffic | Cost | Source location | Dest location | Direction | more independent variables (mostly nominal)

There is also a unit price ($/unit traffic) of each entity which comes from a discrete set {p1,p2,p3} and is ordinal/continuous. I want to model this unit price using regression analysis.

Now the question i'm facing is that price is assigned to an entity (which can be source or dest in the table above) and not to a flow which represents each row above. I'm assuming that flows are independent of each other (i.i.d).

Would it be wise to attribute unit price of an entity to a flow ?

I know there is the option of aggregating data on entities but i'm afraid that could be disadvantageous because:

  1. It is likely that information would be lost
  2. Dependency is introduced among rows

Also, which models could make sense ? I'm inclined towards regression because of simplicity.

Appreciate any help/references here. Thanks

PS. I'm already confused while dealing with this many nominal variables.

tool.ish
  • 392
  • 2
  • 13
  • Within groups determined by entity, do most other variables simply stay constant or do they move around? For example, if the data were (year, person_id, first_name) then within person_id groups the first_name would not change a lot. I suspect that if you leave it disaggregated you will have some pathological dependencies within entity groups. – Bruno Apr 25 '14 at 23:55
  • @bruno within a group of source entities the variables would change. For instance, a source sends traffic to multiple entities within multiple countries. Within the groups of 'source-dest entity pairs' all other variables would essentially be constant, which can be considered as an index here. – tool.ish Apr 26 '14 at 09:32
  • Why not treat it as an ecosystem or network of interacting entities where each entity is a node and the flows between them are edges? Graph theory is well developed and designed to facilitate this kind of analysis. The associations can be captured with a visual like the one in this link: http://blog.euromonitor.com/2014/06/the-patterns-of-world-trade.html – Mike Hunter Oct 16 '15 at 14:02

2 Answers2

2

In addition to loss of statistical power, aggregation also adds potentially severe bias, particularly when there are non-linear associations between variables. As a rule, I vehemently oppose aggregation, unless there is a theoretical justification for doing so.

The non-linear association issue comes up all over the place in actual data in my experience.

Another example: aggregating individual data to census tracts is more or less meaningless because (1) inferences based on census-tract level statistics cannot be extended to non-census tracts without committing ecological, atomistic, or other cross-level fallacies, and (2) generally speaking, neither resident experiences, nor policy/planning responses are enacted at the level of census-tracts. As a contrasting example, aggregating to police precinct when drawing inferences makes much sense, since specific policing policies will vary from precinct to precinct.

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • One may also consider ecological relationship to be useful when there is **practical justification** for doing so. If an intervention can only be only at the group/aggregated level, analysis should be examined at the ecological level. Otherwise if individual-level results are used to infer ecological behavious, one may commit **individualistic fallacy**. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4350299/ – KubiK888 Mar 11 '16 at 23:03
  • @KubiK888 Yep. The individualistic fallacy is also sometimes called the "atomistic fallacy." See for example, Diez-Roux, A. V. (1998). Bringing context back into epidemiology: variables and fallacies in multilevel analysis. *American Journal of Public Health*, 88(2):216–222. To paraphrase Geoffrey Rose: the causes of individual experiences are not the causes of populations rates of individual experiences. – Alexis Mar 11 '16 at 23:12
0

I think you should aggregate. If the price is assigned to an entity, then the flow variation within entities will not matter. Aggregating makes sense particularly if the price is reviewed periodically. In that case, the price might reflect some aggregate measure of flow (for example, if the average flow has been high they pay more, or they pay by maximum flow, etc).

I do not think your disaggregated set up is right because the dependent variable does not vary within certain groups of observations. That is, if left disaggregated, there are some extreme dependencies in price within entity groups.

Bruno
  • 152
  • 5