Creating robust intervals from highly skewed data?

Question

I am using factor analysis to model the underlying structure of social capital. My data consists of individual responses expressing how often they interacted with other individuals in a specific year, measured by 10 different variables. For each variable, my intention is to discretize the counts into five intervals, with qualitative labels going from "never" to "very often".

The sample size is 496 individuals, and - for example - in one specific variable 74% have zero interactions per year, while 23% have had interactions between one and 6 times per year. I have also "outlier" respondents, for example 1 observation with 96 interactions and 1 observation with 260 interactions. The source of my confusion is how heavily skewed the sample is towards zero interactions as well as the few outliers. I believe this is preventing me from using conventional bin sizing rules.

I am aware of a similar answer posted by Kevin, but I believe the problem here is different, since I want to use the interval frequencies to feed my model.

To me it seems that you're observing network effects in your data. I would suggest looking at a graphical model - your 'outliers' seem to be individuals who would be represented by nodes with high degree in the social graph. This can happen naturally in certain models where the connections between individuals are modeled as some kind of random variable. You could then try to model the distribution of the 'connection' random variable, and your problem becomes one of density estimation as a first step before further analysis. — Don Walpola, Feb 12 '20 at 00:59
Thanks for your comment Don. Unfortunately I can't use a social networks approach because I don't have information about the specific interactions carried out among individuals. This prevents me from estimating nodal density. Instead, the factor analysis approach I am using focuses on identifying the different dimensions of social capital, based on the variance of their responses to an established set of SC questions. This is why the rescaling I want to carried out -from frequency to Likert scale - is so important to get right. — Chuy Pulido, Feb 12 '20 at 02:36

Creating robust intervals from highly skewed data?

0 Answers0