Cluelessness Disclaimer
I'm a statistics noob so if at all possible, please don't stone me. Also write slowly and in simple terms.
I'm wondering what the relation between the internal consistency of a scale as usually measured by Cronbach's $\alpha$ and unidimensionality is.
The concrete reason I'm asking this fundamental question is that I tried to split a scale into three factors which separated quite nicely using a PCA and a direct oblimin rotation. Almost no intercorrelation between factors and the worst loading in any scale is $.59$.
Now while the first new scale that fell out of this process has a nice $\alpha$ of $.81$ (the total scale had $.5$), the second scale has a whoppingly bad $\alpha$ of less than minus frickin' six point five! I've seen my fair share of negative alphas in my short statistical life but this is unprecedented.
Now my naïve understanding is that internal consistency measures the existence of an underlying construct and unidimensionality of a scale means that this underlying construct has only one dimension. This would mean that internal consistency is a prerequisite for unidimensionality. This SSPS FAQ item at least seems to imply this and this section in Wikipedia that I found while Googling might imply anything because it's really poorly written but it refers to this paper that unfortunately goes over my head.
Because this post got so lengthy, this is probably a good time for a short break for your eyes and brain.
Sorry, this looked like a funny idea on paper. Anyway, there is also this question Assessing reliability of a questionnaire: dimensionality, problematic items, and whether to use alpha, lambda6 or some other index? I already found that I feel might contain the answer I'm seeking but it doesn't fully reveal itself to me.
Based on this understanding however, I started with $\alpha$ and tried to work my way from there. What I did is somewhat … unorthodox (which probably translates to "wrong" but I hope we'll hear more on that in an answer to this post). I did a global optimization of $\alpha$ which means I went through all possible partitions of the items of my scale and computed the average $\alpha$ for each partition. Funny thing is, the best solution for three factors (three and four factors had an identical best average $\alpha$ of $.71$) actually makes sense in terms of interpretability (the whole PCA shebang didn't). I'm pretty sure this is a coincidence though, right?
Edit:
After StasK's comment, I decided to include my code for the $\alpha$ optimization in case others might find it useful. Just put your scale into a file named data.csv
with columns as variables and rows as cases.
This script's not pretty and it's not fast; it did the job I personally wanted it to do. The optimization of my 12 items took almost 12 minutes on my Core i3. There can be optimizations done in the code but they probably won't fundamentally change the fact that optimizing a 30-item scale with this method would take not 30 minutes as one might naïvely assume but roughly 300 times the age of the universe. Blindly partitioning your 100-item questionnaire is probably not a good idea with this method. The number of partitions generated by partitions()
can be checked here http://www.wolframalpha.com/input/?i=bell+number+of+12. Just replace 12 by your number of items.
If you only want to generate partitions with a certain number of factors, replace for p in partitions(range(len(variances))):
by for p in n_partitions(range(len(variances)), 3):
for 3 factors say. This won't make it faster though (here a considerable speedup would probably be possible if you restrict the search depth in the first place).
In the long run, only a cleverer way than to test all possible partitions will do for larger scales.
#!/usr/bin/env python
from numpy import *
import sys
def variance(data):
n = len(data)
variance = 0.0
mean = float(sum(data))/n
for x in data:
variance += (x - mean)**2
return variance/n
# from http://stackoverflow.com/q/2037327/1050373
def partitions(set_):
if not set_:
yield []
return
for i in xrange(2**len(set_)/2):
parts = [set(), set()]
for item in set_:
parts[i&1].add(item)
i >>= 1
for b in partitions(parts[1]):
yield [parts[0]]+b
def n_partitions(set_, n):
for partition in partitions(set_):
if len(partition) == n:
yield partition
def alpha(data, variances, cols):
n = len(cols)
cols = array(cols)
data_cols = data.transpose()[cols]
variances = variances[cols]
data_rows = data_cols.transpose()
row_sums = array([sum(row) for row in data_rows])
return n/float(n-1)*(1-sum(variances)/variance(row_sums))
data = genfromtxt('data.csv', delimiter=',')
print 'input data:'
print data
# pre-compute column variances
variances = array([variance(col) for col in data.transpose()])
data = data[0:len(data)]
print 'overall alpha:'
print alpha(data, variances, range(len(variances)))
best_alpha_average = float('-inf')
for p in partitions(range(len(variances))):
lengths = [len(subset) for subset in p]
if min(lengths) < 2:
continue
alphas = [alpha(data,variances,list(subset)) for subset in p]
alpha_average = sum(alphas)/len(p)
if alpha_average >= best_alpha_average:
print 'new best:',alpha_average,alphas
print 'partition of size '+ str(len(p)) + ': ' + str([list(array(list(subset))+1) for subset in p])
best_alpha_average = alpha_average
sys.stdout.flush()