I have two categorical variables (sample 500+). One is a nominal independent variable which compromises of 6 possible values in the dataset (the numbers 1 to 6 represent a category a respondent has been calculated into based on the result of an existing 30 question 5-point Likert scale used in my questionnaire). The other variable is an ordinal dependent variable where a respondent has chosen the top preference from a list of 26 items (1-26). I have run a Crosstab in SPSS to find the top preference and distribution for this ordinal variable across each of the 6 groups of the nominal independent variable. My question is what test should I run on these variables to test for correlations between the two categorical variables? The non parametric Spearman rho is ordinal to ordinal so that’s out. The chi squared test on the CROSSTABS is telling me that over 80% of cells have expected frequencies less than 5.
$\begingroup$ Could you please explain why you consider your dependent variable to be ordinal? Is there some inherent ordering among the 26 items that isn’t represented in the preferences stated by the respondents? If so, what is the nature of that ordering? $\endgroup$
Commented Jul 24, 2019 at 20:49There's a great answer here that discusses correlation between categorical data.
To summarize the main points from this answer (all credit goes to original poster, Alexey Grigorev):
Male Female Total City A 55 45 100 City B 20 30 50 Total 75 75 150
Male Female Total City A 50 50 100 City B 25 25 50 Total 75 75 150
tbl = matrix(data=c(55, 45, 20, 30), nrow=2, ncol=2, byrow=T) dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F')) chi2 = chisq.test(tbl, correct=F) c(chi2$statistic, chi2$p.value)
Ultimately, I highly recommend visiting the link and walking through the answer yourself as it is extremely informative
answered Jul 24, 2019 at 18:12 111 2 2 bronze badges $\begingroup$Your dependent variable is not an "ordinal variable" in the sense of that word in statistical practice.
An ordinal variable is a categorical variable having some type of pre-defined order. For example, in this question an expert had pre-ranked a set of companies; regression was used to see how various characteristics of the companies might be related to that single expert's ranking. In your study, each Likert item is an ordinal variable with a pre-defined ordering such as least desirable to most desirable.
Although each of your 500+ respondents is implicitly ranking your 26 items, that doesn't make their set of responses into an ordinal variable. There is no agreed-on pre-ranking of those 26 items. What each respondent is doing is making an independent choice of 1 from among 26 possibilities.
So what you have is a contingency table for two unordered categorical variables, item versus respondent category. The answer from @TomHood is a good place to start.
You do have a potential problem with a large number of near-empty cells in your contingency table, so a standard chi-square test wouldn't be appropriate. It might be possible to perform a Fisher exact test on the contingency table. With this many categories and cases, however, the standard implementation might run out of memory on a personal computer. A simulation method such as simulate.p.value option in the R fisher.test program might be needed.
Having each respondent choose independently one from among 26 items is related to the multinomial distribution. You thus could consider modeling this with a multinomial regression. That might not help much if your only predictor variable was your classification of each respondent into 1 of 6 categories, but that approach might provide the flexibility to look in more detail how your individual Likert items or scales correspond to respondent preferences and thus might help improve your respondent-classification scheme for this application.