Suitable correlation test for two categorical variables

I have two categorical variables (sample 500+). One is a nominal independent variable which compromises of 6 possible values in the dataset (the numbers 1 to 6 represent a category a respondent has been calculated into based on the result of an existing 30 question 5-point Likert scale used in my questionnaire). The other variable is an ordinal dependent variable where a respondent has chosen the top preference from a list of 26 items (1-26). I have run a Crosstab in SPSS to find the top preference and distribution for this ordinal variable across each of the 6 groups of the nominal independent variable. My question is what test should I run on these variables to test for correlations between the two categorical variables? The non parametric Spearman rho is ordinal to ordinal so that’s out. The chi squared test on the CROSSTABS is telling me that over 80% of cells have expected frequencies less than 5.

correlation
categorical-data

Fergal O'Hanlon asked Jul 24, 2019 at 13:41 Fergal O'Hanlon Fergal O'Hanlon 1 2 2 bronze badges

$\begingroup$ Could you please explain why you consider your dependent variable to be ordinal? Is there some inherent ordering among the 26 items that isn’t represented in the preferences stated by the respondents? If so, what is the nature of that ordering? $\endgroup$

Commented Jul 24, 2019 at 20:49

2 Answers 2

$\begingroup$

There's a great answer here that discusses correlation between categorical data.

To summarize the main points from this answer (all credit goes to original poster, Alexey Grigorev):

Checking if two categorical variables are independent can be done with Chi-Squared test of independence. For the typical Chi Square Test, if we assume that two variables are independent, then the values of the contingency table for these variables should be distributed uniformly. And then we check how far away from uniform the actual values are.
In the given example, consider the two categorical variables for gender (Male/Female) and City of residence (City A/City B)

 Male Female Total City A 55 45 100 City B 20 30 50 Total 75 75 150

Are gender and city independent? We can perform the Chi Squared Test to figure it out. For this test we will use the following Null hypothesis: they are independent Alternative hypothesis: they are correlated in some way.
Under the Null hypothesis, we assume uniform distribution, so our expected values are the following table:

 Male Female Total City A 50 50 100 City B 25 25 50 Total 75 75 150

Now we run the chi-squared test in R and the resulting p-value here can be seen as a measure of correlation between these two variables.

tbl = matrix(data=c(55, 45, 20, 30), nrow=2, ncol=2, byrow=T) dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F')) chi2 = chisq.test(tbl, correct=F) c(chi2$statistic, chi2$p.value)

The test yields a p value of 0.08 - quite small, but still not enough to reject the hypothesis of independence. So we can say that the "correlation" here is 0.08

Ultimately, I highly recommend visiting the link and walking through the answer yourself as it is extremely informative

answered Jul 24, 2019 at 18:12 111 2 2 bronze badges $\begingroup$

Your dependent variable is not an "ordinal variable" in the sense of that word in statistical practice.

An ordinal variable is a categorical variable having some type of pre-defined order. For example, in this question an expert had pre-ranked a set of companies; regression was used to see how various characteristics of the companies might be related to that single expert's ranking. In your study, each Likert item is an ordinal variable with a pre-defined ordering such as least desirable to most desirable.

Although each of your 500+ respondents is implicitly ranking your 26 items, that doesn't make their set of responses into an ordinal variable. There is no agreed-on pre-ranking of those 26 items. What each respondent is doing is making an independent choice of 1 from among 26 possibilities.

So what you have is a contingency table for two unordered categorical variables, item versus respondent category. The answer from @TomHood is a good place to start.

You do have a potential problem with a large number of near-empty cells in your contingency table, so a standard chi-square test wouldn't be appropriate. It might be possible to perform a Fisher exact test on the contingency table. With this many categories and cases, however, the standard implementation might run out of memory on a personal computer. A simulation method such as simulate.p.value option in the R fisher.test program might be needed.

Having each respondent choose independently one from among 26 items is related to the multinomial distribution. You thus could consider modeling this with a multinomial regression. That might not help much if your only predictor variable was your classification of each respondent into 1 of 6 categories, but that approach might provide the flexibility to look in more detail how your individual Likert items or scales correspond to respondent preferences and thus might help improve your respondent-classification scheme for this application.