Inter-Rater Agreement with multiple raters and variables
Online calculator for inter-rater agreement with multiple raters, featuring Light's kappa, Fleiss's kappa, Krippendorff's alpha, and support for missing data. The marginal distributions of raters can be plotted, giving insight into how often each rater used a particular category.
The simplest measure of agreement between raters is the percentage of cases on which they agree (observed agreement). Often, the proportion of times raters would agree if they guessed on every case (expected agreement) is taken into account and corrected for (e.g. Cohen's Kappa). Also, measuring reliability or agreement with only two coders is rarely considered enough, and many studies involve more than three coders. This page offers the possibility to analyze inter-rater agreement for data:
- Using various coefficients;
- Involving more than two raters;
- Allowing for missing values;
- For multiple variables at once;
A popular measure for agreement between two coders is Cohen's Kappa [1,2]. When more raters are involved, several generalized or adapted versions of the Kappa statistic can be considered. Light expanded Cohen's kappa for multiple raters , further generalized by Conger , by computing scores for all rater pairs before taking the arithmetic mean. Davies and Fleiss  in similar fasion proposed to average expected agreement for all rater pairs instead. A different approach was taken by Fleiss , who introduced a kappa like measure which is a multi-rater generalization of Scott's pi . Another commonly used measure, Krippendorff's alpha , is closely related but offers more flexibility.
Detailing the differences between the used measures is beyond the scope of this page, but all measures consider pairwise agreement of raters. Light's kappa as well as that of Davies and Fleiss take the distributions of individual raters into account. The other two measures define the amount of agreement on a particular item as the proportion of agreeing rater pairs out of the total number of rater pairs. Importantly, chance agreement is based on a single distribution reflecting the combined judgments of all coders.
If you have multiple variables to analyse, provide annotations for each coder in separate text files (as below, left). With one variable, provide annotations in a single file (as below, right).
The first line should contain the variable or rater names, separated by a space or tab. In case of multiple variables, make sure that this line is identical in each file. Each subsequent line contains the annotations for an item, again separated by a space or tab. The annotations can be positive numbers as well as text (as exemplified).
Note that the above measures are mostly intended for nominal variables. If your data is not on the nominal scale, consider using other measures instead such as the intraclass correlation coefficient (ICC).
To calculate inter rater agreement for your data, please: (a) Select "Reset" to start a new analysis. (b) Provide input file(s) in the format described above. (c) Select "Analyze". Apart from the final scores, observed (obs) and expected (exp) agreement (A) and disagreement (D) are listed.
3 raters and 3 cases
2 variables with 18 decisions in total
no missing data
If you would like to cite this resource, please use:
- Geertzen, J. (2012). Inter-Rater Agreement with multiple raters and variables. Retrieved July 16, 2019, from https://nlp-ml.io/jg/software/ira/
For further information on methodological aspects of measuring inter-rater agreement and reliability, please have a look at the excellent website by John Uebersax. Feel free to contact me in case of difficulties or for feedback on this page.
- 1960). A coefficient of agreement for nominal scales. Education and Psychological Measurement, 20:249-254 (
- 1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249-254 (
- 1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19:127-141 (
- 1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378-382 (
- 1982). Measuring agreement for multinomial data. Biometrics, pages 1047-1051 (
- 1971). Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin, 76:365-377 (
- 1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88:322-328 (
- 1980). Content Analysis: An Introduction to its Methodology. , Sage Publications (