InterRater Agreement with multiple raters and variables
Online calculator for interrater agreement with multiple raters, featuring Light's kappa, Fleiss's kappa, Krippendorff's alpha, and support for missing data. The marginal distributions of raters can be plotted, giving insight into how often each rater used a particular category.
Introduction
The simplest measure of agreement between raters is the percentage of cases on which they agree (observed agreement). Often, the proportion of times raters would agree if they guessed on every case (expected agreement) is taken into account and corrected for (e.g. Cohen's Kappa). Also, measuring reliability or agreement with only two coders is rarely considered enough, and many studies involve more than three coders. This page offers the possibility to analyze interrater agreement for data:
 Using various coefficients;
 Involving more than two raters;
 Allowing for missing values;
 For multiple variables at once;
A popular measure for agreement between two coders is Cohen's Kappa [1,2]. When more raters are involved, several generalized or adapted versions of the Kappa statistic can be considered. Light expanded Cohen's kappa for multiple raters [7], further generalized by Conger [8], by computing scores for all rater pairs before taking the arithmetic mean. Davies and Fleiss [5] in similar fasion proposed to average expected agreement for all rater pairs instead. A different approach was taken by Fleiss [4], who introduced a kappa like measure which is a multirater generalization of Scott's pi [3]. Another commonly used measure, Krippendorff's alpha [6], is closely related but offers more flexibility.
Detailing the differences between the used measures is beyond the scope of this page, but all measures consider pairwise agreement of raters. Light's kappa as well as that of Davies and Fleiss take the distributions of individual raters into account. The other two measures define the amount of agreement on a particular item as the proportion of agreeing rater pairs out of the total number of rater pairs. Importantly, chance agreement is based on a single distribution reflecting the combined judgments of all coders.
Input format
If you have multiple variables to analyse, provide annotations for each coder in separate text files (as below, left). With one variable, provide annotations in a single file (as below, right).
color  shape 
red  round 
blue  round 
yellow  oval 
color  shape 
orange  round 
blue  round 
orange  oval 
color  shape 
red  round 
purple  round 
yellow  oval 
John  Joe  Mary 
red  orange  red 
blue  blue  purple 
yellow  orange  yellow 
The first line should contain the variable or rater names, separated by a space or tab. In case of multiple variables, make sure that this line is identical in each file. Each subsequent line contains the annotations for an item, again separated by a space or tab. The annotations can be positive numbers as well as text (as exemplified).
Note that the above measures are mostly intended for nominal variables. If your data is not on the nominal scale, consider using other measures instead such as the intraclass correlation coefficient (ICC).
Calculating agreement
To calculate inter rater agreement for your data, please: (a) Select "Reset" to start a new analysis. (b) Provide input file(s) in the format described above. (c) Select "Analyze". Apart from the final scores, observed (obs) and expected (exp) agreement (A) and disagreement (D) are listed.
 
Data
3 raters and 3 cases 2 variables with 18 decisions in total no missing data 1: color

If you would like to cite this resource, please use:
 Geertzen, J. (2012). InterRater Agreement with multiple raters and variables. Retrieved March 23, 2019, from https://nlpml.io/jg/software/ira/
For further information on methodological aspects of measuring interrater agreement and reliability, please have a look at the excellent website by John Uebersax. Feel free to contact me in case of difficulties or for feedback on this page.
References
 1960). A coefficient of agreement for nominal scales. Education and Psychological Measurement, 20:249254 (
 1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249254 (
 1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19:127141 (
 1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378382 (
 1982). Measuring agreement for multinomial data. Biometrics, pages 10471051 (
 1971). Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin, 76:365377 (
 1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88:322328 (
 1980). Content Analysis: An Introduction to its Methodology. , Sage Publications (