Fleiss Kappa — Multi-Rater Agreement
When 3 or more raters classify multiple objects nominally, this evaluates their agreement. Cohen κ applies only to two raters; with more raters use Fleiss κ. The number of raters per object must be the same. For two raters or ordered grades use Kappa (with weighting).
① Enter the count table
One rated object per line, one category per column, value = how many raters assigned that object to that category. The row sum (total raters m) must be the same on every row. For example "4 0 0" means all 4 raters chose category 1.
How to use & methodology
When do I use Fleiss κ instead of Cohen κ?
Cohen κ handles only 'two fixed raters'. When there are 3 or more raters (and not every object needs the same set of people), use Fleiss κ. For only two raters, or ordered categories needing weights, use the Kappa (with linear/quadratic weighting) tool.
How should the data be arranged?
Arrange an 'object × category' count table: one object per row, one category per column, the cell being the number of raters who placed that object in that category. Each row sum equals the total raters m, and m must be the same on every row.
What κ counts as good?
The common Landis & Koch reference: <0.2 slight, 0.2–0.4 fair, 0.4–0.6 moderate, 0.6–0.8 substantial, >0.8 almost perfect. But thresholds are only a guide; clinical meaning depends on the setting.
Does low κ always mean the raters are poor?
Not necessarily. When one category dominates (very imbalanced prevalence), the expected agreement Pe is high and κ is deflated (the kappa paradox). Read it together with the category proportions and the observed agreement P̄.