Abstract:
Objective:
To identify desirable characteristics, including different sample sizes and dental caries prevalences, in virtual samples that allow, at the same time, higher values of general agreement percentage (GPA) and Kappa coefficient (κ), under a low confidence interval (CI), in reproducibility studies.
Method:
A total of 384 statistical simulations of inter-examiner calibration, varying sample size (12, 15, 20, 60, 200 and 500 individuals), caries prevalence (30, 50, 60 and 90%) and percentages of positive (PA) and negative (NA) agreement (30, 50, 60 and 90%) were undertaken. GPA and κ were used to measure reproducibility and define deviation between them.
Results:
The sample of 60 individuals, under caries prevalence of 50%, PA and NA of 90%, presented a GPA and Kappa values of 90 and 80%, respectively, a relative small confidence interval (95%CI 0.65 - 0.95) and a GPA/Kappa deviation of 10.00.
Conclusion:
A virtual sample of 60 individuals, under caries prevalence of 50%, seems feasible to produce a satisfactory interexaminer agreement at epidemiological conditions. However, epidemiological studies to corroborate or refute this assertion are necessary.
Keywords:
Sample size; Reproducibility of results; Dental health surveys; Dental caries; Calibration; Epidemiology
Resumo:
Objetivo:
Identificar características desejáveis, considerando diferentes tamanhos de amostra e prevalências de cárie em amostras virtuais que possibilitem, simultaneamente, altos valores de porcentagem geral de concordância (PGC) e do coeficiente Kappa (κ), sob baixo intervalo de confiança (IC), em estudos de reprodutibilidade.
Método:
Ao total, 384 simulações estatísticas de calibração interexaminador, variando o tamanho da amostra (12, 15, 20, 60, 200 e 500 indivíduos), a prevalência de cárie (30, 50, 60 e 90%) e as taxas de concordâncias positiva (CP) e negativa (CN) (30, 50, 60 e 90%) foram realizadas. Os valores de PGC e κ foram utilizados para mensurar a reprodutibilidade e o desvio entre as respectivas medidas PGC/Kappa.
Resultados:
A amostra de 60 indivíduos, com prevalência de cárie de 50% e taxas de concordância positiva e negativa de 90%, apresentou um valor de PGC = 90%, Kappa = 80%, um intervalo de confiança (IC95% 0,65 - 0,95) relativamente pequeno e um desvio PGC/Kappa de 10,00.
Conclusão:
A amostra virtual de 60 indivíduos parece ser viável, em condições epidemiológicas, para produzir uma concordância interexaminadores satisfatória. Contudo, estudos epidemiológicos para corroborar ou refutar esta conclusão são necessários.
Palavras-chave:
Tamanho da amostra; Reprodutibilidade dos testes; Inquéritos de saúde bucal; Cárie dentária; Calibragem; Epidemiologia
INTRODUCTION
Oral health surveys are needed to plan and evaluate oral health actions and services. The control of the methodological biases in such surveys must be done. According the World Health Organization (WHO) methodology, previous training and calibration of the examiners are the initial and essential steps of oral health surveys. The calibration allows to standardize the interpretation of diagnostic criteria among examiners. The general percentage agreement (GPA) and Kappa statistics have been proposed for this task11. World Health Organization (WHO). Oral health surveys: basic methods. 4th ed. Geneva: WHO; 1997..
The GPA is the simplest way to evaluate the agreement among examiner. However, its weakness lies on the precision when a low caries prevalence sample is examined. For this reason, Kappa has been the statistical method choice for measuring the reproducibility in oral health surveys22. Peres MA, Traebert J, Marcenes W. Calibration of examiners for dental caries epidemiology studies. Cad Saúde Pública2001; 17(1): 153-9.. The Kappa coefficient eliminates agreement due to chance, thus constituting a measurement of real agreement for nominal or ordinal data33. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur1960; 20(1): 37-46.. Values of 85% or above for the GPA and of at least 0.80 for Kappa are accepted for epidemiological surveys of dental caries. Such values indicate a precise use for the diagnostic methods among the examiners11. World Health Organization (WHO). Oral health surveys: basic methods. 4th ed. Geneva: WHO; 1997..
The WHO recommends a minimal sample size of 20 individuals or above (since they have the whole spectrum of the dental caries disease) for the calibration exercises. No more details about the sample are given11. World Health Organization (WHO). Oral health surveys: basic methods. 4th ed. Geneva: WHO; 1997.. Besides the age group and environmental conditions during the examinations, the prevalence and annual increment of the disease deserves special attention during the planning and execution of epidemiological studies, especially at the training and calibration stages11. World Health Organization (WHO). Oral health surveys: basic methods. 4th ed. Geneva: WHO; 1997.,22. Peres MA, Traebert J, Marcenes W. Calibration of examiners for dental caries epidemiology studies. Cad Saúde Pública2001; 17(1): 153-9.,33. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur1960; 20(1): 37-46.,44. Frias AC, Antunes JLF, Narvai PC. Reliability and validity of oral health surveys: dental caries in the city of Sao Paulo, 2002. Rev Bras Epidemiol2004; 7(2): 144-54.,55. Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements Phys Ther2005; 85(3): 257-68.. These factors, if neglected, may compromise the reproducibility and validity of the diagnostic methods used, especially when the reproducibility values are very low. Spurious results of reproducibility as a high general percentage agreement (GPA) associated with a very low (even negative) Kappa values may be observed in the scientific literature66. Feinstein AR, Cicchetti DV. High agreement but low Kappa: I. The problems of two paradoxes. J Clin Epidemiol1990; 43(6): 543-9..
For this reason, the aim of this study was to identify desirable characteristics, including different sample sizes and dental caries prevalences, in virtual samples that allow, at the same time, higher values of GPA and Kappa coefficient, under a low confidence interval (CI), in reproducibility studies.
METHODOLOGY
Statistical computer simulations of interexaminer calibration, varying caries prevalence (30, 50, 60 and 90) in hypothetical samples of different sizes (12, 15, 20, 60, 200 and 500 individuals), in addition to varying the percentages of positive (30, 50, 60 and 90%) and negative (30, 50, 60 and 90%) agreements in these samples were obtained. Therefore, a total of 384 simulations, between a gold standard examiner and an examiner, both virtual, were performed using 'The SAS System 9.0 for Windows' (SAS Institute Inc., Cary, NC, USA).
A contingency table (square matrix: nXn) is necessary for Kappa calculation. For study purposes, a contingency table 2X2, with clinical conditions dichotomized in "decayed" (cavitation or marginal leakage around dental restorations) and "non-decayed", was considered (Table 1). This dichotomy makes some sense in Dentistry when more sensible diagnostic methods are used.
The positive agreement refers to the percentage of the cases in cell "A" of a contingency table 2X2, considering the calibration between a gold standard examiner and the examiner. The cells "B" and "C" expresses the disagreement, while the cell "D" express the negative agreement between such examiners (Table 1).
The Kappa statistics is obtained by the formula:
Where:
- Po = proportion of agreements observed = (A+D)/N;
- Pe = proportion of agreements expected = (F1G1+F2G2)/N2.
Prevalence (|A-D|/N) and bias (|B-C|/N) rates influence Kappa values55. Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements Phys Ther2005; 85(3): 257-68.,66. Feinstein AR, Cicchetti DV. High agreement but low Kappa: I. The problems of two paradoxes. J Clin Epidemiol1990; 43(6): 543-9..
Although many positive and negative agreement values/rates were obtained during the simulations, rates of 90% were stipulated as the ideal condition to obtain both high GPA and Kappa values for such agreements. The deviation between the GPA and Kappa values (module) is given by the formula: |GPA - κ|.
RESULTS
The smallest deviations between the GPA and Kappa (κ) values, for calculation of the interexaminer reproducibility, as a function of prevalence of the disease in the sample, sample size, considering the positive (PA) and negative (NA) agreements of 90% are expressed in Table 2.
The best GPA/Kappa ratio (highest GPA and Kappa values, deviation ≤ 10, low 95%CI and sample size) was found for the sample of 60 individuals under a caries prevalence of 50% (Table 2).
The greatest deviations between the GPA and Kappa values are listed below in Table 3.
A GPA above 80% can produce GPA/Kappa deviation above |90.00| in samples of 12 and 15 individuals. A GPA of 45% and Kappa of -100.00% produced a GPA/Kappa deviation of |145.00| in sample of 200 individuals (Table 3).
DISCUSSION
Kappa (κ) statistics is an index that measures the reproducibility of examiners concerning categorical data, being widely used in biomedical sciences. Kappa values vary from -1 (total interexaminer disagreement), passing through 0 (agreement merely by chance), up to +1 (total interexaminer agreement). Assuming the value of -1, Po is lower than Pe. A Kappa value of 0, denotes an agreement merely by chance, where Po = Pe. For the Kappa value equal to +1, Po is higher than Pe33. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur1960; 20(1): 37-46.,77. Rigby AS. Statistical methods in epidemiology. Towards an understanding of the kappa coefficient. Disabil Rehabil 2000; 22(8): 339-44..
Because it expresses agreement among examiners beyond the chance, the Kappa values are slightly lower than the GPA values. Nevertheless high GPA values associated with very low values (or even negative) Kappa values may be found in reproducibility studies. This fact deserves special attention by the scientific community. A negative Kappa value is not always a reflection of mathematical, typographic or computational errors or misuse of a diagnostic test. This may reflect its dependency of the trait of disease prevalence in the examined sample66. Feinstein AR, Cicchetti DV. High agreement but low Kappa: I. The problems of two paradoxes. J Clin Epidemiol1990; 43(6): 543-9.,88. Gwet K. Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Statistical Methods For Inter-Rater Reliability Assessment 2002; 2: 1-9..
The situation described above can be avoided when the sample for reproducibility studies is well designed. However, even when there is some methodological care, significant differences between the GPA and Kappa values may be found. This fact can be worsened by not selecting individuals prior to the calibration phase. A clear example of this occurs in relation to the stage of intraexaminer recalibration during the field phase, by selecting 5-10% of individuals in the sample, as recommended by the WHO manual of examiners11. World Health Organization (WHO). Oral health surveys: basic methods. 4th ed. Geneva: WHO; 1997.. Even in this case, there is no recommendation of previous selection and distribution of individuals according to their respective disease prevalence in order to obtain a controlled sample. Therefore, this may also generate a low caries prevalence in this group and thus compromise the results of reproducibility.
Whenever possible, larger sample sizes with disease prevalence near 50% are always desirable88. Gwet K. Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Statistical Methods For Inter-Rater Reliability Assessment 2002; 2: 1-9.,99. Hoehler FK. Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. J Clin Epidemiol2000; 53(5): 499-503.. Larger sample sizes provide slower confidence intervals and may allow to the examiner a full view of the disease spectrum, compensating the effects of unreliability.
In the present study, the paradox of "high GPA and low Kappa"66. Feinstein AR, Cicchetti DV. High agreement but low Kappa: I. The problems of two paradoxes. J Clin Epidemiol1990; 43(6): 543-9. was evident for the samples of 12 and 15 individuals under a dental caries prevalence of 90%. Nevertheless, in all the situations in which the deviation between the GPA and Kappa values was high, the ratio between the value of Po and that of Pe was determinant. Very close values of Po and Pe are responsible for this paradox. Whereas, the higher the value of Po associated with a lower value of Pe, the smaller the GPA/Kappa deviation. This condition is attained when the values of cells A and D are higher than the values of cells B and C, favoring the achievement of a positive and negative agreements above 90%66. Feinstein AR, Cicchetti DV. High agreement but low Kappa: I. The problems of two paradoxes. J Clin Epidemiol1990; 43(6): 543-9.,1010. Cicchetti DV, Feinstein AR. High agreement but low Kappa: II. Resolving the paradoxes. J Clin Epidemiol1990; 43(6): 551-8..
The impact of disease prevalence on the marginal totals of the contingency table, influencing the Kappa value, cannot be neglected. The samples of 12 and 15 individuals, under dental caries prevalence of 60 and 30%, respectively, presented a symmetrical imbalance of their marginal totals (F1≈G1 and F2≈G2). Their Po and Pe values were above 0.90 and 0.50 respectively, generating a smaller GPA/Kappa deviation, associated with high values of both of the reproducibility measurements. In these samples, the GPA value was higher than 85%, while the Kappa value was "almost perfect", value and classification recommended by WHO11. World Health Organization (WHO). Oral health surveys: basic methods. 4th ed. Geneva: WHO; 1997.. However, the 95%CI was high for both situations.
An ideal sample (without dubious cases) is statistically desirable, but unfeasible in real settings. Therefore, an experienced epidemiologist (non-participant of the survey) is necessary to select individuals to compose samples and to conduct the calibration sessions in epidemiological surveys. The true prevalence of the attribute in an ideal sample is obtained by an equanimous distribution of "diseased" (cell A) and "non-diseased" (cell D) individuals66. Feinstein AR, Cicchetti DV. High agreement but low Kappa: I. The problems of two paradoxes. J Clin Epidemiol1990; 43(6): 543-9.. The control of the cells A and D is directly related to the value of Po. A Po higher than Pe, determines high GPA and Kappa values and a smaller deviation between them. A very low or very high value propensity of positive classification (P+=(F1+G1)/2) also contributes to a low Kappa value88. Gwet K. Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Statistical Methods For Inter-Rater Reliability Assessment 2002; 2: 1-9..
The estimate of sample size and statistical power of the diagnostic methods are generally neglected in epidemiological studies77. Rigby AS. Statistical methods in epidemiology. Towards an understanding of the kappa coefficient. Disabil Rehabil 2000; 22(8): 339-44.. Usually, reproducibility studies are performed with samples of 50 or less subjects1111. Donner A. Sample size requirements for the comparison of two or more coefficients of inter-observer agreement. Stat Med1998; 17(10): 1157-68., which may compromise, to some degree, the statistical power of the method used. This problem is more serious when the outcome variable is dichotomous, being aggravated by a low prevalence of the attribute in the studied population/sample55. Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements Phys Ther2005; 85(3): 257-68.,88. Gwet K. Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Statistical Methods For Inter-Rater Reliability Assessment 2002; 2: 1-9.,1212. Viera AJ, Garret JM. Understanding interobserver agreement: the Kappa statistic. Fam Med2005; 37(5): 360-3.. In Dentistry the sample sizes for reproducibility studies has ranged between 10 and 25 individuals. No details about this number having been provided. Another peculiarity in Dentistry is that the caries outcomes are categorical, non-dichotomous, with various clinical conditions coexisting in a single individual11. World Health Organization (WHO). Oral health surveys: basic methods. 4th ed. Geneva: WHO; 1997.. This is a natural and additional source of variations among examiners.
Satisfactory GPA and Kappa values were obtained with relative small samples (12 and 15 individuals). However, a sample of 12 individuals, for example, represents 336 teeth/1,680 dental surfaces examined. The GPA/Kappa deviation for such samples may reflect the distribution of the A and D cells, because the respective confidence intervals were not considered. Considering the lower confidence interval and sample size, the best reproducibility and GPA and Kappa ratio were found for the sample of 200 individuals (5,600 teeth/28,000 dental surfaces) under caries prevalence of 50%, positive and negative agreements of 90%. The results about reproducibility from the sample of 60 individuals (1,680 teeth/8,400 dental surfaces), under same methodological conditions, is similar to the results of the sample of 200 individuals. The advantage of the sample of 60 individuals in relation to the 200 individuals is its feasibility, fewer individuals are needed.
In addition to the careful selection of the sample, specialists have suggested the presentation of Kappa values simultaneously to p-value and confidence interval. Other ways to evaluate interexaminer agreement, as Dice index, intraclass correlation coefficient, κmax, prevalence and bias adjusted kappa (PABAK), the separate presentation of the proportion of positive and negative agreements and even the Kappa calculation for true positives and true negative sub-samples have been proposed99. Hoehler FK. Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. J Clin Epidemiol2000; 53(5): 499-503.,1010. Cicchetti DV, Feinstein AR. High agreement but low Kappa: II. Resolving the paradoxes. J Clin Epidemiol1990; 43(6): 551-8.,1212. Viera AJ, Garret JM. Understanding interobserver agreement: the Kappa statistic. Fam Med2005; 37(5): 360-3.,1313. Assaf AV, Zanin L, Meneghim MC, Pereira AC, Ambrosano GMB. Comparison of reproducibility measurements for calibration of dental caries epidemiological surveys. Cad Saúde Pública 2006; 22(9): 1901-7.. However, each method suggested above has its own strengths and weaknesses. Caution is needed to apply and interpret them too.
The present results were obtained by statistical simulations in virtual samples (designed for such purpose) with dichotomous clinical outcomes. Therefore, such results may not exactly represent data from a real setting of epidemiological surveys. Although this issue has been studied by many authors, the study calls readers attention about how such problems can affect the reproducibility in epidemiological dental caries surveys. Anyway, such results contributed to clarify some issues that have been raised around the Kappa Statistics expressed in the literature.
CONCLUSION
A sample of 60 individuals, whose caries prevalence was 50%, produced low deviation between GPA and Kappa, under a relatively small confidence interval. Such sample is virtually applicable at epidemiological conditions to produce good results of reproducibility. Therefore, epidemiological studies that corroborate/refute this assertion are necessary to verify its feasibility under field conditions. Previous and careful selection of individuals to compose samples in reproducibility studies should be implemented by community health researchers.
REFERENCES
- 1World Health Organization (WHO). Oral health surveys: basic methods. 4th ed. Geneva: WHO; 1997.
- 2Peres MA, Traebert J, Marcenes W. Calibration of examiners for dental caries epidemiology studies. Cad Saúde Pública2001; 17(1): 153-9.
- 3Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur1960; 20(1): 37-46.
- 4Frias AC, Antunes JLF, Narvai PC. Reliability and validity of oral health surveys: dental caries in the city of Sao Paulo, 2002. Rev Bras Epidemiol2004; 7(2): 144-54.
- 5Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements Phys Ther2005; 85(3): 257-68.
- 6Feinstein AR, Cicchetti DV. High agreement but low Kappa: I. The problems of two paradoxes. J Clin Epidemiol1990; 43(6): 543-9.
- 7Rigby AS. Statistical methods in epidemiology. Towards an understanding of the kappa coefficient. Disabil Rehabil 2000; 22(8): 339-44.
- 8Gwet K. Inter-rater reliability: dependency on trait prevalence and marginal homogeneity. Statistical Methods For Inter-Rater Reliability Assessment 2002; 2: 1-9.
- 9. Hoehler FK. Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. J Clin Epidemiol2000; 53(5): 499-503.
- 10Cicchetti DV, Feinstein AR. High agreement but low Kappa: II. Resolving the paradoxes. J Clin Epidemiol1990; 43(6): 551-8.
- 11Donner A. Sample size requirements for the comparison of two or more coefficients of inter-observer agreement. Stat Med1998; 17(10): 1157-68.
- 12Viera AJ, Garret JM. Understanding interobserver agreement: the Kappa statistic. Fam Med2005; 37(5): 360-3.
- 13Assaf AV, Zanin L, Meneghim MC, Pereira AC, Ambrosano GMB. Comparison of reproducibility measurements for calibration of dental caries epidemiological surveys. Cad Saúde Pública 2006; 22(9): 1901-7.
- Financial support: none
Publication Dates
- Publication in this collection
Apr-Jun 2016
History
- Received
16 Sept 2014 - Accepted
05 May 2015