Eight orthopedic surgeons achieved moderate to excellent reliability measuring the preoperative posterior tilt angle in 50 Garden-I and Garden-II femoral neck fractures

Background Studies of elderly patients with Garden-I and Garden-II femoral neck fractures (FNFs) suggest that a preoperative posterior tilt of the femoral head of at least 20° increases the risk of fixation failure. A recently published treatment algorithm recommended hemiarthroplasty over internal fixation for elderly patients with Garden-I and Garden-II FNFs and a preoperative posterior tilt of at least 20°. However, the reliability of the method used to measure the posterior tilt has not been assessed according to recommended standards for reliability trials. Methods Four orthopedic registrars and four consultants measured the posterior tilt angle in 50 preoperative lateral radiographs at two occasions six weeks apart. We estimated inter- and intrarater reliability by intraclass correlation coefficient (ICC). We also assessed repeatability by the repeatability coefficient (RC) and agreement by the minimal detectable change (MDC). Based on the suggested cutoff value of 20°, we reported the overall percentage and specific agreement for the choice of implant. Results Inter- and intrarater reliability for all raters was excellent with an ICC (95% CI) of 0.77 (0.69–0.85) and 0.77 (0.67–0.86), respectively. The RC was 13.9 and the MDC 14.1. Specific agreement for choosing arthroplasty was 61.3 and 54.6% for the first and second test occasion, respectively. Conclusions Eight orthopedic surgeons measured the posterior tilt in 50 Garden-I and Garden-II FNFs and achieved excellent inter- and intrarater reliability. However, variations in repeated measurements and variations in measurements made by different raters, as assessed by the RC and the MDC respectively, ranged from 13.9° to 14.1°. The variations in posterior tilt measurements should be taken into account when choosing the type of implant for elderly patients with Garden-I and Garden-II femoral neck fractures.


Background
Elderly patients with Garden-I and Garden-II femoral neck fractures (FNFs) treated with internal fixation may suffer from higher rates of complications such as fixation failure, nonunion, and avascular necrosis of the femoral head than previously acknowledged [1][2][3]. Recent trials identified a subgroup of Garden-I and Garden-II FNFs that had an increased risk of fixation failure. Those were elderly patients who presented with a posterior tilt of the femoral head of at least 20°measured on the preoperative lateral radiograph [1][2][3]. Primary arthroplasty could thus be a better alternative for this subgroup of elderly patients [4,5]. Two studies suggested that elderly patients with Garden-I and Garden-II FNFs with a posterior tilt of ≥ 20°could benefit from arthroplasty, whereas patients with a posterior tilt of < 20°may be treated with internal fixation [1,3]. However, the findings of another retrospective study contradicted these results [6], and an explanation could be a possibly poor reliability of posterior tilt measurements. Therefore, we evaluated the inter-and intrarater reliability of posterior tilt measurements according to standards for good reliability studies [7].

Study design and population
This study was part of a retrospective cohort study of elderly patients with Garden-I and Garden-II FNFs treated with two cancellous screws at Akershus University Hospital, Norway, between 2005 and 2012. The authors evaluated anteroposterior radiographs of the pelvis and classified the fractures according to the simplified Garden classification [8]. To assure that radiographs were representative, we randomly selected 50 supine cross-table lateral view radiographs from a cohort of 322 patients with Garden-I and Garden-II femoral neck fractures using computer software. Patient data from the same cohort have recently been published [3]. All lateral view radiographs were used independently of their quality to reduce the risk of selection bias.

Radiographic measurements
The posterior tilt of the femoral head was measured with the software mDesk (RSA Medical, Umeå, Sweden) using the method described by Palm et al [1]. The raters fitted a circle to the cortical contour of the femoral head, and the software calculated the center point of the circle. The raters then drew a straight line across the narrowest part of the femoral neck succeeded by two parallel lines on each side, with a distance of 5 mm to the initial line. The mid-collum line (MCL) was defined as a line through the center points of the three lines. The radius collum line (RCL) was drawn from the center of the circle to the intersection between the circle and the MCL. The posterior tilt of the femoral head was defined as the angle between the MCL and RCL (Fig. 1).
Eight orthopedic surgeons-four registrars and four senior consultants-were invited to assess lateral hip radiographs at two occasions with a washout period of at least six weeks.
The raters received individual instructions as described above for approximately 20 min before the first rating. None of the raters had any experience using the measuring method in question before the study. The raters were blinded to the clinical outcome and completed sessions independently at their pace, using the same portable computer and software. No feedback was provided between sessions, and the raters were not allowed to discuss their results. The inter-and intrarater reliabilities of measurements of posterior tilt were calculated based on the results of the first and second ratings.

Statistics
Sample size calculations were performed according to the recommendations of Donner and Rotondi [9]. The eight raters were divided into two groups of four based on their clinical experience. For interrater analysis, intraclass correlation coefficients (ICCs) were estimated by a linear mixed model with random effects for patient and rater, which corresponds to a two-way mixed model, agreement and single measure (ICC 2.1). Calculations were performed using the R package lme4 [10]. ICC was interpreted as follows [11]: excellent (> 0.75), fair to good (0.40-0.75), and poor (< 0.40). The standard error of measurement (SEM) agreement was calculated from the square root of the sum of residual, patient, and rater variance. Minimal detectable change (MDC), which estimates the smallest amount of change that can be detected beyond measurement error, was calculated using the formula 1.96 × √2 × SEM.
The recorded posterior tilt angles were also dichotomized using the suggested cutoff value of 20° [1,5] indicating the two implant options: arthroplasty ≥ 20°and internal fixation < 20°. The overall percentage agreement is the proportion of cases for which all raters agree, and the specific agreement was defined as the observed agreement for choosing arthroplasty as treatment. Percentage agreement was calculated with the R packages obs.agree [12]. Intrarater reliability (ICC intra ) was estimated by a linear mixed model with random effects for patient, which corresponds to a two-way mixed model, agreement and single measure (ICC 2.1). The means of the individual ICC intra with corresponding standard deviations (SDs) were used to compare intrarater reliability between groups of raters. Withinsubject SD was calculated using one-way analysis of variance (ANOVA) and repeatability estimated by the repeatability coefficient (RC) using the formula √2 × 1.96 × within-subject SD [13]. Statistical analyses were performed using R version 3.1.3 for Mac OS X [14].

Results
The eight raters measured posterior tilt in all 50 lateral hip radiographs at two test occasions, giving a total of 8 × 50 × 2 = 800 assessments (Appendix). The angles ranged from − 30.0 to 49.7°. Negative values denote anterior tilt of the femoral head, whereas positive values denote a posterior tilt. Using the mean angle of all eight measurements for each case from the first test occasion, 9 of 50 patients had a posterior tilt angle of at least 20°.

Interrater reliability
The pair-wise ICC values for 28 possible pairs of raters ranged from "fair to good" (0.64) to "excellent" (0.91) ( Table 1), and the overall ICC for the eight raters was "excellent" (0.77) at the first session ( Table 2). The interrater reliability for registrars was "excellent" (0.81) compared to "fair to good" (0.73) for the consultants ( Table 2), but the difference was not statistically significant (p = 0.19). Registrars achieved lower SEM and MDC values compared to the consultants ( Table 2). Paired sample t test did not show any differences in reliability between the two test occasions (data not shown).

Intrarater reliability
Individual intrarater reliability (ICC intra ) ranged from "fair to good" (0.62) to "excellent" (0.90) ( Table 1, right column). The mean intrarater reliability for all raters was "excellent" (0.77) ( Table 3). The mean ICC for the registrars was "excellent" (0.83) and for consultants "fair to good" (0.70), but the difference was not statistically significant (p = 0.12). Similar to SEM and MDC, the values for within-subject SD and RC were lower for registrars compared to the consultants (Table 3).

Agreement
The overall percentage agreement for all raters was 83.9 for the first test occasion and 82.1 for the second test occasion ( Table 4). The specific agreement for choosing arthroplasty as treatment, based on the recommended cutoff value of a posterior tilt of at least 20°, was 61.3 and 54.6 for the first and second test occasions, respectively.

Discussion
Eight orthopedic surgeons measured the posterior tilt in 50 Garden-I and Garden-II FNFs and achieved excellent interand intrarater reliability. However, the MDC ranged from 11.4 to 16.6 and the RC from 13.9 to 16.3 (Tables 2 and 3).
We estimated inter -and intrarater reliability of posterior tilt measurements based on the ratings of four registrars and four consultants in orthopedic surgery. These measurements are of clinical importance because the presence of a preoperative posterior tilt in Garden-I and Garden-II FNFs has been associated with increased risk of fixation failure. In general, these fractures are treated with internal fixation, but arthroplasty has been recommended for fractures exceeding a cutoff value of 20°posterior tilt [1,5]. To estimate variations in repeated measurements and variations in measurements made by different raters, we calculated the RC and the MDC. The RC represents the difference between two measurements made by the same rater on the same subject, and for 95% of pairs of observations, the difference will be less than the value of the RC. The MDC estimates the smallest change that can be detected beyond measurement error. We also evaluated the overall percentage agreement as well as specific agreement to provide information at a practical level.   ICC values for angular measurement were excellent, but the MDC was between 11.4 and 16.6 and the RC in the range of 13.9-16.3. These findings are relevant because variations in measurements of 15°are not inconsequential given the proposed treatment algorithm recommending internal fixation when the posterior tilt is < 20°and arthroplasty when the posterior tilt is ≥ 20° [5]. These observations could also partially explain discrepancies in the literature regarding the risk of treatment failure associated with preoperative posterior tilt [6].
Palm et al. reported excellent inter-and intrarater reliability among eight raters that evaluated posterior tilt in 17 Garden-I and Garden-II FNFs with ICC values of 0.87 (range 0.74-0.94) and 0.91 (range 0.83-0.95), respectively [5]. In the present study, the corresponding ICCs were also interpreted as excellent albeit with lower coefficients. Importantly, Palm et al. did not assess the repeatability or the MDC, but they did report inter-and intrarater kappa values for the choice of treatment and total percentage agreement for eight raters in 15 out of 17 cases (88.2%) [5]. In the present study, the total percentage agreement was between 82.1 and 83.9 for all raters. The specific agreement for choosing arthroplasty as treatment was 54.6-61.3%.
We recently reported excellent reliability of posterior tilt measurements performed by two orthopedic surgeons [3], with reliability similar to what has been published by Palm et al. In the present study, we invited eight orthopedic surgeons with no previous experience using the same measuring method and evaluated inter-and intrarater reliability. We used four registrars and four consultants to better reflect the staff of an orthopedic trauma hospital unit, and the resulting ICCs were lower than expected. The differences in reliability may also indicate that reliability could improve over time with more experience, although there was no improvement comparing the first and second measuring session.
We followed recommended guidelines for performing reliability studies [7]. The proportions of patients with a posterior tilt of at least 20°were similar in the randomly selected sample of 50 patients as compared to the cohort of 322 patients from which the sample was obtained [3]. The proportions of patients with a posterior tilt of at least 20°were 9 of 50 versus 43 of 322 (p = 0.38), and this supports the assumption that the sample of 50 patients was representative. Furthermore, there was no learning effect between the sessions as inter-and intrarater reliability was similar at both sessions, indicating that the second reading was independent of the first.
The present study has several limitations. None of the raters had any experience with the measuring procedure or the software used. Although we did not show any significant learning effect between the two rating occasions, a learning effect could still be present, accounting for higher reliability reported in previous studies. Furthermore, the software used in the present study differs from that used by Palm et al. in the original trial defining the measuring method [1]. The raters occasionally reported difficulties measuring posterior tilt due to poorly defined cortical contours. In a clinical setting, the clinician can acquire a new radiograph when image quality is poor, but we chose not to exclude lateral radiographs of poor quality to minimize the risk of selection bias. As a result, the reliability of posterior tilt measurements could be better in a clinical setting if radiographs of poor quality are replaced.
The mid-collum line could deviate substantially from the assumed central axis of the femoral neck, even though the three assisting lines were defined according to the procedure. These apparent mismatches occurred when the contours of the femoral neck were asymmetric or when radiographs demonstrated a double femoral neck contour (Fig. 2). As a result, the raters reported that they occasionally had to redefine the three assisting lines to achieve a reasonably oriented MCL.

Conclusion
In the present retrospective cohort study, interpretations of inter-and intrarater reliability of posterior tilt measurements ranged from "fair to good" to "excellent." The ICC values were lower than previously reported, and the MDC ranged from 11.4°to 16.6°. The specific agreement for choosing arthroplasty as treatment was 54.6-61.3%. The variations in posterior tilt measurements should be taken into account when choosing the type of implant for elderly patients with Garden-I and Garden-II femoral neck fractures.