Validation of a second-generation appropriateness classification system for total knee arthroplasty: a prospective cohort study

Background To test the validity of a second-generation appropriateness system in a cohort of patients undergoing total knee arthroplasty (TKA). Methods We applied the RAND/UCLA Appropriateness Method to derive our second-generation system and conducted a prospective study of patients diagnosed with knee osteoarthritis in eight public hospitals in Spain. Main outcome questionnaires were the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), Short-Form-12 (SF-12), and the Knee Society Score satisfaction scale (KSS), completed before and 6 months after TKA. Baseline, changes from baseline to 6 months (journey outcome), and 6-month scores (destination outcome) were compared according to appropriateness category. Percentage of patients attaining the minimal clinically important difference (MCID) and responders according to Outcome Measures in Rheumatology-Osteoarthritis Research Society (OMERACT-OARSI) criteria were also reported. Results A total of 282 patients completed baseline and 6-month questionnaires. Of these, 142 (50.4%) were classified as Appropriate, 90 (31.9%) as Uncertain, and 50 (17.7%) as Inappropriate. Patients classified as Appropriate had worse preoperative pain, function, and satisfaction (p < 0.001) and had greater improvements (i.e., journey scores) than those classified as Inappropriate (p < 0.001). At 6 months, destination scores for pain, function, or satisfaction were not significantly different across appropriateness categories. The percentage of patients meeting responder criteria (p < 0.001) and attaining MCID was statistically higher in Appropriate versus Inappropriate groups in pain (p = 0.04) and function (p = 0.004). Conclusions The validity of our second-generation appropriateness system was generally supported. The findings highlight a critical issue in TKA healthcare: whether TKA appropriateness should be driven by the extent of improvement, by patient final state, or by both. Supplementary Information The online version contains supplementary material available at 10.1186/s13018-021-02371-z.


Background
Utilization of total knee arthroplasty (TKA) has risen substantially over the past few decades. In the USA in 2014, for example, 723,000 TKA surgeries were done and based on 2000-2014 data, the projected growth of 85% to 1.26 million procedures is expected by 2030 [1]. Consistent with variability in TKA utilization, there also is variability in recommendations for who should qualify for TKA [2]. Variation in TKA recommendations is important because about 20% of patients have a poor outcome after TKA [3].
Although it is generally accepted that TKA is an effective treatment for symptomatic knee osteoarthritis (OA), there are controversies about the indication criteria. There have been several attempts to establish criteria to recommend TKA, from the first studies [4,5] to more recent works reflecting perspectives from orthopedic surgeons [6,7], patients [8], and other stakeholders [9]. On the other hand, we found no studies applying these criteria to patients other than our firstgeneration appropriateness classification system for TKA [10]. This landmark paper, published in The Journal of Evaluation in Clinical Practice in 2003, lacked specificity with a variety of criteria (e.g., the symptomatology criteria did not have mutually exclusive categories), was developed two decades ago and is out-dated, and did not include variables related to psychological distress or comorbidity [11,12].
It is well known that randomized clinical trials (RCTs) are the best way to assess healthcare interventions, but are lacking in TKA [13]. One alternative to the RCT is to synthesize the opinions of experts [14]. The RAND/ UCLA Appropriateness Method (RUAM) [14] has been used to evaluate appropriateness in several diagnostic and surgical procedures [15,16]. The RUAM is a consensus-based multi-step method that requires an expert panel to synthesize the evidence, identify key indicators (predictor variables) for the problem of interest, and then write a complete set of brief clinical scenarios that capture all permutations of the indicators. An independent second expert panel of experts then scores each of the scenarios using a well-established appropriateness ranking system [14]. More recently, the American Academy of Orthopaedic Surgeons (AAOS) have used RUAM to develop appropriateness criteria for several procedures including TKA [17].
We designed our study to fill an important gap in the translation of knee arthroplasty research [18] evidence by developing and testing a RAND-based classification system for knee arthroplasty appropriateness that was grounded in contemporary practice. For example, our newly proposed system includes psychological distress and comorbidity indicators that have not been used in other RAND-based methods.
Our objective was to test the validity of our secondgeneration RUAM-based TKA appropriateness system. Validity, in this context, is consistent with methods for testing validity and advocated by developers of the RUAM system [14]. We judged the presence of validity by comparing baseline, change, and final outcome scores of persons classified as appropriate, inappropriate, or uncertain. Outcome scores can be compared in a retrospective fashion, by applying classification criteria to already collected patient data, much like an earlier paper [18], or to patient data collected after a system is developed, the method used in the current study. Our analytic approach and hypotheses were grounded by consideration of pain, function, and satisfaction measures along a time-based continuum from preoperative baseline to change over time (i.e., the journey) and to final destination at 6 months (i.e., the final time point) [19]. We hypothesized that (1) patients classified as Appropriate would have greater pain and worse self-reported kneerelated function as well as less satisfaction with their current knee health prior to TKA as compared to those classified as Inappropriate, (2) baseline to 6-month pain and function change scores would be greater for patients classified as Appropriate as compared to persons classified as Inappropriate, and (3) 6-month pain, function, and satisfaction scores would be approximately the same for all three classification groups [18].

Classification system development
Criteria for developing our TKA appropriateness system were based on the method developed by the RAND Corporation and the University of California in Los Angeles (RUAM) [14]. The description of the RUAM process used to generate the TKA appropriateness system has been thoroughly described elsewhere [20] and appears in brief in the Additional file 1.

Validity study
We carried out a prospective cohort study, which took place in eight hospitals belonging to the Spanish National Health Service. This study complies with the Declaration of Helsinki, and the corresponding Institutional Review Boards approved the study (registration ID: PI2016135, issued on 29 November 2016). All patients who agreed signed a informed consent.
Consecutive patients placed on the waiting list to undergo primary TKA with OA, between January 2017 and January 2018, were eligible for the study. Patients with psychiatric diseases were excluded because of the potential for biased or incorrect responses when filling out the questionnaires. We collected data directly from patients via questionnaires. We sent the questionnaires to patients at baseline while on the waiting list and at 6 months post-surgery. The questionnaires included sociodemographic questions, pain, function, and healthrelated quality of life (HRQoL) instruments. A reminder letter was sent to patients who had not replied after 15 days.

Baseline and follow-up outcome questionnaires
The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) questionnaire is an arthritisspecific questionnaire developed for patients with hip or knee OA. It comprises 24 items, grouped into three dimensions: pain (5 items), stiffness (2 items), and physical function (17 items). We used the Likert version with five response levels for each item. The data were standardized to a range of values from 0 (best) to 100 (worst). The WOMAC has been validated in Spanish [21].
The Short-Form-12 (SF-12), a validated HRQoL measure, was also administered and provides scores on two summary measures, the mental (MCS) and physical (PCS) component summaries. Both scales are scored from 0 (worst status) to 100 (best status) and have been validated in Spanish [22].
The satisfaction subscale of the Knee Society Clinical Rating System (KSS), a widely used self-report questionnaire in TKA, was used to measure patient satisfaction. Satisfaction questions ask patients to rate the degree of satisfaction with their pain or function while sitting, lying in bed, getting out of bed, and performing light household activities and recreational activities, each on a five-level Likert scale, from very satisfied to very dissatisfied. The five satisfaction items are summed to calculate a total satisfaction score. The Spanish version has been validated [23].

Statistical analysis
The unit of analysis was the patient meaning that only one knee of each patient was included in the data analysis. In cases where bilateral staged TKA was performed during the study period, only the data from the first TKA was used. Descriptive statistics included frequency tables, means, and standard deviations (SD). Baseline characteristics were compared for those with versus those without follow-up data using the t-test for continuous variables and the chi-square or the Fisher's exact test for categorical variables. Our focus for the key analyses was on comparisons of baseline differences, comparisons of changes over the 6-month period (i.e., the journey outcomes), and comparisons at the final 6month time point (i.e., the destination outcomes).
The WOMAC, SF-12, and KSS scores at baseline and at 6 months and changes from baseline to 6 months were compared amongst the three appropriateness groups. We used the chi-square test, analysis of variance (ANOVA) with Scheffé post hoc test, or the nonparametric Kruskal-Wallis, when appropriate, positive changes indicating improvement following surgery.
We compared the percentage of patients exceeding the WOMAC pain and function minimal clinically important difference (MCID) amongst the three appropriateness groups, by the chi-square. Previously published MCID cut-off values by quartiles of baseline severity in WOMAC were used [24]. We also compared responders according to the definition of the OMERACT-OARSI set of responder criteria [25] amongst the three appropriateness groups. The KSS satisfaction questions were dichotomized into satisfied (answers of very satisfied and satisfied) versus unsatisfied patients (answers of neutral, dissatisfied, and very dissatisfied). The percentages of patients in each group were compared according to appropriateness by chi-square tests. All effects were considered statistically significant at p < 0.05. Statistical analysis was performed using SPSS v17, and SAS for Windows statistical software, version 9.4 (SAS Institute, Inc., Carey, NC).

Sample size
We hypothesized statistically significant differences in the percentage of patients exceeding the MCID for WOMAC pain scores between the Appropriate and the Inappropriate groups, based on the results previously reported [26], where 69% were classified as Appropriate and 13% as Inappropriate. Therefore, 35 subjects classified as Inappropriate and 185 classified as Appropriate were required to detect a statistically significant difference between the percentages of patients exceeding the MCID for pain of 50% for Inappropriate and 75% for Appropriate cases. The analysis was based on an alpha risk of 0.05 and a power of 0.8 in a bilateral contrast.

Results
A total of 334 patients who met the inclusion criteria completed the questionnaires prior to TKA. Of these, 282 patients (84.4%) returned the questionnaires at 6 months. There were no significant differences (Table 1), among those with versus those without 6-month followup data regarding any variable. Of the 282 participating patients with completed data, 142 were classified as Appropriate (50.4%), 90 as Uncertain (31.9%), and 50 as Inappropriate (17.7%). The mean age was 70.9 years (SD, 8.3) and 184 (65.2%) were women. Other questionnaires used in classification and score means for the total sample as well as each appropriateness subgroup appear in the supplementary material. The focus of this report is on the key outcome measures.

Baseline score comparisons
At baseline, there were significant differences amongst appropriateness groups in all WOMAC, SF-12, and KSS satisfaction scores (Table 2). Regarding the WOMAC questionnaire, gradient differences amongst the three groups were found, with worst to best from Appropriate to Uncertain to Inappropriate, respectively, in the three domains (p < 0.001). These significant gradient differences also were found for each of the five satisfaction questions and the overall satisfaction score of the KSS (p < 0.001).

Comparisons of 6-month journey (change) scores
There were differences amongst the three groups (p < 0.001) in WOMAC domains ( Table 3). The KSS satisfaction scores demonstrated differences between Appropriate and Inappropriate groups (p < 0.001), with a gradient in improvement from Appropriate to Uncertain to Inappropriate. In all cases, the patients classified as Appropriate had a greater improvement than Uncertain and for Uncertain compared with Inappropriate. No differences amongst groups were found for the PCS, but there were differences in the MCS change scores between Appropriate and Inappropriate groups (p = 0.03).
Results regarding patients attaining the baseline adjusted WOMAC pain MCID (

Comparisons of 6-month destination (final outcome) scores
None of the scales showed statistically significant differences among the three appropriateness groups at 6 months (Table 3). Applying the OMERACT-OARSI criteria to the total sample, the percentage of responders was 90.8% for the Appropriate group, 88.9% for Uncertain, and 64.0% for the Inappropriate group (p < 0.001).

Satisfaction: baseline and destination comparisons
Dichotomized patient ratings of satisfaction appear in Table 5. At baseline, there were statistically significant differences (p < 0.001) in the five questions. At baseline, a higher percentage of satisfied patients were classified as Inappropriate as compared to appropriate and uncertain subgroups, with percentages ranging from 13.3 to 22.4% satisfied with function items and approximately 45% satisfied for pain items. At 6 months, there were no differences in the percentages of satisfied patients amongst appropriateness groups.

Discussion
The main objective of our study was to prospectively test the validity of our RAND/UCLA appropriateness system by recruiting a sample of patients undergoing TKA. Our first hypothesis was confirmed. Baseline scores for WOMAC Pain and Function as well as satisfaction with current status were statistically different amongst the three appropriateness groups. The most severely affected patients were classified as Appropriate and the least severely affected patients were classified as Inappropriate.
In addition to statistical differences, their magnitude of differences was quite large, for example, > 30 points between Appropriate and Inappropriate groups in baseline pain and function. At baseline, the differences in the percentage of patients satisfied or very satisfied with their current pain level or knee function were even higher, between Inappropriate and Appropriate groups. Differences in HRQoL, measured by SF-12, were consistent with other baseline measures, although their magnitudes were less, possibly because generic HRQoL measures are not designed to optimize between patient differences. Patients classified as Appropriate had worse HRQoL relative to Uncertain or Inappropriate groups.
Data comparing Appropriate to Uncertain groups differ from a modified version of our first-generation TKA appropriateness tool where baseline scores for the Uncertain and Appropriate groups were very similar [18]. However, the baseline results of the current study are more in line with a prior study conducted in Spain [26], where we found the Appropriate group had more severe pain and functional deficits relative to the Uncertain group which, in turn, was worse than the Inappropriate group. Our suspicion is that the additional criteria added to our system (i.e., psychological factors, pain   catastrophizing, and comorbidities) combined with deletion of older criteria (e.g., range of motion and knee stability) resulted in a clearer separation of Appropriate, Uncertain, and Inappropriate classifications as compared to the original (first generation) [5] or modified [18] system. We see this as a strength of the system, given that in theory, there should be clear separation and measurable baseline differences among the three appropriateness categories. Our second hypothesis was also supported. Greater improvements occurred in patients classified as Appropriate relative to patients in the Inappropriate group for WOMAC scores and satisfaction. The percentage of patients attaining their baseline adjusted MCID also support this hypothesis. The Uncertain group demonstrated a pattern of improvement that mirrored the baseline findings. That is, the Uncertain group had improvements in WOMAC scores that fell approximately midway between the Appropriate and Inappropriate groups. Similar results were obtained when considering OMERACT-OARSI responder criteria, a finding also reported in prior work [27], but again there are differences amongst groups, with the Appropriate group demonstrating a higher percentage of patients considered as responders.
Finally, our hypothesis about final scores was supported. We did not find significant differences in 6month WOMAC scores among the three classification groups. Much like our prior outcome studies [18,26], the three appropriateness groups all ended up with similar final pain and function scores. A novel and important finding in the current study is that final satisfaction scores among the three appropriateness groups also were not statistically different. We do not believe that a ceiling effect explains the lack of difference among the three groups at 6 months. All groups had additional room for improvement across all outcome measures.
When considering our results in total, our secondgeneration appropriateness system performed reasonably well in that the baseline differences among the three groups were more substantial than prior studies [18,26], creating a clearer distinction among the three groups. However, 6-month changes and final destination outcome findings were more nuanced. While change scores were smaller for the Inappropriate and Uncertain groups relative to the Appropriate group, all groups ended up in approximately the same place and satisfaction at 6 months was not different among the three groups. These data suggest that patients classified as Inappropriate derived less benefit (i.e., their change scores were smaller) but they were, as a group, as satisfied as the other two appropriateness groups. This begs the question of whether TKA decision-making should be driven by the magnitude of benefit, that is, the change score from baseline, or whether it should be driven by patient satisfaction and pain and function at the final outcome time point (i.e., final outcome was assessed 6 months after surgery), or some combination of both. Our study cannot answer this question but will hopefully provide a stimulus for developing consensus on this issue. Losina and Katz posed a similar question [19]. Our study further informs deliberations on the question of appropriateness in that we found, for the first time, that satisfaction is equally high among the three appropriateness groups.
Another important finding is that the Uncertain group represented 31.9% of the sample. Baseline scores were different in nearly all dimensions for the Uncertain group compared to the other two groups and with intermediate values. It is possible that additional indication criteria variables would help to clarify whether this Uncertain group should actually be classified as Appropriate or Inappropriate. Additional work is needed to determine the reasons for this relatively large proportion Scores are reported separately for the two pain items and the three function items. All satisfaction scores were dichotomized to indicate whether the patient was either satisfied (i.e., very satisfied or satisfied) or unsatisfied (neutral, dissatisfied, or very dissatisfied). The table reports the sample size (n) and percentage of patients for each appropriateness classification and for baseline and 6-month time points TKA Total knee arthroplasty of patients classified as Uncertain. An alternative explanation, given that TKA is an elective procedure, is that even with the application of a new classification system with multiple criteria for judging appropriateness, substantial uncertainty persists regarding determinations of appropriateness for surgery.
There are limitations to our study. It is likely that the use of the Hospital Anxiety and Depression Scale, and Pain Catastrophizing Scale is not likely to be common in daily practice and their use by surgeons may not have been optimal. While we relied on current evidence to drive the selection of these measures and their cutpoints, it may be that surgeons are unfamiliar with these measures and the new prognostic evidence that support their use [3,28]. We believe that since they are variables supported by evidence to indicate the likely prognosis of TKA, their measurement should be incorporated, much like pain or functional capacity or radiographic OA, in a standardized way. Finally, our power analysis indicated we needed 185 patients in the "appropriate" category and we ended up with 142. However, we still found significant differences among the classifications for percentage of patients meeting or exceeding the MCID for WOMAC scores (see Table 4).
Our data were collected in Spain and there is uncertainty regarding the extent to which these findings generalize to other countries. Finally, while our validity analyses showed fairly dramatic differences among the three classification groups across baseline and change scores, there were no significant final outcome differences. Before one can conclude that this system may not demonstrate clear and strong differences among appropriate versus inappropriate and inconclusive groups at the final outcome, additional cohorts of patients should be studied to see if the more nuanced findings reported here are consistent across different samples.
Our outcomes were measured 6 months following surgery, and it may be the results could vary with longer follow-up. However, data suggest that changes from 6 to 12 months are minimal for the WOMAC and likely other pain and function self-reported outcomes [29]. Finally, our study included patients who already consented to undergo TKA, so we do not know if variables considered as important prior to obtaining patient consent for TKA such as the extent of social support, prior treatment, or expectations had been properly managed.
Future work on appropriateness criteria should be focused on external validation of our proposed system on patients from different countries and different healthcare systems. External validation will be important in judging the extent to which our system might impact TKA decision-making in other countries. Additionally, further development of our system should be encouraged as evidence identifies additional important indicators of outcome.

Conclusion
Our results generally supported the validity of our TKA appropriateness classification system though the clinical impact of these findings is likely to be modest. The findings highlight a critical issue in TKA decision-making going forward. Whether appropriateness should be driven primarily by the magnitude of improvement over time or by patient satisfaction and pain and functional status following recovery is unknown. Consensus development on this issue should be a high priority for stakeholders involved with TKA healthcare delivery.