Assessing responsiveness of the EQ-5D-3L, the Oxford Hip Score, and the Oxford Knee Score in the NHS patient-reported outcome measures

Background The degree to which a validated instrument is able to detect clinically significant change over time is an important issue for the better management of hip or knee replacement surgery. This study examines the internal responsiveness of the EQ-5D-3L, the Oxford Hip Score (OHS), and the Oxford Knee Score (OKS) by various methods. Data from NHS patient-reported outcome measures (PROMs) linked to the Hospital Episodes Statistics (HES) dataset (2009–2015) was analysed for patients who underwent primary hip surgery (N = 181,424) and primary knee surgery (N = 191,379). Methods Paired data-specific univariate responsiveness was investigated using the standardized response mean (SRM), the standardized effect size (SES), and the responsiveness index (RI). Multivariate responsiveness was furthermore examined using the defined capacity of benefit score (i.e. paired data-specific MCID), adjusting baseline covariates such as age, gender, and comorbidities in the Box-Cox regression models. The observed and predicted percentages of patient improvement were examined both as a whole and by the patients' self-assessed transition level. Results The results showed that both the OHS and the OKS demonstrated great univariate and multivariate responsiveness. The percentages of the observed (predicted) total improvement were high: 51 (54)% in the OHS and 73 (58)% in OKS. The OHS and the OKS showed distinctive differences in improvement by the 3-level transition, i.e. a little better vs. about the same vs. a little worse. The univariate responsiveness of the EQ-5D-3L showed moderate effects in total by Cohen’s thresholds. The percentages of improvement in the EQ-5D-3L were moderate: 44 (48)% in the hip and 42 (44)% for the knee replacement population. Conclusions Distinctive percentage differences in patients’ perception of improvement were observed when the paired data-specific capacity of benefit score was applied to examine responsiveness. This is useful in clinical practice as rationale for access to surgery at the individual-patient level. This study shows the importance of analytic methods and instruments for investigation of the health status in hip and/or knee replacement surgery. The study finding also supports the idea of using a generic measure along with the disease-specific instruments in terms of cross-validation.


Background
The responsiveness of a health-related functional state is an important issue in arthroplasty surgery. Responsiveness is the ability of an instrument to detect clinically significant change in health status and as such reflects its impact on clinical practice over time [1][2][3]. It is well recognized that measurement properties can vary according to the study population of interest. This is particularly true of the generic measures, especially those measuring responsiveness. The decision to use a generic or disease-specific instrument to detect responsiveness will also depend on the study design, objectives, and evaluation of cost-effectiveness [4]. Generic health status measures seek a broad perspective that is not specifically related to the restricted scope of the health-related functional status of a particular disease. Generic measures allow the comparison of health status across different diseases and interventions [4,5].
Assessing outcomes of hip and knee replacement surgery for both generic and specific measures is enabled by the EQ-5D-3L, the Oxford Hip Score (OHS), and the Oxford Knee Score (OKS). The EQ-5D is a well-known and widely used generic patient-reported outcome questionnaire [6]. The current UK version of EQ-5D-3L was introduced in 1997 as a generic measure of health for clinical and economic assessment [7]. It was designed to describe and value health by providing a single summary index-based value (utility; − 0.59 to 1) representing the overall health-related quality of life by quantifying a preference for the individual's health state [8]. The questionnaire consists of a self-reported/descriptive system to describe the three-level health problems (no/some/extreme problem) on each five dimension: mobility (i.e. problem in walking), self-care (i.e. problem with self-care), usual activities (e.g. work, study, housework, family, or leisure activities), pain/discomfort, and anxiety/depression.
The OHS and the OKS focus on the disease being studied, allowing greater sensitivity to intervention related-change compared to generic measures [4,9]. The OHS and the OKS consist of 12 Likert-type response items, which relate to pain and disability experienced over the past 4 weeks [10,11]. Scores from each item are summed (responses coded from 'None' = 4 to 'Severe' = 0) providing a range of 0 to 48, with a higher score indicating greater health status [12,13].
Husted et al. [14] defined the internal responsiveness as the ability of a measure to change over a pre-specified time frame. The external responsiveness was defined as the extent to which changes in measure relate to changes in other measures of health status, and it measures rather the relationship between change in the measure and change in the external standard [14]. The external responsiveness between independent groups and cross-validation between measuring systems were explored in the previous studies. The ability of these instruments to detect responsiveness is required to examine using the paired group specific statistics as previous studies did not specify the internal responsiveness for a single group. The aim of this study is to evaluate the paired dataspecific responsiveness of the EQ-5D-3L, the OHS, and the OKS using various analytic methods, and to discuss which analytic methods and instruments should be used for the reporting system in arthroplasty surgery.

Data sources
Responsiveness was accessed for the population from the NHS patient-reported outcome measures (PROMs) data who have undergone hip or knee surgery in the UK (ref: NIC-392690-F7H2Q). Follow-up was measured 6 months after the hip or knee surgeries. The NHS PROMs linked to Hospital Episodes Statistics (HES) (2009-2015) data recorded the pre-and the 6 months post-operative PROMs outcomes. The outcomes include the EQ-5D-3L and the respective hip and knee Oxford scores for all individuals who underwent hip and knee surgery in England [15].
The inclusion criteria were patients who had not received revision (primary surgery only) 1 and who had not had previous surgery, using the 'Q1_PREVIOUS_SUR-GERY' question (N = 575,980). In addition, patients who completed both pre-and post-questionnaires were included, using the 'Q1 and Q2 Complete' questions (N = 443,262) [16]. For hip surgery, patients who submitted specific data were included for both the pre-and postoperative Oxford questionnaires to derive scores with sufficient procedure, using the 'HR Q1 and HR Q2 Score Complete' questions (N = 209,761) and the 'Q1 and Q2 EQ-5D Health Scale Complete Indicators' (N = 181,424). The same approach was applied for those undergoing knee surgeries (N = 191,379).

Outcome and predictor variables
The change scores (the difference between the post-and pre-operative scores) of the EQ-5D-3L, the OHS and the OKS were used as the main outcomes, respectively. The pre-operative EQ-5D-3L, OHS, and OKS scores were used as the main predictor variables for the change scores. Patients' age, gender and important clinical exposures, namely, 12 individual comorbidities (heart disease, high blood pressure, stroke, circulation, lung disease, diabetes, kidney, nervous system, liver disease, cancer, depression, arthritis) were used as other prognostic variables.

Transition question
The MCID (minimally clinically important difference), which can be linked to the improvement concept, was calculated using the patients' self-assessment of the 6 months post-operative outcomes relative to the preoperation. The MCID allows an estimation of the probability of a relevant improvement in the instrument of an intervention [17]. The assumption of the MCID is that the mean change score needed to obtain a medium or large effect size is clinically meaningful [18]. Clinically meaningful refers to a change that indicates the efficacy of the intervention in domains of a health-related functional status instrument [4]. The MCID can be calculated for the group reflecting level (using the anchorbased transition in which the concept of 'minimal importance' is explicitly incorporated) and also for the distribution-based individual level (using the standardized response mean (SRM) applied paired data-specific MCID) [17,19,20]. In this paper, a combined approach, firstly, the SRM applied paired data-specific MCID was used to estimate the threshold for improvement, and secondly, patients' perception of improvement was estimated by the level of the transition in the multivariate regression models ( Table 1).
The NHS PROMs contains the post-operative satisfaction and success questions, and the success question was applied in this study since it is considered more objective than the satisfaction question asking 'How would you describe the results of your operation? Excellent/ Very good/Good/Fair/Poor'.
For the paired data-specific univariate responsiveness, the SRM, the standardized effect size (SES), and the responsiveness index (RI) were calculated.

Univariate responsiveness measures
In the present study, internal responsiveness was investigated focusing on internal standard of an individual using the pre-and post-operation (paired) data and compared as the psychometric property of the EQ-5D-3L, the OHS, and the OKS. The internal responsiveness was assessed by calculating different formula of responsiveness in terms of a critical assessment: the SRM, the SES, and the RI for the univariate statistics.
SRM for the paired data [4,[20][21][22] The paired data − speci fic SRM where r is a correlation coefficient between the pre-and post-operative scores [4]. The pre-and post-operation data-specific SRM is the ratio between the mean change score and the variability (SD) of that change score within the same group (Meanchange score /SD change score ), and the difference between means for the independent data is standardized (i.e. divided) by a value √2 × √ (1 − r) (as large as would be the case were they independent) [4,21] (The SRM for the independent data is simply Mean change score /SD change score between the two groups [20]).

SES for the paired data
The SES was calculated using the patients' self-assessed transition level, i.e. much better, a little better, about the same, a little worse, and much worse [4].
RI for the paired data The RI was proposed as the ratio of average change produced by a treatment to the between subject variability of difference scores in stable subjects [23]. The RI was calculated using the patients' self-assessed transitionbased (i.e. a little better vs. about the same) MCID, where the MCID here is according to a criterion (i.e. the difference in change score between those who perceived a little better vs. about the same) In addition to the univariate responsiveness measures, the patients' perception of improvement was estimated using the modelling approach using the Box-Cox regressions based on log-likelihood while adjusting responsiveness with patient characteristics, including age, gender, and 12 individual comorbidities. For the robust analytic approach, the paired data-specific MCID was defined as the threshold for improvement in the models.

Multivariate responsiveness measures
The threshold for improvement with the MCID for the paired data Cohen introduced the matched pairs effect size [21], which was later renamed the standardized response mean (SRM) by Liang et al. [4,20].

Multivariate responsiveness using the regression models
The percentage improvement based on the paired dataspecific MCID [Eq. 4] was examined as multivariate responsiveness of the EQ-5D-3L, the OHS and the OKS to examine which instrument is sensitive to detect the changes of improvement for the paired data. The result was additionally examined by the patients' self-assessed transition level, i.e. much better, a little better, about the same, a little worse, and much worse. The observed and estimated percentage improvements were examined separately where regression approaches were applied, adjusting patient baseline covariates such as age, gender, and comorbidities. Adjusting the covariates is one of the strengths in comparison to the responsiveness statistics described in the previous sections. The 3rd and the 2nd degree Box-Cox regressions based on log-likelihood were fitted to estimate the patients' perception of improvement. The impact of baseline covariates, i.e. age (as a continuous variable), gender, and individual comorbidities, were examined in total and by the transition level population (Fig. 1).
The Box-Cox regression models were selected among other statistical average models (e.g. polynomial regressions) and median-based models (e.g. quantile regressions), after the model diagnostic assessments. The model is robust for a non-normal dependent variable, transforming it into a normal shape. The observational and estimated percentage improvements for the average population were calculated to examine if the instrument has a good discriminative ability. The individual level post-operative scores were modelled as a function of the transformed variables pre-operative linear, quadratic, and cubic terms and of the untransformed age, gender, and individual comorbidities. In comparison to the models with only pre-operative score terms, circulation and depression (which chi-squared statistics are greater than 2000 in the models and coefficients are significantly large, i.e. greater than absolute value 200) were selected to be adjusted for the hip outcomes. Circulation, diabetes, and depression were selected for the knee outcomes based on the same criteria.
The 3rd degree left-hand-side-only model obtaining the maximum likelihood estimates is as below for the OHS: where ε~N(0, σ 2 ). y indicates the changed-operative score, and x indicates pre-operative score. y is subject to a Box-Cox transform with parameter θ. z 1 , z 2 , z 3 are untransformed age, gender, circulation, and depression [26].

Demographics
In total, 181,423 had hip replacement surgeries; over half (N = 106,493; 59%) were female with ages ranging from 13 to 100 years (SD 10. The Spearman's rank correlation coefficients for the pre-and post-operative scores, r, are provided by the transition level in Table 4. The large correlations between of the pre-and post-operative scores are observed in patients with the transition level of about the same, a little worse, and much worse.

Univariate responsiveness measures for the paired data
The OHS and the OKS showed great univariate responsiveness in total, i.e. SRM [Eq. 1], SES [Eq. 2], and RI [Eq. 3] in total: 1.8, 2.8, and 0.6 (~0.7) in the OHS and 1.4, 2.5, and 0.7 in the OKS. In addition, the OHS and the OKS showed distinctive differences in the SRM [Eq. 1] by the 3-level transition, in particular, a little better Fig. 1 The OHS and EQ-5D-3Ltotal population (1, 3) and the transition level (2,4). Fitted 3rd degree Box-Cox regression lines 1 for the OHS total population and 2 by the patients' self-assessed transition level. The 2nd degree Box-Cox regression estimates 3 for the EQ-5D-3L total hip surgery population and 4 by the patients' self-assessed transition level. All the graphs are presented by age group additionally. Colourful dots indicate 50th percentile for each category, and grey dots indicate actual observations. Grey horizontal lines indicate each defined score improvement (e.g. 22 for the OHS and 0.428 for the hip EQ-5D-3L). Percentiles of the EQ-5D-3L show all over disperse patterns by the transition level whereas percentiles of the OHS show disperse patterns in 'A little worse' and 'Much worse' transition level. Model performance of the OKS and the knee EQ-5D-3L is provided in Supplementary Figure 1 vs. about the same vs. The univariate responsiveness in total for the generic instrument EQ-5D-3L were 1.1, 1.6, and 0.3 (~0.4) for of 'A little better' in the EQ-5D-3L were 0.8 and 1.4, respectively, which can be interpreted as moderate and crucial differences in the 'successful' percentage in each of the two groups (r) of 0.37 and 0.57 [28]. This implies the SRM [Eq. 1] shows a good discriminative ability for the different severities in comparison to the SES [Eq. 2], and EQ-5D-3L is less responsive in comparison to the OHS.

The paired data-specific MCID as the threshold for improvement
The paired data-specific MCID [Eq. 4] was calculated, applying the SRM [Eq. 1] as a desired ES. Multivariate responsiveness was examined using the defined capacity of benefit score as improvement (i.e. 22 for the OHS, and 0.428 for the hip EQ-5D-3L; 16 for the OKS and 0.309 for the knee EQ-5D-3L) 2 , adjusting covariates. Various ways to assess the improvement for the independent data are presented in Supplementary Table 2. Those scores are smaller than the capacity of benefit scores for the paired data. The SRM applied MCIDs for the independent data are 6 for the OHS, and 0.196 for the hip EQ-5D-3L, using Cohen's medium (0.5) effect size. The MDCs (minimal detectable changes, defined as the minimal change that falls beyond the measurement error in the measurement score [29]) are 6 for the OHS and 0.234 for the hip EQ-5D-3L, with ICC 0.9. The anchor-based MCIDs are 9 for the OHS, and 0.101 for the hip EQ-5D-3L, using the short distance. The mean change scores using the anchor are 6 for the OHS, and 0.106 for the hip EQ-5D-3L. A greater capacity of benefit score is required for the paired data in comparison to the independent data, to detect how likely the surgery is to distinguish an actual effect from one of chance in the pre-and post-operative outcomes.

Multivariate responsiveness measuresobserved and predicted improvement
The percentage improvements based on patients' perceptions were high in the OHS and the OKS (Tables 7  and 8). The percentages of the observed (predicted) total improvement were 51 (54)% in the OHS and 73 (58)% in the OKS. In addition, the OHS and the OKS showed distinctive percentage differences by the 3-level transition, i.e. a little better vs. about the same vs. a little worse. As an example, the observed percentages of the 3-level transition were 10% vs. 4% vs. 1% in the OHS and 21% vs. 6% vs. 3% in the OKS. The percentages of the observed (predicted) total improvement in the generic instrument EQ-5D-3L were 44 (48)% for the hip and 42 (44)% for the knee replacement population. The observed (predicted) percentages of the 3-level transition in the EQ-5D-3L were 39 (41)% vs. 29 (11)% vs. 21 (4)% for the hip and 39 (45)% vs. 32 (36)% vs. 26 (14)% for the knee replacement population.
The observed (predicted) percentage improvements applied the Cohen's ES (0.5 and 0.8) are additionally provided in Supplementary Table 3 and 4 for the independent data. The observed (predicted) percentages for the medium improvement were 93 (99)% in the OHS, and 85 (98)% in the OKS. The observed (predicted) percentage improvements in the EQ-5D-3L were 75 (74)% for the hip and 60 (58)% for the knee replacement population. The observed (predicted) percentages of the 3- A great number of patients (86% for hip and 72% for knee) answered much better for success of the surgery (Table 2). In addition, the greater capacity of benefit score was applied for the calculation of the paired data-specific percentage improvement. Therefore, overall percentages (%) of patients' perception of improvement are lower in comparison to the improvement for the independent data. There were much distinctive percentage differences by the transition level when the paired data-specific capacity of benefit score was applied for the calculation. Based on the question, Table 1   Table 5 Hipthe SRM, SES, and RI (with 95% CIs) for the OHS and the EQ-5D-3L (by the transition)

Model performance
The area under the ROC curve (AUC) with 95% binomial exact confidence intervals was calculated to examine discriminative ability with each MCID assuming as the true improvement status, using the patient rating instruments, i.e. OHS, OKS, and EQ-5D-3L (Tables 7 and  8) for the observational data. There was no significant sensitivity by two-period (Supplementary Figure 2).

Discussion
The paired data-specific sensitivity of the EQ-5D-3L, the OHS and the OKS were investigated to detect changes  in the health state over time for the population who underwent hip or knee surgeries in the UK. To ensure accuracy of the health status and instrument evaluation in hip and/or knee replacement surgery, the paired dataspecific SRM was examined for the univariate responsiveness. In addition, the SES and the RI were calculated using the patients' self-assessed transition. Multiple responsiveness metrics were applied, including a robust modelling approach that adjusted significant baseline covariates to estimate percentage improvements. From the modelling approach, the paired data-specific observed (and the predicted) percentages of improvement were distinctive by the transition level (Tables 7 and 8). The multivariate modelling method provided robust responsiveness statistics in terms of adjusting the patient demographic information and comorbidities. Responsiveness from the models was interpretable with a percentage scale of improvement. A greater capacity of benefit score is applied to a calculation of improvement for a paired data. Therefore, overall percentages (%) of patients' perception of improvement are relatively low. The missing cases of predicted improvement by certain transition levels are inevitable for the Oxford questionnaires which have ceiling effects where a greater study population answered much better after the surgery.
Disease-specific and generic instruments are both available in the PROMS data in the UK, and they showed reasonable responsiveness as a health-related instrument that measures functional state. A previous study using the NHS patient-reported outcome measures (PROMs) supports moderate correlations (0.3 to 0.6) between the EQ-5D-3L and other measures of patient-reported health changes, including the OHS and the OKS [30]. Nonetheless, there has been a lack of evidence to support the ability to discriminate. In terms of detecting clinically significant changes in arthroplasty surgery, although it has not been firmly fixed yet, a number of studies indicated that disease-specific instruments are more responsive than generic instruments [4,[31][32][33][34][35]. The present study showed that, although the responsiveness was greater and more distinctive in the disease-specific instruments, the responsiveness of the EQ-5D-3L for hip and knee surgery are reasonably good. The EQ-5D would be useful in terms of short completion time and good validity [3]. Nevertheless, it may not be sufficiently sensitive to be used solely in hip and/or knee replacement surgery, either to discriminate between cases of differing severities by a transition question or to detect the changes in severity or functional status over time [21].
The accurate identification and the early stage of stratification of patients undergoing hip and/or knee replacement are one of the greatest unmet needs. A robust and precise measurement instrument will be effective in the management of arthroplasty surgery for particular group of patients. The OHS and the OKS have been provided evidence that the instruments are able to contribute to the better management of arthroplasty surgery. In general, arthroplasty surgery is based on an individual level in terms of a patient's expectations, symptoms, diagnoses, and degree of pain. Although the excellence of the Oxford questionnaires over other patient-reported questionnaires was examined, the Oxford questionnaires have a ceiling effect, and the threshold levels are always a trade-off between sensitivity and specificity. Moreover, the current version of the OHS or the OKS does not contain a psychological measurement such as depression or anxiety which is also important in health outcome. Further investigation is required about their potential roles of clinical or trial use, costeffectiveness, and their effects on referral patterns.

Strengths and limitations
The strength of this study includes using a large cohort data linked to HES on both hip and knee replacement surgeries that provided enough power to support the research outcomes. Although the sample size is large enough to validate the improvement values using complete-case analysis, validation by an external data set was not conducted. The study design may be suboptimal compared to a well-blinded randomized clinical trial. Additional care may be required in the interpretation of patients' socio-demographics, clinical/treatment and other unobserved covariates that may not be adjusted. A secondary transition was not used in the study. The NHS PROMs data contains only one-point transition measurement (6 months post-operation) and a more objective point assessment may need to be considered [36]. The mean change score using a patient-reported transition (i.e. an anchor approach) has a limitation, in that the one-point transition measurement relies on a patient's memory in global health status, and it could be a more subjective change measurement in contrast with each of the pre-and post-point assessments [36]. In addition, the measurement errors should account for repeatedly measured patient-reported outcomes. There will be several ways to control the errors such as use of the MDC approach (i.e. the threshold for improvement adjusted for measurement error) or applying advance statistical inference approaches such as Bayesian models with computational methods. Potential limitations or difficulties would be the fact that it is not easy to precisely estimate a percentage improvement using the model fitting with the EQ-5D-3L due to the nature of the real number scales (− 0.59 to 1), and the scale is very dispersed (Supplementary Figure 3).