Artificial intelligence and machine learning on diagnosis and classification of hip fracture: systematic review
Journal of Orthopaedic Surgery and Research volume 17, Article number: 520 (2022)
In the emergency room, clinicians spend a lot of time and are exposed to mental stress. In addition, fracture classification is important for determining the surgical method and restoring the patient's mobility. Recently, with the help of computers using artificial intelligence (AI) or machine learning (ML), diagnosis and classification of hip fractures can be performed easily and quickly. The purpose of this systematic review is to search for studies that diagnose and classify for hip fracture using AI or ML, organize the results of each study, analyze the usefulness of this technology and its future use value.
PubMed Central, OVID Medline, Cochrane Collaboration Library, Web of Science, EMBASE, and AHRQ databases were searched to identify relevant studies published up to June 2022 with English language restriction. The following search terms were used [All Fields] AND (", "[MeSH Terms] OR (""[All Fields] AND "bone"[All Fields]) OR "bone fractures"[All Fields] OR "fracture"[All Fields]). The following information was extracted from the included articles: authors, publication year, study period, type of image, type of fracture, number of patient or used images, fracture classification, reference diagnosis of fracture diagnosis and classification, and augments of each studies. In addition, AI name, CNN architecture type, ROI or important region labeling, data input proportion in training/validation/test, and diagnosis accuracy/AUC, classification accuracy/AUC of each studies were also extracted.
In 14 finally included studies, the accuracy of diagnosis for hip fracture by AI was 79.3–98%, and the accuracy of fracture diagnosis in AI aided humans was 90.5–97.1. The accuracy of human fracture diagnosis was 77.5–93.5. AUC of fracture diagnosis by AI was 0.905–0.99. The accuracy of fracture classification by AI was 86–98.5 and AUC was 0.873–1.0. The forest plot represented that the mean AI diagnosis accuracy was 0.92, the mean AI diagnosis AUC was 0.969, the mean AI classification accuracy was 0.914, and the mean AI classification AUC was 0.933. Among the included studies, the architecture based on the GoogLeNet architectural model or the DenseNet architectural model was the most common with three each. Among the data input proportions, the study with the lowest training rate was 57%, and the study with the highest training rate was 95%. In 14 studies, 5 studies used Grad-CAM for highlight important regions.
We expected that our study may be helpful in making judgments about the use of AI in the diagnosis and classification of hip fractures. It is clear that AI is a tool that can help medical staff reduce the time and effort required for hip fracture diagnosis with high accuracy. Further studies are needed to determine what effect this causes in actual clinical situations.
In the emergency room, clinicians spend a lot of time and are exposed to mental stress . There are many things to check due to various images and laboratory tests, and fatigued clinicians (especially residents) are prone to misdiagnosis . According to previous studies, it has been reported that about 2–10% of hip fractures are misdiagnosis . Early diagnosis and treatment of elderly patients with hip fracture are very important for the clinical course . Delay in diagnosis or surgery causes complications such as pneumonia and psoa in these patients and increases morbidity and mortality rates . This not only reduces the patient's quality of life, but also causes economic exhaustion.
Diagnosis can be defined as determining the cause and characteristics of an individual patient's disease, and classification is mainly for creating a relatively homogeneous population through standardized criteria, which is mainly an important factor in disease research . In addition, fracture classification is important for determining the surgical method and restoring the patient's mobility . Since the surgical method is directly related to the medical cost, several countries have provided guidelines for treatment methods according to the classification of hip fractures . However, classifying fractures from a lot of image information is time-consuming .
Currently, most medical institutions use digital medical imaging systems, which overcomes the temporal and spatial limitations of access to image information. In addition, recently, with the help of computers using artificial intelligence (AI) or machine learning (ML), diagnosis and classification of hip fractures can be performed easily and quickly . Studies reporting the effects of applying AI or ML to hip fracture detection used various image information such as computed tomography as well as radiographs, and presented various results on the usefulness of diagnosis and the accuracy of fracture classification.
Therefore, the purpose of this systematic review is to search for studies that diagnose and classify for hip fracture using AI or ML, organize the results of each study, analyze the usefulness of this technology and its future use value.
Study eligibility criteria
Studies were selected based on the following inclusion criteria: (1) studies using AI or ML techniques for diagnosis or classification of hip fracture; and (2) studies reporting on the type of imaging information used; and (3) studies reporting on statistical analysis of accuracy or area under the ROC (receiver operating characteristic) curve (AUC) for diagnosis or classification of hip fracture. Studies were excluded if they failed to meet the above criteria.
Search methods for identification of studies
PubMed Central, OVID Medline, Cochrane Collaboration Library, Web of Science, EMBASE, and AHRQ databases were searched to identify relevant studies published up to June 2022 with English language restriction. The following search terms were used [All Fields] AND (", "[MeSH Terms] OR (""[All Fields] AND "bone"[All Fields]) OR "bone fractures"[All Fields] OR "fracture"[All Fields]). Manual search was also conducted for possibly related references. Two of us reviewed the titles, abstracts, and full texts of all potentially relevant studies independently, as recommended by the Cochrane Collaboration. Any disagreement was resolved by the third reviewer. We assessed full-text articles of the remaining studies according to the previously defined inclusion and exclusion criteria, and then selected eligible articles. The reviewers were not blinded to authors, institutions, or the publication.
The following information was extracted from the included articles: authors, publication year, study period, type of image, type of fracture, number of patient or used images, fracture classification, reference diagnosis of fracture diagnosis and classification, and augments of each studies. In addition, AI name, CNN architecture type, ROI or important region labeling, data input proportion in training/validation/test, and diagnosis accuracy/AUC, classification accuracy/AUC of each studies were also extracted.
The initial search identified 123 references from the selected databases and 4 references from manual searching. Eighty-two references were excluded by screening the abstracts and titles for duplicates, unrelated articles, case reports, systematic reviews, and non-comparative studies. The remaining 45 studies underwent full-text reviews, and subsequently, 31 studies were excluded. Finally, 14 studies are included in this study [1, 7, 8, 11,12,13,14,15,16,17,18,19,20,21]. The details of the identification of relevant studies are shown in the flow chart of the study selection process (Fig. 1).
In 14 studies, the type of image used for AI learning was all X-ray. However, one study additionally used CT images and another additionally used CT and MRI [8, 18]. Four studies included only the neck [11, 16, 17, 21], and two studies included only the intertrochanter fracture [8, 18]. The rest of the studies included both fractures. There were 4 studies that reported the accuracy of fracture classification by AI [8, 14,15,16]. The number of images used varied from 234 to 10,484. The demographic data including reference diagnosis and augments method of each studies are showed in Table 1.
The accuracy of diagnosis for hip fracture by AI was 79.3–98%, and the accuracy of fracture diagnosis in AI aided humans was 90.5–97.1. The accuracy of human fracture diagnosis was 77.5–93.5. AUC of fracture diagnosis by AI was 0.905–0.99. The accuracy of fracture classification by AI was 86–98.5 and AUC was 0.873–1.0 (Table 2). The forest plot of AI accuracy and AUC of diagnosis and classification is presented in Figs. 2, 3, 4, 5. In the included study, the mean AI diagnosis accuracy was 0.92 (Fig. 2), the mean AI diagnosis AUC was 0.969 (Fig. 3), the mean AI classification accuracy was 0.914 (Fig. 4), and the mean AI classification AUC was 0.933 (Fig. 5).
Among the included studies, the architecture based on the GoogLeNet architectural model [7, 11, 18] or the DenseNet architectural model [13, 14, 20] was the most common with three each. Among the data input proportions, the study of Adams et al. had the lowest training rate of 57% , and the study of Yamada et al. had the largest training rate of 95% . In 14 studies, 5 studies used Grad-CAM for highlight important regions. The information on AI for all included studies is presented in Table 3 [1, 8, 16, 20, 21].
Expected effects of AI in hip fracture diagnosis
As human lifespans prolong and the elderly population grows, the socioeconomic problems associated with hip fractures and postoperative care are public concerns worldwide . Early diagnosis and treatment are essential to preserving patient function, improving quality of life and alleviating economic burden. Rapid diagnosis of non-displaced hip fractures by human could be difficult and sometimes requires the use of additional radiographs, bone scans, CT, or MRI. But, these additional tests are not always available in all hospitals. In addition, demineralization and overlying soft tissues may interfere with diagnosis of hip fracture . Delayed diagnosis and treatment may lead to complications, such as malunion, osteonecrosis, and arthritis . Moreover, as total number of imaging and radiological examinations has increased, radiology departments cannot report all acquired radiographs in timely manner . For this reason, several studies on detecting hip fractures using ML have already been reported [1, 7, 8, 11,12,13,14,15,16,17,18,19,20,21]. Early diagnosis of hip fracture by AI algorithm in clinical course could help reduce medical costs, facilitate further preventive practices, and increase the quality of health care . It also improves the allocation of resources, reduce the need for unnecessary consultations, and facilitate faster patient disposition. In particular, physicians can focus on conceptually more demanding tasks in high-volume clinics. However, reports on the effectiveness of early diagnosis of hip fractures by AI algorithm seem to be insufficient. It is considered that further studies are needed.
CNN architecture used for hip fracture diagnosis
In this study, several CNN structures were used for radiograph image analysis in each study for hip fracture diagnosis. Among the included studies, CNNs using DenseNet or GoogLeNet architecture models were used the most. These two CNNs are inception architecture, which are deep CNNs with an architecture design composed of repeating components . GoogLenet is a CNN architecture with 22 layers and is widely used in image analysis such as radiographs because of its excellent ability to recognize visual patterns . In addition, GoogLeNet has 9 inception modules including 1 × 1 convolution which allows to derive various characteristics by accumulating the feature maps generated in the previous layer . This structure of GoogLenet allows to extract features from different layers without the need for additional computational burdens . DenseNet is a Dense Convolution Network, a CNN that can receive input from all previous layers through concatenation in a more advanced architecture than that of GoogleNet. DenseNet has the advantage of increasing computational efficiency through a compact network and being able to train by considering more diverse feature sets in all layers . In addition, Inception-V3 and Xception used in the included studies are the more advanced CNN architectures of GoogLenet. These results suggested that researchers have been applied progressively advanced CNN architectures of AI for hip fracture diagnosis (Table 3).
Diagnosis accuracy in AI versus human: Can AI replace human role in hip fracture diagnosis?
In the results of the articles included in our study, the accuracy of diagnosis for hip fracture by AI algorithm was over 90%, except for the results of Beyaz et al., and AUC of fracture diagnosis was over 0.9, which was very high . Also, the diagnostic accuracy of AI was higher in a comparative study on the accuracy of hip fracture diagnosis between AI and human. Urakawa et al. presented a AI model that detected intertrochanteric fractures with an accuracy of 95.5% and an AUC of 0.984 . This was higher than human's diagnostic accuracy of 92.2% and AUC of 0.969. Adams et al. reported a conventional neural network model to diagnose femoral neck fractures with an accuracy of 88.1–94.4% . These figure is also comparable to experts and resident`s diagnostic accuracy of 93.5 and 92.9%. In the study of Cheng et al. and Sato et al., human diagnostic accuracy was lower than that by AI algorithm [1, 20]. Nevertheless, it is still questionable whether can AI replace human role in hip fracture diagnosis. Bae et al. used AI to diagnose femoral neck fracture after deep learning of AI using 4,189 images. Diagnostic accuracy of AI algorithm was 97.1%. However, they reported that it is difficult to detect a non-displaced fracture of the femoral neck, despite high diagnostic accuracy of AI . This means that AI can reveal the limits of diagnosis in cases where AI is not trained or lacks learning. In addition, since all AI systems included in this study are not integrated with other clinical information, we consider that the clinical suspicion of human for occult fracture through evaluation of the patient's overall condition cannot yet be simulated by AI algorithm. Mawatari et al. also argued that, because the AUC values of AI aided experts were higher than the AI algorithm alone, a valid diagnosis could not be obtained by the radiograph alone, and it was inevitably affected by the quality of AI algorithm . Thus, we believed that AI algorithm does not totally replace human intelligence in the current clinical environment; however, AI algorithms can complement and augment the ability and knowledge of physicians.
The increase in human dependence on hip fracture detection using AI algorithm may be another issue because it is difficult and time-consuming for doctors to make their own clinical judgments by synthesizing the results of examinations performed face-to-face with patients . To solve this issue, Cheng et al. made the hip fracture detection site by AI to be highlighted and displayed so that physicians could check the results of the AL algorithm and make a final clinical judgment . With the development of technology, the AI algorithm will further develop, and the tendency of humans to rely on AI will increase further in future. Further research is needed for further solutions to this problem in future.
Efforts for AI deep learning and high diagnostic accuracy for hip fracture
Because deep learning of AI automatically and adaptively learn features from data, large and clean datasets are required . Better results for detection of hip fracture by AI are decided according to the number of images. In our study, we summarized the 2 methods suggested by previous studies to overcome this. The first is data augmentation and generation where data are manipulated to artificially enlarge the dataset. The number of patients visiting a single hospital is limited, and acquiring image information from other institutions may cause a problem of personal information leakage. Sato et al. created augmented 10,484 images by classifying the images of 4851 patients into fractured side and normal side according to the time they were taken, and used it for deep learning of AI . Mutasa et al. created 9063 augmented images with 737 hip fracture images and 326 normal images in 550 patients, and Beyaz et al. also generated 2106 augmented images from 234 radiographs of 65 patients [16, 17]. The second is to use various type of image information. Yu et al. reported that a distinctive fracture line or cortical angular deformity of a neck fracture is easy to detect in a single radiographic view, but a larger sample size is required for intertrochanteric fractures with complex and multiple fracture lines because the spectrum of fracture morphology is large . Also, soft tissue shading or femur alignment variation may affect the detection of fractures by AI . To overcome this, Yamada et al. argued that the fracture detection rate could be increased by adding a lateral view as well as a hip AP view . On the other hand, Yoon et al. reported that CT images as well as radiographs were used for fracture classification of intertrochanteric fractures, reducing time consumption due to fracture classification and helping to plan accurate surgery . Also, Mawatari et al. used MRI as well as CT for hip fracture detection . However, this has a disadvantage in that additional cost is consumed and it is difficult to obtain a normal hip lateral view.
As AI can quickly process large amounts of patient information, it has incredible potential in diagnosing and classifying patients' diseases . Especially the usefulness of AI is being studied in the trauma prediction, which has a wide range of individual differences in the number and severity of injuries due to the involvement of many external and internal factors . The present study is expected to be helpful in verifying the effectiveness of AI in diagnosing these specific diseases.
There are several limitations in our study. First, we did not consider the type of AI algorism and degree of training of AI algorism. Second, we did not consider the quality of radiographs for deep learning. The selected images are likely to have high quality. Also, these images can only represent characteristics of a specific age and sex. Third, implants used for surgical treatment of hip fracture were not considered.
We expected that our study may be helpful in making judgments about the use of AI in the diagnosis and classification of hip fractures. It is clear that AI is a tool that can help medical staff reduce the time and effort required for hip fracture diagnosis. Further studies are needed to determine what effect this causes in actual clinical situations.
Availability of data and materials
All data generated or analyzed during this study are included in this published article.
Sato Y, Takegami Y, Asamoto T, Ono Y, Hidetoshi T, Goto R, et al. Artificial intelligence improves the accuracy of residents in the diagnosis of hip fractures: a multicenter study. BMC Musculoskelet Disord. 2021;22:407.
Leeper WR, Leeper TJ, Vogt KN, Charyk-Stewart T, Gray DK, Parry NG. The role of trauma team leaders in missed injuries: Does specialty matter? J Trauma Acute Care Surg. 2013;75:387–90.
Cannon J, Silvestri S, Munro M. Imaging choices in occult hip fracture. J Emerg Med. 2009;37:144–52.
Cha Y-H, Ha Y-C, Yoo J-I, Min Y-S, Lee Y-K, Koo K-H. Effect of causes of surgical delay on early and late mortality in patients with proximal hip fracture. Arch Orthop Trauma Surg. 2017;137:625–30.
Aggarwal R, Ringold S, Khanna D, Neogi T, Johnson SR, Miller A, et al. Distinctions between diagnostic and classification criteria? Arthritis Care Res. 2015;67:891–7.
Whitehouse MR, Berstock JR, Kelly MB, Gregson CL, Judge A, Sayers A, et al. Higher 30-day mortality associated with the use of intramedullary nails compared with sliding hip screws for the treatment of trochanteric hip fractures: a prospective national registry study. Bone Jt J. 2019;101-B:83–91.
Murphy EA, Ehrhardt B, Gregson CL, von Arx OA, Hartley A, Whitehouse MR, et al. Machine learning outperforms clinical experts in classification of hip fractures. Sci Rep. 2022;12:2058.
Yoon S-J, Hyong Kim T, Joo S-B, Eel OhS. Automatic multi-class intertrochanteric femur fracture detection from CT images based on AO/OTA classification using faster R-CNN-BO method. J Appl Biomed. 2020;18:97–105.
Romero Lauro G, Cable W, Lesniak A, Tseytlin E, McHugh J, Parwani A, et al. Digital pathology consultations-a new era in digital imaging, challenges and practical applications. J Digit Imaging. 2013;26:668–77.
Petrick N, Sahiner B, Armato SG, Bert A, Correale L, Delsanto S, et al. Evaluation of computer-aided detection and diagnosis systems. Med Phys. 2013;40: 087001.
Adams M, Chen W, Holcdorf D, McCusker MW, Howe PD, Gaillard F. Computer vs human: deep learning versus perceptual training for the detection of neck of femur fractures. J Med Imaging Radiat Oncol. 2019;63:27–32.
Urakawa T, Tanaka Y, Goto S, Matsuzawa H, Watanabe K, Endo N. Detecting intertrochanteric hip fractures with orthopedist-level accuracy using a deep convolutional neural network. Skeletal Radiol. 2019;48:239–44.
Cheng C-T, Ho T-Y, Lee T-Y, Chang C-C, Chou C-C, Chen C-C, et al. Application of a deep learning algorithm for detection and visualization of hip fractures on plain pelvic radiographs. Eur Radiol. 2019;29:5469–77.
Krogue JD, Cheng KV, Hwang KM, Toogood P, Meinberg EG, Geiger EJ, et al. Automatic hip fracture identification and functional subclassification with deep learning. Radiol Artif Intell. 2020;2: e190023.
Yu JS, Yu SM, Erdal BS, Demirer M, Gupta V, Bigelow M, et al. Detection and localisation of hip fractures on anteroposterior radiographs with artificial intelligence: proof of concept. Clin Radiol. 2020;75:237.e1-9.
Mutasa S, Varada S, Goel A, Wong TT, Rasiej MJ. Advanced deep learning techniques applied to automated femoral neck fracture detection and classification. J Digit Imaging. 2020;33:1209–17.
Beyaz S, Açıcı K, Sümer E. Femoral neck fracture detection in X-ray images using deep learning and genetic algorithm approaches. Jt Dis Relat Surg. 2020;31:175–83.
Mawatari T, Hayashida Y, Katsuragawa S, Yoshimatsu Y, Hamamura T, Anai K, et al. The effect of deep convolutional neural networks on radiologists’ performance in the detection of hip fractures on digital pelvic radiographs. Eur J Radiol. 2020;130: 109188.
Yamada Y, Maki S, Kishida S, Nagai H, Arima J, Yamakawa N, et al. Automated classification of hip fractures using deep convolutional neural networks with orthopedic surgeon-level accuracy: ensemble decision-making with antero-posterior and lateral radiographs. Acta Orthop. 2020;91:699–704.
Cheng C-T, Chen C-C, Cheng F-J, Chen H-W, Su Y-S, Yeh C-N, et al. A human-algorithm integration system for hip fracture detection on plain radiography: system development and validation study. JMIR Med Inform. 2020;8: e19416.
Bae J, Yu S, Oh J, Kim TH, Chung JH, Byun H, et al. External validation of deep learning algorithm for detecting and visualizing femoral neck fracture including displaced and non-displaced fracture on plain X-ray. J Digit Imaging. 2021;34:1099–109.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conf Comput Vis Pattern Recognit CVPR. 2016. p. 2818–26.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. 2015 IEEE Conf Comput Vis Pattern Recognit CVPR. 2015. p. 1–9.
Lin M, Chen Q, Yan S. Network in network [Internet]. arXiv; 2014 [cited 2022 Aug 26]. Available from: http://arxiv.org/abs/1312.4400.
Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks [Internet]. arXiv; 2018 [cited 2022 Aug 26]. Available from: http://arxiv.org/abs/1608.06993.
Maffulli N, Rodriguez HC, Stone IW, Nam A, Song A, Gupta M, et al. Artificial intelligence and machine learning in orthopedic surgery: a systematic review protocol. J Orthop Surg. 2020;15:478.
Kakavas G, Malliaropoulos N, Pruna R, Maffulli N. Artificial intelligence: a tool for sports trauma prediction. Injury. 2020;51(Suppl 3):S63–5.
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea. Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (Grant Number: HI22C0494).
Ethics approval and consent to participate
This trial is a systematic review, which we collected data from other included studies. Ethics approval and consent to participate is not applicable.
Consent for publication
This trial is a systematic review, which we collected data from other included studies. Consent for publication is not applicable.
All authors confirmed that there is no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Cha, Y., Kim, JT., Park, CH. et al. Artificial intelligence and machine learning on diagnosis and classification of hip fracture: systematic review. J Orthop Surg Res 17, 520 (2022). https://doi.org/10.1186/s13018-022-03408-7