A 9 mRNAs-based diagnostic signature for rheumatoid arthritis by integrating bioinformatic analysis and machine-learning

Background Rheumatoid arthritis (RA) is an autoimmune rheumatic disease that carries a substantial burden for both patients and society. Early diagnosis of RA is essential to prevent disease progression and select an optimal therapeutic strategy. However, RA diagnosis is challenging, partly due to a lack of reliable biomarkers. Here, we aimed to explore the diagnostic signature and establish a predictive model of RA. Methods The mRNA expression profiling data of GSE17755, containing blood samples of 112 RA patients and 53 healthy control patients, were obtained from the Gene Expression Omnibus (GEO) database, followed by differential expression, GO (Gene Ontology), and KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analysis. A PPI network was constructed to select candidate hub genes, then logistic regression and random forest models were established based on the identified genes. Results Significantly, we identified 52 differentially expressed genes (DEGs), including 16 upregulated genes and 36 downregulated genes in RA samples compared with control samples. GO and KEGG analysis showed that several immune-related cellular processes were particularly enriched. We identified nine hub genes in the PPI network, including CFL1, COTL1, ACTG1, PFN1, LCP1, LCK, HLA-E, FYN, and HLA-DRA. The logistic regression and random forest models based on the nine identified genes reliably distinguished the RA samples from the healthy samples with substantially high AUC. Conclusion The diagnostic logistic regression and random forest models based on nine hub genes reliably predicted the occurrence of RA. Our findings could provide new insights into RA diagnostics. Supplementary Information The online version contains supplementary material available at 10.1186/s13018-020-02180-w.


Introduction
Rheumatoid arthritis (RA) is an autoimmune rheumatic inflammatory disorder that influences several organs and tissues and causes chronic synovial inflammation, ultimately resulting in chronic disability, joint destruction, and decreased life expectancy [1][2][3]. RA affects nearly 0.5 to 1% of people globally, occurring more commonly in females [4]. Furthermore, RA is challenging to manage and often requires lifelong treatment once developed [5]. Detection of RA at an early stage affords a window of opportunity for effective curative responses, and this pre-clinical period may be as short as several months [6][7][8]. Accordingly, early diagnosis of RA is essential to prevent the progression of radiologic variations and select the optimal therapeutic strategy [9].
Rheumatoid factor (RF) serum biomarkers have been used as preferred diagnostic criteria for RA for decades of years [10]. However, because of the lack of sensitivity (50-90%) and specificity (50-95%) [11] of auxiliary biomarkers, anti-citrullinated protein antibody (ACPA) was included in the diagnostic criteria for RA as developed by the American College of Rheumatology (ACR)/European League Against Rheumatism (EULAR) in 2010 [12]. Existing biomarkers may be difficult to detect during the pre-clinical period. Subsequently, multiple studies have revealed an association between genetics and RA [13,14], indicating that aberrantly expressed genes may be identified as potential diagnostic biomarkers of RA. A previous study demonstrated that dysregulated circular RNAs in the peripheral blood mononuclear cells of RA patients presented diagnostic value [15]. Multiple microRNAs have been identified as effective markers for RA patients [16]. However, the development of RA is a complex process, making it particularly important to establish a diagnostic model.
In this study, we aimed to identify blood-derived mRNA-based diagnostic signatures by integrating bioinformatics analysis and machine learning algorithms based on the mRNA expression profiling data of GSE17755 from the GEO database, containing blood samples of 112 RA patients and 53 healthy control patients. We identified a total of 52 differential expression genes (DEGs) in the RA patients compared with the controls and identified nine hub genes, including CFL1, COTL1, ACTG1, PFN1, LCP1, LCK, HLA-E, FYN, and HLA-DRA. The logistic regression and random forest models based on these nine genes reliably distinguished the RA samples from the healthy control samples.

Data collection
To establish the diagnosis model of RA from blood sample, the mRNA expression profiling data of GSE17755 contained blood samples of 112 RA patients and 53 healthy controls were obtained from GEO (https://www.ncbi.nlm.nih.gov/geo/) [17]. The mRNA expression levels of the GSE17755 data set were quantified based on the Hitachisoft AceGene Human Oligo Chip 30K 1 Chip Version.

Identification of differentially expressed genes (DEGs)
The dataset of GSE17755 was normalized by robust multi-array (RMA) and the DEGs were analyzed by using a limma R package [18]. After quantile normalization, raw signals of analyses were log2 transformed. DEGs were defined by absolute value of fold change (FC) > 2 (|log2FC| > 1) and false discovery rate (FDR) < 0.05.

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis
To analyze the functions and pathways associated with DEGs, data were merged to obtain gene symbols, then GO enrichment analysis and KEGG pathway analysis were performed by using enrichGO function and enrich-KEGG function of clusterProfiler package of R [19], respectively. Subsequently, GO enrichment results were visualized by using a GOChord function in GOplot package [20], and KEGG enrichment results were visualized by using a Barplot function in clusterProfiler package, independently. The GO included molecular function, biological process, and cellular component. The P < 0.05 was regarded as statistically significant.

PPI analysis
The protein-protein analysis (PPI) was conducted in the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database (https://string-db.org/cgi/input.pl) with the threshold of confidence score ≥ 0.4 [21]. The visualization of the PPI network was presented by Cytoscape software (https://cytoscape.org/) [22]. The modular analysis of the PPI network using the molecular complex detection (MCODE) plug-in of Cytoscape software with MCODE score > 2 as the threshold [23].

Construction of logistic regression and random forest model
The logistic regression model and random forest model were established based on the identified genes in the PPI network, in which the expression of identified DEGs served as continuous variable, and the sample type (RA or not) served as a binary responsive variable. The logistic regression model was constructed using glm of R [24]. The random forest model based on the Bagging method was constructed using randomForest R package [25]. The 5-fold cross-validation was performed in the models using caret R package (https://CRAN.R-project.org/package=caret). The receiver operating characteristic curves were generated to evaluate the sensitivity and specificity of the models, and the area under the curve (AUC) was calculated to assess the reliability of the models.

Identification of DEGs
To comprehensively understand the development of rheumatoid arthritis (RA) and explore the potential diagnostic biomarkers, the mRNA expression profiling data of GSE17755, containing blood samples of 112 RA patients and 53 healthy controls, were obtained from GEO database. The dataset was normalized by robust multi-array (RMA), and we observed that the data deviation was acceptably small, which could be used for further analysis (Fig. 1a and Table S1). In order to verify the data repeatability, the principal component analysis (PCA) based on the mRNA expression value of the samples was performed, and our data revealed that the samples of RA patients and healthy controls were effectively separated (Fig. 1b), indicating that the availability of the Fig. 1 Identification of DEGs. a The dataset of GSE17755 was normalized by robust multi-array (RMA) and the result was shown in the box-plot. The x-axis was the samples and the y-axis was the gene expression levels. b The principal component analysis (PCA) based on the mRNA expression value of the samples was performed, in which the dots with different colors represented samples in different groups. The distance of the dots represented the similarity of mRNA expression of the samples. c Volcano plot filtering map displayed DEGs in the RA samples compared with the normal samples. The x-axis was the Log2fold change (FC) and the y-axis was −log10 (FDR). d The DEGs were presented by heatmap. The x-axis was samples and the y-axis was DEGs, in which red and green represented the expression level of genes, respectively data repeatability. Significantly, we identified a total of 52 DEGs, including 16 upregulated genes and 36 downregulated genes in the RA samples compared with the normal samples (Fig. 1c), in which the remarkable difference was presented by heatmap (Fig. 1d).

GO and KEGG analysis of DEGs
For primary comprehensions of these DEGs, GO [16] and KEGG pathway analysis were performed based on the identified DEGs. We enriched 102 GO terms and 41 KEGG pathways in the analysis (P < 0.05) ( Table S2). The top 10 significant biological process and cellular component [26] GO terms (Fig. 2a, b), the 11 remarkable molecular function GO terms (Fig. 2c), and the top 15 notable KEGG pathways were demonstrated (Fig. 2d), in which multiple cellular processes were associated with immune response.

PPI network construction and candidate hub gene selection
To further identify the candidate hub genes among the DEGs in the healthy cases and RA patients, we constructed PPI network based on the 52 DEGs in the STRI NG online database (https://string-db.org/cgi/input.pl), and we identified 39 genes with confidence score ≥ 0.4 in the PPI network (Fig. 2e). The network module may represent the specific biological significance and thereby is usually the core of the PPI network [27]. Accordingly, we performed the modular analysis of the PPI network using the MCODE plug-in of Cytoscape software with MCODE score > 2 as the threshold and identified Cluster1 including CFL1, COTL1, ACTG1, PFN1, and LCP1, and Cluster2 containing LCK, HLA-E, FYN, and HLA-DRA (Fig. 2e), suggesting that these nine genes may play critical roles in the development of RA.

Construction of logistic regression and random forest model
We constructed the logistic regression model and the random forest model based on the selected nine genes including CFL1, COTL1, ACTG1, PFN1, LCP1, LCK, HLA-E, FYN, and HLA-DRA in the PPI network, in which the expression of selected nine genes served as the continuous predict variable and the sample type (RA or not) served as the response variable. The 5-fold crossvalidation was performed in the model to verify the reliability of the model and we observed that the AUC of the logistic regression model (Fig. 3a) and the random forest model (Fig. 3b) was substantially high, suggesting that both models can reliably distinguish the RA samples from the healthy control samples.

Discussion
Consistent with the results of previous studies, our study indicates that RA is a disease involving a complex gene network and multiple gene contributors [28]. In this study, 39 genes were selected in the PPI network and 9 hub genes were identified after modular analysis of the PPI network, including CFL1, COTL1, ACTG1, PFN1, LCP1, LCK, HLA-E, FYN, and HLA-DRA. These genes may be significantly correlated with the progression of RA. Furthermore, we constructed a logistic regression model and random forest model based on the nine identified genes, both with a significant AUC.
Combined with previous reports, COTL1, LCK, HLA-DRA, and HLA-E, among our identified hub genes, have been reported to be associated with RA. Proteomics revealed that upregulation of COTL1 might affect the 5lipoxygenase (5LO) activity involved in leukotriene biosynthesis and mediate inflammation in RA [29]. Wholeexome sequencing defined LCK as linked to familial RA and highlighted LCK variation in the T cell receptor (TCR) signaling pathway leading to T cell activation, resulting in T cell differentiation, survival, and effector functions [30]. Bioinformatics analysis showed that HLA- Fig. 3 Construction of logistic regression and random forest model. a, b The logistic regression model and the random forest model based on the selected genes were constructed using glm of R and randomForest R package, respectively. The reliability of the model was assessed by the AUC analysis. a The 5-fold cross-validation was performed in the logistic regression model. b The 5-fold cross-validation was performed in the random forest model DRA was dysregulated in RA patients [31,32]. Furthermore, HLA-E was involved in susceptibility to RA and anti-TNF treatment in RA patients [33]. Little evidence has directly demonstrated any relation of other genes to RA, such as CFL1, ACTG1, PFN1, or FYN; however, these genes play a key role in immune regulation [34][35][36][37].
In conclusion, we selected innovative biomarkers by analyzing the critical genes that influence the molecular mechanisms of RA, and nine mRNA-based diagnostic signatures were identified. The logistic regression and random forest models based on these nine hub genes were able to reliably distinguish RA samples from healthy control samples. Meanwhile, the nine genes had immune-related functions, including T cell activation, differentiation, tolerance, and lymphocyte formation. Further exploration is warranted to validate the clinical significance of these genes in the immune disorder of RA progression.