Background

Renal cell carcinoma (RCC), which originates in the renal cortex, accounts for 90% of primary renal tumors [1]. With over 400,000 new cases and more than 180,000 deaths annually worldwide, RCC ranks among the top 14 most common malignancies [1,2,3]. Clear cell renal cell carcinoma (ccRCC), accounting for over 80% of RCC cases, is the most common and biologically heterogeneous subtype, posing challenges in both treatment and prognosis due to its variable aggressiveness and response to therapies [1, 3, 4]. The tumor-node-metastasis (TNM) staging system evaluates tumor characteristics based on three components: tumor size and local invasion (T), regional lymph node involvement (N), and distant metastasis (M) [5]. Preoperative T staging focuses on assessing tumor size, depth of invasion, and involvement of surrounding tissues. Together, TNM staging plays a crucial role in determining the scope of surgery, systemic therapy needs, and overall prognosis. However, accurate staging often depends on postoperative pathology, limiting its applicability in early treatment planning. This underscores the significance of preoperative evaluation methods in providing timely and clinically actionable insights to guide and refine therapeutic decision-making [6].

Contrast-enhanced CT is widely used for preoperative evaluation of ccRCC, valued for its non-invasiveness, convenience, and consistency [7]. However, preoperative staging with CT is inherently observer-dependent, relying on radiologists’ subjective visual assessments. This approach is not only time-consuming and labor-intensive but also subject to inter-observer variability, which can lead to inconsistencies in staging accuracy [8, 9]. These limitations highlight the pressing need for objective, automated, and accurate diagnostic tools for ccRCC staging.

Deep learning algorithms have shown remarkable accuracy in preoperative cancer staging, often matching or exceeding radiologist assessments in cancers such as lung, gastric, and colorectal [10,11,12,13]. Recent studies have explored applying deep learning to CT images for RCC preoperative staging. However, these studies often faced limitations such as reliance on single-center or public datasets, small sample sizes, unrepresentative case distributions, binary classification approaches, and insufficient predictive accuracy [7, 14,15,16]. Moreover, the clinical interpretability of these models and their potential for collaboration with radiologists remain largely unexplored.

This study aims to address these challenges by developing and validating CT-based deep learning models using multicenter datasets comprising 1,148 ccRCC cases. Grad-CAM was integrated to explore model interpretability, and human-machine collaboration experiments were conducted to evaluate clinical utility. These comprehensive approaches enable precise and generalizable staging predictions across various clinical settings.

Methods

Patient collection

This study, approved by the Ethics Committee of the Guizhou Provincial People's Hospital (GPH) (Approval Number: KY2021165), waived the requirement for written informed consent due to its retrospective design and the anonymization of patient data. Data were retrospectively collected from five medical centers, and all cases were confirmed as ccRCC by postoperative pathology. Among them, data from GPH and Affiliated Hospital of Zunyi Medical University (ZMH) were merged and randomly divided into training set (80%) and testing set (20%). Data from Affiliated Hospital of Guizhou Medical University (GMH) and First Affiliated Hospital of Shihezi University (SUH) were combined to form external validation set 1, while data from the independent Guiqian International General Hospital (GQH) was used as external validation set 2. The data collection periods were as follows: GPH: July 2012 to April 2022; ZMH: April 2013 to February 2022; GMH: July 2012 to April 2022; SUH: April 2019 to April 2023; GQH: January 2021 to December 2024.

Inclusion criteria were: (1) surgical treatment, (2) postoperative pathological confirmation of ccRCC, and (3) contrast-enhanced CT scans conducted no more than thirty days prior to surgery. Exclusion criteria were: (1) lack of standard corticomedullary contrast-enhanced CT images, (2) poor-quality CT scans, including those with severe artifacts (e.g., motion artifacts, metal artifacts) or incomplete coverage of the lesion, (3) prior kidney biopsy or treatment before the CT scan, and (4) incomplete or inaccurate clinical and pathological data. The study flowchart is presented in Fig. 1.

Fig. 1
figure 1

The study flowchart

Clinical data collection

Preoperative corticomedullary contrast-enhanced CT images were retrieved from the Picture Archiving and Communication System at each center. Clinical and pathological data, including age, gender, tumor size, nuclear grade, T staging, N staging, M staging, and TNM staging, were collected through the clinical information management system. The primary predictors in this study were T staging and TNM staging, both crucial for determining treatment strategies and predicting outcomes. T staging was categorized into T1, T2, and T3 + T4, with T4 cases merged into T3 due to the small number of T4 cases. TNM staging was classified as stages I, II, III, or IV. At each center, two junior radiologists independently performed manual delineation of the region of interest (ROI) on corticomedullary contrast-enhanced CT images slice by slice using ITK-SNAP software for tumor segmentation. Any discrepancies between their delineations were reviewed and resolved by a senior radiologist to ensure consistency and minimize inter-rater variability.

Standardization of staging assessment

To ensure accuracy and consistency in T staging and TNM staging assessments, this study closely followed the AJCC 8th edition guidelines for renal cell carcinoma TNM staging [5]. A professional pathologist from each center independently re-evaluated the staging based on the standardized criteria to minimize variability and ensure uniformity across all data.

Network architecture

This study utilizes a Three Dimensional (3D) Transformer-ResNet (TR-Net) architecture to improve classification accuracy by progressively increasing network depth [17]. To address the challenge of capturing global features in Convolutional Neural Network (CNN) modules within the 3D ResNet, two Transformer modules were integrated into the model’s backend to enhance its ability to extract long-range features in tumors [18]. The proposed 3D TR-Net model is illustrated in Fig. 2.

Fig. 2
figure 2

Architecture of 3D TR-Net. (a) illustrates the workflow of 3D TR-Net. (b) depicts the structure of the ResNetBottleneck within the workflow

In the 3D model, convolutional layers utilize a 3 × 3 × 3 kernel with a stride of 1 × 1 × 1 and padding of 1 × 1 × 1. Downsampling is achieved using convolution with a 1 × 1 × 1 kernel and a stride of 2 × 2 × 2. The final Linear layer decodes the 3D model features into classification results. The calculation of self-attention in the 3D context is defined as follows:

$$\:\text{S}\text{e}\text{l}\text{f}-\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}=\text{S}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(\frac{\text{Q}{\text{K}}^{\text{T}}}{\sqrt{{\text{d}}_{\text{k}}}}\right)\text{V}$$
(1)

where Q, K, and V are three feature matrices of the same size obtained by linear projection of the self-attention input. The softmax function is then applied to normalize the results [19]. \(\:{\text{d}}_{\text{k}}\) is the feature dimension used to scale the results, preventing gradient explosion during network training. Self-attention is a core module in the Transformer architecture, allowing the model to learn the global features of the data. However, due to its high computational complexity, 3D TR-Net employs two modules only in the backend of the network.

Data preprocessing

Preprocessing operations were conducted on the collected 3D CT images to improve data quality, enhance usability, and increase model classification accuracy. Given the use of multicenter data in this study, it was essential to integrate data from various sources for effective management and utilization. As previously mentioned, data from GPH and ZMH were used for model development, including training and test (80%/20%), while data from GMH and SUH were used as external validation set 1, and GQH data served as an independent external validation set 2. To provide transparency on the CT protocols across the five centers, key acquisition parameters including manufacturer, model, kVp, contrast dose, and slice thickness, are summarized in Supplementary Table S1.

To convert raw CT data into a manageable and analyzable format, data transformation, including dimensionality reduction and numerical normalization, was performed. Tumor ROI (Region of Interest) segmentation results were used as the reference to extract data blocks of size 128 × 128 × 64 from the center of the tumor ROI outward. This cropping method, rather than resizing the entire tumor ROI, preserves the morphological and textural information essential for model learning. Following extraction, a masking operation retained only the histological information of the tumor and its surrounding tissues. Standardization was then applied as follows:

$$\:{\text{Model}}_{\text{in}}=\frac{\text{cropped}-\text{mean}\left(\text{cropped}\right)}{\text{s}\text{t}\text{d}\left(\text{cropped}\right)}$$
(2)

where cropped refers to the tumor ROI data that has been trimmed or extracted from the original image, mean refers to the average value, and std refers to the standard deviation. To mitigate inter-center variability from diverse CT scanners and acquisition settings, we applied per-sample z-score normalization to the cropped tumor ROI data. This approach standardizes each sample individually, enabling the model to prioritize relative intensity and structural patterns over scanner-dependent absolute values, thus preserving critical tumor features for classification.

Network training

Cross-entropy loss was utilized to train the 3D network model for classification tasks. Given the uneven distribution of data classes [20], weighted focal loss was applied to improve the network’s performance on small-sample classes [21]. Focal Loss is defined as:

$$\:{\mathcal{L}}_{\text{F}}\left({\text{p}}_{\text{t}}\right)=-{\text{a}}_{\text{t}}{\left(1-{\text{p}}_{\text{t}}\right)}^{{\upgamma\:}}\text{l}\text{o}\text{g}\left({\text{p}}_{\text{t}}\right)$$
(3)

where \(\:{p}_{t}\) denotes the predicted probability of the sample by the model, \(\:{{\upalpha\:}}_{\text{t}}\) represents the class weights of the sample, and \(\:{\upgamma\:}\) is the modulation factor. The cross-entropy loss also incorporates sample weights and can be represented as:

$$\:{\mathcal{L}}_{\text{C}\text{E}}\left(\text{x},\text{y}\right)={\left\{{\mathcal{L}}_{1},\dots\:,{\mathcal{L}}_{\text{N}}\right\}}^{\text{T}},{\mathcal{L}}_{\text{n}}={-\text{w}}_{{\text{y}}_{\text{n}}}\text{l}\text{o}\text{g}\frac{\text{e}\text{x}\text{p}\left({\text{x}}_{\text{n},{\text{y}}_{\text{n}}}\right)}{{\sum\:}_{c=1}^{C}\text{e}\text{x}\text{p}\left({\text{x}}_{\text{n},\text{c}}\right)}$$
(4)

where \(\:\text{x}\) represents the model output, \(\:\text{y}\) represents the labels, \(\:\text{w}\) represents the weights, \(\:\text{C}\) represents the number of classes, and \(\:\text{N}\) represents the number of samples. Finally, the total training loss of the model is obtained by combining the two losses using the weight parameter α:

$$\:{\mathcal{L}}_{\text{T}\text{o}\text{t}\text{a}\text{l}}={\upalpha\:}{\mathcal{L}}_{\text{F}}+\left(1-{\upalpha\:}\right){\mathcal{L}}_{\text{C}\text{E}}$$
(5)

In this study, \(\:{\upalpha\:}\) is set to 0.5. The model parameters were optimized using the AdamW optimizer with an initial learning rate of 1e-4. The model was trained for 50 epochs on the training set, with performance monitored on the internal testing set. The model parameters that achieved the best performance on this internal testing set were then selected for final evaluation on two external validation sets. All experiments were conducted on an NVIDIA A100 Graphics Processing Unit.

Visualizing model attention with Gradient-weighted class activation mapping (Grad-CAM)

To enhance the interpretability of the 3D TR-Net model, the Grad-CAM technique was employed on the training set. This method visualizes the regions of CT images that the model focuses on during T staging and TNM staging predictions. Grad-CAM generates heatmaps by computing the gradients of the final convolutional layer and overlaying them onto the original images. The color bar, ranging from blue to red, indicates activation intensity from low to high. This approach provides an intuitive understanding of the model’s decision-making process and its attention to tumor-related features.

Model evaluation

This study assessed the multiclass performance of the T staging and TNM staging models quantitatively, primarily using micro-averaged AUC (micro-AUC), macro-averaged AUC (macro-AUC), and accuracy (ACC). To provide a comprehensive evaluation of the models’ diagnostic efficacy, supplementary metrics like precision, recall, and F1-score were also employed. To mitigate potential bias from relying on a single dataset, this study employed two independent external validation sets to ensure robustness and generalizability of the models.

Human–machine collaboration evaluation

To evaluate the clinical utility of the proposed T staging model, an additional blinded evaluation was conducted using external validation set 2. Two junior radiologists (5 and 7 years of experience) and two senior radiologists (14 and 20 years of experience), all blinded to pathology results, independently performed T staging assessments. Subsequently, the radiologists reviewed the model’s staging predictions along with Grad-CAM visualizations, and revised their assessments accordingly. ACC was calculated to objectively compare performance between independent and human–machine collaborative assessments.

Statistical analysis

Analyses were performed using R software (version 4.2.2) and Python (version 3.11.11). Categorical data were analyzed using chi-square or Fisher’s exact tests, depending on sample size, while continuous data were assessed using t-tests or the Mann-Whitney U test based on the data distribution. A threshold of p < 0.05 was set to denote statistical significance. Paired categorical accuracy data (unaided vs. with model assistance) were evaluated with McNemar’s test. To ensure adequate sensitivity, post-hoc power analyses were conducted for each McNemar comparison, targeting ≥ 80% power, with a significance threshold of α = 0.05 (p < 0.05).

Results

This study included 1,148 ccRCC patients from five centers: GPH (n = 344, 30%), ZMH (n = 221, 19%), GMH (n = 288, 25%), SUH (n = 161, 14%), and GQH (n = 134, 12%). This multicenter approach provided a diverse and representative dataset, ensuring robust generalizability of the models. Data from GPH and ZMH (n = 565) were merged and divided into a training set (80%, n = 452) and a testing set (20%, n = 113). Two external validation sets were defined: external validation set 1 (GMH +SUH, n = 449) and external validation set 2 (GQH, n = 134). Except for nuclear grade and size (P < 0.05), no significant differences were observed in age, sex, or staging distributions (T, N, M, TNM) across the three sets (all P > 0.05), indicating no systematic bias among the groups. Baseline clinicopathological characteristics are detailed in Table 1.

Table 1 Patient characteristics and clinicopathological variables of ccRCC patients across different sets

For Age and Size, data are expressed as mean ± standard deviation. All other data are presented in the form of count and percentage (n (%)).

T Staging, Tumor Staging; N Staging, Node Staging; N0, No regional lymph node metastasis; Nx, Regional lymph nodes cannot be assessed; N1, Regional lymph node metastasis present; M Staging, Metastasis Staging; M0, No distant metastasis; M1, Distant metastasis present.

For T staging prediction, the 3D TR-Net model demonstrated acceptable overall performance across all sets. In the training, testing, external validation 1, and external validation 2 sets, The micro-AUCs were 0.974, 0.946, 0.939, and 0.954, respectively; macro-AUCs were 0.943, 0.839, 0.857, and 0.894; and ACCs were 0.929, 0.882, 0.843, and 0.869. Although overall performance was acceptable, the AUC for T3 + T4 was moderate (external validation 1: AUC = 0.769; external validation 2: AUC = 0.795), likely due to class imbalance, while the model maintained strong discrimination for T1 (AUC = 0.894 and 0.933) and T2 (AUC = 0.909 and 0.955) in external validation 1 and 2 sets. Precision, recall, and F1-score for each stage further confirmed performance across stages, though with variability in advanced subclasses (as shown in Supplementary Table S2), for instance, F1-scores in external validation set 1 were 0.863 for T1, 0.629 for T2, and 0.634 for T3 + T4; in set 2 they were 0.899 for T1, 0.750 for T2, and 0.716 for T3 + T4, highlighting stronger results for the majority T1 class but moderate-to-lower metrics for minority classes (Table 2; Fig. 3).

Table 2 Performance of the 3D TR-Net model for preoperative T staging of ccRCC across different sets
Fig. 3
figure 3

ROC curves for the 3D TR-Net model for predicting T staging in the training set (a), the external validation set 1 (b) and the external validation set 2 (c)

For TNM staging prediction, the 3D TR-Net model also exhibited acceptable overall performance. In the training, testing, external validation 1, and external validation 2 sets, the micro-AUCs were 0.955, 0.918, 0.935, and 0.924, respectively, the macro-AUCs were 0.883, 0.782, 0.817, and 0.888; and the ACCs were 0.891, 0.841, 0.856, and 0.807. Although overall performance was acceptable, the AUC for TNM III was moderate (external validation 1: AUC = 0.669; external validation 2: AUC = 0.801), likely due to class imbalance, while the model maintained strong discriminative power for TNM I (AUC = 0.864 and 0.936), TNM II (AUC = 0.907 and 0.963), and TNM IV (AUC = 0.826 and 0.852) in the two external validation sets. Precision, recall, and F1-score for each stage further confirmed consistent performance (Table 3; Fig. 4).

Table 3 Performance of the 3D TR-Net model for preoperative TNM staging of ccRCC across different sets
Fig. 4
figure 4

ROC curves for the 3D TR-Net model for predicting TNM staging in the training set (a), the external validation set 1 (b) and the external validation set 2 (c)

Grad-CAM visualizations

Grad-CAM heatmaps generated from the training set illustrate the model’s attention regions in T staging and TNM staging predictions. For T staging (Fig. 5), T1 presents a deep blue low-intensity heatmap, T2 shows a broader distribution with moderate intensity, and T3 + T4 displays significant intensity concentrated in the tumor center and high-density areas. For TNM staging (Fig. 6), the heatmap for TNM I is predominantly deep blue, with low-intensity features uniformly distributed within the tumor and occasional high-weight regions at the edges. TNM II displays a blue-green heatmap with scattered yellow high-weight regions at the edges. TNM III exhibits unevenly distributed yellow and red high-weight regions, while TNM IV is characterized by concentrated red and yellow high-intensity areas in the tumor center and high-density regions. These visualizations demonstrate the model’s focus on clinically relevant regions, thereby enhancing its interpretability and clinical credibility.

Fig. 5
figure 5

Representative Grad-CAM heatmaps illustrating T stage predictions of ccRCC on CT images. T1 tumors exhibit uniformly low activation—blue (a); T2 tumors show moderate activation primarily along the lesion edges—green to yellow (b); while T3 + T4 tumors display high central activation intensity—yellow to red (c). The color bar indicates increasing activation strength from blue to red

Fig. 6
figure 6

Representative Grad-CAM heatmaps for TNM staging of ccRCC on CT images. TNM Ⅰ tumors show diffuse low activation—blue (a); TNM stage II demonstrates moderate activation concentrated at lesion boundaries—green to yellow (b); TNM stage III presents heterogeneous activation both centrally and peripherally—yellow to red (c); and TNM stage IV reveals intense central activation—yellow to red (d). Color gradients denote activation strength from low to high

Human–Machine collaboration results

On external validation set 2, the accuracy of T staging by the model alone was 0.869. The junior radiologists achieved an accuracy of 0.813 (0.767–0.860) unaided, which improved to 0.873 (0.833–0.913) with model assistance, representing an absolute improvement of 5.96% points (p = 0.001, McNemar’s test; post-hoc power = 98.3%). Similarly, the senior radiologists' accuracy increased from 0.854 (0.812–0.897) unaided to 0.896 (0.859–0.932) with model assistance, representing an absolute improvement of 4.20% points (p = 0.009, McNemar’s test; post-hoc power = 88.4%). These results demonstrate statistically significant improvement in diagnostic accuracy with sufficient power to detect the observed effects, particularly benefiting less-experienced radiologists.

Discussion

This study developed and externally validated two CT-based 3D TR-Net models for preoperative T staging and TNM staging in ccRCC. In independent validation cohorts, the T-staging model demonstrated acceptable discrimination and accuracy, paralleled by similarly acceptable performance from the TNM-staging model. Grad-CAM heatmaps highlighted the key tumor regions driving each prediction, enhancing model interpretability. In a human–machine collaboration study, radiologists assisted by 3D TR-Net improved their staging accuracy. Together, these results demonstrate that 3D TR-Net offers acceptable overall performance with robust generalizability in majority classes, and practical interpretability, making it a promising tool for prospective clinical evaluation in ccRCC staging despite moderate results in advanced subclasses, as it reflects real-world class distributions and supports radiologist assistance.

This study addresses several Limitations in prior work through the following. First, it adhered to the latest AJCC 8th edition renal cancer TNM staging guidelines, resolving inconsistencies found in previous studies [5]. This approach improves the reliability and significantly enhances the clinical applicability of the results. Secondly, the 3D TR-Net hybrid neural network model effectively integrates the global feature extraction capabilities of the Transformer module, the local detail recognition strengths of the CNN module, and the depth feature representation of ResNet to capture 3D tumor structures [17, 21, 22]. This combination provides a robust solution for predicting preoperative T and TNM staging in ccRCC. Lastly, this study addresses limitations in prior research that relied on single-center or public datasets, which often led to discrepancies between dataset case distributions and real-world epidemiological patterns [14, 16, 23]. Incorporating data from 1,148 ccRCC cases across five medical centers allowed the creation of a large and diverse dataset, enhancing the clinical relevance of the findings and supporting more accurate staging predictions.

In early explorations of deep learning for T staging prediction in ccRCC, Hadjiyski et al. pioneered the application of this technique using CT images, achieving an AUC of 0.90 [24]. Subsequently, Wu et al. utilized CT texture features in a multicenter study, attaining an AUC of 0.80 [16]. While these studies underscored the potential of intelligent diagnosis in T staging, they were limited to binary classification, failing to capture the full complexity of T staging. To overcome this limitation, Tian et al. developed a multi-class model for pathological T1-T3 staging based on data from two centers [23]. Although effective on internal test sets, the model demonstrated Limited generalizability on external sets, with micro- and macro-AUCs of 0.72 and 0.78, respectively. In contrast, the 3D TR-Net model presented in this study exhibited improved performance on external validation—likely due to our large multicenter dataset—though it showed moderate results in advanced subclasses. Its overall diagnostic accuracy slightly exceeded that of previous radiologist assessments (ACC: 0.80) [25].

However, despite these improvements, the T staging model faces challenges from class imbalance and limitations in capturing complex tissue invasion. It focuses solely on the tumor, overlooking key features like tumor thrombus and perinephric invasion, which are vital for differentiating localized from advanced ccRCC, especially in T3 staging. These shortcomings limit its ability to accurately stage advanced ccRCC, particularly in T3 and T4 subgroups. Still, its performance, backed by a large multicenter dataset, demonstrates clinical potential for ccRCC T staging prediction. Future inclusion of tumor thrombus and perinephric invasion features could enhance its accuracy in advanced ccRCC stages (T3 and T4).

Regarding TNM staging, the results of this study also showed advantages over previous research, such as improved generalizability from multicenter data. Ökmen et al. laid the foundation for intelligent preoperative TNM staging prediction, achieving an accuracy of 0.85, which was slightly lower than the overall accuracy of 0.856 demonstrated by this study [26]. Talaat et al. later developed a radiomics-based model for multi-class TNM staging, reporting an accuracy of 0.99 in the validation set [27]. However, their feature selection was conducted simultaneously on both the training and testing sets, which increased the risk of overfitting. In contrast, this study utilized a large, multicenter dataset to ensure the model’s generalizability and reliability. Furthermore, in studies on binary classification models for early and mid-late RCC TNM staging, Hussain et al. achieved an accuracy of 0.83 in a single-center study [28], while Demirjian et al. reported an AUC of 0.80 in a multicenter study [29]. Neither of these models matched the accuracy and generalizability demonstrated by this study.

Although overall performance was acceptable, TNM stage III cases exhibited a noticeably lower AUC, suggesting diminished discriminative capacity for this category. This may be partially attributed to class imbalance, which can bias the model toward majority classes and reduce its sensitivity to underrepresented stages such as TNM stage III [30]. Additionally, TNM stage III often presents CT features that substantially overlap with those of TNM stages II and IV, making accurate classification difficult even for experienced radiologists [31]. Furthermore, the current architecture focuses solely on localized CT features and lacks mechanisms to incorporate broader anatomical context—such as peripheral invasion, tumor thrombus, and distant metastases—which are essential for accurate staging of advanced TNM cases. Future enhancements may include the use of context-aware modules (e.g., attention mechanisms or graph-based networks), and multimodal fusion of CT image with clinical and radiology report to improve the model’s ability to distinguish intermediate and advanced stage cases.

Grad-CAM visualizations offer an intuitive depiction of the tumor regions that 3D TR-Net exploits for staging prediction. Prior renal cancer studies have similarly validated its clinical utility: Zhu et al. applied Grad-CAM to a multimodal B-mode and contrast-enhanced ultrasound network, revealing distinct modality dependencies when differentiating low versus high nuclear grade RCC [32]; Moon et al. used Grad-CAM with a ResNet-18 classifier to delineate the model’s focus regions across normal, benign, and malignant renal pathological tissues [33]. In this study, Grad-CAM heatmaps exhibited a clear progression as tumor T and TNM stage increased—from low intensity, diffuse activations in T1/TNM I cases to high intensity, centrally concentrated activations in T3 + T4/TNM IV cases. This visualization confirms 3D TR-Net’s ability to capture key CT imaging features related to heterogeneity, invasiveness, and metastatic potential, enhancing model interpretability and potentially aiding diagnosis.

Human-machine collaboration using the 3D TR-Net model with Grad-CAM visualization significantly improves ccRCC T-staging accuracy, particularly benefiting junior radiologists and reducing experience-based variability in diagnosis. Similar benefits have been demonstrated in other radiological applications, such as a CT-based clinico-radiomics model that improved junior radiologists’ diagnostic performance when differentiating mediastinal lymphomas from thymic epithelial tumors [34], and a breast cancer classification model that provides interpretable insights to radiologists via Grad-CAM heatmaps, highlighting critical regions in mammograms influencing its predictions [35]. This synergistic approach enhances diagnostic workflow safety, consistency, and efficiency in clinical practice.

Despite the acceptable performance of the proposed models, several limitations remain. First, reliance on manual ROI delineation is time-consuming and subjective, limiting scalability. Future work will explore automated segmentation frameworks to improve efficiency. Second, the study used only corticomedullary contrast-enhanced CT images; integrating multiphase or multimodal imaging and clinical data may further enhance model performance. Third, due to the limited number of T4 cases, T3 and T4 stages were combined to ensure model robustness, inevitably reducing clinical granularity. Fourth, the model focused solely on intratumoral features, without fully incorporating peritumoral invasion characteristics such as vascular or renal sinus fat infiltration, which are essential for distinguishing advanced ccRCC, especially in T3 staging. Prior research has highlighted the value of integrating intratumoral and peritumoral radiomics for predicting nuclear grading and survival outcomes in ccRCC, suggesting that such peri-tumoral analysis has the potential to substantially improve predictive accuracy for higher T stages in future iterations [36, 37]. Fifth, while per-sample z-score normalization mitigated inter-center CT variability without advanced methods like HU calibration or ComBat (as routine scanner HU calibration ensured absolute consistency and ComBat suits radiomics features, not end-to-end deep learning from raw images [38]), its efficacy is evidenced by robust external validation and comparable multicenter CT studies [39]. Future studies with more T4 data should aim for finer subclassification. Lastly, although Grad-CAM offers some interpretability, the decision-making process of deep models remains largely opaque, potentially limiting clinical trust and adoption.

Conclusion

This multicenter study validates the CT-based 3D TR-Net models for preoperative ccRCC staging, demonstrating acceptable overall performance on external cohorts but moderate results in advanced subclasses and offering potential to enhance clinical decision-making and radiologist diagnostic accuracy, especially when augmented with future improvements addressing class imbalance and extratumoral features.