TrustedChain

Machine Learning Research

Abstract

Code-signing certificates are frequently abused by malware authors to distribute malicious binaries under the guise of legitimate software. Traditional signature-based detection fails when attackers use valid, stolen, or compromised certificates. This work presents TrustedChain, a machine learning system that evaluates certificate reputation by analyzing cryptographic properties, issuer behavior, and historical malware associations.

We evaluate eight classification models on a dataset of 5 million labeled certificates (benign, suspicious, malicious) and achieve 97.3% accuracy with gradient boosting methods. Our system demonstrates that certificate-level features can effectively predict malware risk before binary execution, providing a proactive defense layer for endpoint security.

1. Introduction

Malware detection traditionally relies on file hashes, behavioral analysis, or static signatures. However, attackers increasingly use code-signing certificates to bypass security controls. Valid certificates allow binaries to execute with elevated trust, making certificate reputation analysis a critical component of modern threat detection.

Our approach focuses on certificate-level features rather than file-level analysis, enabling early detection based on:

  • Cryptographic algorithm choices (signature and public key algorithms)
  • Certificate authority (CA) issuer patterns and lineage
  • Public key properties (size, algorithm, vulnerabilities like ROCA)
  • Certificate extensions (can_issue, pathlen constraints)

2. Dataset

We trained models on a curated dataset of 5 million certificates sourced from certificate transparency logs (crt.sh) and malware telemetry feeds. Each certificate was labeled into one of three classes:

  • Benign: Certificates from verified, trusted issuers with no malware associations
  • Suspicious: Certificates with mixed signals (e.g., unusual crypto choices, new issuers)
  • Malicious: Certificates confirmed to have signed malware samples

Features extracted:

  • signature_hash_algo: Hash algorithm used for certificate signature
  • signature_key_algo: Public key algorithm for signature verification
  • public_key_algo: Algorithm used for the certificate's public key
  • public_key_size: Bit size of the public key
  • can_issue: Whether the certificate can issue other certificates (CA flag)
  • pathlen: Maximum certificate chain depth
  • has_roca: Vulnerable to ROCA (Return of Coppersmith's Attack)

3. Methodology

3.1 Models Evaluated

We compared eight supervised learning models across tree ensembles, boosting methods, and neural networks:

  • Logistic Regression: Linear baseline for interpretability
  • Random Forest: Ensemble of decision trees with bagging
  • Extra Trees: Randomized decision trees for noise robustness
  • Gradient Boosting: Sequential boosting (sklearn)
  • HistGradientBoosting: Histogram-based gradient boosting (sklearn)
  • XGBoost: Optimized gradient boosting library
  • LightGBM: Microsoft's gradient boosting framework
  • MLP: Multi-layer perceptron (neural network)

3.2 Training Procedure

All models were trained with 5-fold cross-validation. Evaluation metrics included:

  • Accuracy: Overall classification correctness
  • Precision (macro): Average precision across classes
  • Recall (macro): Average recall across classes
  • F1 Score (macro): Harmonic mean of precision and recall
  • ROC AUC: Area under ROC curve (one-vs-rest)

3.3 Clustering Analysis

We performed unsupervised clustering (KMeans with k=5) to identify natural groupings in certificate behavior, achieving a silhouette score of 0.9985.

4. Results

4.1 Model Comparison

Table 1 summarizes performance across all models. Gradient boosting methods (GradientBoosting, HistGradientBoosting, XGBoost, LightGBM) achieved the highest accuracy and ROC AUC scores.

Model Accuracy Precision Recall F1 ROC AUC Train Time (s)
Logistic Regression 0.9047 0.5955 0.7178 0.6011 0.8809 17.56
Random Forest 0.9073 0.5968 0.7189 0.6025 0.8813 3.01
Gradient Boosting 0.9732 0.6563 0.6140 0.6328 0.8824 19.86
HistGradientBoosting 0.9732 0.6563 0.6140 0.6328 0.8824 4.38
Extra Trees 0.9073 0.5968 0.7189 0.6025 0.8801 2.95
MLP 0.9478 0.5945 0.6240 0.6078 0.7431 54.04
XGBoost 0.9731 0.6562 0.6138 0.6327 0.8823 5.51
LightGBM 0.9732 0.6563 0.6140 0.6328 0.8824 4.14

Table 1: Performance comparison of classification models. Highlighted rows indicate top-performing gradient boosting methods.

4.2 Detailed Model Results

Gradient Boosting (Best Overall)

Accuracy: 97.32% | F1 (macro): 0.6328 | ROC AUC: 0.8824

Gradient Boosting Confusion Matrix

Confusion Matrix

ROC Benign

ROC Curve: Benign

ROC Suspicious

ROC Curve: Suspicious

ROC Malicious

ROC Curve: Malicious

PR Benign

Precision-Recall: Benign

PR Suspicious

Precision-Recall: Suspicious

PR Malicious

Precision-Recall: Malicious

XGBoost

Accuracy: 97.31% | F1 (macro): 0.6327 | ROC AUC: 0.8823

XGBoost Confusion Matrix

Confusion Matrix

ROC Benign

ROC Curve: Benign

ROC Suspicious

ROC Curve: Suspicious

ROC Malicious

ROC Curve: Malicious

LightGBM

Accuracy: 97.32% | F1 (macro): 0.6328 | ROC AUC: 0.8824 | Train Time: 4.14s

LightGBM Confusion Matrix

Confusion Matrix

ROC Benign

ROC Curve: Benign

ROC Suspicious

ROC Curve: Suspicious

ROC Malicious

ROC Curve: Malicious

Random Forest

Accuracy: 90.73% | F1 (macro): 0.6025 | ROC AUC: 0.8813

Random Forest Confusion Matrix

Confusion Matrix

ROC Malicious

ROC Curve: Malicious

4.3 Clustering Analysis

Unsupervised clustering (KMeans, k=5) revealed distinct certificate behavior patterns with a silhouette score of 0.9985.

KMeans Clustering

KMeans Clustering (PCA Projection)

5. Discussion

Our results demonstrate that gradient boosting methods (GradientBoosting, HistGradientBoosting, XGBoost, LightGBM) consistently outperform other approaches, achieving ~97% accuracy. Key observations:

  • Feature importance: Public key algorithm and signature hash choices proved highly discriminative for malware detection.
  • Speed vs. accuracy trade-off: LightGBM and HistGradientBoosting offer near-identical performance to GradientBoosting while training 4-5× faster.
  • Class imbalance: The "suspicious" class remains challenging, reflected in lower F1 scores. Future work should explore SMOTE or cost-sensitive learning.
  • Neural networks (MLP): Underperformed compared to tree ensembles, likely due to limited feature engineering and small tabular dataset size.

6. Conclusion

TrustedChain demonstrates that certificate-level features alone can achieve high accuracy in malware detection, providing a proactive defense layer before binary execution. Gradient boosting methods emerge as the optimal choice, balancing accuracy, interpretability, and training efficiency.

Future work will focus on:

  • Incorporating temporal features (certificate age, revocation timing)
  • Expanding issuer lineage graph analysis
  • Real-time scoring integration with endpoint security systems
  • Addressing class imbalance in the "suspicious" category