TrustedChain | Machine Learning Research

Abstract

Code-signing certificates are frequently abused by malware authors to distribute malicious binaries under the guise of legitimate software. Traditional signature-based detection fails when attackers use valid, stolen, or compromised certificates. This work presents TrustedChain, a machine learning system that evaluates certificate reputation by analyzing cryptographic properties, issuer behavior, and historical malware associations.

We evaluate eight classification models on a dataset of 5 million labeled certificates (benign, suspicious, malicious) and achieve 97.3% accuracy with gradient boosting methods. Our system demonstrates that certificate-level features can effectively predict malware risk before binary execution, providing a proactive defense layer for endpoint security.

1. Introduction

Malware detection traditionally relies on file hashes, behavioral analysis, or static signatures. However, attackers increasingly use code-signing certificates to bypass security controls. Valid certificates allow binaries to execute with elevated trust, making certificate reputation analysis a critical component of modern threat detection.

Our approach focuses on certificate-level features rather than file-level analysis, enabling early detection based on:

Cryptographic algorithm choices (signature and public key algorithms)
Certificate authority (CA) issuer patterns and lineage
Public key properties (size, algorithm, vulnerabilities like ROCA)
Certificate extensions (can_issue, pathlen constraints)

2. Dataset

We trained models on a curated dataset of 5 million certificates sourced from certificate transparency logs (crt.sh) and malware telemetry feeds. Each certificate was labeled into one of three classes:

Benign: Certificates from verified, trusted issuers with no malware associations
Suspicious: Certificates with mixed signals (e.g., unusual crypto choices, new issuers)
Malicious: Certificates confirmed to have signed malware samples

Features extracted:

signature_hash_algo: Hash algorithm used for certificate signature
signature_key_algo: Public key algorithm for signature verification
public_key_algo: Algorithm used for the certificate's public key
public_key_size: Bit size of the public key
can_issue: Whether the certificate can issue other certificates (CA flag)
pathlen: Maximum certificate chain depth
has_roca: Vulnerable to ROCA (Return of Coppersmith's Attack)

3. Methodology

3.1 Models Evaluated

We compared eight supervised learning models across tree ensembles, boosting methods, and neural networks:

Logistic Regression: Linear baseline for interpretability
Random Forest: Ensemble of decision trees with bagging
Extra Trees: Randomized decision trees for noise robustness
Gradient Boosting: Sequential boosting (sklearn)
HistGradientBoosting: Histogram-based gradient boosting (sklearn)
XGBoost: Optimized gradient boosting library
LightGBM: Microsoft's gradient boosting framework
MLP: Multi-layer perceptron (neural network)

3.2 Training Procedure

All models were trained with 5-fold cross-validation. Evaluation metrics included:

Accuracy: Overall classification correctness
Precision (macro): Average precision across classes
Recall (macro): Average recall across classes
F1 Score (macro): Harmonic mean of precision and recall
ROC AUC: Area under ROC curve (one-vs-rest)

3.3 Clustering Analysis

We performed unsupervised clustering (KMeans with k=5) to identify natural groupings in certificate behavior, achieving a silhouette score of 0.9985.

4. Results

4.1 Model Comparison

Table 1 summarizes performance across all models. Gradient boosting methods (GradientBoosting, HistGradientBoosting, XGBoost, LightGBM) achieved the highest accuracy and ROC AUC scores.

Model	Accuracy	Precision	Recall	F1	ROC AUC	Train Time (s)
Logistic Regression	0.9047	0.5955	0.7178	0.6011	0.8809	17.56
Random Forest	0.9073	0.5968	0.7189	0.6025	0.8813	3.01
Gradient Boosting	0.9732	0.6563	0.6140	0.6328	0.8824	19.86
HistGradientBoosting	0.9732	0.6563	0.6140	0.6328	0.8824	4.38
Extra Trees	0.9073	0.5968	0.7189	0.6025	0.8801	2.95
MLP	0.9478	0.5945	0.6240	0.6078	0.7431	54.04
XGBoost	0.9731	0.6562	0.6138	0.6327	0.8823	5.51
LightGBM	0.9732	0.6563	0.6140	0.6328	0.8824	4.14

Table 1: Performance comparison of classification models. Highlighted rows indicate top-performing gradient boosting methods.

4.2 Detailed Model Results

Gradient Boosting (Best Overall)

Accuracy: 97.32% | F1 (macro): 0.6328 | ROC AUC: 0.8824

Confusion Matrix

ROC Curve: Benign

ROC Curve: Suspicious

ROC Curve: Malicious

Precision-Recall: Benign

Precision-Recall: Suspicious

Precision-Recall: Malicious

XGBoost

Accuracy: 97.31% | F1 (macro): 0.6327 | ROC AUC: 0.8823

Confusion Matrix

ROC Curve: Benign

ROC Curve: Suspicious

ROC Curve: Malicious

LightGBM

Accuracy: 97.32% | F1 (macro): 0.6328 | ROC AUC: 0.8824 | Train Time: 4.14s

Confusion Matrix

ROC Curve: Benign

ROC Curve: Suspicious

ROC Curve: Malicious

Random Forest

Accuracy: 90.73% | F1 (macro): 0.6025 | ROC AUC: 0.8813

Confusion Matrix

ROC Curve: Malicious

4.3 Clustering Analysis

Unsupervised clustering (KMeans, k=5) revealed distinct certificate behavior patterns with a silhouette score of 0.9985.

KMeans Clustering (PCA Projection)

5. Discussion

Our results demonstrate that gradient boosting methods (GradientBoosting, HistGradientBoosting, XGBoost, LightGBM) consistently outperform other approaches, achieving ~97% accuracy. Key observations:

Feature importance: Public key algorithm and signature hash choices proved highly discriminative for malware detection.
Speed vs. accuracy trade-off: LightGBM and HistGradientBoosting offer near-identical performance to GradientBoosting while training 4-5× faster.
Class imbalance: The "suspicious" class remains challenging, reflected in lower F1 scores. Future work should explore SMOTE or cost-sensitive learning.
Neural networks (MLP): Underperformed compared to tree ensembles, likely due to limited feature engineering and small tabular dataset size.

6. Conclusion

TrustedChain demonstrates that certificate-level features alone can achieve high accuracy in malware detection, providing a proactive defense layer before binary execution. Gradient boosting methods emerge as the optimal choice, balancing accuracy, interpretability, and training efficiency.

Future work will focus on:

Incorporating temporal features (certificate age, revocation timing)
Expanding issuer lineage graph analysis
Real-time scoring integration with endpoint security systems
Addressing class imbalance in the "suspicious" category