Executive Summary
Certificate reputation, built like a production ML system.
TrustedChain ingests public certificate transparency data and malware telemetry, enriches it with behavioral signals, and outputs a calibrated trust score that maps to clear security actions.
Goal
Stop malicious binaries by evaluating the *issuer + certificate lineage*, not just the file hash.
Core Output
Allow ✅ / Review ⚠️ / Quarantine 🚫 with confidence scores and reasons.
Guarantee
No sensitive files are exposed publicly; the ML code stays offline unless explicitly deployed.
End-to-End Flow
From raw certificates to actionable security decisions
This reflects the *actual architecture* in the repo and avoids claims not implemented in runtime.
Certificate + Malware Signals
- crt.sh (certificate transparency logs)
- Malware feeds / sandbox labels
- OSINT & issuer metadata
Scraper & Normalization
- Rate-limited scraping with resume state
- Schema normalization for certificates
- Data stored in SQLite for fast iteration
Contextual Features
- Issuer history & lineage graph
- Revocation timing + re-use frequency
- Sandbox detonation outcomes
Offline ML Pipeline
- Feature engineering + labeling
- Training & evaluation notebooks
- Model comparison + clustering outputs
ML notebooks live in machine-learning-code/ and are not executed by the production API unless explicitly wired in.
API & Dashboard
- FastAPI service exposes insights
- Dashboard visualizes trust trajectories
- Admin console tracks telemetry
Security Verdicts
- Allow / Review / Quarantine labels
- Confidence + recommended action
- Analyst-friendly explanations
Operational Notes
Performance & safety posture
Performance
Production runtime is lightweight (API + UI). ML training is offline and only runs when triggered.
Security
Ensure notebooks and outputs are not publicly served; keep training artifacts private.
Data Governance
Use rate limiting, caching, and provenance tracking to maintain compliance and reproducibility.