TrustedChain

Data Flow & ML Pipeline

Executive Summary

Certificate reputation, built like a production ML system.

TrustedChain ingests public certificate transparency data and malware telemetry, enriches it with behavioral signals, and outputs a calibrated trust score that maps to clear security actions.

Goal

Stop malicious binaries by evaluating the *issuer + certificate lineage*, not just the file hash.

Core Output

Allow ✅ / Review ⚠️ / Quarantine 🚫 with confidence scores and reasons.

Guarantee

No sensitive files are exposed publicly; the ML code stays offline unless explicitly deployed.

End-to-End Flow

From raw certificates to actionable security decisions

This reflects the *actual architecture* in the repo and avoids claims not implemented in runtime.

Sources

Certificate + Malware Signals

  • crt.sh (certificate transparency logs)
  • Malware feeds / sandbox labels
  • OSINT & issuer metadata
Ingestion

Scraper & Normalization

  • Rate-limited scraping with resume state
  • Schema normalization for certificates
  • Data stored in SQLite for fast iteration
Enrichment

Contextual Features

  • Issuer history & lineage graph
  • Revocation timing + re-use frequency
  • Sandbox detonation outcomes
Modeling

Offline ML Pipeline

  • Feature engineering + labeling
  • Training & evaluation notebooks
  • Model comparison + clustering outputs

ML notebooks live in machine-learning-code/ and are not executed by the production API unless explicitly wired in.

Serving

API & Dashboard

  • FastAPI service exposes insights
  • Dashboard visualizes trust trajectories
  • Admin console tracks telemetry
Decision

Security Verdicts

  • Allow / Review / Quarantine labels
  • Confidence + recommended action
  • Analyst-friendly explanations

Operational Notes

Performance & safety posture

Performance

Production runtime is lightweight (API + UI). ML training is offline and only runs when triggered.

Security

Ensure notebooks and outputs are not publicly served; keep training artifacts private.

Data Governance

Use rate limiting, caching, and provenance tracking to maintain compliance and reproducibility.