Replatforming Heroku’s Runtime to Kubernetes
If you had just under a year to migrate a mature PaaS runtime to Kubernetes, hit an immutable re:Invent launch date, and “burn the boats” on your legacy platform, how would you run that program? That was the challenge we took on with Heroku Fir: a full re‑platform of Heroku’s runtime from its proprietary infrastructure to Kubernetes, while simultaneously modernizing observability, networking, and billing in partnership with external cloud partners and multiple internal product lines.
Context & Goals
Heroku Fir was a nearly year long program to re‑platform Heroku’s runtime from its proprietary infrastructure to Kubernetes. The goals were to leverage Cloud Native primitives, reduce infrastructure costs, and unlock new workload architectures (AMD64 and ARM64). We committed to announcing at major industry conference, and that date was immutable, which forced us to be explicit on scope and ruthless about what was in or out for launch.
Leadership made the mandate unambiguous: it was time to “burn the boats” and move away from Cedar, which drove deep technical and product trade‑off discussions across the company.
Technical Problems Driving the Initiative
Several structural issues with our legacy platform made Fir both necessary and strategically important:
- Legacy build artifacts: Heroku’s proprietary build format could not leverage industry‑standard container tooling (OCI).
- Limited Cloud Native integration: Legacy architecture predated Kubernetes and limited access to modern orchestration primitives.
- Observability gaps: Heroku’s logging and metrics systems were proprietary, so many customers relied on third‑party add‑ons instead; we chose to address this by natively supporting OpenTelemetry.
- IPv4 constraints and costs: Rising IPv4 costs and VPC scaling limits pushed us toward an IPv6‑first architecture.
My Role & Program Plan
I was the sole TPM for Fir, accountable for cross‑team delivery across the entire Heroku organization. I owned the overall program structure, cross‑team technical decision forums, and risk management for external partner and billing dependencies.
Cross‑Functional Teams
Fir required coordination across a broad set of internal and external stakeholders. I partnered with cross‑functional leaders across Engineering, Product, Architecture, Operations, and Go‑To‑Market, including runtime, networking, observability, billing, legal, support, and marketing.
I also reported to executive leadership (CEO/CTO/CPO) and coordinated with our external cloud provider and multiple security and SI partners.
As TPM, my focus was on providing input on architecture decisions by creating decision frameworks, aligning these groups on scope and sequencing, and ensuring we had clear owners and escalation paths for cross‑cutting risks.
Phased Plan
To derisk the re:Invent deadline and create predictable checkpoints, we structured Fir into three phases.
Phase 1 – Foundational Milestones
We first built the platform primitives needed for any meaningful customer traffic:
- OCI image build path.
- Kubernetes cluster provisioning.
- OpenTelemetry integration.
Phase 2 – Shippable Milestones
Next, we focused on complete end‑to‑end experiences that represented core Heroku workflows. This phase was about proving that the new platform could support real customer usage, not just isolated components.
Phase 3 – Pilot
Finally, we defined a Pilot that was opinionated but complete enough to onboard real customers:
- MVP scope completed.
- Sales and Support enablement in place.
- Go To Market plan aligned across GTM, Product, and Engineering.
Metrics & SLAs (Measuring Success)
Before launching the Pilot, I drove alignment on how we would define “success” for Fir. This included adoption, cost, reliability, and quality signals.
Core Metrics
- Adoption: Number of accounts, number of workloads, % of apps running on Fir for 1+ week, number of partners.
- Cost to Serve (CTS): % lower than legacy.
- Stability: Uptime, provisioning and build latency, MTTR for customer‑facing issues, incident MTTR, monitoring coverage, and % test coverage.
- Bug Health: Frequency and severity decreasing, with outflow vs. inflow trending in our favor rather than focusing on raw bug counts.
Success Criteria for Pilot and GA
We also defined clear gates for Pilot and GA readiness:
- Pilot:
- No significant incidents and no P0 bugs.
- % of apps active on Fir.
- % positive CSAT.
- % of apps using OTEL.
- Target number of customers onboarded.
- GA:
- Target percentage of customers and Pilot apps migrated.
- Metered billing/invoicing live.
- Target revenue.
- % Add‑ons supported.
- Documentation complete.
Key Technical Decisions & Trade‑offs
A lot of the TPM work on Fir lived in the “gray area” between product, infrastructure, and the external product roadmaps. Below are some of the highest‑impact technical decisions and trade‑offs I helped drive.
VPC Peering vs. Transit Gateway (IPv4/IPv6 Pivot)
Context Our original connectivity plan assumed VPC peering using traditional IPv4/IPv6. As we evaluated scale, we identified several constraints.
Decision I recommended pivoting from VPC peering to Transit Gateway, an approach I had learned from another TPM in a different business unit. This avoided locking the platform into a non‑scalable, IPv4‑heavy design and aligned networking with our long‑term cost and scalability goals.
Metered Billing Integration
Context Our central metering platform ran into provider‑level scale limits for invoicing and given we needed to support both the new runtime and a major database billing upgrade in overlapping timeframes, it created significant dependency risk.
Decision To give the team enough time to deliver both solutions without jeopardizing the launch date, I pushed for full usage tracking during Pilot with delayed charging. This ensured we could hit the date without blocking on billing integration while still retaining accurate usage data for potential back‑billing.
EU Capacity Constraints
Context Our architects raised future capacity constraints in several EU regions as a potential risk we should work to mitigate.
Decision I pushed for opening a new region for net‑new customers only. I obtained estimates from Runtime and Infrastructure architects to establish the region, and I drove an agreement that migration for existing EU customers would be handled in a later phase, decoupling that risk from the re:Invent launch.
External Product Roadmap Changes
Context We had initially aligned on a managed autoscaling solution with our cloud provider, but a later product pivot changed the control model in ways that did not fit our multi‑tenant PaaS needs.
Decision & Trade‑off We evaluated the new offering in early access, confirmed it wasn’t a fit for our multi‑tenant PaaS requirements, and decided to build our own autoscaling and packing solution.
Closing Thoughts & Call to Action
For TPMs, Fir is a reminder that the hardest part of a massive re‑platform is rarely the technology in isolation; it is aligning architecture, business constraints, and external partners around a non‑negotiable timeline while still making principled technical decisions.
In world where TPMs are sometimes considered unnecessary, it shows how we can structure a large, multi‑year‑impact program by putting clear phases, metrics, and trade‑off frameworks in place and then driving decisions through the inevitable surprises.