platform-engineering

AWS Lambda MicroVMs: stateful sandboxes with full lifecycle control

AWS just shipped something that reframes what “serverless” can mean. Lambda MicroVMs are not Lambda Functions with a bigger timeout. They are a fundamentally different primitive: stateful, VM-level isolated environments with an explicit lifecycle you control. The key shift: instead of getting a recycled process for 15 minutes, you get a dedicated Firecracker microVM that lives up to 8 hours — and you decide when it starts, suspends, resumes, and terminates....

AI agents in practice: self-learning, knowledge bases, and why fewer agents is better

Building AI agents sounds fun until you actually build one. Then a different set of problems shows up — ones nobody writes about. Here is what I have learned running agent systems in production: self-improvement conflicts with git, most knowledge bases hit a wall sooner than expected, and adding more agents almost never helps. The self-improvement problem One of the selling points of agents like Hermes is that they can self-reflect and improve — updating their own rules based on experience....

Terraform at scale: GitOps tools and the long apply problem

If you’ve been using Terraform Cloud for a while, you’ve probably hit at least one of these: the pricing model changed and suddenly it’s expensive, applies take 10+ minutes, or the state files have grown into something nobody wants to touch. You’re not alone — this is a recurring topic in every DevOps community right now. This post covers the main tools people are using to solve these problems in 2025–2026, with a focus on two separate issues that often get conflated: GitOps orchestration (who triggers plans, who approves applies) and state management at scale (why applies are slow and what to do about it)....

How GitHub engineers tackle platform problems

GitHub’s platform engineering team shares their approach to tackling infrastructure challenges at scale. Key strategies include: Understanding your domain: Talk to neighboring teams with more experience Investigate old issues to understand system limitations Read documentation to build foundational knowledge Platform-specific skills: Network fundamentals (TCP, UDP, L4 load balancing, debugging tools) Operating systems and hardware selection for scalability and cost Infrastructure as Code (Terraform, Ansible, Consul) Distributed systems understanding (failures are inevitable, need failover/recovery) Impact radius considerations:...

📧 Subscribe