GitHub’s platform engineering team shares their approach to tackling infrastructure challenges at scale. Key strategies include:

Understanding your domain:

  • Talk to neighboring teams with more experience
  • Investigate old issues to understand system limitations
  • Read documentation to build foundational knowledge

Platform-specific skills:

  • Network fundamentals (TCP, UDP, L4 load balancing, debugging tools)
  • Operating systems and hardware selection for scalability and cost
  • Infrastructure as Code (Terraform, Ansible, Consul)
  • Distributed systems understanding (failures are inevitable, need failover/recovery)

Impact radius considerations:

  • Understand downstream dependencies before making changes
  • Review postmortems to understand incident impact
  • Use monitoring and telemetry (like Single Availability Metric) for quick health checks

Testing in distributed environments:

  • Use test sites as “real” machines for changes
  • Test IaC provisioning and deprovisioning operations
  • Implement end-to-end testing by directing traffic to test servers
  • Test self-healing capabilities and identify bottlenecks early
  • Roll out changes host-by-host for easier rollback

The key difference with platform engineering is the wide impact radius - changes to foundational services like DNS can affect numerous products. Testing and gradual rollouts are critical.

For the full article with detailed examples and best practices, see the original post.