IFRAME: [1]https://www.googletagmanager.com/ns.html?id=GTM-WVF23W3

Part I - Introduction

[2]Google 1. [3]Table of Contents 2. [4]Foreword 3. [5]Preface 4. [6]Part I - Introduction 5. [7]1. Introduction 6. [8]2. The Production Environment at Google, from the Viewpoint of an SRE 7. [9]Part II - Principles 8. [10]3. Embracing Risk 9. [11]4. Service Level Objectives 10. [12]5. Eliminating Toil 11. [13]6. Monitoring Distributed Systems 12. [14]7. The Evolution of Automation at Google 13. [15]8. Release Engineering 14. [16]9. Simplicity 15. [17]Part III - Practices 16. [18]10. Practical Alerting 17. [19]11. Being On-Call 18. [20]12. Effective Troubleshooting 19. [21]13. Emergency Response 20. [22]14. Managing Incidents 21. [23]15. Postmortem Culture: Learning from Failure 22. [24]16. Tracking Outages 23. [25]17. Testing for Reliability 24. [26]18. Software Engineering in SRE 25. [27]19. Load Balancing at the Frontend 26. [28]20. Load Balancing in the Datacenter 27. [29]21. Handling Overload 28. [30]22. Addressing Cascading Failures 29. [31]23. Managing Critical State: Distributed Consensus for Reliability 30. [32]24. Distributed Periodic Scheduling with Cron 31. [33]25. Data Processing Pipelines 32. [34]26. Data Integrity: What You Read Is What You Wrote 33. [35]27. Reliable Product Launches at Scale 34. [36]Part IV - Management 35. [37]28. Accelerating SREs to On-Call and Beyond 36. [38]29. Dealing with Interrupts 37. [39]30. Embedding an SRE to Recover from Operational Overload 38. [40]31. Communication and Collaboration in SRE 39. [41]32. The Evolving SRE Engagement Model 40. [42]Part V - Conclusions 41. [43]33. Lessons Learned from Other Industries 42. [44]34. Conclusion 43. [45]Appendix A. Availability Table 44. [46]Appendix B. A Collection of Best Practices for Production Services 45. [47]Appendix C. Example Incident State Document 46. [48]Appendix D. Example Postmortem 47. [49]Appendix E. Launch Coordination Checklist 48. [50]Appendix F. Example Production Meeting Minutes 49. [51]Bibliography

Part I. Introduction

This section provides some high-level guidance on what SRE is and why it is different from more conventional IT industry practices.

Ben Treynor Sloss, the senior VP overseeing technical operations at Google—and the originator of the term "Site Reliability Engineering"—provides his view on what SRE means, how it works, and how it compares to other ways of doing things in the industry, in [52]Introduction.

We provide a guide to the production environment at Google in [53]The Production Environment at Google, from the Viewpoint of an SRE as a way to help acquaint you with the wealth of new terms and systems you are about to meet in the rest of the book.

[54]

Preface [55]

Chapter 1 - Introduction

References

1. https://www.googletagmanager.com/ns.html?id=GTM-WVF23W3 2. https://www.google.com/ 3. https://sre.google/sre-book/table-of-contents/ 4. https://sre.google/sre-book/foreword/ 5. https://sre.google/sre-book/preface/ 6. https://sre.google/sre-book/part-I-introduction/ 7. https://sre.google/sre-book/introduction/ 8. https://sre.google/sre-book/production-environment/ 9. https://sre.google/sre-book/part-II-principles/ 10. https://sre.google/sre-book/embracing-risk/ 11. https://sre.google/sre-book/service-level-objectives/ 12. https://sre.google/sre-book/eliminating-toil/ 13. https://sre.google/sre-book/monitoring-distributed-systems/ 14. https://sre.google/sre-book/automation-at-google/ 15. https://sre.google/sre-book/release-engineering/ 16. https://sre.google/sre-book/simplicity/ 17. https://sre.google/sre-book/part-III-practices/ 18. https://sre.google/sre-book/practical-alerting/ 19. https://sre.google/sre-book/being-on-call/ 20. https://sre.google/sre-book/effective-troubleshooting/ 21. https://sre.google/sre-book/emergency-response/ 22. https://sre.google/sre-book/managing-incidents/ 23. https://sre.google/sre-book/postmortem-culture/ 24. https://sre.google/sre-book/tracking-outages/ 25. https://sre.google/sre-book/testing-reliability/ 26. https://sre.google/sre-book/software-engineering-in-sre/ 27. https://sre.google/sre-book/load-balancing-frontend/ 28. https://sre.google/sre-book/load-balancing-datacenter/ 29. https://sre.google/sre-book/handling-overload/ 30. https://sre.google/sre-book/addressing-cascading-failures/ 31. https://sre.google/sre-book/managing-critical-state/ 32. https://sre.google/sre-book/distributed-periodic-scheduling/ 33. https://sre.google/sre-book/data-processing-pipelines/ 34. https://sre.google/sre-book/data-integrity/ 35. https://sre.google/sre-book/reliable-product-launches/ 36. https://sre.google/sre-book/part-IV-management/ 37. https://sre.google/sre-book/accelerating-sre-on-call/ 38. https://sre.google/sre-book/dealing-with-interrupts/ 39. https://sre.google/sre-book/operational-overload/ 40. https://sre.google/sre-book/communication-and-collaboration/ 41. https://sre.google/sre-book/evolving-sre-engagement-model/ 42. https://sre.google/sre-book/part-V-conclusions/ 43. https://sre.google/sre-book/lessons-learned/ 44. https://sre.google/sre-book/conclusion/ 45. https://sre.google/sre-book/availability-table/ 46. https://sre.google/sre-book/service-best-practices/ 47. https://sre.google/sre-book/incident-document/ 48. https://sre.google/sre-book/example-postmortem/ 49. https://sre.google/sre-book/launch-checklist/ 50. https://sre.google/sre-book/production-meeting/ 51. https://sre.google/sre-book/bibliography/ 52. https://sre.google/sre-book/introduction/ 53. https://sre.google/sre-book/production-environment/ 54. https://sre.google/sre-book/preface/ 55. https://sre.google/sre-book/introduction/ 56. https://creativecommons.org/licenses/by-nc-nd/4.0/

;