Sre in Practice

Sre in Practice

  • Downloads:9458
  • Type:Epub+TxT+PDF+Mobi
  • Create Date:2021-03-14 03:17:48
  • Update Date:2025-09-06
  • Status:finish
  • Author:David N. Blank-Edelman
  • ISBN:1491978864
  • Environment:PC/Android/iPhone/iPad/Kindle

Summary

Site Reliability Engineering was the first book to reveal the results of the 13 years of hard work Google spent creating and nurturing the SRE idea。 As a perfect follow-up to that guide, SRE in Practice shows how other organizations have applied and even extended the theories and practices from the first book。 It's the ideal way to learn how to implement SRE in your own company。

If you're hungry to explore more of the SRE ideas, and curious how these ideas could work in your environment, this is definitely the book for you。

This new book:


Addresses the dearth of SRE-related information for people already in, or interested in, the growing SRE field
Gives you a way of relating to the first book and provides ideas for how SRE can be implemented outside of Google
Relates your existing understanding of DevOps to this new discipline
Enables you to start participating in the global conversation around the future of operations

Download

Reviews

Jason

Cognitive SRE work chapter by Allspaw and Cook was significantly crystalizing for me。

Oleg

​​After the first book, which described how Google runs the production system, the term SRE became hype and mainstream。 Thus, there is no surprise that 2nd book followed and focused on technical implementation based on real examples of the topics mentioned in the first one, such as configuration management, right SLO/SLI definition。 The opinions in the industry have been split from "Wow, let do it" to "it's only big companies like Google can implement such grade of production sanity of the proce ​​After the first book, which described how Google runs the production system, the term SRE became hype and mainstream。 Thus, there is no surprise that 2nd book followed and focused on technical implementation based on real examples of the topics mentioned in the first one, such as configuration management, right SLO/SLI definition。 The opinions in the industry have been split from "Wow, let do it" to "it's only big companies like Google can implement such grade of production sanity of the processes"。As usual, the truth is in the middle。The book "Seeking SRE" is very good attempt to understand what is behind of SRE term, and how is adopted among others key players such Amazon, Facebook, Netflix, Dropbox, which run the production systems at a global scale。 Besides, the technical aspects, there are chapters dedicated to human aspects of being SRE, which cover skill, burnout, mindset, work-life balance, role in the organization。 Those could be applied to any team/persons running production systems at scale despite the title。The book consists of a set of interviews with SRE or people with a similar mindset (PE (product Engineers at Facebook)。 They tell their stories of successes, failures, and learning of running production systems。It comes with no surprise, that other successful companies have a set of practices allowing to run production systems within defined SLA。 The key factor is that all of engineers, managers, business, developers are speaking one "language" when talking about the reliability of production systems。 That allows quickly to understand what can be improved and this process is continuous and based on data (metrics)。 The example of such metrics flow is below: Untitled picture。pngThe digging into production metrics is rated very high and it's considered as a very important practice。 They believe, that organizations must not aim for 100% availability but build resilient, scalable and highly available systems instead。 The systems which can recover to the last known state within seconds or minutes。From this believe raised Chaos Engineering, which tests if production service can recover fast with no to minimal impact on users。 To achieve that, usually, there is a central engineering department。 It analyzes business requirements and decides how to implement it。 In contrast to other big EE organizations, the design of the implementation of such requirements are done by SREs (teams, who are RUNNING production systems and know best how to implement that)。 Of course, achieving the consolidation around processes and tooling similar to Google or Facebook is something which may be considered as "overkill"。 In this light, the transformation story of Spotify is amazing and shows that the company goes through continuous learning processes: from central Ops team(s) to the Ops-in-Squad-model to "Golden Path" model。 Golden Path aims to offer the easiest way to bring new applications/services into production, to remove operational burning and to bring the consistency and still keeping the dev teams to be responsible for their code in production。 To keep the central organization on track and prevent the turning into bottleneck and showstopper, the feedback is crucial。 The set of metrics are defined and reviewed to comprehend if the velocity, productivity, people satisfactions, etc。 increasing over time。Another big pillar of SRE philosophy is the definition of SLIs/SLOs (Service Level Indicators and Service Level Objectives) and way how it's defined compare to the classical approach - define the desired system state in terms of business needs and commitment from all involved teams to maintain it。 There is a great chapter dedicated to it from Amazon SRE。 The relation between DevOps and SRE is carefully studied in the book。 Few quotes are below: "At PayPal, we believe that site reliability engineers are both the ultimate enablers as well as the ideal practitioners of DevOps。 ""DevOps is for newer teams to Agile that need to improve tooling and culture。 SRE is for established Agile teams that are looking to improve uptime, monitoring, sanity, and peace of mind。 ""DevOps can and should be implemented in every organization adopting Agile or an iterative way of working so as to close the feedback loop cycle to product owners or those owning the backlog of features to ensure service management waste is known and addressed。 SRE’s are compatible with organizations that have or seek to foster an engineering culture and eliminate waste and find efficiencies with engineering outcomes。 Both can coexist at the same time in the same organization when the organization is big enough or old enough。 "I'd like to finish this short book review with sharing two practices, both have dedicated chapters and the potential benefits of it。 The first one is blameless postmortem。 The key factor here is shared organization-wide and everyone can read and learn from it。 Ideally, it has one template to simplify the writing and reading。 The second one is the keeping of documentation in the code repository。 This enables faster search of needed runbooks or documentation。 It reduces the number of outdated documents, and adds comprehensive versioning and reviewing。 In other words, the documentation should be treated in the same way as a code。 Thanks for reading 。。。more

Rodolfo

Excellent book。 Some of the later chapters go too deep into technologies and scaling issues but all the first chapters are totally worth it。 I have a lot of notes to start some discussions in the office。