What are the key principles of SRE?

The key principles of SRE include: (1) Service Level Objectives (SLOs) - defining and measuring the reliability of a service; (2) Error Budgets - using SLOs to balance reliability and innovation; (3) Automation - using software to manage and operate systems; (4) Monitoring - collecting and analyzing data to detect and diagnose issues; and (5) Incident Response - responding to and learning from incidents to improve system reliability.

What are some common SRE tools and technologies?

Some common SRE tools and technologies include: (1) Configuration Management tools like Puppet, Chef, and Ansible; (2) Containerization technologies like Docker and Kubernetes; (3) Monitoring and Logging tools like Prometheus, Grafana, and ELK Stack; (4) Incident Management tools like PagerDuty and VictorOps; and (5) Cloud Computing platforms like AWS, GCP, and Azure.

What are the benefits of adopting SRE practices?

Adopting SRE practices can lead to improved system reliability, availability, and performance, as well as faster and more frequent releases of new features and updates. SRE can also help reduce operational costs and improve the overall user experience of a service.

How can I become an SRE?

To become an SRE, you typically need a strong background in software engineering and operations, as well as experience with SRE tools and technologies. You can also pursue certifications and training programs in SRE, such as the Google SRE Certification or the SRE Foundation Certification. Networking with other SRE professionals and participating in SRE communities and events can also help you learn and grow in the field.

SRE Engineer

At sitereliabilityengineer.dev, our mission is to provide comprehensive and up-to-date information about site reliability engineering (SRE) to help individuals and organizations improve the reliability and performance of their websites and applications. We aim to be a trusted resource for SRE professionals, aspiring SREs, and anyone interested in learning more about this critical field. Through our articles, tutorials, and community resources, we strive to promote best practices, share insights and experiences, and foster a culture of continuous improvement in site reliability engineering.

Video Introduction Course Tutorial

/r/sre Yearly

📄 Job Interview Task: In this exercise, we want you to use https://draw.io or your tool of choice to draw a diagram for an E-Commerce platform based on the following high-level requirements: ...... Didn't get the call back.

📄 Can I just have KubeCTL access? I used to have it in my old company

📄 Should /r/sre Go Dark Next Week?

📄 is there a special place for memes, or should I just post them here?

📄 Current feeling in week 2 of SRE from 10 years as a SysAdmin

📄 A "real" day in the life of an SRE. We have all seen those "A Day in the life of..." videos and blogs. I wanted to try and get a "real" account of what you do as an SRE/senior SRE. Just to start things off, here is my day....

📄 I made this API investigation strategy for juniors in my team. Would love some feedback or suggestions.

📄 How attached do you feel to production?

📄 Became SRE. Highly regret it. Help.

📄 Awesome SLI/SLO list

📄 I just got leetcoded

📄 Advice for Google SRE/SE interview's Linux internals

📄 Google to decrease SREs ratio. What are your thoughts?

📄 When you put a job interview coding exercise on the clock, you're testing the candidate's test-taking skill more than their ability to code.

📄 One of the happiest day of the week ! Discovered D2.

📄 How HashiCorp Does Site Reliability Engineering

📄 Mother of All Outages

📄 Don’t do this with your k8s health checks

📄 The dichotomy of the SRE

📄 List of Post-Mortems

📄 Fiberplane beta: Collaborative notebooks for SREs

📄 When choosing SaaS vendors make sure you write this into your contracts.

📄 Advice for Apple SRE interview

📄 Is anybody willing to share what internal tooling / projects your SRE team is doing at the moment. I enjoy reading 'stories' of how various problems are solved through software.

📄 My Favorite Things About SRE

📄 Flying blind

📄 How Cloudflare runs Prometheus at scale

📄 Infrastructure as Code AMA with Luke Hoban (CTO of Pulumi)

📄 Promoted to Lead SRE last week. Was told yesterday my manager’s position is being eliminated and the other SREs will report to me.

📄 The Grafana Labs Observability Survey 2023 asked respondents about the state of observability at their organization. Here are some key takeaways.

📄 What would make you leave your current SRE role?

📄 Google SRE Linux Internals Interview

📄 Learnings from 17 years as a Google SRE

📄 Recent tasks assigned to me

📄 What’s the weirdest outage reason you dealt with throughout your career?

📄 Dev to SRE handover checklist

📄 Argo Rollouts at scale: Bringing Automated Rollbacks to 2,100+ services at Monzo

📄 SRE complexity

📄 SRE managers - does anyone have experience leading a transition from NOC-> SRE?

📄 Dashboards as code: A new approach to visualizing AWS APIs

📄 Python or Golang for SREs ?

📄 Are SREs familiar with OpenTelemetry?

📄 [META] New Mod Here

📄 mTLS in 15 minutes

📄 What's your incident response flow?

📄 Feeling overwhelmed by the amount of knowledge

📄 What does SRE own when collaborating with a Platform Engineering team?

Site Reliability Engineering (SRE) Cheatsheet

Welcome to the Site Reliability Engineering (SRE) cheatsheet! This reference sheet is designed to help you get started with SRE and provide you with a quick reference guide for the concepts, topics, and categories related to SRE.

What is Site Reliability Engineering (SRE)?

Site Reliability Engineering (SRE) is a discipline that combines software engineering and operations to build and run large-scale, distributed, and reliable software systems. SRE is a set of practices that help organizations improve the reliability, availability, and performance of their systems.

Key Concepts

Service Level Objectives (SLOs)

Service Level Objectives (SLOs) are a key concept in SRE. SLOs are a way to define the level of service that a system should provide to its users. SLOs are typically defined in terms of availability, latency, and throughput.

Error Budgets

Error Budgets are another key concept in SRE. Error Budgets are a way to measure the reliability of a system. An Error Budget is the amount of time that a system can be down or degraded without violating its SLOs.

Incident Management

Incident Management is the process of detecting, responding to, and resolving incidents in a system. Incident Management is a critical part of SRE, as it helps organizations minimize the impact of incidents on their systems and users.

Monitoring and Alerting

Monitoring and Alerting are essential components of SRE. Monitoring is the process of collecting data about a system's performance and behavior. Alerting is the process of notifying the appropriate people when a system is behaving abnormally.

Capacity Planning

Capacity Planning is the process of determining the resources that a system needs to meet its SLOs. Capacity Planning is critical to ensuring that a system can handle its expected load and traffic.

Change Management

Change Management is the process of making changes to a system in a controlled and predictable manner. Change Management is essential to ensuring that changes do not negatively impact a system's reliability, availability, or performance.

Key Topics

Reliability

Reliability is the ability of a system to perform its intended function without failure. Reliability is a critical aspect of SRE, as it is the primary goal of the discipline.

Availability

Availability is the ability of a system to be operational and accessible to its users. Availability is a key component of SLOs and is critical to ensuring that a system meets its users' needs.

Latency

Latency is the time it takes for a system to respond to a user's request. Latency is a critical component of SLOs, as it directly impacts a system's performance and user experience.

Scalability

Scalability is the ability of a system to handle increasing amounts of traffic and load. Scalability is critical to ensuring that a system can meet its users' needs as it grows.

Performance

Performance is the ability of a system to perform its intended function efficiently. Performance is a critical aspect of SRE, as it directly impacts a system's reliability and user experience.

Security

Security is the protection of a system from unauthorized access, use, disclosure, disruption, modification, or destruction. Security is a critical aspect of SRE, as it helps ensure that a system is reliable and available to its users.

Key Categories

Incident Response

Incident Response is the process of detecting, responding to, and resolving incidents in a system. Incident Response is a critical part of SRE, as it helps organizations minimize the impact of incidents on their systems and users.

Monitoring and Alerting

Capacity Planning

Capacity Planning is the process of determining the resources that a system needs to meet its SLOs. Capacity Planning is critical to ensuring that a system can handle its expected load and traffic.

Change Management

Disaster Recovery

Disaster Recovery is the process of recovering a system from a catastrophic event, such as a natural disaster or cyberattack. Disaster Recovery is a critical aspect of SRE, as it helps organizations minimize the impact of such events on their systems and users.

Automation

Automation is the use of technology to perform tasks without human intervention. Automation is a critical component of SRE, as it helps organizations improve the reliability and efficiency of their systems.

Conclusion

This cheatsheet provides a quick reference guide for the concepts, topics, and categories related to Site Reliability Engineering (SRE). SRE is a critical discipline that helps organizations improve the reliability, availability, and performance of their systems. By understanding the key concepts, topics, and categories of SRE, you can better prepare yourself to build and run large-scale, distributed, and reliable software systems.

Common Terms, Definitions and Jargon

1. Availability: The measure of how often a system is operational and accessible to users.
2. Black box testing: A testing method that examines the functionality of a system without knowledge of its internal workings.
3. Blue/green deployment: A deployment strategy that involves switching between two identical environments to minimize downtime during updates.
4. Capacity planning: The process of determining the resources required to meet future demand for a system.
5. Chaos engineering: A methodology for testing and improving system resilience by intentionally introducing failures.
6. Circuit breaker: A design pattern that prevents cascading failures by automatically stopping requests to a failing service.
7. Configuration management: The process of managing and tracking changes to a system's configuration settings.
8. Containerization: A method of packaging and deploying applications in lightweight, isolated environments called containers.
9. Continuous delivery: A software development practice that emphasizes frequent, automated releases to production.
10. Continuous integration: A software development practice that involves merging code changes into a shared repository frequently to detect and resolve conflicts early.
11. Disaster recovery: The process of restoring a system to a functional state after a catastrophic event.
12. Elasticity: The ability of a system to dynamically adjust its resource usage to meet changing demand.
13. Fault tolerance: The ability of a system to continue operating in the event of a failure.
14. High availability: A system design that ensures minimal downtime and maximum uptime.
15. Incident management: The process of responding to and resolving incidents that impact system availability or performance.
16. Infrastructure as code: A practice of managing infrastructure using code and automation tools.
17. Key performance indicators (KPIs): Metrics used to measure the performance of a system or process.
18. Latency: The time delay between a request and its response.
19. Load balancing: The process of distributing incoming traffic across multiple servers to improve performance and availability.
20. Mean time between failures (MTBF): The average time between system failures.

Editor Recommended Sites

AI and Tech News
Best Online AI Courses
Classic Writing Analysis
Tears of the Kingdom Roleplay
Run Kubernetes: Kubernetes multicloud deployment for stateful and stateless data, and LLMs
GSLM: Generative spoken language model, Generative Spoken Language Model getting started guides
ML Startups: Machine learning startups. The most exciting promising Machine Learning Startups and what they do
Learn with Socratic LLMs: Large language model LLM socratic method of discovering and learning. Learn from first principles, and ELI5, parables, and roleplaying
Prompt Ops: Prompt operations best practice for the cloud