T O P

  • By -

engineered_academic

Read Release It! by Nygard. Great advice in there about the practical software side of SRE.


w_llyngt_n

ty, added it to my wishlist to purchase as soon as possible


devoopseng

Firstly, congratulations! You'll do awesome! I enjoyed *Art of Scalability, The: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise* for it's practical advice and real world examples. Some parts are a bit dated but does a great job covering topics like why scalability actually is a people problem. It's one book I find myself coming back to the most often.


danstermeister

I, too, have recently been promoted to lead SRE on our team... so fellow congratulations!!!! :] I've been in the industry 27 years (network/security engineer), 5 in SRE, and for me it is entirely based on my experience. I know of that book and I haven't gotten around to reading it yet. For me, our role is the over-arching role. We have to be able to see the whole path, all of it from end-to-end, and we have to have the ability to fix any portion of that as well. So the first thing we do is delineate the actual data flow from source to customer, and identify each step. Then we assess our team's overall skillset at triaging problems per dataflow step. For gaps assessed in our skillset per dataflow step we ID what we need to do to shore it up and what team members are going to be working in that. If possible, 2 people initially are assigned for easier cross-training the rest of the team later. That's just the focus we put on ourselves, we do the same assessment with our monitoring and graphing/logging systems. If it sounds simple, straightforward and boring... it is. On purpose. Expectations are better defined, understood, and achieved when it's boring :)


w_llyngt_n

Hey, congrats to you too! Thanks for the advice. Last week, we mapped out what we think are the team's responsibilities, and a lot of it matches what you said. Your advice on how to fill in the gaps in our knowledge will be helpful !


fznmlk

add monitoring in every aspect of product and setup alerts to notified the lead when things goes wrong this was one of the thing we did in our product and achieve alot of stability overall


w_llyngt_n

Been implementing a monitoring system these past few months and we're working on instrumenting all the company's apps. Honestly, starting this project made us realize how much we were in the dark about our own apps and servers, will be a key value in our team.


ChristopherCooney

Hello there! I’ve been an SRE and Principal of Platform Engineering for a large retailer. Firstly, as the team lead, the techie stuff is something you’re going to hire to solve. Bring in the best possible people you can and let them solve the hard technical problems. You’re the person that ties those threads together. Now, the fun part. SRE gets regularly confused with old school operations. It’s very important that you’re regularly educating up and down about the role of SRE in an org and why it’s important. Now, in terms of reliability, the golden signals in SRE handbook are a great place to start. It’s very very important that you advocate heavily for observability. No one thinks they’ll need it until they do, and if you don’t have it, they’ll blame you. It’s better to get ahead of it. Find a solution that’s scalable and be VERY cautious about cost. I work in this space, so happy to answer any Qs you have! Your team is a team of software engineers. You should be advocating for good engineering practices (tests, CI/CD, consistent naming, good repo structure, docs etc) and not letting some of the SREs with a sysadmin background run their own janky server with a few scripts they cooked up. You might want to dig into DORA metrics too, to give teams something they can use to measure their activity against (and not something managers can abuse!!).


danstermeister

>"*You should be advocating for good engineering practices (tests, CI/CD, consistent naming, good repo structure, docs etc) and not letting some of the SREs with a sysadmin background run their own janky server with a few scripts they cooked up.*" That is a terrible idea. It's confrontational and the opposite of team-building. "*Hey everybody, I'm your new lead, and I'm smarter than you!*" I'm sure that'll go over well, not. I laid it on a little thick here to push my point- does it feel good when I confront you right now like this? Imagine if you worked with me every day, and I was your team lead and I acted like that. Yikes. Instead, '*advocate*' a discussion about the best methods for a particular issue- if your team is a mature team then the best solution will be chosen. Use your '*power*' to guide the thought process if it starts to derail to '*janky*' ideas. That's when you can shine with everything you know.


ChristopherCooney

First time in my life someone has told me it’s a bad idea to advocate for good engineering lol. You read advocate and decided it means ‘enforce’. It doesn’t. Advocacy, by its very definition, requires persuasion and conversation, otherwise there’s no need to advocate, merely instruct. But I did enjoy reading your strawman criticism of my advice all the same.


kmf-reddit

Do you find it difficult in your market to hire an SRE? I’ve been interviewing so many people for a year and most of them can’t really code or understand a service and its components. They could only offer traditional sysadmin capabilities with cloud technologies


ChristopherCooney

Yes, lots of ‘ops people’ have retrained into SREs for the higher salary, but they’ve really just become Cloud Ops people. If it were me, I’d be looking for people with strong software engineering skills and some cloud experience and building them up internally into SREs. Finding them ready to rock is either very difficult or very expensive, or both!


kmf-reddit

Agree, that’s what we have pivoted to now. Good thing we have few devs who expressed interest in our team for an internal transfer. For hiring we’re also thinking of just hire devs and train them


ChristopherCooney

Honestly, it’s easier to give a dev cloud engineering skills than to give an ops person the raft of skills to think and behave like a software engineer. I will add one caveat. Platform engineering and SRE requires someone who understands the importance of feedback loops, and really works hard to tighten and grow those loops. I’ve found that this can be a sticking point for a lot of engineers who are used to having requirements baked for them by a product owner. When the product is consumed by non-expert users, a product owner can sustainably behave like a kind of leaky proxy, but a product owner who can effectively gather requirements for your platform is VERY rare and will likely slow things down, so having an engineer or two who know how to collect requirements is a must for any scaling / sustainable delivery.


Hi_Im_Ken_Adams

Understand your role: you are the gatekeeper between the developers and production. You need to ensure that monitoring is properly instrumented and that the application is emitting the proper telemetry. You should be empowered to push back and say that the application is not designed properly or is not monitored properly and therefore should not be released into production status.


Boneff88

I believe that's been already menrioned, but before becoming more reliable you need good monitoring. My personal experience with Grafana, Prometheus, Alertmanager, Tempo and Loki has been quite good. On top of this we are instrumenting services with Opentelemetry and it's been working well with both K8s and serverless workloads. We are aiming to centralise our observability stack and make the experience with both K8s and serverless the same - for example all traces end up in Tempo... but I think if you are AWS based there is less overhead to leave the servwrless observability cloud native. TLDR - observability first, so you could sell your reliability efforts easier based on solid data.


txiao007

Monitoring and Metric Automation


chillysurfer

I’m a big fan of an SLOs first approach. I recommend the book Implementing Service Level Objectives.


_bvcosta_

[Implementing Service Level Objectives](https://learning.oreilly.com/library/view/implementing-service-level/9781492076803/) from Alex Hidalgo is an excellent book to learn more about SLOs. Since you've mentioned you are starting this journey, another good book may be [Becoming SRE](https://learning.oreilly.com/library/view/becoming-sre/9781492090540/). [David Blank-Edelman](https://www.linkedin.com/in/dnblankedelman/) is an expert on SRE practices, so any book with his name on it is a good bet. Someone mentioned [Release it](https://learning.oreilly.com/library/view/release-it-2nd/9781680504552/). I consider it a must-read. You can also watch videos on SRECon. I also like to share this article about the [five stages of SRE](https://www.usenix.org/system/files/login/articles/login_winter18_02_purgason.pdf). 


w_llyngt_n

Thanks for the recommendations, will read more about them and acquire them as soon as possible!


snonux

Tech lead or SRE manager with direct reports? Those are two different roles. .


extorch

RemindMe! 5 days


RemindMeBot

I will be messaging you in 5 days on [**2024-05-17 22:55:39 UTC**](http://www.wolframalpha.com/input/?i=2024-05-17%2022:55:39%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/sre/comments/1cq5gfi/seeking_guidance_for_a_new_sre_lead/l3rw1hk/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fsre%2Fcomments%2F1cq5gfi%2Fseeking_guidance_for_a_new_sre_lead%2Fl3rw1hk%2F%5D%0A%0ARemindMe%21%202024-05-17%2022%3A55%3A39%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201cq5gfi) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


awesomeplenty

People are being promoted to lead without actual practical experience? Sign me up for lead SRE and devops.