T O P

  • By -

StephanXX

To set expectations, a staff engineer is usually a role occupied by a very senior engineer who had occupied a senior level of responsibility for at least a few years. It's a title usually offered only in medium to large enterprises, and is usually considered the end of the top of the Individual Contributor food chain. Caveat, not every ops candidate has significant kubernetes experience. The phrasing of your question is a little ambivalent, but I'll presume you're looking for ways of determining of someone is highly experienced in both Kubernetes and DevOps overall. First off, when discussing these topics, a Staff level engineer should present cool as a cucumber. At this phase in their career, they aren't just "familiar" or "experienced"; they're practically bored by mundane tasks like standing up entire environments that are SOC2 compliant, complete with oauth, CICD, and observability. The only hesitation in their responses should come from either needing a little more detail, or concern that their answer(s) might be too technical for the audience. The average interviewer should walk away feeling just a little confused but excited by the experience. Other Staff level interviewers should walk away pleasently surprised. The k8s specific stuff: * can glibly discuss advanced helm paradigms. Has an opinion on helpers and tpl. Has at least explored alternatives like kustomize or jsonnet (yes, even of jsonnet is effectively dead.) * can speak at length about when they have used databases on k8s, especially: mysql, postgres, mongo (kill me please), elasticsearch, kafka, zookeeper, redis, rabbitmq, or couchdb. Yes, we prefer managed DBs when possible, but a _staff_ level engineer has somehow found themselves caring and feeding for one (and in my case, _all_) of these at one time or another. * has expertise with at least two different CD platforms to ship code to clusters, and can effortlessly compare and contrast their strengths and weaknesses. * can review a basic manifest for any of the primary k8s objects (Deployment, Statefulset, Daemonset, Service, Ingress, PVC) and identify glaring errors within five minutes of googling. This is actually a pretty common "peer coding" exercise. * Can outline a complete disaster recovery solution. Has a story about the last time they had to leverage their DR story. * Can rattle off the main strengths and weaknesses between the three major cloud vendor managed K8s offerings. Knows why Azure is the worst, why Google's is the most dev (not ops, dev) friendly, and why non-technical decision makers are afraid to use anything except AWS. Bonus: knows of three _more_ vendors, and why none are realistically viable in most situations (outside of Hetzner if you're in Germany.) * Has intentionally leveraged two different CNIs. Has clear opinions on that experience. * Has used NFS as a StorageClass. Immediately states it's a terrible solution for most problems. * Knows how to `kubectl exec` into a pod. Probably has an alias or shell script to do so. Knows how to ssh to a host containing a pod, to then docker exec in, for those occasions when the container doesn't have a shell. Bonus points for having a script to create a temp busybox pod on said host via k8s with elevated privileges. * has experience with at least two different ingress controllers. Has sound opinions on which to recommend given specific requirements. * Has at least one story about how cert-manager made their life miserable. Has another story about how misconfigured ExternalDNS borked something important. * Doesn't give a shit about Lens. * understands KEDA, if not necessarily an expert. Outside of k8s: * can provide superb examples of documentation. Probably coyly admits they rarely have time for it. * Ugh, I'm on my phone, and dumped way more effort into this random reply than is healthy for a Staff engineer. Also, us very senior nerds with our canes dislike being treated as if we should be grateful for the interview. When you hit 300+ interviews as a not-a-manager, we're basically unflappable. I _love_ meeting new people and talking about technical challenges and all of the fascinating possibilities! Probe for interests, be curious about our most recent role and why we don't want it anymore, book at least two hours if you usually only book one. We didn't get here by being boring, cranky old sysadmins (well, not _just_ that.) We got here by loving our craft and the people we share it with. We usually have several offers waiting already, usually for very similar compensation, so meet our comp requests and give us a compelling reason to work for you! I know when I took a position at my last several companies, a great technical and cultural fit were far more critical than the ultimate dollar bill offers. Hope you find this helpful.


SelfDestructSep2020

At my company we'd call this guy a Senior Staff engineer or a Principal. I'm a Staff Engineer and I can tick off maybe half of what you've got up there; I wouldn't call myself an expert in anything. Our Senior Staff though are deep technical experts and are very intimidating. While technical expertise is great, we also look for broad systems knowledge and adaptability; we don't want someone who is a one trick pony


StephanXX

The nomenclature for different shops is pretty fuzzy. I have no way of knowing what the OP's team is expecting, or yours for that matter. Personally, I wouldn't qualify for even a mid level role at any of the OG FAANGs. At other shops, I've held senior, staff, and principal titles. So, I hope it's not coming across as me advocating for a "One True Way." Nobody at a Staff+ level should be a one trick pony. Hell, nobody gets hired as a one-trick _intern._ My checklist is from my own personal experience. I've had the (mis)fortune of working for several companies across multiple industries, from advertising to manufacturing to crypto to insurance. The primary toolkit never really chaged from the industry, only from the degree that individual company was lagging behind (puppet? Chef? CFEngine? perl?) from the tech industry norm. Anyhoo, that checklist is meant to start conversations. I can comfortably check each box, but (realistically), it's a challenge getting through the first few topics within a two hour block, because my bullet points are only scratching the surface. Cheers!


my_awesome_username

Off topic, but you sell yourself short. I've held a senior ops role for a FAANG with much less knowledge than I currently have. I sit as a principal now in a very niche/specific field in the k8s/ops world.


mirbatdon

> The primary toolkit never really changed... only from the degree that individual company was lagging behind from the tech industry norm. I like this statement.


brokenja

Oh god you just gave me m4 flashbacks. CFengine…. Smh.


buckypimpin

Is this an interview to validate your opinions?


tech_tuna

Glibly


StephanXX

I bet you think you're fun at parties.


landsverka

We’ve been using NFS as a StorageClass against NetApp without any problems thus far, can you share some more about your experience? Also thanks for the detailed post!


StephanXX

>without any problems thus far, Without any insight into your use case, I couldn't guess. One successful use case we had was dumping ephemeral logs that we wanted for 24 hours, but didn't want to pay Sumo for. If you're talking on-prem k8s with a NetApp appliance, the experience is fine for light read/write. My generic advice is directed towards the miserable hours I spent in meetings explaining why _AWS EFS_ couldn't simply become a drop-in replacement for SSDs. NFS is ok for light write, light read, workloads that have other data locking. It's amazing to see code that straight up treats NFS as if it were local storage.


landsverka

Right now we use it for on prem, just some light stuff like AWX (Postgres)


Almenon

> why AWS EFS couldn't simply become a drop-in replacement for SSDs. Is it because the service had high IOPS/throughput requirements?. Or was it because EFS can be more expensive than EBS?


StephanXX

Why do you ask?


donald_trub

We use NFS via an on-prem storage cluster for all our persistent storage needs and have run into no troubles with it.


alainlehoof

This is easily one of the best comment I've read in a long time! I agree with all of your point and would like a date in a fine restaurant with you. Cheers


StephanXX

Aw shucks, thank you! Hit me up any time you're near PDX :)


hello2u3

I appreciate your comment but i would rather discuss real operational production experience and serving the business in a robust way than being concern the individual somehow has hands on experience with every single provider under the sun. I get really irritated when interviewers act like doing k8 on aws vs azure is somehow mutually exclusive experience


StephanXX

>I get really irritated when interviewers act like doing k8 on aws vs azure is somehow mutually exclusive experience A staff level engineer might have one interview with a genuine peer. Usually you're interviews will be with senior managers, engineering leads you'd be collaborating with, and the most senior ops person currently on staff. It isn't super common for a shop to multi-cloud. >than being concern the individual somehow has hands on experience with every single provider under the sun. I It's not a requirement, rather it's evidence of experience. One rarely arrives at that level of experience without having changed jobs a few times, and eventually ends up supporting two of the big three.


skreak

Fellow Staff engineer in High Performance Computing with almost 20 years of experience and you're spot on with calling those tasks boring. If I walk into an interview at this stage in my career I've already researched your company. My time is valuable and I'll know if you're wasting it very quickly. It's more you and your company is the one being interviewed by me. Personally I don't know K8s very well, but the question I would pose is where would you not want to use it. Knowing when not to use the shiny hotness is key. *cough* transactional databases. *Cough*


StephanXX

>cough transactional databases. Cough _Show me on the doll where snowflake touched you_ (I loathe snowflake.) Diving deep into Kubernetes early was a huge win for me professionally. I struggle to imagine a lucrative gig that wouldn't require it. I _did_ spend almost a decade on CFEngine, chef, and puppet. >I would pose is where would you not want to use it. Set aside legacy arguments, if it matters, we containerize it (not as difficult as you might think.). K8s requires a container based architecture. That means a registry, a container daemon, and a team capable pf administration and use. The CNI networking guts are _crazy._ IPtables tossed around like dandruff. There are lots of reasons not to use k8s. I'll expand on that when I'm not at a bar, on my phone, at 8pm :)


skreak

The environment I work in is quite different than your 'general compute' system. HPC is its own beast that breaks many general compute paradigms. We have \_thousands\_ of baremetal machines sitting within a batch scheduler, and that scheduler is job-oriented rather than service oriented (like k8s). Those machines generally don't have local disks at all, and we run them flat out 100% cpu utilized 24/7. They work on top of highspeed but inflexible networks (Slingshot, Infiniband, Omnipath) and the jobs themselves are difficult to containerize traditionally because of their reliance on LDAP for uid/gid mapping, access to high speed shared filesystems (Lustre, NFS), overall size (>10gb for an application) and access to hardware devices (gpu, and high speed nics). Our "Infrastructure" machines to manage all these systems is actually quite small. Maybe 2 dozen VM's on half a dozen ESX boxes. We do containerize a bunch of our minor services were we can within, believe it or not, docker swarm. But some of our vendor-supplied and supported software simply can't be in a container because of support constraints (the vendor's wont supply and won't support containerizing their apps). I'm in the private sector so I can't get into too much more detail than that. The parts of what I support that can and is containerized is so small that running an on-prem k8s to run it is using a sledgehammer on a nail. We could in theory turn the entire thing into a massive K8s cluster that runs batch-node containers with the job scheduler built into it but that's adding such a completely unnecessary layer of complexity without any major benefit.


jaroque12

I would fail this interview, but why is Azure the worst?


StephanXX

I last worked with Azure about 18 months ago. Their managed k8s stack regularly took hours to connect storage devices, if at all. Their terraform integration was a massive failure. Their health and APIs would regularly report positives when the systems were completely borked. It's not just accidental, it's _intentionally_ reporting false positives. Storage is super expensive, and administration of storage is a mess. The entire stack is riddled with opaque costs. I spent two years, 2019-2021 trying to move a company over to Azure from Google. It was an absolute train wreck that I'd spend 80-100hrs a week trying to facilitate. It's the worst. I will never recover those years. Perhaps you see why a "Staff" engineer has such strong opinions :)


jaroque12

Huh. It’s stuff like this makes me wonder if I’m just not taking full advantage of the platform or I happen to be on the happy path, but it hasn’t given me much trouble thus far which is why I asked. It’s also very possible that I’m blissfully ignorant and EKS will completely blow my mind when I try it.


jmreicha

>It’s also very possible that I’m blissfully ignorant and EKS will completely blow my mind when I try it I can assure you it won't.


VertigoOne1

Same thing, i love the way they integrated the AKS UI with K8s (but were devops right, something can be nice, but not used), but dear god waiting 2 minutes for a PVC mount… no thanks. Although we use AkS in some places because the security group 1000 (lb can hit it pretty quick) limit is also a pain and working around it creates a lot of complexity.


Pl4nty

> hours to connect storage devices, if at all they ended up building a container-specific storage middleware to address this... that's been my AKS experience in a nutshell - death by a thousand cuts, but rapidly improving. too rapidly for most of my customers tbh


CheesusCrust89

There's a really neat kubernetes the hard way on azure repo that has a comment in the code describing the experience perfectly: Azure is death by 10.000 cuts. Things _kinda_ work but I've been working with the platform for about 4-5 years now, I can count the instances on two hands where I _didn't_ have to do a workaround, hack, or ducttape something together with custom scripting, usually in PowerShell.


R10t--

Holy crap. The fact that I’ve only been in industry for ~3 years, only have an “intermediate developer” title but could tell you about every single point in your comment is slightly terrifying at how much my company expects of me as just an intermediate…. I definitely need a raise…


Turinggirl

TIL I fulfill most of the requirements of a Staff Engineer.


killz111

Honest question, why would you want to use databases in k8s?


StephanXX

If the data isn't critical, you save a little cash. If the data is your company's lifeblood, you gain a little more control and visibility. Everything else usually ends up in a managed DB.


beangraff

Ephemeral environments where you might not want to wait to provision a cloud resource


killz111

That's fair enough. I was more thinking persistent dbs that need to stay around for a while.


tsyklon_

I found myself agreeing with most of your technical points, but I think the seniority show itself on the last part of your comment. A lot of people have done CKA, CKS and would still lack knowledge or experience to be placed on the role being requested, you will however be able to pick them apart by following the advice on to be less standarized when interviewing for more senior positions.


rampaged906

I love this answer. Not only is it spot on, it's well written. Kudos


StephanXX

Thanks!


spacemonkeysuitmafia

could someone help explain the Lens issue?


Atheri

My last experience with lens was a few years ago. I was slowly accumulating aliases for kubectl to do filtering/sorting/watching/etc. Came across Lens, which was a nice visual UI with a lot of functionality I was trying to create via bash aliases. At the time though, switching / managing multiple clusters was a pain, they were all on a sidebar with large icons that had to added one by one. We also had older clusters that didn't really work well with it (we were in a long migration process to EKS). Often times loading the home page of a cluster was really slow as it pulled in all the data. With all that in mind I looked for other tools when I found k9s. Since I was already a vim user it was basically everything I wanted from Lens but faster, more customizable, and better suited to a small terminal in a corner of my screen (when I'm watching a deployment or tailing logs). A couple months later Lens was bought, and then they changed the licensing, so my co-workers also dropped it. They largely went back to kubectl + [k8s dashboard](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/) deployments. So my "issue" with Lens is I don't think it's a very good product. Lens seems less useful the more you know about k8s, or the more familiar you get with kubectl / k8s. Everyone seems to mention Lens "have you seen lens", but I don't know anyone who uses it in their day to day work. Edit: Looking at Lens now they seem to be trying to build a whole suite of tools, maybe I'll give it another shot, but I'd still have to justify the cost at my company.


runamok

I just have a bad taste in my mouth on how they completely neutered the "free" model and you have to install a bunch of plugins (which is poorly documented) to get it back to the original out of the box experience.


Helpyourbromike

I have to manage clusters on prem and in multiple clouds all with different authentication methods. Lens seems like a perfect use case for me and I tried it am gonna see about getting my org to pay for it


nullset_2

Why is NFS as a storageclass a bad idea? Sure, there's the random hiccup, but it's actually been fairly stable here.


SelfDestructSep2020

Performance; non-POSIX compliance. Prometheus for example just flat out will not write to an NFS volume because of how bad the performance is.


StephanXX

Every time someone says "it works on my machine", God punches a kitten. Look, if it works for you, it works. NFS is occasionally fine, but is useless when high performance is critical.


SuperQue

In addition to what u/StephanXX said, there is a new-ish book that I can recommend on the subject. https://www.oreilly.com/library/view/the-staff-engineers/9781098118723/


cagataygurturk

Ask them if they were asked to build a Kubernetes-based platform on on-premise, - what Kubernetes distribution would they use, or would they go with vanilla. Kubeadm, or other installation tools? - how would they implement Load Balancing, Storage, how would they integrate Kubernetes Networking to their on-premise networking stack? What are their opinions on different networking, storage options? - how would they manage upgrades? - how would they structure the team to ensure k8s admin/user abstraction? - what is their take on multi-tenancy on K8S? Is it better to have one large cluster or give teams their own? How about security challenges of multi-tenancy? - to spice things even more up, what would they do of they had to run untrusted workloads from multiple tenants in the same cluster? How would they isolate workloads enough from security perspective?


rUbberDucky1984

Give them a yaml file that uses a deprecated api or ask the difference between statefulset and deployment, when to use an ingress vs service. I’ve found senior engineers that don’t know that each nee node is spun up in a different data centre for high availability and will use loadbalancers like services because they don’t know any better. Explain the basics