Image from UnSplash

I’ve led infrastructure at a startup for the past 4 years that has had
to scale quickly. From the beginning I made some core decisions that the
company has had to stick to, for better or worse, these past four years. This post
will list some of the major decisions made and if I endorse them for your
startup, or if I regret them and advise you to pick something else.

AWS

Link to headingPicking AWS over Google Cloud

Link to heading🟩 Endorse

Early on, we were using both GCP and AWS. During that time, I had
no idea who my “account manager” was for Google Cloud, while at the same
time I had regular cadence meetings with our AWS account manager. There is a feel
that Google lives on robots and automation, while Amazon lives with a customer focus.
This support has helped us when evaluating new AWS services. Besides support, AWS has done a great job around stability
and minimizing backwards incompatible API changes.

There was a time when Google Cloud was the choice for Kubernetes clusters, especially
when there was ambiguity around if AWS would invest in EKS over ECS. Now though, with
all the extra Kubernetes integrations around AWS services (external-dns, external-secrets, etc),
this is not much of any issue anymore.

EKS

Link to heading🟩 Endorse

Unless you’re penny-pinching (and your time is free), there’s no reason to run your own
control plane rather than use EKS. The main advantage of using an alternative in AWS, like ECS,
is the deep integration into AWS services. Luckily, Kubernetes has caught up in many ways: for example,
using external-dns to integrate with Route53.

EKS managed addons

Link to heading🟧 Regret

We started with EKS managed addons because I thought it was the “right” way to use EKS. Unfortunately, we always
ran into a situation where we needed to customize the installation itself. Maybe the CPU requests, the image tag,
or some configmap. We’ve since switched to using helm charts for what were add-ons and things are running much better
with promotions that fit similar to our existing GitOps pipelines.

RDS

Link to heading🟩 Endorse

Data is the most critical part of your infrastructure. You lose your network: that’s downtime. You
lose your data: that’s a company ending event. The markup cost of using RDS (or any managed database)
is worth it.

Redis ElastiCache

Link to heading🟩 Endorse

Redis has worked very well as a cache and general use product. It’s fast, the API is simple and
well documented, and the implementation is battle tested. Unlike other cache options, like
Memcached, Redis has a lot of features that make it useful for more than just caching. It’s a
great swiss army knife of “do fast data thing”.

Part of me is unsure what the state of Redis is for Cloud Providers, but I feel it’s so widely used by AWS customers
that AWS will continue to support it well.

ECR

Link to heading🟩 Endorse

We originally hosted on quay.io. It was a hot mess of stability problems. Since moving to ECR,
things have been much more stable. The deeper permission integrations with EKS nodes or dev servers has also been a
big plus.

AWS VPN

Link to heading🟩 Endorse

There are Zero Trust VPN alternatives from companies like CloudFlare. I’m sure these products work
well, but a VPN is just so dead simple to setup and understand (“simplicity is preferable” is my mantra). We use
Okta to manage our VPN access and it’s been a great experience.

AWS premium support

Link to heading🟧 Regret

It’s super expensive: almost (if not more) than the cost of another engineer. I think if we had very little
in house knowledge of AWS, it would be worth it.

Control Tower Account Factory for Terraform

Link to heading🟩 Endorse

Before integrating AFT, using control tower was a pain mostly because it was very difficult to automate. We’ve since
integrated AFT into our stack and spinning up accounts has worked well since. Another thing AFT makes easier is
standardizing tags for our accounts. For example, our production accounts have a tag that we can then use to make
peering decisions. Tags work better than organizations for us because the decision of “what properties describe
this account” isn’t always a tree structure.

Process

Link to headingAutomating post-mortem process with a slack bot

Link to heading🟩 Endorse

Everyone is busy. It can feel like you’re the “bad guy” reminding people to fill out the post-mortem. Making a robot
be the bad guy had been great. It streamlines the process by nudging people to follow the SEV and post-mortem procedure.

It doesn’t have to be too complex to start. Just the basics of “It’s been an hour of no messages. Someone post an update” or
“It’s been a day with no calendar invite. Someone schedule the post-mortem meeting” can go a long ways.

🟩 Endorse

Why reinvent the wheel? PagerDuty publishes a template of what to do during an incident. We’ve customized it a bit,
which is where the flexibility of Notion comes in handy, but it’s been a great starting point.

🟩 Endorse

Alerting for a company goes like this:

There are no alerts at all. We need alerts.We have alerts. There are too many alerts, so we ignore them.We’ve prioritized the alerts. Now only the critical ones wake me up.We ignore the non-critical alerts.We have a two tiered alerting setup: critical and non-critical. Critical alerts wake people up. Non-critical alerts
are expected to ping the on-call async (email). The problem is that non-critical alerts are often ignored. To resolve
this, we have regular (usually every 2 weeks) PagerDuty review meetings where we go over all our alerts. For the critical
alerts, we discuss if it should stay critical. Then, we iterate the non-critical alerts (usually picking a few each meeting)
and discuss what we can do to clear those out as well (usually tweaking the threshold or creating some automation).

Monthly cost tracking meetings

Link to heading🟩 Endorse

Early on, I set up a monthly meeting to go over all of our SaaS cost (AWS, DataDog, etc). Previously, this was just
something reviewed from a finance perspective, but it’s hard for them to answer general questions around “does this cost
number seem right”. During these meetings, usually attended by both finance and engineering, we go over every software
related bill we get and do a gut check of “does this cost sound right”. We dive into the numbers of each of the high bills
and try to break them down.

For example, with AWS we group items by tag and separate them by account. These two dimensions, combined with the general
service name (EC2, RDS, etc) gives us a good idea of where the major cost drivers are. Some things we do with this data
are go deeper into spot instance usage or which accounts contribute to networking costs the most. But don’t stop at
just AWS: go into all the major spend sinks your company has.

🟥 Regret

Everyone should do post-mortems. Both DataDog and PagerDuty have integrations to manage writing post-mortems and we tried
each.
Unfortunately, they both make it hard to customize the post-mortem process. Given how powerful wiki-ish tools
like Notion are, I think it’s better to use a tool like that to manage post-mortems.

Not using Function as a Service(FaaS) more

Link to heading🟥 Regret

There are no great FaaS options for running GPU workloads, which is why we could never go fully FaaS. However,
many CPU workloads could be FaaS (lambda, etc). The biggest counter-point people bring up is the cost. They’ll
say something like “This EC2 instance type running 24/7 at full load is way less expensive than a Lambda running”.
This is true, but it’s also a false comparison. Nobody runs a service at 100% CPU utilization and
moves on with their life. It’s always on some scaler that says “Never reach 100%. At 70% scale up another”. And it’s
always unclear when to scale back down, instead it’s a heuristic of “If we’ve been at 10% for 10 minutes, scale down”.
Then, people assume spot instances when they aren’t always on market.

Another hidden benefit of Lambda is that it’s very easy to track costs with high accuracy. When deploying services
in Kubernetes, cost can get hidden behind other per node objects or other services running on the same node.

GitOps

Link to heading🟩 Endorse

GitOps has so far scaled pretty well and we use it for many parts of our infrastructure: services,
terraform, and config to name a few. The main downside is that pipeline oriented workflows give
a clear picture of “here is the box that means you did a commit and here are arrows that go from
that box to the end of the pipeline”. With GitOps we’ve had to invest in tooling to help people answer
questions like “I did a commit: why isn’t it deployed yet”.

Even still, the flexibility of GitOps has been a huge win and I strongly recommend it for your company.

Prioritizing team efficiency over external demands

Link to heading🟩 Endorse

Most likely, your company is not selling the infrastructure itself, but another product. This puts pressure on the
team to deliver features and not scale your own workload. But just like airplanes tell you to put your own mask on
first, you need to make sure your team is efficient. With rare exception, I have never regretted prioritizing
taking time to write some automation or documentation.

Multiple applications sharing a database

Link to heading🟥 Regret

Like most tech debt, we didn’t make this decision, we just did not not make this decision. Eventually, someone
wants the product to do a new thing and makes a new table. This feels good because there are now foreign keys between
the two tables. But since everything is owned by someone and that someone is a row in a table, you’ve got
foreign keys between all objects in the entire stack.

Since the database is used by everyone, it becomes cared for by no one. Startups don’t have the luxury of a DBA,
and everything owned by no one is owned by infrastructure eventually.

The biggest problem with a shared database are:

Crud accumulates in the database, and it’s unclear if it can be deleted.When there are performance issues, infrastructure (without deep product knowledge) has to debug the database and figure out who to redirect toDatabase users can push bad code that does bad things to the database. These bad things may PagerDuty alert the
infrastructure team (since they own the database). It feels bad to wake up one team for another team’s issue. With application owned databases,
the application team is the first responder.All that said, I’m not against stacks that want to share a single database either. Just be aware of the tradeoffs above
and have a good story for how you’ll manage them.

SaaS

Link to headingNot adopting an identity platform early on

Link to heading🟥 Regret

I stuck with Google Workspace at the start, using it to create groups for employees as a way to assign permissions. It just isn’t flexible enough.
In retrospect, I wish we had picked up Okta much sooner. It’s worked very well, has integrations for almost everything,
and solves a lot of compliance/security aspects. Just lean into an identity solution early on and only accept SaaS
vendors that integrate with it.

Notion

Link to heading🟩 Endorse

Every company needs a place to put documentation. Notion has been a great choice and worked much easier than things
I’ve used in the past (Wikis, Google Docs, Confluence, etc). Their Database concept for page organization has also
allowed me to create pretty sophisticated organizations of pages.

Slack

Link to heading🟩 Endorse

Thank god I don’t have to use HipChat anymore. Slack is great as a default communication tool, but to reduce stress
and noise I recommend:

Using threads to condense communicationCommunicating expectations that people may not respond quickly to messagesDiscourage private messages and encourage public channels.Moving off JIRA onto linear

Link to heading🟩 Endorse

Not even close. JIRA is so bloated I’m worried running it in an AI company it would just turn fully sentient. When
I’m using Linear, I will often think “I wonder if I can do X” and then I’ll try and I can!

Not using Terraform Cloud

Link to heading🟩 No Regrets

Early on, I tried to migrate our terraform to Terraform Cloud. The biggest downside was that I couldn’t justify the
cost. I’ve since moved us to Atlantis, and it has worked well enough. Where atlantis
falls short, we’ve written a bit of automation in our CI/CD pipelines to make up for it.

GitHub actions for CI/CD

Link to heading🟧 Endorse-ish

We, like most companies, host our code on GitHub. While originally using CircleCI, we’ve switched
to Github actions for CI/CD. The marketplace of actions available to use for your workflows is
large and the syntax is easy to read. The main downside of Github actions is their support
for self-hosted workflows is very limited. We’re using EKS and actions-runner-controller
for our self-hosted runners
hosted in EKS, but the integration is often buggy (but nothing we cannot work around).
I hope GitHub takes Kuberentes self-hosting more seriously in the future.

Datadog

Link to heading🟥 Regret

Datadog is a great product, but it’s expensive. More than just expensive, I’m worried
their cost model is especially bad for Kubernetes clusters and for AI companies. Kubernetes
clusters are most cost-effective when you can rapidly spin up and down many nodes, as well
as use spot instances. Datadog’s pricing model is based on the number of instances you
have and that means even if we have no more than 10 instances up at once, if we spin up
and down 20 instances in that hour, we pay for 20 instances. Similarly, AI companies
tend to use GPUs heavily. While a CPU node could have dozens of services running at once,
spreading the per node Datadog cost between many use cases, a GPU node is likely to have
only one service using it, making the per service Datadog cost much higher.

🟩 Endorse

Pagerduty is a great product and well priced. We’ve never regretted picking it.

Software

Link to headingSchema migration by Diff

Link to heading🟧 Endorse-ish

Schema management is hard no matter how you do it, mostly because of how scary it is. Data is important and a bad
schema migration can delete data. Of all the scary ways to solve this hard problem, I’ve been very happy with the idea
of checking in the entire schema into git and then using a tool to generate
the SQL to sync the database to the schema.

Ubuntu for dev servers

Link to heading🟩 Endorse

Originally I tried making the dev servers the same base OS that our Kubernetes nodes ran on, thinking this would make
the development environment closer to prod. In retrospect, the effort isn’t worth it. I’m happy we are sticking
with Ubuntu for development servers. It’s a well-supported OS and has most of the packages we need.

AppSmith

Link to heading🟩 Endorse

We frequently need to automate some process for an internal engineer: restart/promote/diagnose/etc. It’s easy enough
for us to make APIs to solve these problems, bu
Read More