One of the most important skills I believe programmers need is to understand when to use certain tools. My typical process for solving a problem is to go through each tool in my toolbox and see which one would solve the problem the best. Over the past few years, I’ve worked on systems that process 100s of billions of data points. In doing so, I’ve added a number of tools to my personal toolbox. Here’s a glimpse of some of those tools:
Postgres is by far the tool I know best. I was previously responsible for managing a cluster of Postgres instances containing a total of about 1 petabyte of data. I’ve written 40 or so blog posts about Postgres. I even tried to start a business helping people optimize their Postgres databases, but that wound up not working out. I’ve since pivoted to Freshpaint which has been going a lot better.
Postgres is my default tool for storing data. It hits a pretty awesome sweet spot:
- Postgres is usually fast enough.
- It can handle pretty much any type of data. It doesn’t matter whether it’s time series data, geospatial data. Postgres can handle it.
- Postgres allows you to query the data in a ton of different ways. You have complete access to SQL. It can sometimes be a bit clunky to write the query you want, but there’s almost always a way to get it to work.
The one downside of Postgres is its learning curve. In order to use Postgres, you need to learn about tables, designing a schema, and indexes. Not to mention you have to learn SQL just to use it.
When should you use Postgres?
You should use Postgres if:
- You need a place to store data.
- You are either already familiar with Postgres or are willing to put in the time to learn about it.
- You aren’t dealing with Big DataTM. It becomes a challenge to scale Postgres beyond a few terabytes of data. Fortunately that’s more than plenty for most people.
If all those points are true you should probably go with Postgres.
When I want to run Postgres these days, I go with Amazon RDS. RDS takes care of all the hard parts of administering Postgres. It takes care of replication, failover, and backups. I previously ran my own Postgres instances and I don’t want to ever again unless I have to.1
I think Redis is one of the most elegant pieces of software out there. I’ve even read the complete Redis source code from beginning to end. Of all the software I’ve used, Redis is the one that just works. (In other words, it’s the one that breaks on me the least.)
All Redis does is provide a few basic data structures. It gives you a hash table, a queue, a sorted set, and several others. That’s literally it. Because Redis does so little, it’s able to provide an incredibly simple interface. Want to use Redis for a queue? Just use RPUSH (push to the right) and LPOP (pop from the left):
> rpush mylist 10
> rpush mylist 20
> lpop mylist
> lpop mylist
Want to have a leaderboard? Just use ZADD (add to sorted set) and ZREVRANGE (list from sorted set, starting from the highest values) to get the top results:
> ZADD scoreboard 50 "George"
> ZADD scoreboard 150 "John"
> ZADD scoreboard 100 "Paul"
> ZADD scoreboard 0 "Ringo"
> ZREVRANGE scoreboard 0 1 WITHSCORES
It’s just that simple and straightforward to use.
When should you use Redis?
You should use Redis if:
- You’re doing something that maps directly to one of the supported data structures.
- You are working with only a few GB of data. Redis stores everything in RAM so you’ll need a pretty big machine to work with anything beyond that.
If you’re interested in learning more about Redis, you should check out try.redis.io. It’s an interactive tutorial that walks you through the basic Redis data structures.
S3 is even simpler than Redis. It’s so simple, they decided to even put “simple” in the name (Simple Storage Service). The core S3 API is made up of two operations. You can upload a file, and you can download a file. That’s really all there is to it. I like to think of S3 as “Dropbox for Programmers”.
S3 really shines when you either need cheap storage or need an easy way to distribute a file. The three main cases where I’ve used S3 are: storing database backups, storing lots of data I won’t need to look at often, and as an easy way to share files. Right now, my go to method for developing a front-end is to host a React app on S3. That’s how I’ve been developing freshpaint.io. If I need an API I’ll setup a quick one with AWS Lambda.
When should you use S3?
You should use S3 if:
- You need to store data cheaply or need to distribute a file.
- The data doesn’t change often.
- You don’t need to query the data in any complex way.
In the same way that Postgres is my default way of storing data, Lambda is my default approach to building APIs. All you need to do is upload your executable to AWS Lambda and Amazon will take care of running it for you.
AWS Lambda handles a ton of distracting details for you. You don’t need to worry about load balancing, server maintenance, or capacity planning if you just go with AWS Lambda. I often just want to write some code and put it out there, and Lambda is a great option for doing that.
AWS Lambda has two big downsides that I see. First, there’s a 15 minute time limit on any function execution. This limits the kind of code you can run in lambda to jobs that are relatively short. If the limit were something more like an hour, I think I would use it for a lot more tasks. The other downside is that costs can be unpredictable. Lambda uses usage based pricing, so if you aren’t sure how much load you are going to have, you can’t really be sure of your costs.
When should you use AWS Lambda?
You should use AWS Lambda when:
- You need to run code in short bursts and you know those bursts are going to take <15 minutes.
- You are monitoring how much you are spending.
- You don’t already have an orchestration tool like Kubernetes in place.
I’ve honestly only used Docker for development and testing. I’ve never actually run docker containers in production. That doesn’t mean Docker hasn’t made my life a lot easier.
Docker eliminates the hassle of figuring out how to install a piece of software and lets you run multiple copies of the same piece of software simultaneously. Want to run a copy of your infrastructure that uses multiple Redis instances and multiple Postgres instances on a new laptop? If you’re using Docker, it’s just a few commands away.
One case where I found Docker to be particularly handy is when I wanted to test a library I was building against the last 5 versions of Postgres. I was able to easily download and run the past five versions with just a tiny bit of setup in Docker.
When should you use Docker?
If you need to locally run multiple independent services, Docker is a great way to accomplish that.
There’s a whole separate story around running Docker in production that I honestly don’t know enough about to comment on.
This is just a quick glance at some of the tools I use the most. There’s a whole lot more tools out there, many of which I haven’t even heard about yet. If you have any questions at all about the post or want to share your favorite tools, feel free to reach out to me on Twitter at @mmalisper.
- I’ve experienced first hand just how distracting rolling your own backups can be. You need to have a job that periodically pushes a backup. Monitoring in case the job fails or dies. You need to make sure wherever your backups are being uploaded to is always available. Otherwise your machine will run out of disk space. It’s just a huge pain.