There is More to Programming Than Programming Languages

I hear the question all the time. “What programming languages should I be learning?”. I’ve asked this question many times myself in the past. When you are first getting started programming, it seems like all there is to becoming a good programmer is the number of programming languages you know. Several years later and I now realize this is not the case. If you really want to become a better programmer, instead of focusing on learning more programming languages you should focus on the other aspects of programming.

What it Means to Learn a Programming Language

There are many different pieces to a programming language you have to learn before you really know the language. For the purposes of this post, I will separate them out into two different categories:

Language Syntax: Every language has it’s own way for writing if statements, for loops, function calls, etc. Obviously you need to be familiar with how to write these constructs before you can be proficient in a programming language. Although learning the syntax of a language largely means ingraining the syntax into muscle memory, knowing the syntax for a language is still a necessary step in learning a programming language.

Language Concepts: Beyond the basic syntax of the language, every programming language has it’s own features and concepts that separate it out from other programming languages. Python has constructs such as decorators and context managers. Lisp based languages provide code generation through macros as a first class feature. As you learn a programing language, you become aware of these features the language provides. You learn what each of the features do and how to use them. The cool thing about learning more concepts is that it teaches you new ways to approach problems, but more on that later.

The Benefits of Learning More Programming Languages

As I see it, there are two main advantages to learning more programming languages. First you have the advantage of being able to quickly ramp up on any project that makes use of a programming language you know. If you know Python, you can start contributing to a codebase in Python almost immediately. While this is an advantage for programmers that know more languages, it is only a slight advantage. Usually it takes just a few weeks for a programmer to become proficient at a new programming language. If this is the reason you want to learn so many different programming languages, it makes more sense for you to wait for there to be a project you want to work on that actually forces you to learn a new language as it won’t take much time to ramp up anyways.

The other, more interesting advantage to learning more programming languages, is that each new language will introduce you to new ways of approaching problems. This is mainly because as you learn a language, you learn the specific features the language provides and how to use them to approach problems. In a way, learning a new language “stretches your mind” and allows you to think about problems in new ways. As an example, if you learn a Lisp derived language, you will learn about how to use code generation to approach problems. In the future, being familiar with using code generation allows you to recognize when code generation is the best approach to a problem. As you learn more programming languages, you learn more approaches to solving problems. This enables you to choose the best approach for solving a problem from a larger repertoire of options.

What Really Matters

Even though learning more programming languages does give you more ways to approach problems, ultimately the choice of approach for solving a particular problem doesn’t matter much. What is vastly more important than the number of approaches to problems you know is the space of problems you know how to solve. A good programmer isn’t valuable because they can solve the same set of problems every other programmer can solve, just in different ways. A good programmer is valuable because they can solve problems that other programmers cannot.

Most programming languages and styles are designed with a similar purpose in mind: making it easy to express algorithms and procedures in a way a computer can execute them. While some programming styles are better than others at expressing certain procedures, they only make writing solving the particular problem at hand slightly easier. For the most part, any problem that can be solved using an object oriented style can also be solved in a functional style and vice versa. Knowing just one programming language and one programming style enables you to solve the vast majority of problems solvable in any programming language or style. In the larger picture, the particular choice of programming language and approach is mostly an implementation detail.

If you really want to become a better programmer and broaden the space of problems you can solve, instead of focusing on programming languages, you should focus on the parts of programming that actually enable you to solve more problems. Here are a few examples of areas outside of programming languages that having knowledge of broadens the space of problems you can solve:

  • Operating Systems
  • Web Development
  • Distributed Systems
  • Networking
  • Algorithms
  • Security

As an example, you should definitely learn how to setup and use a database. Nearly every non-trivial program uses some kind of database to keep track of application data. Why? Because databases are able to solve problems around handing data are extremely difficult to do with pretty much any programming language. For a few examples many databases:

  • Can work with more data than you can work with in your typical programming language.
  • Guarantee data won’t be lost, even if the power goes out.
  • Make it impossible for your data to get into an inconsistent state.1

If you are familiar with how to work with a database, you can solve all of these problems with little to no effort by just setting up a database. No matter what programming language or programming paradigm you are using, you would much rather have the database handle these problems for you. Ultimately, learning how to use a database enables you to solve way more problems than learning another programming language does.

This applies just as much to the other areas listed. Each one of them enables you to solve more problems than you would be able to otherwise. How are you supposed to build a website without knowing about web development? How are you supposed to write an application that can handle a machine crashing without knowledge of distributed systems? For each of these areas you learn, you become able to solve problems you wouldn’t be able to before.

This is completely unlike learning more programming languages or paradigms, which for the most part are interchangeable. There just isn’t enough distinction between most programming languages and styles that learning a new language enables you to solve that many more problems then you could have before. At this point which do you want to learn: how to solve a problem in more ways or how to solve more problems?

How to Get a Programming Job Straight Out of High School

While I was in High School I looked at the possible options ahead of me. The obvious option was to go onto college, but I was also looking at another option. I wanted to try to get a job working as a programmer and skip college altogether. Given the number of people that have skipped college and made their way directly into the tech industry, I’m surprised at how little there is written about what it’s like to go through the process. In this post, I explain, for people who are thinking about getting a job straight out of high school, what it’s like to go through the process and what obstacles you’ll run into along the way.

The first thing I will say is that it’s definitely possible to get a programming job without a college degree. I did it myself and wound up with offers from numerous companies including Google, Uber, and Dropbox, as well as several lesser known, smaller companies. Second, if you do decide to look for a job out of high school, I highly recommend applying to college and deferring for a year. Deferring will guarantee you have the option to go to school in case your job search doesn’t pan out.

As for actually getting a programming job straight out of high school, there are two main hurdles you will run into. First, you have to get companies to interview you. This stage will be especially tough for you. Second, after getting companies to interview you, you will be put into their interview processes. For most companies, the interview process consists of a several phone calls and an in person interview at the end. In each round you will be asked to solve one or more programming related problems. In most interviews, you will be asked to only describe a solution. In many of the interviews you also will be asked to write code for a working solution. Once you make it through a company’s interview process, you will find yourself with a job offer.

It’s important to emphasize that, for the most part, these two stages are completely separate. Once you get your foot in the door, how well your application looks does not matter. You could have the best application in the world, but if you fail a company’s interview they will still reject you.

Hurdle #1: Getting Companies to Interview You

There are many different approaches you can take to get companies to interview you and there is a lot of surprisingly bad advice out there. Here’s a list of several different approaches you can try and whether or not they work informed by my personal experience. Let’s start with the approaches that I found not to work:

What Does Not Work

Applying to a Company Directly

If you submit an application directly to a company, in most cases your resume will be sent to a recruiter. The recruiter will look at your resume and decide whether it’s worth the companies time to interview you. Usually when a recruiter looks at your application, they will first look for one of two things: a college degree from a well known school or prior experience at a well known tech company. Since you have neither of those things you are almost guaranteed to be rejected before you even start interviewing. You may hope that some company will take the risk and decide to interview you since you are straight out of high school, but as far as I know, no company actually does this.

When I tried applying directly to several companies, I did not get into the interview process for any of them. That was in spite of the fact that I had a Knuth check, several interesting side projects, and had won the Illinois statewide WYSE competition in Computer Science.

Contributing to Open Source

For some reason a lot of advice suggests that if you want to get a programming job, you should contribute to some open source software projects. As I see it, there are two things you can hope to gain out of working on open source:

  1. It looks good on your resume and you think it will help you pass the application process.
  2. You hope companies will see your open source contributions and reach out to you directly.

Regarding 1, as mentioned above, most recruiters first look for either a degree from a well known school or prior experience at a well known company. Without either of those, any open source contributions you have won’t matter. Most companies don’t even take a look at what open source projects you’ve worked on.

As for 2, in general companies only reach out to open source contributions if they are one of the core developers of a well known open source project. While it is possible to get interviews this way, the amount of effort you would have to put in is way more than it’s worth. The approaches I recommend later will help get you get interviews with a lot less effort.


As for approaches that actually work, here are two I’ve found that have worked. Both of them try to bypass the traditional resume screen and get you straight into companies’ interview processes.

What Does Work

Applying Through a Company that does Recruiting

There are a number of companies that will take care of the application process for you. Most of the job offers I got out of high school came out of the Recurse Center. The Recurse Center is a three month long program in New York where programmers work together on whatever projects they want. Although helping attendees find jobs isn’t the purpose of the Recurse Center, they do help all attendees looking for a job find one. After my time at the Recurse Center ended, I started looking for a job. The Recurse Center reached out to companies on my behalf and was able to get me straight into some companies’ interview pipelines. Even with the Recurse Center reaching out to companies on my behalf, I still only had ~1/3 companies decide to interview me, but that was enough for me to get several job offers.

Triplebyte is another option here. Triplebyte is a recruiting company that has their own interview process. If you pass their interview process, Triplebyte will then send you straight to the last round of interviews at other companies. If you are really good at technical interviews (described below), you should be able to pass Triplebyte’s interview process. Once you pass their interview, they will make sure you get interviews from other companies.

Networking

Another approach I’ve found successful is networking. If you have a friend at a company you want to work for and you can get them to refer you, they should be able to get you past the resume screen. Unfortunately, since you are straight out of high school, you probably don’t have much of a network so this option probably isn’t viable to you. If you do not already have a network, it is probably not worth building out a network just to get a job. In that case, you should try the approach above and apply through a company that handles the application process for you because it will take a lot less effort on your part.

Hurdle #2: Passing the Interview

Once you make it into a company’s interview process, all you need to do to get a job offer from them is to do well in their interview process. In general, programming interviews today across Most companies are very similar. Usually a company will ask you multiple algorithmic problems and you just need to be able to solve them to succeed in their interview. You may also be asked to write code that solves the problem.

An example of a question you will may run into is “Write a function that takes as input a number N and returns the number of different ways to make change for N cents.”

If you aren’t familiar with algorithm problems and how to solve them, there are tons of resources available. The book, Cracking the Coding Interview walks you through lots of different interview problems and how to solve them. If you want a comprehensive resource that covers everything you need to know, I really like the book Introduction to Algorithms. At over 1300 pages, it is a really long book, but it does cover everything you will need to know to solve algorithm related interview problems.

If you want to practice solving interview problems, the website LeetCode has many different algorithmic coding problems. The easy problems in the array section are about the difficulty you should expect in a programming interview. For any of those problems, you should be able to implement a program that solves the problem and be able to explain its runtime (in terms of big O notation) in <45 minutes.

In general, you should be familiar with the following:

  • What big O notation is and how to apply it. Given a piece of code or algorithm you should be able to easily determine its run time complexity and explain why it has that complexity.
  • All of the basic data structures (arrays, linked lists, hash tables, heaps, binary trees). For each data structure you should have all of the operations and the runtime of each operation memorized.
  • Basic algorithms (breadth first search, depth first search, quicksort, mergesort, binary search) and their runtimes.
  • Dynamic Programming. Dynamic programing is an algorithmic technique for solving various algorithms problems. I’m not sure why, but tons of companies ask problems that can be solved with dynamic programming.

There are a few companies that have started to move away from algorithm related interview problems, but the vast majority still ask them. You should mainly focus on learning how to solve algorithm problems and that should give you enough to pass the interview process at many different companies.


That’s really all there is to getting a job straight out of high school. It can all be boiled down to getting good at two things:  getting your foot in the door and getting really good at algorithm problems. If you are able to do both of these things, you will be able to start getting job offers fairly quickly. It isn’t impossible to get a programming job straight out of high school, you just need to work for it.

How to Improve Your Productivity as a Working Programmer

For the past few weeks, I’ve been obsessed with improving my productivity. During this time, I’ve continuously been monitoring the amount of work I’ve been getting done and have been experimenting with changes to make myself more productive. After only two months, I can now get significantly more work done than I did previously in the same amount of time.

If you had asked me my opinion on programmer productivity before I started this process, I wouldn’t have had much to say. After looking back and seeing how much more I can get done, I now think that understanding how to be more productive is one of the most important skills a programmer can have. Here are a few changes I’ve made in the past few weeks that have had a noticeable impact on my productivity:

Eliminating Distractions

One of the first and easiest changes I made was eliminating as many distractions as possible. Previously, I would spend a nontrivial portion of my day reading through Slack/email/Hacker News. Nearly all of that time could have been used much more effectively if I had only used that time to focus on getting my work done.

To eliminate as many distractions as possible, I first eliminated my habit of pulling out my phone whenever I got a marginal amount of work done. Now, as soon as I take my phone out of my pocket, I immediately put it back in. To make Slack less of a distraction, I left every Slack room that I did not derive immediate value from. Currently I’m only a in a few rooms that are directly relevant to my team and the work I do. In addition, I only allow myself to check Slack at specific times throughout the day. These times are before meetings, as well as before lunch and at the end of the day. I specifically do not check Slack when I first get into the office and instead immediately get started working.

Getting into the Habit of Getting into Flow

Flow is that state of mind where all of your attention is focused solely at the task at hand, sometimes referred to as “the zone”. I’ve worked on setting up my environment to maximize the amount of time I’m in flow. I moved my desk over onto the quiet side of the office and try set up long periods of time where I won’t be interrupted. When I want to get into flow, I’ll put on earmuffs, close all of my open tabs, and focus all of my energy at the task in front of me.

Scheduling My Day Around When I’m Most Productive

When I schedule my day, there are now two goals I have in mind. The first is to arrange all of my meetings together. This is to maximize the amount of time I can get into flow. The worst possible schedule I’ve encountered is having several meetings, all 30 minutes apart from each other. 30 minutes isn’t enough time for me to get any significant work done before being interrupted by my next meeting. Instead by aligning all of my meetings right next to each other, I go straight from one to the next. This way I have fewer larger blocks of time where I can get into flow and stay in flow.

The second goal I aim for is to arrange my schedule so I am working at the times of the day when I am most productive. I usually find myself most productive in the mornings. By the time 4pm rolls around, I am typically exhausted and have barely enough energy to get any work done at all. To reduce the effect this had on my productivity, I now schedule meetings specifically at the times of the day when I’m least productive. It doesn’t take a ton of energy to sit through a meeting, and scheduling my day this way allows me to work when I’m most productive. Think of it this way. If I can move a single 30 minute meeting from the time when I’m most productive to the time of the time at which I’m the least productive, I just added 30 minutes of productive time to my day.

Watching Myself Code

One incredibly useful exercise I’ve found is to watch myself program. Throughout the week, I have a program running in the background that records my screen. At the end of the week, I’ll watch a few segments from the previous week. Usually I will watch the times that felt like it took a lot longer to complete some task than it should have. While watching them, I’ll pay attention to specifically where the time went and figure out what I could have done better. When I first did this, I was really surprised at where all of my time was going.

For example, previously when writing code, I would write all my code for a new feature up front and then test all of the code collectively. When testing code this way, I would have to isolate which function the bug was in and then debug that individual function. After watching a recording of myself writing code, I realized I was spending about a quarter of the total time implementing the feature tracking down which functions the bugs were in! This was completely non-obvious to me and I wouldn’t have found it out without recording myself. Now that I’m aware that I spent so much time isolating which function a bugs are in, I now test each function as I write it to make sure they work. This allows me to write code a lot faster as it dramatically reduces the amount of time it takes to debug my code.

Tracking My Progress and Implementing Changes

At the end of every day, I spend 15 minutes thinking about my day. I think about what went right, as well as what went wrong and how I could have done better. At the end of the 15 minutes, I’ll write up my thoughts. Every Saturday, I’ll reread what I wrote for the week and implement changes based on any patterns I noticed.

As an example of a simple change that came out of this, previously on weekends I would spend an hour or two every morning on my phone before getting out of bed. That was time that would have been better used doing pretty much anything else. To eliminate that problem, I put my phone far away from my bed at night. Then when I wake up, I force myself to get straight into the shower without checking my phone. This makes it extremely difficult for me to waste my morning in bed on my phone, saving me several hours every week.

Being  Patient

I didn’t make all of these changes at once. I only introduced one or two of them at a time. If I had tried to implement all of these changes at once, I would have quickly burned out and given up. Instead, I was able to make a lot more changes by introducing each change more slowly. It only takes one or two changes each week for things to quickly snowball. After only a few weeks, I’m significantly more productive than I was previously. Making any progress at change at all is a lot better than no change. I think Stanford professor John Ousterhout’s quote describes this aptly. In his words, “a little bit of slope makes up for a lot of y-intercept”.

Time for An Intermission

I’ve been writing a lot these past two months. I decided I’m going to take a break for a little bit. I plan on starting to write continuously again within the next 2-4 weeks. That is all.

Postgres JSONB

JSONB is a nifty Postgres type that allows you to store unstructured data inside of Postgres. A common use case of JSONB is to represent a mapping from a set of keys to arbitrary values. JSONB is nice for this because the set of keys can be completely different for each value. It is also possible to express hierarchical data through JSONB.

As an example of where JSONB is incredibly useful, the company I work for, Heap, makes heavy use of JSONB. At Heap, we use JSONB to store events that happen on our customers’ websites. These events include pageviews, clicks, as well as custom events created by our customers. All of these different kinds of events have completely different properties. This makes JSONB a great tool for our use case. More concretely, we have an events table with a fairly simple schema:

CREATE TABLE events (
    user_id bigint,
    time bigint, 
    data jsonb
);

With JSONB, this simple schema is able to take care of most of our use cases. For example, a click event on the “login” button may look something like the following:

INSERT INTO events
SELECT 0 AS user_id,
       1498800692837 AS time,
       '{"type": "click",
         "target_text": "login",
         "page": "/login"}'::jsonb AS data;

And a pageview on the homepage may look like:

INSERT INTO events
SELECT 1 AS user_id,
       1498800692837 AS time,
       '{"type": "pageview",
         "page": "/home",
         "referrer": "www.google.com"}'::jsonb AS data;

JSONB lets us easily express all of these different kinds of events. Then when we want to query the data, it’s fairly easy to get the data out of the data column. For example, if to see what pages are viewed the most frequently, we can run a query such as:

SELECT (data ->> 'page'), count(*)
FROM events
WHERE (data ->> 'type') = 'pageview'
GROUP BY (data ->> 'page');

We use this same general idea to power all of the analysis Heap is able to perform. This includes funnels (of people that did A, how many later did B) as well as retention queries (of people that did A, how many people did B within N weeks).

Of course JSONB isn’t free. Due to our heavy use of JSONB, we’ve ran into a decent number of issues with JSONB. One problem is that the keys need to be repeated in every event. This winds up wasting a lot of space. I did an experiment where I pulled out most of the fields we store in JSONB and found that we could save ~30% of our disk usage by not using JSONB!

Another problem that is much worse is the lack of statistics. Normally Postgres collects statistics about the different columns of a table. This includes a histogram of each column as well as an estimate of the number of distinct elements in the column. At query time, Postgres uses these statistics to determine what query plan to use. Currently for JSONB, Postgres has no way of collecting statistics over it. In certain cases, this leads Postgres to making some very bad query plans. My manager goes into both of these issues in more depth in a blog post he wrote on our company blog.

Depending on your exact needs, JSONB can be a god send. JSONB makes it easy to store whatever data you want in Postgres without worrying about an overarching common format for all of your data.

track_io_timing

The parameter track_io_timing is a relatively unknown, but super helpful parameter when optimizing queries. As the name suggests, when the parameter is turned on, Postgres will track how long I/O takes. Then, when you run a query with EXPLAIN (ANALYZE, BUFFERS), Postgres will display how much time was spent just performing I/O.

You normally don’t want to have track_io_timing always on since it incurs a significant amount of overhead. To get around this, when you want to time how long a query is spending performing I/O, you can use a transaction with SET LOCAL track_io_timing = on;. This will enable track_io_timing only during the transaction. As a specific example of track_io_timing, here’s a simple query over a table I have laying around:

> BEGIN; SET track_io_timing = ON; EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pets; COMMIT;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Seq Scan on pets  (cost=0.00..607.08 rows=40008 width=330) (actual time=8.318..38.126 rows=40009 loops=1)
   Buffers: shared read=207
   I/O Timings: read=30.927
 Planning time: 161.577 ms
 Execution time: 42.104 ms

The I/O Timings field shows us that of the 42ms spent executing the query, ~31ms was spent performing I/O. Now if we perform the query again when the data is cached:

> BEGIN; SET track_io_timing = ON; EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pets; COMMIT;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on pets  (cost=0.00..607.08 rows=40008 width=330) (actual time=0.004..7.504 rows=40009 loops=1)
   Buffers: shared hit=207
 Planning time: 0.367 ms
 Execution time: 11.478 ms

We can see the query is just about 31ms faster! This time the query does not show any information about the I/O timing since no time was spent performing I/O, due to the data being cached.

When benchmarking queries, I also make sure to make use of track_io_timing so I can see whether the expensive part of the query is performing I/O, or if the expensive part is something else entirely.

Avoiding Torn Pages

For the Postgres file layout, Postgres reads and writes data to disk 8kb at a time. Most operating systems make use of a smaller page size, such as 4kb. If Postgres is running on one of these operating systems, an interesting edge case can occur. Since Postgres writes to disk in units of 8kb and the OS writes to disk in units of 4kb, if the power went out at just the right time, it is possible that only 4kb of an 8kb write Postgres was performing were written to disk. This edge case is sometimes referred to as “torn pages”. Postgres does have a way of working around torn pages, but it does increase the amount of I/O Postgres needs to perform.

Under normal circumstances, Postgres uses a technique called WAL to prevent data loss. At a high level, WAL works by creating a log on disk of all changes made by a transaction to the database, before the changes themselves are persisted to disk. Since creating a single continuous log on disk is much cheaper than performing random writes to disk, WAL reduces the amount of I/O Postgres needs to perform without the risk of data loss. If Postgres crashes, Postgres will be able to recover all of the changes that weren’t persisted to disk by replaying the WAL log.

Although keeping track of all of the changes made does allow Postgres to recover from common crashes in which every write was either done completely or not at all, it does not let Postgres recover from a torn page. Due to the specifics with the implementation of WAL log1, in the case of a torn page, the changes alone do not provide Postgres with enough information to determine what changes should be applied to each half of the page.

To recover from torn pages, Postgres does something called “full-page writes”. Whenever Postgres makes a change to a page, it writes a full copy of the page to the WAL log. That way, when using the WAL log to recover from a crash, Postgres does not need to pay attention to the contents of the page stored on disk. Postgres is able to  recover the entire state of the page just from the WAL log, sidestepping the problem of torn pages entirely! To avoid constantly writing full copies of every page to WAL, Postgres checks if a full copy of the page was recently written to WAL, and if so, will not write a full copy of the page since it will still be able to recover the complete page from WAL.

There is actually a parameter, full_page_writes, that allows you to disable this behavior. If you care about preventing data corruption, there are very few cases in which you should disable it. The only real case is if the OS/filesystem has built-in protection against torn pages. For example, the ZFS filesystem provides it’s own transactional guarantees and prevents torn pages. This is largely due to the copy-on-write nature of ZFS.

Postgres TOAST

TOAST aka The Oversized Attribute Storage Technique aka the best thing since sliced bread, is a technique Postgres uses to store large values. Under the normal Postgres data layout, every row is stored on 8kb pages, with no row spanning multiple pages. To support rows that can be larger than 8kb, Postgres makes use of TOAST.

If Postgres is about to store a row and the row is over 2kb, TOAST will kick in and Postgres will first attempt to shrink the row by compressing the large variable width fields of the row. If the row is still over 2kb after compressing the fields of it, Postgres will then repeatedly store the large fields outside of the row until the row is under 2kb1. To do so, Postgres splits up the compressed value into individual chunks of ~2kb2. Each of those chunks are then stored in a “TOAST table”. Every regular Postgres table has a corresponding TOAST table in which the TOASTed values are stored. A value stored in this manner is commonly referred to as “TOASTed”.

Every TOAST table has three columns. It has a column called chunk_id which is used to distinguish the specific TOASTed values each chunk is for. All chunks of the same toasted value have the same chunk_id. The chunk_id is what is stored in the original row and is what allows Postgres to determine what chunks are for a given value. The second field of a TOAST table is chunk_seq which determines the ordering of the chunks with the same chunk_id. The first chunk of a TOASTed value has chunk_seq=0, the second has chunk_seq=1 and so on. The last column is chunk_data which contains the actual data for the TOASTed value.

At query time, when a TOASTed value is needed, Postgres will use an index on the TOAST table on (chunk_id, chunk_seq) to lookup all of the chunks with a given chunk_id sorted by chunk_seq. From there, it can stitch all of the chunks back together and decompress that result to obtain the original value of the row.

Under certain circumstances, TOAST can actually make queries faster. If a TOASTed field isn’t needed to answer a query, Postgres doesn’t have to read the chunks for the TOASTed value, and can skip reading the value into memory. In some cases, this will dramatically reduce the amount of disk I/O Postgres needs to perform to answer a query.

You can actually access the TOAST table for a given table directly and inspect the values stored in it. As a demonstration, let’s create a table messages that has a single column message:

CREATE TABLE messages (message text);

We can then insert a few random strings to be TOASTed:

INSERT INTO messages
SELECT (SELECT 
        string_agg(chr(floor(random() * 26)::int + 65), '')
        FROM generate_series(1,10000)) 
FROM generate_series(1,10);

Now that we have a table with values we know are toasted, we first need to lookup the name of the TOAST table for the messages table. We can do this by looking up the name of the corresponding TOAST table with the following query:

> SELECT reltoastrelid::regclass 
> FROM pg_class 
> WHERE relname = 'messages';
      reltoastrelid      
-------------------------
 pg_toast.pg_toast_59611
(1 row)

The query pulls information from the pg_class Postgres table, which is a table where Postgres stores metadata about tables.

Now that we have the name of the TOAST table, we can read from it just like any other table:

> SELECT * FROM pg_toast.pg_toast_59611;
 chunk_id | chunk_seq | chunk_data
----------+-----------+------------
    59617 |         0 | \x4c4457...
    59617 |         1 | \x424d4b...
...

Note that the chunk_data is the binary of the compressed version of the field, so it isn’t exactly human readable.

Overall, TOAST is a clever technique that reuses the ordinary Postgres storage technique to store larger values. It’s completely transparent to the user, yet if you really want to, you can dig into your TOAST tables and see exactly how Postgres is storing your data.

The File Layout of Postgres Tables

Although Postgres may seem magical, it really isn’t. When data is stored in Postgres, Postgres in turn stores that data in regular files in the filesystem. In this blog post, we’ll take a look at how Postgres uses files to represent data stored in the database.

First of all, each table in Postgres is represented by one or more underlying files. Each 1GB chunk of the table is stored in a separate file. It is actually pretty easy to find the actual underlying files for a table. To do so, you first need to find the Postgres data directory, which is the directory in which Postgres keeps all of your data. You can find where the data directory is by running SHOW DATA_DIRECTORY;. When I run it locally, I see the following:

> SHOW DATA_DIRECTORY;
        data_directory        
------------------------------
 /var/lib/postgresql/9.5/main
(1 row)

Now that you know where the Postgres data directory is, you will need to find where the files for the specific table we are looking for is located. To do so, you can use the pg_relation_filepath function with the name of the table you want to find the file for. The function will return the relative filepath of the files from the data directory. Here is what I see when I run the command on a table I have locally:

> SELECT pg_relation_filepath('people');
 pg_relation_filepath 
----------------------
 base/16387/51330
(1 row)

Together with the location of the location of the data directory, this gives us the location of the files for the people table. All of the files are stored in /var/lib/postgresql/9.5/main/base/16387/. The first GB of the table is stored in a file called 51330, the second in a file called 51330.1, the third in 51330.2, and so on. You can actually read and write data to the file yourself, but I heavily suggest not doing so as you will most likely wind up corrupting your database.

Now that we’ve found the actual files, let’s walk through how each file is laid out. Each file is broken up into 8kb chunks, called “pages”1. For example, a 1.5GB table will be stored across two files and 196,608 pages2 and look like the following:

Each row is stored on a single page (with the exception of when a row is too large, in which case a technique called TOAST is used). Pages are the unit of which Postgres reads and writes data to the filesystem. Whenever Postgres reads a row it needs to answer a query from disk , Postgres reads the entire page the row is on. When Postgres writes to a row on a page, it writes a whole new copy of the entire page to disk at one time. Postgres operates in this way for numerous reasons which are outside of the scope of this blog post.

Pages themselves have the following format:

The header is 24 bytes and contains various metadata about the page, including a checksum and information necessary for WAL. The row offsets is of pointers into the rows field, with the Nth pointer pointing to the Nth row. The offsets can be used to quickly lookup an arbitrary row of a page. If we emphasize the individual rows on the page, the page winds up looking like:

The first thing you likely noticed is that the first rows are stored at the back of the page. That is so the offsets and the actual row data can both grow towards the middle. If a new row is inserted, we can allocate a new offset from the front of the free space, and allocate the space for the row from the back of the free space.

As for each row, they have a format that looks like the following:

The header of each row is 23 bytes and includes the transaction ids for MVCC as well as other metadata about the row. Based on the table schema, each field of the row is either a fixed width type or a variable width type. If the field is fixed width, Postgres already knows how long the field is and just stores the field data directly in the row.

If the field is variable width there are two possibilities for how the field is stored. Under normal circumstances, it would be stored directly the row with a header detailing how large the field is. In certain special cases, or when it’s impossible to store the field directly in the row, the field will be stored outside of the row using a technique using TOAST, which we will take a look at in my next post.

To recap, each row is stored on an 8kb page along with several other rows. Each page in turn is part of a 1GB file. While processing a query, when Postgres needs to fetch a row from disk, Postgres will read the entire page the row is stored on. This is, at a high level, how Postgres represents data stored in it on disk.

How to Write a Postgres SQL UDF

Being able to write a Postgres UDF (user-defined function) is a simple skill that goes a long way. SQL UDFs let you give a name to part or all of a SQL query and use that name to refer to that SQL code. It works just like any user-defined function in your favorite programming language.

As a simple example, in the last post we came up with a query for incrementing a counter in a table of counters:

INSERT INTO counters
SELECT <id> AS id, <amount> AS VALUE
    ON CONFLICT (id) DO
    UPDATE SET value = counters.value + excluded.value;

When we used this query multiple times, we had to copy and paste it once for each time we used it. To avoid this problem, we could define a UDF that runs the query and then only increment the counters through the UDF. In general, most of the time when you define a SQL UDF, you’ll use code like the following:

CREATE OR REPLACE FUNCTION <function name>(<arguments>)
RETURNS <return type> AS $$
  <queries to run>
$$ LANGUAGE SQL;

This will define a UDF with the given name that runs the queries in the body whenever it is called. Inside of the queries, you’ll be able to refer to any of the arguments passed to the function. If we convert the query we had for incrementing a counter into a UDF, we wind up with the following UDF definition:

CREATE OR REPLACE FUNCTION 
increment_counter(counter_id bigint, amount bigint)
-- Use void as the return type because this function 
-- returns no value.
RETURNS void AS $$
  INSERT INTO counters
  SELECT counter_id AS id, amount AS value
      ON CONFLICT (id) DO
      UPDATE SET value = counters.value + excluded.value;
$$ LANGUAGE SQL;

With this UDF we can now use the UDF instead of the original query:

> SELECT * FROM counters;
 id | value 
----+-------
(0 rows)

> SELECT increment_counter(1, 10);

> SELECT * FROM counters;
 id | value 
----+-------
  1 |    10
(1 row)

> SELECT increment_counter(1, 5);

> SELECT * FROM counters;
 id | value 
----+-------
  1 |    15
(1 row)

> SELECT increment_counter(2, 5);

> SELECT * FROM counters;
 id | value 
----+-------
  1 |    15
  2 |     5
(2 rows)

> SELECT increment_counter(3, 20);

> SELECT * FROM counters;
 id | value 
----+-------
  1 |    15
  2 |     5
  3 |    20
(3 rows)

This is much better than what we had before.

One of the more interesting classes of UDFs are those that return rows instead of a single result. To define such a UDF, you specify SETOF TABLE (<columns>) as the return type. For example, if we wanted a UDF that returned the top N counters, we could define one as such:

CREATE OR REPLACE FUNCTION top_counters(n bigint)
RETURNS TABLE (id bigint, value bigint) AS $$
  SELECT * FROM counters ORDER BY value DESC LIMIT n;
$$ LANGUAGE SQL;

Then we can use it like:

> SELECT * FROM top_counters(2);
 id | value 
----+-------
  3 |    20
  1 |    15
(2 rows)

You can then use the function as part of a larger SQL query. For example, if you wanted to find the sum of the values of the top 10 counters, you could do that with the following straightforward SQL query:

SELECT sum(value) FROM top_counters(10);

To recap, UDFs are a great way to simplify SQL queries. I find them to be especially useful when I am reusing the same subquery in a bunch of different places.