Postgres JSONB

JSONB is a nifty Postgres type that allows you to store unstructured data inside of Postgres. A common use case of JSONB is to represent a mapping from a set of keys to arbitrary values. JSONB is nice for this because the set of keys can be completely different for each value. It is also possible to express hierarchical data through JSONB.

As an example of where JSONB is incredibly useful, the company I work for, Heap, makes heavy use of JSONB. At Heap, we use JSONB to store events that happen on our customers’ websites. These events include pageviews, clicks, as well as custom events created by our customers. All of these different kinds of events have completely different properties. This makes JSONB a great tool for our use case. More concretely, we have an events table with a fairly simple schema:

    user_id bigint,
    time bigint, 
    data jsonb

With JSONB, this simple schema is able to take care of most of our use cases. For example, a click event on the “login” button may look something like the following:

SELECT 0 AS user_id,
       1498800692837 AS time,
       '{"type": "click",
         "target_text": "login",
         "page": "/login"}'::jsonb AS data;

And a pageview on the homepage may look like:

SELECT 1 AS user_id,
       1498800692837 AS time,
       '{"type": "pageview",
         "page": "/home",
         "referrer": ""}'::jsonb AS data;

JSONB lets us easily express all of these different kinds of events. Then when we want to query the data, it’s fairly easy to get the data out of the data column. For example, if to see what pages are viewed the most frequently, we can run a query such as:

SELECT (data ->> 'page'), count(*)
FROM events
WHERE (data ->> 'type') = 'pageview'
GROUP BY (data ->> 'page');

We use this same general idea to power all of the analysis Heap is able to perform. This includes funnels (of people that did A, how many later did B) as well as retention queries (of people that did A, how many people did B within N weeks).

Of course JSONB isn’t free. Due to our heavy use of JSONB, we’ve ran into a decent number of issues with JSONB. One problem is that the keys need to be repeated in every event. This winds up wasting a lot of space. I did an experiment where I pulled out most of the fields we store in JSONB and found that we could save ~30% of our disk usage by not using JSONB!

Another problem that is much worse is the lack of statistics. Normally Postgres collects statistics about the different columns of a table. This includes a histogram of each column as well as an estimate of the number of distinct elements in the column. At query time, Postgres uses these statistics to determine what query plan to use. Currently for JSONB, Postgres has no way of collecting statistics over it. In certain cases, this leads Postgres to making some very bad query plans. My manager goes into both of these issues in more depth in a blog post he wrote on our company blog.

Depending on your exact needs, JSONB can be a god send. JSONB makes it easy to store whatever data you want in Postgres without worrying about an overarching common format for all of your data.

Leave a Reply

Your email address will not be published. Required fields are marked *