Suppose I have a simple table called
events like this:
|apr 5 01:05:00
|apr 5 01:04:15
|apr 5 01:06:10
|apr 5 01:04:30
|apr 5 01:03:00
|apr 5 01:03:02
|apr 5 01:03:02
|apr 5 01:01:00
Given this table, for any given interval, such as
last 10 hours and so on, I’d like to extract the following data:
Simple enough, right?
SUM(event_count) as total_action_count
GROUP BY event_id
ORDER BY total_action_count DESC
However, I’m finding that this doesn’t scale, at least not well enough for querying it every second, for multiple intervals, when dealing with millions of rows per day, and tens of thousands of unique
With a setup pretty similar to this one, using timescaledb with Postgres, a query on 2.5m events and 30k unique
event_id over a 24 hour period is taking anywhere from 2 to 5 seconds. Considering I want to run this query every second or so for 10 different intervals, this approach simply doesn’t work.
From my perspective, my bottleneck is the fact that I can’t get around having to read every single row to SUM
event_id. In this example I could in theory have one row for each
action_count instead, but for my real use case I can’t because
action_count can be a number anywhere from 1 to 10 billion.
So, my question is, is there any way I could model this in such a way that I can run these queries much faster?
Keep in mind, it’s no use for me to keep track of events on a hourly/daily timeframe, since I’m always querying on last N minutes/hours instead of hour N or day N.
With timescaledb I could bucket my data into minute/hour intervals, but since I have tens of thousands of unique
event_id I end up running into the same bottleneck of having to read and sum millions of rows at a time.
Another caveat is that rows are usually, but not always, coming in the correct order. So I may insert and event for April 5 followed by an event for March 30 and then April 5 again.
I can think of a few solutions, but they all seem extremely contrived and error prone. I could also use a memory database, but it seems to me that there’s got to be a better solution to all of this, right?
So, I’m asking you, how would you approach this?