An analysis application I'm working on allows the user to perform complex aggregations, which the application converts to SQL queries before returning the result to the user.
Example, from a list of transactions stored in the database, with the following properties:
transaction ID Customer number product_id amount spent date
We consider the following scenario which leads to a relatively complex aggregation of the final results:
- for each customer, get the median time in days between two purchases of the same product
- do that for each product
- aggregate average median time for all customers for each product
- Let's call this final result "typical time-out" for each product or TIPD.
This is good if the user is satisfied with this TIPD and continues, but if they want to use it for later analysis (for example, what is the average customer spend for products with TIPD between 10 and 20), so we want to consider storing these TIPD values somewhere to avoid recalculating each time.
Some of the constraints are:
- the stored "TIPD" values must be accessible from the database because their only point is to serve in subsequent aggregations
- the database is the only computer engine used (the tool has no engine but is based only on SQL queries)
- the database can (and will) receive additional lines over time, and we need to find a way to keep aggregations up-to-date or let the user update them on demand
If you know the BI tools, you can consider this use case as a typical BI application, used in direct connection with a database, but can also keep the results of complex aggregations for reuse. in subsequent analysis tasks.
What would be the design ideas for such a scenario?
A naive approach would simply consist in creating a new table for each of these aggregations, with (in the TIPD example) the products as a key and the TIPD as values - this raises the question of how much money can be saved. indexing and joining, as currently reads the tool. either from materialized views or from unnormalized tables (for indexing reasons).
Another very naive approach would be to create a new column with a specific name in the view / denormalization table and store the TIPD there – this would obviously involve a huge amount of duplication, but nothing more than duplication already caused by denormalization. the attributes of the products, for example.