Please bear with me for the description of the slightly longer problem.
I am a newbie in the Cassandra world and am trying to migrate my current product from the Oracle-based data layer to Cassandra.
In order to support beach queries, I created an entity as below:
create a table does not exist my_system.my_system_log_dated ( id uuid, client_request_id text, tenant_id text, vertical_id text, channel text, text of the event, event_type text, created_date date, primary key ((created_date, tenant_id, vertical_id, channel, event), event_type, id_request_client, id) ) with grouping order by (created_date desc);
Now, I came across several documentations / resources / blogs that mention that I should keep the size of my partition less than 100 MB for an optimally performing cluster. With the amount of traffic my system handles per day for some partition key combinations, there is no way to keep it below 100 MB with the partitioning key above.
To solve this problem, I introduced a new factor called bucket_id and I thought I'd give it the value of the time of day to divide partitions even further into smaller pieces and keep them at less than 100mb (even if that means I have to do 24 readings to process the traffic details for one day, but I'm fine with some ineffective reads). Here is the schematic with the bucket identifier
create a table does not exist my_system.my_system_log_dated ( id uuid, client_request_id text, tenant_id text, vertical_id text, channel text, text of the event, bucket_id int, event_type text, created_date date, primary key ((created_date, tenant_id, vertical_id, channel, event, bucket_id), event_type, client_request_id, id) ) with grouping order by (created_date desc);
Even with that, some combinations of
It is over 100 mb while all other volumes are comfortably in the beach.
In this situation, I have below questions:
- Is it an absolute error that few of your partitions exceed the 100mb limit?
- Even if with an even smaller compartment, for example, a 15-minute window, all partition key combinations are less than 100mb, but this creates sharply asymmetric partitions, which means that large volume combinations reach 80 mb while they remain well below 15 mb once. Is this something that will negatively impact the performance of my cluster?
- Is there a better way to solve this problem?
Here are some additional information that might be helpful in my opinion:
- The average size of lines for this entity is about 200 bytes.
- I'm also considering a factor of future proofing of the load of 2 and an estimate for double the load.
- The maximum load for a specific combination of partition key is about 2.8 million records per day
- the same combination at peak traffic time of about 1.4 million records
- and the same window in 15 minutes contains about 550,000 records.
Thank you in advance for your contributions !!