Why do two queries run faster than combined subquery?

I’m running postgres 11 on Azure.

If I run this query:

select min(pricedate) + interval '2 days' from pjm.rtprices

It takes 0.153 sec and has the following explain:

    "Result  (cost=2.19..2.20 rows=1 width=8)"
    "  InitPlan 1 (returns $0)"
    "    ->  Limit  (cost=0.56..2.19 rows=1 width=4)"
    "          ->  Index Only Scan using rtprices_pkey on rtprices  (cost=0.56..103248504.36 rows=63502562 width=4)"
    "                Index Cond: (pricedate IS NOT NULL)"

If I run this query:

    select pricedate, hour, last_updated, count(1) as N 
    from pjm.rtprices
    where pricedate<= '2020-11-06 00:00:00'
    group by pricedate, hour, last_updated
    order by pricedate desc, hour

it takes 5sec with the following explain:

    "GroupAggregate  (cost=738576.82..747292.52 rows=374643 width=24)"
    "  Group Key: pricedate, hour, last_updated"
    "  ->  Sort  (cost=738576.82..739570.68 rows=397541 width=16)"
    "        Sort Key: pricedate DESC, hour, last_updated"
    "        ->  Index Scan using rtprices_pkey on rtprices  (cost=0.56..694807.03 rows=397541 width=16)"
    "              Index Cond: (pricedate <= '2020-11-06'::date)"

However when I run

    select pricedate, hour, last_updated, count(1) as N 
    from pjm.rtprices
    where pricedate<= (select min(pricedate) + interval '2 days' from pjm.rtprices)
    group by pricedate, hour, last_updated
    order by pricedate desc, hour

I get impatient after 2 minutes and cancel it.

The explain on the long running query is:

    "Finalize GroupAggregate  (cost=3791457.04..4757475.33 rows=3158115 width=24)"
    "  Group Key: rtprices.pricedate, rtprices.hour, rtprices.last_updated"
    "  InitPlan 2 (returns $1)"
    "    ->  Result  (cost=2.19..2.20 rows=1 width=8)"
    "          InitPlan 1 (returns $0)"
    "            ->  Limit  (cost=0.56..2.19 rows=1 width=4)"
    "                  ->  Index Only Scan using rtprices_pkey on rtprices rtprices_1  (cost=0.56..103683459.22 rows=63730959 width=4)"
    "                        Index Cond: (pricedate IS NOT NULL)"
    "  ->  Gather Merge  (cost=3791454.84..4662729.67 rows=6316230 width=24)"
    "        Workers Planned: 2"
    "        Params Evaluated: $1"
    "        ->  Partial GroupAggregate  (cost=3790454.81..3932679.99 rows=3158115 width=24)"
    "              Group Key: rtprices.pricedate, rtprices.hour, rtprices.last_updated"
    "              ->  Sort  (cost=3790454.81..3812583.62 rows=8851522 width=16)"
    "                    Sort Key: rtprices.pricedate DESC, rtprices.hour, rtprices.last_updated"
    "                    ->  Parallel Seq Scan on rtprices  (cost=0.00..2466553.08 rows=8851522 width=16)"
    "                          Filter: (pricedate <= $1)"

Clearly, the last query has it doing a very expensive gathermerge so how to avoid that?

I did a different approach here:

    with lastday as (select distinct pricedate from pjm.rtprices order by pricedate limit 3)
        select rtprices.pricedate, hour, last_updated - interval '4 hours' as last_updated, count(1) as N 
        from pjm.rtprices
        right join lastday on rtprices.pricedate=lastday.pricedate
        where rtprices.pricedate<= lastday.pricedate
        group by rtprices.pricedate, hour, last_updated
        order by rtprices.pricedate desc, hour

which took just 2 sec with the following explain:

    "GroupAggregate  (cost=2277449.55..2285769.50 rows=332798 width=32)"
    "  Group Key: rtprices.pricedate, rtprices.hour, rtprices.last_updated"
    "  CTE lastday"
    "    ->  Limit  (cost=0.56..1629038.11 rows=3 width=4)"
    "          ->  Result  (cost=0.56..105887441.26 rows=195 width=4)"
    "                ->  Unique  (cost=0.56..105887441.26 rows=195 width=4)"
    "                      ->  Index Only Scan using rtprices_pkey on rtprices rtprices_1  (cost=0.56..105725202.47 rows=64895517 width=4)"
    "  ->  Sort  (cost=648411.43..649243.43 rows=332798 width=16)"
    "        Sort Key: rtprices.pricedate DESC, rtprices.hour, rtprices.last_updated"
    "        ->  Nested Loop  (cost=0.56..612199.22 rows=332798 width=16)"
    "              ->  CTE Scan on lastday  (cost=0.00..0.06 rows=3 width=4)"
    "              ->  Index Scan using rtprices_pkey on rtprices  (cost=0.56..202957.06 rows=110933 width=16)"
    "                    Index Cond: ((pricedate <= lastday.pricedate) AND (pricedate = lastday.pricedate))"

This last one is all well and good but if my subquery wasn’t extensible to this hack, is there a better way for my subquery to have similar performance to the one at a time approach?

innodb – Queries freezes in MariaDB

I have a problem where queries freezes sporadically during the day. I made a script monitoring the processlist and have detected a pattern where a slow query is running (different slow queries) and then (fast) queries are stuck in different states, ie. opening tables, query end, update, etc, like this:

time (seconds)  state           query
7               Opening tables  SELECT id FROM... 
8               Opening tables  SELECT type FROM...
8               query end       UPDATE cache...
8               Opening tables  SELECT language FROM...
9               query end       INSERT INTO cache...
9               query end       INSERT INTO cache...
29              Sending data    SELECT product_id FROM...

Usually it’s write operations getting stuck, but also simple selects in state “Opening tables”. Then something happens and the queries disappears from the list. Not sure what this “something” is though. It feels like something is locking the queries but I’m not able to pinpoint what it is.

Any ideas what can be causing this problem or how I can debug it further?

MariaDB version: 10.1.41-MariaDB-0+deb9u1

SHOW GLOBAL STATUS:
https://pastebin.com/gyZNhEsP

SHOW VARIABLES:
https://pastebin.com/YAXFcz2G

python – @reify executes database queries every time called?

Based on this comment about reify,

It acts like @property, except that the function is only ever called once; after that, the value is cached as a regular attribute. This gives you lazy attribute creation on objects that are meant to be immutable.

I have this custom reify class:

class reify(object):
    def __init__(self, wrapped):
        self.wrapped = wrapped

    def __get__(self, inst):
        if inst is None:
            return self
        val = self.wrapped(inst)
        setattr(inst, self.wrapped.__name__, val)
        return val

And it’s used like:

@reify
def user_details(self, user_id):
    try:
        # function that executes db query, returns dict
        return user_details_from_id(self._dbconn, user_id)
    except Exception as e:
        pass

Clearly, we can use it just doing name = self.user_details.get("name").

This works as expected, but not sure if this is caching the result or executing the query every time called, how can I confirm? I mean is this implementation correct? (don’t have DB console)

Python and redis. Make queries more efficient

My Python code (part of a Django site) queries data from a redis database and loads it into a Pandas dataframe. I’m new to redis, and I think I’m not using the redis library as efficiently as I could.

The database keys are timestamps in epocho format. The values are all json strings. Here’s a sample:

Keys:
('1622235486.006474000', '1622235486.006760000', '1622235486.114156000')
Values:
('{"timestampx": "1622235486.006474000", "length": "1416", "dscp": "0", "srcip": "172.17.4.2", "destip": "172.16.1.2"}', '{"timestampx": "1622235486.006760000", "length": "108", "dscp": "0", "srcip": "172.16.1.2", "destip": "172.17.4.2"}', '{"timestampx": "1622235486.114156000", "length": "112", "dscp": "0", "srcip": "172.17.4.2", "destip": "172.16.1.2"}')

Four questions about my code:

  1. First I get all keys with .keys(). Then I filter the list of keys. Then I use .mget(keys). Making two connections to redis (once to get keys, again to get values) seems inefficient. Is there a better method?
  2. Some of the keys aren’t epoch timestamps. They start with “1_”, and they contain None values. I have to filter those out. I use a separate line of code to do that. Can I do that as part of the keys() function?
  3. I also have a line to sort the keys. Can I ask redis to return them sorted?
  4. Finally, is there an argument for .mget() that filters out None values?
import redis
from datetime import datetime
from datetime import timedelta
import pandas as pd
import json

INTERVAL = 5

r = redis.StrictRedis(**redis_config)
filter = datetime.now() - timedelta(seconds=INTERVAL) 

recentkeys = () 
allkeys = r.keys(pattern="*")  #Question 1. Is the best query keys() passed to mget()?
allkeys = (x for x in allkeys if not x.startswith('1_')) #Question 2. Can I pass a filter to redis and have it filter the response?

    for k in allkeys:
        if float(k) > float(filter.timestamp()):
            recentkeys.append(k)
    recentkeys.sort()          #Question 3. Can redis return a sorted set? 
   
    values = r.mget(recentkeys)    
    values = (x for x in values if x != None)    #Question 4. Can redis filter out None values?

    recent_values = pd.DataFrame(map(json.loads, values))

seo – Google results of search queries in other websites?

I’m not sure if this is the right place to ask, but I think the answer will be interesting, so I had to ask.

I came across something while searching for this single character:

https://www.google.com/search?q=%E3%80%90

These are my search results (from Australia):

google search results

I noticed after the first two generic expected results there were strange results from websites which have their own search query URLs. I go to the third result and it’s got a sort of advertisement for a place in south korea:

┎신속하게┚●GGULFO.COM●당동출장샵❒당동오피스텔✁당동키스방✙당동출장샵✿당동스파【당동출장샵♐당동풀쌀롱✞당동휴게실✒당동출장샵

Intrigued, I continue and search for what the website is: https://www.google.com/search?q=GGULFO.COM and get more confusing results.

google search results

What is going on here?

sql server – How can I judge which SQL queries need to be optimized based on the graph?

This is my practice question. This is a multiple-choice question But I haven’t done it right this evening tho. The final exam will be the day after tomorrow. I think it is necessary to push the projection down, and the second is to complete the query based on the index. And the main goal of optimization is to make the table smaller than the original table when doing the merge operation. Push projection down and selection are definitely possible, index lookup, and ID should have a binary tree index, which can be used. Finally, it can also use the index nested loop to join it, the animal table has an index.
This is my idea, but I don’t know what’s wrong…
I don’t understand the rest. Thanks a lot! Question

db2 – Cross Database Queries using IBM Data Studio

For you to execute any SQL statement you have to be connected to some database server; Data Studio does not execute SQL statements. Subsequently, for the three part name (<server>.<schema>.<object>) to work the server you’re connected to has to know what the <server> part is.

In the simple case of accessing objects in a database that belongs to the same Db2 for LUW database instance, <server> is the other database name, and no additional setup is required.

However, if the other table is in a database managed by a different instance, or if it belongs to a different DBMS (Db2 for z/OS, Oracle, etc.), you will need to set up a federated data source, whose name you will then use for <server>.

architecture – Running ad hoc queries on JSON log files

I have a situation where let’s say I have a folder called logs which has N folders.
Each folder contains events for a specific event type and each folder has N .log files where each file has multiple lines of JSON.

Example:

event1.1.log

{"id":1, "name": "ABCD"}
{"id":2, "name": "EFGH"}
{"id":5, "name": "IJKL"}
{"id":7, "name": "MNOP"}

event1.2.log

{"id":3, "name": "ABCD"}
{"id":4, "name": "EFGH"}
{"id":6, "name": "IFKL"}
{"id":8, "name": "ABED"}

Now, each event can have its own structure, but it’s guaranteed that each log in the same event will always have the same structure.

Now, I need a way to run ad hoc queries on these: get a list of students, get top ten students, etc.

I thought of loading them onto a temporary table and then run queries on it, but I was wondering if there was any other way to do this.

I could write an application that could parse the files in memory, but the amount of data could be huge to do computation in memory. And every time I want to run a different query on the same dataset within the next few days, it would have to parse all files into memory again.

Any approaches on this?

query performance – Why does a GIST index on a cube column in PostgreSQL actually make K-Nearest Neighbor (KNN) ORDER BY queries worse?

Adding a GIST index actually seems to make K-Nearest Neighbor (KNN) ORDER BY queries on cube columns worse in PostgreSQL. Why would that be, and what can be done about it?

Here’s what I mean. In a PostgreSQL database I have a table whose DDL is create sample (id serial primary key, title text, embedding cube) where the embedding column is an embedding vector of the title obtained with a Google language model. The cube data type is provided by the cube extension, which I have installed. Incidentally, these are titles of Wikipedia articles. In any case, there are 1 million records. I then perform a KNN query with the following query. This query defines distance using the Euclidean distance operator <->, though results are similar for the other two metrics. It does an ORDER BY and applies a LIMIT in order to find 10 Wikipedia articles with “similar” titles (the most similar being the target title itself). That all works fine.

select sample.title, sample.embedding <-> cube('(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)') distance from sample order by distance limit 10;

What’s puzzling to me, however, is that, if I put a GIST index on the embedding column, the query performance actually is worse. Adding the index, the query plan changes as expected, in the way expected, insofar as it uses the index. But…it gets slower!

This seems to run contrary to the documentation for cube which states:

In addition, a cube GiST index can be used to find nearest neighbors using the metric operators <->, <#>, and <=> in ORDER BY clauses

They even provide an example query, which is very similar to mine.

SELECT c FROM test ORDER BY c <-> cube(array(0.5,0.5,0.5)) LIMIT 1

Here’s the query plan and timing info before dropping the index.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..6.30 rows=10 width=29)
   ->  Index Scan using sample_embedding_idx on sample  (cost=0.41..589360.33 rows=999996 width=29)
         Order By: (embedding <-> '(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)'::cube)
(3 rows)

        title         |      distance      
----------------------+--------------------
 david petrarca       | 0.5866321762629475
 david adamski        | 0.5866321762629475
 richard ansdell      | 0.6239883862603475
 linda darke          | 0.6392124797481789
 ilias tsiliggiris    | 0.6996660649119893
 watson, jim          | 0.7059481479504834
 sk radni%c4%8dki     |   0.71718948226995
 burnham, pa          | 0.7384858030758069
 arthur (europa-park) | 0.7468462897336924
 ivan kecojevic       | 0.7488206082281348
(10 rows)

Time: 1226.457 ms (00:01.226)

And, here’s the query plan and timing info after dropping the index.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=74036.32..74037.48 rows=10 width=29)
   ->  Gather Merge  (cost=74036.32..171264.94 rows=833330 width=29)
         Workers Planned: 2
         ->  Sort  (cost=73036.29..74077.96 rows=416665 width=29)
               Sort Key: ((embedding <-> '(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)'::cube))
               ->  Parallel Seq Scan on sample  (cost=0.00..64032.31 rows=416665 width=29)
(6 rows)

        title         |      distance      
----------------------+--------------------
 david petrarca       | 0.5866321762629475
 david adamski        | 0.5866321762629475
 richard ansdell      | 0.6239883862603475
 linda darke          | 0.6392124797481789
 ilias tsiliggiris    | 0.6996660649119893
 watson, jim          | 0.7059481479504834
 sk radni%c4%8dki     |   0.71718948226995
 burnham, pa          | 0.7384858030758069
 arthur (europa-park) | 0.7468462897336924
 ivan kecojevic       | 0.7488206082281348
(10 rows)

Time: 381.419 ms

Notice:

  • With Index: 1226.457 ms
  • Without Index: 381.419 ms

This very puzzling behavior! All of it is documented in a GitHub repo so that others can try it. I’ll add documentation about how to generate the embedding vectors, but that shouldn’t be needed, as in the Quick-Start I show that pre-computed embedding vectors can be downloaded from my Google Drive folder.

Addendum

It was asked in the comments below to provide the output of explain (analyze, buffers). Here that is, where

  1. I re-create the (covering) index
  2. I run the query with explain (analyze, buffers)
  3. I drop the index
  4. I run the query with explain (analyze, buffers) again
pgbench=# create index on sample using gist (embedding) include (title);
CREATE INDEX
Time: 51966.315 ms (00:51.966)
pgbench=# 
                                                                                                                                                                                                                                                                                                                                       QUERY PLAN                                                                                                                                                                                                                                                                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..4.15 rows=10 width=29) (actual time=3215.956..3216.667 rows=10 loops=1)
   Buffers: shared hit=1439 read=87004 written=7789
   ->  Index Only Scan using sample_embedding_title_idx on sample  (cost=0.41..373768.39 rows=999999 width=29) (actual time=3215.932..3216.441 rows=10 loops=1)
         Order By: (embedding <-> '(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)'::cube)
         Heap Fetches: 0
         Buffers: shared hit=1439 read=87004 written=7789
 Planning:
   Buffers: shared hit=14 read=6 dirtied=2
 Planning Time: 0.432 ms
 Execution Time: 3316.266 ms
(10 rows)

Time: 3318.333 ms (00:03.318)
pgbench=# drop index sample_embedding_title_idx;
DROP INDEX
Time: 182.324 ms
pgbench=# 
                                                                                                                                                                                                                                                                                                                                           QUERY PLAN                                                                                                                                                                                                                                                                                                                                            
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=74036.35..74037.52 rows=10 width=29) (actual time=6052.845..6057.210 rows=10 loops=1)
   Buffers: shared hit=70 read=58830
   ->  Gather Merge  (cost=74036.35..171265.21 rows=833332 width=29) (actual time=6052.825..6057.021 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=70 read=58830
         ->  Sort  (cost=73036.33..74077.99 rows=416666 width=29) (actual time=6002.928..6003.019 rows=8 loops=3)
               Sort Key: ((embedding <-> '(0.18936706, -0.12455666, -0.31581765, 0.0192692, -0.07364611, 0.07851536, 0.0290586, -0.02582532, -0.03378124, -0.10564457, -0.03903799, 0.08668878, -0.15357816, -0.17793414, -0.01826405, 0.01969068, 0.11386908, 0.1555583, 0.09368557, 0.13697313, -0.05610929, -0.06536788, -0.12212707, 0.26356605, -0.06004387, -0.01966437, -0.1250324, -0.16645767, -0.13525756, 0.22482251, -0.1709727, 0.28966117, -0.07927769, -0.02498624, -0.10018375, -0.10923951, 0.04770213, 0.11573371, 0.04619929, 0.05216618, 0.19176421, 0.12948817, 0.08719034, -0.16109011, -0.02411379, -0.05638905, -0.37334979, 0.31225419, 0.0744801, 0.27044332)'::cube))
               Sort Method: top-N heapsort  Memory: 26kB
               Buffers: shared hit=70 read=58830
               Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on sample  (cost=0.00..64032.33 rows=416666 width=29) (actual time=0.024..3090.103 rows=333333 loops=3)
                     Buffers: shared read=58824
 Planning:
   Buffers: shared hit=3 read=3 dirtied=1
 Planning Time: 0.129 ms
 Execution Time: 6057.388 ms
(18 rows)

Time: 6053.284 ms (00:06.053)

query performance – Prevent Certain Types of Queries (Redshift)

Some IDEs will allow users to “peak” at data by hitting some button in a menu, and under the hood the IDE performs a select * without a where clause. This can be a problem because our Redshift cluster includes a table (via Spectrum) which contains a very large number of rows. Therefore, running select * without a where clause bogs down the cluster unnecessarily.

One option is to train users not to run queries without where clauses on that particular table, but asking users to remember not to hit an innocuous-looking button in their IDE may not work 100% of the time.

Are there other best practices we can explore here? For example, is there a way to tell Redshift to disallow queries without where clauses on particular tables? (If for some reason the user really wants all rows, they can include where 1=1 or something.) Or is there some kind of server-side lint-like plugin, or server-side query regex filter, that would accomplish similar functionality?

Please note that this is not a security-related question. If a malicious user wants to bog down the cluster intentionally, let’s assume they will still be able to do so. Rather, the intent of this functionality would be to do a favor for users who hit the wrong button or type the wrong thing without realizing the problem.

Thanks in advance.