loading – 130ms slower startup time, or asyncronous data load?

My desktop application comes with some examples it can show on the main home page. The trouble is, those examples take ~120ms to load, which if done synchronously effectively delays application startup time.

An alternative is to load the examples after the application has loaded, but that means the interface is not ‘static’ because as soon as it loads, the small table with examples ‘appears’ and this could be jarring.

What is the best experience UX wise to go with?

d6 system – Why are knives and fists slower than swords and other weapons?

I was going over the D6 System Book on a whim to learn a new system, just to see how it handled things compared to systems I do understand, and while I was looking through it was largely simple… except for one aspect.

When I was looking at the weapons chart I saw that weapons had a speed attribute and a Damage attribute, and while the weapons like swords and axes and the like seemed fine, the weapons such as fists, and daggers (which seem to me like they’d be easier and faster to use) seemed to have a slower speed for some reason.

What is this?

Why would a dagger or your own fists take longer to use than something like a Baseball bat, a Battle-axe or even a sword?

mysql – Queries are significantly slower on a VPS than a dedicated server. Is CPU the sole bottleneck?

Moved from a dedicated server to a VPS and queries that used to take less than a second are taking up to seven seconds now. The Dedicated server had MySQL 5.6, the new one has MySQL 5.7. Both servers have 32G of RAM, and MySQL was using default settings on the Dedicated server. Tables are all InnoDB and the data + indexes make up ~1.7G. innodb_buffer_pool_size is set to 3G (it’s also hosting websites; could be increased if needed but I don’t think it’d make a difference at this point).

Dedicated CPU info:

# grep -E "model name|processor" /proc/cpuinfo
processor : 0
model name  : Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
processor : 1
model name  : Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
processor : 2
model name  : Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
processor : 3
model name  : Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz

VPS CPU info:

# grep -E "model name|processor" /proc/cpuinfo
processor       : 0
model name      : Intel Core Processor (Haswell, no TSX, IBRS)
processor       : 1
model name      : Intel Core Processor (Haswell, no TSX, IBRS)
processor       : 2
model name      : Intel Core Processor (Haswell, no TSX, IBRS)
processor       : 3
model name      : Intel Core Processor (Haswell, no TSX, IBRS)
processor       : 4
model name      : Intel Core Processor (Haswell, no TSX, IBRS)
processor       : 5
model name      : Intel Core Processor (Haswell, no TSX, IBRS)
processor       : 6
model name      : Intel Core Processor (Haswell, no TSX, IBRS)
processor       : 7
model name      : Intel Core Processor (Haswell, no TSX, IBRS)

A lot of the queries being ran have subqueries and JOINs. EXPLAIN output for one query can be found here (pretty long, didn’t want to paste here): https://pastebin.com/Za4pX25h

Query cache helps, but the problem is the tables are updated pretty regularly so the cache gets flushed a lot.

MySQL CPU usage when the query runs on the Dedicated server is 6.0%, whereas on the VPS it goes up to 98-105%. If it’s not strictly a CPU problem, is there something else I could look at? Thanks in advance.

SP runs slower on new server

We migrated our databases from SQLServer 2012 to SQLServer 2019. Our ETL’s are build in Visual Studio and are set up from a master package. The masterpackage calls on different packages, which are not deployed in SSIS. One of the packages calls a stored procedure. This stored procedure call on different stored procedures. On the old server, this SP step took 4 hours. On the new server, this step takes 7 hours. What could we do to speed up this proces? Does the compatibility level of the database affect this proces? And would it help if we deploy the package in SSIS? We are open for any suggestions.

Things we already tried:

  • Rebuilding indexes and updating statistics
  • Improving certain queries
  • Creating 8 files instead of 1 in the tempdb (old sever has one)

Thank you for your help.
Esmee

hash – Dynamic format used in John the Ripper jumbo way slower than MDXFind

I’m currently doing some research on a pretty huge list of hashes (approx. 2 millions) and thus I’d like to improve my cracking speed. The hash format is 12 rounds of SHA512(password + salt), which could be written like this: sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512($p.$s)))))))))))).

I wrote a dynamic format for use with John the ripper:

(List.Generic:dynamic_3000)
Expression=dynamic=sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512(sha512($p.$s))))))))))))
#  Flags for this format
Flag=MGF_FLAT_BUFFERS
Flag=MGF_SALTED
Flag=MGF_INPUT_64_BYTE
#  Lengths used in this format
SaltLen=20
MaxInputLenX86=110
MaxInputLen=110
#  The functions in the script
Func=DynamicFunc__clean_input_kwik
Func=DynamicFunc__append_keys
Func=DynamicFunc__append_salt
Func=DynamicFunc__SHA512_crypt_input1_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_overwrite_input2
Func=DynamicFunc__SHA512_crypt_input2_to_output1_FINAL

Doing john --format=dynamic_3000 --test=10 gives this result:

Many salts:     403397 c/s real, 403089 c/s virtual
Only one salt:  392575 c/s real, 392664 c/s virtual

When using my dynamic format with a pretty huge list of passwords (14M) on my 2M list of hashes, John gets to crack ~10K hashes in >10 minutes (using –fork with the maximum number of cores on my machine for 100% CPU usage).

I got to the same point in just < 2 minutes using MDXfind with this:

mdxfind -h SHA512PASSSALT -i 12 -f 2Mhashes.txt -s 2Msalts.txt 14Mpassword.txt

My questions are:

  • Is there any way to improve my cracking speed using John? Maybe some other flags could be used? I’d like to stick with John for convenience (sessions etc).
  • Is there any way to improve speed via GPU? There seems to be no support for my specific use case with John (although there is raw-SHA512-opencl format). And I don’t have the skillset required to write a custom hashcat kernel.

Any help would be greatly appreciated!

Performance of select from a 3d list – Mathematica slower than Python

I am creating a random 3d data set in Matematica 12.1. Then I am selecting all points that are in a certain range of one axis.

The same I am doing in Python (same computer, Python 3.8.5, numpy 1.19.2)

RESULT:
It seems that Python is able to select much faster (1.7 sec) than Mathematica (5.2 sec). What is the reason for that?
For selection in Mathematica I used the fastest solution, which is by Carl Woll (see here at bottom).

SeedRandom(1);
coordinates = RandomReal(10, {100000000, 3});

selectedCoordinates = 
   Pick(coordinates, 
    Unitize@Clip(coordinates((All, 1)), {6, 7}, {0, 0}), 
    1); // AbsoluteTiming

{5.16326, Null}

Dimensions(coordinates)

{100000000, 3}

Dimensions(selectedCoordinates)

{10003201, 3}

PYTHON CODE:

import time
import numpy as np
 
np.random.seed(1)
coordinates = np.random.random_sample((100000000,3))*10

start = time.time()
selectedCoordinates = coordinates((coordinates(:,0) > 6) & (coordinates(:,0) < 7))
end = time.time()

print(end-start)

print(coordinates.shape)

print(selectedCoordinates.shape)

1.6979997158050537

(100000000, 3)

(9997954, 3)

networking – My PC internet speed is slower than it used to be while all of my devices work normally

I have around 150Mb/s download speed and around half a year ago I was downloading around 20MB/s and suddenly it went down to 1.2MB/s. Now I thought it might’ve been because of corona but sadly it still is 1.2MB/s and I checked with multiple speedtest from different websites and all of them say the same. Downloading stuff that is bigger than 30GB takes me almost half a day maybe even more. I have around 5 devices in my apartment and all of them have normal speed but when it comes to my PC it’s way slower than it should be. I use an ethernet cable and the cable stretches throughout 2 rooms. The cable itself is pretty old and I’m thinking it might be it, but I genuinely have no idea. Yesterday I called my Internet provider (UPC) and after I gave them some information they sent a technician today to replace my router but when he checked the internet speed through cable he said everything was okay and there were no problems with the modem itself. I’ve tried a lot of things – Changing my DNS, to fully resetting the modem and checking if I had some malware problems with my PC as the technician said it. And well nothing helped.. and now Im sort of desparate because I really don’t know what is causing this. The modem itself is around 2-3 years old at max.
If anyone could help I’d be grateful.
This is the modem I use if it helps. https://www.upc.cz/televize/doplnky/hd-dvr-mediabox/
Thanks.

python – Multiplication algorithm I wrote – slower than I expected

Recently I thought of an algorithm for multiplication and decided to stop dreaming and start writing on paper my ideas, and even implement this to code (in this case – Python 3.9.1).
I do not know if it resembles Karatsuba’s algorithm, but I glanced at it and it seems to work very differently.


The idea behind this multiplication (calculating $x cdot y$) algorithm is to represent them as a power of two, plus some remainder, then use the distributive rule of multiplication to get:

$$x = 2^a + K \ y = 2^b + T$$

$$ x cdot y = (2^a + K) cdot (2^b + T) = 2^{a+b} + T cdot2^a + K cdot 2^b + K cdot T$$

I chose the power to be $2$ as it would help us with bit-manipulation later on.
Calculating $2^{a+b}$ is easy using bitwise operations as so: $$ 2^{a+b} = 1 << (a+b)$$

But how would we find $a$ and $b$?

We want $2^a$ or $2^b$ to be the largest power of $2$ below our $x$ (or $y$ correspondingly), to take as much ‘volume’ from the original number, and thus making the calculations easier with bit manipulations. So, I just used the $lg$ function, which from what I’ve read it can run in $O(1)$ running-time complexity (Or at worst, $lg lg n$). We have:

$$ a = lfloor lg(x) rfloor, ~~~ b = lfloor lg(y) rfloor$$

We then need to find $K$ which is just the remainder when we subtract $2^a$ from $x$: $$K= x – 2^a = x – (1 << a)$$

However, maybe subtraction isn’t the best idea, maybe it takes too much time, and though about another bit manipulation. All I had to do is to flip the most significant bit (left most bit) which represents the greatest power of $2$ this number consists of, and so I had to pad exactly $a$ $1$‘s and use the $&$ bitwise operation to clear the MSB. We now have a code to find $K$ and $T$ respectively:

$$ K = x~~ &~~ text{int(‘1’ * a, 2)} \ T = y~~ &~~ text{int(‘1’ * b, 2)}$$

Finally, we can add all the factors together, calling the function recursively to compute $K cdot T$ to get:

$$ (1 << (a + b)) + (T << a) + (K << b) + overbrace{text{mult(K,T)}}^{text{recursive call}}$$


def mult(x, y):
    if x == 1:
        return y
    elif y == 1:
        return x
    elif x == 0 or y == 0:
        return 0

    base_x = int(log2(x))
    base_y = int(log2(y))

    K = x & int('1' * base_x, 2)
    T = y & int('1' * base_y, 2)

    return (1 << (base_x + base_y)) + (T << base_x) + (K << base_y) + mult(K, T)

But oh! from what I’ve tested, this algorithm does not seem to get near the time it takes to multiply two numbers by just using the plain-old $text{*}$ operation, Sob!

times = ()
for _ in range(10000):
    x = random.randint(10 ** 900, 10 ** 1000)
    y = random.randint(10 ** 900, 10 ** 1000)
    start = time.time()
    mult(x, y)
    end = time.time()
    times.append(end - start)
print(sum(times)/len(times))

This tests $1,000$ multiplications of $900$ to $1000$ digits long random integers, then printing the average time. On my machine the average is: 0.01391555905342102 seconds. Python’s regular multiplication won’t even show a number, just 0.0 because it is so fast.

From what I know, Python’s algorithm do use Karatsuba’s algorithm, and it is roughly $O(n^{approx 1.58})$ – I did not analyze mine strictly, but in one sense it runs at approximately: $$O(max (text{#Number_of_on_bits_x, #Number_of_on_bits_y}))$$
Because every recursive call, we turn off the $text{MSB}$ – thus the number of recursive calls we make is the maximum number of bits that are on ($=1$) in $x$ and $y$, which is strictly smaller than the numbers themselves.. thus we can surely say it is $O(max (x,y)) sim O(n)$ as all the other operations in the functions are $O(1)$. So it boils down to the question of ‘why?’ – why is it slower? What have I done wrong in my algorithm so it is slower even that from first glance it seems faster?

Thank you!

performance – Postgresql uses slower constraint over faster index

I have a few tables (25m records +) which have a very INSERT, UPDATE and DELETE biased workload (typically 50,000 to 100,000 operations a day) where the query planner seems to make an odd choice of index, favouring a constraint over other indexes. A bit of trial and error shows that use the constraint is typically somewhere between 80 to 300x slower than using one of our indexes.

As an example lets say our table looks something like:

Column                | Type                   | Nullable

id                    | uuid                   | not null
device_id             | character varying(255) | not null
device_child_id       | character varying(255) | not null
device_grandchild_id  | smallint               | not null
device_data_type      | character varying(255) | not null
device_data_unit      | character varying(255) | not null
data_date             | date                   | not null
<data>

And indexes:

"devices_pkey" PRIMARY KEY, btree (id)
...
"device_date" btree (device_id, data_date)
"device_child_grandchild" btree (device_id, device_child_id, device_grandchild_id)
"device_child_grandchild_date_constraint" EXCLUDE USING gist (data_date WITH =, device_id WITH =, device_child_id WITH =, device_grandchild_id WITH =, device_data_type WITH =, device_data_unit WITH =)

Some notes about the data:

  • There are around 50,000 unique device_id.
  • Each device_id may have up to 4 device_child_ids (but most have 1)
  • Each device_child_id may have up to 6 device_grandchild_ids (but most have 1)
  • Each of the combinations of the 3 ids above has a data_date and a bunch of data in other columns. The dates are not necessarily contiguous (though most are) and range from a few days to a few years – the largest set being around 4,500 records.

We use the constraint to ensure that we don’t have more than one row of data for the combination of the 4 fields above (and 2 others that don’t change currently).

Here’s the output from a little bit of EXPLAIN, first using the constraint:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT
  id
FROM
  devices
WHERE 
device_id = '<device_id>'
  AND device_child_id = 'ABC123456789'
  AND device_grandchild_id = 1
  AND NOT (local_date <@ DATERANGE('2018-01-01', '2019-01-01', '()'));
 Index Scan using device_child_grandchild_date_constraint on public.devices  (cost=0.42..2.64 rows=1 width=1780) (actual time=2.346..1396.550 rows=760 loops=1)
   Output: id
   Index Cond: (((devices.device_id)::text = '<device_id>'::text) AND ((devices.device_child_id)::text = 'ABC123456789'::text))
   Filter: ((NOT (devices.local_date <@ '(2018-01-01,2019-01-02)'::daterange)) AND (devices.device_grandchild_id = 1))
   Rows Removed by Filter: 1
   Buffers: shared hit=17315 read=2932 dirtied=2
   I/O Timings: read=1147.917
 Planning Time: 1.007 ms
 Execution Time: 1396.691 ms
(9 rows)

And again, but removing the device_child column from the query to trick the planner into using an index:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT
  id
FROM
  devices
WHERE 
device_id = '<device_id>'
  -- AND device_child_id = 'ABC123456789'
  AND device_grandchild_id = 1
  AND NOT (local_date <@ DATERANGE('2018-01-01', '2019-01-01', '()'));
 Index Scan using device_child_grandchild on public.devices  (cost=0.69..844.22 rows=722 width=1780) (actual time=3.192..8.517 rows=760 loops=1)
   Output: id
   Index Cond: (((devices.device_id)::text = '<device_id>'::text) AND (devices.device_grandchild_id = 1))
   Filter: (NOT (devices.local_date <@ '(2018-01-01,2019-01-02)'::daterange))
   Rows Removed by Filter: 2
   Buffers: shared hit=753 read=20
   I/O Timings: read=7.103
 Planning Time: 0.074 ms
 Execution Time: 8.590 ms

The biggest discrepancy I can see is the estimated and actual rows from the index scan portion of the query, i.e.
(cost=0.42..2.64 rows=1 width=1780) (actual time=2.346..1396.550 rows=760 loops=1) vs (cost=0.69..844.22 rows=722 width=1780) (actual time=3.192..8.517 rows=760 loops=1)

From reading a stack of other post, it seems like there is no way to tell the planner not to use the constraint and that the right way to address this is to improve the statistics gathering the planner uses – but I’m entirely unsure which columns to alter, or whether it’s the index that needs altering?

Any clues greatly appreciated!

mariadb – Galera cluster slightly slower than single database?

I recently set up a MariaDB galera cluster for our production. I used sysbench to benchmark the cluster against the old database which is on a single server.

On my PRD Galera Cluster I got the following results:

SQL statistics:
    queries performed:
        read:                            3914980
        write:                           0
        other:                           782996
        total:                           4697976
    transactions:                        391498 (1304.77 per sec.)
    queries:                             4697976 (15657.22 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.0492s
    total number of events:              391498

Latency (ms):
         min:                                    5.37
         avg:                                   12.26
         max:                                   66.20
         95th percentile:                       15.83
         sum:                              4798745.23

Threads fairness:
    events (avg/stddev):           24468.6250/414.77
    execution time (avg/stddev):   299.9216/0.01

Meanwhile our old single database production got this results:

SQL statistics:
    queries performed:
        read:                            5306060
        write:                           0
        other:                           1061212
        total:                           6367272
    transactions:                        530606 (1768.51 per sec.)
    queries:                             6367272 (21222.18 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.0266s
    total number of events:              530606

Latency (ms):
         min:                                    3.87
         avg:                                    9.04
         max:                                   59.99
         95th percentile:                       12.08
         sum:                              4798278.00

Threads fairness:
    events (avg/stddev):           33162.8750/440.14
    execution time (avg/stddev):   299.8924/0.01

Now I’m wondering why does the cluster operate a bit slower than the single database? They have the same specs: Quadcore CPU, 32GB RAM and vm.swappiness=1. Here’s my cluster configuration (same across 3 servers) and is using HAProxy to load balance between 3 servers:

max_connections = 3000

wsrep_slave_threads=4
innodb_lock_wait_timeout=8000
innodb_io_capacity=2000
innodb_buffer_pool_size=25G
innodb_buffer_pool_instances=25
innodb_log_buffer_size=256M
innodb_log_file_size=1G
innodb_flush_log_at_trx_commit=2
innodb_flush_method = O_DIRECT_NO_FSYNC

innodb_read_io_threads=8
innodb_write_io_threads=4

thread_handling = pool-of-threads
thread_stack = 192K
thread_cache_size = 4
thread_pool_size = 8
thread_pool_oversubscribe = 3

wsrep_provider_options="gcache.size=10G; gcache.page_size=10G"

I used sysbench on a spare server, does the latency between servers also affect the outputs? I would appreciate any inputs, thank you.