Graphics Card – How to Calculate PCIe Socket Runtime Bandwidth Using Process Control Monitor?

I work in depth learning and I try to identify a bottleneck in our GPU pipeline.

We use Ubuntu on an Intel Xeon motherboard with 4 RTX NVIDIA Titan. The use of GPUs seems to be quite low even though the memory used is about 97%.

I'm trying to see if the bus is the neck of the throttle.

I have downloaded PCM and I run it to monitor PCIe 3.0 x16 traffic.

Processor Counter Monitor: PCIe Bandwidth Monitoring Utility
This utility measures PCIe bandwidth in real time

PCIe event definitions (each event counts for a transfer):
PCIe read events (PCI devices reading in memory - applications written to a disk / network / PCIe device):
PCIeRdCur * - Transfer being read PCIe (full cache line)
On the Haswell server, PCIeRdCur counts both total / partial cache lines
RFO * - RFO Request Data
CRd * - Reading the request code
DRd - Request Data Read
PCIe write events (PCI devices writing to memory - applications read from a disk / network / PCIe device):
ItoM - Complete PCIe write cache line
RFO - partial PCIe write
CPU MMIO events (read / write processor on PCIe devices):
PRd - Reading MMIO [Haswell Server only] (Partial cache line)
WiL - Writing MMIO (complete / partial)
...
Socket 0: 2 memory controllers detected with a total number of 6 channels. 3 QPI ports detected. 2 M2M blocks (meshes in memory) detected.
Socket 1: 2 memory controllers detected with a total number of 6 channels. 3 QPI ports detected. 2 M2M blocks (meshes in memory) detected.
Try to use the perfect Linux events ...
PMU on heart successfully programmed with Linux perf
Link 3 is disabled
Link 3 is disabled
Socket 0
Maximum speed of the link QPI: 23.3 GB / second (10.4 GT / second)
Maximum link speed 1 QPI: 23.3 GB / second (10.4 GT / second)
Taken 1
Maximum speed of the link QPI: 23.3 GB / second (10.4 GT / second)
Maximum link speed 1 QPI: 23.3 GB / second (10.4 GT / second)

Processor Detected Intel (R) Xeon (R) Gold 5122 at 3.60GHz "code name of Intel (r) Skylake-SP microarchitecture", microcode level 4 step-by-step 0x200004d
Update every 1.0 seconds
delay_ms: 54
Skt | PCIeRdCur | RFO | CRd | DRd | ItoM | Prd | WiL
0 13 K 19 K 0 0 220 K 84 588
1 0 3024 0 0 0 0 264
-------------------------------------------------- ---------------------
* 13 K 22 K 0 0 220 K 84 852  

Ignore real values ​​for a moment. I have a lot more values. πŸ™‚

How to calculate the bandwidth of the PCIe socket?

Why are there only two catches listed?