I have four SQL Agent tasks that run constantly. Each job runs a stored procedure that queries recent ATM transactions, all via the same linked server. This occurs in a loop of each job with a delay of 5 seconds between executions (with WAITFOR DELAY).
Specifically, one of the four jobs uses a stored procedure and the other three use another. Each job run has its own set of parameters related to specific transaction patterns that will trigger a response. They all poll the same linked server.
Most of the time, everything works well. But occasionally (almost *) one of the jobs will fail with this error:
TCP Provider: The specified network name is no longer available. [SQLSTATE 42000] (Error 64) The OLE DB provider "SQLNCLI11" of the linked server "ATMDB" returned the message "Failed communication link". [SQLSTATE 01000] (Error 7412).
It only started after the jobs were updated to run continuously with the WAITFOR loop, rather than running once a minute as & # 39; Separate SQL Agent Job Execution. This change was made to avoid excessive logging of SQL jobs and to monitor ATM transactions much closer to real time.
There is no consistency with the frequency at which they will fail, but it is at least a few times a day.
* I say "almost" above because one of the jobs has never failed. It is the one that receives most often "occurrences". When this happens, the work step is allowed to end. So we have explicit logging in the SQL job logs. (He then returns to the first step after a 1 second delay.)
All jobs are configured to run every minute. Thus, after a breakdown, work will resume at the beginning of the next minute. This will not have a significant impact on things. It's just very annoying!
I suspect that the native SQL client and the way it implements bundling of connections are causing the problem. During testing, I updated the linked server in one of two procedures so that it was different from the other. Two different host names, but still the same server (I just use the hosts file to create a new name). This resulted in the failure of one job, but two of the other three jobs still failed intermittently. After changing the other procedure to use the same linked server again, all jobs fail again. Because connection pools are specific to the connection string, a different server name would result in the use of another pool.
With the help of the Windows Resource Monitor, I find that three TCP connections are usually open on the linked server. These remain open on the same ports for several minutes, so I guess it uses a connection pool type. When the error occurs, it coincides with the closing of one of the connections and, soon after, a new connection opens.
My current theory: Since the processes repeat themselves in loop when using the linked server, something about the connection pool causes a situation in which it tries to use a connection that has been closed for some reason. Perhaps the only work step that "ends" from time to time is refreshed one way or another and prevents the error from occurring on that specific process.
I've tried to encapsulate the procedure calls stored in sp_executesql as a workaround for "forcing" another process or execution context, but that did not help.
Both servers are Windows 2012 running on VMWare.
Any ideas on how to solve more?