最近处理一个长时间运行的job,记录一下处理过程
SQL>SELECT * FROM DBA_JOBS_RUNNING;
SID JOB FAILURES LAST_DATE LAST_SEC THIS_DATE THIS_SEC INSTANCE
------- ---------- ---------- ------------------ ------------- ------------- ------------- ----------
1054 127807 0 20-SEP-23 02:00:04 21-SEP-23 02:00:02 0
可以看到这个job当前等待在TCP Socket (KGAS) ,等待时间已经长达131239秒
SQL> select sid,serial#,sql_id,program,username,event,p1,p2,p3,state,seconds_in_wait from v$session where sid=1054;
SID SERIAL# SQL_ID PROGRAM USERNAME EVENT P1 P2 P3 STATE SECONDS_IN_WAIT
---------- ---------- ------------- ------------------------------ -------------------- -------------------- ---------- ---------- ---------- -------------------- ---------------
1054 41311 arg0arysf30mj oracle@gsi2database (J007) SKB2ADMIN TCP Socket (KGAS) 6 0 0 WAITING 131239
这是oracle官方对此事件的说明:
TCP Socket (KGAS)
A session is waiting for an external host to provide requested data over a network
socket. The time that this wait event tracks does not indicate a problem, and even a
long wait time is not a reason to contact Oracle Support. It naturally takes time for
data to flow between hosts over a network, and for the remote aspect of an application
to process any request made to it. An application that communicates with a remote
host must wait until the data it will read has arrived.
很明显,问题出在数据库之外。 上述跟踪输出中的图p1=6,表示会话正在等待远程主机发回的数据。 切换到操作系统工具。 看看是哪台远程主机,这时我们需要数据库会话的spid。
SQL> SELECT SPID FROM V$PROCESS WHERE ADDR IN (
SELECT PADDR FROM V$SESSION WHERE SID IN (SELECT SID FROM DBA_JOBS_RUNNING));
SPID
------------
17380
strace是一个有用的工具,追踪一下这个spid
[root@gsi2database ~]# strace -p 17380
Process 17380 attached - interrupt to quit
...
getrusage(RUSAGE_SELF, {ru_utime={6, 451019}, ru_stime={0, 166974}, ...}) = 0
poll([{fd=24, events=POLLIN|POLLRDNORM}, {fd=25, events=POLLIN|POLLRDNORM}], 2, 30000) = 0 (Timeout)
...
Strace提供了进程正在等待什么的线索,检查一下文件描述符 25(上面输出中的fd=25)
[oracle@gsi2database udump]$ ll /proc/17380/fd|grep 25
lrwx------ 1 oracle oinstall 64 Sep 22 13:49 25 -> socket:[35135988]
如同我们之前看到的等待事件TCP Socket (KGAS) ,这是一个套接字。 下一步是确定与套接字建立了什么连接。
[oracle@gsi2database udump]$ /usr/sbin/lsof |head -n 1|/usr/sbin/lsof|grep 35135988
oracle 17380 oracle 25u IPv4 35135988 0t0 TCP 172.19.194.2:42205->172.19.160.52:smtp (ESTABLISHED)
明显这是一个smtp 服务器问题导致数据库会话在尝试发送通知时挂起。 SMTP服务器修复后恢复正常