Discussion:
hung postmaster when client machine dies?
(too old to reply)
Mark Harrison
2004-01-30 01:00:27 UTC
Permalink
We recently had an incident where a linux box went down with
a kernel error. A process on the box had an open connection
to a postgres session, and was in a transaction.

This situation was noticed when other processes connected to
postgres would not respond.

We observed that there was postmaster connected to the ip address
of the downed machine with an "in transaction" status. Killing
that processes unblocked the other processes.

Is this expected behavior? Was postgres simply waiting for
a failure from the TCP/IP layer?

We're now running a watchdog process that pings machines for
which a postmaster is running and killing that process if
the machine is not contactable for a certain period of time.
Thanks to whoever made the status information show up in ps
output!

Unfortunately, we didn't capture the process data... if this
would help we can attempt to reproduce the situation.

Many TIA,
Mark
--
Mark Harrison
Pixar Animation Studios
Emeryville, CA


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Jeff
2004-01-30 15:18:07 UTC
Permalink
On Thu, 29 Jan 2004 17:00:27 -0800
Post by Mark Harrison
We observed that there was postmaster connected to the ip address
of the downed machine with an "in transaction" status. Killing
that processes unblocked the other processes.
Is this expected behavior? Was postgres simply waiting for
a failure from the TCP/IP layer?
When a machine simply "goes away" (crashed, unplugged) no packets are
sent indicating the socket has closed so PG doesn't know the socket is
closed and cannot rollback the transaction. (This is true of any
tcp-based protocol).

If that connection had a transaction open it'll likely have rows locked.
Killing PG caused it to rollback that transaction freeing those locks.
Eventually PG would have found out the socket was dead. Most likely if
it tried to write to it.
--
Jeff Trout <***@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Tom Lane
2004-01-30 15:42:08 UTC
Permalink
Post by Jeff
Eventually PG would have found out the socket was dead.
We do enable TCP keepalive on the client socket, so eventually the
kernel will time out and notify us of a lost connection. Unfortunately
the timeouts involved are long --- IIRC the relevant RFCs specify at
least an hour.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ***@postgresql.org
Goulet, Dick
2004-01-30 15:42:42 UTC
Permalink
Hope you don't mind if I disagree. Most OS's that have a tcp/ip layer also have a parameter therein called tcp_keep_alive. They also set this parameter to infinity. The purpose of tcp_keep_alive is to have the OS kernel periodically verify that all tcp/ip connections it is managing are still functioning every so often. Basically the OS sends a probe packet down the line to the to the client machine. If it bounces back the connection is dead & the OS can do what it has to, which will then notify postmaster just like you manually did. I'd contact your OS vendor for information of what tcp_keep_alive is set to by default and how you can change it.

Dick Goulet
Senior Oracle DBA
Oracle Certified 8i DBA

-----Original Message-----
From: Jeff [mailto:***@torgo.978.org]
Sent: Friday, January 30, 2004 10:18 AM
To: Mark Harrison
Cc: pgsql-***@postgresql.org
Subject: Re: [ADMIN] hung postmaster when client machine dies?


On Thu, 29 Jan 2004 17:00:27 -0800
Post by Mark Harrison
We observed that there was postmaster connected to the ip address
of the downed machine with an "in transaction" status. Killing
that processes unblocked the other processes.
Is this expected behavior? Was postgres simply waiting for
a failure from the TCP/IP layer?
When a machine simply "goes away" (crashed, unplugged) no packets are
sent indicating the socket has closed so PG doesn't know the socket is
closed and cannot rollback the transaction. (This is true of any
tcp-based protocol).

If that connection had a transaction open it'll likely have rows locked.
Killing PG caused it to rollback that transaction freeing those locks.
Eventually PG would have found out the socket was dead. Most likely if
it tried to write to it.
--
Jeff Trout <***@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ***@postgresql.org
Peter Galbavy
2004-02-07 10:54:55 UTC
Permalink
Post by Goulet, Dick
Hope you don't mind if I disagree. Most OS's that have a tcp/ip
layer also have a parameter therein called tcp_keep_alive. They also
set this parameter to infinity. The purpose of tcp_keep_alive is to
have the OS kernel periodically verify that all tcp/ip connections it
is managing are still functioning every so often. Basically the OS
sends a probe packet down the line to the to the client machine. If
it bounces back the connection is dead & the OS can do what it has
to, which will then notify postmaster just like you manually did.
I'd contact your OS vendor for information of what tcp_keep_alive is
set to by default and how you can change it.
On the other hand I suggest that you *do not* change this value, even if you
know how, without being very careful understanding what it means.

Changing the system-wide TCP KA time can have unforseen effects on other,
unrelated functions of the systems concerned.

This is simply why most long lived protocols layered over TCP also tend to
have their own keep alive PDUs.

Peter


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Loading...