Psycopg 2.2.x issue with PgBouncer

Discussion:

Jason Yan

2010-07-23 23:45:21 UTC

I'm seeing an issue with how Psycopg 2.2.x handles connections with
PgBouncer. After a connection is closed, PgBouncer spits out the
following:
  2010-07-23 16:33:46.407 5282 LOG C-0x1e9ad80:
test/postgres-savSHZN5Fh8qMp+***@public.gmane.org:33306 login successful: db=test user=postgres
  2010-07-23 16:33:46.407 5282 LOG S-0x1eb8280:
test/postgres-savSHZN5Fh8qMp+***@public.gmane.org:5432 new connection to server
  2010-07-23 16:33:46.411 5282 LOG C-0x1e9ad80:
test/postgres-savSHZN5Fh8qMp+***@public.gmane.org:33306 closing because: client close request
(age=0)
  2010-07-23 16:33:46.411 5282 LOG S-0x1eb8280:
test/postgres-savSHZN5Fh8qMp+***@public.gmane.org:5432 closing because: unclean server (age=0)

This kills PgBouncer's server backend which effectively kills
connection pooling. I've reproduced this behavior with both 2.2.1 and
2.2.2, but I've verified that 2.0.14 does not have this issue. I'm
running a clean build of PgBouncer 1.3.3 with the default
configuration on Ubuntu 9.04 64-bit with the following PostgreSQL
packages:
  ii postgresql 8.3.11-0ubuntu9.04
object-relational SQL database
(supported ve
  ii postgresql-8.3 8.3.11-0ubuntu9.04
object-relational SQL database, version
8.3
  ii postgresql-client-8.3 8.3.11-0ubuntu9.04
front-end programs for PostgreSQL 8.3
  ii postgresql-client-common 97
manager for multiple PostgreSQL client
versi
  ii postgresql-common 97
PostgreSQL database-cluster manager
  ii postgresql-contrib-8.3 8.3.11-0ubuntu9.04
additional facilities for PostgreSQL

And the small test script that I'm reproducing the issue with:
  import psycopg2
  connection = psycopg2.connect("dbname=test user=postgres
host=localhost port=6432")
  cursor = connection.cursor()
  cursor.execute("SELECT 1")
  connection.close()

Thanks,
Jason

Jan Urbański

2010-07-24 14:18:59 UTC

Permalink

Post by Jason Yan
I'm seeing an issue with how Psycopg 2.2.x handles connections with
PgBouncer.

Hi!

Thanks for the report, I think it's a psycopg2 regression. With the
merging of the green branch psycopg stopped issuing a ROLLBACK when the
connection is garbage collected while still open. PFA the patches that
fix this.

Post by Jason Yan
import psycopg2
connection = psycopg2.connect("dbname=test user=postgres
host=localhost port=6432")
cursor = connection.cursor()
cursor.execute("SELECT 1")
connection.close()

As an immediate workaround you can add connection.rollback() or
connection.commit() before your connection.close().

I fixed this problem in my repo, in the rollback_on_exit branch:

http://git.wulczer.org/?p=psycopg2.git;a=shortlog;h=ref/head/rollback_on_exit

However that triggered interesting issues with the green connection
tests. In short, when deallocating a connection, conn_close() is called.
It then tries to execute a ROLLBACK, which in case of green connections
leads to calling the user-defined wait callback. After the callback
returns, as the connection has zero references, it gets deallocated
again, recursively, which leads to all sorts of problems.

I fixed it in two ways. First, the code keeps a flag that tells if
deallocation is underway. If the flag is set, the deallocation function
is a noop (this prevents recursive invocation).

Second, if deallocation is underway, the green execution method is never
used (it always blocks). That's not ideal, but otherwise it would mean
calling the user-defined callback with the connection object as an
argument that has zero refs... Not sure how this could be fixed.

Note that with the second fix the first is not needed, as we bypass
green execution mode and so won't trigger recursive dealloc. I still
think it's useful to have a guard there, in case later on we start doing
someting in the connection_dealloc function that will cause similar issues.

Cheers,
Jan

Daniele Varrazzo

2010-07-25 00:43:33 UTC

Permalink

Post by Jan UrbaÅski
I fixed it in two ways. First, the code keeps a flag that tells if
deallocation is underway. If the flag is set, the deallocation function
is a noop (this prevents recursive invocation).
Second, if deallocation is underway, the green execution method is never
used (it always blocks). That's not ideal, but otherwise it would mean
calling the user-defined callback with the connection object as an
argument that has zero refs... Not sure how this could be fixed.
Note that with the second fix the first is not needed, as we bypass
green execution mode and so won't trigger recursive dealloc. I still
think it's useful to have a guard there, in case later on we start doing
someting in the connection_dealloc function that will cause similar issues.

Thank you for these patches. I think not dropping into the wait
callback is compulsory when the refcount of the connection is zero.
This leads to the only solution of issuing the final rollback in
blocking mode. I think it's perfectly fine, as people can always issue
a manual close before gc-ing the connection. I would add a note about
that in the documentation.

-- Daniele

Federico Di Gregorio

2010-07-25 09:50:56 UTC

Permalink

Post by Daniele Varrazzo

Merged. Given this fixes a regression I'll do a release ASAP.

federico

--
Federico Di Gregorio fog-NGVKUo/i/***@public.gmane.org
Domani si vede domani. -- Alessandra O'Hara

Marko Kreen

2010-07-28 15:28:34 UTC

Permalink

Post by Federico Di Gregorio

Post by Daniele Varrazzo

Merged. Given this fixes a regression I'll do a release ASAP.

This behavior is bad. The original user code was buggy - it closed
the connection while in mid-transaction. PgBouncer was correct
to drop the connection to force the server to do rollback.

I recently had use-case in production env where such behaviour
(making buggy client act as non-buggy client) was clearly broken:

There was COPY-to-COPY pipeline, where target side got error.
Now if I do .close() on source side, what happens? It starts
to consume data until command end... Which is silly.

Same can happen with big (async?) SELECT, where it's impossible
to get rid of the connection. Even __del__ will consume data...

Only way to get out of such situation would be:

os.close(db.cursor().fileno())

but it seems silly to force users to use such workaround.

My conclusion: .close() and .__del__() should close the
connection immediately, without any workarounds, to guarantee
predictable behavior. There is no need to drain data to
issue ROLLBACK, because mid-trancation disconnect will
do that anyway.

--
marko

Daniele Varrazzo

2010-08-02 00:33:16 UTC

Permalink

This behavior is bad. The original user code was buggy - it closed
the connection while in mid-transaction. PgBouncer was correct
to drop the connection to force the server to do rollback.
I recently had use-case in production env where such behaviour
There was COPY-to-COPY pipeline, where target side got error.
Now if I do .close() on source side, what happens? It starts
to consume data until command end... Which is silly.
Same can happen with big (async?) SELECT, where it's impossible
to get rid of the connection. Even __del__ will consume data...
os.close(db.cursor().fileno())
but it seems silly to force users to use such workaround.
My conclusion: .close() and .__del__() should close the
connection immediately, without any workarounds, to guarantee
predictable behavior. There is no need to drain data to
issue ROLLBACK, because mid-trancation disconnect will
do that anyway.

For me doing the least possible on __del__, mostly don't try to issue
queries, is a sensible approach.

OTOH, the result of the bug was exactly what Marko suggests: issuing
PQfinish without a ROLLBACK, which has been reported not playing well
with PgBouncer.

So, is this the tradeoff we should consider? trying to ROLLBACK on
close() makes us try to do a potential big amount of work in the wrong
moment; closing brutally makes pooling impossible.

I don't know PgBouncer at all so I'm asking Jason -- or whoever can
answer: is just the badly-closed connection to be forced out of the
pool or is the "unclean server" is serious enough to make it consider
the entire pool inconsistent?

On the base that you can always call rollback() before closing/gc-ing
the connection, I'm actually tempted to make the buggy behaviour "a
feature".

-- Daniele

Marko Kreen

2010-08-02 07:21:31 UTC

Permalink

Post by Daniele Varrazzo

Post by Marko Kreen
This behavior is bad. The original user code was buggy - it closed
the connection while in mid-transaction. PgBouncer was correct
to drop the connection to force the server to do rollback.
I recently had use-case in production env where such behaviour
There was COPY-to-COPY pipeline, where target side got error.
Now if I do .close() on source side, what happens? It starts
to consume data until command end... Which is silly.
Same can happen with big (async?) SELECT, where it's impossible
to get rid of the connection. Even __del__ will consume data...
os.close(db.cursor().fileno())
but it seems silly to force users to use such workaround.
My conclusion: .close() and .__del__() should close the
connection immediately, without any workarounds, to guarantee
predictable behavior. There is no need to drain data to
issue ROLLBACK, because mid-trancation disconnect will
do that anyway.

For me doing the least possible on __del__, mostly don't try to issue
queries, is a sensible approach.

Yes.

Post by Daniele Varrazzo
OTOH, the result of the bug was exactly what Marko suggests: issuing
PQfinish without a ROLLBACK, which has been reported not playing well
with PgBouncer.

For pgbouncer to "play well" with such behaviour, it would need
to have quite a lot more complex protocol state-machine.

Remember, the client connection can go down at any point during
packet exchange. When can the connection be recovered and when not?

Post by Daniele Varrazzo
So, is this the tradeoff we should consider? trying to ROLLBACK on
close() makes us try to do a potential big amount of work in the wrong
moment; closing brutally makes pooling impossible.

Well, technically you could issue ROLLBACK if you are sure the
connection is <idle in transaction> - that there are no ongoing
queries going on. This avoids the unnecessary draining problem.

But even that can suck - lets say I have 2 connections open, both
in transaction. I issue a query on one, but network goes down.
The query gets network timeout, now i close both connections.
(Or __del__ ...) Why do I need to wait for network timeout on other
connection too?

Post by Daniele Varrazzo
I don't know PgBouncer at all so I'm asking Jason -- or whoever can
answer: is just the badly-closed connection to be forced out of the
pool or is the "unclean server" is serious enough to make it consider
the entire pool inconsistent?

Only one connection would be dropped.

Post by Daniele Varrazzo
On the base that you can always call rollback() before closing/gc-ing
the connection, I'm actually tempted to make the buggy behaviour "a
feature".

You seem to forget that it can also call .commit(). And in
autocommit mode the issue should not even arise, but I'm
not sure about current code...

Anyway, there is only one sort-of-valid excuse for dropping .rollback:
to avoid network roundtrip, as postgres does it on close anyway.

But that plays badly with pooling, as you noticed.

And when db driver does it anyway, the excuse is void.

IOW, doing anything else then PQfinish in .close() [not even speaking
about .__del__()] is wrong.

--
marko

Daniele Varrazzo

2010-08-02 11:45:00 UTC

Permalink

Post by Marko Kreen

I don't know PgBouncer at all so I'm asking Jason -- or whoever can
answer: is just the badly-closed connection to be forced out of the
pool or is the "unclean server" is serious enough to make it consider
the entire pool inconsistent?

Only one connection would be dropped.

On the base that you can always call rollback() before closing/gc-ing
the connection, I'm actually tempted to make the buggy behaviour "a
feature".

You seem to forget that it can also call .commit(). And in
autocommit mode the issue should not even arise, but I'm
not sure about current code...

No, I just mentioned rollback because this would replicate, from a
data state PoV, what happens to the server both if the client closes
the connection with a straight PQfinish and what used to happen in
psycopg on close()/GC before introducing the bug. commit() would
obviously close the transaction too but replacing close() with
rollback() (or prepending a rollback() call) would keep the semantic
of existing programs untouched, while giving PgBouncer the correct
message.

If this is agreed, I can change the documentation to better explain
what happens on close() and underline the need of closing with a
closed transaction if middleware such as PgBouncer is involved. I'd
like to fix the other issue you mentioned too (infinite loop on error)
before a release.

Jason, can you confirm that adding a rollback() makes your system work
again as expected? Thank you.

-- Daniele

Daniele Varrazzo

2010-08-05 00:32:59 UTC

Permalink

Post by Marko Kreen
You seem to forget that it can also call .commit(). And in
autocommit mode the issue should not even arise, but I'm
not sure about current code...

I've checked and the autocommit mode creates no problem to pgbouncer.

If fog agrees to this change, I've prepared a patch
(http://tinyurl.com/36qufm4) dropping the currently failing attempt to
rollback and fixing the documentation for connection.close(), where
I've added:

Changed in version 2.2: previously an explicit ROLLBACK was issued by
Psycopg on close(). The command could have been sent to the backend at
an inappropriate time, so Psycopg currently relies on the backend to
implicitly discard uncommitted changes. Some middleware are known to
behave incorrectly though when the connection is closed during a
transaction (when status is STATUS_IN_TRANSACTION), e.g. PgBouncer
reports an unclean server and discards the connection. To avoid this
problem you can ensure to terminate the transaction with a
commit()/rollback()
before closing.

-- Daniele

Devrim GÜNDÜZ

2010-08-04 10:48:41 UTC

Permalink

Where are we on this? I will move to new servers next week with 2.2.X
series. I can rollback to 2.0.14, but I'd prefer to stick to 2.2.X
series. Was this fixes somewhere?

Post by Jason Yan
I'm seeing an issue with how Psycopg 2.2.x handles connections with
PgBouncer. After a connection is closed, PgBouncer spits out the
(age=0)
This kills PgBouncer's server backend which effectively kills
connection pooling. I've reproduced this behavior with both 2.2.1 and
2.2.2, but I've verified that 2.0.14 does not have this issue. I'm
running a clean build of PgBouncer 1.3.3 with the default
configuration on Ubuntu 9.04 64-bit with the following PostgreSQL
ii postgresql 8.3.11-0ubuntu9.04
object-relational SQL database
(supported ve
ii postgresql-8.3 8.3.11-0ubuntu9.04
object-relational SQL database, version
8.3
ii postgresql-client-8.3 8.3.11-0ubuntu9.04
front-end programs for PostgreSQL 8.3
ii postgresql-client-common 97
manager for multiple PostgreSQL client
versi
ii postgresql-common 97
PostgreSQL database-cluster manager
ii postgresql-contrib-8.3 8.3.11-0ubuntu9.04
additional facilities for PostgreSQL
import psycopg2
connection = psycopg2.connect("dbname=test user=postgres
host=localhost port=6432")
cursor = connection.cursor()
cursor.execute("SELECT 1")
connection.close()
Thanks,
Jason
_______________________________________________
Psycopg mailing list
http://lists.initd.org/mailman/listinfo/psycopg

--
Devrim GÃNDÃZ
PostgreSQL DanÄ±ÅmanÄ±/Consultant, Red Hat Certified Engineer
PostgreSQL RPM Repository: http://yum.pgrpms.org
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz

Daniele Varrazzo

2010-08-04 11:20:42 UTC

Permalink

Post by Devrim GÃNDÃZ
Where are we on this? I will move to new servers next week with 2.2.X
series. I can rollback to 2.0.14, but I'd prefer to stick to 2.2.X
series. Was this fixes somewhere?

If you ensure to have a commit/rollback before close/GC of the
connection, PgBouncer should work fine with all the psycopg versions.
If you can make this change and run the program with the current 2.2.2
a feedback would be very appreciated.

I haven't heard from fog if he prefers to have 2.2.3 reproducing the
behavior of pre 2.2, where a rollback was automatically issued on
close(), or if drop it for the reasons discussed in this thread.

-- Daniele

Federico Di Gregorio

2010-08-17 07:17:07 UTC

Permalink

Post by Daniele Varrazzo

Post by Devrim GÃNDÃZ
Where are we on this? I will move to new servers next week with 2.2.X
series. I can rollback to 2.0.14, but I'd prefer to stick to 2.2.X
series. Was this fixes somewhere?

If you ensure to have a commit/rollback before close/GC of the
connection, PgBouncer should work fine with all the psycopg versions.
If you can make this change and run the program with the current 2.2.2
a feedback would be very appreciated.
I haven't heard from fog if he prefers to have 2.2.3 reproducing the
behavior of pre 2.2, where a rollback was automatically issued on
close(), or if drop it for the reasons discussed in this thread.

Hi *,

before going on vacancy I didn't do the release because there was a
discussion going on. From what I read it seems to me that rolling back
the changes on master and documenting the behaviour is the best way to
fix that. Right?

federico

--
Federico Di Gregorio fog-NGVKUo/i/***@public.gmane.org
Questa lista non pretende di essere il riferimento torinese per il
software libero o altre cazzate del genere. -- Silvano, su torte

Marko Kreen

2010-08-17 12:23:29 UTC

Permalink

Post by Federico Di Gregorio

Post by Daniele Varrazzo

Post by Devrim GÃNDÃZ
Where are we on this? I will move to new servers next week with 2.2.X
series. I can rollback to 2.0.14, but I'd prefer to stick to 2.2.X
series. Was this fixes somewhere?

If you ensure to have a commit/rollback before close/GC of the
connection, PgBouncer should work fine with all the psycopg versions.
If you can make this change and run the program with the current 2.2.2
a feedback would be very appreciated.
I haven't heard from fog if he prefers to have 2.2.3 reproducing the
behavior of pre 2.2, where a rollback was automatically issued on
close(), or if drop it for the reasons discussed in this thread.

Hi *,
before going on vacancy I didn't do the release because there was a
discussion going on. From what I read it seems to me that rolling back
the changes on master and documenting the behaviour is the best way to
fix that. Right?

I'm not familiar with changes in master. The behaviour I was complaining
about - unexpectedly heavy .close()/.__del__() - is seen in 2.0.x.

If you are planning to document that as official feature,
then I give -1 to that.

I'd suggest pushing forward and remove any expensive operations from
.close() and document _that_ as feature.

--
marko