This one I’m sharing here in the hope that others won’t have to go as crazy as we had to.
I have very talented people in my team, and we’re used to Big Scale: Gumtree is the #12 site in AU, only behind giants like Google, Facebook, etc… And my team is less than 50 all in, not hundreds or thousands. (Ok, I’ll quit bragging about my team now :P )
Still, it took a toll on a group of us to find this issue, and then to fix it. I hope Google shows it to you in your time of need.
For some version of MySQL (that we know of > 5.1 as we never saw this issue in 5.1), there’s an issue where if many connections are attempted at the same time it will accept some socket connections but never send it’s handshake. This caused big problems with our DB pool management (the great HikariCP) as the “pool filler” thread would get stuck and eventually the pool would starve for connections and the application would grind to a halt.
The solution was to upgrade to MySQL J/Connect driver version 5.1.28 or later and set the obscure
The problem as we lived it
The way we experienced this was through instability on the API. We identified that at some point a Tomcat instance would hit an “unknown issue” and go crazy (yeah… I know, that’s the technical term). When that happened:
- Error rate of the instance went through the roof,
- Speed went to the ground, and
- More worryingly, the instance didn’t recover, even after taking it down on the Load Balancer, and therefore
- The only solution was to restart that instance
Now…. of all that, the one thing that really worried me was the didn’t recover, even after taking it down on the Load Balancer. Because that’s a big problem.
Managing stability at scale
Let me share a bit about managing stability at scale: If you think that systems are deterministic you’ve simply not pushed it enough. There’s a reason why they tell you to code for failure: As the complexity and scale of the system increases, it’s impossible to maintain a zero error rate.
Therefore you should aim for a system that can recover from errors and that “tends” to equilibrium.
To understand why, I like the classification of Systems that the Cynefin framework establishes: If your system is small enough you probably know how all the pieces fit together and are in the realm of a Simple/Obvious System. As it grows in features and business logic, it becomes Complicated.
But when big scale kicks in, it becomes a Complex System: “in which the relationship between cause and effect can only be perceived in retrospect, but not in advance” (from the Wiki). If you’re not capable of managing that complexity it’ll become a Chaotic System or one of Disorder… and in the world of Software Systems that means that you can’t guarantee availability.
How do you build systems that tend to equilibrium? Well, there are some architectural patterns for this, particularly useful is the Circuit Breaker (Good article on the Netflix blog), retries and just being very conscious about all the timeouts in your system.
How we identified the issue
Having good tools to look at production issues is critical, you don’t necessarily need complex tools, but being able to connect to a production JVM with JVisualVM and being able to do some profiling or monitor closely GC, Memory, Threads, etc is very important.
In our case three tools were really useful:
This was the first thing we did. It’s great because it can very quickly give you a direction to follow.
All right, it’s clear that
ConcurrentBag.borrow() shouldn’t be toping the list. So, clearly, at least one of our issues is with DB Connection Pooling.
Our first suspect was a bug in HikariCP, but after a good search we couldn’t find anything that would explain this issue.
When an instance went “down”, the logs were filled with exceptions like:
Which meant that it had tried to obtain a connection from the DB Connection pool, and it hadn’t been able to get one in 30 seconds!!!! (this timeout is, in all honesty, a very poor default of HikariCP).
That made me look into the stats of the DB Connection pool (HikariCP has good integration with Codahale stats) and quickly realised that the pool was empty!
Then we took a thread dump, and saw this thread:
Which made sense, I mean, HikariCP should be filling the pool. But when we took another thread dump a minute or so after, we saw that this thread was still there, locked on the same objects (same hashes), so it meant that it wasn’t just adding more connection, this thread was stuck.
At this point we were using version 5.1.16 of the MySQL Connector driver, and when downloaded the source code and traced it to the code lines in the stack trace, realised that it was stuck waiting for the server to send its handshake.
Why MySQL?, Why won’t you shake my hand?
Why on certain circumstances MySQL would accept the connection but not send the handshake, was and still is a bit of a question mark.
We looked into many possibilities, from not properly configured parameters like the ones described in this great article from Percona. To a possible misbehavior of the Load Balancer.
I found no direct acknowledgment of this issue in MySQL’s bug tracking system, but did find this bug report, and it was exactly what we were experiencing:
- The DriverManager.setLoginTimeout method is not honored, making it very difficult to handle situations where the MySQL Server gets in a bad state and accepts new connections but never process the handshake. I experienced this just recently, and the end result was that all my Connection threads became hung trying to connect. I tried using DriverManager.setLoginTimeout, but after looking through the code, I can its not used. I then tried using connectTimeout, but that only works if the socket creation takes longer than the timeout, its not used during the handshake. Finally, I tried socketTimeout, which worked, but the problem with this solution is that its set for the lifetime of the socket. Any further commands, such as long running queries will also results in timeouts occurring, unfortunately negating this solution.
That was it… that was exactly what we were experiencing, and the same reason why we hadn’t just introduced a low Socket Timeout as a solution to the problem.
That was a bug reported against v5.1.25 of the MySQL Connector, and following the comments, the fix was merged in v5.1.28. So I downloaded the latest version at the time 5.1.38 and sure enough these were the two methods that fixed our problem (in com.mysql.jdbc.StandardSocketFactory):
This method sets a Socket Timeout that will be effective only for the duration of the handshake.
This one sets it back to whatever it was before the handshake.
Therefore we were able to: set a LoginTimeout of 1 second (can only be set in seconds), and a socket timeout that we felt would be long enough even for batch-like queries.
Some stats from prod
Ok… to put this in perspective let me share one image that shows the # of connections by state in one of the Tomcat instances along time.
So, here is how to read this graph:
Before 5/5 we used to manage separate connection pools per DB Schema, this was a decision taken a long time ago by another team thinking: in the future, when scalability concerns kick in, putting different Schemas in different clusters would be a good way to scale.
Not a bad idea, but so far proven unnecessary. In any case we implemented a simple wrapper around the connection pool that would set the Catalog/Schema to the one needed when borrowing a connection and consolidated the connection pools down to the number of clusters and not the number of DB Schemas… And that’s why in general the number of connections drops dramatically after 5/5.
The red piece means stuff is going on. Those are clients of HikariCP waiting for a connection. It’s hard to see what’s going on in that graph before 5/5 because those stats are aggregates of the different pools, but look at Figure 3 for a Zoom in on one connection pool alone
What you can see here is that the pool is depleted!!! There are no connections in the pool.
After 5/6 you see all the red go away from Figure 2. That’s when we deployed to prod the upgrade to the latest MySQL J/Connector and setting the LoginTimeout.
In the end
In the end we came out better and stronger from this bug. We’re utilising our resources way better now and are much more resilient: Now the same JVM is capable of recovering even after bashing it with an absurd number of connections in L&P for hours.
We also followed some of the advise from the About Pool Sizing article and went much more restrictive on the size of the pools and tomcat thread pools, and thanks to all this now we have, again, a system that tends to equilibrium :)