The "MySQL has gone away" mystery
Posted: Tue Jun 02, 2015 5:22 pm
I spent a little time this afternoon researching (again) why the MySQL server is refusing connections to New Telon after a certain period of time. I've determined there's a couple of things going on on my end, one of which I have no control but the others (high-availability, load balancing) I can spend some time trying to offer. It's good knowledge, regardless.
The info to follow is specific to the MySQL has gone away issue we see on New Telon occasionally. From this article, I can see a few scenarios why this is happening and thought I would post some thoughts on it for anyone else who experiences this non-descriptive problem.
First, the VGOEmulator problem(s) -
[quote="MySQL dot com"]The problem on Windows is that in some cases MySQL does not get an error from the OS when writing to the TCP/IP connection to the server, but instead gets the error when trying to read the answer from the connection. [/quote]
This was originally an issue long ago, and theFoof solved it by adding the mysql_ping() function to the world code.
In common/Database.cpp:
and in world/Net.cpp:
We added this ping, every 5 minutes, to keep the MySQL connection open even if there is no activity. We did not add this to LoginServer, because Login is constantly doing maintenance every minute or so, thus keeping it's MySQL connection alive.
I mention this because I noticed something specific on the Windows 7 machine (New Telon) that I do not see on the LoginServer/VGODev Server machine (Linux) and that is Windows seems to be the only one who occasionally gives this error:
[quote]16:27:57.303 D Char Loading 12 characters for account_id: 72
16:27:57.303 E Database Error running MySQL query (2006): MySQL server has gone away
SELECT `character_id`,`firstname`,`lastname`,`race`,`adventure_class`,`gender`,`adventure_level`,`current_chunk_fk`,
`hidehood`,`hidehelm`,`hidecloak`
FROM `characters`
JOIN `character_details` ON `character_id`=`char_id_fk` WHERE `account_id`=72
AND `is_deleted`=0
LIMIT 12
16:27:57.303 D Char Loaded 0 characters for account_id: 72[/quote]
Once Windows 7 gets this disconnect one time, regardless of our PingDatabase() call every 5 mins, the world never reconnects to MySQL. This is bad, and should probably be looked into (World catching this error and doing something more robust than simply trying to ping again)
The second thing I'm sure is happening (and this is environment specific):
[quote="MySQL dot com"]It is also possible to see this error if host name lookups fail (for example, if the DNS server on which your server or network relies goes down). This is because MySQL is dependent on the host system for name resolution, but has no way of knowing whether it is working—from MySQL's point of view the problem is indistinguishable from any other network timeout. [/quote]
New Telon runs on a VM on Windows 7; The MySQL database server is another stand-alone VM on another ESX host, thus a separate IP; Lastly, the web services are also their own dedicated VM, with a 3rd IP. All use an internal DNS, on yet a 4th VM and IP. In a perfect world, separation does not matter.
In my world, one piece goes down, the entire house of cards falls. Especially when it's my webserver and/or DNS. Either machine get borked, every VM on that cluster node tends to say "meh, get back to me when you get your shit together" and crashes.
The reason the DNS fails is that the VM is on the same ESX host with the Web Server, and for the same reasons that take out web services, IRC and the game machines for days (until someone tells me) this causes DNS requests to stop working. So the innocent New Telon -and- MySQL machine on the 2nd cluster node do not have an issue, but can no longer do DNS lookups to the ESX cluster that is running rampant with a crashing web server.
Action Items:
The info to follow is specific to the MySQL has gone away issue we see on New Telon occasionally. From this article, I can see a few scenarios why this is happening and thought I would post some thoughts on it for anyone else who experiences this non-descriptive problem.
First, the VGOEmulator problem(s) -
[quote="MySQL dot com"]The problem on Windows is that in some cases MySQL does not get an error from the OS when writing to the TCP/IP connection to the server, but instead gets the error when trying to read the answer from the connection. [/quote]
This was originally an issue long ago, and theFoof solved it by adding the mysql_ping() function to the world code.
In common/Database.cpp:
Code: Select all
void Database::PingDatabase() { mysql_ping(&mysql);}
Code: Select all
//Time in MS to ping the database (300000 = 5 minutes)#define MYSQL_PING_INTERVAL 300000..bool Net::Process() { PacketStruct *packet_struct; ClientPacketData *data; if (mysql_ping_timer->Check()) database.PingDatabase();
I mention this because I noticed something specific on the Windows 7 machine (New Telon) that I do not see on the LoginServer/VGODev Server machine (Linux) and that is Windows seems to be the only one who occasionally gives this error:
[quote]16:27:57.303 D Char Loading 12 characters for account_id: 72
16:27:57.303 E Database Error running MySQL query (2006): MySQL server has gone away
SELECT `character_id`,`firstname`,`lastname`,`race`,`adventure_class`,`gender`,`adventure_level`,`current_chunk_fk`,
`hidehood`,`hidehelm`,`hidecloak`
FROM `characters`
JOIN `character_details` ON `character_id`=`char_id_fk` WHERE `account_id`=72
AND `is_deleted`=0
LIMIT 12
16:27:57.303 D Char Loaded 0 characters for account_id: 72[/quote]
Once Windows 7 gets this disconnect one time, regardless of our PingDatabase() call every 5 mins, the world never reconnects to MySQL. This is bad, and should probably be looked into (World catching this error and doing something more robust than simply trying to ping again)
The second thing I'm sure is happening (and this is environment specific):
[quote="MySQL dot com"]It is also possible to see this error if host name lookups fail (for example, if the DNS server on which your server or network relies goes down). This is because MySQL is dependent on the host system for name resolution, but has no way of knowing whether it is working—from MySQL's point of view the problem is indistinguishable from any other network timeout. [/quote]
New Telon runs on a VM on Windows 7; The MySQL database server is another stand-alone VM on another ESX host, thus a separate IP; Lastly, the web services are also their own dedicated VM, with a 3rd IP. All use an internal DNS, on yet a 4th VM and IP. In a perfect world, separation does not matter.
In my world, one piece goes down, the entire house of cards falls. Especially when it's my webserver and/or DNS. Either machine get borked, every VM on that cluster node tends to say "meh, get back to me when you get your shit together" and crashes.
The reason the DNS fails is that the VM is on the same ESX host with the Web Server, and for the same reasons that take out web services, IRC and the game machines for days (until someone tells me) this causes DNS requests to stop working. So the innocent New Telon -and- MySQL machine on the 2nd cluster node do not have an issue, but can no longer do DNS lookups to the ESX cluster that is running rampant with a crashing web server.
Action Items:
- VGO Devs should investigate a more robust MySQL Error trap and connection recovery in the code[/*:m]
- I will pursue two things: splitting up web/db/game services even further, and investigate HA/LB configurations for my services (should have them regardless) for Apache/IIS and MySQL services[/*:m][/list:u]
I cannot stop "bad people" from attacking my web server, for whatever reason they feel they need to do that. I thought I had preventive measures in place, but apparently it's not enough (and I have yet to catch what it is). Hopefully once I offer HA/LB, it might be moot and the script kiddies can go bother someone elses free fun time.