ResendPackets()
Moderators: Community Managers, Developers
ResendPackets()
Take what about to say as a theory. I've resolved quite a few of the recent issues around crashing while researching this. The commits will have to wait for tomorrow, but let's share knowledge anyways.
First, the last few crashes at GetClient() (this is usually a mutex crash, but if you walk the stack, you'll run into GetClient()) are because we get the chunk from character->GetCurrentChunk() but we don't check that chunk is valid. This happens because if a users client crashes or they are DCd during chunking or login, they will have a chunk client, but not necessarily a chunk. Checking for chunk and skipping the loop or current scope will resolve this.
Second, I think we're missing something packet/opcode wise around out of sync packets and the re-sending of these. I believe OpCode 0x0400 means something like Acknowledge Resent Packets or Wrong Resent Packets Sent.
I know the VGClient is not accepting our attempt to respond to 0x0011 (Out of Order Packet, either this or 0x0019, not at PC), because when I purposefully out of sync a single packet, my particular Client object is stuck in an infinite loop of try to resend packets, sometimes the VGClient is quiet about it: it does not respond with "garbage"; sometimes it is noisy: it responds with 0x0400s and all sorts of nonsense. While the client is re-senting packets forever, it is not sending ANY other packets. This is why nothing seems to work while this is happening. This infinite loop continues until all 1000 spots in the Deque are filled and some server logic disconnects the user.
Couple other things I noticed with this 0x0400 packet and the other "garbage" that sometimes accompanies it:
The garbage always starts with an unknown uint16, and I think this is the sequence number that the client is missing. The next 2 bytes are always the same: 00 11 (which happens to be our Out of Sync OpCode, but in the wrong place in the packet. The next 2 bytes are unknown to me.
Sometimes one of these packets may be very large, but the same pattern repeats: 2 bytes, 00 11, then 2-4 bytes. I think these are either Acks to the Resent packets bunched together.
I ran out of energy tonight but will follow up with this theory tomorrow.
Oh, why is this happening now? I believe this has ALWAYS been happening, and is responsible for nearly all the random CTDs we've encountered that weren't bugged World Heartbeat related. This is as random as you get, it just takes 1 missed packet, something which will NEVER happen on local dev, but can happen every so often over WAN. Previously, the client would just be silently disconnected. Now we see more logging spam.
First, the last few crashes at GetClient() (this is usually a mutex crash, but if you walk the stack, you'll run into GetClient()) are because we get the chunk from character->GetCurrentChunk() but we don't check that chunk is valid. This happens because if a users client crashes or they are DCd during chunking or login, they will have a chunk client, but not necessarily a chunk. Checking for chunk and skipping the loop or current scope will resolve this.
Second, I think we're missing something packet/opcode wise around out of sync packets and the re-sending of these. I believe OpCode 0x0400 means something like Acknowledge Resent Packets or Wrong Resent Packets Sent.
I know the VGClient is not accepting our attempt to respond to 0x0011 (Out of Order Packet, either this or 0x0019, not at PC), because when I purposefully out of sync a single packet, my particular Client object is stuck in an infinite loop of try to resend packets, sometimes the VGClient is quiet about it: it does not respond with "garbage"; sometimes it is noisy: it responds with 0x0400s and all sorts of nonsense. While the client is re-senting packets forever, it is not sending ANY other packets. This is why nothing seems to work while this is happening. This infinite loop continues until all 1000 spots in the Deque are filled and some server logic disconnects the user.
Couple other things I noticed with this 0x0400 packet and the other "garbage" that sometimes accompanies it:
The garbage always starts with an unknown uint16, and I think this is the sequence number that the client is missing. The next 2 bytes are always the same: 00 11 (which happens to be our Out of Sync OpCode, but in the wrong place in the packet. The next 2 bytes are unknown to me.
Sometimes one of these packets may be very large, but the same pattern repeats: 2 bytes, 00 11, then 2-4 bytes. I think these are either Acks to the Resent packets bunched together.
I ran out of energy tonight but will follow up with this theory tomorrow.
Oh, why is this happening now? I believe this has ALWAYS been happening, and is responsible for nearly all the random CTDs we've encountered that weren't bugged World Heartbeat related. This is as random as you get, it just takes 1 missed packet, something which will NEVER happen on local dev, but can happen every so often over WAN. Previously, the client would just be silently disconnected. Now we see more logging spam.
- John Adams
- Retired
- Posts: 4582
- Joined: Wed Aug 28, 2013 9:40 am
- Location: Phoenix, AZ.
- Contact:
Re: ResendPackets()
Brilliant analysis and theorizing. You have my attention. Maybe Xinux can provide insight into the packet flow as well, though I am sure you get that by now yourself.
One other thing I had thought of last night, in a sorta "duh" moment; why are we not asking these disco victims to log into NT using the collector? Would that help diagnose what THEY are seeing? And secondly, is there anything we can add to the world code (or outside, wire shark?) that can capture a days worth of logging packets to World, which we can then compare to collected logs to get a better idea of the communication breakdowns?
Personally, I'd rather see World able to diagnose itself but I fear adding a shit-ton of TRACE loggers to our code will only slow things down more. By this I mean a packet dump to an actual file, not console, during a full day of tracing. Nothing we'd do forever, although I think the real servers kept extensive logs of what was going on above our basic info, debug and errors.
Thoughts?
One other thing I had thought of last night, in a sorta "duh" moment; why are we not asking these disco victims to log into NT using the collector? Would that help diagnose what THEY are seeing? And secondly, is there anything we can add to the world code (or outside, wire shark?) that can capture a days worth of logging packets to World, which we can then compare to collected logs to get a better idea of the communication breakdowns?
Personally, I'd rather see World able to diagnose itself but I fear adding a shit-ton of TRACE loggers to our code will only slow things down more. By this I mean a packet dump to an actual file, not console, during a full day of tracing. Nothing we'd do forever, although I think the real servers kept extensive logs of what was going on above our basic info, debug and errors.
Thoughts?
Re: ResendPackets()
FYI i updated the packet collector to spit out more information and also updated the opcodes if you want to use it to try and get some collects on this.
Re: ResendPackets()
Lokked/all: Great writeup and great thinking! Sounds very reasonable. Is in line with why some players (i.e. locations/computers) have more issues than others.
"Gaze in amazement adventurer"
Re: ResendPackets()
Ok this is better i swapped to WPE Pro to do the collect.
Good response before out of order packet.
Packet EA got dropped.
Client saying it got EB which is out of order should have gotten EA
Client saying it got EC which is out of order
Server resent EA,EB,EC and the client ack EC
Client saying it got EE which is out of order
Server keeps sending EE to the client when the client is still looking for ED this repeated 10 times before i stopped the collect.
Good response before out of order packet.
Code: Select all
458 :0 0.0.0.0:52692 16 RecvFrom
0000: 00 09 00 E9 09 00 EC 81 F7 DC 01 00 00 00 6F C9 ..............o.
459 0.0.0.0:52692 :0 6 SendTo
0000: 00 15 00 E9 C4 A3 ......
Client saying it got EB which is out of order should have gotten EA
Code: Select all
475 :0 0.0.0.0:52692 16 RecvFrom
0000: 00 09 00 EB 09 00 22 8A F7 DC 01 00 00 00 D7 F7 ......".........
476 0.0.0.0:52692 :0 6 SendTo
0000: 00 11 00 EB 0D 53
Code: Select all
484 :0 0.0.0.0:52692 16 RecvFrom
0000: 00 09 00 EC 09 00 38 8E F7 DC 01 00 00 00 CB 84 ......8.........
485 0.0.0.0:52692 :0 6 SendTo
0000: 00 11 00 EC 98 F0
Code: Select all
494 :0 0.0.0.0:52692 16 RecvFrom
0000: 00 09 00 EA 09 00 07 86 F7 DC 01 00 00 00 0E 71 ...............q
495 :0 0.0.0.0:52692 16 RecvFrom
0000: 00 09 00 EB 09 00 22 8A F7 DC 01 00 00 00 D7 F7 ......".........
496 :0 0.0.0.0:52692 16 RecvFrom
0000: 00 09 00 EC 09 00 38 8E F7 DC 01 00 00 00 CB 84 ......8.........
497 0.0.0.0:52692 :0 6 SendTo
0000: 00 15 00 EC 30 2C ....0,
Code: Select all
505 :0 0.0.0.0:52692 16 RecvFrom
0000: 00 09 00 EE 09 00 6E 96 F7 DC 01 00 00 00 5A AD ......n.......Z.
506 0.0.0.0:52692 :0 6 SendTo
0000: 00 11 00 EE F9 DC ......
Code: Select all
515 :0 0.0.0.0:52692 16 RecvFrom
0000: 00 09 00 EE 09 00 6E 96 F7 DC 01 00 00 00 5A AD ......n.......Z.
516 0.0.0.0:52692 :0 6 SendTo
0000: 00 11 00 EE F9 DC ......
Re: ResendPackets()
NOTE: This post pertains to Xinux's first example, which has been replaced (I think, I might be hallucinating). Posting again to respond to new WPE Pro example.
Perfect, thanks for providing this information, Xinux. For clarification: Is that Out of Order Packet the raw packet? Or was the 04 00 stripped?
If you would like to try and collect the 04 00 (this packet occurs frequently enough that it's GOT to have a specific purpose and not just be garbage), you can guarantee to generate this packet by running the Clumsy application (http://jagt.github.io/clumsy/download) with the Drop Packets option selected. Let me know if you need help configuring it.
I find it strange that you say "we never send another packet to the client besides ack packets". The function ResendPackets() is supposed to do exactly this, and my looking over of it does seem to indicate that it's doing something. When a packet is sent, it is added to a Client's packet_sent list, and when the server receives an Ack for a packet, that packet is removed from the packet_sent list. The ResendPackets function extracts the Sequence Number from the 00 11 - Out of Order Packet (I've watched it do so) and the packet_sent list is looped from "Earliest Sequence Number Not Acked" to "Sequence Number Taken From Out of Order Packet" and sends these to the Client.
But, you are saying that the server does NOT send anything after receiving this Out of Order Packet? Perhaps something is broken with how this ResendPacket function is queuing the packets to be sent.
I am aware that while the server detects that it needs to resend packets to a Client, it is NOT sending newly requested packets, and this is by design (because the client will reject them, as their sequence number is above the internal counter it's recording). But, it should at least be sending the sent_packets set of packets.
So we agree on what's happening and what needs to be done:
Perfect, thanks for providing this information, Xinux. For clarification: Is that Out of Order Packet the raw packet? Or was the 04 00 stripped?
If you would like to try and collect the 04 00 (this packet occurs frequently enough that it's GOT to have a specific purpose and not just be garbage), you can guarantee to generate this packet by running the Clumsy application (http://jagt.github.io/clumsy/download) with the Drop Packets option selected. Let me know if you need help configuring it.
I find it strange that you say "we never send another packet to the client besides ack packets". The function ResendPackets() is supposed to do exactly this, and my looking over of it does seem to indicate that it's doing something. When a packet is sent, it is added to a Client's packet_sent list, and when the server receives an Ack for a packet, that packet is removed from the packet_sent list. The ResendPackets function extracts the Sequence Number from the 00 11 - Out of Order Packet (I've watched it do so) and the packet_sent list is looped from "Earliest Sequence Number Not Acked" to "Sequence Number Taken From Out of Order Packet" and sends these to the Client.
But, you are saying that the server does NOT send anything after receiving this Out of Order Packet? Perhaps something is broken with how this ResendPacket function is queuing the packets to be sent.
I am aware that while the server detects that it needs to resend packets to a Client, it is NOT sending newly requested packets, and this is by design (because the client will reject them, as their sequence number is above the internal counter it's recording). But, it should at least be sending the sent_packets set of packets.
So we agree on what's happening and what needs to be done:
Re: ResendPackets()
Ok, this should be enough information. Can you confirm that ED was NEVER sent by the server? After resending EA, EB, and EC, the server skips ED?
Is there a NOP (no operation) style of packet? I'm wondering what might happen if the Sequence is incremented and there is no packet formed and stored in the sent_packets list, can we send a no operation packet, in order to fill in the sequence gaps? I'm sure this isn't a good solution, but I figure it's worth an ask.
Is there a NOP (no operation) style of packet? I'm wondering what might happen if the Sequence is incremented and there is no packet formed and stored in the sent_packets list, can we send a no operation packet, in order to fill in the sequence gaps? I'm sure this isn't a good solution, but I figure it's worth an ask.
Re: ResendPackets()
[quote]Perfect, thanks for providing this information, Xinux. For clarification: Is that Out of Order Packet the raw packet? Or was the 04 00 stripped?[/quote]
This is the raw packet WPE does not strip anything off of it and when you say 04 00 do you mean this?
Which is a Multiple SOE Protocols this example has three packets in one and the data starts at 04 being the size
http://wiki.vgoemulator.net/Docs/Backup ... _Breakdown
[quote]If you would like to try and collect the 04 00 (this packet occurs frequently enough that it's GOT to have a specific purpose and not just be garbage), you can guarantee to generate this packet by running the Clumsy application (http://jagt.github.io/clumsy/download) with the Drop Packets option selected. Let me know if you need help configuring it.[/quote]
I was using the clumsy tool with the drop selected to do this.
[quote]I find it strange that you say "we never send another packet to the client besides ack packets". The function ResendPackets() is supposed to do exactly this, and my looking over of it does seem to indicate that it's doing something. When a packet is sent, it is added to a Client's packet_sent list, and when the server receives an Ack for a packet, that packet is removed from the packet_sent list. The ResendPackets function extracts the Sequence Number from the 00 11 - Out of Order Packet (I've watched it do so) and the packet_sent list is looped from "Earliest Sequence Number Not Acked" to "Sequence Number Taken From Out of Order Packet" and sends these to the Client.
But, you are saying that the server does NOT send anything after receiving this Out of Order Packet? Perhaps something is broken with how this ResendPacket function is queuing the packets to be sent.[/quote]
Ok just did another test and chat packets were still working but commands and such were not. Chat packets use a different connection (worldserver) so may have just not gotten out of sync like the chunkserver would have to spam chat and make one of them drop to see what happens.
Would need to run a collect on the server at the same time to see the packet that is being dropped to see if there is a common packet type that may not be getting sent back correctly even tho it shouldn't matter.
I'm attaching both of my logs i've done so far wpe_drop_packet is from NT and wpe_drop_2 is from my server.
This is the raw packet WPE does not strip anything off of it and when you say 04 00 do you mean this?
Which is a Multiple SOE Protocols this example has three packets in one and the data starts at 04 being the size
Code: Select all
0000: 00 03 04 00 11 01 E3 04 00 11 01 EB 04 00 11 01 ................
0010: F0 9D 13 ...
[quote]If you would like to try and collect the 04 00 (this packet occurs frequently enough that it's GOT to have a specific purpose and not just be garbage), you can guarantee to generate this packet by running the Clumsy application (http://jagt.github.io/clumsy/download) with the Drop Packets option selected. Let me know if you need help configuring it.[/quote]
I was using the clumsy tool with the drop selected to do this.
[quote]I find it strange that you say "we never send another packet to the client besides ack packets". The function ResendPackets() is supposed to do exactly this, and my looking over of it does seem to indicate that it's doing something. When a packet is sent, it is added to a Client's packet_sent list, and when the server receives an Ack for a packet, that packet is removed from the packet_sent list. The ResendPackets function extracts the Sequence Number from the 00 11 - Out of Order Packet (I've watched it do so) and the packet_sent list is looped from "Earliest Sequence Number Not Acked" to "Sequence Number Taken From Out of Order Packet" and sends these to the Client.
But, you are saying that the server does NOT send anything after receiving this Out of Order Packet? Perhaps something is broken with how this ResendPacket function is queuing the packets to be sent.[/quote]
Ok just did another test and chat packets were still working but commands and such were not. Chat packets use a different connection (worldserver) so may have just not gotten out of sync like the chunkserver would have to spam chat and make one of them drop to see what happens.
Would need to run a collect on the server at the same time to see the packet that is being dropped to see if there is a common packet type that may not be getting sent back correctly even tho it shouldn't matter.
I'm attaching both of my logs i've done so far wpe_drop_packet is from NT and wpe_drop_2 is from my server.
Re: ResendPackets()
Going to add this here also.
Missing 29
Resent 29 and then sent 2A and 2C guessing maybe 2B was dropped by program?
Ack 2A and oop 2C
Sent 2C again still need 2B
Server decided to jump ahead to 2E
Server resent 2C and 2E
Server is now sending 30
Now sent 2C, 2E and 30 and going to stop here since it just keep's going
Log file
Missing 29
Code: Select all
1179 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2A 06 00 11 00 00 00 10 00 00 00 68 00 ...*..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 28 A5 e.l.l.o.......(.
1180 0.0.0.0:53523 :0 6 SendTo
0000: 00 11 00 2A FF 75
Code: Select all
1197 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 29 06 00 11 00 00 00 10 00 00 00 68 00 ...)..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 05 66 e.l.l.o........f
1198 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2A 06 00 11 00 00 00 10 00 00 00 68 00 ...*..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 28 A5 e.l.l.o.......(.
1199 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2C 06 00 11 00 00 00 10 00 00 00 68 00 ...,..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 75 62 e.l.l.o.......ub
Code: Select all
1200 0.0.0.0:53523 :0 14 SendTo
0000: 00 03 04 00 15 00 2A 04 00 11 00 2C 0C 34 ......*....,.4
Code: Select all
1211 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2C 06 00 11 00 00 00 10 00 00 00 68 00 ...,..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 75 62 e.l.l.o.......ub
1212 0.0.0.0:53523 :0 6 SendTo
0000: 00 11 00 2C 5A 40 ...,Z@
Code: Select all
1219 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2E 06 00 11 00 00 00 10 00 00 00 68 00 ..............h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 43 E0 e.l.l.o.......C.
1221 0.0.0.0:53523 :0 6 SendTo
0000: 00 11 00 2E 3B 6C ....;l
Code: Select all
1231 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2C 06 00 11 00 00 00 10 00 00 00 68 00 ...,..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 75 62 e.l.l.o.......ub
1232 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2E 06 00 11 00 00 00 10 00 00 00 68 00 ..............h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 43 E0 e.l.l.o.......C.
1233 0.0.0.0:53523 :0 14 SendTo
0000: 00 03 04 00 11 00 2C 04 00 11 00 2E 80 FF ......,.......
Code: Select all
1252 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 30 06 00 11 00 00 00 10 00 00 00 68 00 ...0..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 63 F8 e.l.l.o.......c.
1254 0.0.0.0:53523 :0 6 SendTo
0000: 00 11 00 30 06 0F
Code: Select all
1262 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2C 06 00 11 00 00 00 10 00 00 00 68 00 ...,..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 75 62 e.l.l.o.......ub
1263 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 2E 06 00 11 00 00 00 10 00 00 00 68 00 ..............h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 43 E0 e.l.l.o.......C.
1264 :0 0.0.0.0:53523 32 RecvFrom
0000: 00 09 00 30 06 00 11 00 00 00 10 00 00 00 68 00 ...0..........h.
0010: 65 00 6C 00 6C 00 6F 00 00 00 04 00 00 00 63 F8 e.l.l.o.......c.
1265 0.0.0.0:53523 :0 19 SendTo
0000: 00 03 04 00 11 00 2C 04 00 11 00 2E 04 00 11 00 ......,.........
0010: 30 7E 88 0~.