Page 1 of 3

World Crash: Camp Timers (?)

Posted: Thu Aug 21, 2014 7:25 am
by John Adams
Not sure camp timers are at fault, looks like the GetCurrentChunk() is.

Stack

Code: Select all

>	WorldServer.exe!std::_Ptr_base<ChunkServer>::_Reset<ChunkServer>(const std::_Ptr_base<ChunkServer> & _Other, bool _Throw) Line 363	C++
 	WorldServer.exe!std::shared_ptr<ChunkServer>::shared_ptr<ChunkServer><ChunkServer>(const std::weak_ptr<ChunkServer> & _Other, bool _Throw) Line 552	C++
 	WorldServer.exe!std::weak_ptr<ChunkServer>::lock() Line 1051	C++
 	WorldServer.exe!UnrealActor::GetCurrentChunk() Line 221	C++
 	WorldServer.exe!CampTimers::Check(UDPServer * udp_server, ChunkServer * chunk_server) Line 98	C++
 	WorldServer.exe!ChunkServer::Process() Line 199	C++
 	WorldServer.exe!ChunkProcessThread(void * data) Line 107	C++
 	WorldServer.exe!ThreadRun(void * arg) Line 77	C++
 	WorldServer.exe!_callthreadstart() Line 255	C
 	WorldServer.exe!_threadstart(void * ptd) Line 239	C
 	kernel32.dll!7609338a()	Unknown
 	[Frames below may be incorrect and/or missing, no symbols loaded for kernel32.dll]	
 	ntdll.dll!77189f72()	Unknown
 	ntdll.dll!77189f45()	Unknown
Another weak_ptr fail?

Re: World Crash: Camp Timers (?)

Posted: Thu Aug 21, 2014 8:48 am
by Volt
Suppose I will have to learn more about the shared_pointers unless we got any new expert around .

Do you have the error message that came with the crash? I think that would be really helpful to me.

Also, do we know who the camping character was? I would like to know if the character was doing anything special at the time (like trying to interrupt the camping with .rift or brute force closing the application), or if it was a "standard /camp"?

Re: World Crash: Camp Timers (?)

Posted: Thu Aug 21, 2014 9:27 am
by Dakadin
It might have been me if it happened just before midnight PDT. The character was Banamur. I had just logged in. It didn't seem like I was really in the world because emotes, /who and dot commands weren't working. I could run around but not much else. I decided to logout and figured I would try using /camp. It started fine with the 20 second message but never moved past there. I eventually had to exit out. I started getting the Waiting on Data error right after so I would be surprised if it wasn't me.

Re: World Crash: Camp Timers (?)

Posted: Thu Aug 21, 2014 10:15 am
by Volt
That is useful info, thank you.

Re: World Crash: Camp Timers (?)

Posted: Thu Aug 21, 2014 11:14 am
by Lokked
Generally, segfaults with Smart Pointers are created when a single object is referred to by 2 different shared_ptr<>s, which you should never do (it defeats the purpose of shared_ptr<>).

A shared_ptr is a wrapper around a pointer that keeps a reference count of how many times (or you could think of as, "from how many places") the pointed-to object is being referred to. The idea is that once nothing is referring to an object, that object is deleted (as it should be).

If you have 2 separate shared_ptr pointing to the same object, they don't know about each other's reference count (they are completely separate of each other). Say each held a reference count of 1 to an object (that object is being referred to only once, but each shared_ptr). If one of those shared_ptr goes out of scope or is removed, it's reference count reduces to 0, and this prompts the smart pointer wrapper to delete the object (as it should). BUT WAIT! There is another shared_ptr that is holding a reference to the now deleted object! It doesn't know there is no longer an object. When it goes out of scope/is removed, it attempts to delete the object again, and this results in a Segmentation Fault.

Segfault always means that memory handling has been mismanaged. http://xkcd.com/371/

Smart Pointer Segfault example:

Code: Select all

void Foo_Shared()
{
    Obj *p = new Obj; //New Object instantiated and pointed to by p
    shared_ptr<Obj> sp1(p); //Shared_ptr sp1 created, with 1 reference to Obj
    shared_ptr<Obj> sp2(p); //Shared_ptr sp2 created, with 1 reference to Obj
}
When this function ends:
sp1 will reduce it's reference counter by 1, because it has gone out of scope. This reduces the reference count to 0, so Obj is deleted.
sp2 will reduce it's reference counter by 1, because it has gone out of scope. This reduces the reference count to 0. When it attempts to delete Obj, it Segfaults, because Obj has already been deleted.

This is the same thing with Raw Pointers:

Code: Select all

void Foo_Raw()
{
    Obj *p1, *p2
    p1 = new Obj; //New Object instantiated and pointed to by p
    p2 = p1 //Point P2 to the same Obj that P1 points to.

    delete p1 // deletes Obj
    delete p2 // Segfaults. Obj already deleted
}
I'd also like to take this time to calm the storm about smart pointers. They have done more benefit than harm. These exact same errors could be reproduced using raw pointers. The difference is in the skill of the programmer. "Old School" programmers will continue using raw pointers with as much success as they had before and there's nothing wrong with that.

Re: World Crash: Camp Timers (?)

Posted: Thu Aug 21, 2014 12:50 pm
by Volt
That is a very nice writeup Lokked, thank you. I did some readup on shared_pointers a few hours ago but I think your writeup is maybe better than what I read before.

Back to the reported issue: Seems something went wrong long before the stack dump. How do we catch and smash it?

It sounds like not all threads were running properly (based on Dakadins report), but can't say if that is just another symptom. Do we have a thread count, monitor or similar implemented?

Re: World Crash: Camp Timers (?)

Posted: Thu Aug 21, 2014 5:41 pm
by Dakadin
Thanks Lokked. I've been getting back up to speed with pointers and you bring up some great points. There are times when multiple pointers can really help the situation but if you don't pay close attention to all the possibilities it is easy to have unexpected side effects and errors.

Volt, we need to review the code and look for potential places where pointers are referencing the same address as Lokked said. It can be difficult to do but it is definitely doable. I will start looking through the code to see if I can find any places where it could possibly happen.

Re: World Crash: Camp Timers (?)

Posted: Thu Aug 21, 2014 11:25 pm
by Dakadin
It might be something corrupted with my actual character. I tried logging in again and it wasn't let me enter any commands. I tried zoning and my client crashed. I am pretty sure the server crashed also. Let me know if you want to test it again sometime tomorrow. Hopefully we can track down the bug.

Re: World Crash: Camp Timers (?)

Posted: Fri Aug 22, 2014 6:29 am
by John Adams
Yup, looks like it crashed in the same place.

Code: Select all

>	WorldServer.exe!std::_Ptr_base<ChunkServer>::_Reset<ChunkServer>(const std::_Ptr_base<ChunkServer> & _Other, bool _Throw) Line 363	C++
 	WorldServer.exe!std::shared_ptr<ChunkServer>::shared_ptr<ChunkServer><ChunkServer>(const std::weak_ptr<ChunkServer> & _Other, bool _Throw) Line 552	C++
 	WorldServer.exe!std::weak_ptr<ChunkServer>::lock() Line 1051	C++
 	WorldServer.exe!UnrealActor::GetCurrentChunk() Line 221	C++
 	WorldServer.exe!CampTimers::Check(UDPServer * udp_server, ChunkServer * chunk_server) Line 98	C++
 	WorldServer.exe!ChunkServer::Process() Line 199	C++
 	WorldServer.exe!ChunkProcessThread(void * data) Line 107	C++
 	WorldServer.exe!ThreadRun(void * arg) Line 77	C++
 	WorldServer.exe!_callthreadstart() Line 255	C
 	WorldServer.exe!_threadstart(void * ptd) Line 239	C
 	kernel32.dll!7609338a()	Unknown
 	[Frames below may be incorrect and/or missing, no symbols loaded for kernel32.dll]	
 	ntdll.dll!77189f72()	Unknown
 	ntdll.dll!77189f45()	Unknown
Console:
[quote]23:03:17.060 I UDP New client connected from 173.51.246.121:4907
23:03:17.060 I UDP Received session request from 173.51.246.121:4907 with connection ID 1724868530

23:03:17.465 D Chunk control_text='HELLO REVISION=0 MINVER=3151 VER=3186'
23:03:17.652 D Chunk control_text='LOGIN'
23:03:29.165 D Chunk control_text='JOIN'
23:04:08.540 I UDP New client connected from 173.51.246.121:50600
23:04:08.540 I UDP New client connected from 173.51.246.121:43760
23:04:08.540 D UDP Received an ack with an empty sent packets list?
23:04:08.540 D UDP Received an ack with an empty sent packets list?
23:04:09.258 I UDP Client from 173.51.246.121:28276 set to disconnect : Timeout
23:04:09.414 I UDP Client from 173.51.246.121:4907 set to disconnect : Timeout
23:04:10.272 I UDP Client from 173.51.246.121:28276 has been removed.
23:04:10.428 I UDP Client from 173.51.246.121:4907 has been removed.
23:05:02.080 E Unreal Could not find unreal channel list for account id 0![/quote]
There are 2 things I've started seeing repeatedly since the 589 commit, and that is
[quote]D UDP Received an ack with an empty sent packets list?[/quote]
and more troubling
[quote]E Unreal Could not find unreal channel list for account id 0![/quote]
Not sure if they are related or telling.

Re: World Crash: Camp Timers (?)

Posted: Fri Aug 22, 2014 7:40 am
by Volt
Yeah we should try and smoke this out together. Need Dakadin and John at the same time (client and server). I'd be happy to join in if we can find a time when we are all available.