The problem: + mono is mind mangled & doesn't expose a sane GC API + ie. it wants to know about your entire C stack & wander over it + it should push/pop start/stop tags as we invoke in/out + it should manage C handles with a controlled lifecycle mechanism: ref/unref layered on top of the GC => no need to worry about walking the C stack. + however - this is horribly broken somehow. We really need something like GC_start_routine() the start_info would be used to invoke our fn. and cleanup would be automatic. * RedHerring 1: what is: GC_push_all_stack(self, ) called from GCThreadFunctions mono_gc_thread_vtable -> mono_gc_push_all_stacks + for the debugger [ libgc/include/libgc-mono-debugger.h ] * The pthread_support.c code seems to be not much called: GC_segment_is_thread_stack - only from some dlopen code GC_register_map_entries() [ filtering out code regions ? ] * seems to call GC_add_roots * /* Add a root segment. Wizards only. */ GC_API void GC_add_roots GC_PROTO((char * low_address, char * high_address_plus_1)); /* Remove a root segment. Wizards only. */ GC_API void GC_remove_roots GC_PROTO((char * low_address, char * high_address_plus_1)); * Adding a thread has to set the GC_suspend_handler / GC_restart_handler + cf. pthread_stop_world, pthread_stop_init (?) + GC_stop_world uses the stop_world method from the vtable ... GCThreadFunctions *gc_thread_vtable = &pthread_thread_vtable; => We need thread data to ensure we have stopped all the threads properly. => we need to setup / install our signal handler properly in each thread. + The high stack pointer is grokked from the GC_approx_sp function - from the signal handler - just returns cur ptr. - signal handler executing on stack of signal recieving thread. [ fun ] * GC_push_all_stacks [ pthread_stop_world.c ] + calls 'push_all_stack' for each thread. + called from os_dep.c (GC_default_push_other_roots) [ GC_push_other_roots ] * Hacking libgc: yes, I suggest just duplicating the needed code for now lupus: is this a minimal-changes regimen ? or a wild-crazy-re-factoring-festival ? michael_: no crazy refactoring:-) lupus: and what does one do about a non-internal libgcj ? - or is that not a case worth worrying about ? michael_: we'll use the proper ifdef in the mono runtime sure - but it's ok to break horribly if we're not using an internal libgcj ? ;-) but a not internal gc is just for our testing, it's not a supported feature for people to use ie. no foreign-thread / OO.o impl. ? oh - great :-) ** How does Mono get a callback from the pthread foo ? + We need to add the functionality we need to mono_thread_attach 2005-04-12 log: lupus: so - are you utterly opposed to adding this to mono_runtime_invoke ? michael_: why should I be opposed? lupus: there is a small cost, particularly if you don't have a thread-variable, lupus: of course - then we could bin the mono_thread_attach or whatever; michael_: unmanaged->managed is not an important transition and a simple check is cheap of course there is no need to drop mono_thread_attach lupus: ok'y dokey, lupus: so - I'll re-work the patch to move this code into mono_runtime_invoke ? michael_: if your current patch works for you, I can write the equivalent code for the next mono release we're not going to remove the thread info at the end of _invoke something like the setspecific hack you had may be enough lupus: the problem is - we could get invoked again from lower down the stack in the same thread trivially. lupus: so - it's really no safer; lupus: and removing the thread information from the gc post invoke - [ if we added it ourself ] is cheap anyhow. that is not cheap, it requires locking both in the runtime and the GC I don't see what removing the info each time buys us lupus: consider A() -> B() -> C() -> runtime_invoke -> adds stack base --><-- --> eno (~atsushi@east94-p4.eaccess.hi-ho.ne.jp) has joined #mono the code I have in mind will check the stack start registered: if it doesn't contain the current stack address, we change it lupus: subsequently A() -> runtime_invoke -> stack base is now --><-- lupus: sounds fine, it will also register an aligned address lupus: sounds fine, at a page size so hopefully we get a good enough approximation of the thread start lupus: sounds like you've got it under control anyhow clearly manipulating the local thread stack/state should require precisely no locking in an ideal world ;-) but ... [ when we write a better GC - we'll make it perfectly precise - right ? ;-] michael_: since you need to register the thread with the GC it's not possible to avoid a lock, since it's global state and we won't have a perfectly precise GC lupus: AFAICS the GC only asks for that when a signal is fired, the unamanged stack will still be conservatively scanned for a while michael_: the GC needs the stack boundaries when doing collection lupus: I appreciate that ;-) lupus: anyhow - thanks for taking that on. lupus: when is the next release ? :-) michael_: likely in a month or so --------------------- Debugging my bogus (non)-GC related bug --------------------- 0x8240e70 + Threads created outside Mono's control ... + GC crash - under gdb: + greatest_ha - (gdb) p limit $1 = (word *) 0x83c2104 (gdb) p *limit Cannot access memory at address 0x83c2104 (gdb) p *(limit - 4096) $2 = 0 (gdb) p *(limit + 4096) Cannot access memory at address 0x83c6104 GC_mark_from mis-calculates limit, clearly it's scanning a broken / non-object; missing the length [ with 0 bottom 2 bits] tag / marker. Who sets that up ? The mark stack is a simple array, and looks fine ( in places ): $42 = {mse_start = 0x81fd800, mse_descr = 1024} (gdb) p mark_stack[14] $43 = {mse_start = 0x81eff00, mse_descr = 32} (gdb) p mark_stack[15] $44 = {mse_start = 0x81eef00, mse_descr = 160} (gdb) p mark_stack[16] $45 = {mse_start = 0x81fae00, mse_descr = 4294967279} ... ... (gdb) p mark_stack[20] $52 = {mse_start = 0x8355000, mse_descr = 4294967279} (gdb) p mark_stack[21] $53 = {mse_start = 0x83c2108, mse_descr = 136027216} (gdb) p mark_stack[22] $54 = {mse_start = 0x83acd20, mse_descr = 80} (gdb) p mark_stack[23] $55 = {mse_start = 0x83accd0, mse_descr = 80} The mark stack data comes from: MARK_FROM_MARK_STACK GC_mark_stack_top = GC_mark_from(GC_mark_stack_top, \ GC_mark_stack, \ GC_mark_stack + GC_mark_stack_size); + need to watch all manipulations of the GC_mark_stack ... (?) gc_pmark.h + GC_PUSH_ONE_HEAP, GC_PUSH_ONE_STACK (calls) -> GC_mark_and_push, GC_mark_and_push_stack The problem comes from: GC_push_all_stacks ... (pthread_stop_world.c) #1 0x080fbe2b in GC_push_all_eager (bottom=0xbfffc1b4 "0%G���%@", top=0xbfffce30 "\002") at mark.c:1485 #2 0x0813a7eb in pthread_push_all_stacks () at pthread_stop_world.c:260 #3 0x0813a6ad in GC_push_all_stacks () at pthread_stop_world.c:291 #4 0x080f673b in GC_push_roots (all=1, cold_gc_frame=0xbfffc254 "\200") at mark_rts.c:643 #5 0x080fd0ea in GC_mark_some (cold_gc_frame=0xbfffc254 "\200") at mark.c:306 #6 0x080ff51c in GC_stopped_mark (stop_func=0x80febe0 ) at alloc.c:533 #7 0x080ff7d8 in GC_try_to_collect_inner (stop_func=0x80febe0 ) at alloc.c:375 #8 0x080ff98d in GC_collect_or_expand (needed_blocks=1, ignore_off_page=0) at alloc.c:1035 #9 0x080ffe23 in GC_allocobj (sz=128, kind=0) at alloc.c:1112 #10 0x080f9041 in GC_generic_malloc_inner (lb=454, k=0) at malloc.c:136 #11 0x080f90b4 in GC_generic_malloc (lb=454, k=0) at malloc.c:192 #12 0x080f9361 in GC_malloc_atomic (lb=454) at malloc.c:270 #13 0x080c226a in mono_string_new_size (domain=0x81eef00, len=220) at object.c:2081 #14 0x080d1c18 in ves_icall_System_String_InternalAllocateStr (length=220) at string-icalls.c:634 * pthread_stop_world.c - do we set p->flags & MAIN_THREAD ? + bs_lo = p->backing_store_end + bs_hi = GC_save_regs_in_stack etc. * GC_push_all_eager - must get the wrong end of the stick somehow (?) initially (fine): Pushing stacks from thread 0x4021abc0 Stack for thread 0x4021abc0(0x6) = [bfffcd64,bfffd0b0) later (broken): Pushing stacks from thread 0x4021abc0 Stack for thread 0x40b01bb0(0x2) = [40b01594,40b02000) Stack for thread 0x4021abc0(0x6) = [bfffc434,bfffd0b0) hi < lo etc. [!]... + How does MAIN_THREAD get set (?) - surely correct (by Mono) + Header from pointer: GET_HDR()... GC_find_header(p) hdr * GC_find_header(h) ptr_t h; + does return HDR_INNER(h) # define HDR_FROM_BI(bi, p) \ ((bi)->index[((word)(p) >> LOG_HBLKSIZE) & (BOTTOM_SZ - 1)]) # ifndef HASH_TL # define BI(p) (GC_top_index \ [(word)(p) >> (LOG_BOTTOM_SZ + LOG_HBLKSIZE)]) # define HDR_INNER(p) HDR_FROM_BI(BI(p),p) # define LOG_BOTTOM_SZ 10 # define BOTTOM_SZ 1<<10 # define LOG_HBLKSIZE 12 By instrumenting hb_descr setting with: Set hb_descr to 16 on hdr 0x81eed70 Set hb_descr to -17 on hdr 0x81efe08 #0 setup_header (hhdr=0x81efe08, sz=4, kind=4, flags=0 '\0') at allchblk.c:243 #1 0x080fc1e5 in GC_allochblk_nth (sz=4, kind=4, flags=0 '\0', n=6) at allchblk.c:743 #2 0x080fbd78 in GC_allochblk (sz=4, kind=4, flags=0) at allchblk.c:558 #3 0x081018a9 in GC_new_hblk (sz=4, kind=4) at new_hblk.c:253 #4 0x081013d0 in GC_allocobj (sz=4, kind=4) at alloc.c:1103 #5 0x080f94f6 in GC_generic_malloc_inner (lb=12, k=4) at malloc.c:136 #6 0x080ff9c6 in GC_gcj_malloc (lb=12, ptr_to_struct_containing_descr=0x81b3b38) at gcj_mlc.c:157 #7 0x080fa20b in GC_local_gcj_malloc (bytes=12, ptr_to_struct_containing_descr=0x81b3b38) at pthread_support.c:387 #8 0x080c1117 in mono_object_new_alloc_specific (vtable=0x81b3b38) at object.c:2089 #9 0x080c127b in mono_object_new_specific (vtable=0x81b3b38) at object.c:2147 #10 0x080af01f in mono_type_get_object (domain=0x81f0f00, type=0x81c7e34) at reflection.c:5381 #11 0x080c2094 in mono_class_vtable (domain=0x81f0f00, class=0x81c7d88) at object.c:861 #12 0x080c20ca in mono_class_vtable (domain=0x81f0f00, class=0x81c7cb8) at object.c:859 #13 0x080c20ca in mono_class_vtable (domain=0x81f0f00, class=0x81c7be8) at object.c:859 #14 0x080c3223 in mono_object_new (domain=0x81f0f00, klass=0x81c7be8) at object.c:2108 #15 0x080af01f in mono_type_get_object (domain=0x81f0f00, type=0x81c550c) at reflection.c:5381 #16 0x080c2094 in mono_class_vtable (domain=0x81f0f00, class=0x81c5460) at object.c:861 #17 0x080c20ca in mono_class_vtable (domain=0x81f0f00, class=0x81cc9a0) at object.c:859 #18 0x080c3223 in mono_object_new (domain=0x81f0f00, klass=0x81cc9a0) at object.c:2108 #19 0x080a7a1f in mono_runtime_init (domain=0x81f0f00, start_cb=0x81036b0 , attach_cb=0x8103760 ) at appdomain.c:96 #20 0x081053cd in mini_init (filename=0xbfffd37d "ViewSample.exe") at mini.c:10053 #21 0x0805b99b in mono_main (argc=2, argv=0xbfffce24) at driver.c:837 #22 0x0805b218 in main (argc=2, argv=0xbfffce24) at main.c:6 So: (gdb) p GC_obj_kinds[kind].ok_descriptor $3 = 4294967279 (gdb) p GC_obj_kinds[kind] $4 = {ok_freelist = 0x81f4000, ok_reclaim_list = 0x8223000, ok_descriptor = 4294967279, ok_relocate_descr = 0, ok_init = 1} (gdb) p kind $5 = 4 Clearly ok_descriptor must be valid ... - do we not use this as a size ? allchblk.c (setup_header): descr = GC_obj_kinds[kind].ok_descriptor; if (...) descr += WORDS_TO_BYTES(sz); PUSH_OBJ(obj,hhdr) -> sets mse_descr to _descr Is this: GC_obj_kinds[NORMAL].ok_descriptor = ((word)(-ALIGNMENT) | GC_DS_LENGTH); responsible !? /* Predefined kinds: */ # define PTRFREE 0 # define NORMAL 1 # define UNCOLLECTABLE 2 # ifdef ATOMIC_UNCOLLECTABLE # define AUNCOLLECTABLE 3 # define STUBBORN 4 # define IS_UNCOLLECTABLE(k) (((k) & ~1) == UNCOLLECTABLE) # else # define STUBBORN 3 # define IS_UNCOLLECTABLE(k) ((k) == UNCOLLECTABLE) # endif GC_new_kind_inner screwed up -17 Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1075948480 (LWP 8355)] GC_new_kind_inner (fl=0x81f4000, descr=4294967279, adjust=0, clear=1) at misc.c:1128 1128 *(int *)NULL = 0; (gdb) bt #0 GC_new_kind_inner (fl=0x81f4000, descr=4294967279, adjust=0, clear=1) at misc.c:1128 #1 0x080ff8f8 in GC_init_gcj_malloc (mp_index=5, mp=0x0) at gcj_mlc.c:87 #2 0x080c1886 in mono_class_compute_gc_descriptor (class=0x81cc9a0) at object.c:539 #3 0x080c1a77 in mono_class_vtable (domain=0x81f0f00, class=0x81cc9a0) at object.c:695 #4 0x080c3223 in mono_object_new (domain=0x81f0f00, klass=0x81cc9a0) at object.c:2108 #5 0x080a7a1f in mono_runtime_init (domain=0x81f0f00, start_cb=0x8103720 , attach_cb=0x81037d0 ) at appdomain.c:96 #6 0x0810543d in mini_init (filename=0xbfffd37d "ViewSample.exe") at mini.c:10053 #7 0x0805b99b in mono_main (argc=2, argv=0xbfffce24) at driver.c:837 #8 0x0805b218 in main (argc=2, argv=0xbfffce24) at main.c:6 (gdb) So - I hypothesize - the kinds are meant to be big/negative - but this is compensated for by the size of the allocation. GC_gcj_kind = GC_new_kind_inner( (void **)GC_gcjobjfreelist, (((word)(-MARK_DESCR_OFFSET - GC_INDIR_PER_OBJ_BIAS)) | GC_DS_PER_OBJECT), FALSE, TRUE); export GC_IGNORE_GCJ_INFO=1 - and it works ... My gamble on GC_gcj_kind being broken ... The problem comes from: (gdb) bt #0 setup_header (hhdr=0x81efe08, sz=4, kind=4, flags=0 '\0') at allchblk.c:245 #1 0x080fc1ed in GC_allochblk_nth (sz=4, kind=4, flags=0 '\0', n=6) at allchblk.c:744 #2 0x080fbd80 in GC_allochblk (sz=4, kind=4, flags=0) at allchblk.c:559 #3 0x081018fd in GC_new_hblk (sz=4, kind=4) at new_hblk.c:253 #4 0x08101424 in GC_allocobj (sz=4, kind=4) at alloc.c:1103 #5 0x080f94f6 in GC_generic_malloc_inner (lb=12, k=4) at malloc.c:136 #6 0x080ffa1a in GC_gcj_malloc (lb=12, ptr_to_struct_containing_descr=0x81b3b38) at gcj_mlc.c:157 #7 0x080fa20b in GC_local_gcj_malloc (bytes=12, ptr_to_struct_containing_descr=0x81b3b38) at pthread_support.c:387 #8 0x080c1117 in mono_object_new_alloc_specific (vtable=0x81b3b38) at object.c:2089 #9 0x080c127b in mono_object_new_specific (vtable=0x81b3b38) at object.c:2147 #10 0x080af01f in mono_type_get_object (domain=0x81f0f00, type=0x81c7e34) at reflection.c:5381 #11 0x080c2094 in mono_class_vtable (domain=0x81f0f00, class=0x81c7d88) at object.c:861 #12 0x080c20ca in mono_class_vtable (domain=0x81f0f00, class=0x81c7cb8) at object.c:859 #13 0x080c20ca in mono_class_vtable (domain=0x81f0f00, class=0x81c7be8) at object.c:859 #14 0x080c3223 in mono_object_new (domain=0x81f0f00, klass=0x81c7be8) at object.c:2108 #15 0x080af01f in mono_type_get_object (domain=0x81f0f00, type=0x81c550c) at reflection.c:5381 #16 0x080c2094 in mono_class_vtable (domain=0x81f0f00, class=0x81c5460) at object.c:861 #17 0x080c20ca in mono_class_vtable (domain=0x81f0f00, class=0x81cc9a0) at object.c:859 #18 0x080c3223 in mono_object_new (domain=0x81f0f00, klass=0x81cc9a0) at object.c:2108 #19 0x080a7a1f in mono_runtime_init (domain=0x81f0f00, start_cb=0x8103700 , attach_cb=0x81037b0 ) at appdomain.c:96 #20 0x0810541d in mini_init (filename=0xbfffd37d "ViewSample.exe") at mini.c:10053 #21 0x0805b99b in mono_main (argc=2, argv=0xbfffce24) at driver.c:837 #22 0x0805b218 in main (argc=2, argv=0xbfffce24) at main.c:6 (gdb) up So - then - negative hb_descrs are to be expected (apparently). process slot: 27 DS_PER_OBJECT: 0xffffffef Object of type 'MonoType' has size 536870913 DS_BITMAP: 0x20000001 process slot: 26 DS_PER_OBJECT: 0xffffffef Object of type 'XModel' has size 138021856 DS_LENGTH: 0x83a0be0 Assertion failure: mark.c:658 assertion failure ... (gdb) p mark_stack_top $1 = (mse *) 0x81e50d0 (gdb) p *mark_stack_top $2 = {mse_start = 0x8240e70, mse_descr = 4294967279} (gdb) p *mark_stack_top->mse_start $3 = 137267600 (gdb) p *(MonoVTable **)mark_stack_top->mse_start $4 = (struct MonoVTable *) 0x82e8990 (gdb) p *$4 $5 = {klass = 0x8398d38, gc_descr = 0x8398ec0, domain = 0x83a0bc0, interface_offsets = 0x83a0ce8, data = 0x0, type = 0x81fffa0, max_interface_id = 34, rank = 0 '\0', remote = 0, initialized = 1, vtable = 0x82e89ac} (gdb) p *$4->klass $6 = {image = 0x8225010, enum_basetype = 0x0, element_class = 0x8398d38, cast_class = 0x8398d38, rank = 0 '\0', inited = 1, init_pending = 0, size_inited = 1, valuetype = 0, enumtype = 0, blittable = 1, unicode = 0, wastypebuilder = 0, min_align = 1, packing_size = 0, ghcimpl = 0, has_finalize = 0, marshalbyref = 0, contextbound = 0, delegate = 0, gc_descr_inited = 1, has_cctor = 0, dummy = 0, has_references = 0, has_static_refs = 0, declsec_flags = 0, exception_type = 0, exception_data = 0x0, parent = 0x0, nested_in = 0x0, nested_classes = 0x0, type_token = 33554953, name = 0x40b62892 "XModel", name_space = 0x40b62899 "unoidl.com.sun.star.frame", supertypes = 0x83a1988, idepth = 1, interface_count = 1, interface_id = 72, max_interface_id = 72, interface_offsets = 0x8397128, interfaces = 0x8352e28, instance_size = 8, class_size = 0, vtable_size = 0, flags = 161, field = {first = 911, last = 911, count = 0}, method = {first = 1137, last = 1148, count = 11}, property = {first = 0, last = 0, count = 0}, event = {first = 0, last = 0, count = 0}, marshal_info = 0x0, fields = 0x0, properties = 0x0, events = 0x0, methods = 0x83985d8, this_arg = {data = {klass = 0x8398d38, type = 0x8398d38, array = 0x8398d38, method = 0x8398d38, generic_param = 0x8398d38, generic_class = 0x8398d38}, attrs = 0, type = 18, num_mods = 0, byref = 1, pinned = 0, modifiers = 0x8398de4}, byval_arg = {data = {klass = 0x8398d38, type = 0x8398d38, array = 0x8398d38, method = 0x8398d38, generic_param = 0x8398d38, generic_class = 0x8398d38}, attrs = 0, type = 18, num_mods = 0, byref = 0, pinned = 0, modifiers = 0x8398dec}, generic_class = 0x0, generic_container = 0x0, reflection_info = 0x0, gc_descr = 0x0, runtime_info = 0x82470a0, vtable = 0x0} (gdb) p mono_class_get_parent($4->klass) $7 = (MonoClass *) 0x0 (gdb) with: (gdb) p fprintf(stderr, "0x%x\n", 137989824) 0x8398ec0 ie. the gc_descr looks dead wrong. ** Frighteningly the gc_descr points to another MonoClass - instead of a GC value [!?] With more debug printout we see: Add interface 'XComponentContext' Set pvt->gc_descr to 0x30000001: vtable 0x82e8888 ... Object of type 'TransparentProxy' vtable 0x82e8888 class 0x81cd9d0 currentp 0x8240e70 has size 805306369 slot 29 on stack DS_BITMAP: 0x30000001 ... Object of type 'XModel' vtable 0x82e8888 class 0x838a538 currentp 0x8240e70 has size 137930328 slot 26 on stack DS_LENGTH: 0x838a658 The instance seems to change type [!?] Break in: Hardware watchpoint 3: *$3 Old value = (MonoClass *) 0x81cd9d0 New value = (MonoClass *) 0x838a538 0x080c1545 in mono_upgrade_remote_class (domain=0x81f5f00, remote_class=0x82bdbd0, klass=0x838a538) at object.c:1108 1108 remote_class->interfaces [remote_class->interface_count-1] = klass; (gdb) bt #0 0x080c1545 in mono_upgrade_remote_class (domain=0x81f5f00, remote_class=0x82bdbd0, klass=0x838a538) at object.c:1108 #1 0x0809506a in mono_upgrade_remote_class_wrapper (rtype=0x7, tproxy=0x836c120) at marshal.c:6806 From TransparentProxy -> XModel ... Then - trampling all over the GC data: Hardware watchpoint 2: *$2 Old value = (void *) 0x30000001 New value = (void *) 0x838a658 0x080c1545 in mono_upgrade_remote_class (domain=0x81f5f00, remote_class=0x82bdbd0, klass=0x838a658) at object.c:1108 1108 remote_class->interfaces [remote_class->interface_count-1] = klass; (gdb) bt #0 0x080c1545 in mono_upgrade_remote_class (domain=0x81f5f00, remote_class=0x82bdbd0, klass=0x838a658) at object.c:1108 #1 0x080c4563 in mono_object_isinst_mbyref (obj=0x837ecd8, klass=0x838a658) at object.c:2862 #2 0x080c655e in ves_icall_type_IsInstanceOfType (type=0x8, obj=0x837ecd8) at icall.c:1189