Closed Bug 37639 Opened 25 years ago Closed 25 years ago

New XUL widgets (XBL) leak like a sieve!

Categories

(Core :: XBL, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: bugs, Assigned: waterson)

References

Details

(Keywords: memory-leak)

Attachments

(3 files)

I checked in a new navigator.xul, containing menubuttons, buttons, etc., as required for nsbeta2/mozilla skinnability. The leak count has risen to ~740K. This isn't good ;) Below is the new leak log just after my checkin: --NEW-LEAKS-------------------------------- nsBookmarksService 84 - nsTimerGtk 88 - MemoryElementSet::List 2148 - nsJSEventListener 9376 - nsJSContext 144 - LocalSearchDataSource 16 - ContentIdTestNode::Element 304 - nsXULContentUtils 12 - nsJISx4501LineBreaker 12 - nsCaseConversionImp2 12 - HTMLStyleSheetImpl 56 - nsBindingSet 1312 - CompositeDataSourceImpl 1216 - nsDOMAttributeMap 288 - Value 2960 - nsXULPrototypeDocument 440 - nsXULPDGlobalObject 96 - nsHTMLElementFactory 24 - nsXULKeyListenerImpl 28 - InternetSearchDataSource 28 - ConflictSet::MatchEntry 6912 - nsXULTemplateBuilder 8512 - CSSParserImpl 504 - CSSLoaderImpl 152 - ConflictSet::SupportEntry 7436 - nsDOMWindowController 16 - nsXULElement 71192 - DateImpl 1720 - XULSortServiceImpl 12 - nsXULElement::Slots 29276 - URLKey 288 - CSSImportantRule 300 - nsXULDocument 524 - LiteralImpl 34768 - nsXULPrototypeText 36 - nsHTTPIndex 48 - Rule 1080 - ArenaImpl 48 - nsXULPrototypeElement 36288 - XULPopupListenerImpl 48 - nsBinding 1848 - nsXULPrototypeCache 28 - Match 3280 - MemoryElementSet 656 - nsJSDOMEventListener 84 - SupportsKey 32 - nsXULCommandDispatcher 56 - nsElementMap 40 - Key 2592 - InMemoryDataSource 36 - Instantiation 1312 - LocalStoreImpl 28 - nsMenuBarListener 28 - nsRDFResource 6440 - nsWindowMediator 44 - HTMLCSSStyleSheetImpl 32 - RDFServiceImpl 48 - nsCharsetMenu 32 - nsElementMap::ContentListItem 3464 - RDFXMLDataSourceImpl 288 - FileSystemDataSource 16 - nsDateTimeFormatUnix 156 - nsRDFDOMNodeList 56 - nsXULPrototypeScript 544 - RDFPropertyTestNode::Element 480 - nsCollationUnix 88 - nsXULAttribute 69300 - nsToolbarDragListener 96 - nsBindingSet::List 3080 - RDFContainerMemberTestNode::Element 2640 - nsToolboxFrame::DragListenerDelegate 16 - nsXBLAttributeEntry 2736 - RDF_Assertion 16152 - RDFContainerInstanceTestNode::Element 960 - AttributeKey 24 - nsXULAttributes 44560 - RDFContainerUtilsImpl 12 - PropertySet 192 - MatchSet 9648 - GenericTableRule 32 - nsAtomList 7904 49300.00% nsEventListenerManager 23688 16350.00% CSSStyleSheetImpl 6432 6600.00% nsCSSSelector 47872 5653.85% nsSupportsArray 14440 4412.50% CSSStyleSheetInner 1620 4400.00% nsXBLBinding 3872 4300.00% nsClassList 2936 3977.78% nsCSSRule 14048 3890.91% nsStdURLParser 840 3400.00% CSSDeclarationImpl 54112 3290.48% nsStdURL 308 2466.67% nsVoidArray 29984 2398.67% nsAttrSelector 16640 1980.00% nsStr 80980 776.41% NameSpaceImpl 2632 754.55% NameSpaceManagerImpl 288 500.00% nsNoAuthURLParser 48 300.00% nsXMLElement 11600 262.50% nsXULControllers 60 200.00% nsXPCComponents 240 150.00% nsXPCWrappedNativeScope 140 150.00% nsHTMLInputElement 176 100.00% XMLElementFactoryImpl 24 100.00% nsGenericFactory 40 100.00% nsXBLEventHandler 11880 83.33% nsTextNode 480 60.00% nsLocalFile 812 40.00% nsCommentNode 240 33.33% HTMLAttributesImpl 600 11.11% HTMLAttribute 304 5.56% TOTAL 724200 I'd rather these problems get fixed than me get backed out because my navigator changes need to happen and from a XUL point of view are perfectly valid. I did not change any observer behaviour. The button elements that observed various back/forward broadcasters were as prior to my checkin, the change just modifying their class name.
I've looked at this bug a bit yesterday and today. I isolated the problem to ben's changes to navigator.xul, navigatorOverlay.xul, and navigator.dtd (versions 1.173, 1.76, and 1.61 respectively) at 2000-04-29 19:42. The leak is ugly. It looks the same as what radha caused a week ago. The "cause" of the leak is, I think, that the nsXULDocument does not get JS-garbage collected. That's the main difference on refcount balancer logs between the before and after pictures. I have a bunch of logs at http://dbaron.student.harvard.edu/u/david/leaks/regress/ben/ when my computer is on. These use the refcount balancer, many including my experimental COMPtr logging tools. The files beginning with good-* are the "before" picture. A filename ending in balance-* is refcount balancer output, where the * can contain ib meaning "--ignore-balance" and "ic" meaning that nsCOMPtrs (balanced in all cases for the nsXULDocument logs) are cancelled out. The leak is one of those things that looks like a circular reference (nsXULDocument::GetScriptObject makes a script object using NS_NewScriptXULDocument, which addrefs the nsXULDocument because it passed this as the second argument), except it looked the same in the "before" picture and it didn't leak then. At least, I think everything goes back to the nsXULDocument. The two JSContexts leaked (serial#s 3 and 4) are owned by two of the 10 nsXULPrototypeDocuments, which are owned by an nsXULDocument, presumably the one that leaked. cc:ing jband in case he has any insight into what causes the garbage collection not to release the nsXULDocument.
Keywords: mlk
This is the same leak that Radha introduced and then fixed (by changing the XUL back) last week. I have a very simple minimum diff to trigger that leak (if you don't have Ben's changes). The differences between the addrefs/releases for a before/after picture of that leak are at: http://dbaron.student.harvard.edu/u/david/leaks/regress/radha/refcnt.diffs
wow, this is really cool. thanks, dbaron, you rock!
Adding self & waterson to cc list. The bug I filed for the same leaks 36961 went to waterson. Maybe one s'd be closed out as dupe.
No clue. It's not menubuttons. Who knows? Handing over to LeakBitch(tm) for triage.
Assignee: hyatt → waterson
My JS reentrancy bug theory doesn't hold water (I didn't actually mention it above, but...). js_DestroyContext is never called (in the leaky world) from within js_GC. (I think it was before the leaks...) It's hard to isolate what XUL/JS caused the leak because Ben's changes were rather large. I can get it down to a chunk where removing the chunk fixes the leak - but that could easily because the JS code that expects some object (in the removed chunk) to exist, causes an error, and prevents the real problem code from being executed. However, I do know that the original navigatorOverlay.xul and navigator.dtd changes did not cause the leak, but adding the navigator.xul changes does cause the leak.
Here's some more guessing about what could be happening here. I'm getting the feeling (although maybe I'm just a pessimist) we're getting into more ugly issues (like bug 28570) relating to the ownership model of JS global objects. It's my understanding that the JS engine wasn't really designed for a situation where things that own JS contexts are being held by other JS contexts. js_DestroyContext(http://lxr.mozilla.org/seamonkey/source/js/src/jscntxt.c#131 ) does some extra-special stuff (which it seems to me is to be sure to free up everything) if the context being destroyed is the last one. Here the last contexts to be destroyed are being held by something (in this case, the XUL Prototype documents) that are not going to be released until another JS context is fully cleaned up (the one associated with the main nsXULDocument). Is it possible that a context is not fully garbage-collected (does that phrase make sense?) when it is destroyed if it is not the last context to be destroyed? If this were the case, would it be easier to change the ownership model (i.e, not have JS contexts own their global object, but make the global object, or something it owns, responsible for destroying the context first) or the JS engine? Since I think jband is on sabbatical, cc:ing brendan. Brendan: Does this theory make any sense at all? What (else) could cause something not to be garbage-collected? (BTW, I think the circular reference issues that worried me before are properly handled in DocumentViewerImpl::~DocumentViewerImpl, GlobalWindowImpl::SetNewDocument, and nsXULDocument::SetScriptGlobalObject .)
*** Bug 36961 has been marked as a duplicate of this bug. ***
Status: NEW → ASSIGNED
Priority: P3 → P1
Target Milestone: --- → M16
>I'm getting the feeling (although maybe I'm just a pessimist) we're getting into >more ugly issues (like bug 28570) relating to the ownership model of JS global >objects. Bug 28570 was fixed while I was on sabbatical, wasn't it? It was a failure of the JS_THREADSAFE mode of the JS garbage collector to cope with more than one JSContext per thread. It was easy to fix, once diagnosed. >It's my understanding that the JS engine wasn't really designed for a >situation where things that own JS contexts are being held by other JS contexts. No, the JS engine was designed (by me) for the old "classic" codebase, which had multiple contexts per thread, one context per window, just like the Mozilla modern codebase. Therefore in trees induced by the frameset-contains-frame relation, a window held strong (owning) references to its frame children, although there could be other strong refs. Each window native data structure owned its JSContext. We eliminated leaks, so the design did address the situation you describe. It's probably worth recapping how the old scheme worked. There were two heaps and two storage management schemes: the JS heap, which was garbage collected; and the libmocha (portion of the malloc) heap, which was reference-counted with strong and weak reference counts. JS heap objects on the left, libmocha on the right, strong refs are -->, weak are ==>, property 'ids' along the arrows: window context <------------------+----------------------------< MWContext | | v v JS: window object ========> libmocha: window's MochaDecoder <======+ | | | 'frames[i]' 'document' | | | | | v | | document object ====> document private data =======+ | v frames[i] (window) object ======> frames[i]'s MochaDecoder ^ ^ | | frames[i] context <---------------+----------------------------< MWContext Note the double-ended arrows between contexts and decoders -- JS contexts are not reference counted, they are manually created and destroyed by a single owner, which itself must be reference-counted in a world of multiple potential references to a window data structure. [Note that the JSContext struct contains no actual void * or otherwise-typed pointer to the decoder. Instead, there was another malloc-heap object, MWContext, corresponding to the MochaDecoder, that pointed to the JSContext and held a strong ref to the decoder that we attributed to the owning pointer (namely, MWContext::mocha_context, of type JSContext *).] The weak references are from JS objects, via their private data slots, to their corresponding libmocha data structures. These refs are weak because a JS object may become garbage and be finalized, but its underlying libmocha structure should not be destroyed until all strong and weak refs from native C/C++ code and from other JS object private data slots have gone away. In particular, references "up the tree" from descendent DOM objects' private data to the window's private data structure, the decoder, must be weak. When a document is released from a window (caused by the first packet of new document data arriving and displacing an old document already loaded in the window), then the document's containing window has its scope and GC roots cleared, and the GC is run. If one of its frame kids is now unreachable by JS strong refs (which are the only kind of JS refs), the frame object is finalized. That decrements the weak reference count (decoder->back_count) for the frame's decoder. If that count reaches 0, and if the strong ref count (forw_count) is already 0, the decoder *and its JSContext* are destroyed. When a native window (MWContext) is closed or otherwise destroyed (app exit), the strong ref held indirectly from the MWContext via the JSContext is removed, and if the strong reference count (decoder->forw_count) therefore goes to 0, then the GC roots are cleared, the context's global object is set to null (rather than clearing the window object's scope, which would frustrate attempts to discover properties set in that window object for discovery by the caller to window.open, after the window has been closed; and which would also waste cycles in the case that no one has a strong ref to the JS window object [it will be GC'd soon enough, so no need to clear its scope]), and the GC is run. Around this GC run, the decoder's back_count is elevated to keep it alive in case there are no strong or weak refs to it. Then the weak ref-count is dropped, and if zero, then *all* memory (GC roots registered with JS, other deep data structures) owned exclusively by the decoder, and the decoder memory itself, are freed. But if there are JS refs to the window (say, in a running script's variables, or stashed in another window's scope), then the weak ref from the window object's private data slot to the decoder keeps its storage alive, even though it is mostly "closed" and all-but-freed. About the only thing it is good for is handling win.name property gets and the like, and of course getting properties set by script in the opened or child window. Whew. Ok, let's update the diagram to the new world, prior to "brutal sharing" and the XUL prototype document: MWContext => webshell, MochaDecoder => GlobalWindowImpl, forw_count => mRefCnt -- oops! The JS window object no longer holds a weak ref (decoder->back_count-accounted pointer in the old world). It adds to the mRefCnt pot of the GlobalWindowImpl. That's big bad difference #1. I can't get to dbaron.student.harvard.edu right now, but based on dbaron's next to last comment about JS_DestroyContext not being called for a certain context, I'm guessing the leak here is not just an nsXULDocument, but a whole DOM window (GlobalWindowImpl). Is that true? If so, I can see why. The lack of a weak ref count in GlobalWindowImpl (back_count in the MochaDecoder in the classic codebase) means that there can be a strong ref cycle, if a JS GC root in the GlobalWindowImpl, or in another data structure reachable from its strong XPCOM refs, points back to the JS object. Note how in the classic world, because the window JS object uses the weak ref count, when there are no strong refs to the native data structure, its roots are all cleared and the GC is run. But the weak ref count keeps the window object's private data alive, albeit "closed". In the modern Mozilla world, I see no such code to clear roots. I may be missing some root clearing and/or removing somewhere in the DOM, that suffices at least for non-XUL (no brutal sharing) windows. But I believe the bug involves only stuck root(s) and refcnts. If there is a stuck GC root, we should be able to find it with js_LiveThingToFind and GC_MARK_DEBUG. I'll help waterson with this tomorrow, if he has time. >js_DestroyContext(http://lxr.mozilla.org/seamonkey/source/js/src/jscntxt.c#131 ) >does some extra-special stuff (which it seems to me is to be sure to free up >everything) if the context being destroyed is the last one. Here the last >contexts to be destroyed are being held by something (in this case, the XUL >Prototype documents) that are not going to be released until another JS context >is fully cleaned up (the one associated with the main nsXULDocument). Then which context is "last"? The answer is the one that goes through the code at the start of js_DestroyContext, which finds the runtime's context-list empty after removing the context. I believe this means that what you call the "last" context above won't be considered last, because if there are other contexts not yet destroyed, they will still be on the runtime's contextList. Anyway, all this JS engine "last context-destruction runtime cleanup" code is way below the level at which I believe this leak bug lives: the world of refcnt cycles and stuck GC roots. >Is it >possible that a context is not fully garbage-collected (does that phrase make >sense?) when it is destroyed if it is not the last context to be destroyed? A context is not a garbage-collected object, so no. Nor is it leaking any roots, as the context holds no explicit GC roots (added by JS_Add{,Named}Root), and its struct members that are roots-by-definition (e.g., globalObject) are no longer roots once the context is off the runtime's contextList (see jsgc.c). >If this were the case, would it be easier to change the ownership model (i.e, >not have JS contexts own their global object, but make the global object, or >something it owns, responsible for destroying the context first) or the JS >engine? None of this Mozilla leak bugginess requires JS engine changes; my proof is the classic codebase. Dbaron, thanks for all your work here, anyway -- JS needs scrutiny, as bug 28570 shows. It's just not at all likely to be at fault here. /be
Sorry, my bad for using notepad and overflowing the 80th column. Please don't bother reading all of my long blurb in-line in the bug -- try reading the attachment instead. /be
Blocks: 8241
Blocks: 37480
Turns out to be a circularity caused by XBL. Specifically, if an XBL event gets dispatched to anonymous content, then the anonymous content's script object gets created. This circularity is *supposed* to be broken by nsIContent::SetDocument(nsnull), but since the parent of the anonymous content isn't aware of it's anonymous children, this never gets called. Hyatt and I are working out a fix for this: I'll patch diffs soon. In the meantime, I've checked in code that comments out "urlbar.focus()" in navigator.js, which was causing this.
Attached patch fix (deleted) — Splinter Review
Sorry it took so long to get the patch up. Friggin' bugzilla. Anyway, the fix is to make sure that nsXULElement::SetDocument() tells the anonymous content nodes created from a binding to fix up *their* document backpointers. This'll cause them to do the right thing when we're tearing down a document. Hyatt sez this change is scary, so help me test it. Seems to work fine for me.
Checked in the changes. I'll wait to mark fix until I see leak stats go down.
ok, it worked! marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: