Closed
Bug 37639
Opened 25 years ago
Closed 25 years ago
New XUL widgets (XBL) leak like a sieve!
Categories
(Core :: XBL, defect, P1)
Core
XBL
Tracking
()
RESOLVED
FIXED
M16
People
(Reporter: bugs, Assigned: waterson)
References
Details
(Keywords: memory-leak)
Attachments
(3 files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review |
I checked in a new navigator.xul, containing menubuttons, buttons, etc., as
required for nsbeta2/mozilla skinnability. The leak count has risen to ~740K.
This isn't good ;)
Below is the new leak log just after my checkin:
--NEW-LEAKS--------------------------------
nsBookmarksService 84 -
nsTimerGtk 88 -
MemoryElementSet::List 2148 -
nsJSEventListener 9376 -
nsJSContext 144 -
LocalSearchDataSource 16 -
ContentIdTestNode::Element 304 -
nsXULContentUtils 12 -
nsJISx4501LineBreaker 12 -
nsCaseConversionImp2 12 -
HTMLStyleSheetImpl 56 -
nsBindingSet 1312 -
CompositeDataSourceImpl 1216 -
nsDOMAttributeMap 288 -
Value 2960 -
nsXULPrototypeDocument 440 -
nsXULPDGlobalObject 96 -
nsHTMLElementFactory 24 -
nsXULKeyListenerImpl 28 -
InternetSearchDataSource 28 -
ConflictSet::MatchEntry 6912 -
nsXULTemplateBuilder 8512 -
CSSParserImpl 504 -
CSSLoaderImpl 152 -
ConflictSet::SupportEntry 7436 -
nsDOMWindowController 16 -
nsXULElement 71192 -
DateImpl 1720 -
XULSortServiceImpl 12 -
nsXULElement::Slots 29276 -
URLKey 288 -
CSSImportantRule 300 -
nsXULDocument 524 -
LiteralImpl 34768 -
nsXULPrototypeText 36 -
nsHTTPIndex 48 -
Rule 1080 -
ArenaImpl 48 -
nsXULPrototypeElement 36288 -
XULPopupListenerImpl 48 -
nsBinding 1848 -
nsXULPrototypeCache 28 -
Match 3280 -
MemoryElementSet 656 -
nsJSDOMEventListener 84 -
SupportsKey 32 -
nsXULCommandDispatcher 56 -
nsElementMap 40 -
Key 2592 -
InMemoryDataSource 36 -
Instantiation 1312 -
LocalStoreImpl 28 -
nsMenuBarListener 28 -
nsRDFResource 6440 -
nsWindowMediator 44 -
HTMLCSSStyleSheetImpl 32 -
RDFServiceImpl 48 -
nsCharsetMenu 32 -
nsElementMap::ContentListItem 3464 -
RDFXMLDataSourceImpl 288 -
FileSystemDataSource 16 -
nsDateTimeFormatUnix 156 -
nsRDFDOMNodeList 56 -
nsXULPrototypeScript 544 -
RDFPropertyTestNode::Element 480 -
nsCollationUnix 88 -
nsXULAttribute 69300 -
nsToolbarDragListener 96 -
nsBindingSet::List 3080 -
RDFContainerMemberTestNode::Element 2640 -
nsToolboxFrame::DragListenerDelegate 16 -
nsXBLAttributeEntry 2736 -
RDF_Assertion 16152 -
RDFContainerInstanceTestNode::Element 960 -
AttributeKey 24 -
nsXULAttributes 44560 -
RDFContainerUtilsImpl 12 -
PropertySet 192 -
MatchSet 9648 -
GenericTableRule 32 -
nsAtomList 7904 49300.00%
nsEventListenerManager 23688 16350.00%
CSSStyleSheetImpl 6432 6600.00%
nsCSSSelector 47872 5653.85%
nsSupportsArray 14440 4412.50%
CSSStyleSheetInner 1620 4400.00%
nsXBLBinding 3872 4300.00%
nsClassList 2936 3977.78%
nsCSSRule 14048 3890.91%
nsStdURLParser 840 3400.00%
CSSDeclarationImpl 54112 3290.48%
nsStdURL 308 2466.67%
nsVoidArray 29984 2398.67%
nsAttrSelector 16640 1980.00%
nsStr 80980 776.41%
NameSpaceImpl 2632 754.55%
NameSpaceManagerImpl 288 500.00%
nsNoAuthURLParser 48 300.00%
nsXMLElement 11600 262.50%
nsXULControllers 60 200.00%
nsXPCComponents 240 150.00%
nsXPCWrappedNativeScope 140 150.00%
nsHTMLInputElement 176 100.00%
XMLElementFactoryImpl 24 100.00%
nsGenericFactory 40 100.00%
nsXBLEventHandler 11880 83.33%
nsTextNode 480 60.00%
nsLocalFile 812 40.00%
nsCommentNode 240 33.33%
HTMLAttributesImpl 600 11.11%
HTMLAttribute 304 5.56%
TOTAL 724200
I'd rather these problems get fixed than me get backed out because my navigator
changes need to happen and from a XUL point of view are perfectly valid. I did
not change any observer behaviour. The button elements that observed various
back/forward broadcasters were as prior to my checkin, the change just modifying
their class name.
Comment 1•25 years ago
|
||
I've looked at this bug a bit yesterday and today. I isolated the problem to
ben's changes to navigator.xul, navigatorOverlay.xul, and navigator.dtd
(versions 1.173, 1.76, and 1.61 respectively) at 2000-04-29 19:42.
The leak is ugly. It looks the same as what radha caused a week ago. The
"cause" of the leak is, I think, that the nsXULDocument does not get JS-garbage
collected. That's the main difference on refcount balancer logs between the
before and after pictures. I have a bunch of logs at
http://dbaron.student.harvard.edu/u/david/leaks/regress/ben/ when my computer is
on. These use the refcount balancer, many including my experimental COMPtr
logging tools. The files beginning with good-* are the "before" picture. A
filename ending in balance-* is refcount balancer output, where the * can
contain ib meaning "--ignore-balance" and "ic" meaning that nsCOMPtrs (balanced
in all cases for the nsXULDocument logs) are cancelled out.
The leak is one of those things that looks like a circular reference
(nsXULDocument::GetScriptObject makes a script object using
NS_NewScriptXULDocument, which addrefs the nsXULDocument because it passed this
as the second argument), except it looked the same in the "before" picture and
it didn't leak then.
At least, I think everything goes back to the nsXULDocument. The two JSContexts
leaked (serial#s 3 and 4) are owned by two of the 10 nsXULPrototypeDocuments,
which are owned by an nsXULDocument, presumably the one that leaked.
cc:ing jband in case he has any insight into what causes the garbage collection
not to release the nsXULDocument.
Comment 2•25 years ago
|
||
This is the same leak that Radha introduced and then fixed (by changing the XUL
back) last week. I have a very simple minimum diff to trigger that leak (if you
don't have Ben's changes).
The differences between the addrefs/releases for a before/after picture of that
leak are at:
http://dbaron.student.harvard.edu/u/david/leaks/regress/radha/refcnt.diffs
Comment 3•25 years ago
|
||
Reporter | ||
Comment 4•25 years ago
|
||
wow, this is really cool. thanks, dbaron, you rock!
Comment 5•25 years ago
|
||
Adding self & waterson to cc list. The bug I filed for the same leaks 36961 went
to waterson. Maybe one s'd be closed out as dupe.
Comment 6•25 years ago
|
||
No clue. It's not menubuttons. Who knows? Handing over to LeakBitch(tm) for
triage.
Assignee: hyatt → waterson
Comment 7•25 years ago
|
||
My JS reentrancy bug theory doesn't hold water (I didn't actually mention it
above, but...). js_DestroyContext is never called (in the leaky world) from
within js_GC. (I think it was before the leaks...)
It's hard to isolate what XUL/JS caused the leak because Ben's changes were
rather large. I can get it down to a chunk where removing the chunk fixes the
leak - but that could easily because the JS code that expects some object (in
the removed chunk) to exist, causes an error, and prevents the real problem code
from being executed.
However, I do know that the original navigatorOverlay.xul and navigator.dtd
changes did not cause the leak, but adding the navigator.xul changes does cause
the leak.
Comment 8•25 years ago
|
||
Here's some more guessing about what could be happening here.
I'm getting the feeling (although maybe I'm just a pessimist) we're getting into
more ugly issues (like bug 28570) relating to the ownership model of JS global
objects. It's my understanding that the JS engine wasn't really designed for a
situation where things that own JS contexts are being held by other JS contexts.
js_DestroyContext(http://lxr.mozilla.org/seamonkey/source/js/src/jscntxt.c#131 )
does some extra-special stuff (which it seems to me is to be sure to free up
everything) if the context being destroyed is the last one. Here the last
contexts to be destroyed are being held by something (in this case, the XUL
Prototype documents) that are not going to be released until another JS context
is fully cleaned up (the one associated with the main nsXULDocument). Is it
possible that a context is not fully garbage-collected (does that phrase make
sense?) when it is destroyed if it is not the last context to be destroyed?
If this were the case, would it be easier to change the ownership model (i.e,
not have JS contexts own their global object, but make the global object, or
something it owns, responsible for destroying the context first) or the JS
engine?
Since I think jband is on sabbatical, cc:ing brendan. Brendan: Does this
theory make any sense at all? What (else) could cause something not to be
garbage-collected?
(BTW, I think the circular reference issues that worried me before are properly
handled in DocumentViewerImpl::~DocumentViewerImpl,
GlobalWindowImpl::SetNewDocument, and nsXULDocument::SetScriptGlobalObject .)
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Priority: P3 → P1
Target Milestone: --- → M16
Comment 10•25 years ago
|
||
>I'm getting the feeling (although maybe I'm just a pessimist) we're getting
into
>more ugly issues (like bug 28570) relating to the ownership model of JS global
>objects.
Bug 28570 was fixed while I was on sabbatical, wasn't it? It was a failure of
the
JS_THREADSAFE mode of the JS garbage collector to cope with more than one
JSContext
per thread. It was easy to fix, once diagnosed.
>It's my understanding that the JS engine wasn't really designed for a
>situation where things that own JS contexts are being held by other JS
contexts.
No, the JS engine was designed (by me) for the old "classic" codebase, which had
multiple contexts per thread, one context per window, just like the Mozilla
modern
codebase. Therefore in trees induced by the frameset-contains-frame relation, a
window held strong (owning) references to its frame children, although there
could
be other strong refs. Each window native data structure owned its JSContext.
We
eliminated leaks, so the design did address the situation you describe.
It's probably worth recapping how the old scheme worked. There were two heaps
and
two storage management schemes: the JS heap, which was garbage collected; and
the
libmocha (portion of the malloc) heap, which was reference-counted with strong
and
weak reference counts. JS heap objects on the left, libmocha on the right,
strong
refs are -->, weak are ==>, property 'ids' along the arrows:
window context <------------------+----------------------------<
MWContext
| |
v v
JS: window object ========> libmocha: window's MochaDecoder <======+
| | |
'frames[i]' 'document' |
| | |
| v |
| document object ====> document private data =======+
|
v
frames[i] (window) object ======> frames[i]'s MochaDecoder
^ ^
| |
frames[i] context <---------------+----------------------------<
MWContext
Note the double-ended arrows between contexts and decoders -- JS contexts are
not
reference counted, they are manually created and destroyed by a single owner,
which
itself must be reference-counted in a world of multiple potential references to
a
window data structure.
[Note that the JSContext struct contains no actual void * or otherwise-typed
pointer to the decoder. Instead, there was another malloc-heap object,
MWContext,
corresponding to the MochaDecoder, that pointed to the JSContext and held a
strong
ref to the decoder that we attributed to the owning pointer (namely,
MWContext::mocha_context, of type JSContext *).]
The weak references are from JS objects, via their private data slots, to their
corresponding libmocha data structures. These refs are weak because a JS object
may become garbage and be finalized, but its underlying libmocha structure
should
not be destroyed until all strong and weak refs from native C/C++ code and from
other JS object private data slots have gone away. In particular, references
"up
the tree" from descendent DOM objects' private data to the window's private data
structure, the decoder, must be weak.
When a document is released from a window (caused by the first packet of new
document data arriving and displacing an old document already loaded in the
window),
then the document's containing window has its scope and GC roots cleared, and
the GC
is run. If one of its frame kids is now unreachable by JS strong refs (which
are
the only kind of JS refs), the frame object is finalized. That decrements the
weak
reference count (decoder->back_count) for the frame's decoder. If that count
reaches 0, and if the strong ref count (forw_count) is already 0, the decoder
*and
its JSContext* are destroyed.
When a native window (MWContext) is closed or otherwise destroyed (app exit),
the
strong ref held indirectly from the MWContext via the JSContext is removed, and
if
the strong reference count (decoder->forw_count) therefore goes to 0, then the
GC
roots are cleared, the context's global object is set to null (rather than
clearing
the window object's scope, which would frustrate attempts to discover properties
set
in that window object for discovery by the caller to window.open, after the
window
has been closed; and which would also waste cycles in the case that no one has a
strong ref to the JS window object [it will be GC'd soon enough, so no need to
clear its scope]), and the GC is run.
Around this GC run, the decoder's back_count is elevated to keep it alive in
case
there are no strong or weak refs to it. Then the weak ref-count is dropped, and
if
zero, then *all* memory (GC roots registered with JS, other deep data
structures)
owned exclusively by the decoder, and the decoder memory itself, are freed. But
if
there are JS refs to the window (say, in a running script's variables, or
stashed
in another window's scope), then the weak ref from the window object's private
data
slot to the decoder keeps its storage alive, even though it is mostly "closed"
and
all-but-freed. About the only thing it is good for is handling win.name
property
gets and the like, and of course getting properties set by script in the opened
or
child window.
Whew.
Ok, let's update the diagram to the new world, prior to "brutal sharing" and the
XUL prototype document: MWContext => webshell, MochaDecoder => GlobalWindowImpl,
forw_count => mRefCnt -- oops! The JS window object no longer holds a weak ref
(decoder->back_count-accounted pointer in the old world). It adds to the
mRefCnt
pot of the GlobalWindowImpl. That's big bad difference #1.
I can't get to dbaron.student.harvard.edu right now, but based on dbaron's next
to last comment about JS_DestroyContext not being called for a certain context,
I'm guessing the leak here is not just an nsXULDocument, but a whole DOM window
(GlobalWindowImpl). Is that true? If so, I can see why. The lack of a weak
ref count in GlobalWindowImpl (back_count in the MochaDecoder in the classic
codebase) means that there can be a strong ref cycle, if a JS GC root in the
GlobalWindowImpl, or in another data structure reachable from its strong XPCOM
refs, points back to the JS object.
Note how in the classic world, because the window JS object uses the weak ref
count, when there are no strong refs to the native data structure, its roots are
all cleared and the GC is run. But the weak ref count keeps the window object's
private data alive, albeit "closed". In the modern Mozilla world, I see no such
code to clear roots.
I may be missing some root clearing and/or removing somewhere in the DOM, that
suffices at least for non-XUL (no brutal sharing) windows. But I believe the
bug
involves only stuck root(s) and refcnts.
If there is a stuck GC root, we should be able to find it with
js_LiveThingToFind
and GC_MARK_DEBUG. I'll help waterson with this tomorrow, if he has time.
>js_DestroyContext(http://lxr.mozilla.org/seamonkey/source/js/src/jscntxt.c#131
)
>does some extra-special stuff (which it seems to me is to be sure to free up
>everything) if the context being destroyed is the last one. Here the last
>contexts to be destroyed are being held by something (in this case, the XUL
>Prototype documents) that are not going to be released until another JS context
>is fully cleaned up (the one associated with the main nsXULDocument).
Then which context is "last"? The answer is the one that goes through the code
at the start of js_DestroyContext, which finds the runtime's context-list empty
after removing the context. I believe this means that what you call the "last"
context above won't be considered last, because if there are other contexts not
yet destroyed, they will still be on the runtime's contextList.
Anyway, all this JS engine "last context-destruction runtime cleanup" code is
way below the level at which I believe this leak bug lives: the world of refcnt
cycles and stuck GC roots.
>Is it
>possible that a context is not fully garbage-collected (does that phrase make
>sense?) when it is destroyed if it is not the last context to be destroyed?
A context is not a garbage-collected object, so no. Nor is it leaking any
roots,
as the context holds no explicit GC roots (added by JS_Add{,Named}Root), and its
struct members that are roots-by-definition (e.g., globalObject) are no longer
roots once the context is off the runtime's contextList (see jsgc.c).
>If this were the case, would it be easier to change the ownership model (i.e,
>not have JS contexts own their global object, but make the global object, or
>something it owns, responsible for destroying the context first) or the JS
>engine?
None of this Mozilla leak bugginess requires JS engine changes; my proof is the
classic codebase.
Dbaron, thanks for all your work here, anyway -- JS needs scrutiny, as bug 28570
shows. It's just not at all likely to be at fault here.
/be
Comment 11•25 years ago
|
||
Comment 12•25 years ago
|
||
Sorry, my bad for using notepad and overflowing the 80th column. Please don't
bother reading all of my long blurb in-line in the bug -- try reading the
attachment instead.
/be
Assignee | ||
Comment 13•25 years ago
|
||
Turns out to be a circularity caused by XBL. Specifically, if an XBL event gets
dispatched to anonymous content, then the anonymous content's script object gets
created. This circularity is *supposed* to be broken by
nsIContent::SetDocument(nsnull), but since the parent of the anonymous content
isn't aware of it's anonymous children, this never gets called.
Hyatt and I are working out a fix for this: I'll patch diffs soon. In the
meantime, I've checked in code that comments out "urlbar.focus()" in
navigator.js, which was causing this.
Assignee | ||
Comment 14•25 years ago
|
||
Assignee | ||
Comment 15•25 years ago
|
||
Sorry it took so long to get the patch up. Friggin' bugzilla. Anyway, the fix is
to make sure that nsXULElement::SetDocument() tells the anonymous content nodes
created from a binding to fix up *their* document backpointers. This'll cause
them to do the right thing when we're tearing down a document.
Hyatt sez this change is scary, so help me test it. Seems to work fine for me.
Assignee | ||
Comment 16•25 years ago
|
||
Checked in the changes. I'll wait to mark fix until I see leak stats go down.
Assignee | ||
Comment 17•25 years ago
|
||
ok, it worked! marking fixed.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•