Another setback with footprint tracing. With work that jar and
thesteve had done, I thought we'd narrowed the problem down to
fragmentation. But then remembered that the data included all of the
`dark matter' from trace-malloc's own bookkeeping. Brendan's going to
massage trace-malloc to allocate stacks and table entries from his own
`private pool', so they won't show up in the malloc()
heap.
Tried the
hoard
allocator, and discovered that it had a
similar growth rate
to vanilla glibc malloc(), which leads me to
believe that either (1) hoard has the same problems as
glibc malloc(), or (2) the data that I gave
to jar and thesteve to process was shite. I'm leaning toward the
latter.
Talked to mjudge about trace-malloc on windows; he's struggling with some of jband's fancy macro-fu; asked jband to help him tomorrow if there's still trouble.
On a lark,
tried using the BSD4.4 allocator
that can be built as part of NSPR. I was suprised to find that it
actually doesn't exhibit the `perpetual growth' problems that hoard
and glibc malloc() exhibit. Kooky.
Posted some more results from analyzing the lo-tech output from the BSD4.4 allocator last night, but looking at it again, I think I see a problem with this data. Specifically, since I computed `VM size' by subtracting the address of the highest pointer seen to date, it should monotonically increase. But it doesn't! Looks like I was screwing up the excel graphing.
Trying to get a better long-term picture of URL usage with the
different allocators. I've got a
long list
of URLs, but have also discovered a problem with the `buster'
CGI. Some of the pages have onload handlers that replace
document.top, and this cuts the refresh goop off at the
knees. I'm just pruning those URLs out of the file as I find them, but
it's slow going.
Talked with jst about moving the XUL content model stuff into the layout DLL, and he's going to get heikki to see how much work it'd be to just completely separate the content model stuff into its own DLL.
Friday, January 5, 2001Got a head-to-head comparison of some different allocators pulled together. Gotta try Doug Lea's allocator, which is apparently pretty good. I'm having some trouble getting it to work with Mozilla out-of-the box.
Got Doug Lea's allocator up and running. (I was having trouble because
it's not threadsafe by default!) Anyway, v2.6.6 does a bit better than
glibc, and 2.7.0 does a bit better than 2.6.6. (Well, a bit more than
`a bit': it's process size was about 9MB smaller after 500 URLs!)
Anyway, I've summarized
the results
and am now trying to collect data on what the live objects are (using
lo-fi printf() technology).
Finally checked in a fix for bug 57026 which was a block-in-inline headache with out-of-flow frames and views.
Talked with jst and heikki briefly: they are moving ahead with
content.dll, so I'll wait to land my XUL merge until
they're done. Memory meeting: rickg to test SmartHeap in
winEmbed; harishd to evaluate parser node arena stuff
using lo-fi malloc instrumentation. dprice to collect
winEmbed stats using jrgm's tests on slow machine.
Collected some quick and dirty data on winEmbed
performance on a low-end (166MHz, 64MB, Win98) machine using jgrm's
stuff. Used TaskInfo2000 to look at working set size: although VM size
is equivalent, our working set size is almost five times larger than
IE5's.
Spent some time debugging winEmbed netwerk foo
along the way: discovered a
deadlock
as well as some
evil surprises
left by ruslan.
Comparing builds with and without rfg's patch to strip pure-virtual vtables. The builds end up being slightly smaller (~1-2%), which is a bit less than I'd expected.
Running trace-malloc build to determine what objects are around after loading ~500 URLs.
Somebody
posted
about new glibc equivalents of
declspec(__dll[import|export]); need to goad someone into
looking into this to see if it'd help reduce the number of symbols we
export.
Finally collected trace-malloc data on 100 URLs, and then 400 URLs. I'll dig into what the objects actually are tomorrow. Met with buster, karnaze, kmcclusk, attinasi to discuss excessive invalidates. Came up with some good stuff to work on, including:
FlushPendingReflows()
Spent some time trying to profile
tomshardware.com,
and had a hell of a time getting Quantify to work right. Must be some
crufty software that I've installed recently. Anyway, it looks like
we're thrashing the heck out of the network: a ton of time is showing
up in PR_ExitMonitor() and such. I wanna take a look at
this from a local file to see if performance is any better.
Rounded up some performance data for the Lea allocator, updated the allocator page to reflect it. Looks like the Lea allocators cause a slight slowdown in page load time. Started fiddling around with jband's idea of tracing calls to produce a file that the linker can use to order functions inside a DLL. The `prototype' works, but I'm having some trouble with static ctors that I need to sort out.
To do...
Figured out problem with tracing and static ctors (needed to save off
ecx), started to merge the changes into the build system, and
handed it off
to dprice.
Started collecting updated gross VM growth information on Doug Lea's pre7 allocator: looks to be about the same as pre6. I'm going to update the Hoard data too, and maybe BSD if I can find it, and re-publish that stuff eventually.
Started analyzing the 100- to 400-URL trace-malloc data in
earnest.
Reduced
the default size for an nsHashtable from 256 to 16. That,
plus radha's
session history limits,
look to knock the VM growth rate down by 40% with Lea's pre7 allocator
(from 10KB to 5.8KB). Sweet justice!
Because I couldn't resist, I cobbled together some perl hackery to
pass to the Win32 linker via /ORDER. At first blush, it
looks like it doesn't have
much effect
on the working set size, and I'm not sure why. cc'd some of the Big
Guns to see if they have any thoughts.
Spent most of the day poking around at resident set size. I compared
two rebased builds, and noticed only a moderate (300KB, or 3%)
improvement with winEmbed, testing startup only.
One thing that I tried to do (but couldn't) was to strip off the
relocations (using rebase -f). I'm curious if each
DLL's static data is also being properly rebased.
Found a Pietrek article on the
Win32 Portable Executable File Format,
which described DLL and executable layout on Win32. I stubled across
it trying to figure out what the HIGHLOW relocations
are. Found another article that goes into
rebasing
in detail.
Spent some time debugging 64929 with Bhuvan. It's a crasher that looks to have something to do with a bizarre installation config.
Spent some time trying to figure out whether RTTI bloats stuff significantly. Looks like it does (about 5% across the board), but certainly not to the level that the AOL folks think it does. Which means that there's probably another bugbear waiting in the weeds.
Jody mentioned that ``Exports, and the inablity to properly control
them, are also another cause of bloat from gcc.'' I wonder if the new
glibc-2.2 attributes that
rth mentioned
would be of any use to use in controlling this problem? (I also asked
Jody to elaborate...)
Talked to blizzard and brendan about this: There are (apparently) a
large number of export entries in each .so's Global
Offset Table. It's possible that using the private
attribute (or somesuch) could eliminate this, but cls said last week
that bryner and wtc were working on a post-processing tool to strip
this stuff out. Need to follow up there...
Ok, wow. So I was wrong: it turns out that with
gcc-2.95.2, the RTTI stuff does generate much
larger .so files. Should get someone to look into that.
Talked with rpotts a bit about talking the Win32 resident set size.
Couple of ``to do'' ideas:
VirtualQueryEx() and friends to walk
the process's page table; sync it up with the map file.
.so bloat on
gcc-2.95.2
trace-malloc data on object growth with radha's
patch turned on.
Spent some time hacking on a program that walks the process's page
table with VirtualQueryEx(), but realize that this isn't
going to tell me if a page is resident or not.
Started reading the Solomon book that discusses NT memory management
in hopes that I'll be able to figure out how to track what pages are
resident. Collected trace-malloc data, did quick summary,
and
posted
it. Put together status report that describes where we are now, and
tries to estimate how far we'll get going forward.
Cached design review. Filed a few bugs based on
trace-malloc data collected yesterday.
jband criticized the way I was trying to detect resident set improvements: he suggested that we wouldn't see much difference justing doing a simple startup and shutdown. So, I frittered away some time running a full-blown Mozilla build, collecting data. It might have made some difference: at one point I'd convinced myself that the resident set size was about 1.5 to 2MB lower (from just under 14MB to just over 12MB) when I ran with the ``ordered'' build. A second run to verify showed a more modest (200KB) difference.
Still no progress on determining what pages are actually in memory. I finished that chapter from Solomon, and convinced myself that what we really need to do is try to read the process's page table, and dump that out. Didn't have a chance to experiment with that today.
Gotta catch up on layout bugs so I can pick buster's brain before he bails.
Went through layout bugs; for categories: 1) performance, 2) inline
borders/margins/padding, 3) block-in-inline, and 4) text runs through
non-text leaf inline frames. Another hour-long session with buster and
crew. Some back-and-forth with rfg trying to figure out why we're
getting different results. Updated footprint estimate stuff with
phil's feedback and forwarded to embed-eng to proof-read.
Found a
small problem
with buster's frame hint stuff: a case where floaters (or probably
other absolutely positioned items) would send the
nsCSSFrameConstructor::FindFrameWithContent off into the
weeds.
Did a startup profile and forwarded it to the newsgroup.
Twiddling with more patches from rfg.