Profiling information
Displaying 1-8 of 8 total.
1
Please enter a numerical value for the importance of this sticky.
Enter 0 to unsticky.
adderd

Could I trouble anyone to compile verge (from SVN) with profiling enabled in windows and send me the relevant info? I'm thinking that there is more being slow in linux than just the opengl visualization. In fact, the opengl stuff doesn't seem that slow... it seems like something else is gooing it up and a profile output from windows would really help. I think we talked about whether VisualC++ supports good profiling before but never came to a good conclusion on that. Any extra info?

Posted on 2006-05-21 22:29:08

Kildorf

Visual C++ doesn't seem to come with any profiling stuff built in. I'm not sure why... anyway.

There are some third party plug-ins for profiling, but I've not had any luck on getting them to actually work, and most are offered not-for-free, from what I've been able to figure out. Has anyone else had any success with getting these things working? I may just be an incompetent buffoon. :D

Posted on 2006-05-22 09:27:08

adderd

I found out that VC6 did have a profiler but it's only available in the Enterprise addition (the pricks...)

But AMD CodeAnalyst is available for free and it does work (for the most part) on Intel chips too. So, I'm going to use that to profile in windows and see what's different.

Posted on 2006-05-23 11:21:30

adderd

Okie dokie... Well, I got Visual Studio 2005 setup and compiled verge. Then I used CodeAnalyst to profile it. Also, I've compiled verge in linux and used gprof to profile it there as well. In both cases I've enabled normal optimizations (/O2 in both)

This is the top of the data from windows:
IP Funct PID %
"0x4be8a0","memset","","5036","3.28","5036","5036","0",
"0x4bd570","memcpy","","3164","2.06","3164","3164","0",
"0x459aa0","dd32_Blit_lucent","","3152","2.05","3152","3152","0",
"0x45a340","dd32_TBlitTile","","2738","1.78","2738","2738","0",
"0x43a730","VCCore::ProcessOperand","","2663","1.73","2663","2663","0",
"0x45cf40","dd32_AlphaBlit","","2254","1.47","2254","2254","0",
"0x457a70","dd32_HLine","","1911","1.24","1911","1911","0",
"0x4b8420","CheckBytes","","1820","1.19","1820","1820","0",
"0x43aae0","VCCore::ResolveOperand","","1804","1.17","1804","1804","0",
"0x45a200","dd32_BlitTile","","1761","1.15","1761","1761","0",
"0x4bd8e0","strncpy","","1751","1.14","1751","1751","0",
"0x4b6ec0","_heap_alloc_dbg","","1546","1.01","1546","1546","0",
"0x41a1c0","Chunk::GrabC","","1505","0.98","1505","1505","0",
"0x408f80","rawmem::resize","","1426","0.93","1426","1426","0",
"0x4b7ca0","_free_dbg_nolock","","1136","0.74","1136","1136","0",
"0x4b61b0","operator delete","","1104","0.72","1104","1104","0",
"0x4386d0","VCCore::HandleAssign","","888","0.58","888","888","0",
"0x4093b0","string::string","","780","0.51","780","780","0",
"0x426be0","std::vector<int_t *,std::allocator<int_t *>
>::operator[]","","770","0.50","770","770","0",
"0x4b7c20","_free_dbg","","756","0.49","756","756","0",
"0x41a200","Chunk::GrabD","","669","0.44","669","669","0",

The percentages above are weird because CodeAnalyst profiles the whole system not just the application. Multiply all above values by around 3 and you get the program's percentage numbers.

Linux:
% cumulative self self total
time seconds seconds calls s/call s/call name
14.16 54.00 54.00 204788829 0.00 0.00 Chunk::GrabC()
12.94 103.34 49.34 77127574 0.00 0.00 VCCore::ProcessOperand()
7.25 130.98 27.64 976422 0.00 0.00 dd32_BlitTile(int, int, char*, image*)
6.66 156.39 25.41 49439747 0.00 0.00 VCCore::ResolveOperand()
6.28 180.36 23.97 21918139 0.00 0.00 dd32_HLine(int, int, int, int, image*)
5.66 201.94 21.58 78136841 0.00 0.00 Chunk::GrabD()
4.90 220.62 18.68 27870550 0.00 0.00 rawmem::resize(int, char const*)
4.42 237.47 16.85 597936 0.00 0.00 dd32_TBlitTile(int, int, char*, image*)
3.94 252.48 15.01 6391 0.00 0.00 dd32_AlphaBlit(int, int, image*, image*, image*)
2.89 263.51 11.03 6242 0.00 0.00 dd32_Blit_lucent(int, int, image*, image*)
2.36 272.53 9.02 9601475 0.00 0.00 VCCore::HandleAssign()
2.04 280.30 7.77 40927397 0.00 0.00 rawmem::destroy()
1.55 286.21 5.91 14803057 0.00 0.00 rawmem::become_string(char const*)
1.47 291.81 5.60 12425758 0.00 0.00 VCCore::ReadInt(int, int, int)
1.42 297.23 5.42 23512247 0.00 0.00 image::GetClip(int&, int&, int&, int&)
1.32 302.25 5.02 18583379 0.00 0.00 rawmem::get(int, unsigned int) const
1.30 307.22 4.97 79846031 0.00 0.00 rawmem::length() const
1.24 311.94 4.72 13067224 0.00 0.00 rawmem::rawmem(int, char const*)
1.07 316.01 4.07 14803595 0.00 0.00 rawmem::touch(unsigned int)
1.04 319.97 3.96 219308 0.00 0.00 VCCore::ExecuteBlock()

It's interesting that the profiles are very different. You can see a general pattern though. Several functions rank up near the top in both. It's interesting that in windows two system calls (memset, memcpy) take up a lot of time and in the linux version Chunk:GrabC takes up a terribly large amount of time.

Well, sorry for the long, dry post. I can provide the whole files to anyone interested in helping to optimize verge's speed.

Posted on 2006-05-23 22:30:35

mcgrue

Is col #5 the time spent no the windows log?

If so, there's quite a divergence between the two verges, no? Why would GrabC() be taking 14x the time?

...what does GrabC even *do*? :o

Posted on 2006-05-24 20:51:37

adderd

Quote:Originally posted by mcgrue

Is col #5 the time spent no the windows log?

If so, there's quite a divergence between the two verges, no? Why would GrabC() be taking 14x the time?

...what does GrabC even *do*? :o


Well, column 5 is the percentage of total CPU time spent in each function. However, strangely the windows profiler thinks that verge's exe is only getting 33% of the time. (it's because the program also calls system calls and such and other programs are running in the background so theres a lot of processor time going elsewhere). So multiply by 3. That's still 14x / 3 = approx 4x slower. But it's not necessarily even 4x slower. It's all really complicated. I think that the memcpy and memset that you see in the windows profiling would never show up in the linux profile because they are technically system calls and those aren't getting profiled in linux. And so, you really can't compare the profiles at all... My bad... It would be useful if I could just profile verge and not the whole OS in windows. Anyway, basically I can only compare profile output from the same program (CodeAnalyst or GProf). Comparing between the two is naughty.

Anyway, GrabC gets data, basically the data that the verge core needs for the interpreter. As it's interpretting it uses GrabC, GrabD, GrabStr, etc to get more data (say, to get the next operand, or to get data related to that operand, etc). As such it gets called... ALOT. And by alot I mean ALOT. I've changed it inline and taken out the bounds checks. That makes VergeMark about 10% faster for me in windows. It seems that the same might be the case in linux as well. I think that we ought to have some #if's that detect if it's a debug build and add the bounds checks in for debug builds but leave them out for release builds. That's easy enough to do in windows, linux, and macos as there are defines for which build is going on.

So, the Grab* functions are optimized. I don't forsee them getting much better. The only thing that might be faster is to remove the use of them completely and directly access the memory right at the interpreter. That's not necessarily a good thing as it makes changing the archetecture hard and further obfuscates the code. I'd say it's not worth it.

Now, that leaves two main areas:
1. ProcessOperand / ResolveOperand
2. The Tile/Line /Blit functions

Both take up a lot of the time. They should, they have the most calls and basically do the most obvious things. However, lets say that we can cut the processing overhead of ProcessOperand/ResolveOperand in half and that combined they take up 20% of the time in VergeMark. That leads to another 10% speed up.

The drawing functions factor very heavy in both the windows and linux profiles. Obviously they'd be a good place to optimize. I'm looking at them now. I see that dd32_blit uses memcpy to do blitting. If GCC doesn't optimize that into tight assembly (which VC does do) then that might be why the linux version runs like a dog and the release version of windows doesnt (turns out lots of those calls to memcpy you see in the windows profile come from my profiling a debug version by mistake.) I wonder what would happen to the linux version if I replaced all those memcpy's with for loops (like what dd32_hline uses to draw a line). A for loop that just copies data from one place to another should easily be caught by any optimizer and turned into uber-code. Other than that the drawing routines are really pretty well optimized. If you all don't mind getting a little more current we can make verge take advantage of SSE2 to really pep up the performance of things like the lucent blitter. You could use MMX if you don't want to enter FPU territory. An MMXified blitter should still be loads faster than a normal one.

Well, this is a telephone book already. Let's talk about using SSE or MMX (at least as an option) and maybe about taking calls to memcpy and memset out.

Posted on 2006-05-24 22:29:20

Vaan

I personally recommend this profiler. It will only profile the application and ignore the random windows crap. It is kind of weird to use, but it works nicely.

In all honesty though, if you want your blitting to be faster, you need to move into the hardware realm. 2D in 3D is now where it is at. Moving the whole render path to either OpenGL (for cross platform support) or Direct3D should result in a speed boost. If I get some free time, I may actually do an experiment with this.

Adderd, you said you had speed problems with OpenGL in Linux. I don't mean to sound condescending, but, are you sure you were running a hardware implementation, or MESA. Most distros don't supply drivers for cards, and instead fall back to the MESA slow as hell software renderer.

Posted on 2006-05-28 11:33:37

adderd

Quote:
In all honesty though, if you want your blitting to be faster, you need to move into the hardware realm. 2D in 3D is now where it is at. Moving the whole render path to either OpenGL (for cross platform support) or Direct3D should result in a speed boost. If I get some free time, I may actually do an experiment with this.


Yes, that's true. However it's a LOT of work. Every single current blit or line function would need to be changed (well, personally I would think it'd be better to just add a new batch of blitters and register them like the current various blitter types do) Also, all loading of tiles into memory would need to be changed to loads to a texture. I'm not saying that it couldn't or shouldn't be done because it *should* be done. It's just a lot of work. If the current blitters can be sped up a little in the meantime then I think that's worth doing.


Adderd, you said you had speed problems with OpenGL in Linux. I don't mean to sound condescending, but, are you sure you were running a hardware implementation, or MESA. Most distros don't supply drivers for cards, and instead fall back to the MESA slow as hell software renderer.


Yes, I'm aware of that. glxinfo tells me that that isn't the case for me. I'm making sure that X is setup to use acceleration and that I'm asking for an accelerated mode. And actually, I think I mentioned in this thread that I found out that the OpenGL code isn't really the big problem with speed. It's a little bit of a bottleneck but that's because currently all blitting is happening behing OGL's back and then one big blit to texture is done and the texture displayed. That's not too efficient of a use of 3D hardware. However, it's not unbearably slow.

There are other problems slowing the linux version down. Not the least of which is that GCC doesn't pull tricks behind your back like MSVC does. Technically MSVC is doing the *better* thing but it's being tricky. What happens is that MSVC will replace some function calls (memcpy, memset, probably others) with straight optimized assembler instead of doing the function call to the standard library. This is obviously a lot faster. However it's not *technically* the correct thing to do. So GCC won't do that because it would be cheating and preventing the standard library functions from being called. Since GCC is generating tons of function calls it produces slower code (not to mention that the standard library versions of memcpy and memset probably aren't as optimized as the versions that MSVC replaces them with).

All in all, sort of the performance gap just plain comes down to compiler differences. MSVC is really a pretty nice compiler and in some situations it is really hard to beat.

GCC is starting to be able to hold it's own though. If I ask GCC to generate MMX or SSE2 code automatically for me then I get a nice jump in performance in verge. I think it might be worth offering a version of verge optimized for newer machines.

Posted on 2006-05-28 22:17:34


Displaying 1-8 of 8 total.
1
 
Newest messages

Ben McGraw's lovingly crafted this website from scratch for years.
It's a lot prettier this go around because of Jon Wofford.
Verge-rpg.com is a member of the lunarnet irc network, and would like to take this opportunity to remind you that regardless how babies taste, it is wrong to eat them.