Slight note regarding DMA speed vs. normal data access

About VERGE

Displaying 1-14 of 14 total.
1

Please enter a numerical value for the importance of this sticky.
Enter 0 to unsticky.

Stickiness:

Omni	Comparing two arrays, one being Verge's built in array, of 2,800,000 elements: int samplearray[2800000]; And the other being a chunk of DMAed memory: *int sample2 = Malloc(28000004); //the times 4 is for a signed quad (normal Verge integer)** It actually takes longer to iterate through a loop of simple assignment with the DMA example. while(count < 2800000) //THIS ONE IS SLOWER { dma.squad[sample2+(count*4)] = [any value, say, 30]; } while(count < 2800000) { samplearray[count] = 30; } I mean, I guess I never had any reason to believe DMA was _faster_ than Verge's default data types, but I always thought that despite the difficulty of understanding and managing memory, it was in other ways clearly superior... On my machine, for example, the time difference was 1,358 centisecs/tics for the DMA array and 1,234 centisecs for the builtin data type. Posted on 2005-07-11 22:52:13
gannon	did some tests and the time difference is due to the multiplication and addition. they are the same if you did the sample2+(count*4) calculation before the loop Posted on 2005-07-12 12:57:48
anonymous	I don't quite how to perform the calculation 'before the loop', as count is constantly changing. Though I might try something like count = sample; with an iterator such as 'count+=4;' instead of having to multiply count each loop. I actually didn't think of the math at all. Good call! --Omni Posted on 2005-07-12 16:20:44
anonymous	This is yhn mzw cs. First, I don't think those loops look right. For a while loop, you need to increment/decrement your accumulator somewhere, don't you? If you wanted to test the loops independently, ... a few ideas just went wrong... you could just cancel out the math by assigning the access value (sample2+(count*4)) to the value in the array. That might make the speeds match, without more trouble for ... Heck, this is just theory anyway, and your solution is fast enough. Posted on 2005-07-12 16:47:42
Omni	Er. They were sample loops. I realized after the post that I forgot to put in incrementors, but I figured that would be kinda obvious :) Posted on 2005-07-12 17:19:22
anonymous	Rewrite sample2+(count*4) as count<<2+sample2 and let me know what you get. Removing the brackets can make more of a difference than you might imagine. -basil Posted on 2005-07-12 19:07:33
Omni	Did more tests with my laptop on AC power, so that all the times are faster now. [battery much slower]. As an aside, DMA is still slower no matter what I do...though the math indeed does matter. Array size 2,800,000 -- loop and assign value '30' Builtin 4.95 seconds [increment count++] DMA no math, Omni style 5.12 seconds [count=DMAarrayptr, increment count+=4, location count] DMA with math 5.17 seconds [count=0, increment count++, location DMAarrayptr+(count4) ] DMA no math, Basil style 5.08 seconds [count=0, increment count++, location count<<2+DMAarrayptr ] I find it curious that the method I used, minus multiplication, where count is set to DMAarrayptr and then incremented in steps of fourth, is actually slower than Basil's bitshift + add pointer method. How a single addition of 4 is slower than a bitshift and an addition I do not know, but Basil's method is still the fastest. On my laptop, however, they are all slower than Verge's builtin array. But...read on...in addition to the two examples in the first post of this thread, I used the following. Basil Style* void BasilTest() { int myarr = Malloc(28000004); int count; timer=0; while(count < 2800000) { dma.squad[count<<2 + myarr] = 30; count++; } Log('DMA bitshift - basil - '+str(timer)); } Omni Style* void OmniTest() { int myarr = Malloc(28000004); int count; timer=0; count=myarr; while(count-myarr < 28000004) { dma.squad[count] = 30; count+=4; } Log('DMA no math - '+str(timer)); } Looking at this, I expected mine to be the superior solution, since it used less in-loop math [one addition vs. Basil's addition and bitshift]. Strangely it lost nearly every time [one time it outclassed Basil's, but never happened again -- not sure why, an anomaly]. Then I decided that perhaps the math in the loop condition statement was slowing it down. So, I did this... Final Style void OmniTest2() { int myarr = Malloc(28000004); int count; int limit; timer =0; count = myarr; limit = myarr+(28000004); while(count < limit) { dma.squad[count] = 30; count+=4; } Log('DMA Omni - '+str(timer)); } With all math, except the incrementor, out of the loop, this was the fastest of the DMA loops. I tested them over and over again, and all times generally fluctuated near 5 seconds. The final DMA loop type, however, is not clearly superior to the builtin type loop. 2,800,000 array size [example times, in centisecs] Builtin - 495 DMA default - 526 DMA Omni - 510 DMA bitshift - Basil - 516 DMA Final - 496 The 'final' type and builtin type loops tend to trade places as the fastest, though I suspect that the 'final' type is marginally slower. Just for fun I cranked up array size further. 10,000,000 array size (multiple times for multiple trials) Builtin - 1740, 1737, 1751, 1745 centisecs DMA Final - 1767, 1787, 1782, 1787 centisecs So at larger speeds it is obvious that Verge's builtin array is still the speed champion. Still, the DMA solution isn't too bad...I could tolerate being off by a few scant milliseconds. I then decided to try one last test. I wondered whether or not using Basil's bitshifting and removing his addition would be fast enough to beat builtin... Super Combo Style void SupaFinish() { int myarr = Malloc(100000004); int count; int off; int limit; count = myarr>>2; limit = myarr+(100000004)>>2; timer = 0; while(count < limit) { dma.squad[count<<2] = 30; count++; } Log('DMA Omni/Basil Hyper Combo - '+str(timer)); } This was really kinda stupid, as I risked raping my computer's memory (seeing as how there's no guarantee that (memptr/2 * 2) = (memptr), due to integer truncation) and it ended up being slower than Final type loops anyway. Apparently a bitshift and simple increment [++] are slower than an addition operation (which I guess makes sense). Builtin - 1761 DMA Omni/Basil Hyper Combo - 1852 So, don't do the Hyper Combo :) Posted on 2005-07-13 18:44:53
basil	Are while loop conditions checked each iteration? I suspect so, in which case part of your problem was putting a multiply in the while loop condition. I think largely however that the issue is a lack of repetition. As you've seen, results fluctuate for a variety of reasons. I suggest you use a running average system. Something like total=0 repetitions=0 while(!bored) timer=0 while(x < a big number) do something to be tested total+=timer repetitions++ printstring ( number of repetitions & total/repetitions) if you're dealing with small times, make that total*1000/repetitions or something suitable to avoid roundoff. This way you can run the program and sit and let it run through as many times as it needs to for the average to settle on something. It will give you a much, much better idea of the speed of operations. Posted on 2005-07-13 19:14:49
basil	I'm apparently forbidden from editing my post so the psuedocode remains incomprehensible. Ask if you don't follow the point I'm making. Posted on 2005-07-13 19:17:02
Omni	Ah, well, I'll probably avoid doing that...it's kind of pointless to just hog my computer on a Verge program running...and running...and running four or five different long loops at the same time. Idea makes perfect sense though. Posted on 2005-07-13 21:01:18
JL	Did more tests with my laptop on AC power, so that all the times are faster now. [battery much slower]. As an aside, DMA is still slower no matter what I do...though the math indeed does matter. I remember when I was writing Verge 2.something-or-other that I went through a bit of trouble to put in checks to make sure that DMA was reading and writing within the expected range of memory; this way any corruption could be caught and dealt with immediately, rather than causing bizarre errors later in the program's execution. With regular variables, these checks were unnecessary, since you already knew that reading or writing them would go to a recognized memory location. I don't have access to the V3 source (obviously), but I suppose it is possible that V3 has similar checks in place that degrade the performance of DMA. Posted on 2005-07-14 18:36:14
anonymous	If it matters, DMA-ing to random parts of memory still indiscriminately crashes Verge (while raping memory) -- for example, reading an extra quad off of a malloc-ed block.. Can I assume this means DMA checks are not in place and we're just supposed to be smart with it? Or is it something else? In which case, I bet that's why it's slower, and hadn't thought of that, either... --Omni Posted on 2005-07-14 19:20:59
vecna	The point of DMA was never speed, and certainly abusing DMA will cause it to crash, thats why its called 'Direct Memory Access' :D The ONLY point of DMA is as a way to do things that just can't be done any other way. Posted on 2005-07-21 22:03:41
Omni	Noted. Posted on 2005-07-21 22:25:49

Displaying 1-14 of 14 total.
1

Newest messages

Ben McGraw's lovingly crafted this website from scratch for years.
It's a lot prettier this go around because of Jon Wofford.
Verge-rpg.com is a member of the lunarnet irc network, and would like to take this opportunity to remind you that regardless how babies taste, it is wrong to eat them.