Slight note regarding DMA speed vs. normal data access
Displaying 1-14 of 14 total.
1
Please enter a numerical value for the importance of this sticky.
Enter 0 to unsticky.
Omni

Comparing two arrays, one being Verge's built in array, of 2,800,000 elements:

int samplearray[2800000];

And the other being a chunk of DMAed memory:

int sample2 = Malloc(2800000*4); //the times 4 is for a signed quad (normal Verge integer)

It actually takes longer to iterate through a loop of simple assignment with the DMA example.


while(count < 2800000) //THIS ONE IS SLOWER
{
dma.squad[sample2+(count*4)] = [any value, say, 30];
}

while(count < 2800000)
{
samplearray[count] = 30;
}


I mean, I guess I never had any reason to believe DMA was _faster_ than Verge's default data types, but I always thought that despite the difficulty of understanding and managing memory, it was in other ways clearly superior...

On my machine, for example, the time difference was 1,358 centisecs/tics for the DMA array and 1,234 centisecs for the builtin data type.

Posted on 2005-07-11 22:52:13

gannon

did some tests and the time difference is due to the multiplication and addition.

they are the same if you did the sample2+(count*4) calculation before the loop

Posted on 2005-07-12 12:57:48

anonymous

I don't quite how to perform the calculation 'before the loop', as count is constantly changing. Though I might try something like

count = sample;

with an iterator such as 'count+=4;' instead of having to multiply count each loop.

I actually didn't think of the math at all. Good call!

--Omni

Posted on 2005-07-12 16:20:44

anonymous

This is yhn mzw cs.
First, I don't think those loops look right. For a while loop, you need to increment/decrement your accumulator somewhere, don't you?
If you wanted to test the loops independently, ... a few ideas just went wrong... you could just cancel out the math by assigning the access value (sample2+(count*4)) to the value in the array. That might make the speeds match, without more trouble for ... Heck, this is just theory anyway, and your solution is fast enough.

Posted on 2005-07-12 16:47:42

Omni

Er. They were sample loops. I realized after the post that I forgot to put in incrementors, but I figured that would be kinda obvious :)

Posted on 2005-07-12 17:19:22

anonymous

Rewrite
sample2+(count*4)
as
count<<2+sample2

and let me know what you get. Removing the brackets can make more of a difference than you might imagine.

-basil

Posted on 2005-07-12 19:07:33

Omni

Did more tests with my laptop on AC power, so that all the times are faster now. [battery much slower]. As an aside, DMA is still slower no matter what I do...though the math indeed does matter.

Array size 2,800,000 -- loop and assign value '30'

Builtin 4.95 seconds
[increment count++]
DMA no math, Omni style 5.12 seconds
[count=DMAarrayptr, increment count+=4, location count]
DMA with math 5.17 seconds
[count=0, increment count++, location DMAarrayptr+(count*4) ]
DMA no math, Basil style 5.08 seconds
[count=0, increment count++, location count<<2+DMAarrayptr ]

I find it curious that the method I used, minus multiplication, where count is set to DMAarrayptr and then incremented in steps of fourth, is actually slower than Basil's bitshift + add pointer method.

How a single addition of 4 is slower than a bitshift and an addition I do not know, but Basil's method is still the fastest. On my laptop, however, they are all slower than Verge's builtin array.

But...read on...in addition to the two examples in the first post of this thread, I used the following.

Basil Style
void BasilTest()
{
int myarr = Malloc(2800000*4);
int count;

timer=0;
while(count < 2800000)
{
dma.squad[count<<2 + myarr] = 30;
count++;
}
Log('DMA bitshift - basil - '+str(timer));
}


Omni Style
void OmniTest()
{
int myarr = Malloc(2800000*4);
int count;

timer=0;
count=myarr;
while(count-myarr < 2800000*4)
{
dma.squad[count] = 30;
count+=4;
}
Log('DMA no math - '+str(timer));
}


Looking at this, I expected mine to be the superior solution, since it used less in-loop math [one addition vs. Basil's addition and bitshift]. Strangely it lost nearly every time [one time it outclassed Basil's, but never happened again -- not sure why, an anomaly].

Then I decided that perhaps the math in the loop condition statement was slowing it down. So, I did this...

Final Style
void OmniTest2()
{
int myarr = Malloc(2800000*4);
int count;
int limit;

timer =0;
count = myarr;
limit = myarr+(2800000*4);
while(count < limit)
{
dma.squad[count] = 30;
count+=4;
}
Log('DMA Omni - '+str(timer));
}


With all math, except the incrementor, out of the loop, this was the fastest of the DMA loops. I tested them over and over again, and all times generally fluctuated near 5 seconds. The final DMA loop type, however, is not clearly superior to the builtin type loop.

2,800,000 array size
[example times, in centisecs]
Builtin - 495
DMA default - 526
DMA Omni - 510
DMA bitshift - Basil - 516
DMA Final - 496

The 'final' type and builtin type loops tend to trade places as the fastest, though I suspect that the 'final' type is marginally slower.

Just for fun I cranked up array size further.

10,000,000 array size (multiple times for multiple trials)
Builtin - 1740, 1737, 1751, 1745 centisecs
DMA Final - 1767, 1787, 1782, 1787 centisecs

So at larger speeds it is obvious that Verge's builtin array is still the speed champion. Still, the DMA solution isn't too bad...I could tolerate being off by a few scant milliseconds.

I then decided to try one last test. I wondered whether or not using Basil's bitshifting and removing his addition would be fast enough to beat builtin...

Super Combo Style
void SupaFinish()
{
int myarr = Malloc(10000000*4);
int count;
int off;
int limit;

count = myarr>>2;
limit = myarr+(10000000*4)>>2;
timer = 0;
while(count < limit)
{
dma.squad[count<<2] = 30;
count++;
}
Log('DMA Omni/Basil Hyper Combo - '+str(timer));
}


This was really kinda stupid, as I risked raping my computer's memory (seeing as how there's no guarantee that (memptr/2 * 2) = (memptr), due to integer truncation) and it ended up being slower than Final type loops anyway. Apparently a bitshift and simple increment [++] are slower than an addition operation (which I guess makes sense).

Builtin - 1761
DMA Omni/Basil Hyper Combo - 1852

So, don't do the Hyper Combo :)

Posted on 2005-07-13 18:44:53

basil

Are while loop conditions checked each iteration? I suspect so, in which case part of your problem was putting a multiply in the while loop condition.

I think largely however that the issue is a lack of repetition. As you've seen, results fluctuate for a variety of reasons. I suggest you use a running average system. Something like

total=0
repetitions=0
while(!bored)

timer=0

while(x < a big number)
do something to be tested

total+=timer
repetitions++
printstring ( number of repetitions & total/repetitions)

if you're dealing with small times, make that total*1000/repetitions or something suitable to avoid roundoff.

This way you can run the program and sit and let it run through as many times as it needs to for the average to settle on something. It will give you a much, much better idea of the speed of operations.

Posted on 2005-07-13 19:14:49

basil

I'm apparently forbidden from editing my post so the psuedocode remains incomprehensible. Ask if you don't follow the point I'm making.

Posted on 2005-07-13 19:17:02

Omni

Ah, well, I'll probably avoid doing that...it's kind of pointless to just hog my computer on a Verge program running...and running...and running four or five different long loops at the same time. Idea makes perfect sense though.

Posted on 2005-07-13 21:01:18

JL

Did more tests with my laptop on AC power, so that all the times are faster now. [battery much slower]. As an aside, DMA is still slower no matter what I do...though the math indeed does matter.

I remember when I was writing Verge 2.something-or-other that I went through a bit of trouble to put in checks to make sure that DMA was reading and writing within the expected range of memory; this way any corruption could be caught and dealt with immediately, rather than causing bizarre errors later in the program's execution. With regular variables, these checks were unnecessary, since you already knew that reading or writing them would go to a recognized memory location. I don't have access to the V3 source (obviously), but I suppose it is possible that V3 has similar checks in place that degrade the performance of DMA.

Posted on 2005-07-14 18:36:14

anonymous

If it matters, DMA-ing to random parts of memory still indiscriminately crashes Verge (while raping memory) -- for example, reading an extra quad off of a malloc-ed block.. Can I assume this means DMA checks are not in place and we're just supposed to be smart with it? Or is it something else? In which case, I bet that's why it's slower, and hadn't thought of that, either...

--Omni

Posted on 2005-07-14 19:20:59

vecna

The point of DMA was never speed, and certainly abusing DMA will cause it to crash, thats why its called 'Direct Memory Access' :D

The ONLY point of DMA is as a way to do things that just can't be done any other way.

Posted on 2005-07-21 22:03:41

Omni

Noted.

Posted on 2005-07-21 22:25:49


Displaying 1-14 of 14 total.
1
 
Newest messages

Ben McGraw's lovingly crafted this website from scratch for years.
It's a lot prettier this go around because of Jon Wofford.
Verge-rpg.com is a member of the lunarnet irc network, and would like to take this opportunity to remind you that regardless how babies taste, it is wrong to eat them.