Did more tests with my laptop on AC power, so that all the times are faster now. [battery much slower]. As an aside, DMA is still slower no matter what I do...though the math indeed does matter.
Array size 2,800,000 -- loop and assign value '30'
Builtin 4.95 seconds
[increment count++]
DMA no math, Omni style 5.12 seconds
[count=DMAarrayptr, increment count+=4, location count]
DMA with math 5.17 seconds
[count=0, increment count++, location DMAarrayptr+(count*4) ]
DMA no math, Basil style 5.08 seconds
[count=0, increment count++, location count<<2+DMAarrayptr ]
I find it curious that the method I used, minus multiplication, where count is set to DMAarrayptr and then incremented in steps of fourth, is actually slower than Basil's bitshift + add pointer method.
How a single addition of 4 is slower than a bitshift and an addition I do not know, but Basil's method is still the fastest. On my laptop, however, they are all slower than Verge's builtin array.
But...read on...in addition to the two examples in the first post of this thread, I used the following.
Basil Style
void BasilTest()
{
int myarr = Malloc(2800000*4);
int count;
timer=0;
while(count < 2800000)
{
dma.squad[count<<2 + myarr] = 30;
count++;
}
Log('DMA bitshift - basil - '+str(timer));
}
Omni Style
void OmniTest()
{
int myarr = Malloc(2800000*4);
int count;
timer=0;
count=myarr;
while(count-myarr < 2800000*4)
{
dma.squad[count] = 30;
count+=4;
}
Log('DMA no math - '+str(timer));
}
Looking at this, I expected mine to be the superior solution, since it used less in-loop math [one addition vs. Basil's addition and bitshift]. Strangely it lost nearly every time [one time it outclassed Basil's, but never happened again -- not sure why, an anomaly].
Then I decided that perhaps the math in the loop condition statement was slowing it down. So, I did this...
Final Style
void OmniTest2()
{
int myarr = Malloc(2800000*4);
int count;
int limit;
timer =0;
count = myarr;
limit = myarr+(2800000*4);
while(count < limit)
{
dma.squad[count] = 30;
count+=4;
}
Log('DMA Omni - '+str(timer));
}
With all math, except the incrementor, out of the loop, this was the fastest of the DMA loops. I tested them over and over again, and all times generally fluctuated near 5 seconds. The final DMA loop type, however, is not clearly superior to the builtin type loop.
2,800,000 array size
[example times, in centisecs]
Builtin - 495
DMA default - 526
DMA Omni - 510
DMA bitshift - Basil - 516
DMA Final - 496
The 'final' type and builtin type loops tend to trade places as the fastest, though I suspect that the 'final' type is marginally slower.
Just for fun I cranked up array size further.
10,000,000 array size (multiple times for multiple trials)
Builtin - 1740, 1737, 1751, 1745 centisecs
DMA Final - 1767, 1787, 1782, 1787 centisecs
So at larger speeds it is obvious that Verge's builtin array is still the speed champion. Still, the DMA solution isn't too bad...I could tolerate being off by a few scant milliseconds.
I then decided to try one last test. I wondered whether or not using Basil's bitshifting and removing his addition would be fast enough to beat builtin...
Super Combo Style
void SupaFinish()
{
int myarr = Malloc(10000000*4);
int count;
int off;
int limit;
count = myarr>>2;
limit = myarr+(10000000*4)>>2;
timer = 0;
while(count < limit)
{
dma.squad[count<<2] = 30;
count++;
}
Log('DMA Omni/Basil Hyper Combo - '+str(timer));
}
This was really kinda stupid, as I risked raping my computer's memory (seeing as how there's no guarantee that (memptr/2 * 2) = (memptr), due to integer truncation) and it ended up being slower than Final type loops anyway. Apparently a bitshift and simple increment [++] are slower than an addition operation (which I guess makes sense).
Builtin - 1761
DMA Omni/Basil Hyper Combo - 1852
So, don't do the Hyper Combo :)