Indicator speed improvement through parallelizing
Author: akuzn
Creation Date: 11/25/2013 10:47 PM
profile picture

akuzn

#1
Good Day!

After have seen that following
CODE:
Please log in to see this code.

after compilation is realised as 3 loops of summation with additional loop with dividing each sum element by 4 i ve tried to realise some improvements: simple loop for (..) and with Parallel.For, Parallel.Foreach:

CODE:
Please log in to see this code.


After that i ve measured execution speed
CODE:
Please log in to see this code.

The result on 4 strategies with 45600 bars and 4 strategies with 10500 bars loading together in one workspace would say impressive:
QUOTE:

==================================================================
Computing averages for Strategy: EU-12.13(EUZ3)
Computing regular average...
Regular time computing of average in milliseconds: {14}

Computing average as serie method...
Serie method time computing of average in milliseconds: {2}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {3}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {5}
==================================================================
==================================================================
Computing averages for Strategy: EU-12.13(EUZ3)
Computing regular average...
Regular time computing of average in milliseconds: {12}

Computing average as serie method...
Serie method time computing of average in milliseconds: {2}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {4}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {1}
==================================================================
==================================================================
Computing averages for Strategy: EU-12.13(EUZ3)
Computing regular average...
Regular time computing of average in milliseconds: {46}

Computing average as serie method...
Serie method time computing of average in milliseconds: {1}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {2}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {2}
==================================================================
==================================================================
Computing averages for Strategy: EU-12.13(EUZ3)
Computing regular average...
Regular time computing of average in milliseconds: {14}

Computing average as serie method...
Serie method time computing of average in milliseconds: {3}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {3}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {3}
==================================================================
==================================================================
Computing averages for Strategy: SI-12.13(SIZ3)
Computing regular average...
Regular time computing of average in milliseconds: {10}

Computing average as serie method...
Serie method time computing of average in milliseconds: {4}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {10}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {69}
==================================================================
==================================================================
Computing averages for Strategy: SI-12.13(SIZ3)
Computing regular average...
Regular time computing of average in milliseconds: {14}

Computing average as serie method...
Serie method time computing of average in milliseconds: {5}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {6}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {5}
==================================================================
==================================================================
Computing averages for Strategy: SI-12.13(SIZ3)
Computing regular average...
Regular time computing of average in milliseconds: {11}

Computing average as serie method...
Serie method time computing of average in milliseconds: {5}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {13}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {4}
==================================================================
==================================================================
Computing averages for Strategy: SI-12.13(SIZ3)
Computing regular average...
Regular time computing of average in milliseconds: {18}

Computing average as serie method...
Serie method time computing of average in milliseconds: {10}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {3}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {3}
==================================================================

That means
CODE:
Please log in to see this code.

is the most ineffective code..
After that i ve rewritten some indicators with Parallel "small body loop as in MSDN article style"( last method ).
Result was great. In general parallel backtests run 1.5-2 time faster.

Do you have any plans to rewrite indicators libraries and may be optimizers same or better style?
profile picture

Eugene

#2
Thank you for your input Alexey. Excellent idea.

There are no short-term plans to rewrite indicators and the Optimizer using Task Parallel Library.

P.S. I remember an episode when rewriting MS123 Visualizers last year. There was a performance bottleneck in a piece of code that calculated Closed Equity across the library. At first, applying Parallel.For seemed like the natural choice as it brought considerable speed improvement. Just one thing: the end result was incorrect. No matter how hard I bashed my head against this brick wall, it didn't work out and I had to optimize the algorithm instead.
profile picture

akuzn

#3
Thank you for your support.
Especially KAMA and StdDev are looking good for parallelizing.
I would say optimization is running much faster. Now 8 intraday optimizations are running with expected time 3 hours. Yesterday it was about 7 hours.
profile picture

Eugene

#4
Alexey,

Performed a quick test to verify your results on 250532 bars of IQFeed 5-min data for AAPL:

CODE:
Please log in to see this code.


Worth noting is that by using plain vanilla version of Parallel.For I've noticed no breakthough at all. However, having switched to Parallel.For with Partitioner there's a 2x speed improvement:

QUOTE:
==================================================================
Computing averages for: AAPL(250532 bars)
Computing regular average...
Computing regular average finshed in 31 milliseconds

Computing average using Parallel.For...
Computing Parallel average finished in 27 milliseconds

Computing average using Partitioner...
Computing Parallel average using Partitioner finished in 17 milliseconds
==================================================================


Quite interesting.

Perhaps it's because the source DataSeries, being indexed collections whose length is known, are susceptible to range partitioning.
profile picture

akuzn

#5
Yes, i agree "Partiioner " is best.
And it was suggested in MSDN example "small body loop". Seems only they can know what happens in their jungles.
Any new programming construction can break brain.)
But what i understood from their explanations - there are some consuming time operations during parallel preparation at the beginning of "parallelizing" and at the end. What happens in the middle - is very quick.

I ve rewritten some parts of code with possibility to switch between simple for and "small body loop" but can tell only if there are 1 of 2 strategies backtesting efffect may be 1,5 - 2 times, if 4-8 strategies seems no no improvement at all.
Btw i ve not noticed different results in series computing. I ve rewritten RSI, MA, adaptive ma's.
profile picture

JDardon

#6
A few questions regarding this line of thoughts.

45,600 bars @ (4 bytes per item) = 182,400 bytes (+ some overhead) = 178 KB per series. You are averaging 4 series = 712 KB of total data in 4 variables that are passed as parameters to 3 different functions.

Current processors typically have 6 MB of cache each. This means that your 4 data series can be fully loaded into the cache. The first call has to include the I/O time for uploading the data from memory-> cpu cache. The calls that happen afterwards will not have that overhead loading time and therefore apparently better execution time.

a) Could that be the reason for the severe discrepancy between the regular calculation and the series method? Could you test using a different order for the function calls and see if the results remain the same?
b) Given that there was only one result in which the parallel small body loop method was significantly faster than the series methood, couldn't this be related to the other activities the computer might have been doing at the same time? Could you guys test but with a lot more data (to take away the benefits from the cache of the microprocessor)?

The reason I ask is because I tried to run your code to verify this and it didn't compile:

QUOTE:

Error 1) The type System.CollectionConcurrent.Partitioner exist in both c:\program files\Fidelity Investments\Wealth-ab Pro 6\System.threading.dll and c:\windows\microsoft.NET\Framework64\v.4.0.30319\mscorlib.dll

Error 2) The type arguments for method "System.Collections.Concurrent.Partitioner.Create<Tsource>(System.Collections.Generic.IList<Tsource>,bool) can not be inferred from the usage. Try specifying the arguments explicitly.



profile picture

Eugene

#7
Jorge,

Your argument makes sense. However, take the code from my post #4. Generate a huge dummy data file using Random data provider. Change the order of series. Even if the Parallel.For loop with Range Partitioner is made the first one to bear the I/O overhead, it beats the other two hands down.
profile picture

akuzn

#8
1. It is from async - in general 3-4 times faster if you can use parallelizing. It s proved - it s not only the question of cache it is question of new async improvements in .Net in general.

2. I would like to add some more information.

About 3.500.000 tick data, my methods StdDev, AMA, Weighted MA, RSI etc run 50-300 times! (not percent) faster than standard due to parallelizing and some additional old school tricks avoiding excessive memory adressing, unboxing (about 3-8% time economy) etc.

Btw seems i found the way to improve moving average speed - only about 13 (from 129) ms on this amount of ticks. Certainly without any parallelizing because next average depends on previous value. May be larger amounts of ticks can be divided for parallel invokations but seems it is not needed for ma.

If i use some additinal tricks like threading or parallel invokation of series computing under stategy thread i have 50% and more additional speed improvement under WealthLab threading model.

I suppose GA may be imporoved but may be it needs too many changes depending on WealthLab itself.
I have tested some GA, evolutionary and particles swarm optimizations - there are ways to improve GA im sure.

Now i am on the way to finish with LowLatency platform (10-20 ms reaction including optimization and machine learning) - i use there multi threading model - i supose 100-400% additinal speed improvement for multi strategies runs.
I have bottleneck with data providers - pipe, dde and some others - our pro from exchange decided to use Lua language instead of normal API - never worked with it before but as for above information - it is tested and 100%.

Btw i realised some fast drawings in WPF async bindings etc - some improvements can be done in drawing too. I ve read ZedGraph is not fastest.

So the question of partitioner is just beginning. But still want to use WealthLab and hope it ll run faster.


As for the error you are getting - seems i had similar error under WealthLab editor.
I would suggest you to use methods from dll compiled in Visual Studio - it s much more comfortable and no problems at all.
profile picture

Eugene

#9
Alexey,

I tried to parallelize SMA using both Parallel.For and Parallel.ForEach. Have to say, not many indicators allow to parallelize themselves this way due to data race conditions and parallel dependencies. Surprisingly, on a symbol with 300K bars both implementations turned up slower (up to several times) than the traditional loop.

CODE:
Please log in to see this code.


CODE:
Please log in to see this code.


Considered PLINQ too but I think it's too expensive for this kind of task. I wonder how do you parallelize indicators like WMA or RSI (w/o the tricks you mentioned)?
profile picture

akuzn

#10
Nice to see your interest.
If i understand logic SMA computing may be done fast enough without any cpu context switching in sync mode. I dont have enough knowledge about it. My tests with sma parallelizing didnt give anything too.
May be it will be needed to compare data array to CPU cash size. And if bigger parallelizing in .Net will give something. How big it should be? Good question. It will be interesting co compare to CUDA results. (About CUDA later).

I am not a pro programmer and had about 10 years out of quantative finances. 3 years ago returned to my previous interests as hobby when was at hospital with broken leg)).
So pls dont critisize me to much. All i know now - all i can remember from turbo pascal, c and 2 years of C# learning.

Here is my code wich gave me best improvement 9% (usually 4-5%) on 3.800.000 ticks. No parallelizing - just precize coding).
CODE:
Please log in to see this code.


I have strong opinion that C# may give super results and c++ is not needed. The main idea not to let GarbageCollector work during computation and follow recommendations like dont forget make working class as sealed, use structs, and internal block variables wich can be stored in AX, BX, CX, DX ( how many modern CPU has and how are they named?) arithmetic registres of CPU.

Some recommendations work better on x64. There are many tests when x64 studio compilation in same conditions and even similar code give 3-5% better results than unsafe C#, C++ or C.
So i have read carefully these tests and implemented them.

For example what surpised me much:

const int i is better than just int i;
in for (int j =0; j<i;j++) difference on C# can give about 3% -5% of speed difference between C++ and C#!
Dont know why it doesnt give anything in c++.

Following same logic ds.Count is bad style. But in my example some improvement was realised due to avoiding some computation branches and may be avoiding excessive CPU context switching. 2 years more and i ll know it).
I think 2-4% of time will be reduced by replacing ds.Count by: int num = ds.Count; and num will be placed in body of for operator.

In one topic concerning c# code optimization i have seen an interesting idea about loop coding optimization but havenot understand what it was mean and couldnt reproduce. But idea was as following
QUOTE:

for (int i = bigperiod; i > 0; )
{
...a*b*c * matrix[i--];
}

Author called it c# loop optimization. May be i have missed something. In my case i ve received freezed interface.

As you can see in my SMA example - little bit more coding but it works little bit better.
As for StdDev and others - im not kidding. I tried to compare results to amibroker - it runs fast enough - my results are similar. Not sure how they are measuring it - all strategies or final result is on time computing of each strategy. I mean each strategie there can show time computing and rendering time. Each result is good but as you know we cant be 100% trust if we dont see how fair it is measured and computed.

To be honest i dont like AmiBroker - just used my old 5.4 version to check.

Another test i ve tried 2 years ago was with optimization of 2 strategies with many stddev and normalization of price/stddev computing. Ami interface was freezing for 10-15 minutes and sometimes i couldnt understand what was going on. But one simple computing seems running there fast enough. And in general their optimization interface and internal language is not as comfortable as WealthLab and Visual Studio.

When i was optimizing same ideas under WealthLab - WLB was looking better with 8-20 strategies with GA.
Algos of series computing in ami seems fast enough, but as for multithreading - i think they dont work good with Windows model.
They tell that new Ami version can use up to 32 CPU cores, but why it was freezing in the same optimization conditions on 8 cores?
WealthLab looks much more friendly with Windows - naturally born).

What i understand now - Amibroker nor WealhLab dont follow WPF model!. Let's say WealthLab is much closer.- COM - WPF interoperation - black holes in windows can be explained.

I ve realised some ideas following WPF description model and feel very happy.
What i mean - graphics as fast it can be done under WPF. I ve tested 5 ms update in 40 windows with Visual Class - would say perfect.
No freezing no big cpu load. Sure some additional tricks - but under main thread. BackgroundWorker or parallizing - and rendering on computed event works great. No freezing at all. WealthLab freezes sometimes. I ve done it because i need to trade with high frequency intraday data, order book and trying to use machine learning. Seems everything is possible. Even if i switch off graphics and install program on server near exchange C# code will work great.

And last additinal trick is CUDA - CUDAFy - free C# library.
Result is impressive but it is needed to find balance because PLINQ and parallelizing under Net is fast enough. Difference begins after 100.000 values are reached. There are some articles on codeproject with computational examples. In one article of CUDAFy developer Nvidia quadra 540 (800 usd acer) gave result 300! better than PLINQ on Geo tasks. But with arrrays below 50.000 elements cuda programming wont give anything.
I studied it to understand how better plan future work with my program. New Nvidia adapter even on notebooks can give result much better than 10 Xeons or any other well marketed products. Quadra or Tesla and just one I7 - and nothing more.
Last version of NVIDIA hardware and software (unfortunately only release candidate) support shared memory adressing.
If i understand right unified operators and no big difficulties of transfering data arrays between different memory types. May be programming interface will be even same for Computer RAM and Nvidia Card memory. There are c++ examples but my card doesnt support this mode. Some video chips with this possibility are already available. CUDAFy wich provides C# interfaces doesnt support it yet but i think in half year everything will be tested.

As for data base if i understand right file database is much faster than any SQL etc.
So just simple well organized .CSV files are much faster. Have not tested SQL programming personally but i have such experience with moscow data provider - they were giving static provider with data stored in sql. I dont like speed so even dont want any experiment with it.

So seems following WPF programming model must be comfortable enough. Even it is not needed to run every window in personal thread. UI thread will be enough in combination with parallel For, LINQ if needed. And CUDA in addition if needed huge arrayes testing.
Now im finishing with data providing. and hope will finish with beta assembly in about 2 weeks.
If you are interested in testing - i can send you copy.
profile picture

akuzn

#11
SMA, StdDev, KAMA are standard indicators,

SMA_OPTIM1, SMA_OPTIM2,
StdDev_Naive
AMA - my version.
Everything is done under WealhLab.
So as you can see WLB StdDev 2698 ms, my version 175 ms,
SMA's are similar but in general SMA_OPTIM2 is little bit faster if i run it 10 times,
KAMA and AMA 4334 and 161 ms.

Correct computation is proved because one PlotSeries covers another.
They are computed one after another without any additional processes or parallel execution or backgroundworker etc.

Results may have variations even if just only one strategy is loaded.
But if i load and run 2 and more stategies difference between standard WLB and my indicators in time computing increases.
In general in my strategies i use parallel tasks under wealthlab thread so difference increases again.
Certainly would be better compare average time of every serie somputing - in this situation difference between average time based on overall computing time divided by numver of strategies loaded will be increasing too.
Certainly if i use bigger data arrayes difference increases again. It is direct comparison.
But if i need to compute StdDev depending on Bollinger and AMA? Absolutely normal research.
Usually will be good style to compute StdDev of spread between averages (like MACD).

As for WMA, RSI's are running much faster too. I ve done all this job because i was doing experiments with RSI tuning.
For example some modifications may use not only moving window but for exmple also peaks and throughs of SMA, AMA or anythign else as adaptive window. Imagination has no limits.
It was difficult to wait hours.


profile picture

Eugene

#12
Alexey,

Thank you so much for your thoughtful and detailed reply. It's great to have such a speed optimization expert on board.

Here's what I found, having tried out the SMA code from your post #10:

QUOTE:
Computing standard SMA indicator finshed in 49 milliseconds
Computing optimized SMA indicator finshed in 79 milliseconds


These results were obtained on 1.500.000 bars of random data. Note that the "slower" standard indicator takes the burden by being executed first, and still it beats the "faster" SMA considerably. Not sure why I'm unable to duplicate your results.

So far I'm becoming skeptical about parallelizing indicators. Maybe it's not the way to go, considering that the great majority of our users will never have the need to execute backtests on millions of bars - where the speed difference may become noticeable.
profile picture

akuzn

#13
Another pic with test

profile picture

akuzn

#14
SMA code:
CODE:
Please log in to see this code.

Difficult to do something with moving average.
The last idea what i decided not to realize is to push values in the range [bar-period] ..[bar] in small array of doubles avoiding ds[bar-period] call - something like internal loop cash. But it is not really needed and moving average is not weakest. As for StdDev, WMA difference result of optmization is significant.

profile picture

akuzn

#15
I have retested.

profile picture

Eugene

#16
QUOTE:
Difficult to do something with moving average.

Your latest findings re: SMA are in line with my observations. Not much to optimize here, considering typical tasks of the average WLP/D user.

Being "embarrassingly parallel", SMA/WMA is simple matter. Talking about RSI, AMA and other indicators that have dependency on previous values. Could you please share how did you overcome this obstacle (dependencies and race conditions) when parallelizing their code?
profile picture

akuzn

#17


3.853.000 ticks.
I just have unloaded all programs and freed all memory i could.
StdDev: 1512 ms and 159 ms.
SMA 112 ms, 116 ms, 109 ms. 109 ms - is version of previous topic
KAMA 2378 ms and 140 ms.
VMA 1563 ms and 442 ms.

Parallelizing and code optimization are used together.

I agree that 3000.000 can be excessive - but it is just some trading days.
Normally for fast mean reverting systems not needed anymore.

But for example if i want to do some classification on dataset of 500 symbols or even more like k-means, correlation matrixes any portfoio optimizations fast computing is needed even for 50.000 data arrayes.

In real time serie computing may not affect speed execution if you compute on new value coming and adding last indicator value.
But if optimization is needed you have no way - have to find faster series computing.

I think parallel GA and EV coding will give additional 80-100% speed improvement.
I ll test it i think in 5 days.
profile picture

akuzn

#18
Nothing special - i compare it to VMA.Series and Volume as weight parameter
I have tested Weighted MA with and without parallelizing and decided to use parallelizing - it showed faster results.
I think because we have already 2 arrays wich can be loaded everywhere in memory so cpu switches context much often than with SMA, so parallelizing will let cpu better optimize computing.
I have only such explanation.
CODE:
Please log in to see this code.
profile picture

akuzn

#19
RSI:
It may differe relative to basic RSI, but RSI idea can have various versions.
In this example i could miss something or make it too smooth. If you critisize i would much appreciate).
To be honest i was interested more in adaptive window of RSI based on achived high's and low's, Volume weighted RSI's, their smoothed versions, and creating adaptive averages based on volume price movement relationship and didnt find final verision of Simple RSI.
So may be uncommenting will make this code closer to standard rsi graph. But as for speeding idea - stlil the same.


CODE:
Please log in to see this code.
profile picture

Eugene

#20
Thank you very much for sharing. Hopefully this discussion will be useful to advanced users of Wealth-Lab.

Here's a direct comparison of built-in RSI vs. your faster RSI on 57K+ bars of intraday data...



Surprisingly, most of the time the figures are in the ballpark of 4ms (built-in RSI) vs. 6ms (your RSI). Tested with both debug and release builds of my DLL (x64, .NET 4.5.1). I've got a 6-core CPU. Not sure what I'm doing wrong.
profile picture

akuzn

#21
It becomes ... funny.
I ll retest 3 symbols...
I call methods from my dll, compiled in x64 Studio.
I switched there to 4.5.1 .Net Framework.
Code of SMA, WMA, RSI is compiled in dll and used through reference.

I think your result is based on small array.I would suggest you to use larger array. 57k+ is nothing to measure.
But if you ll try even to run optimization or even test on data set - result will be more signigicant.
With smaller arrayes may be some cPU background activities affecting.
With smaller arrayes i had unpredictable variations in measuring results.
Larger data array more fair comparison i suppose. And especially at the limit of RAM.

And you may try Amibroker on 1.000.000 element array. Ami s really fast.
I have read about one comparative backtesting of 1000 symbols in WealthLab and Ami. Ami won. Difference was significant.
But i cant work with their visual interface of middle ages.

When i was optimizing 3 months intraday data arrayes - about 55.000 of bars - diffrence between bult int and "optmized" indicators with parallel series computing gave me 2.5-4 times faster computing.
But interface was freezing sometimes.
I use 4 cores I7 cpu.

First symbol SRH4 has 657.808 ticks,
Second SIH4 2.138.804 ticks,
Third RIH4 3.833.410 ticks

More data, more cpu load the more fair result.
So i load 3 strategies and press 3 times run.
Interface is freezing and after that WealthLab gives results.


Here is my code in WealthLab:
CODE:
Please log in to see this code.


profile picture

akuzn

#22
To be fair i ve loaded only one test with small data array with 657.808 ticks,
Im 100% sure that no full memeory usage at least 200 mb is left.
So 100% just fair computing.

But wouldnt say 3 strategies with full load is unfair

profile picture

akuzn

#23
Just mid array test
SIH4 2.138.804 ticks. (USD/RUR Futures)

profile picture

akuzn

#24
And the largest serie
RIH4 3.833.410 ticks (Futures on INDEX)

Larger dataserie - difference increases

profile picture

akuzn

#25
To finish with testing - i have run many times tests.
I put here the most repeatable results. Improvement in percent is the same on smaller arrayes.
3.833.410 ticks

StdDev: 2182 ms and 176 ms
SMA: 119 ms(built in), 116 and 108 ms
KAMA 3064 ms and 138 ms
VMA 4284 ms and 406 ms
RSI 4543 ms and 367 ms.

If i use parallel series computing in tasks and run multiple strategies difference increases.
Under optimization increases more.

So it can be tested on DataSet with symbols of 20k to 50 k.
result will be significant.

Next step as i told parallel optimization methods and better usage of WPF model.

profile picture

akuzn

#26
Eugene, could you test following code of SMA.
Interesting will you have any improvement? I ve done it to finish with sma optimization.
-1 if operator in loop and internal small cache with an idea that it will be faster than adress to DataSerie.
No more ideas. In my tests it gives 3-15% improvement.

CODE:
Please log in to see this code.


Here the most interesting thing is:
CODE:
Please log in to see this code.

If it could be possible to find where to place temporal cache faster than in array it will give some additinal improvement..
I suppose this array is placed in cache of method and i suppose the most time consuming is memory allocation here.

profile picture

Eugene

#27
Sorry Alexey, no visible improvement. :( I just tried out your latest SMA code and the results (averaged over several runs) are:

QUOTE:
57K+ bars: 2ms (standard) vs. 3ms (your)
1.5 mln bars: 4ms (standard) vs. 6ms (your)


But this is twice as fast as if I'd take the Parallel.For* code from post #9.
profile picture

akuzn

#28
Hmm...i have real improvement.
larger series - better result.

May be you have super processor and very fast memory and they dont need experiments.
profile picture

akuzn

#29
To be honest i dont know what kind of sma code is built in. I suppose it should be naive version or similar with above but with ds.count left.
Code sujjested above should be best modification of moving window version.
I have tried this "internal method cache" double[] ds_cache in other dataseries and it gave additional speed improvement for 3 indicators.

I have really only one explanation: 57k is really fully loaded in cpu cache. And 1.500.000 may be too.
That explains why parallelizing is not giving anything. Much larger series are needed.
Version of double [] ds_cache is little bit behind. But larger series should prove it.
It will give more only if there are many memory calls for example volume weighted or other idicator wich may adress to different memory blocks and CPU and compilator wont optimize (predict) them.
Anyway interesting experiment and it proves again that Microsoft C# is well optmized for scientifical computing on modern CPUs.

May be you could test with 3.000.000 or even 5.000.000 bars?
And if not a secret could you share your CPU description: processor, bus, memory.
I m really thinking about cloud tests or I7 + CUDA. Theoretically would prefer I7 not Xeon + CUDA with new type of shared memory NVIDIA Quadra or something similar.
profile picture

Eugene

#30
QUOTE:
I have really only one explanation: 57k is really fully loaded in cpu cache. And 1.500.000 may be too.

Agreed. Like your i7, my 6-core Phenom II X6 is also equipped with 6Mb cache.

QUOTE:
May be you could test with 3.000.000 or even 5.000.000 bars?

Since this would present only a theoretical interest, I prefer not to conduct further tests. Only a small fraction of our customers might be interested in backtesting on symbols with such enormous bar count. On a related note, parallelizing MS123 Visualizers offered an instant, perceptible performance boost regardless of data amount in a backtest. I'll mark to give your code a go on an older machine, though -- maybe there I'll be able to see a difference.

UPDATE: no difference noticed on the older PC
profile picture

akuzn

#31
I ve done all tests on my notebook. It s powerfull enough but certainly not as good as desktop.
I have one ineffective 600mb excel file wich computes intraday bank balance but it represents super test.
No one notebook i ve tested didnt show better results than mine. Even gamer's notebooks show same results.
Just what i have not tested - last version of HP workstation with 32 gb.
And i think many people use notebooks.
So will much appreciate if you show me sma code. May be some measuring affects results.
I think would be more fair to compare in the same conditions.
I still cant believe that above SMA code based on moving window algorithm is loosing.

But in general seems everything is clear.
Thank you very much for participation))
profile picture

akuzn

#32
Eugene, may i ask you for the last fair test. I still cant believe your results.
Main idea is as follows. There is known fact: operation of multiplication is much faster than dividing (4-8 times).
I would like to suggest you code wich gives me 8-30% of speed improvement with SMA.
Operation of dividing of sum by period is replaced by multiplacation ratio computed earlier as 1d/ period.
So if code below wont be faster than standard something is wrong with time measuring in your previous tests.

CODE:
Please log in to see this code.

profile picture

Eugene

#33
Alexey,

I tried out this version and averaged over 3 runs, it is faster than the standard SMA by 38% (on Release build). After testing, I plan to switch to this version in Community/TASC Indicators to speed up all indicators that depend on SMA. Great job and thank you!
profile picture

Eugene

#34
New round of discoveries. Take the code from your initial post (or my post #4), create a formal DataSeries and pack it into a DLL (build configuration: Release). The same code starts working 2x-3x slower than it's working in the WealthScript Strategy i.e. even slower than the built-in AveragePrice series. Go figure.
profile picture

akuzn

#35
Sounds great.
At least now noone could say that i wasnt right.
I ve seen some books like C# for financial markets, for scientific computations etc and was really surprised by highly ineffective code suggested sometimes by "professors":)
So i am very pleased that you have confirmed my results.
---
All tests i ve done are done by calls from dll builded code.
Even ininital comparison of averaging is realised in dll.
And i usually use WealthLab to vizualize results and optimize strategies parameters but everything is in dll.
I think it is not only my approach. It is normal way to manage and research.
---
Why i decided to optimize average price - because under debugger i ve seen 4 unneded loops of summation and dividing: First loop: O+H, second: previous result + L, ..+C and after that loop of dividing by 4.
So parallelizing loop of averaging of 4 values coming from different parts of RAM was looking great for it.
This beahaviour doesnt depend on WealthLab i think - or may be depends on DataSeries operators of summation/multiplication etc.
I supposed that it was too difficult for me to override these operators and would be easier to write some new methods.
As i told everything is under dll file: strategies, indicators etc.
May be there is difference in compilator behavior in code optimization between Visual Studio and WealthLab?
Only this explanation. And in addition how DataSeries are loaded in memory. How are they fragmented and aligned. May be you have inserted 32 gb chip? How memory was fragmented? For example one load of 1.500.000 bars, another 200.000 after first strategy is closed and opened new with 1.800.000 bars....Just general questions.

But I use these results for another tests under my WPF trading platform where i ve improved speed again) : i need to work with order book and trading data coming from different markets: spot, futrures, options on stocks, currencies, fixed income. So any kind of freezing interface or weak computing is affecting comfort of making decisions. Seems i have avoided it.
Btw no SQL, ADO, EF etc .
But logically i cant understand why packing dataseries in dll should be slower?
Could you show your code? May be i have not understand something?
---
Btw i ve finished with database files and working now on high speed optimization.
Will try to test GA in WealthLab with my implementation of Evolutionary optimization methods, GA and particles swarm optimization.)
There are some ways to optimize and to parallelize sorting or simply use many threads.
This will be really interesting. I ve read some articles concerning matlab - it doesnt give best results. There are some critics and ways to improve.


---

As for my opinion if you rewrite series methods in sequential mode you ll get more improvement. In 1000.000 series i have more than 200x improvement with StdDev for example.
If you will use some parallelizing classes you ll get more improvement.
If you switch off graphics GUI will freeze less.
There is known fact that dynamic arrayes (and dynamic collections in general) in C# are hyper fast and difficult to improve their speed even using native code (you may find super Richter articles on MSDN) but my own List<T> or arrayes give me better result than DataSeries in WelathLab. Loading speed of large dataseries could be faster too. But there is always balance between how comfortable to manage - databases are always slower than direct file access.
May be it is possible to use async / await operators with Dispatcher.BeginInvoke to let GUI be more responsive but it is not already user's question. All modern applications use only this technic.
And not sure how graphics are realised in general - is there drawing in panel and this panel is srolled? I suppose yes. Much better would be drawing of only visible Window.
It is possible to wait for all these delayes - computers are fast now, but what i really dont like - trade bar. If i understand right it will never be bar+1, but if we want to use all optimizations tools we have to use these technics. But you cant get acces to GA from code, Btw GA is not multithreaded and provides me errors from time to time in errors panel.
But i am very gratefull to WLB team for evolutionary algorithm and wfo implementation ideas and still using as general purpose application mostly for visualization of some ideas.
I can repost it if needed.
profile picture

Eugene

#36
FWIW

During preparation for new Wealth-Lab release 6.8, it's been determined that parallelizing does NOT offer speed improvement. On the contrary, it demonstrated considerably worse results.