Indicator speed improvement through parallelizing

akuzn

#1

11/25/2013 10:47 PM

Good Day!

After have seen that following

after compilation is realised as 3 loops of summation with additional loop with dividing each sum element by 4 i ve tried to realise some improvements: simple loop for (..) and with Parallel.For, Parallel.Foreach:

After that i ve measured execution speed

The result on 4 strategies with 45600 bars and 4 strategies with 10500 bars loading together in one workspace would say impressive:

That means

is the most ineffective code..
After that i ve rewritten some indicators with Parallel "small body loop as in MSDN article style"( last method ).
Result was great. In general parallel backtests run 1.5-2 time faster.

Do you have any plans to rewrite indicators libraries and may be optimizers same or better style?

Size:

Color:

Good Day!

After have seen that following
[code]
DataSeries av = new DataSeries (..);
av = (High + Low + Close  + Open ) / 4;
[/code]
after compilation is realised as 3 loops of summation with additional loop with dividing each sum element by 4 i ve tried to realise some improvements: simple loop for (..) and with Parallel.For, Parallel.Foreach:

[code]
public static DataSeries Average_Serie(WealthScript ws, params DataSeries[] ds_lst)
        {
            int _num_ds = ds_lst.Length;
            DataSeries _ds_result = new DataSeries(ws.Bars, "Average Serie from" + _num_ds.ToString());

if (_num_ds > 0)
            {
                for (int bar = 0; bar < ws.Bars.Count; bar++)
                {
                    double temp = 0;
                    for (int k = 0; k < _num_ds; k++) 
                        { temp += ds_lst[k][bar]; }                        
                    _ds_result[bar] = temp / _num_ds;
                }
            }
            else
                _ds_result = ds_lst[0];

return _ds_result;
        }

//
        public static DataSeries Average_Serie_Parallel(WealthScript ws, params DataSeries[] ds_lst)
        {
            int _num_ds = ds_lst.Length;
            DataSeries _ds_result = new DataSeries(ws.Bars, "Average Serie from" + _num_ds.ToString());

if (_num_ds > 0)
            {
                Parallel.For(0, ws.Bars.Count, bar =>
                {
                    double temp = 0;
                    for (int k = 0; k < _num_ds; k++) 
                        { temp += ds_lst[k][bar]; }
                    _ds_result[bar] = temp / _num_ds;
                });
            }
            else
                _ds_result = ds_lst[0];

return _ds_result;
        }

//
        public static DataSeries Average_Serie_Parallel2(WealthScript ws, params DataSeries[] ds_lst)
        {
            int _num_ds = ds_lst.Length;
            DataSeries _ds_result = new DataSeries(ws.Bars, "Average Serie from" + _num_ds.ToString());

if (_num_ds > 0)
            {   
                //
                var rangePartitioner = Partitioner.Create(0, _ds_result.Count);
                //  
           
                Parallel.ForEach(rangePartitioner, (range, loopState) =>
                {                    
                    // Loop over each range element without a delegate invocation.
                    for (int i = range.Item1; i < range.Item2; i++)
                    {
                        double temp = 0;
                        for (int k = 0; k < _num_ds; k++) 
                            { temp += ds_lst[k][i]; }
                        _ds_result[i] = temp / _num_ds;
                    }
                });
            }
            else
                _ds_result = ds_lst[0];

return _ds_result;
        }

[/code]

After that i ve measured  execution speed
[code]

this.ws = ws;
            // General Definitions
            DataSeries av = new DataSeries(ws.Bars, "AV");
            DataSeries av1 = new DataSeries(ws.Bars, "AV1 as Serie");
            DataSeries av2 = new DataSeries(ws.Bars, "AV2 as Parallel Serie");
            DataSeries av3 = new DataSeries(ws.Bars, "AV3 as Parallel Serie small body loop");
            
            Stopwatch stopwatch = new Stopwatch();
            ws.PrintDebug("==================================================================");
            ws.PrintDebug("Computing averages for Strategy: " + ws.Bars.Symbol);
            ws.PrintDebug("Computing regular average...");
            stopwatch.Start();
            //
            av = (ws.High + ws.Low + ws.Open + ws.Close) / 4;
            stopwatch.Stop();
            ws.PrintDebug("Regular  time computing of average in milliseconds: {" + stopwatch.ElapsedMilliseconds + "}");
            //
            stopwatch.Reset();
            ws.PrintDebug("\nComputing average as serie method...");
            stopwatch.Start();
            av1 = Utilities.Average_Serie(ws, ws.High, ws.Low, ws.Open, ws.Close);
            stopwatch.Stop();
            ws.PrintDebug("Serie method  time computing of average in milliseconds: {" + stopwatch.ElapsedMilliseconds + "}");
            //
            stopwatch.Reset();
            ws.PrintDebug("\nComputing average as serie Parallel method...");
            stopwatch.Start();
            av2 = Utilities.Average_Serie_Parallel(ws, ws.High, ws.Low, ws.Open, ws.Close);
            stopwatch.Stop();
            ws.PrintDebug("Parallel Serie method time computing of average in milliseconds: {" + stopwatch.ElapsedMilliseconds + "}");
            //
            stopwatch.Reset();
            ws.PrintDebug("\nComputing average as serie Parallel method small body loop...");
            stopwatch.Start();
            av3 = Utilities.Average_Serie_Parallel2(ws, ws.High, ws.Low, ws.Open, ws.Close);
            stopwatch.Stop();
            ws.PrintDebug("Parallel Serie small body loop method time computing of average in milliseconds: {" + stopwatch.ElapsedMilliseconds + "}");
            //
            ws.PrintDebug("==================================================================");
[/code]
The result on 4 strategies with 45600 bars and 4 strategies with 10500 bars loading together in one workspace would say impressive:
[quote]
==================================================================
Computing averages for Strategy: EU-12.13(EUZ3)
Computing regular average...
Regular  time computing of average in milliseconds: {14}

Computing average as serie method...
Serie method  time computing of average in milliseconds: {2}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {3}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {5}
==================================================================
==================================================================
Computing averages for Strategy: EU-12.13(EUZ3)
Computing regular average...
Regular  time computing of average in milliseconds: {12}

Computing average as serie method...
Serie method  time computing of average in milliseconds: {2}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {4}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {1}
==================================================================
==================================================================
Computing averages for Strategy: EU-12.13(EUZ3)
Computing regular average...
Regular  time computing of average in milliseconds: {46}

Computing average as serie method...
Serie method  time computing of average in milliseconds: {1}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {2}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {2}
==================================================================
==================================================================
Computing averages for Strategy: EU-12.13(EUZ3)
Computing regular average...
Regular  time computing of average in milliseconds: {14}

Computing average as serie method...
Serie method  time computing of average in milliseconds: {3}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {3}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {3}
==================================================================
==================================================================
Computing averages for Strategy: SI-12.13(SIZ3)
Computing regular average...
Regular  time computing of average in milliseconds: {10}

Computing average as serie method...
Serie method  time computing of average in milliseconds: {4}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {10}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {69}
==================================================================
==================================================================
Computing averages for Strategy: SI-12.13(SIZ3)
Computing regular average...
Regular  time computing of average in milliseconds: {14}

Computing average as serie method...
Serie method  time computing of average in milliseconds: {5}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {6}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {5}
==================================================================
==================================================================
Computing averages for Strategy: SI-12.13(SIZ3)
Computing regular average...
Regular  time computing of average in milliseconds: {11}

Computing average as serie method...
Serie method  time computing of average in milliseconds: {5}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {13}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {4}
==================================================================
==================================================================
Computing averages for Strategy: SI-12.13(SIZ3)
Computing regular average...
Regular  time computing of average in milliseconds: {18}

Computing average as serie method...
Serie method  time computing of average in milliseconds: {10}

Computing average as serie Parallel method...
Parallel Serie method time computing of average in milliseconds: {3}

Computing average as serie Parallel method small body loop...
Parallel Serie small body loop method time computing of average in milliseconds: {3}
==================================================================
[/quote]
That means 
[code]
av = (High + Low + Close  + Open ) / 4;
[/code]
is the most ineffective code..
After that i ve rewritten some indicators with Parallel "small body loop as in MSDN article style"( last method ).
Result was great. In general parallel backtests run 1.5-2 time faster.

Do you have any plans to rewrite indicators libraries and may be optimizers same or better style?

Eugene

#2

11/26/2013 4:59 AM

Thank you for your input Alexey. Excellent idea.

There are no short-term plans to rewrite indicators and the Optimizer using Task Parallel Library.

P.S. I remember an episode when rewriting MS123 Visualizers last year. There was a performance bottleneck in a piece of code that calculated Closed Equity across the library. At first, applying Parallel.For seemed like the natural choice as it brought considerable speed improvement. Just one thing: the end result was incorrect. No matter how hard I bashed my head against this brick wall, it didn't work out and I had to optimize the algorithm instead.

akuzn

#3

11/26/2013 8:22 PM

Thank you for your support.
Especially KAMA and StdDev are looking good for parallelizing.
I would say optimization is running much faster. Now 8 intraday optimizations are running with expected time 3 hours. Yesterday it was about 7 hours.

Eugene

#4

12/10/2013 3:46 AM

Alexey,

Performed a quick test to verify your results on 250532 bars of IQFeed 5-min data for AAPL:

Worth noting is that by using plain vanilla version of Parallel.For I've noticed no breakthough at all. However, having switched to Parallel.For with Partitioner there's a 2x speed improvement:

Quite interesting.

Perhaps it's because the source DataSeries, being indexed collections whose length is known, are susceptible to range partitioning.

Size:

Color:

Alexey,

Performed a quick test to verify your results on 250532 bars of IQFeed 5-min data for AAPL:

[code]using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using WealthLab;
using WealthLab.Indicators;
using System.Threading.Tasks;
using System.Diagnostics;
using System.Collections.Concurrent;

namespace WealthLab.Strategies
{
	public class MyStrategy : WealthScript
	{		
		public DataSeries Average_Series_Parallel(params DataSeries[] ds_lst)
		{
			int _num_ds = ds_lst.Length;
			DataSeries _ds_result = new DataSeries(Bars, "Parallel (" + _num_ds.ToString() + ")");

return _ds_result;
		}
		
		public DataSeries Average_Series_Partitioner(params DataSeries[] ds_lst)
		{
			int _num_ds = ds_lst.Length;
			DataSeries _ds_result = new DataSeries(Bars, "Partitioner(" + _num_ds.ToString() + ")");

return _ds_result;
		}
	
		protected override void Execute()
		{
			DataSeries av = new DataSeries(Bars, "Average OHLC");
			DataSeries av1 = new DataSeries(Bars, "Average OHLC (Parallel)");
			DataSeries av2 = new DataSeries(Bars, "Average OHLC (Partitioner)");
            
			Stopwatch stopwatch = new Stopwatch();
			PrintDebug("==================================================================");
			PrintDebug("Computing averages for: " + Bars.Symbol + "(" + Bars.Count + " bars)" );
			PrintDebug("Computing regular average...");
			stopwatch.Start();
			av = (High + Low + Open + Close) / 4;
			stopwatch.Stop();
			PrintDebug("Computing regular average finshed in " + stopwatch.ElapsedMilliseconds + " milliseconds");
			//
			stopwatch.Reset();
			PrintDebug("\nComputing average using Parallel.For...");
			stopwatch.Start();
			av1 = Average_Series_Parallel(High, Low, Open, Close);
			stopwatch.Stop();
			PrintDebug("Computing Parallel average finished in " + stopwatch.ElapsedMilliseconds + " milliseconds");
			//
			//
			stopwatch.Reset();
			PrintDebug("\nComputing average using Partitioner...");
			stopwatch.Start();
			av2 = Average_Series_Partitioner(High, Low, Open, Close);
			stopwatch.Stop();
			PrintDebug("Computing Parallel average using Partitioner finished in " + stopwatch.ElapsedMilliseconds + " milliseconds");
			//
			PrintDebug("==================================================================");
		}
	}
}[/code]

Worth noting is that by using plain vanilla version of Parallel.For I've noticed no breakthough at all. However, having switched to Parallel.For with Partitioner there's a 2x speed improvement:

[quote]==================================================================
Computing averages for: AAPL(250532 bars)
Computing regular average...
Computing regular average finshed in 31 milliseconds

Computing average using Parallel.For...
Computing Parallel average finished in 27 milliseconds

Computing average using Partitioner...
Computing Parallel average using Partitioner finished in 17 milliseconds
==================================================================[/quote]

Quite interesting.

Perhaps it's because the source DataSeries, being indexed collections whose length is known, are susceptible to range partitioning.

akuzn

#5

12/10/2013 5:51 AM

Yes, i agree "Partiioner " is best.
And it was suggested in MSDN example "small body loop". Seems only they can know what happens in their jungles.
Any new programming construction can break brain.)
But what i understood from their explanations - there are some consuming time operations during parallel preparation at the beginning of "parallelizing" and at the end. What happens in the middle - is very quick.

I ve rewritten some parts of code with possibility to switch between simple for and "small body loop" but can tell only if there are 1 of 2 strategies backtesting efffect may be 1,5 - 2 times, if 4-8 strategies seems no no improvement at all.
Btw i ve not noticed different results in series computing. I ve rewritten RSI, MA, adaptive ma's.

JDardon

#6

3/4/2014 3:57 PM

A few questions regarding this line of thoughts.

45,600 bars @ (4 bytes per item) = 182,400 bytes (+ some overhead) = 178 KB per series. You are averaging 4 series = 712 KB of total data in 4 variables that are passed as parameters to 3 different functions.

Current processors typically have 6 MB of cache each. This means that your 4 data series can be fully loaded into the cache. The first call has to include the I/O time for uploading the data from memory-> cpu cache. The calls that happen afterwards will not have that overhead loading time and therefore apparently better execution time.

a) Could that be the reason for the severe discrepancy between the regular calculation and the series method? Could you test using a different order for the function calls and see if the results remain the same?
b) Given that there was only one result in which the parallel small body loop method was significantly faster than the series methood, couldn't this be related to the other activities the computer might have been doing at the same time? Could you guys test but with a lot more data (to take away the benefits from the cache of the microprocessor)?

The reason I ask is because I tried to run your code to verify this and it didn't compile:

Size:

Color:

A few questions regarding this line of thoughts.

45,600 bars @ (4 bytes per item) = 182,400 bytes (+ some overhead)  = 178 KB per series.   You are averaging 4 series = 712 KB of total data in 4 variables that are passed as parameters to 3 different functions.

Current processors typically have 6 MB of cache each. This means that your 4 data series can be fully loaded into the cache.  The first call has to include the I/O time for uploading the data from memory-> cpu cache.   The calls that happen afterwards will not have that overhead loading time and therefore apparently better execution time.

a) Could that be the reason for the severe discrepancy between the regular calculation and the series method? Could you test using a different order for the function calls and see if the results remain the same?
b) Given that there was only one result in which the parallel small body loop method was significantly faster than the series methood, couldn't this be related to the other activities the computer might have been doing at the same time?  Could you guys test but with a lot more data (to take away the benefits from the cache of the microprocessor)?

The reason I ask is because I tried to run your code to verify this and it didn't compile:

[quote]
Error 1) The type System.CollectionConcurrent.Partitioner exist in both c:\program files\Fidelity Investments\Wealth-ab Pro 6\System.threading.dll and c:\windows\microsoft.NET\Framework64\v.4.0.30319\mscorlib.dll

Error 2) The type arguments for method "System.Collections.Concurrent.Partitioner.Create<Tsource>(System.Collections.Generic.IList<Tsource>,bool) can not be inferred from the usage.  Try specifying the arguments explicitly.
[/quote]

Eugene

#7

3/7/2014 3:19 PM

Jorge,

Your argument makes sense. However, take the code from my post #4. Generate a huge dummy data file using Random data provider. Change the order of series. Even if the Parallel.For loop with Range Partitioner is made the first one to bear the I/O overhead, it beats the other two hands down.

akuzn

#8

3/7/2014 6:08 PM

1. It is from async - in general 3-4 times faster if you can use parallelizing. It s proved - it s not only the question of cache it is question of new async improvements in .Net in general.

2. I would like to add some more information.

About 3.500.000 tick data, my methods StdDev, AMA, Weighted MA, RSI etc run 50-300 times! (not percent) faster than standard due to parallelizing and some additional old school tricks avoiding excessive memory adressing, unboxing (about 3-8% time economy) etc.

Btw seems i found the way to improve moving average speed - only about 13 (from 129) ms on this amount of ticks. Certainly without any parallelizing because next average depends on previous value. May be larger amounts of ticks can be divided for parallel invokations but seems it is not needed for ma.

If i use some additinal tricks like threading or parallel invokation of series computing under stategy thread i have 50% and more additional speed improvement under WealthLab threading model.

I suppose GA may be imporoved but may be it needs too many changes depending on WealthLab itself.
I have tested some GA, evolutionary and particles swarm optimizations - there are ways to improve GA im sure.

Now i am on the way to finish with LowLatency platform (10-20 ms reaction including optimization and machine learning) - i use there multi threading model - i supose 100-400% additinal speed improvement for multi strategies runs.
I have bottleneck with data providers - pipe, dde and some others - our pro from exchange decided to use Lua language instead of normal API - never worked with it before but as for above information - it is tested and 100%.

Btw i realised some fast drawings in WPF async bindings etc - some improvements can be done in drawing too. I ve read ZedGraph is not fastest.

So the question of partitioner is just beginning. But still want to use WealthLab and hope it ll run faster.

As for the error you are getting - seems i had similar error under WealthLab editor.
I would suggest you to use methods from dll compiled in Visual Studio - it s much more comfortable and no problems at all.

Size:

Color:

1. It is from async - in general 3-4 times faster if you can use parallelizing. It s proved - it s not only the question  of cache it is question of new async improvements in .Net in general.

2. I would like to add some more information.

About 3.500.000 tick data, my methods StdDev, AMA, Weighted MA, RSI etc run 50-300 times! (not percent) faster than standard due to parallelizing and some additional old school tricks avoiding excessive memory adressing, unboxing (about 3-8% time economy) etc.

Btw seems i found the way to improve moving average speed  - only about 13 (from 129) ms on this amount of ticks. Certainly without any parallelizing because next average depends on previous value.  May be larger amounts of ticks can be divided for parallel invokations but seems it is not needed for ma.

If i use some additinal tricks like threading or parallel invokation of series computing under stategy thread i have 50% and more additional speed improvement under WealthLab threading model.

I suppose GA may be imporoved but may be it needs too many changes depending on WealthLab itself.
I have tested some GA, evolutionary and particles swarm optimizations - there are ways to improve GA im sure.

Now i am on the way to finish with LowLatency platform (10-20 ms reaction including optimization and machine learning) - i use there multi threading model - i supose 100-400% additinal speed improvement for multi strategies runs. 
I have bottleneck with data providers - pipe, dde and some others - our pro from exchange decided to use Lua language instead of normal API - never worked with it before but as for above information - it is tested and 100%.

Btw i realised some fast drawings in WPF async bindings etc - some improvements can be done in drawing too. I ve read ZedGraph is not fastest.

So the question of partitioner is just beginning. But still want to use WealthLab and hope it ll run faster.

As for the error you are getting - seems i had similar error under WealthLab editor. 
I would suggest you to use methods from dll compiled in Visual Studio - it s much more comfortable and no problems at all.

Eugene

#9

3/11/2014 3:46 PM

Alexey,

I tried to parallelize SMA using both Parallel.For and Parallel.ForEach. Have to say, not many indicators allow to parallelize themselves this way due to data race conditions and parallel dependencies. Surprisingly, on a symbol with 300K bars both implementations turned up slower (up to several times) than the traditional loop.

Considered PLINQ too but I think it's too expensive for this kind of task. I wonder how do you parallelize indicators like WMA or RSI (w/o the tricks you mentioned)?

akuzn

#10

3/19/2014 7:44 PM

Nice to see your interest.
If i understand logic SMA computing may be done fast enough without any cpu context switching in sync mode. I dont have enough knowledge about it. My tests with sma parallelizing didnt give anything too.
May be it will be needed to compare data array to CPU cash size. And if bigger parallelizing in .Net will give something. How big it should be? Good question. It will be interesting co compare to CUDA results. (About CUDA later).

I am not a pro programmer and had about 10 years out of quantative finances. 3 years ago returned to my previous interests as hobby when was at hospital with broken leg)).
So pls dont critisize me to much. All i know now - all i can remember from turbo pascal, c and 2 years of C# learning.

Here is my code wich gave me best improvement 9% (usually 4-5%) on 3.800.000 ticks. No parallelizing - just precize coding).

I have strong opinion that C# may give super results and c++ is not needed. The main idea not to let GarbageCollector work during computation and follow recommendations like dont forget make working class as sealed, use structs, and internal block variables wich can be stored in AX, BX, CX, DX ( how many modern CPU has and how are they named?) arithmetic registres of CPU.

Some recommendations work better on x64. There are many tests when x64 studio compilation in same conditions and even similar code give 3-5% better results than unsafe C#, C++ or C.
So i have read carefully these tests and implemented them.

For example what surpised me much:

const int i is better than just int i;
in for (int j =0; j<i;j++) difference on C# can give about 3% -5% of speed difference between C++ and C#!
Dont know why it doesnt give anything in c++.

Following same logic ds.Count is bad style. But in my example some improvement was realised due to avoiding some computation branches and may be avoiding excessive CPU context switching. 2 years more and i ll know it).
I think 2-4% of time will be reduced by replacing ds.Count by: int num = ds.Count; and num will be placed in body of for operator.

In one topic concerning c# code optimization i have seen an interesting idea about loop coding optimization but havenot understand what it was mean and couldnt reproduce. But idea was as following

Author called it c# loop optimization. May be i have missed something. In my case i ve received freezed interface.

As you can see in my SMA example - little bit more coding but it works little bit better.
As for StdDev and others - im not kidding. I tried to compare results to amibroker - it runs fast enough - my results are similar. Not sure how they are measuring it - all strategies or final result is on time computing of each strategy. I mean each strategie there can show time computing and rendering time. Each result is good but as you know we cant be 100% trust if we dont see how fair it is measured and computed.

To be honest i dont like AmiBroker - just used my old 5.4 version to check.

Another test i ve tried 2 years ago was with optimization of 2 strategies with many stddev and normalization of price/stddev computing. Ami interface was freezing for 10-15 minutes and sometimes i couldnt understand what was going on. But one simple computing seems running there fast enough. And in general their optimization interface and internal language is not as comfortable as WealthLab and Visual Studio.

When i was optimizing same ideas under WealthLab - WLB was looking better with 8-20 strategies with GA.
Algos of series computing in ami seems fast enough, but as for multithreading - i think they dont work good with Windows model.
They tell that new Ami version can use up to 32 CPU cores, but why it was freezing in the same optimization conditions on 8 cores?
WealthLab looks much more friendly with Windows - naturally born).

What i understand now - Amibroker nor WealhLab dont follow WPF model!. Let's say WealthLab is much closer.- COM - WPF interoperation - black holes in windows can be explained.

I ve realised some ideas following WPF description model and feel very happy.
What i mean - graphics as fast it can be done under WPF. I ve tested 5 ms update in 40 windows with Visual Class - would say perfect.
No freezing no big cpu load. Sure some additional tricks - but under main thread. BackgroundWorker or parallizing - and rendering on computed event works great. No freezing at all. WealthLab freezes sometimes. I ve done it because i need to trade with high frequency intraday data, order book and trying to use machine learning. Seems everything is possible. Even if i switch off graphics and install program on server near exchange C# code will work great.

And last additinal trick is CUDA - CUDAFy - free C# library.
Result is impressive but it is needed to find balance because PLINQ and parallelizing under Net is fast enough. Difference begins after 100.000 values are reached. There are some articles on codeproject with computational examples. In one article of CUDAFy developer Nvidia quadra 540 (800 usd acer) gave result 300! better than PLINQ on Geo tasks. But with arrrays below 50.000 elements cuda programming wont give anything.
I studied it to understand how better plan future work with my program. New Nvidia adapter even on notebooks can give result much better than 10 Xeons or any other well marketed products. Quadra or Tesla and just one I7 - and nothing more.
Last version of NVIDIA hardware and software (unfortunately only release candidate) support shared memory adressing.
If i understand right unified operators and no big difficulties of transfering data arrays between different memory types. May be programming interface will be even same for Computer RAM and Nvidia Card memory. There are c++ examples but my card doesnt support this mode. Some video chips with this possibility are already available. CUDAFy wich provides C# interfaces doesnt support it yet but i think in half year everything will be tested.

As for data base if i understand right file database is much faster than any SQL etc.
So just simple well organized .CSV files are much faster. Have not tested SQL programming personally but i have such experience with moscow data provider - they were giving static provider with data stored in sql. I dont like speed so even dont want any experiment with it.

So seems following WPF programming model must be comfortable enough. Even it is not needed to run every window in personal thread. UI thread will be enough in combination with parallel For, LINQ if needed. And CUDA in addition if needed huge arrayes testing.
Now im finishing with data providing. and hope will finish with beta assembly in about 2 weeks.
If you are interested in testing - i can send you copy.

Size:

Color:

Nice to see your interest.
If i understand logic SMA computing may be done fast enough without any cpu context switching in sync mode. I dont have enough knowledge about it. My tests with sma parallelizing didnt give anything too.
May be it will be needed to compare data array to CPU cash size. And if bigger parallelizing in .Net will give something. How big it should be? Good question. It will be interesting co compare to CUDA results. (About CUDA later).

I am not a pro programmer and had about 10 years out of quantative finances. 3 years ago returned to my previous interests as hobby when was at hospital with broken leg)).
So pls dont critisize me to much. All i know now - all i can remember from turbo pascal, c and 2 years of C# learning.

Here is my code wich gave me best improvement 9% (usually 4-5%) on 3.800.000 ticks. No parallelizing - just precize coding).
[code]
public static DataSeries Simple_MA_Optim2(DataSeries ds, int period)
        {
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            
            DataSeries sma = new DataSeries(ds, "Simple_MA_Optim2(" + ds.Description + ", " + period.ToString() + ")");
            //
            double sum = 0;
            if (period > ds.Count)
            {
                for (int bar = 0; bar < ds.Count; bar++)
                {
                    double cur_ds = ds[bar];
                    sum += cur_ds;
                    sma[bar] = bar != 0 ? sum / (bar+1) : cur_ds;
                }
            }
            else
            {
                for (int bar = 0; bar < period; bar++)
                {
                    double cur_ds = ds[bar];
                    sum += cur_ds;
                    sma[bar] = bar != 0 ? sum / (bar + 1) : cur_ds;
                }

for (int bar = period; bar < ds.Count; bar++)
                {
                    double cur_ds = ds[bar];
                    double period_ds = ds[bar - period];
                    sum += cur_ds;

sum -= period_ds;// ds[bar - period];//
                    sma[bar] = period != 0 ? sum / period : cur_ds;
                }
            }

stopwatch.Stop();
            sma.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";

return sma;
        }//End Simple_MA_Optim2
[/code]

I have strong opinion that C# may give super results and c++ is not needed. The main idea not to let GarbageCollector work during computation and follow recommendations like dont forget make working class as sealed, use structs, and internal block variables wich can be stored in AX, BX, CX, DX ( how many modern CPU has and how are they named?) arithmetic registres of CPU.

Some recommendations work better on x64. There are many tests when x64 studio compilation in same conditions and even similar code give 3-5% better results than unsafe C#, C++ or C.
So i have read carefully these tests and implemented them.

For example what surpised me much:

const int i is better than just int i;
in for (int j =0; j<i;j++) difference on C# can give about 3% -5% of speed difference between C++ and C#!
Dont know why it doesnt give anything in c++.

Following same logic ds.Count is bad style. But in my example some improvement was realised due to avoiding some computation branches and may be avoiding excessive CPU context switching. 2 years more and i ll know it).
I think 2-4% of time will be reduced by replacing ds.Count by: int num = ds.Count; and num will be placed in body of for operator.

In one topic concerning c# code optimization i have seen an interesting idea about loop coding optimization but havenot understand what it was mean and couldnt reproduce. But idea was as following
[quote]
for (int i = bigperiod; i > 0; )
{
    ...a*b*c * matrix[i--];
}
[/quote]
Author called it c# loop optimization. May be i have missed something. In my case i ve received freezed interface.

As you can see in my SMA example - little bit more coding but it works little bit better.
As for StdDev and others - im not kidding. I tried to compare results to amibroker - it runs fast enough - my results are similar. Not sure how they are measuring it - all strategies or final result is on time computing of each strategy. I mean each strategie there can show time computing and rendering time. Each result is good but as you know we cant be 100% trust if we dont see how fair it is measured and computed.

To be honest i dont like AmiBroker - just used my old 5.4 version to check.

Another test i ve tried 2 years ago was with optimization of 2 strategies with many stddev and normalization of price/stddev computing. Ami interface was freezing for 10-15 minutes and sometimes i couldnt understand what was going on. But one simple computing seems running there fast enough. And in general their optimization interface and internal language is not as comfortable as WealthLab and Visual Studio.

When i was optimizing same ideas under WealthLab - WLB was looking better with 8-20 strategies with GA.
Algos of series computing in ami seems fast enough, but as for multithreading - i think they dont work good with Windows model.
They tell that new Ami version can use up to 32 CPU cores, but why it was freezing in the same optimization conditions on 8 cores?
WealthLab looks much more friendly with Windows - naturally born).

What i understand now - Amibroker nor WealhLab dont follow WPF model!. Let's say WealthLab is much closer.- COM - WPF interoperation - black holes in windows can be explained.

I ve realised some ideas following WPF description model and feel very happy.
What i mean - graphics as fast it can be done under WPF. I ve tested 5 ms update in 40 windows with Visual Class - would say perfect.
No freezing no big cpu load. Sure some additional tricks - but under main thread. BackgroundWorker or parallizing - and rendering on computed event works great. No freezing at all. WealthLab freezes sometimes. I ve done it because i need to trade with high frequency intraday data, order book and trying to use machine learning. Seems everything is possible. Even if i switch off graphics and install program on server near exchange C# code will work great.

And last additinal trick is CUDA - CUDAFy - free C# library.
Result is impressive but it is needed to find balance because PLINQ and parallelizing under Net is fast enough. Difference begins after 100.000 values are reached. There are some articles on codeproject with computational examples. In one article of CUDAFy developer Nvidia quadra 540 (800 usd acer) gave result 300! better than PLINQ on Geo tasks. But with arrrays below 50.000 elements cuda programming wont give anything.
I studied it to understand how better plan future work with my program. New Nvidia adapter even on notebooks can give result much better than 10 Xeons or any other well marketed products. Quadra or Tesla  and just one I7 - and nothing more.
Last version of NVIDIA hardware and software (unfortunately only release candidate) support shared memory adressing.
If i understand right unified operators and no big difficulties of transfering data arrays between different memory types. May be programming interface will be even same for Computer RAM and Nvidia Card memory. There are c++ examples but my card doesnt support this mode. Some video chips with this possibility are already available. CUDAFy wich provides C# interfaces doesnt support it yet but i think in half year everything will be tested.

As for data base if i understand right file database is much faster than any SQL etc.
So just simple well organized .CSV files are much faster. Have not tested SQL programming personally but i have such experience with moscow data provider - they were giving static provider with data stored in sql. I dont like speed so even dont want any experiment with it.

So seems following WPF programming model must be comfortable enough. Even it is not needed to run every window in personal thread. UI thread will be enough in combination with parallel For, LINQ if needed. And CUDA in addition if needed huge arrayes testing.
Now im finishing with data providing. and hope will finish with beta assembly in about 2 weeks.
If you are interested in testing - i can send you copy.

akuzn

#11

3/19/2014 8:07 PM

SMA, StdDev, KAMA are standard indicators,

SMA_OPTIM1, SMA_OPTIM2,
StdDev_Naive
AMA - my version.
Everything is done under WealhLab.
So as you can see WLB StdDev 2698 ms, my version 175 ms,
SMA's are similar but in general SMA_OPTIM2 is little bit faster if i run it 10 times,
KAMA and AMA 4334 and 161 ms.

Correct computation is proved because one PlotSeries covers another.
They are computed one after another without any additional processes or parallel execution or backgroundworker etc.

Results may have variations even if just only one strategy is loaded.
But if i load and run 2 and more stategies difference between standard WLB and my indicators in time computing increases.
In general in my strategies i use parallel tasks under wealthlab thread so difference increases again.
Certainly would be better compare average time of every serie somputing - in this situation difference between average time based on overall computing time divided by numver of strategies loaded will be increasing too.
Certainly if i use bigger data arrayes difference increases again. It is direct comparison.
But if i need to compute StdDev depending on Bollinger and AMA? Absolutely normal research.
Usually will be good style to compute StdDev of spread between averages (like MACD).

As for WMA, RSI's are running much faster too. I ve done all this job because i was doing experiments with RSI tuning.
For example some modifications may use not only moving window but for exmple also peaks and throughs of SMA, AMA or anythign else as adaptive window. Imagination has no limits.
It was difficult to wait hours.

SpeedTests.png

Size:

Color:

SMA, StdDev, KAMA are standard indicators,

SMA_OPTIM1, SMA_OPTIM2,
StdDev_Naive
AMA - my version. 
Everything is done under WealhLab.
So as you can see WLB StdDev 2698 ms, my version 175 ms,
SMA's are similar but in general SMA_OPTIM2 is little bit faster if i run it 10 times,
KAMA and AMA 4334 and 161 ms.

Correct computation is proved because one PlotSeries covers another.
They are computed one after another without any additional processes or parallel execution or backgroundworker etc.

Results may have variations even if just only one strategy is loaded.
But if i load and run 2 and more stategies difference between standard WLB and my indicators in time computing increases.
In general in my strategies i use parallel tasks under wealthlab thread so difference increases again.
Certainly would be better compare average time of every serie somputing - in this situation difference between average time based on overall computing time divided by numver of strategies loaded will be increasing too.
Certainly if i use bigger data arrayes difference increases again. It is direct comparison.
But if i need to compute StdDev depending on Bollinger and AMA? Absolutely normal research.
Usually will be good style to compute StdDev of spread between averages (like MACD).

As for WMA, RSI's are running much faster too. I ve done all this job because i was doing experiments with RSI tuning.
For example some modifications may use not only moving window but for exmple also peaks and throughs of SMA, AMA or anythign else as adaptive window. Imagination has no limits.
It was difficult to wait hours.

[img]http://wl6.wealth-lab.com/handlers/DownloadHandler.ashx?id=4158&password=452ece032a904a8[/img]

Eugene

#12

3/20/2014 5:51 AM

Alexey,

Thank you so much for your thoughtful and detailed reply. It's great to have such a speed optimization expert on board.

Here's what I found, having tried out the SMA code from your post #10:

These results were obtained on 1.500.000 bars of random data. Note that the "slower" standard indicator takes the burden by being executed first, and still it beats the "faster" SMA considerably. Not sure why I'm unable to duplicate your results.

So far I'm becoming skeptical about parallelizing indicators. Maybe it's not the way to go, considering that the great majority of our users will never have the need to execute backtests on millions of bars - where the speed difference may become noticeable.

akuzn

#13

3/20/2014 6:18 AM

Another pic with test

SpeedTestsSI(USDRUR).png

akuzn

#14

3/20/2014 6:40 AM

SMA code:

Difficult to do something with moving average.
The last idea what i decided not to realize is to push values in the range [bar-period] ..[bar] in small array of doubles avoiding ds[bar-period] call - something like internal loop cash. But it is not really needed and moving average is not weakest. As for StdDev, WMA difference result of optmization is significant.

3parallel_speed_tests.png

Size:

Color:

SMA code:
[code]
public static DataSeries Simple_MA_Optim2(DataSeries ds, int period)
        {
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            
            DataSeries sma = new DataSeries(ds, "Simple_MA_Optim2(" + ds.Description + ", " + period.ToString() + ")");
            //
            double sum = 0;
            int max_num = ds.Count;
            if (period > max_num)
            {
                for (int bar = 0; bar < max_num; bar++)
                {
                    double cur_ds = ds[bar];
                    sum += cur_ds;
                    sma[bar] = bar != 0 ? sum / (bar+1) : cur_ds;
                }
            }
            else
            {
                for (int bar = 0; bar < period; bar++)
                {
                    double cur_ds = ds[bar];
                    sum += cur_ds;
                    sma[bar] = bar != 0 ? sum / (bar + 1) : cur_ds;
                }

for (int bar = period; bar < max_num; bar++)
                {
                    double cur_ds = ds[bar];
                    double period_ds = ds[bar - period];
                    sum += cur_ds;

sum -= period_ds;// ds[bar - period];//
                    sma[bar] = period != 0 ? sum / period : cur_ds;
                }
            }

stopwatch.Stop();
            sma.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";

return sma;
        }//End Simple_MA_Optim2
[/code]
Difficult to do something with moving average. 
The last idea what i decided not to realize is to push values in the range [bar-period] ..[bar] in small array of doubles avoiding ds[bar-period] call - something like internal loop cash. But it is not really needed and moving average is not weakest. As for StdDev, WMA difference result of optmization is significant.

[img]http://wl6.wealth-lab.com/handlers/DownloadHandler.ashx?id=4160&password=42fe18d651184f1[/img]

akuzn

#15

3/20/2014 6:50 AM

I have retested.

FuturesIndexRTS.png

Eugene

#16

3/20/2014 7:09 AM

Your latest findings re: SMA are in line with my observations. Not much to optimize here, considering typical tasks of the average WLP/D user.

Being "embarrassingly parallel", SMA/WMA is simple matter. Talking about RSI, AMA and other indicators that have dependency on previous values. Could you please share how did you overcome this obstacle (dependencies and race conditions) when parallelizing their code?

akuzn

#17

3/20/2014 7:11 AM

3.853.000 ticks.
I just have unloaded all programs and freed all memory i could.
StdDev: 1512 ms and 159 ms.
SMA 112 ms, 116 ms, 109 ms. 109 ms - is version of previous topic
KAMA 2378 ms and 140 ms.
VMA 1563 ms and 442 ms.

Parallelizing and code optimization are used together.

I agree that 3000.000 can be excessive - but it is just some trading days.
Normally for fast mean reverting systems not needed anymore.

But for example if i want to do some classification on dataset of 500 symbols or even more like k-means, correlation matrixes any portfoio optimizations fast computing is needed even for 50.000 data arrayes.

In real time serie computing may not affect speed execution if you compute on new value coming and adding last indicator value.
But if optimization is needed you have no way - have to find faster series computing.

I think parallel GA and EV coding will give additional 80-100% speed improvement.
I ll test it i think in 5 days.

SpeedTest.png

akuzn

#18

3/20/2014 7:21 AM

Nothing special - i compare it to VMA.Series and Volume as weight parameter
I have tested Weighted MA with and without parallelizing and decided to use parallelizing - it showed faster results.
I think because we have already 2 arrays wich can be loaded everywhere in memory so cpu switches context much often than with SMA, so parallelizing will let cpu better optimize computing.
I have only such explanation.

Size:

Color:

Nothing special - i compare it to VMA.Series and Volume as weight parameter
I have tested Weighted MA with and without parallelizing and decided to use parallelizing - it showed faster results.
I think because we have already 2 arrays wich can be loaded everywhere in memory so cpu switches context much often than with SMA, so parallelizing will let cpu better optimize computing.
I have only such explanation.
[code]
public static DataSeries Simple_WMA_Optim(DataSeries ds, DataSeries weight, int period)
        {
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            //
            DataSeries wma = new DataSeries(ds, "Simple_WMA_Optim(" + ds.Description + ", " + period.ToString() + ")");
            //

stopwatch.Stop();
            wma.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";

return wma;
        }//End Simple_WMA_Optim[/code]

akuzn

#19

3/20/2014 7:27 AM

RSI:
It may differe relative to basic RSI, but RSI idea can have various versions.
In this example i could miss something or make it too smooth. If you critisize i would much appreciate).
To be honest i was interested more in adaptive window of RSI based on achived high's and low's, Volume weighted RSI's, their smoothed versions, and creating adaptive averages based on volume price movement relationship and didnt find final verision of Simple RSI.
So may be uncommenting will make this code closer to standard rsi graph. But as for speeding idea - stlil the same.

Size:

Color:

RSI:
It may differe relative to basic RSI, but RSI idea can have various versions.
In this example i could miss something or make it too smooth. If you critisize i would much appreciate).
To be honest i was interested more in adaptive window of RSI based on achived high's and low's, Volume weighted RSI's, their smoothed versions, and creating adaptive averages based on volume price movement relationship and didnt find final verision of Simple RSI. 
So may be uncommenting will make this code closer to standard rsi graph. But as for speeding idea - stlil the same.

[code]
public static DataSeries Simple_RSI(WealthScript ws, DataSeries ds, int period)
        {
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            //
            DataSeries simple_rsi = new DataSeries(ds, "SimpleRSI(" + ds.Description + "," + period.ToString() + ")");
            //
            if (ds.Count > period)
            {
                //DataSeries up_ds = new DataSeries(ws.Bars, "up");
                //DataSeries down_ds = new DataSeries(ws.Bars, "down");

var rangePartitioner = Partitioner.Create(period, ws.Bars.Count);
                Parallel.ForEach(rangePartitioner, (range, loopState) =>
                {
                    // Loop over each range element without a delegate invocation.
                    for (int bar = range.Item1; bar < range.Item2; bar++)
                    //for (int bar = period; bar < ws.Bars.Count; bar++)
                    {
                        int start_bar = bar - period + 1;
                        double U = 0;
                        double D = 0;
                        double cur_ds = ds[start_bar];
                        double prev_ds = ds[start_bar - 1];
                        for (int i = start_bar; i <= bar; i++)
                        {
                            cur_ds = ds[i];
                            double diff = cur_ds - prev_ds;// ds[i] - ds[i - 1];
                            prev_ds = cur_ds;

if (diff > 0)//ds[i] > ds[i - 1])
                            {
                                U += diff;// ds[i] - ds[i - 1];
                            }
                            if (diff < 0)//ds[i] < ds[i - 1])
                            {
                                D += -diff;// ds[i - 1] - ds[i];
                            }

//ws.PrintDebug(
                            //   "[" + bar.ToString() + "], i = " + i.ToString()
                            //   + ", up_change:" + up_change.ToString() + ", down_change:" + down_change.ToString()
                            //   );
                        }

//ws.PrintDebug(Environment.NewLine);

U = period != 0 ? (U / period) : 0d;
                        D = period != 0 ? (D / period) : 0d;

//up_ds[bar] = up_change;
                        //down_ds[bar] = down_change;
                        //double RS = down_change != 0 ? up_change / down_change : 1d;
                        //simple_rsi[bar] = 100d  - (100d  / (1d + RS));
                        if ((U + D) == 0)
                            simple_rsi[bar] = 100d;
                        else
                            simple_rsi[bar] = 100d * U / (U + D);

}

//ChartPane rsi_pane = ws.CreatePane(40, true, false);
                    //ws.PlotSeries(rsi_pane, up_ds, Color.Green, LineStyle.Solid, 2);
                    //ws.PlotSeries(rsi_pane, down_ds, Color.Red, LineStyle.Solid, 2);
                });

}
            
            stopwatch.Stop();
            simple_rsi.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";
            
            return simple_rsi;
        } //End simple RSI[/code]

Eugene

#20

3/20/2014 8:08 AM

Thank you very much for sharing. Hopefully this discussion will be useful to advanced users of Wealth-Lab.

Here's a direct comparison of built-in RSI vs. your faster RSI on 57K+ bars of intraday data...

Surprisingly, most of the time the figures are in the ballpark of 4ms (built-in RSI) vs. 6ms (your RSI). Tested with both debug and release builds of my DLL (x64, .NET 4.5.1). I've got a 6-core CPU. Not sure what I'm doing wrong.

standard-vs-fast.png

akuzn

#21

3/20/2014 8:33 AM

It becomes ... funny.
I ll retest 3 symbols...
I call methods from my dll, compiled in x64 Studio.
I switched there to 4.5.1 .Net Framework.
Code of SMA, WMA, RSI is compiled in dll and used through reference.

I think your result is based on small array.I would suggest you to use larger array. 57k+ is nothing to measure.
But if you ll try even to run optimization or even test on data set - result will be more signigicant.
With smaller arrayes may be some cPU background activities affecting.
With smaller arrayes i had unpredictable variations in measuring results.
Larger data array more fair comparison i suppose. And especially at the limit of RAM.

And you may try Amibroker on 1.000.000 element array. Ami s really fast.
I have read about one comparative backtesting of 1000 symbols in WealthLab and Ami. Ami won. Difference was significant.
But i cant work with their visual interface of middle ages.

When i was optimizing 3 months intraday data arrayes - about 55.000 of bars - diffrence between bult int and "optmized" indicators with parallel series computing gave me 2.5-4 times faster computing.
But interface was freezing sometimes.
I use 4 cores I7 cpu.

First symbol SRH4 has 657.808 ticks,
Second SIH4 2.138.804 ticks,
Third RIH4 3.833.410 ticks

More data, more cpu load the more fair result.
So i load 3 strategies and press 3 times run.
Interface is freezing and after that WealthLab gives results.

Here is my code in WealthLab:

3parallel_speed_retest.png

Size:

Color:

It becomes ... funny.
I ll retest 3 symbols...
I call methods from my dll, compiled in x64 Studio.
I switched there to 4.5.1 .Net Framework.
Code of SMA, WMA, RSI is compiled in dll and used through reference.

I think your result is based on small array.I would suggest you to use larger array. 57k+ is nothing to measure.
But if you ll try even to run optimization or even test on data set - result will be more signigicant.
With smaller arrayes may be some cPU background activities affecting.
With smaller arrayes i had unpredictable variations in measuring results.
Larger data array more fair comparison i suppose. And especially at the limit of RAM.

And you may try Amibroker on 1.000.000 element array. Ami s really fast.
I have read about one comparative backtesting of 1000 symbols in WealthLab and Ami. Ami won. Difference was significant.
But i cant work with their visual interface of middle ages.

When i was optimizing 3 months intraday data arrayes - about 55.000 of bars - diffrence between bult int and "optmized" indicators with parallel series computing gave me 2.5-4 times faster computing.
But interface was freezing sometimes.
I use 4 cores I7 cpu.

First symbol SRH4 has 657.808 ticks, 
Second SIH4 2.138.804 ticks,
Third RIH4 3.833.410 ticks

More data, more cpu load the more fair result.
So i load 3 strategies and press 3 times run.
Interface is freezing and after that WealthLab gives results.

Here is my code in WealthLab:
[code]
using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using WealthLab;
using WealthLab.Indicators;

using System.Diagnostics;
using Studies_Lib;
using Studies_Lib.Adaptive;

namespace WealthLab.Strategies
{
	public sealed class MyStrategy : WealthScript
	{
		StrategyParameter PERIOD;
		
		public MyStrategy()
		{
			PERIOD = CreateParameter("Period", 20, 20, 500, 1);
		}
		
		protected override void Execute()
		{
			int period = PERIOD.ValueInt;
			
			Stopwatch stopwatch = new Stopwatch();
			stopwatch.Start();
			DataSeries sma = SMA.Series(Close, period);
			stopwatch.Stop();
			sma.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";
			
			stopwatch.Start();
			DataSeries std_dev = StdDev.Series(Close, period, WealthLab.Indicators.StdDevCalculation.Population);
			stopwatch.Stop();
			std_dev.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";
			
			stopwatch.Start();
			DataSeries kama = KAMA.Series(Close, period);
			stopwatch.Stop();
			kama.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";
			
			DataSeries sma_optim = Utilities.Simple_MA_Optim(Close, period);
			DataSeries sma_optim2 = Utilities.Simple_MA_Optim2(Close, period);
			DataSeries std_dev_optim = Utilities.StdDev_Optim(Close, period);
			DataSeries ama = Utilities.AMA(Close, period);
			
			stopwatch.Start();
			DataSeries vma = VMA.Series(Close, period);
			stopwatch.Stop();
			vma.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";
			DataSeries wma_optim = Utilities.Simple_WMA_Optim(Close, Volume, period);
			
			
			stopwatch.Start();
			DataSeries rsi = RSI.Series(Close, period);
			stopwatch.Stop();
			rsi.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";
			DataSeries rsi_optim = Utilities.Simple_RSI(this, Close, period);
			
			ChartPane rsi_pane = CreatePane (40, true, false);
			
			ChartPane wma_pane = CreatePane (40, true, false);
			ChartPane ma_pane = CreatePane(40, true, false );
			ChartPane stddev_pane = CreatePane(40, true, false );
						
			PlotSeries ( ma_pane, sma, Color.Blue, WealthLab.LineStyle.Solid, 2 );
			PlotSeries ( ma_pane, sma_optim, Color.SpringGreen, WealthLab.LineStyle.Solid, 2 );
			PlotSeries ( ma_pane, sma_optim2, Color.DarkGreen, WealthLab.LineStyle.Solid, 2 );
			PlotSeries ( ma_pane, kama, Color.LightBlue, WealthLab.LineStyle.Solid, 2 );
			PlotSeries ( ma_pane, ama, Color.Cyan, WealthLab.LineStyle.Solid, 2 );
			
			PlotSeries (stddev_pane, std_dev, Color.Orange, WealthLab.LineStyle.Solid, 2 );
			PlotSeries (stddev_pane, std_dev_optim, Color.SpringGreen, WealthLab.LineStyle.Solid, 2 );		
			
			PlotSeries(wma_pane, vma, Color.Yellow, WealthLab.LineStyle.Solid, 2 );
			PlotSeries(wma_pane, wma_optim, Color.OrangeRed, WealthLab.LineStyle.Solid, 2 );
			
			
			PlotSeries(rsi_pane, rsi, Color.Red, WealthLab.LineStyle.Solid, 2 );
			PlotSeries(rsi_pane, rsi_optim, Color.DarkRed, WealthLab.LineStyle.Solid, 2 );
			
			
		}
	}
}
[/code]

[img]http://wl6.wealth-lab.com/handlers/DownloadHandler.ashx?id=4164&password=b292128ba97741e[/img]

akuzn

#22

3/20/2014 8:56 AM

To be fair i ve loaded only one test with small data array with 657.808 ticks,
Im 100% sure that no full memeory usage at least 200 mb is left.
So 100% just fair computing.

But wouldnt say 3 strategies with full load is unfair

SRH4SpeedTest.png

akuzn

#23

3/20/2014 10:09 AM

Just mid array test
SIH4 2.138.804 ticks. (USD/RUR Futures)

SIH4SpeedTest.png

akuzn

#24

3/20/2014 10:13 AM

And the largest serie
RIH4 3.833.410 ticks (Futures on INDEX)

Larger dataserie - difference increases

RIH4SpeedTest.png

akuzn

#25

3/20/2014 11:23 AM

To finish with testing - i have run many times tests.
I put here the most repeatable results. Improvement in percent is the same on smaller arrayes.
3.833.410 ticks

StdDev: 2182 ms and 176 ms
SMA: 119 ms(built in), 116 and 108 ms
KAMA 3064 ms and 138 ms
VMA 4284 ms and 406 ms
RSI 4543 ms and 367 ms.

If i use parallel series computing in tasks and run multiple strategies difference increases.
Under optimization increases more.

So it can be tested on DataSet with symbols of 20k to 50 k.
result will be significant.

Next step as i told parallel optimization methods and better usage of WPF model.

RIH4SpeedTest_improved.png

akuzn

#26

3/20/2014 2:24 PM

Eugene, could you test following code of SMA.
Interesting will you have any improvement? I ve done it to finish with sma optimization.
-1 if operator in loop and internal small cache with an idea that it will be faster than adress to DataSerie.
No more ideas. In my tests it gives 3-15% improvement.

Here the most interesting thing is:

If it could be possible to find where to place temporal cache faster than in array it will give some additinal improvement..
I suppose this array is placed in cache of method and i suppose the most time consuming is memory allocation here.

Size:

Color:

Eugene, could you test following code of SMA.
Interesting will you have any improvement? I ve done it to finish with sma optimization.
-1 if operator in loop and internal small cache with an idea that it will be faster than adress to DataSerie.
No more ideas. In my tests it gives 3-15% improvement.

[code]
public static DataSeries Simple_MA_Optim3(DataSeries ds, int period)
        {
           
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();

DataSeries sma = new DataSeries(ds, "Simple_MA_Optim3(" + ds.Description + ", " + period.ToString() + ")");
            if (period == 0) return sma;

//
            double sum = 0;
            int max_count = ds.Count;
            if (period > max_count)
            {
                for (int bar = 0; bar < max_count; bar++)
                {
                    double cur_ds = ds[bar];
                    sum += cur_ds;
                    sma[bar] = bar != 0 ? sum / (bar + 1) : cur_ds;
                }
            }
            else
            {
                //
                double [] ds_cache = new double[period];
                int cache_index = 0;
                int prev_cache_index = 0;
                int max_index = period - 1;
                //
                for (int bar = 0; bar < period; bar++)
                {
                    double cur_ds = ds[bar];
                    sum += cur_ds;
                    sma[bar] = sum / (bar + 1);
                    //
                    ds_cache[cache_index] = cur_ds;
                    cache_index++;
                }
                
                for (int bar = period; bar < max_count; bar++)
                {
                    double cur_ds = ds[bar];
                    //double period_ds = ds[bar - period];
                    sum += cur_ds;
                    sum -= ds_cache[prev_cache_index];// period_ds;// ds[bar - period];//
                    sma[bar] = sum / period;

//
                    cache_index++;
                    if (cache_index > max_index) cache_index = 0;
                    prev_cache_index++;
                    if (prev_cache_index > max_index) prev_cache_index = 0;
                    ds_cache[cache_index] = cur_ds;
                }
            }

stopwatch.Stop();
            sma.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";

return sma;
        }//End Simple_MA_Optim3        [/code]

Here the most interesting thing is:
[code]
double [] ds_cache = new double[period];
[/code]
If it could be possible to find where to place temporal cache faster than in array it will give some additinal improvement..
I suppose this array is placed in cache of method and i suppose the most time consuming is memory allocation here.

Eugene

#27

3/20/2014 3:49 PM

Sorry Alexey, no visible improvement. :( I just tried out your latest SMA code and the results (averaged over several runs) are:

But this is twice as fast as if I'd take the Parallel.For* code from post #9.

akuzn

#28

3/20/2014 3:54 PM

Hmm...i have real improvement.
larger series - better result.

May be you have super processor and very fast memory and they dont need experiments.

akuzn

#29

3/21/2014 4:08 AM

To be honest i dont know what kind of sma code is built in. I suppose it should be naive version or similar with above but with ds.count left.
Code sujjested above should be best modification of moving window version.
I have tried this "internal method cache" double[] ds_cache in other dataseries and it gave additional speed improvement for 3 indicators.

I have really only one explanation: 57k is really fully loaded in cpu cache. And 1.500.000 may be too.
That explains why parallelizing is not giving anything. Much larger series are needed.
Version of double [] ds_cache is little bit behind. But larger series should prove it.
It will give more only if there are many memory calls for example volume weighted or other idicator wich may adress to different memory blocks and CPU and compilator wont optimize (predict) them.
Anyway interesting experiment and it proves again that Microsoft C# is well optmized for scientifical computing on modern CPUs.

May be you could test with 3.000.000 or even 5.000.000 bars?
And if not a secret could you share your CPU description: processor, bus, memory.
I m really thinking about cloud tests or I7 + CUDA. Theoretically would prefer I7 not Xeon + CUDA with new type of shared memory NVIDIA Quadra or something similar.

Eugene

#30

3/21/2014 6:43 AM

Agreed. Like your i7, my 6-core Phenom II X6 is also equipped with 6Mb cache.

Since this would present only a theoretical interest, I prefer not to conduct further tests. Only a small fraction of our customers might be interested in backtesting on symbols with such enormous bar count. On a related note, parallelizing MS123 Visualizers offered an instant, perceptible performance boost regardless of data amount in a backtest. I'll mark to give your code a go on an older machine, though -- maybe there I'll be able to see a difference.

UPDATE: no difference noticed on the older PC

akuzn

#31

3/21/2014 10:17 AM

I ve done all tests on my notebook. It s powerfull enough but certainly not as good as desktop.
I have one ineffective 600mb excel file wich computes intraday bank balance but it represents super test.
No one notebook i ve tested didnt show better results than mine. Even gamer's notebooks show same results.
Just what i have not tested - last version of HP workstation with 32 gb.
And i think many people use notebooks.
So will much appreciate if you show me sma code. May be some measuring affects results.
I think would be more fair to compare in the same conditions.
I still cant believe that above SMA code based on moving window algorithm is loosing.

But in general seems everything is clear.
Thank you very much for participation))

akuzn

#32

3/24/2014 4:23 AM

Eugene, may i ask you for the last fair test. I still cant believe your results.
Main idea is as follows. There is known fact: operation of multiplication is much faster than dividing (4-8 times).
I would like to suggest you code wich gives me 8-30% of speed improvement with SMA.
Operation of dividing of sum by period is replaced by multiplacation ratio computed earlier as 1d/ period.
So if code below wont be faster than standard something is wrong with time measuring in your previous tests.

Size:

Color:

Eugene, may i ask you for the last fair test. I still cant believe your results.
Main idea is as follows. There is known fact: operation of  multiplication is much faster than dividing (4-8 times).
I would like to suggest you code wich gives me 8-30% of speed improvement with SMA.
Operation of dividing of sum by period is replaced by multiplacation ratio computed earlier as 1d/ period.
So if code below wont be faster than standard something is wrong with time measuring in your previous tests.

[code]
public static DataSeries Simple_MA_Optim3(DataSeries ds, int period)
        {           
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();

DataSeries sma = new DataSeries(ds, "Simple_MA_Optim3(" + ds.Description + ", " + period.ToString() + ")");
            if (period == 0) return sma;

//
            double sum = 0;
            int max_count = ds.Count;
            if (period > max_count)
            {
                for (int bar = 0; bar < max_count; bar++)
                {
                    double cur_ds = ds[bar];
                    sum += cur_ds;
                    sma[bar] = sum / (bar + 1);
                }
            }
            else
            {
                //
                double [] ds_cache = new double[period];
                int cache_index = 0;
                int prev_cache_index = 0;
                int max_index = period - 1;
                //
                double cur_ds = 0;
                //
                for (int bar = 0; bar < period; bar++)
                {
                    cur_ds = ds[bar];
                    sum += cur_ds;
                    sma[bar] = sum / (bar + 1);
                    //
                    ds_cache[cache_index] = cur_ds;
                    cache_index++;
                }

double period_mul = 1d / period;
                for (int bar = period; bar < max_count; bar++)
                {
                    cur_ds = ds[bar];

sum += cur_ds;
                    sum -= ds_cache[prev_cache_index];
                    sma[bar] = sum * period_mul ;/// period;//

//
                    cache_index++;
                    if (cache_index > max_index) cache_index = 0;

prev_cache_index++;
                    if (prev_cache_index > max_index) prev_cache_index = 0;
                    ds_cache[cache_index] = cur_ds;
                }
            }

stopwatch.Stop();
            sma.Description += " Time:" + stopwatch.ElapsedMilliseconds.ToString() + " (ms)";

return sma;
        }//End Simple_MA_Optim3        
[/code]

Eugene

#33

3/24/2014 8:21 AM

Alexey,

I tried out this version and averaged over 3 runs, it is faster than the standard SMA by 38% (on Release build). After testing, I plan to switch to this version in Community/TASC Indicators to speed up all indicators that depend on SMA. Great job and thank you!

Eugene

#34

3/27/2014 6:48 AM

New round of discoveries. Take the code from your initial post (or my post #4), create a formal DataSeries and pack it into a DLL (build configuration: Release). The same code starts working 2x-3x slower than it's working in the WealthScript Strategy i.e. even slower than the built-in AveragePrice series. Go figure.

akuzn

#35

3/29/2014 7:01 AM

Sounds great.
At least now noone could say that i wasnt right.
I ve seen some books like C# for financial markets, for scientific computations etc and was really surprised by highly ineffective code suggested sometimes by "professors":)
So i am very pleased that you have confirmed my results.
---
All tests i ve done are done by calls from dll builded code.
Even ininital comparison of averaging is realised in dll.
And i usually use WealthLab to vizualize results and optimize strategies parameters but everything is in dll.
I think it is not only my approach. It is normal way to manage and research.
---
Why i decided to optimize average price - because under debugger i ve seen 4 unneded loops of summation and dividing: First loop: O+H, second: previous result + L, ..+C and after that loop of dividing by 4.
So parallelizing loop of averaging of 4 values coming from different parts of RAM was looking great for it.
This beahaviour doesnt depend on WealthLab i think - or may be depends on DataSeries operators of summation/multiplication etc.
I supposed that it was too difficult for me to override these operators and would be easier to write some new methods.
As i told everything is under dll file: strategies, indicators etc.
May be there is difference in compilator behavior in code optimization between Visual Studio and WealthLab?
Only this explanation. And in addition how DataSeries are loaded in memory. How are they fragmented and aligned. May be you have inserted 32 gb chip? How memory was fragmented? For example one load of 1.500.000 bars, another 200.000 after first strategy is closed and opened new with 1.800.000 bars....Just general questions.

But I use these results for another tests under my WPF trading platform where i ve improved speed again) : i need to work with order book and trading data coming from different markets: spot, futrures, options on stocks, currencies, fixed income. So any kind of freezing interface or weak computing is affecting comfort of making decisions. Seems i have avoided it.
Btw no SQL, ADO, EF etc .
But logically i cant understand why packing dataseries in dll should be slower?
Could you show your code? May be i have not understand something?
---
Btw i ve finished with database files and working now on high speed optimization.
Will try to test GA in WealthLab with my implementation of Evolutionary optimization methods, GA and particles swarm optimization.)
There are some ways to optimize and to parallelize sorting or simply use many threads.
This will be really interesting. I ve read some articles concerning matlab - it doesnt give best results. There are some critics and ways to improve.

---

As for my opinion if you rewrite series methods in sequential mode you ll get more improvement. In 1000.000 series i have more than 200x improvement with StdDev for example.
If you will use some parallelizing classes you ll get more improvement.
If you switch off graphics GUI will freeze less.
There is known fact that dynamic arrayes (and dynamic collections in general) in C# are hyper fast and difficult to improve their speed even using native code (you may find super Richter articles on MSDN) but my own List<T> or arrayes give me better result than DataSeries in WelathLab. Loading speed of large dataseries could be faster too. But there is always balance between how comfortable to manage - databases are always slower than direct file access.
May be it is possible to use async / await operators with Dispatcher.BeginInvoke to let GUI be more responsive but it is not already user's question. All modern applications use only this technic.
And not sure how graphics are realised in general - is there drawing in panel and this panel is srolled? I suppose yes. Much better would be drawing of only visible Window.
It is possible to wait for all these delayes - computers are fast now, but what i really dont like - trade bar. If i understand right it will never be bar+1, but if we want to use all optimizations tools we have to use these technics. But you cant get acces to GA from code, Btw GA is not multithreaded and provides me errors from time to time in errors panel.
But i am very gratefull to WLB team for evolutionary algorithm and wfo implementation ideas and still using as general purpose application mostly for visualization of some ideas.
I can repost it if needed.

Size:

Color:

Sounds great. 
At least now noone could say that i wasnt right.
I ve seen some books like C# for financial markets, for scientific computations etc and was really surprised by highly ineffective code suggested sometimes by "professors":) 
So i am very pleased that you have confirmed my results.
---
All tests i ve done are done by calls from dll builded code.
Even ininital comparison of averaging is realised in dll.
And i usually use WealthLab to vizualize results and optimize strategies parameters but everything is in dll.
I think it is not only my approach. It is normal way to manage and research.
---
Why i decided to optimize average price - because under debugger i ve seen 4 unneded loops of summation and dividing: First loop: O+H, second: previous result + L, ..+C and after that loop of dividing by 4. 
So parallelizing loop of averaging of 4 values coming from different parts of RAM was looking great for it.
This beahaviour doesnt depend on WealthLab i think - or may be depends on DataSeries operators of summation/multiplication etc.
I supposed that it was too difficult for me to override these operators and would be easier to write some new methods.
As i told everything is under dll file: strategies, indicators etc.
May be there is difference in compilator behavior in code optimization between Visual Studio and WealthLab?
Only this explanation. And in addition how DataSeries are loaded in memory. How are they fragmented and aligned. May be you have inserted 32 gb chip? How memory was fragmented? For example one load of 1.500.000 bars, another 200.000 after first strategy is closed and opened new with 1.800.000 bars....Just general questions.

But I use these results for another tests under my WPF trading platform where i ve improved speed again) : i need to work with order book and trading data coming from different markets: spot, futrures, options on stocks, currencies, fixed income. So any kind of freezing interface or weak computing is affecting comfort of making decisions. Seems i have avoided it.
Btw no SQL, ADO, EF etc .
But logically i cant understand why packing dataseries in dll should be slower?
Could you show your code? May be i have not understand something?
---
Btw i ve finished with database files and working now on high speed optimization.
Will try to test GA in WealthLab with my implementation of Evolutionary optimization methods, GA and particles swarm optimization.)
There are some ways to optimize and to parallelize sorting or simply use many threads.
This will be really interesting. I ve read some articles concerning matlab - it doesnt give best results. There are some critics and ways to improve.

---

As for my opinion if you rewrite series methods in sequential mode you ll get more improvement. In 1000.000 series i have more than 200x improvement with StdDev for example.
If you will use some parallelizing classes you ll get more improvement.
If you switch off graphics GUI will freeze less.
There is known fact that dynamic arrayes (and dynamic collections in general) in C# are hyper fast and difficult to improve their speed even using native code (you may find super Richter articles on MSDN) but my own List<T> or arrayes give me better result than DataSeries in WelathLab. Loading speed of large dataseries could be faster too. But there is always balance between how comfortable to manage - databases are always slower than direct file access.
May be it is possible to use async / await operators with Dispatcher.BeginInvoke to let GUI be more responsive but it is not already user's question. All modern applications use only this technic.
And not sure how graphics are realised in general - is there drawing in panel and this panel is srolled? I suppose yes. Much better would be drawing of only visible Window.
It is possible to wait for all these delayes - computers are fast now, but what i really dont like - trade bar. If i understand right it will never be bar+1, but if we want to use all optimizations tools we have to use these technics. But you cant get acces to GA from code, Btw GA is not multithreaded and provides me errors from time to time in errors panel.
But i am very gratefull to WLB team for evolutionary algorithm and wfo implementation ideas and still using as general purpose application mostly for visualization of some ideas.
I can repost it if needed.

Eugene

#36

10/5/2014 9:46 AM

FWIW

During preparation for new Wealth-Lab release 6.8, it's been determined that parallelizing does NOT offer speed improvement. On the contrary, it demonstrated considerably worse results.

Wealth-Lab® Links

WealthSignals Links

Disclaimer Policy

Follow Us

A FREE account will allow you access to our knowledge-base resources, customer support, WealthSignals services, and a trial version of our software Wealth Lab

Almost done.