Any plans for multicore support?

SunriseMan

#1

4/11/2011 7:57 PM

Are there any plans to add multicore support to WL? Almost every computer being sold (even netbooks) is now multicore, so I'm sure this would be popular. I'd really like to see the speed-up that could occur with backtesting and optimization if all four of my cores were utilized.

I saw a question of whether this would be included in WL6 (obviously not), but I haven't seen addressed whether it's on the roadmap for a future release.

Eugene

#2

4/12/2011 1:16 AM

Let's say, it's not going to appear in version 6.2 -- its highlight will be multi-system backtesting (a new type of strategy called Combination Strategy).

Everyone interested please consider calling your Fidelity rep and telling you need multi-core support in WL6.

joannakim

#3

4/18/2011 6:52 AM

Please excuse my ignorance, but what would multi core support of WLP do? I have a quad core Intel chip in my PC. I would love to know more about how multi core would operate to enhance WLp's performance. Thanks

Eugene

#4

4/18/2011 7:02 AM

To quote the topic starter:

QUOTE:
"I'd really like to see the speed-up that could occur with backtesting and optimization if all four of my cores were utilized."

dan_rozenberg

#5

4/19/2011 12:59 PM

Would WL really benefit by being multi-core supportive (same as multi-threading?) . Aren't most of WL calculations done in a linear fashion?

Eugene

#6

4/19/2011 2:01 PM

Right now WL 6.2 doesn't utilize all of a CPU's multiple cores. Consequently, typical CPU load of a heavy backtest on a quad core would be 25% vs. 100% with multi-core support. Add here parallel optimizations, or parallelizing a single optimization run.

hlh

#7

7/6/2011 9:33 AM

In good old WL4 times one could copy WL.exe to WL1.exe, WL2.exe, ... to run 4 instances of WL on a quad core CPU and split long opti runs over the 4 process. Can I do the same with WL6 (6.2 Dev. 64bit)? I thought I red somewhere here in the forum or the Wiki to start WL as another user to have another instance. This did not work out for me (by trying to start the very same exe). Is there a way to do that?

Eugene

#8

7/6/2011 9:39 AM

Yes you can. Cumbersome but should work:

''Workspaces'' provide a quick and convenient way to do what might require multiple instances. Still it's possible to run several copies of Wealth-Lab 6 under different Windows user names.

dan_rozenberg

#9

7/8/2011 8:27 AM

Eugene, are there any plans to add multi-core support to Exhaustive backtesting in the next version (6.3)?

The backtester could run each run of the optimization on a different core, if it was coded right, correct?

Eugene

#10

7/8/2011 8:50 AM

QUOTE:
The backtester could run each run of the optimization on a different core, if it was coded right, correct?

It is coded right. Only .NET 4.0 brings true parallelism (well, .NET 3.5 with Parallel Extensions to be precise), we're on .NET 2.0.

Furthermore, the existing Optimizer API is not sufficient to create your optimizers supporting parallel optimization (been there done that). Unless you go out of your way (but please don't ask me).

dan_rozenberg

#11

7/8/2011 9:44 AM

Ok, understood. Thanks!

hlh

#12

9/17/2011 11:54 AM

As long as WL still does not use multicore I wonder whether hyperthreading on an Intel i2600k for example would even be slower than HT turned off or simply using an Intel i2500k which does not support HT?

As I need to upgrade RAM to 16GB to hopefully increase WLs capability to be used over intraday data for some symbols and years I am thinking of upgrading mainboard and CPU on the development computer as well. Think to anyway go for the i2600k (and not the i2500k) but so I realized that the difference of those CPUs would also be HT and the question if HT wouldn't even slow down singlecore apps like WL popped to my mind.

Is this turbo boost feauire someting which would/could speed up a single process? Is it recommemded to use or turn off HT for WL?

And: please make WL support multicore a.s.a.p. This from my side and the lot of other traders and WL users I am working with would be one of the two most important issues! The other one is stability when running it over more data! Way more iortant than some fancy features for the time being. Thanks. lot!

abegy

#13

9/18/2011 2:59 AM

Agree with you ! Must be one of the top priorities.

Eugene

#14

1/28/2012 6:19 AM

What does the phrase "make WL support multicore" really mean to you?

If it means "magically parallelize my Strategy's calculations" then it's not realistic to expect.

Besides the obvious multi-core optimization enhancement which would speed things up, in which tool do you expect to get the boost from parallel extensions, if at some point in time they appeared by virtue of .NET 4.0?

jalalfeghhi1

#15

2/9/2012 1:09 PM

Eugene,
I have a 12 core processor and am getting about 8% cpu utilization. I have toyed with the idea of writing my own optimization which can leverage the new multi core design. It would solve another problem that I have. Because I develop code in CSharp, the built-in optimizer does not allow me to remove some of the parameters to speed up the process. This is very annoying and requires me to change my code.

One issue that I see is the builtin 1-parameter and 2-parameter graphs that are really nice. While I can develop the calculations rather quickly, I have no idea how to re-create the builtin graphs that work with sliders.

Do you have any suggestions for me? Is there any way I can gain access to the built-in graphics capabilities? Any other ideas?

-thanks, J

Eugene

#16

2/9/2012 2:20 PM

J,

The built-in graphic capabilities (i.e. the part of Exhaustive optimizer that does the charting) are proprietery Fidelity code (closed source). Consequently, if one wishes to recreate them in your own optimizer, he/she would have to do it from scratch. You might want to raise this question in a different thread as it's getting too specific for this general thread.

Cone

#17

2/9/2012 4:28 PM

If I'm not mistaken, Wealth-Lab uses TeeChart for those graphs. Price charting, on the other hand comes from proprietary Wealth-Lab code.

Eugene

#18

2/10/2012 12:45 AM

Right but TeeChart controls are licensed for use in Wealth-Lab and thereby shouldn't work in 3rd party assemblies right away (w/o purchasing a license).

Eugene

#19

2/10/2012 12:58 AM

I think that "add muti-core CPU support" sounds too generic. Based on user demand it would be a milestone if the product offered exhaustive optimization boosted by multi-core support. But from any practical standpoint, updating the .NET framework version to 4.0 first is a prerequisite because it brings support for parallelizing tasks. I'd vote for these 2 items, .NET4 and parallelized optimization, for 6.4 or 6.5.

jalalfeghhi1

#20

2/13/2012 10:29 AM

Eugene, Cone,
Thanks much for your feedback. I did a bit of research and I agree that WLP support of .net 4.0 is essential to implement exhaustive optimization. I will give Fidelity a call to push it from our side.

-Best, J

festipower

#21

2/13/2012 5:34 PM

Hello to all.

I have been working for a long time in multithreaded applications using the .NET platform, and i can say that .NET 4.0 isn't required at all to parallelize Wealth-Lab backtests and optimizations. It could make it easier (or not, depending on the approach), but it isn't necessary at all.

On the other hand, I think the best strategy to accelerate the execution would be a multi-level parallelization:

------1.-Execution of the strategy on each symbol of a Data Set in in parallel. (strategy level parallelization):
this parallelization seems a priori a good candidate to be done by using multiple threads, using structures such as semaphores or other structures to synchronize access to data structures shared by all threads (Data Set????) . With proper design, I suspect that the level of parallelization could be very high. The solution does not seem excessively complicated:
A)-A thread reads the Data Set.
B)-As it reads each symbol, it launches another thread executing the strategy for that symbol.
C)-When all the threads have finished executing the strategy, another thread is responsible for conducting the 'position sizing' tasks (this can not be easily parallelized).

It should be a way (perhaps an overridable property at class WealthScript) to enable or disable the parallelization of strategy, as there may be cases where the strategy should be executed on the traditional mode (sequentially).

------2.-Execution of the backtests composing an optimization in parallel.(optimization level paralelization):
this parallelization may be implemented using Background Workers (or Tasks. NET 4) very easily in my opinion.

The best type of parallelization would be under my view the parallelization of type 1 (strategy level parallelization), especially for backtests in Data Sets with a large number of symbols. In addition, this parallelization would also directly benefit the optimizers.
Type 2 parallelization would be good especially for optimization of strategies using sigle symbol or Data Sets with few symbols.

Ideally both parallelizations should be implemented in Wealth-Lab.

Size:

Color:

Hello to all.

I have been working for a long time in multithreaded applications using the .NET platform, and i can say that .NET 4.0 isn't required at all to parallelize Wealth-Lab backtests and optimizations. It could make it easier (or not, depending on the approach), but it isn't necessary at all.

On the other hand, I think the best strategy to accelerate the execution would be a multi-level parallelization:

------1.-Execution of the strategy on each symbol of a Data Set in in parallel. (strategy level parallelization):
this parallelization seems a priori a good candidate to be done by using multiple threads, using structures such as semaphores or other structures to synchronize access to data structures shared by all threads (Data Set????) . With proper design, I suspect that the level of parallelization could be very high. The solution does not seem excessively complicated:
     A)-A thread reads the Data Set.
     B)-As it reads each symbol, it launches another thread executing the strategy for that symbol.
     C)-When all the threads have finished executing the strategy, another thread is responsible for conducting the 'position sizing' tasks (this can not be easily parallelized).

It should be a way (perhaps an overridable property at class WealthScript) to enable or disable the parallelization of strategy, as there may be cases where the strategy should be executed on the traditional mode (sequentially).

------2.-Execution of the backtests composing an optimization in parallel.(optimization level paralelization):
this parallelization may be implemented using Background Workers (or Tasks. NET 4) very easily in my opinion.

The best type of parallelization would be under my view the parallelization of type 1 (strategy level parallelization), especially for backtests in Data Sets with a large number of symbols. In addition, this parallelization would also directly benefit the optimizers.
Type 2 parallelization would be good especially for optimization of strategies using sigle symbol or Data Sets with few symbols.

Ideally both parallelizations should be implemented in Wealth-Lab.

Eugene

#22

2/14/2012 1:07 AM

Carlos,

Thank you for your suggestions. We will forward them to the developers.

Re: .NET4. If using the .NET4 features rather than dealing with the thread pool and locks, the resulting code would be more efficient. But not only upgrading to .NET 4.0 would allow to take advantage of the new parallel programming features. We would be able to utilize .NET4 assemblies in WL6.x and benefit from the enhanced garbage collection, to name a few.

festipower

#23

2/14/2012 4:38 AM

Eugene,

Yes, you are totally right about .NET4. I think the same.

I just wanted to say that .NET4 isn't required in order to make the program multithreaded.

The problem that WealthLab solves when backtesting and optimizing and the way it uses to achieve the solution is easy to parallelyze, regardless the .NET version used. The time used executing those tasks would be greately reduced using modern multicore hardware.

Regards.

dan_rozenberg

#24

2/23/2012 7:10 AM

Hi Eugene,

Is there an ETA for version 6.3?

Eugene

#25

2/23/2012 7:16 AM

Hi Dan,

The features we're talking here (the topic says "multicore support") will not make it into version 6.3.

Cone

#26

2/23/2012 10:48 AM

ETA next week, unless something changes.

However, while speaking of changes, the S. Monitor enhancement (with respect to Fidelity data) had to be delayed until 6.4. Consequently, 6.3 is essentially a maintenance-only release, i.e., bug fixes.

skalman99

#27

3/13/2012 4:27 AM

How do utilize several cores in CPU when doing exhaustive backtesting:
----------------------------------------------------------------------

1. Put several copies of the (almost) same strategy in same C#-file. They need to have separate GUIDs and names:

namespace Strategies
{
public class Strategy1 : Wealtscript ...

public class Strategy2 : Wealtscript ...

}

2. Assume you need to test StrategyParameter strapParam1 for values 1 to 100. In Strategy1 include line
strapParam1 = CreateParameter("param1", 1, 1, 50, 1);.
In Strategy2 use this line instead:
strapParam1 = CreateParameter("param1", 51, 51, 100, 1);

3. Now 2 new separate workspaces. In the first workspace open Strategy1, in the second open Strategy2. Click optimize in each
of the workspaces. Voila, CPU-utilization will be twice as big. (This can of course be repeated for the number of cores you have.)

Regards Jon Brewer

Eugene

#28

3/13/2012 4:36 AM

Alternative ways:

A. Launch two (three...) WL instances with the same strategy in different Windows accounts (usernames). Tweak the Strategy Parameters so they don't overlap.

B. Use Genetic Optimizer.

dan_rozenberg

#29

4/4/2012 8:05 AM

Now that 6.3 is out...what are our chances for multicore support in 6.4? This is the one thing i pray for every night before i go to bed!!

Cone

#30

4/4/2012 1:29 PM

It's not in the cards for 6.4, but at least 6.4 should be built on .NET 4.0. Baby steps...

hlh

#31

4/18/2012 3:19 PM

Reading that Garbage Collection would work better in .NET4 would be worth to stop anything else at Fidelity (including their core brokerage business) until it is done in WL.

Multicore usage is a NO BRAINER and A MUST for a SW which core functionality is to loop thru a bunch of data (series) and do mathematical operations. Whoever has not used the Optimizer extensively has not developed or back-tested a trading strategy. So, even if WL would only allow to start n optimizations on an n core machine without this WL instance trick (which I did not manage to get to work) this would be a huge step into the right (mutlit core) direction.

All the fancy stuff is nice but first of all a stable and fast backtesting engine (which, from time-to-time, gives back some of the enormous Giga bytes of RAM it consumes) is key (for me at least). Being able to use .NET4 stuff would be very cool too (but let us not forget that some bugs need to be fixed as well).

P.S.: I once asked but no answer so far: On some of this Intel CPUs there is this Hyperthreading (or however called). For ol' 6.3 and 6.4 WL single core version, is it - theoretically and/or practically - better to switch that off, so that WL would not use only half of a physical core making it even slower (or does Windows use the full core if required anyway, even if Hyper is on)? Thx!

Eugene

#32

4/19/2012 8:11 AM

Well, then our definitions of "no brainer" are drastically different. It's not a no-brainer to add multi-core support to a mature application of this scale. What may be natural these days with the advent of the Task Parallel Library in .NET 4.0, was nowhere as easy during Wealth-Lab development in 2007 - when .NET 2.0 was pretty much new.

Furthermore, don't forget that 1) multi-core CPUs were not installed in every PC in 2007-2008, and 2) that C# Strategies by themselves offered a 2-10 speed boost compared to slow, interpreted ChartScripts.

20/20 hindsight is always easy. ;)

Eugene

#33

4/19/2012 11:15 AM

P.S. Having multi-core CPU support would be a bless, but GPGPU support might even surpass speed-wise...

hlh

#34

4/19/2012 3:52 PM

I do not argue with the history (and in hindsight I am always good, I think I have that in common with all the financial news which, after the close, can always tell us why it went up, down or sideways today).

I am speaking for the presence, today, where I let freeware run on my PC using cores I even wasnt't aware of that I have, and, as you mentioned, also uses my video card to do ugly fast calculations.

Even my cell phone, once invented for making calls (not sure if it still can do that), has nowadays quad core cpus to multitask between angry birds and facebook app flawlessly. So I am looking forward for WL to catch up ;-)

And still, any opinion from everyone on that HyperThreading question is very much appreciated.

Thanks!

HendersonTrader

#35

4/19/2012 8:12 PM

Regarding hyper-threading technology:
1) Pentium 4 (e.g. one physical core, two logical cores) with Windows/XP & hyper-threading enabled in the bios was slower for W/L 5.x.
2) Current i7 (e,g, four physical cores, eight logical cores) with Windows 7 runs the current W/L release at the same elapsed time whether HT is enabled or disabled in the bios.
The current Intel implementation of hyper-threading comes very close to eliminating the performance hit. Current HT does not dispatch the second logical core on a physical core when W/L is utilizing a high percentage of a physical core.

dirkp

#36

11/3/2012 9:47 AM

Hi guys,

it's been quiet here in this thread. I just wanted to get an update on the multicore support issue. WL6.4 has been recently released. Will the next versions 6.5/6.6 have multicore support?

Thanks!!

Eugene

#37

11/3/2012 10:21 AM

The Wealth-Lab Strategy backtesting has been fairly fast in its present state, and honestly, I do not think that it requires any multicore support.

What appears to be the natural target for enhancement by the parallel execution (i.e. multicore support) is the Optimization tool, especially since 6.4 is based on .NET 4 and can utilize Tasks. Unfortunately, the Optimizer API design in Version 6 is absolutely not suitable for developing 3rd party multicore-enabled optimizers.

If only we were able to get the Optimizer API enhanced, that could move us forward. I've been trying to lobby this point of view for a long time, but unfortunately, I don't have any good news to tell you at this point.

dan_rozenberg

#38

11/8/2012 9:04 AM

Eugene, whom should I contact in order to lobby for multi-core optimization as well?

Cone

#39

11/8/2012 10:58 AM

We (MS123) are your Wealth-Lab Developer contact. Fidelity Wealth-Lab Pro customers should call their reps. Ultimately, changes to the thick client (Wealth-Lab) are business decisions made by Fidelity, which can be demand-driven, but they compete for the same resources allocated by other planning. It's a balancing act.

dirkp

#40

11/8/2012 7:10 PM

Thanks for the update! My question was concerning the optimization tool as this can take up a long time. Anyways hopefully you can lobby this for us Eugene. Good luck!!

kribel

#41

8/7/2013 2:10 AM

Hello Eugene, Cone,

It has been a year since the last update in this thread. What is the current state? Could you please give us an update?

I found this website: http://www2.wealth-lab.com/WL5Wiki/Print.aspx?Page=OpenIssues

As I can see, there is the following topic:
(98348) Optimizer API not sufficient to create parallelized, multi-threaded optimizers. Leverage benefits of multi-core CPUs in Optimizer.

And it is bold? What does it mean? What stand bold for? Does it mean that it is going to be implemented in the next release?

Cheers,
Konstantin

PS:
Where can I find release notes for each WealthLab release?

Eugene

#42

8/7/2013 2:57 AM

QUOTE:
It has been a year since the last update in this thread. What is the current state? Could you please give us an update?

No, not at this time. Wealth-Lab modifications are in the hands of Fidelity, with regard to their business interests and business planning. But we made sure they're aware how high is demand for speedy, parallelized optimizations.

QUOTE:
Where can I find release notes for each WealthLab release?

We don't keep them anymore, if you mean Change History. For what's new in latest build, you can always see it in the User Guide > What's new.

QUOTE:
And it is bold? What does it mean? What stand bold for? Does it mean that it is going to be implemented in the next release?

The list contains high priority bugs with deferred low-priority ones in no particular order. Bold does not mean anything in particular, except that we may consider it to be issues to focus at in the first place.

kribel

#43

8/7/2013 4:48 AM

Hello Eugene,

many thanks for your quick reply! Here a few further questions.

QUOTE:
No, not at this time. Wealth-Lab modifications are in the hands of Fidelity, with regard to their business interests and business planning. But we made sure they're aware how high is demand for speedy, parallelized optimizations.

How can we reach Fidelity to find out more about the implementation status? Would it be helpful if users begging for multi-core support would sign a petition and we would present it to the Fidelity Management?

QUOTE:
For what's new in latest build, you can always see it in the User Guide > What's new.

Can I also see it before I install it?

Many thanks,
Konstantin

Eugene

#44

8/7/2013 5:00 AM

1 - No / No.
2 - Although the complete change log is only available after installation, we always put a brief "what's new" note on the "Home Page" tool in Wealth-Lab Pro/Developer whenever a newer build becomes available.

sourkraut

#45

9/2/2013 12:59 PM

Here's a look at the other side:

Just how much improvement could we expect from multi-core support?

Do you remember the slowest element from computers, the I/O (disks, keyboard, screen)?
CPUs have only a single data bus and address bus. Both buses are used simultaneously, by a one core at a time. Usually they will tie up the busses for several CPU cycles at a time. During that time, the other cores must wait for their turn.

So if a CPU process takes one cycle to complete, but the core requires three cycles to load the data, and after its one cycle process, must wait for three to ten cycles to write the results back to RAM, this one-cycle process might have taken 14 cycles.

Sure, in the mean time, the other cores did something similar. But supposing the same scenario for each core, it might take 50+ cycles to perform four similar operations in parallel, while a single core processor might complete them in only 28 cycles (3 in, 1 inside, 3 out each operation).
This is of course all speculation, and there are several ways to truely improve speed, but do not expect a fourfold increas from a four core processor.

Perhaps if you have a multi-bus (super) computer you could see such improvements. Where each core has its own data and address bus, you only have to worry about your particular RAM location being in use by another CPU. But that would be one heck of an expensive machine.

I too would like to see speed improvements. However, much more, I would like to see the bugs corrected, that make WL a crash-prone Goonybird. True, the editor is much improved, but new sources of crashes are appearing faster than fixes for old ones (Data Manager, Extension Manager).

To me, fixes for the crash problems are much more important, than the promise of questionable results from multi-core support.

Eb

Size:

Color:

Here's a look at the other side:

Just how much improvement could we expect from multi-core support?

Do you remember the slowest element from computers, the I/O (disks, keyboard, screen)? 
CPUs have only a single data bus and address bus. Both buses are used simultaneously, by a one core at a time. Usually they will tie up the busses for several CPU cycles at a time. During that time, the other cores must wait for their turn.

So if a CPU process takes one cycle to complete, but the core requires three cycles to load the data, and after its one cycle process, must wait for three to ten cycles to write the results back to RAM, this one-cycle process might have taken 14 cycles.

Sure, in the mean time, the other cores did something similar. But supposing the same scenario for each core, it might take 50+ cycles to perform four similar operations in parallel, while a single core processor might complete them in only 28 cycles (3 in, 1 inside, 3 out each operation). 
This is of course all speculation, and there are several ways to truely improve speed, but do not expect a fourfold increas from a four core processor.

Perhaps if you have a multi-bus (super) computer you could see such improvements. Where each core has its own data and address bus, you only have to worry about your particular RAM location being in use by another CPU. But that would be one heck of an expensive machine.

I too would like to see speed improvements. However, much more, I would like to see the bugs corrected, that make WL a crash-prone Goonybird. True, the editor is much improved, but new sources of crashes are appearing faster than fixes for old ones (Data Manager, Extension Manager).

To me, fixes for the crash problems are much more important, than the promise of questionable results from multi-core support.

Eb

kribel

#46

9/2/2013 1:35 PM

@sourkraut:

As you already said:

QUOTE:
This is of course all speculation...

Speculations remain speculations until they get tested. Therefore I do not see any point looking for excuses to postpone this improvement.

Cheers,
Konstantin

Cone

#47

9/2/2013 2:00 PM

sourkraut doesn't work for Fidelity or MS123, so it wasn't an excuse. Multi-core support isn't even on the table right now. Sorry to disappoint, but WFO is next.

kribel

#48

9/2/2013 2:55 PM

Hi Cone,

I am aware of that.

QUOTE:
Wealth-Lab modifications are in the hands of Fidelity, with regard to their business interests and business planning. But we made sure they're aware how high is demand for speedy, parallelized optimizations.

Therefore I am not loosing my hope. ;)

JDardon

#49

3/3/2014 4:16 PM

So it's been a while now and WFO is already a reality. Eugene, any news on plans for multi core optimization?

Cone

#50

3/3/2014 4:19 PM

There are no current plans.

Eugene

#51

3/4/2014 5:40 AM

However, at MS123 we are investigating into the idea of parallelizing the code of Community Indicators (where possible) and maybe some most-used indicators (like SMA) on which there exist some dependencies in the existing code of the library. If it works out, then existing Strategy code that uses them may run faster in Optimizer if the data is big enough. Currently, we're making no promises if this will be done and no ETA exists.

You might want to call Fidelity and let them know that yet another customer wants faster multi-core optimizations, though.

JDardon

#52

3/4/2014 12:56 PM

Sigh... Ok will call Fidelity and push them a little. I remember seeing some time ago that there was a page with development requests on which customers could issue a limited amount of votes. Does that still exist? That would really help making the point to Fidelity that SO MANY people require multi core capabilities on the optimization engine (which just about every other trading tool in the market already provides).

Is there any existing guideline in how to take advantage of parallelization in one's own indicators (such as the one you will implement later this year)? We should be able to optimize our own code already with such a guideline.

Eugene

#53

3/4/2014 2:45 PM

QUOTE:
Is there any existing guideline in how to take advantage of parallelization in one's own indicators (such as the one you will implement later this year)? We should be able to optimize our own code already with such a guideline.

Indicator speed improvement through parallelizing

superticker

#54

7/23/2016 1:47 PM

QUOTE:
[by festipower:] ... best strategy to accelerate ... [WL] would be a multi-level parallelization:
----1) Execution of the strategy on each symbol of a Data Set in in parallel. (strategy level parallelization):
This parallelization seems a priori a good candidate to be done by using multiple threads, using structures such as semaphores or other structures to synchronize access to data structures shared by all threads (or Data Sets??) . With proper design, I suspect that the level of parallelization could be very high. The solution does not seem excessively complicated:
A)-A thread reads the Data Set.
B)-As it reads each symbol, it launches another thread executing the strategy for that symbol.
C)-When all the threads have finished executing the strategy, another thread is responsible for conducting the "position sizing" tasks (This cannot be easily parallelized).

I totally agree with festipower's comments with the exception of the semaphore part. Record locking is a database problem, not an application problem. If WL must control data access, it should do so through database record locking, not "directly" using semaphores. (The use of semaphores is for the database to do.)

But what makes WL fast today is its simplicity. If you add database record locking to the mix, you'll take a major speed hit because you're tying the hands of the Windows scheduler. Very bad idea in a simulation program, which must command minimal context-switch overhead.

Moreover, most strategies only need write access to the cache entries they create for a given symbol, not across all symbols. So if the WL DataSeries cache incorporated a hash key to include the symbol name (which I think it already does), that would be protection enough from write conflicts with other threads/tasks crunching different symbols for that particular strategy. The exception would be if a strategy was writing to an external symbol (or cache entry) where threaded tasks could have cache write conflicts.

QUOTE:
... [There] should be a way (perhaps an override property at class WealthScript) to enable or disable the parallelization of strategy, as there may be cases where the strategy should be executed ... sequentially.

And that's the point. For those few strategies that require write access to external symbols, one has to disallow parallelism altogether. This way WL preserves its simplicity and avoids database record locking (or semaphores). Honestly, doing any kind of heavy database activity in a simulation program in a bad idea because of overhead.

QUOTE:
----2) Execution of the backtests composing an optimization in parallel (optimization level parallelization):
This parallelization may be implemented using Background Workers (or .NET 4 tasks); very easily in my opinion.

The best type of parallelization would be ... the parallelization of type 1 (strategy level parallelization), especially for backtests in Data Sets with a large number of symbols. In addition, this parallelization would also directly benefit the optimizers. Type 2 parallelization would be good especially for optimization of strategies using single symbol or Data Sets with few symbols.

Agreed. Optimizes may need to be optionally tweaked to take full advantage of this.

-----
There is a dark side to multi-threading to achieve more multi-core utilization. Does the processor chip have enough cache memory to keep all the processor cores stoked with data? That is, is there enough cache to fit this multi-core problem on chip? I think with daily bars, the answer is "yes"; otherwise, the answer is "no". So WL needs a method to disable ~~some of the cores~~ all core parallelism when the resolution of "scale" goes beyond daily bars to minimize simulation time. This adjustment needs to be done while the WL parameter optimizer is running by monitoring Intel-chip cache-miss rates. Some assembly is required (because C# code won't work here).

I don't believe the Windows OS supports getting into the supervisory mode of the processor chip. That's for device drivers to do. So one needs to write a read-only (for monitoring only) Windows device driver. WL can then use that device driver as a window into the processor's cache miss rate. If the miss rate gets too high, I would disable the parallelism altogether for that bar scale and range.

Size:

Color:

[quote][by festipower:] ... best strategy to accelerate ... [WL] would be a multi-level parallelization:
----1) Execution of the strategy on each symbol of a Data Set in in parallel. (strategy level parallelization):
This parallelization seems a priori a good candidate to be done by using multiple threads, using structures such as semaphores or other structures to synchronize access to data structures shared by all threads (or Data Sets??) . With proper design, I suspect that the level of parallelization could be very high. The solution does not seem excessively complicated:
A)-A thread reads the Data Set.
B)-As it reads each symbol, it launches another thread executing the strategy for that symbol.
C)-When all the threads have finished executing the strategy, another thread is responsible for conducting the "position sizing" tasks (This cannot be easily parallelized).[/quote]
I totally agree with festipower's comments with the exception of the semaphore part.  Record locking is a database problem, not an application problem.  If WL must control data access, it should do so through database record locking, not "directly" using semaphores.  (The use of semaphores is for the database to do.)

But what makes WL fast today is its simplicity.  If you add database record locking to the mix, you'll take a major speed hit because you're tying the hands of the Windows scheduler.  Very bad idea in a simulation program, which must command minimal context-switch overhead.

Moreover, most strategies only need write access to the cache entries they create for a given symbol, not across all symbols.  So if the WL DataSeries cache incorporated a hash key to include the symbol name (which I think it already does), that would be protection enough from write conflicts with other threads/tasks crunching different symbols for that particular strategy.  The exception would be if a strategy was writing to an external symbol (or cache entry) where threaded tasks could have cache write conflicts.

[quote]... [There] should be a way (perhaps an override property at class WealthScript) to enable or disable the parallelization of strategy, as there may be cases where the strategy should be executed ... sequentially.[/quote]
And that's the point.  For those few strategies that require write access to external symbols, one has to disallow parallelism altogether.  This way WL [u]preserves its simplicity[/u] and avoids database record locking (or semaphores).  Honestly, doing any kind of heavy database activity in a simulation program in a bad idea because of overhead.

[quote]----2) Execution of the backtests composing an optimization in parallel (optimization level parallelization):
This parallelization may be implemented using Background Workers (or .NET 4 tasks); very easily in my opinion.

The best type of parallelization would be ... the parallelization of type 1 (strategy level parallelization), especially for backtests in Data Sets with a large number of symbols. In addition, this parallelization would also directly benefit the optimizers.  Type 2 parallelization would be good especially for optimization of strategies using single symbol or Data Sets with few symbols.[/quote]
Agreed.  Optimizes may need to be optionally tweaked to take full advantage of this.

-----
There is a [u]dark side[/u] to multi-threading to achieve more multi-core utilization.  Does the processor chip have enough cache memory to keep all the processor cores stoked with data?  That is, is there enough cache to fit this multi-core problem on chip?  I think with daily bars, the answer is "yes"; otherwise, the answer is "no".  So WL needs a method to disable [s]some of the cores[/s] all core parallelism when the resolution of "scale" goes beyond daily bars to minimize simulation time.  This adjustment needs to be done while the WL parameter optimizer is running by monitoring Intel-chip cache-miss rates.  Some assembly is required (because C# code won't work here).

I don't believe the Windows OS supports getting into the supervisory mode of the processor chip.  That's for device drivers to do.  So one needs to write a read-only (for [u]monitoring only[/u]) Windows device driver.  WL can then use that device driver as a window into the processor's cache miss rate.  If the miss rate gets too high, I would disable the parallelism altogether for that bar scale and range.

JDardon

#55

11/23/2017 6:52 PM

It's been a few years since the last update on this subject. Are there any news on this subject?

Eugene

#56

11/23/2017 7:01 PM

It's not considered. See last post #36 in this thread:

https://www.wealth-lab.com/Forum/Posts/Indicator-speed-improvement-through-parallelizing-33715

JDardon

#57

5/27/2018 7:16 AM

Clear on parallelizing indicators. But the multi core support goes beyond just having more efficient indicator run time.
What about utilizing multiple cores during optimizations on multiple symbols (say sending each symbol data set to a diffrent core). Clearly this has no semaphore or dependency constraints or considerations.

Optimizing in an 8 core computer for multiple symbols could be done in 1/8 the time running ona single core.

Eugene

#58

5/27/2018 8:04 AM

Sorry, this is not on the radar.

superticker

#59

5/27/2018 2:08 PM

QUOTE:
... the multi-core support goes beyond just having more efficient indicator run time. What about utilizing multiple cores during optimizations on multiple symbols (say sending each symbol data set to a different core).

I totally agree....

The run-time is reasonably efficient now (although it doesn't employ multi-core methods), but splitting the problem up such that each symbol had its own core in an optimization would be possible. But now you have a processor cache management problem, which the Windows OS scheduler isn't smart enough to regulate on its own. For example, each symbol may require 0.6G bytes cache, but your 8-core system only has 2G bytes L3 cache. So you need to tell the Windows scheduler to only use 3 of its 8 cores for this problem; otherwise, things will get really slow. There are provisions to do such resource management in supercomputers (via Linux OS scheduler extensions), but not in the Windows OS. It's a problem.

There are hardware accelerator cards (based on NVIDIA cuda-core GPUs; Google it!) with 16G byte cache that may allow some problem size management. But not all WL users have a $1600 accelerator card and 600-watt power supply with 3 fans to support this configuration.

However, if there are enough WL users interested in building server-class machines to do this, I would be interested--and I could help with the engineering side. Look for traders that are doing serious Bitcoin mining and using WL, because they would already have this hardware configuration. But it's still a niche audience.

UPDATE to better workaround a niche audience: You might be able to get WL to restrict the number of symbols it tries to optimize simultaneously at the application level. So in the example above with 2G bytes L3 cache, you could tell the WL optimizer to only solve (optimize) 3 symbols at a time because trying to optimize more would be significantly slower. This is a reasonable workaround to the OS's lack of resource management. Obviously, you would want to be using an Intel i7-class or Xeon-class processor chip with the largest possible L3 cache. (In this case, the number of cores isn't that important; it's the L3 cache size that's the weakest link.)

Size:

Color:

[quote]... the multi-core support goes beyond just having more efficient indicator run time. What about utilizing multiple cores during optimizations on multiple symbols (say sending each symbol data set to a different core).[/quote]I totally agree....

The run-time is reasonably efficient now (although it doesn't employ multi-core methods), but splitting the problem up such that each symbol had its own core in an optimization would be possible. But now you have a processor cache management problem, which the Windows OS scheduler isn't smart enough to regulate on its own. For example, each symbol may require 0.6G bytes cache, but your 8-core system only has 2G bytes L3 cache. So you need to tell the Windows scheduler to only use 3 of its 8 cores for this problem; otherwise, things will get really slow. There are provisions to do such resource management in supercomputers (via Linux OS scheduler extensions), but not in the Windows OS. It's a problem.

There are hardware accelerator cards (based on NVIDIA cuda-core GPUs; Google it!) with 16G byte cache that may allow some problem size management. But not all WL users have a $1600 accelerator card and 600-watt power supply with 3 fans to support this configuration.

However, if there are enough WL users interested in building server-class machines to do this, I would be interested--and I could help with the engineering side. Look for traders that are doing [u]serious[/u] Bitcoin mining and using WL, because they would already have this hardware configuration. But it's still a niche audience.

UPDATE to better workaround a niche audience:  You might be able to get WL to restrict the number of symbols it tries to optimize simultaneously at the application level. So in the example above with 2G bytes L3 cache, you could tell the WL optimizer to only solve (optimize) 3 symbols at a time because trying to optimize more would be significantly slower. This is a reasonable workaround to the OS's lack of resource management. Obviously, you would want to be using an Intel i7-class or Xeon-class processor chip with the largest possible L3 cache. (In this case, the number of cores isn't that important; it's the L3 cache size that's the weakest link.)

festipower

#60

6/20/2018 9:30 AM

I have built a plugging for WL that executes backtests and optimizations using multiple threads and with a lot of profiling/optimization of the code i have been able to execute backtests on large datasets of thousands of symbols and thousands of trades, around 20X FASTER than the default WL method on a broad selection of strategies. The larger the dataset and number of resulting trades, the larger the speedup. The same speedup has been achieved for optimizations. Currently I am working on WFO, and the same level of speedup must be achieved on this tool also.
The plugging works with almost every wealthscript and compiled strategies, without the need to modify them.

The tests have been done in a 6 core machine of 2014 with 16GB of RAM. With more modern machines with larger number of cores a larger speedup must be achieved.

It is all a preliminar work and if there is enough interest in the community, my intention is to build a tool that integrates seamlessly with WL and executes backtests and optimizations with the push of a button. I already have a prototype of this tool that works for backtests and optimizations.

Eugene

#61

6/20/2018 10:32 AM

Carlos, this sounds very impressive. I wonder what approach do you use and why you highlight that "almost every" WS strategy is compatible?

superticker

#62

6/20/2018 12:58 PM

QUOTE:
... around 20X FASTER ... tests have been done in a 6 core machine

So you're saying your theoretically top speed improvement is 6X (for 6 cores), but you're getting 20X in practice. Can you tell us what other things you're doing, besides multi-core execution, that achieves this 20X improvement? There might be something holding WL back that has nothing to do with multi-core execution.

How much L3 processor cache does this machine have? Can you post a screenshot of your processor utilization for all 6 individual cores with ProcessExplorer (sysinternals)? I just want to see how well the processor utilization is being distributed between all 6 cores.

festipower

#63

6/20/2018 2:54 PM

Eugene & Mark:

A complete redesign of the WL simulation engine has been carried out, parallelizing everything possible and analyzing the performance in detail to optimize the execution to the maximum.

The simulation engine has been rewritten from scratch so the WL TradingSystemExecutor class and others are not used. Therefore, apart from the benefit of the parallelization, you get the benefit of optimizing the sequential execution of other parts of the code.

Things like avoiding over-parallelization, avoiding excesive garbage collection, etc. have been taken into account to obtain the best possible performance.

on the other hand, the user is provided with finer control over the indicator cache, so in many circumstances simulations can be carried out without recalculating all the indicators.

Several levels of parallelization have been implemented that can be activated / deactivated by the user at any time. More than one parallelization mode can be active at the same time, and in this case, the system avoids an over-parallelization that would cause resources to be wasted in unnecessary context changes.

Level 1.- Parallelism at the Symbol level. This parallelization mode is the one with the most benefits of execution speed, and is used both in backtests and optimizations. It has the downside that certain strategies may not work without modifications (for example symbol rotation).

Level 2 .- Parallelism at the level of composite strategies: Each of the strategies that make up the composite strategy can be executed in parallel.

Level 3.- Parallelisation at the optimization level: Several optimization steps can be executed in parallel. For now this only works with the exhaustive optimization method.

Level 4.- Fine grain parallelization: In addition to the previous parallelizations, a parallelization of certain internal loops and other structures has been carried out so that although the rest of the parallelisms are deactivated, multiple cpu cores are used for the parallel execution of certain parts of the code.

Size:

Color:

Eugene & Mark:

A complete redesign of the WL simulation engine has been carried out, parallelizing everything possible and analyzing the performance in detail to optimize the execution to the maximum.

The simulation engine has been rewritten from scratch so the WL TradingSystemExecutor class and others are not used. Therefore, apart from the benefit of the parallelization, you get the benefit of optimizing the sequential execution of other parts of the code.

Things like avoiding over-parallelization, avoiding excesive garbage collection, etc. have been taken into account to obtain the best possible performance.

on the other hand, the user is provided with finer control over the indicator cache, so in many circumstances simulations can be carried out without recalculating all the indicators.

Several levels of parallelization have been implemented that can be activated / deactivated by the user at any time. More than one parallelization mode can be active at the same time, and in this case, the system avoids an over-parallelization that would cause resources to be wasted in unnecessary context changes.

Level 1.- Parallelism at the Symbol level. This parallelization mode is the one with the most benefits of execution speed, and is used both in backtests and optimizations. It has the downside that certain strategies may not work without modifications (for example symbol rotation).

Level 2 .- Parallelism at the level of composite strategies: Each of the strategies that make up the composite strategy can be executed in parallel.

Level 3.- Parallelisation at the optimization level: Several optimization steps can be executed in parallel. For now this only works with the exhaustive optimization method.

Level 4.- Fine grain parallelization: In addition to the previous parallelizations, a parallelization of certain internal loops and other structures has been carried out so that although the rest of the parallelisms are deactivated, multiple cpu cores are used for the parallel execution of certain parts of the code.

superticker

#64

6/28/2018 1:34 AM

QUOTE:
... redesign of the WL simulation engine has been carried out, parallelizing everything possible and analyzing the performance in detail to optimize the execution to the maximum.

Thank you very much for this effort.

QUOTE:
Several levels of parallelization have been implemented that can be activated/deactivated by the user ... Level 1 - Parallelism at the Symbol level.... It has the downside that certain strategies may not work without modifications

And that's to be expected. I would set the defaults to "no parallelism" for cases where there might not be upward compatibility. Users can then enable a parallel execution level as they rewrite their code.

QUOTE:
over-parallelization, avoiding excessive garbage collection, etc. have been taken into account

There is a WL 6.9.19.0 bug in how the optimizer handles indicator cache that really slows optimization down after around 150 to 200 symbols when using the Particle Swarm optimizer. This is a separate issue (Perhaps I can start a service ticket to discuss it further.), but indicator cache garbage collection may be partly involved.

There's one companion feature you might add to this, and that's a method to throttle the parallelism down so the parallel problem all fits in the L3 cache of the processor. Understand, the front-side bus may operate at 333MHz even though the processor runs at 4GHz. That means a processor cache miss is going to take a 12X speed loss (4G/333M = 12) if it has to go off chip to DIMM (SDRAM) memory. So if you can limit the parallelism (say by letting only 3 symbols, using 3 of 6 cores, be optimized at once), you're likely to get more L3 cache hits, and therefore better speed. This is a case where less (cores) is more (speed). The goal is to fit the entire parallel problem into L3 processor cache to maximize performance.

I'm happy to answer any hardware questions. Happy computing to you, Superticker

Size:

Color:

[quote]... redesign of the WL simulation engine has been carried out, parallelizing everything possible and analyzing the performance in detail to optimize the execution to the maximum.[/quote]Thank you very much for this effort.

[quote]Several levels of parallelization have been implemented that can be activated/deactivated by the user ... Level 1 - Parallelism at the Symbol level.... It has the downside that certain strategies may not work without modifications[/quote]And that's to be expected. I would set the defaults to "no parallelism" for cases where there might not be upward compatibility. Users can then enable a parallel execution level as they rewrite their code.

[quote]over-parallelization, avoiding excessive garbage collection, etc. have been taken into account[/quote]There is a WL 6.9.19.0 bug in how the optimizer handles indicator cache that really slows optimization down after around 150 to 200 symbols when using the Particle Swarm optimizer. This is a separate issue (Perhaps I can start a service ticket to discuss it further.), but indicator cache garbage collection may be partly involved.

There's one companion feature you might add to this, and that's a method to [i]throttle the parallelism[/i] down so the parallel problem all fits in the L3 cache of the processor. Understand, the front-side bus may operate at 333MHz even though the processor runs at 4GHz. That means a processor cache miss is going to take a 12X speed loss (4G/333M = 12) if it has to go off chip to DIMM (SDRAM) memory. So if you can limit the parallelism (say by letting only 3 symbols, using 3 of 6 cores, be optimized at once), you're likely to get more L3 cache hits, and therefore better speed. This is a case where less (cores) is more (speed). The goal is to fit the [u]entire parallel problem[/u] into L3 processor cache to maximize performance.

I'm happy to answer any hardware questions. Happy computing to you, Superticker

festipower

#65

7/1/2018 12:28 PM

Yes, Mark. Processor cache is something to be taken into account also.

I'm preparing some videos to explain the tools (they will be called BTUtils) and I plan to upload them to YouTube shortly.

Stay tuned! ;-)

superticker

#66

7/1/2018 12:56 PM

QUOTE:
Processor cache is something to be taken into account ...

One "additional" enhancement that can be done is design the "indicator cache" such that it only caches single precision numbers. Now you would cast those back into double precision when you read from the indicator cache so everything remains double precision in the execution. Whether you retain 8 decimal points (single) or 28 decimal points (double) in a fuzzy problem (stock trading) isn't going to matter.

I say "additional" because I see this as an enhancement, and not as a requirement for initial release. Resist the temptation to add this for the initial release. The single-precision indicator caching will improve processor cache hits. The down side it that it will increase CLR garbage collection. The solution there is to create your own garbage collector (much like the streaming classes do) so you're not relying on the CLR garbage collector, which was never designed for block-oriented garbage collection (like block indicator caching or disk cluster caching requires).

The cache of the processor (L2, L3) are block oriented (associative caching for hardware speed), but the CLR garbage collector isn't because program objects have different sizes.

festipower

#67

7/1/2018 2:13 PM

Changing the underlying type of the indicators cache is problematic because the cache mechanisms is integrated into the Dataseries class and it is something difficult to change.

The first release will include a large speedup of the backtest/optimization/WFO with some aditional improvements in the overall tools and workflows.

If there is enough community interest, at a later release there may be some aditional speed improvements and things such as dynamic scorecards (with metrics defined at runtime using a formula editor), better graphic representation of the optimization process, etc

festipower

#68

7/1/2018 3:58 PM

Mark:

When I say "dynamic scorecards", I refer to a scorecard that allows the creation of custom metrics using a formula editor in order to use those metrics in the Optimization/WFO process. This would allow new possibilities such as for example creating composite metrics (whose components would be weighted metrics) as an approximation to a multi-objective optimization. I would also consider the implementation of a proper multi-objective optimizer using the Pareto frontier concept.

Integrating R statistical analysis into Wealth-Lab as you suggest would be a very powerful improvement to WL.

festipower

#69

7/1/2018 4:00 PM

The set of tools will be called BTUtils and will be comprised of several utilities:
-BTSimulator: Backtest utility.
-BTOptimizer: Optimization utility
-BTWalker: Walk forward optimization utility
-BTWatchLists: Watch-lists creation utility with some improvements.
-BTScorecard: A speed optimized scorecard that includes all of the most useful metrics of other scorecards and several aditional features.

As for the state of development of BTUtils, right now I am working on BTWalker (the WFO tool) and I have it half implemented. The other tools are almost completely finished, having to make a few minor changes and bug fixes.

superticker

#70

7/1/2018 4:52 PM

QUOTE:
... referring to a scorecard that allows the creation of custom metrics in the Optimization/WFO process. This would allow ... creating composite metrics (whose components would be weighted metrics) as an approximation to a multi-objective optimization.

It's a good idea, but only if the composite metrics are created with orthogonal terms; otherwise, you'll never converge on a unique solution (The condition number of the system matrix goes bad because there's no unique solution to find.).

What we do in practice is dump the correlation (or covariant) matrix and remove all the highly correlated terms that aren't orthogonal. Afterwards, it's possible to solve for a "unique solution" with the remaining orthogonal terms. I say "unique solution" because this is a fuzzy problem, so there really isn't a truly unique solution to find. It's solved by "best effort" (like the Pareto frontier concept).

R supports all possible subset regression analysis (MatLab's stat toolbox only does step-wise regression), so it will tease out the highly correlated terms and leave you with a par composite model. For a fuzzy system (shutter), I fear doing anything short of this approach won't work. But if you have a better approach--that might work--I'm interested. If you're thinking about including all-possible-subset regression or step-wise regression as part of your weighted-composite scorecard solution, yes I would be interested in that too. Then I wouldn't need R. But with the R.NET interface, it should be easy to call R from WL.

The problem is when you start using MatLab or R, this becomes a research problem. So you need to formulate the final solution into a WL indicator that everyone can use.

WL does have some nice composite indicators today. I use TII (Trend Intensity Index) in all my production strategies.

Size:

Color:

[quote]... referring to a scorecard that allows the creation of custom metrics in the Optimization/WFO process. This would allow ... creating composite metrics (whose components would be weighted metrics) as an approximation to a multi-objective optimization.[/quote]It's a good idea, but [u]only if[/u] the composite metrics are created with [u]orthogonal[/u] terms; otherwise, you'll never converge on a unique solution (The condition number of the system matrix goes bad because there's no unique solution to find.).

What we do in practice is dump the correlation (or covariant) matrix and remove all the highly correlated terms that aren't orthogonal. Afterwards, it's possible to solve for a "unique solution" with the remaining orthogonal terms. I say "unique solution" because this is a fuzzy problem, so there really isn't a truly unique solution to find. It's solved by "best effort" (like the Pareto frontier concept).

R supports [i]all possible subset regression analysis[/i] (MatLab's stat toolbox only does step-wise regression), so it will tease out the highly correlated terms and leave you with a par composite model. For a fuzzy system (shutter), I fear doing anything short of this approach won't work. But if you have a better approach--that might work--I'm interested. If you're thinking about including all-possible-subset regression or step-wise regression as part of your weighted-composite scorecard solution, yes I would be interested in that too. Then I wouldn't need R. But with the R.NET interface, it should be easy to call R from WL.

The problem is when you start using MatLab or R, this becomes a research problem. So you need to formulate the final solution into a WL indicator that [u]everyone[/u] can use.

WL does have some nice composite indicators today.  I use TII (Trend Intensity Index) in all my production strategies.

festipower

#71

7/1/2018 7:06 PM

Thanks for your ideas, Mark. I will take them into account if / when I try to implement BTOptimizer improvements in a later release.

For now I must concentrate my available time on finishing the first version of BTUtils.

Domintia-Carlos

#72

6/8/2019 4:47 PM

Dear WL Users:

BTUtils for Wealth-Lab, our toolset to dramatically speedup backtests and optimizations in Wealth-Lab Dev/Pro has been released.

Check out the videos to get an idea of how it works.

You can also take a look to BTAnalytics, our tool to store and analyze backtests and improve strategies.

Carlos Pérez
https://www.domintia.com

Wealth-Lab® Links

WealthSignals Links

Disclaimer Policy

Follow Us

A FREE account will allow you access to our knowledge-base resources, customer support, WealthSignals services, and a trial version of our software Wealth Lab

Almost done.