CUDA, GPU processing to increase speed of optimizations?

lanatmwan

#1

12/21/2017 7:26 PM

I'm generally curious if Wealth-Lab is still being actively developed or more in bug fix/minor improvement mode? I didn't do much with it but I tried it out a few years ago and it basically seems the same now. I recently saw that a company is offering a compute library for GPUs where the guy specifically mentioned it can be used for monte carlo calculations. Does Wealth-Lab have any plans to add something like that to greatly increase the speed of WealthScript optimizations?

http://blog.quantalea.com/radically-simplified-gpu-programming-with-csharp/

Eugene

#2

12/21/2017 8:19 PM

Wealth-Lab is a very extensible platform. If somebody has interest in GPU programming, supporting it in his custom optimizer would be an excellent addition.

Meanwhile please take a look at these tips and make sure you have installed the recent Optimizer additions:

Optimization is slow and takes much time. Is it possible to speed up?

Specifically, these extensions greatly increase the speed of optimizations:

Exhaustive to Local Maximum Optimizer
Swarm Optimizer Addin

lanatmwan

#3

12/21/2017 10:59 PM

I shall infer from your answer that its in maintenance mode. :)

That is super cool that it is extensible in that way though. I will definitely check those other optimizers out! I probably don't have the mathematical background to make a gpu optimizer but hopefully someone will feel inspired...

lanatmwan

#4

12/21/2017 11:26 PM

And if some motivated soul comes along, there are a lot of code samples:

http://www.aleagpu.com/release/3_0_4/doc/gallery.html#

LenMoz

#5

12/22/2017 5:08 AM

QUOTE:
Wealth-Lab is a very extensible platform. If somebody has interest in GPU programming, supporting it in his custom optimizer would be an excellent addition.

Do I hear, "Hint, hint.", Eugene?

A deterrent to employing the GPU hardware is that a WealthLab optimizer is a cooperative process controlled by native WealthLab code. An optimizer add-on only tells the WealthLab host which set of parameters to try next. The WealthLab optimizer host code runs that set of parameters and returns the result to the add-on. Host code also handles the results and their presentation in the user interface.

The Exhaustive Optimizer simply loops through every possible combination of parameters. Faster optimizers use algorithms to intelligently choose a subset of exhaustive by learning what combinations achieved good results and presenting combinations "near" these previous good results. The interface between host and optimizer would need to be changed such that the WealthLab host would accept an array of parameter combinations and return an array of results to the optimizer. The GPU implementation would need to be made in the host code. The Genetic and Swarm optimizers could work in this fashion, passing parameter sets a generation at a time.

Bottom line - Implementation would be a complex cooperative effort with a fair amount of risk and uncertain ROI. I doubt it will happen.

Len (author of the Particle Swarm Optimizer add-on)

Eugene

#6

12/22/2017 6:20 AM

Excellent answer, thanks Len.

superticker

#7

12/22/2017 7:25 AM

QUOTE:
The interface between host and optimizer would need to be changed such that the Wealth Lab host would accept an array of parameter combinations and return an array of results to the optimizer.

And the other problem is that you would have to host Wealth Lab on a Xeon class (server) processor with an excess of 4 to 8GBytes of on-chip L2 and L3 cache memory to fit the "parallel" optimization problem. Remember, you only run at full processor speed if your problem fits on-chip. If you have to make an off-chip access to main memory (DIMM memory), then you drop down to front-side bus speeds (333MHz) instead of 4GHz processor chip speeds.

And the GPU may be further constrained by problem size (cache size) than a Xeon Pentium processor. Where the GPU processor will rain is with smaller-memory signal-processing problems. Imagine processing (which is what the GPU was designed for) is really many parallel 1-dimensional signal-processing problems stitched together into a 2- or 3-dimensional image result.

There's some signal processing in Wealth Lab [FIR or SMA, IIR or EMA, derivative evaluation (e.g. Momentum, Kalman), integration, vector arithmetic], but not as extensive as many other applications like image processing.

It's possible Neural Network evaluation might be assisted by a GPU in "some" ways (i.e. all the nodes in a given NN layer could be evaluated in parallel). But I think you could build a custom AI processor that could do that much better.

lanatmwan

#8

12/22/2017 8:18 PM

QUOTE:
"host Wealth Lab on a Xeon class (server) processor with an excess of 4 to 8GBytes of on-chip L2 and L3 cache memory to fit the "parallel" optimization problem"

Which brings me to my next desire. I'm pretty sure it wouldn't be very hard, relatively speaking, to create various cloud services out of some of the pieces of wealthlab. I've already got a lot of ideas around this space but backtesting/optimizing is a slam dunk use case for it. Right now my poor laptop is running a monte carlo optimization that will take 5 days (yes I know, too many variables blah blah), I'd much rather be able to upload that as a job to the cloud. All my accounts are currently tied to Fidelity so I'd love to use the tool I'm (indirectly) paying for but I would love to see that development is keeping up with some of the tools that are coming out now.

QUOTE:
"And the GPU may be further constrained by problem size (cache size) than a Xeon Pentium processor. Where the GPU processor will rain is with smaller-memory signal-processing problems. Imagine processing (which is what the GPU was designed for) is really many parallel 1-dimensional signal-processing problems stitched together into a 2- or 3-dimensional image result."

I would encourage you to check out the videos at the same site I posted previously. The math is over my head and the GPU certainly has limitations but they do speak to some specific use cases for financial stuff and say that they see 100x speed improvements.

Eugene

#9

12/23/2017 6:50 AM

QUOTE:
and say that they see 100x speed improvements.

Six years ago we released a version of the Monte Carlo Lab visualizer which offered multi-CPU support for greater speed. Soon we had to retract it because of the corrupt data it produced. It turned out that the Wealth-Lab backtesting engine itself wasn't designed with parallelizing a backtest in mind. Something isn't thread safe there which may affect 'derivative works'.

Check out some existing forum threads on parallelization and multi-core CPU. Some authors succeeded in something, some experiments proved that parallelization offers less benefits than we expected initially, and some went their own way in developing a custom engine for their needs.

superticker

#10

12/23/2017 7:21 PM

QUOTE:
... they do speak to some specific use cases for financial stuff [with the GPU] and say that they see 100x speed improvements.

That 100X speed up is based from array processing parallelism in parallel FOR-loop blocks (Gpu.Parallel.For), not vector processing "within" a FOR loop. And like post# 5 said, you would need to redesign the Wealth Lab optimizer API to support it. Moreover, it's doubtful "consumer" GPUs have enough local memory for that many large parallel problems simultaneously.

Also appreciate most GPUs don't do floating point arithmetic, but about 30% do. Moreover, only 5% of "GPUs" in 2017 even do double-precision float-point. However, that high-end 5% are meant for something more than graphics. :-)

I just found one of these high-end GPU graphics cards on Amazon for $3000; the Nvidia Tesla K80 24GB GDDR5 CUDA Cores Graphic Card. Check it out! And yes 24GB is enough memory for some large, parallel optimizer problems--it will work. So this is what this thread is really about.
https://www.amazon.com/Nvidia-Tesla-GDDR5-Cores-Graphic/dp/B00Q7O7PQA

---
From a vector processing prospective (within a FOR loop), you'll get some speed up of your SMA (FIR filter), EMA (IIR filter), and Momentum (FIR derivative calculation) indicators. These are signal processing vector operations. But that's only going to speed your overall simulation (backtest) by 20%.

What really hurts you in vector processing are branch instructions. Now in signal or image processing, we won't have any branch instructions when operating on vectors. But in a stock-trading simulation, there's lots of events and branching. That's going to break all the operand pipelining within the vector processor chip. The chip's subprocessing units are barely going to have their pipes filled before there's a branch that forfeits everything accumulated in the prefetch and intermediate-calculation pipes. In contrast, with no-branch vector arithmetic, nothing is forfeited.

What you really need to do is remove all your branching in your trading strategy so everything looks like a vector or imagine processing operation. Then you'll get your speed up.

---
Now when evaluating a given layer in a Neural Network (NN), you're just taking the inner product of two (maybe three) vectors. And within that given layer, one operand won't depend on the outcome of another. Moreover, there's little branching. So there's some serious opportunity to do some parallel processing with an array processor (or GPU) here. Now you're not going to speed your NN by 100X (ha, ha; you wish), but you could speed it up by 5X without too much effort (assuming there are 5+ nodes in each NN layer and you have 5+ float-point units (FPUs) in your array processor). Appreciate, the FPUs probably have a 2 or 3 stage pipeline, so you'll need to "unroll your code" so you can stripe execute through the nodes just like you would do on a supercomputer to keep all FPU stages stoked. If you employ the arithmetic code library that comes with your array processor, you should be fine; look for an inner product operation in the array processor's library.

Now how a 5X speed up in your NN is going to affect the overall speed of your trading simulation is another question. I don't know.

Size:

Color:

[quote]... they do speak to some specific use cases for financial stuff [with the GPU] and say that they see 100x speed improvements.[/quote]That 100X speed up is based from [u]array[/u] processing [u]parallelism[/u] in parallel FOR-loop blocks (Gpu.Parallel.For), not vector processing "within" a FOR loop. And like post# 5 said, you would need to redesign the Wealth Lab optimizer API to support it. Moreover, it's doubtful "consumer" GPUs have enough local memory for that many [u]large[/u] parallel problems simultaneously.

Also appreciate most GPUs don't do floating point arithmetic, but about 30% do. Moreover, only 5% of "GPUs" in 2017 even do double-precision float-point. However, that high-end 5% are meant for something more than graphics. :-)

I just found one of these high-end GPU graphics cards on Amazon for $3000; the Nvidia Tesla K80 24GB GDDR5 CUDA Cores Graphic Card. Check it out!  And yes 24GB is enough memory for some large, parallel optimizer problems--it will work. So this is what this thread is really about.
[link]https://www.amazon.com/Nvidia-Tesla-GDDR5-Cores-Graphic/dp/B00Q7O7PQA[/link]

---
From a vector processing prospective (within a FOR loop), you'll get some speed up of your SMA (FIR filter), EMA (IIR filter), and Momentum (FIR derivative calculation) indicators. These are signal processing vector operations. But that's only going to speed your overall simulation (backtest) by 20%.

What really hurts you in vector processing are branch instructions. Now in signal or image processing, we won't have any branch instructions when operating on vectors. But in a stock-trading simulation, there's lots of events and branching. That's going to break all the operand pipelining within the vector processor chip. The chip's subprocessing units are barely going to have their pipes filled before there's a branch that forfeits everything accumulated in the prefetch and intermediate-calculation pipes. In contrast, with no-branch vector arithmetic, nothing is forfeited.

What you really need to do is remove all your branching in your trading strategy so everything looks like a vector or imagine processing operation. Then you'll get your speed up.

---
Now when evaluating a given layer in a Neural Network (NN), you're just taking the inner product of two (maybe three) vectors. And within that given layer, one operand won't depend on the outcome of another. Moreover, there's little branching. So there's some serious opportunity to do some parallel processing with an array processor (or GPU) here. Now you're not going to speed your NN by 100X (ha, ha; you wish), but you could speed it up by 5X without too much effort (assuming there are 5+ nodes in each NN layer and you have 5+ float-point units (FPUs) in your array processor). Appreciate, the FPUs probably have a 2 or 3 stage pipeline, so you'll need to "unroll your code" so you can stripe execute through the nodes just like you would do on a supercomputer to keep all FPU stages stoked. If you employ the arithmetic code library that comes with your array processor, you should be fine; look for an inner product operation in the array processor's library.

Now how a 5X speed up in your NN is going to affect the overall speed of your trading simulation is another question. I don't know.

lanatmwan

#11

12/24/2017 2:39 PM

You forgot to drop the mic. :)

superticker

#12

10/24/2020 12:19 AM

QUOTE:
WL community of HW specialists, is there any value to harnessing the capabilities of 1,000+ core GPUs for WLD?

The short answer is "yes" when it comes to running the WL optimizer (e.g. Particle Swam Optimizer if it's redesigned for massively parallel multi-core). But there are many barriers.

Hardware setup: You need a gaming workstation with a 500-600watt power supply and I would recommend liquid-based chip coolers. Also, these CUDA core accelerator PCI Express cards can run $2000 just for the card alone. I realize not everyone is a computer engineer, but everyone probably knows a gaming junkie on their block that can setup this kind of workstation for you. I would attend a local gamer's meeting, and ask them about their setup. I'm sure they can help you build your custom workstation. I would budget $4500 (includes the cost of the CUDA accelerator card) for your workstation. CUDA overview

Software Setup: You need access to the optimizer source code and some of the core WL code to vectorize and parallelize this code and compile it around the CUDA numerical analysis C++/FORTRAN library. If you skip this step, you won't get the speed up. Some experience with programming supercomputers would be helpful (but not required) since the issues there are the same. Visual Studio lets you mix C# and C++ assemblies.

Size of user base: Now if you were doing pattern matching of DNA sequences (great supercomputer problem), there would be plenty of user base. But for just speeding up WL optimizations, I'm not so sure. That's the biggest barrier. But done right, you could get a WL optimizer speedup of 40-200 times. It wouldn't be worth it for other WL operations, although vectorizing some of the WL indicators with the CUDA library "might" be worth it.

There would also be some opportunity to vectorize (and "maybe" parallelize) a neural network (NN) implementation since those are coded with nested FOR loops and can be reduce to linear system operations that the CUDA library would support. I would ask on a CUDA help forum which NNs have been already ported to CUDA and go from there improving that existing port. (Don't reinvent the wheel.)

Size:

Color:

[quote]WL community of HW specialists, is there any value to harnessing the capabilities of 1,000+ core GPUs for WLD?[/quote]The short answer is "yes" when it comes to running the WL optimizer (e.g. Particle Swam Optimizer if it's redesigned for massively parallel multi-core). But there are [u]many[/u] barriers.

[i]Hardware setup[/i]: You need a gaming workstation with a 500-600watt power supply and I would recommend liquid-based chip coolers. Also, these CUDA core accelerator PCI Express cards can run $2000 just for the card alone. I realize not everyone is a computer engineer, but everyone probably knows a gaming junkie on their block that can setup this kind of workstation for you. I would attend a local gamer's meeting, and ask them about their setup. I'm sure they can help you build your custom workstation. I would budget $4500 (includes the cost of the CUDA accelerator card) for your workstation. [link=https://en.wikipedia.org/wiki/CUDA]CUDA overview[/link]

[i]Software Setup[/i]: You need access to the optimizer source code and some of the core WL code to vectorize and parallelize this code and compile it around the CUDA numerical analysis C++/FORTRAN library. If you skip this step, you won't get the speed up. Some experience with programming supercomputers would be helpful (but not required) since the issues there are the same. Visual Studio lets you mix C# and C++ assemblies.

[i]Size of user base[/i]: Now if you were doing pattern matching of DNA sequences (great supercomputer problem), there would be plenty of user base. But for just speeding up WL optimizations, I'm not so sure. That's the biggest barrier. But done right, you could get a WL optimizer speedup of 40-200 times. It wouldn't be worth it for other WL operations, although vectorizing some of the WL indicators with the CUDA library "might" be worth it.

There would also be some opportunity to vectorize (and "maybe" parallelize) a neural network (NN) implementation since those are coded with nested FOR loops and can be reduce to linear system operations that the CUDA library would support. I would ask on a CUDA help forum which NNs have been already ported to CUDA and go from there improving that existing port. (Don't reinvent the wheel.)

Carova

#13

10/24/2020 1:19 AM

superticker,

I was hoping that you would weigh in on this issue! ;)

I, for one, do have the hardware setup since I run CNNs regularly, but these SW packages are predesigned to take full advantage of CUDA. And I doubt that you would necessarily need to go to a "30-Class" GPU card to take full advantage of the capability. The "10" and "20" class cards are quite affordable, and there would only be a 2-3x reduction in performance.

I am not knowledgeable enough about the code design which is why I asked. I suspect that a Backtesting and Trading System Product that could boast of 40-200x speedup in optimizations would be a big deal in the Marketplace, but I will leave that to Glitch to decide. Also, as you point out, perhaps there is potential in vectorizing the calculation of indicators, which would be another plus.

Maybe I am in the small minority here, but I find that I am constantly waiting for major operations (i.e. optimization) on my system. I wonder if that is the case for many of the other users of WLD.

Vince

superticker

#14

10/24/2020 3:33 AM

QUOTE:
I find that I am constantly waiting for major operations (i.e. optimization) on my system.

Yes, that's the slowest step for me too. I re-optimized about 5% of the worst-case performing stocks on each Saturday morning. To optimize all my stocks would take two days.

I'm sure those using NN in their WL strategies would enjoy the speed up too.

QUOTE:
... processors with 16 cores will soon be very common, and AMD is planning to deliver one with 64 cores next year ...

Yes, but these processors only have enough on-chip cache memory to keep about 4 (maybe 6) cores busy in a "large simulation" problem. If you try using more cores than that, things will slow down because you'll get too many cache misses. The bottleneck here is the cache memory capacity, not the number of cores.

Now the CUDA acceleration cards have their own privately shared memory systems, so none of these main-processor caching issues apply to them.

Cone

#15

10/24/2020 9:19 AM

My take away out of this is that Carova runs CNN. Can you do something about all the fake news out there? :D

Carova

#16

10/24/2020 10:52 AM

QUOTE:
My take away out of this is that Carova runs CNN. Can you do something about all the fake news out there? :D

I only do simulation on CNN. :) You need to speak with someone else for the real deal!

superticker

#17

10/24/2020 4:33 PM

QUOTE:
My take away out of this is that Carova runs CNN.

I'm trying to decide if he's using CNN for image processing or if it's has something to do with stock trading? He never said....

I've been thinking about the "politics" of his suggestion. This really isn't about acceleration as much as it is about taking market share away from all the other trading simulation platforms out there. His suggestion revolves around differentiating WL7 from everyone else so they switch--very sneaky. He should be working for Apple.

My political suggestion would be to release a beta version of WL7 without the CUDA acceleration. At that point, it might be a consideration to include it for the strategy optimization step and see how many WL7 beta users really buy the gaming hardware for their platforms to even use it. You're really targeting the hardware junkies for this beta release. They are buying it because it's a new hardware "toy" to try out and not so much for the acceleration.

*If* it's received by the hardware junkies, then the next step would be to "standardize" the hardware platform so MS123 can support it. I would subcontract with one of the hardware gaming providers (that do custom builds) to support the hardware side if this for MS123 customers. Remember, you're not Apple (which provides both hardware and software support). You're trying to sell software, and your customers are not hardware junkies. That a serious problem, but don't try to become Apple either.

If the WL customer base bytes--and that depends on your execution of the plan--then I would consider adding NN/CNN acceleration. Users aren't going to switch until there's enough incentive. And adding the NN/CNN support "might" tip the scales. There's some risk here, but we are all stock traders, so we know about risk.

Many WL indicators are a "time series calculation" which can be vectorized for the CUDA libraries, but you'll only get a speedup of 3 times. It's hardly worth it. In addition, the "unstable" indicators like ATR cannot be vectorized completely because one needs the result of the last bar before the next bar can start computation. This is not going to fly well with the pipelining hardware*, which works as an assembly line to compute stages of the calculation. Users would have to give up some of their indicators and replace them with others that can be vectorized.

*Grain of salt added here because some pipelining hardware may support rapping of the operand of the last stage back to the previous stage to facilitate IIR filter calculations (e.g. EMA). Check the hardware programming model for a bus that raps back in the pipeline. FIR filter calculations (e.g. SMA) won't have this problem.

Size:

Color:

[quote]My take away out of this is that Carova runs CNN.[/quote]I'm trying to decide if he's using CNN for image processing or if it's has something to do with stock trading? He never said....

I've been thinking about the "politics" of his suggestion. This really isn't about acceleration as much as it is about taking market share away from all the other trading simulation platforms out there. His suggestion revolves around [u]differentiating[/u] WL7 from everyone else so they switch--very sneaky. He should be working for Apple.

My political suggestion would be to release a beta version of WL7 [u]without[/u] the CUDA acceleration. At that point, it might be a consideration to include it for the strategy optimization step and see how many WL7 beta users really buy the gaming hardware for their platforms to even use it. You're really targeting the hardware junkies for this beta release. They are buying it because it's a new hardware "toy" to try out and not so much for the acceleration.

*If* it's received by the hardware junkies, then the next step would be to "standardize" the hardware platform so MS123 can support it. I would subcontract with one of the hardware gaming providers (that do custom builds) to support the hardware side if this for MS123 customers. Remember, you're not Apple (which provides [u]both[/u] hardware and software support). You're trying to sell software, and your customers are [u]not[/u] hardware junkies. That a [u]serious[/u] problem, but don't try to become Apple either.

If the WL customer base bytes--and that depends on your execution of the plan--then I would consider adding NN/CNN acceleration. Users aren't going to switch until there's enough incentive. And adding the NN/CNN support "might" tip the scales. There's some risk here, but we are all stock traders, so we know about risk.

Many WL indicators are a "time series calculation" which can be vectorized for the CUDA libraries, but you'll only get a speedup of 3 times. It's hardly worth it. In addition, the "unstable" indicators like ATR [u]cannot[/u] be vectorized completely because one needs the result of the last bar before the next bar can start computation. This is not going to fly well with the pipelining hardware*, which works as an assembly line to compute stages of the calculation. Users would have to give up some of their indicators and replace them with others that can be vectorized.

*Grain of salt added here because some pipelining hardware may support rapping of the operand of the last stage back to the previous stage to facilitate IIR filter calculations (e.g. EMA). Check the hardware programming model for a bus that raps back in the pipeline.  FIR filter calculations (e.g. SMA) won't have this problem.