GPU processing to increase speed of optimizations?
Author: lanatmwan
Creation Date: 12/21/2017 7:26 PM
profile picture

lanatmwan

#1
I'm generally curious if Wealth-Lab is still being actively developed or more in bug fix/minor improvement mode? I didn't do much with it but I tried it out a few years ago and it basically seems the same now. I recently saw that a company is offering a compute library for GPUs where the guy specifically mentioned it can be used for monte carlo calculations. Does Wealth-Lab have any plans to add something like that to greatly increase the speed of WealthScript optimizations?

http://blog.quantalea.com/radically-simplified-gpu-programming-with-csharp/
profile picture

Eugene

#2
Wealth-Lab is a very extensible platform. If somebody has interest in GPU programming, supporting it in his custom optimizer would be an excellent addition.

Meanwhile please take a look at these tips and make sure you have installed the recent Optimizer additions:

Optimization is slow and takes much time. Is it possible to speed up?

Specifically, these extensions greatly increase the speed of optimizations:

Exhaustive to Local Maximum Optimizer
Swarm Optimizer Addin
profile picture

lanatmwan

#3
I shall infer from your answer that its in maintenance mode. :)

That is super cool that it is extensible in that way though. I will definitely check those other optimizers out! I probably don't have the mathematical background to make a gpu optimizer but hopefully someone will feel inspired...
profile picture

lanatmwan

#4
And if some motivated soul comes along, there are a lot of code samples:

http://www.aleagpu.com/release/3_0_4/doc/gallery.html#
profile picture

LenMoz

#5
QUOTE:
Wealth-Lab is a very extensible platform. If somebody has interest in GPU programming, supporting it in his custom optimizer would be an excellent addition.
Do I hear, "Hint, hint.", Eugene?

A deterrent to employing the GPU hardware is that a WealthLab optimizer is a cooperative process controlled by native WealthLab code. An optimizer add-on only tells the WealthLab host which set of parameters to try next. The WealthLab optimizer host code runs that set of parameters and returns the result to the add-on. Host code also handles the results and their presentation in the user interface.

The Exhaustive Optimizer simply loops through every possible combination of parameters. Faster optimizers use algorithms to intelligently choose a subset of exhaustive by learning what combinations achieved good results and presenting combinations "near" these previous good results. The interface between host and optimizer would need to be changed such that the WealthLab host would accept an array of parameter combinations and return an array of results to the optimizer. The GPU implementation would need to be made in the host code. The Genetic and Swarm optimizers could work in this fashion, passing parameter sets a generation at a time.

Bottom line - Implementation would be a complex cooperative effort with a fair amount of risk and uncertain ROI. I doubt it will happen.

Len (author of the Particle Swarm Optimizer add-on)
profile picture

Eugene

#6
Excellent answer, thanks Len.
profile picture

superticker

#7
QUOTE:
The interface between host and optimizer would need to be changed such that the Wealth Lab host would accept an array of parameter combinations and return an array of results to the optimizer.
And the other problem is that you would have to host Wealth Lab on a Xeon class (server) processor with an excess of 4 to 8GBytes of on-chip L2 and L3 cache memory to fit the "parallel" optimization problem. Remember, you only run at full processor speed if your problem fits on-chip. If you have to make an off-chip access to main memory (DIMM memory), then you drop down to front-side bus speeds (333MHz) instead of 4GHz processor chip speeds.

And the GPU may be further constrained by problem size (cache size) than a Xeon Pentium processor. Where the GPU processor will rain is with smaller-memory signal-processing problems. Imagine processing (which is what the GPU was designed for) is really many parallel 1-dimensional signal-processing problems stitched together into a 2- or 3-dimensional image result.

There's some signal processing in Wealth Lab [FIR or SMA, IIR or EMA, derivative evaluation (e.g. Momentum, Kalman), integration, vector arithmetic], but not as extensive as many other applications like image processing.

It's possible Neural Network evaluation might be assisted by a GPU in "some" ways (i.e. all the nodes in a given NN layer could be evaluated in parallel). But I think you could build a custom AI processor that could do that much better.
profile picture

lanatmwan

#8
QUOTE:
"host Wealth Lab on a Xeon class (server) processor with an excess of 4 to 8GBytes of on-chip L2 and L3 cache memory to fit the "parallel" optimization problem"


Which brings me to my next desire. I'm pretty sure it wouldn't be very hard, relatively speaking, to create various cloud services out of some of the pieces of wealthlab. I've already got a lot of ideas around this space but backtesting/optimizing is a slam dunk use case for it. Right now my poor laptop is running a monte carlo optimization that will take 5 days (yes I know, too many variables blah blah), I'd much rather be able to upload that as a job to the cloud. All my accounts are currently tied to Fidelity so I'd love to use the tool I'm (indirectly) paying for but I would love to see that development is keeping up with some of the tools that are coming out now.

QUOTE:
"And the GPU may be further constrained by problem size (cache size) than a Xeon Pentium processor. Where the GPU processor will rain is with smaller-memory signal-processing problems. Imagine processing (which is what the GPU was designed for) is really many parallel 1-dimensional signal-processing problems stitched together into a 2- or 3-dimensional image result."


I would encourage you to check out the videos at the same site I posted previously. The math is over my head and the GPU certainly has limitations but they do speak to some specific use cases for financial stuff and say that they see 100x speed improvements.
profile picture

Eugene

#9
QUOTE:
and say that they see 100x speed improvements.

Six years ago we released a version of the Monte Carlo Lab visualizer which offered multi-CPU support for greater speed. Soon we had to retract it because of the corrupt data it produced. It turned out that the Wealth-Lab backtesting engine itself wasn't designed with parallelizing a backtest in mind. Something isn't thread safe there which may affect 'derivative works'.

Check out some existing forum threads on parallelization and multi-core CPU. Some authors succeeded in something, some experiments proved that parallelization offers less benefits than we expected initially, and some went their own way in developing a custom engine for their needs.
profile picture

superticker

#10
QUOTE:
... they do speak to some specific use cases for financial stuff [with the GPU] and say that they see 100x speed improvements.
That 100X speed up is based from array processing parallelism in parallel FOR-loop blocks (Gpu.Parallel.For), not vector processing "within" a FOR loop. And like post# 5 said, you would need to redesign the Wealth Lab optimizer API to support it. Moreover, it's doubtful "consumer" GPUs have enough local memory for that many large parallel problems simultaneously.

Also appreciate most GPUs don't do floating point arithmetic, but about 30% do. Moreover, only 5% of "GPUs" in 2017 even do double-precision float-point. However, that high-end 5% are meant for something more than graphics. :-)

I just found one of these high-end GPU graphics cards on Amazon for $3000; the Nvidia Tesla K80 24GB GDDR5 CUDA Cores Graphic Card. Check it out! And yes 24GB is enough memory for some large, parallel optimizer problems--it will work. So this is what this thread is really about.
https://www.amazon.com/Nvidia-Tesla-GDDR5-Cores-Graphic/dp/B00Q7O7PQA

---
From a vector processing prospective (within a FOR loop), you'll get some speed up of your SMA (FIR filter), EMA (IIR filter), and Momentum (FIR derivative calculation) indicators. These are signal processing vector operations. But that's only going to speed your overall simulation (backtest) by 20%.

What really hurts you in vector processing are branch instructions. Now in signal or image processing, we won't have any branch instructions when operating on vectors. But in a stock-trading simulation, there's lots of events and branching. That's going to break all the operand pipelining within the vector processor chip. The chip's subprocessing units are barely going to have their pipes filled before there's a branch that forfeits everything accumulated in the prefetch and intermediate-calculation pipes. In contrast, with no-branch vector arithmetic, nothing is forfeited.

What you really need to do is remove all your branching in your trading strategy so everything looks like a vector or imagine processing operation. Then you'll get your speed up.

---
Now when evaluating a given layer in a Neural Network (NN), you're just taking the inner product of two (maybe three) vectors. And within that given layer, one operand won't depend on the outcome of another. Moreover, there's little branching. So there's some serious opportunity to do some parallel processing with an array processor (or GPU) here. Now you're not going to speed your NN by 100X (ha, ha; you wish), but you could speed it up by 5X without too much effort (assuming there are 5+ nodes in each NN layer and you have 5+ float-point units (FPUs) in your array processor). Appreciate, the FPUs probably have a 2 or 3 stage pipeline, so you'll need to "unroll your code" so you can stripe execute through the nodes just like you would do on a supercomputer to keep all FPU stages stoked. If you employ the arithmetic code library that comes with your array processor, you should be fine; look for an inner product operation in the array processor's library.

Now how a 5X speed up in your NN is going to affect the overall speed of your trading simulation is another question. I don't know.
profile picture

lanatmwan

#11
You forgot to drop the mic. :)