How to create indicator sector/industry averages?
Author: LenMoz
Creation Date: 12/17/2017 12:44 PM
profile picture

LenMoz

#1
I'm looking for a way to build DataSeries of the sector or industry average of an indicator.
profile picture

Eugene

#2
* Wealth-Lab Development Guide > Create a Custom Index
* MS123 IndexDefinitions > Attachments > Download Project Source (demo)
profile picture

LenMoz

#3
If I understand this, I then need to also create a portfolio and custom index for each sector or industry of interest. Is that correct?

profile picture

Eugene

#4
Yes. The built-in "Aggregate Indicator" in I-L may be sufficient.
profile picture

superticker

#5
If all you want to do is aggregate price grouped by sector and industry (based on US GICS definitions), that's already available for Fidelity data download. See https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/si_performance.jhtml?tab=siperformance for the ticker symbols broken down by sector. Click on the INDUSTRY link to get the US GICS symbols for the industries.

There's some WL code written to scrape those pages for their symbols at https://www.wealth-lab.com/Forum/Posts/Find-the-symbol-for-an-industry-or-sector-index-38910.

If you're interested in crossing individual stocks with their corresponding GICS sector and industry symbol, I've written some code for that. Let me know if you want me to post my GICS cross referencing class to the link above.

---
On a separate note, from a "trading" prospective, I haven't seen much "immediate" price cross-correlation between small- and mid-cap stocks and their respective sector or industry. Now long-term and large-cap stocks might be different; I can't speak for those.
profile picture

LenMoz

#6
Eugene/superticker, Thanks. All good stuff. Gives me some new ideas to work on these winter nights.

QUOTE:
I haven't seen much "immediate" price cross-correlation...

My intent is to use, say, "MomentumPct of a stock vs its industry" as a Neural Network input. My focus is a subset of Russell 2000 tech stocks.

profile picture

superticker

#7
QUOTE:
... intent is to use, "MomentumPct of a stock vs its industry"
If you want to take the average of the industry prices first, then take their MomentumPct second, then the Fidelity data download of the industry averages is still useful to you. In contrast, if you want to do the reverse order, then Index Lab is your only option. You can use Index Lab to aggregate (average) all the results of the MomentumPct indicator for a DataSet.

The signal processing question remains, which order is best? Please post your answer when you figure that out. I'm thinking the latter Index-Lab order would preserve the energies of the MomentumPct best, but that doesn't necessarily mean it would produce the highest cross correlation of the stock and industry MomentumPct series vectors; that's another question.

---
On a separate note, I looked in MathNet for a cross-correlation function in their signal processing library, but didn't find one. The only thing they have is an FIR filter, which could work, but you would have to flip the convolution kernel around (the second, industry vector) before calling it. And it may not perform the zero padding on the ends like you would expect to see in a real cross-correlation implementation. I guess you could code your own. I found this solution that could be turned into a WL cross-correlation indicator component. https://stackoverflow.com/questions/46419323/cross-correlation-using-mathdotnet-c-sharp The cross-correlation function should return a double[] array with zero-lag time in the center of the abscissa.
profile picture

LenMoz

#8
Thanks again. To your question...
QUOTE:
The signal processing question remains, which order is best?
I see little value in MomentumPct of the average price because it weights high priced stocks more heavily (and high priced does not equal high market cap). So I agree, the latter order.

QUOTE:
... flip the convolution kernel around (the second, industry vector) before calling it. And it may not perform the zero padding on the ends like you would expect to see in a real cross-correlation implementation.
I don't understand any of that, but that's ok. I'll use my veteran correlation WL script to find NN inputs that correlate to future gain/loss. No external components required.

profile picture

superticker

#9
QUOTE:
I don't understand ...
The primary purpose of the cross-correlation operation is to determine what came first, the chicken or the egg. In other words, it's to identify leading indicators. For example, if I cross correlate the price of gold with the price of oil, and find up spikes in gold prices lag oil by four bars, then I can say oil is a leading indicator of gold by four days. This leading relationship is likely to always hold true, so if you immediately buy gold when oil spikes, you're sure to make a profit--and you can bet the farm on it. And you can perform the same analysis on a stock and its industry index. Which is the leading indicator, the stock or its industry index?

But the value of the highest point in the cross correlation is also a measure of how strong the correlation is between two time vectors. And that's how you might use the cross correlation in this case. You can also integrate the area around the highest point in the cross correlation to get a similar measure of correlation strength between two time vectors.

We use cross correlation all the time in time series analysis (signal processing), but I don't see it used much in financial analysis. I don't know why.

In contrast, the correlations we perform in statistics, such as R or R-squared, are not time dependent like the cross correlation. These statistical correlations are measured against time-independent values--time is not a factor here. Don't confuse the two.
profile picture

LenMoz

#10
Well, ok, now. Thanks for the explanation. I've been doing something similar without knowing it was called cross correlation. For a variable under study, my correlator script computes R for multiple shifts of future gain. The shifts I use are
CODE:
Please log in to see this code.
profile picture

superticker

#11
I think you would be better off performing single-bar shifts so you don't miss the best time-shift correlations. Also, computing the correlation coefficient, R, for each shift is an overkill for screening purposes, which is what we are doing here. Just multiplying the two time vectors together (as you're shifting) is enough for screening purposes as the cross correlation does. Now once the cross correlation identifies the lag (or lead) time with the highest cross correlation (peak max), then you might run further statistics on that ideal lag-time (such as a correlation coefficient, R) to evaluate how strong of a solution you actually have.

In summary, [1] use the cross correlation to screen for the best lag time, then [2] compute a correlation coefficient, R, (or some other statistic) to evaluate how good that best lag time really is. Happy computing.
profile picture

LenMoz

#12
QUOTE:
I think you would be better off performing single-bar shifts so you don't miss the best time-shift correlations.
Let me clarify. I'm trying to determine whether an indicator is short term (5 bars) or longer term (150 bars) predictive. My next step is to gather all the best 5-bar predictors as inputs to a 5-bar predictive Neural Network. Similarly for the other bar intervals.

Finding the best correlation, bar by bar, also raises the red flag of overfitting.

profile picture

superticker

#13
I get the idea now. I like the idea of using a Neural Network because it will fit a non-linear, discontinuous function, which we have here. So far I've stuck with linear, parametric methods because they are more reproducible. But theoretically, a NN should work better for this fuzzy problem.

If you're cross correlating 700+ bar time-series vectors (~3 yrs of daily bars), I wouldn't worry about over fitting as far as the lag-time determination part of it. Now the evaluation on the NN side of the model may be a different story. I'm not even sure how you determine the number of degrees of freedom (DF) in a NN model.... Perhaps counting the nodes in the center NN layer would give you the "minimum" model DF, but that's not that helpful. You really need to determine the worst case (maximum) DF for the model to compute the DF for error.

If you have a DF of at least five for random error, then you're not over fitting.

(Total observations [could be the #of NN inputs in your case]) - (DF of the model) = (DF of error)

In your case, the number of inputs into your NN system "could be" the choke point for independent observations; I can't be sure because I don't know the configuration. But if you have 5 center nodes in your NN (5+ DF for the model), and you need 5 DF for error, then you need at least 10 inputs minimum to your NN so you're not over fitting. And this is a really rough estimation.

I think I now understand why you're trying to build an NN model with more inputs. Interesting, I never thought about that before.
profile picture

superticker

#14
Now that I better understand what you're doing, I have a question for you. On the code below ...
CODE:
Please log in to see this code.
why aren't you using all prime numbers to avoid all the harmonic interactions (autocorrelation of harmonics)?

For example, if there's an attribute with a 5-day lag correlation, it's also going to correlate at 10-, 20-, 50-, etc days as well. You could band-stop filter those harmonics out (And we often do that.), but in this application you have the power to select your lag times, so simply select prime numbers for all of them. That would save you all that band-stop filtering--big hassle. Why don't you try:
CODE:
Please log in to see this code.
With these prime number choices, you avoid the harmonics confounding all your variables without the band-stop notch filtering. Disclaimer: I'm not suggesting my choices in prime numbers will "maximize" the orthogonality of your lag variable components. You should talk to a mathematician about that who studies prime numbers. We're really off topic, and should move this discussion to a signal processing forum or private email.
profile picture

LenMoz

#15
You are correct in that this off topic.
QUOTE:
why aren't you using all prime numbers ...
Intuitively (and only that) I don't think the signal processing model fits this problem.