Do you want to get informed about new posts via Email?

How to create indicator sector/industry averages?

Author: LenMoz

Creation Date: 12/17/2017 12:44 PM

* Wealth-Lab Development Guide > Create a Custom Index

* MS123 IndexDefinitions > Attachments >*Download Project Source (demo)*

* MS123 IndexDefinitions > Attachments >

If I understand this, I then need to also create a portfolio and custom index for each sector or industry of interest. Is that correct?

If all you want to do is aggregate price grouped by sector and industry (based on US GICS definitions), that's already available for Fidelity data download. See https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/si_performance.jhtml?tab=siperformance for the ticker symbols broken down by sector. Click on the INDUSTRY link to get the US GICS symbols for the industries.

There's some WL code written to scrape those pages for their symbols at https://www.wealth-lab.com/Forum/Posts/Find-the-symbol-for-an-industry-or-sector-index-38910.

If you're interested in crossing individual stocks with their corresponding GICS sector and industry symbol, I've written some code for that. Let me know if you want me to post my GICS cross referencing class to the link above.

---

On a separate note, from a "trading" prospective, I haven't seen much "immediate" price cross-correlation between small- and mid-cap stocks and their respective sector or industry. Now long-term and large-cap stocks might be different; I can't speak for those.

There's some WL code written to scrape those pages for their symbols at https://www.wealth-lab.com/Forum/Posts/Find-the-symbol-for-an-industry-or-sector-index-38910.

If you're interested in crossing individual stocks with their corresponding GICS sector and industry symbol, I've written some code for that. Let me know if you want me to post my GICS cross referencing class to the link above.

---

On a separate note, from a "trading" prospective, I haven't seen much "immediate" price cross-correlation between small- and mid-cap stocks and their respective sector or industry. Now long-term and large-cap stocks might be different; I can't speak for those.

Eugene/superticker, Thanks. All good stuff. Gives me some new ideas to work on these winter nights.

My intent is to use, say, "MomentumPct of a stock vs its industry" as a Neural Network input. My focus is a subset of Russell 2000 tech stocks.

QUOTE:

I haven't seen much "immediate" price cross-correlation...

My intent is to use, say, "MomentumPct of a stock vs its industry" as a Neural Network input. My focus is a subset of Russell 2000 tech stocks.

QUOTE:If you want to take the average of the industry prices

... intent is to use, "MomentumPct of a stock vs its industry"

The signal processing question remains, which order is best? Please post your answer when you figure that out. I'm thinking the latter Index-Lab order would preserve the energies of the MomentumPct best, but that doesn't necessarily mean it would produce the highest cross correlation of the stock and industry MomentumPct series vectors; that's another question.

---

On a separate note, I looked in MathNet for a cross-correlation function in their signal processing library, but didn't find one. The only thing they have is an FIR filter, which could work, but you would have to flip the convolution kernel around (the second, industry vector) before calling it. And it may not perform the zero padding on the ends like you would expect to see in a real cross-correlation implementation. I guess you could code your own. I found this solution that could be turned into a WL cross-correlation

Thanks again. To your question...

QUOTE:I see little value in MomentumPct of the average price because it weights high priced stocks more heavily (and high priced does not equal high market cap). So I agree, the latter order.

The signal processing question remains, which order is best?

QUOTE:I don't understand any of that, but that's ok. I'll use my veteran correlation WL script to find NN inputs that correlate to future gain/loss. No external components required.

... flip the convolution kernel around (the second, industry vector) before calling it. And it may not perform the zero padding on the ends like you would expect to see in a real cross-correlation implementation.

QUOTE:The primary purpose of the cross-correlation operation is to determine what came first, the chicken or the egg. In other words, it's to identify leading indicators. For example, if I cross correlate the price of gold with the price of oil, and find up spikes in gold prices

I don't understand ...

But the value of the highest point in the cross correlation is also a measure of how strong the correlation is between two time vectors. And that's how you might use the cross correlation in this case. You can also integrate the area around the highest point in the cross correlation to get a similar measure of correlation strength between two time vectors.

We use cross correlation all the time in time series analysis (signal processing), but I don't see it used much in financial analysis. I don't know why.

In contrast, the correlations we perform in statistics, such as R or R-squared, are

Well, ok, now. Thanks for the explanation. I've been doing something similar without knowing it was called cross correlation. For a variable under study, my correlator script computes R for multiple shifts of future gain. The shifts I use are

CODE:

Please log in to see this code.

I think you would be better off performing single-bar shifts so you don't miss the best time-shift correlations. Also, computing the correlation coefficient, R, for each shift is an overkill for __screening__ purposes, which is what we are doing here. Just multiplying the two time vectors together (as you're shifting) is enough for screening purposes as the cross correlation does. Now once the cross correlation identifies the lag (or lead) time with the highest cross correlation (peak max), then you might run further statistics on that ideal lag-time (such as a correlation coefficient, R) to evaluate how strong of a solution you actually have.

In summary, [1] use the cross correlation to screen for the best lag time, then [2] compute a correlation coefficient, R, (or some other statistic) to evaluate how good that best lag time really is. Happy computing.

In summary, [1] use the cross correlation to screen for the best lag time, then [2] compute a correlation coefficient, R, (or some other statistic) to evaluate how good that best lag time really is. Happy computing.

QUOTE:Let me clarify. I'm trying to determine whether an indicator is short term (5 bars) or longer term (150 bars) predictive. My next step is to gather all the best 5-bar predictors as inputs to a 5-bar predictive Neural Network. Similarly for the other bar intervals.

I think you would be better off performing single-bar shifts so you don't miss the best time-shift correlations.

Finding the best correlation, bar by bar, also raises the red flag of overfitting.

I get the idea now. I like the idea of using a Neural Network because it will fit a non-linear, discontinuous function, which we have here. So far I've stuck with linear, parametric methods because they are more reproducible. But theoretically, a NN should work better for this fuzzy problem.

If you're cross correlating 700+ bar time-series vectors (~3 yrs of daily bars), I wouldn't worry about over fitting as far as the lag-time determination part of it. Now the evaluation on the NN side of the model may be a different story. I'm not even sure how you determine the number of degrees of freedom (DF) in a NN model.... Perhaps counting the nodes in the center NN layer would give you the "minimum" model DF, but that's not that helpful. You really need to determine the worst case (maximum) DF for the model to compute the DF for error.

If you have a DF of at least five for random error, then you're not over fitting.

(Total observations [could be the #of NN inputs in your case]) - (DF of the model) = (DF of error)

In your case, the number of inputs into your NN system "could be" the choke point for independent observations; I can't be sure because I don't know the configuration. But if you have 5 center nodes in your NN (5+ DF for the model), and you need 5 DF for error, then you need at least 10 inputs minimum to your NN so you're not over fitting. And this is a really rough estimation.

I think I now understand why you're trying to build an NN model with more inputs. Interesting, I never thought about that before.

If you're cross correlating 700+ bar time-series vectors (~3 yrs of daily bars), I wouldn't worry about over fitting as far as the lag-time determination part of it. Now the evaluation on the NN side of the model may be a different story. I'm not even sure how you determine the number of degrees of freedom (DF) in a NN model.... Perhaps counting the nodes in the center NN layer would give you the "minimum" model DF, but that's not that helpful. You really need to determine the worst case (maximum) DF for the model to compute the DF for error.

If you have a DF of at least five for random error, then you're not over fitting.

(Total observations [could be the #of NN inputs in your case]) - (DF of the model) = (DF of error)

In your case, the number of inputs into your NN system "could be" the choke point for independent observations; I can't be sure because I don't know the configuration. But if you have 5 center nodes in your NN (5+ DF for the model), and you need 5 DF for error, then you need at least 10 inputs minimum to your NN so you're not over fitting. And this is a really rough estimation.

I think I now understand why you're trying to build an NN model with more inputs. Interesting, I never thought about that before.

Now that I better understand what you're doing, I have a question for you. On the code below ...

For example, if there's an attribute with a 5-day lag correlation, it's also going to correlate at 10-, 20-, 50-, etc days as well. You could band-stop filter those harmonics out (And we often do that.), but in this application you have the power to select your lag times, so simply select prime numbers for all of them. That would save you all that band-stop filtering--big hassle. Why don't you try:

CODE:why aren't you using all prime numbers to avoid all the harmonic interactions (autocorrelation of harmonics)?

Please log in to see this code.

For example, if there's an attribute with a 5-day lag correlation, it's also going to correlate at 10-, 20-, 50-, etc days as well. You could band-stop filter those harmonics out (And we often do that.), but in this application you have the power to select your lag times, so simply select prime numbers for all of them. That would save you all that band-stop filtering--big hassle. Why don't you try:

CODE:With these prime number choices, you avoid the harmonics confounding all your variables without the band-stop notch filtering. Disclaimer: I'm not suggesting my choices in prime numbers will "maximize" the orthogonality of your lag variable components. You should talk to a mathematician about that who studies prime numbers. We're really off topic, and should move this discussion to a signal processing forum or private email.

Please log in to see this code.