Data Tool - utility for data validation, truncation, DataSet cleanup
Author: Eugene
Creation Date: 9/1/2011 4:23 AM
profile picture

Eugene

#1
The Data Tool is a handy new utility that integrates itself into Wealth-Lab's Data Manager, making it effortless to perform the following DataSet operations:

* Truncate last N bars of data
* Wipe the entire DataSet data
(to help reload the data for an entire DataSet from scratch)
* Remove inactive (dead) symbols
* Remove selected symbols
* Change symbol
* Data validity check


Check out the online guide in the Wealth-Lab Wiki:

Data Tool

To install into Wealth-Lab 6.3+:

Direct link

Feel free to leave your suggestions w.r.t. this new tool here.
profile picture

ecorderob

#2
Nice tool,

Indeed I have to say that yesterday you released quite nice things. Thanks.

One suggestion about it. Could it be possible to select only some of the symbols in a data set and remove their data (not the whole dataset)?
I was thinking that it would be nice to remove the symbols from the datasets which have not been updated in the last year (but have some data), so one option could be to delete their data and then remove symbols without data.

Just an idea ;)
profile picture

Eugene

#3
Thanks.

It sounds possible (to remove selected symbols data only) but I think there's better, single-step solution: allow "Remove inactive" to exclude such symbols from DataSet - in addition to wiping out the symbols with 0 bars.

My only concern is how to define a dead/inactive symbol without introducing more GUI options. When exactly a stock is no longer "very illiquid" or in a "trading halt" but has become "delisted": month, quarter, year since last trade?
profile picture

ecorderob

#4
One option could be "remove inactive" adding a date, so that if no updates have been received since that date, the symbol would be deleted. It would be the user the one who decides to delete those, being aware of the consequence.

Obviously is not the most complete solution but it would be probably the easiest for you.
profile picture

Eugene

#5
..."without introducing more GUI options".
profile picture

Cone

#6
What's the concern of adding more GUI?

A right click context menu would be handy for deleting and changing symbols.

1. You could sort by last date, right click, "Delete Selected items"
2. For a single selection, "Change Symbol".

Just throwing it out there (but I'd rather you started working on the "Top Secret" Visualizer)


profile picture

Eugene

#7
The idea behind not adding more GUI is simplicity where possible.

Thank you both for your suggestions, I'll try incorporating these features when time comes.
profile picture

Eugene

#8
QUOTE:
A right click context menu would be handy for deleting and changing symbols.

...and perhaps reloading their data.
profile picture

Eugene

#9
OK, let there be more GUI. Following ecorderob's suggestion, the tool will provide a cutoff date input field to allow cleaning DataSets from symbols that stopped trading before the specified date.

Cone's ideas will also be added:
QUOTE:
1. You could sort by last date, right click, "Delete Selected items"
2. For a single selection, "Change Symbol".
profile picture

Eugene

#10
Data Tool has been updated to version 2011.11.

Highlights:

* Added a context menu with two options:

1. Change symbol - renames a single highlighted symbol.
2. Remove selected symbols - removes highlighted symbols from a given DataSet without affecting their data.

* Improved usability of remove inactive symbols feature by adding a DateTime control for specifying a cutoff date. Here's how it affects the tool's behavior:

1. When Date filter is inactive (default/old behavior): clears selected DataSet of any existing symbols with 0 bars
2. When a cutoff date was specified and the checkbox is engaged: additionally, removes any symbols that stopped trading before that date.

More details in the Wiki.

Thanks for your suggestions guys!
profile picture

thodder

#11
I have a few cosmetic suggestions...

* When I click on a dataset that has a lot of symbols, it takes a while to respond -- approx 30 seconds for a daily list of S&P 500 symbols. I didn't get an hour glass to show it was busy...

1) Please show hourglass so I know it's doing something;
2) Status bar -- which is shown on Data Sets tab while loading Symbol Details;
3) Optional: cancel button if it can take too long. 2000 symbols (Russell 2000) with 5 min bars may take quite a while.

* Dataset highlight went away when I clicked on the grid. Assuming this is a ListView or TreeView, you may want to set HideSelection = false (why it defaults to true I don't know) so I know what dataset I'm working with.

Overall a very nice tool! These were just some suggestions that I thought would improve the user experience. Nice job!!
profile picture

Eugene

#12
Thank you for your suggestions Tim. I'll add these items to my queue.

p.s. I only wonder why it might take so long for you. With a typical 7200rpm HDD, it takes a few seconds for me on large DataSets - just like "Symbol Details" would do because showing DataSet details uses pretty standard API calls and apparently, there's hardly something to optimize there.

In what kind of environment is WLP running?
profile picture

thodder

#13
Eugene,

I have Windows XP (32-bit) with Pentium 4 3.4Ghz with 3G RAM. It took 4 or 5 secs for S&P 100 or NASDAQ 100 daily lists to open the first time. Subsequent times it was fairly instantaneous. The S&P 500 daily was over 20 secs, but fairly quick on subsequent tries.

I think the HDD is 7200rpm.

I didn't expect you to be able to optimize loading the symbols, but showing the status would help a user be patient if it was taking awhile to load them.
profile picture

Eugene

#14
Tim,

I will show hourglass (but skip status bar) and fix HideSelection.

Frankly, seeing PC specs like these is what I expected. While optimizing performance is a worthy task and we shall not waste resources, you'll be definitely amazed what a (groundbreaking) difference a modern CPU makes. I'd expect a ten bagger speed increase if you load 64-bit WL6 on an i7 with 6+ Gb RAM compared to that P4 CPU. Nonetheless, loading S&P 500 on something in the ballpark of a 3 year old Core 2 Duo is not an issue either.
profile picture

thodder

#15
Eugene,

I know my computer is a bit old. I got it in 2005, but I've upgraded RAM and HDD since then. For most of the things that I do, this has been fine, but I've been considering getting a new machine. If I get a laptop, I may even consider getting a SSD drive which should make loads from disk screaming fast. ;-)

Btw, Have you noticed a difference between i3, i5 and i7 processors? I hadn't decided on the processor for the new machine, but it would have at least 6G RAM.
profile picture

Eugene

#16
My hunch tells you're not after high octane number crunching with P4, otherwise you've already upgraded long ago (of course, I stand corrected if you've outsourced those jobs to CUDA or some other GPGPU flavor). So if I'm right, an i7 might not be cost effective for your tasks. Something like i3-2100K (or maybe even i5-2xxx) will bring quite a bit of joy and will most likely satisfy for another half a decade.
profile picture

thodder

#17
Thanks Eugene, That's good to know.
profile picture

Eugene

#18
I'm considering introducing a data check feature here in the long run. Something along the lines of EODDataCheck from Community Components.
profile picture

thodder

#19
On the Truncate button, does it truncate bars for only the selected symbols or all the symbols in the DataSet? You might want to put up a message box stating "# bars from # symbols truncated" as feedback to the user.
profile picture

Eugene

#20
"Truncate" always truncates the last N bars from all symbols of a selected DataSet.

Selection was a later addition.

I think that a dialog box would be excessive because the table immediately reflects the updated state of the selected DataSet.
profile picture

thodder

#21
Minor bug: I created a new Yahoo DataSet in Data Manager then updated prices. When I click on Data Tool, I can not see the new DataSet. I have to close Data Manager and reopen it for the Data Tool to reflect the new DataSet.

Very minor issue as I would not expect to do maintenance on a new DataSet right away but I thought I'd mention it when I noticed it. In this case I was just checking the bar count on some of the DataSets.

Btw, I think this is a great addition!
profile picture

Eugene

#22
QUOTE:
Minor bug:

It's not a bug. Please read documentation before posting: scroll to "Important!"
profile picture

Eugene

#23
Data Tool has been updated to version 2012.01.

Maintenance release; no highlights.

* Fixed: hourglass cursor missing on lengthy operations (requested by thodder)
* Fixed: DataSet highlight goes away when clicking on the grid (thanks thodder)
profile picture

fastrade

#24
selecting a symbol, right click, change symbol doesn't open the new symbol screen. 2012.01
profile picture

Eugene

#25
Because there is no "new symbol screen", and has never been.

Please re-read the online documentation, "Right-click menu", that says:
QUOTE:
Simply type in the new symbol name and hit "Enter".

In other words, the context menu that pops up after a right-click has an input box that contains the highlighted symbol. Type in a new symbol name, hit Enter, voila.
profile picture

Eugene

#26
Data Tool has been updated to version 2012.02.

Change summary:

* New: Truncate whole "days", not just "bars" (to help reload bad intraday data)
* New: Data validity check feature
* Change: regrouped options into several tabs (redesign)
* Change: by request, made possible to install in WL 6.1+

The "Data check" takes the best from Cone's "Bad History Check" and my/fundtimer's "EODDataCheck" strategies, adding a Spike Detection option. For the complete list of detectable data errors please see the online guide in the Wiki.

The idea is to make all data-related operations (that WL6 doesn't already have) available where one expects to find them naturally: in the Data Manager, and without having to resort to a Strategy to validate a DataSet or to use Explorer to delete a bad file, and so on.
profile picture

Sammy_G

#27
I got a chance to test drive the Data Tool today after it became v6.1 compatible. Its a nice tool - but can be made better.

I have the following observations to make about its Truncate feature:
- If you have a few datasets with non-overlapping symbols - and all of them are currently active - then, and *only* then, it will do its job well as deleting the last bar of data is synonymous with removing the last day's data.
- If you have inactive symbols in any dataset, since the Truncate feature is *not* date sensitive, running it on that dataset will delete the last bar of those symbols also with no hope of recovery as when symbols stop trading their data is not available for download any more. Since some users like to keep inactive symbols for backtesting purposes, this is a problem.
- If you use the Truncate last bar (or 'n' bars) feature across multiple datasets and they have some overlapping symbols, then the bar deletion is incremental (additive) e.g. if you delete last bar data from Dow30 and then again from Nasdaq100 datasets, then you have deleted 1+1 = 2 bars of data from the shared symbols (currently CSCO, INTC, MSFT).

All of these limitations can be avoided by adding the ability to Truncate data AFTER a specified date, as opposed to specifying number of bars; that way, inactive symbols' data will be preserved and there will be no deletion of any more bars than is necessary for active-but-overlapping symbols also. Besides, there are other valid reasons for deleting data forward from a date.

I would concurrently note that there are valid reasons for removing data BEFORE a particular date also - but only for individual symbols as opposed to for an entire dataset. Currently, the Data Tool only offers the ability to remove inactive symbols, as opposed to removing data prior to a date for active symbols.

As "The idea is to make all data-related operations...available where one expects to find them naturally..." I would like to suggest that you please add the ability to Truncate data Before/After a user-specified date as well, the former (Truncate Before a date) probably just confined to individual symbols rather than to an entire dataset (for safety reasons) while the latter (Truncate After a date) can be applied to either a symbol or to an entire dataset. This should make the Data Tool more robust and a more comprehensive solution for all data-related needs.
profile picture

Eugene

#28
Thank you for your observations. First things first, I'll put the 'Truncate After Date' option on my list for evaluation. (The 'Truncate Before' is too exotic and I won't be considering this.)

A couple of remarks:
QUOTE:
Since some users like to keep inactive symbols for backtesting purposes, this is a problem.

This is a problem only if inactive symbols are mixed with tradeable symbols in a single DataSet. Create a DataSet that completely excludes the symbols which stopped trading and their precious data won't be affected when one massages the actual symbols's data with the Data Tool.
QUOTE:
- If you use the Truncate last bar (or 'n' bars) feature across multiple datasets and they have some overlapping symbols, then the bar deletion is incremental (additive)

Of course. It's up to you to care about that. The Data Tool operates with the raw data in individual DataSets, and there's no way for it to think what's going on 'outside'. Analogy: you can have dozens of Windows shortcuts to a file (read: DataSets), but if the file (read: a .WL file on HDD) was deleted, they all have become obsolete.
profile picture

Sammy_G

#29
To know when a symbol becomes inactive is not an easy task, especially when you are talking of 1000's of symbols. It is further compounded by the fact that even active symbols become temporarily inactive due to a variety of reasons - lack of interest, pending FDA approval (for drug companies), and so on. No one can, or should, be expected to stay on top of every symbol. At least, some of us have a life outside trading.

Data truncation using date is pretty much the accepted norm. You can offer truncation using bars as an additional method but not in lieu thereof.
profile picture

Eugene

#30
'Truncate After Date' will be implemented as an option (next release).
profile picture

Sammy_G

#31
Good to hear that!
profile picture

Eugene

#32
Data Tool has been updated to version 2012.03.

Change summary:

* New: "Truncate after date" option (requested by Sammy_G)
profile picture

Eugene

#33
Data Tool has been updated to version 2012.05. Update to Wealth-Lab 6.3+ to be able to install/update the extension.

It comes with a considerable usability enhancement: reopening the Data Manager or restarting Wealth-Lab is no longer required after making certain changes to DataSets.
profile picture

hankt

#34
I need to delete eroneous pricing data in yahoo for many tickers. I have been here and found the reverse of what I need (truncate, delete data forward):
http://www2.wealth-lab.com/WL5Wiki/DataTool.ashx?HL=data

How do I do the opposite of truncate, delete a couple of years of daily pricing from a date to the start date of that data set?

Also, a logic check for price reported with no volume should be in the requirements list for the next version of Data Tool.

My previous need is not applicable to a data set, but on a few tickers within a data set.

Thanks,

Hank
profile picture

Eugene

#35
QUOTE:
How do I do the opposite of truncate, delete a couple of years of daily pricing from a date to the start date of that data set?

This exact option is not considered for the Data Tool, but the closest thing is the "Remove all data" button. It will wipe out the data for the instrument so you could reload it from scratch next time you do a regular DataSet update in the Data Manager.

For Yahoo provider, there's an alternative way: specify the Starting Date for the new DataSet (at the time of creation) so that it won't include the erroneous period.

profile picture

Eugene

#36
QUOTE:
Also, a logic check for price reported with no volume should be in the requirements list for the next version of Data Tool.

Thank you for the suggestion. An existing rule in the data validity checker considers a bar erronenous when its Volume < 0. I may add "No volume i.e. Volume = 0" to the list of erroneous bar rules in a future version.
profile picture

Sammy_G

#37
hankt, you may wish to look at this thread: Data Truncation
profile picture

hankt

#38
QUOTE:

How do I do the opposite of truncate, delete a couple of years of daily pricing from a date to the start date of that data set?

This exact option is not considered for the Data Tool, but the closest thing is the "Remove all data" button. It will wipe out the data for the instrument so you could reload it from scratch next time you do a regular DataSet update in the Data Manager.

For Yahoo provider, there's an alternative way: specify the Starting Date for the new DataSet (at the time of creation) so that it won't include the erroneous period.


That won't work as it is a ticker issue, not a data set issue.

QUOTE:

Also, a logic check for price reported with no volume should be in the requirements list for the next version of Data Tool.

Thank you for the suggestion. An existing rule in the data validity checker considers a bar erronenous when its Volume < 0. I may add "No volume i.e. Volume = 0" to the list of erroneous bar rules in a future version.


Yes, not picking up erroneous tick data as volume = 0 would help in most cases. I notice that when tickers/symbols are recycled for IPOs, the old data is nearly always at zero volume.

profile picture

hankt

#39
Eugene,

I just looked at some of the data that I previously purchased from EOD Data - volume is sometimes missing for entire year's worth of data, thus we may want to selectively remove rather than blanket exclude.
profile picture

Eugene

#40
Hank,

Missing volume is a nuisance but the OHLC data is there. In a future build, zero volume will be an option of the data check procedure, but selective removing is not on the radar. Zero volume is possible, it's not a definite error in the data.
profile picture

hankt

#41
Eugene,

Logically we can have no new price if zero volume. I understand the need to work around failures in data providers however.

I'm looking at the chart code that Sammy G wrote which enables a truncate from a point in time backward to help me. I haven't gotten it to work on a ticker yet but It looks like it should work.
profile picture

Eugene

#42
QUOTE:
Logically we can have no new price if zero volume.

Good point, but should it be a limit up/down day or purely a data glitch and you might be throwing out the child along with the water. There are some guilty feeds (like the upcoming Morningstar data provider has fake bars with zero volume) but with the upcoming ability to detect zero-volume bars in the Data Tool, you'll have the list of bars to delete manually.
profile picture

Eugene

#43
What's new in the Data Tool version 2012.06:

* Change: added "No trading (Volume = 0)" rule to the erroneous bar rule list (suggested by hankt)
profile picture

Eugene

#44
sedelstein asked:

QUOTE:
In the Data Tool there is a section on "Supported" static data and "Compatible" data in Green and "Incompatible" data in red

I am unclear on Supported vs Unsupported and Compatible vs Incompatible. In the example on the page, GOOG and AAPL are compatible but MSFT is not

I would think that the data for MSFT would be compatible

Compatible: BBFree, CBOE, Fidelity, Finam, Forexite, Google, IQFeed, Market Sentiment, MSN, QuoteMedia, PiTrading, Random, TradingBlox, Yahoo

since the data can easily be gotten from Fidelity. I'm confused as to why it is in red. am trying to understand the sources of the data for my datasets
profile picture

Eugene

#45
You're confusing "compatibility" with "highlighting". ;)

Compatible or not can be data providers (most are compatible). Particular stocks do not have anything to do with Data Tool compatibility. Think of the Quotes pane as an easy screener of a DataSet's performance on the last bar. GOOG and AAPL are green on the Wiki screenshot simply because their latest close was up (or unchanged), while MSFT is red because it lost a few cents that day.
profile picture

sedelstein

#46
Got it

The reason for the confusion is the word "Incompatible" is in red and (so is Microsoft) and the word "Compatible" is in green

profile picture

Eugene

#47
What's new in the Data Tool version 2014.05:

* New: added "Truncate before date" option (requested by customers)

profile picture

Eugene

#48
What's new in the Data Tool version 2014.11:

* Change: option to use U.S. market holidays when making data check for missing bars

profile picture

gbullr

#49
Is the Data Tool still available?

Cannot find it in data manager tab.

Version 6.8.10.0

Thank you.

profile picture

Eugene

#50
Sure it's still available and will also get upgraded to v2015.06 shortly to fix a minor issue.

If you can not find it, you either haven't installed the extension (see link in the top post) or are subject to scenario explained in this FAQ: Extension installed in Wealth-Lab doesn't show up after restarting application.
profile picture

Eugene

#51
What's new in the Data Tool version 2015.06:

Fix: "Remove inactive symbols" and "Change Symbol" was deleting all symbols from the DataSet if provider does not support DataSet modification
profile picture

kendalab

#52
Recently started using Data Tool to remove symbols with too few bars causing an index issue, however when I remove the symbols from the dataset and then rerun the script they still cause an error until I restart WL. Is there a way to refresh the buffered data without restarting WL?

I realize this isn't unique to Data Tool, because I can replace symbols in a dataset by cut and paste under the "Data Sets" tab and have the same problem of scripts using the old buffered symbols.
profile picture

Eugene

#53
This can happen if you did not close the Strategy window being backtested on that DataSet before applying changes. Everything works correctly if you close the Strategy window(s) before removing inactive symbols with the Data Tool.
profile picture

kendalab

#54
Thanks, I thought I had deleted symbols before and remembered it working, so that would explain why it appeared inconsistent.
profile picture

mkbryan

#55
Hi Eugene, The "truncate before date" feature hangs on one minute data. I experienced this with a 1m Fidelity dataset containing one symbol (.SPX) - the date was 7/1/2013. Please note this was done on a fresh install of WLP 6.9 and a fresh install of the DataTool extension v2015.06

Please advise. Thanks.
profile picture

Eugene

#56
Hi Marc,

There must be a million or more bars of data to proceed so it should be a very time consuming process (if it succeeds). Especially if you're not an SSD user and/or your PC isn't fast enough. I wouldn't use it on huge intraday data files. There's nothing to optimize in the code so if it doesn't work, try it with EOD instead.
profile picture

mkbryan

#57
Thanks for the reply. The Fidelity .SPX 1m file has about 867K bars - not more than a million. I let the Truncate operation run for well over an hour. The WL file was never modified (based on the modified date in Windows Explorer), WLP was using only 50% of the CPU, and disk activity was nil (except for the occasional Windows OS 8.1 services). Even the page file (which is on a separate partition), never grew - nor did memory. In short, there was no sign of progress.

My goal is a base 1m dataset for the S&P500 index without data gaps. Fidelity's 1m bars for .SPX have gaps (I have a complete list) - and unfortunately Fidelity has made no fixes (yes, I tried them first). Without good 1m data, all higher scales have errors (although some may be "inside" the range of higher scale bars so the data gaps are effectively hidden). The last 1m data gap in Fidelity .SPX is 8/7/2015 - less than a year ago. I intend to run strategy verification for intraday trading on at least one year of data (and waiting until this August is a bit too far away even assuming there will be no more gaps in Fidelity's data from today forward). I am hoping to use Google static 1m data to add missing bars to Fidelity 1m data should my strategy detect any more gaps after 8/7/2015.

Any ideas on what is happening with the apparent Truncate "hang"?

Thanks.
profile picture

Eugene

#58
As a workaround, try creating a new Bars object in a .WL file using a WealthScript Strategy. See QuickRef (Bars object > SaveToFile & LoadFromFile) and this thread: How to create a DataSet from SaveToFile() *.WL data.
profile picture

mkbryan

#59
Got it. Thanks.
profile picture

Eugene

#60
What's new in the Data Tool version 2015.11:

* Change: Spike detection (Data validity check) is disabled on Yahoo datasets
* Change: possible to install in WL 6.8+, requires .NET 4.5