I use web scraping to collect financial data from a number of web sites (Yahoo, Nasdaq,…). Web scraping is more complex than directly importing data as I described in Part 2, but it:
- Offers far greater flexibility. Allows you to automatically browser to the page and pick point only the data you need.
- Allows to collect data not presented in table format.
- Allows you to read data behind what the site is showing (eg, for ratings, to transform stars into numbers)
- Allows you to include login and password in the script, so you will be granted access to your data as if you were doing it by hand.
This is not for everyone but requires some technical skills. That said, rather than writing a lengthy report I am not very good at, I am going to adopt a much more practical approach: I will show you how to get the crawler tool and provide you some example scripts. You need to have some experience with Excel, and understanding a little bit HTML. Download the crawler and the examples, try then by yourself, explore them, modify them and then create new ones.
You will need to spend some time to figure out how the crawler tool works. Take your time. There is not much help available out there, although there is not much to ask for. Anyway, the examples I provide contains enough information to create your own programs. You can ask me for help, as a last resource, please, if you need so.
- I work with Windows and use Chrome as the Web browser for this application.
- You may also use Firefox (I don't)
- The crawler I use is supposed to work also with IE, but I don't use it.
Downloading the web data extraction software:
I use DataTool (http://datatoolbar.com/) version for Chrome. There are other crawlers and I do not know if this is the best. It was the first one I found and, so far, it has allowed me to get all the information I need from all websites I work with.
You may start by using the free version. It does the same as the pay version, with a capacity limitation, but no registration and no ads. Pay for it (around $24, one shot) later on, if you want or if you need more data than the allowed by the limit.
You may download DataTool free version for Chrome and Firefox, by clicking "Download" here: http://datatoolbar.com/datatoobar30.html
Downloading sample scripts.
SA colleagues and myself have created several sample scripts which you can run and modify to fit to your own requirements. They collect and show you how to collect data from Yahoo, NASDAQ, FINVIZ, Gurufocus, RTTnews, WSJ, and others.
These examples are uploaded in www.box.com. To download them, you need to open an account (it is free) if you don't have one, and then you will be able to access the folder I created and all the example files within it. For this (once you have created your account) just follow this link: https://app.box.com/s/q4kr92t3xslj5cn1kk5qx3jkm91az8il
Scripts Data Base
As suggested by SDS (see comments), this storage folder welcomes SA readers who may want to contribute to creating a collection of DataToolBar scripts which will be useful for the SA community.
SDS has kindly provided several scripts to collect dividend declaration information provided by wsj.com, RTTnews.com and others
If you want to contribute, let me know and I will upload your script or will give you editor access to SeekingAlpha Box.com folder.
- Every web has some different approach, so there is no standard script. You will need to study each case and find the best strategy to implement the crawler.
- Scrips are very much dependent on the Web format, so an upgrade affecting the structure of the site (basically its HTLM code) most likely will require some adjustment on the script you did for that site.
- DataTool will place the data in an Excel file whose name and address is indicated in "Project Properties". Before running the script, go there and change it.
- Data collected by the crawler is in Excel with anonymous format. If you get numbers you need to convert them into numerical format, which can be done just by multiplying it by one (=cell*1)
- "Output file" can also be selected to be CSV Unicode, SQL Script, XML or HTML. You choose it by clicking in the square box (...) at the right.
- Remember not to use directly the data from the excel file created by the crawler, but use an intermediate sheet as I indicate in Part 1.
- A convenient way to run all or part of your scripts is by creating a bat file. You just run the bat file and it will run the scripts and store the Excel files for you. I have create and uploaded "batchSA.bat" so you can dowload it and see it (use Windows pad or a simple editor)
- I will be happy to help you. Leave here a comment or send me an email.
- I can create more scripts following your suggestions and share them, which can be useful for the whole SA community.
In this anex I will show how it looks the script NasdaqInstitutions, which is used to collect number and percentages of institutional owners.
First, is the screen from Nasdaq, with the field we want to collect:
The following is DataTool screen. Here you can see the selected fields inserted in the script:
The Scrip will start its job when you hit "Get Data". Once it finishes, will display this:
Now, you hit "Save" and it will be converted in an Excel file and saved: