To design a simple AI system for the market one builds at least two models. One model to accurately reflect the market and a second model to represent your desires for the market.
Although a stock has both price and a timestamp, there is no real information in a price. The information is the difference. Difference in it and the last quote, difference in it and some other time, or difference in it and some other ticker. Probably the best is difference in it and the last quote for the same ticker though I would also consider using difference in it and the opening price. Volume is secondary data, as most trades are broken into pieces but I still use it.
I consider it cheating to have tables of information, such as membership in an ETF. One can deduce algorithmically what ETF a ticker is in, and if you can't, then you shouldn't use the information.
I think of the US Market as segments. For example, the first minute of the opening bell has different rules than the market once it is open five minutes. The actual "close" price, I consider sort of its own thing too and when quotes are looked at as a differences, there is a really the entire night in missing quotes, and so, the difference represented between close and open the sum of all the ups and downs that happened overnight, but without any knowledge of what it was that happened.
I think of the model to represent market as the substrate. The substrate is simply the rules, whether it is the national tax laws, the exchange rules, the known paradigm that many are following, or the time of day the market is open rules. There are also the different types of buyers, be that algorithmic and individual. Individual buyers are so different from algorithmic buyers that it makes sense to attempt to create an algorithm to separate the two types of trades from the data feed. One of the most important ideas of the substrate is liquidity, and by extension market impact.
There is no completely right answer for modeling the substrate as there are some subjective calls to be made. So in practice, your model would be different than mine. One assumption I have made is that the biggest algorithmic players do not play stocks with a low market cap. What I have found is that there are only about four thousand tickers that appear to be algorithmically played by large funds, and the rest are not. And of those four thousand tickers, only two thousand are heavily played. So my model is now reduced to just about four thousand tickers. There are many assumptions like this I have made, which I won't go into.
The substrate itself is not coded, and only a portion of it is used to design the algorithms. For example, because I am pretty limited in what I have for compute power, I try to filter out the trades from individuals so as to throw that data away. If I do it correctly, I only have algorithmic trades represented in my model and the trades of individuals are thrown out. I do this by looking at the substrate and determining what rules I might find to distinguish a computer inititated trade from an individual.
It takes months or more of examination to determine enough to have a basic algorithmic model of a subset of the stock market. Also to consider is the data collection for constraints. For my system because I only had sixteen cores I further attempted to filter out the market neutral automated trades from the other algorithmic trades using rules I had put together in the substrate model.
For my initial model of the market, I decided that the algorithmic traders were very powerful and knew everything knowable about the company. Therefore, the only thing they couldn't know is the new information that has just broke about a ticker. Therefore, in this perfect world where they know everything, there is really only one variable that is indeterminant, and this is the news.
Consider the saying about how a rising tide lifts all boats, with the market being the tide. For the model presented below, the rising tide is intentionally left out, and the measurement of if all boats are floating with it is considered. For example, if the market went up one hundred points and the indexes all went up perfectly together, then they would all score zero in my system. It is only the ticker that doesn't float, or floats when it shouldn't which is being measured.
I thought about calling this the Probability of Information, but that name turned out to already be taken. So, I call I call it the News Scores and it is the presumed information that is not market wide that is affecting a ticker, or a sector. My core algorithm does not read the news and news scores are derived solely from ticker data. ("News(x)" is a function or equation.)
The first chart is simply the news scores of the major indexes for about 13 months.
If you average these nine series together and multiply by the constant C= 1.6, you get the next chart.
This model attempts to only consider market neutral based trades done by algorithm and in my opinion, this is why it appears to be zero sum.
Below is the same general idea, but shows the components of a single index above. These are the top weighted tickers in the XLE index news scores.
And again, here is the average of those tickers news scores, compared to the index news scores itself. Also multiplied by C.
If I represent the function News as "News(NYSEARCA:SPY) (for example) to be the news scores for the SPY for an interval, and j to be the interval counter, then I can treat the results as a series and say
News(SPY)j = Average(news(NYSEARCA:XLB)j + news(NYSEARCA:XLE)j + news(NYSEARCA:XLF)j + news(NYSEARCA:XLI)j + news(NYSEARCA:XLK)j + news(NYSEARCA:XLP)j + news(NYSEARCA:XLU)j + news(NYSEARCA:XLV)j + news(NYSEARCA:XLY)j) *C
As well as
News(XLE)j = Average(News(NYSE:XOM)j + news(NYSE:CVX)j + news(NYSE:SLB)j + news(NYSE:COP)j + news(NYSE:OXY)j + news(NYSE:HAL)j + news(NYSE:APA)j + news(NYSE:NOV)j + news(BHI)j + news(NYSE:DVN)j + news(NYSE:EOG)j + news(NYSE:CHK)j )
Given that this is the case, then one could use basic arithmetic manipulation to break things down and isolate any ticker on the left side of the equation.
Ie: News(CVX) = ...
In this model I have attempted to filter out only those trades that are executed by market neutral algorithms using rules I found in the substrate. Hopefully I have included enough information that you understand what I did and the concepts involved, but I also intended to leave out enough information to not give away how I did this particular model.
I like to think that I have isolated the concept "news" in my algorithm too. However, it turns out to be the case that "News" has properties and is variable and is not a single idea. For that reason, I refer to it as having a subjective/objective property, or SOP, or more simply the news property.
The more subjective news is, the more erratic and long lasting the results. I consider HLF to be an example of a ticker with a lot of subjective news, and a staunch hardly changing, power company might be an example of the other end of the spectrum.
Another measurement attempts to approximate Market Efficiency. It is simply a measure of how long a particular ticker takes to move to a new position, and is not exactly the sam thing as meant by EMH. There is a calculable Efficiency for each ticker.
Now that I have a basic core, I can create as many secondary algorithms on top of it as I care to.
Disclosure: I have no positions in any stocks mentioned, and no plans to initiate any positions within the next 72 hours.