You can find me on Metaculus at https://www.metaculus.com/accounts/profile/116023/.
Ah, I see. I missed that part of the post for some reason.
In this setup the update you're doing is fine, but I think measuring the evidence for the hypothesis in terms of "bits" can still mislead people here. You've tuned your example so that the likelihood ratio is equal to two and there are only two possible outcomes, while in general there's no reason for those two values to be equal.
This is a rather pedantic remark that doesn't have much relevance to the primary content of the post (EDIT: it's also based on a misunderstanding of what the post is actually doing - I missed that an explicit prior is specified which invalidates the concern raised here), but
If such a coin is flipped ten times by someone who doesn't make literally false statements, who then reports that the 4th, 6th, and 9th flips came up Heads, then the update to our beliefs about the coin depends on what algorithm the not-lying[1] reporter used to decide to report those flips in particular. If they always report the 4th, 6th, and 9th flips independently of the flip outcomes—if there's no evidential entanglement between the flip outcomes and the choice of which flips get reported—then reported flip-outcomes can be treated the same as flips you observed yourself: three Headses is 3 * 1 = 3 bits of evidence in favor of the hypothesis that the coin is Heads-biased. (So if we were initially 50:50 on the question of which way the coin is biased, our posterior odds after collecting 3 bits of evidence for a Heads-biased coin would be 23:1 = 8:1, or a probability of 8/(1 + 8) ≈ 0.89 that the coin is Heads-biased.)
is not how Bayesian updating would work in this setting. As I've explained in my post about Laplace's rule of succession, if you start with a uniform prior over for the probability of the coin coming up heads and you observe a sequence of heads in succession, you would update to a posterior of which has mean . For that would be rather than .
I haven't formalized this, but one problem with the entropy approach here is that the distinct bits of information you get about the coin are actually not independent, so they are worth less than one bit each. They aren't independent because if you know some of them came up heads, your prior that the other ones also came up heads will be higher, since you'll infer that the coin is likely to have been biased in the direction of coming up heads.
To not leave this totally up in the air, if you think of the th heads having an information content of
bits, then the total information you get from heads is something like
bits instead of bits. Neglecting this effect leads you to make much more extreme inferences than would be justified by Bayes' rule.
Yeah, Neyman's proof of Laplace's version of the rule of succession is nice. The reason I think this kind of approach can't give the full strength of the conjugate prior approach is that I think there's a kind of "irreducible complexity" to computing for non-integer values of . The only easy proof I know goes through the connection to the gamma function. If you stick only to integer values there are easier ways of doing the computation, and the linearity of expectation argument given by Neyman is one way to do it.
One concrete example of the rule being used in practice I can think of right now is this comment by SimonM on Metaculus.
Answering your questions in order:
What matters is that it's something you can invest in. Choosing the S&P 500 is not really that important in particular. There doesn't have to be a single company whose stock is perfectly correlated with the S&P 500 (though nowadays we have ETFs which more or less serve this purpose) - you can simply create your own value-weighted stock index and rebalance it on a daily or weekly basis to adjust for the changing weights over time, and nothing will change about the main arguments. This is actually what the authors of The Rate of Return on Everything do in the paper, since we don't really have good value-weighted benchmark indices for stocks going back to 1870.
The general point (which I hint at but don't make in the post) is that we persistently see high Sharpe ratios in asset markets. The article I cite at the start of the post also has data on real estate returns, for example, which exhibit an even stronger puzzle because they are comparable to stock returns in real terms but have half the volatility.
I don't know the answer to your exact question, but a lot of governments have bonds which are quite risky and so this comparison wouldn't be appropriate for them. If you think of the real yield of bonds as consisting of a time preference rate plus some risk premium (which is not a perfect model but not too far off), the rate of return on any one country's bonds puts an upper bound on the risk-free rate of return. Therefore we don't need to think about investing in countries whose bonds are risky assets in order to put a lower bound on the size of the equity premium relative to a risk-free benchmark.
This only has a negligible effect because the returns are inflation-adjusted and over long time horizons any real exchange rate deviation from the purchasing power parity benchmark is going to be small relative to the size of the returns we're talking about. Phrased another way; inflation-adjusted stock prices are not stationary whereas real exchange rates are stationary, so as long as the time horizon is long enough you can ignore exchange rate effects so long as you perform inflation adjustment.
This is an interesting question and I don't know the answer to it. Partly this is because we don't really understand where the equity premium is coming from to begin with, so thinking about how some hypothetical change in the human condition would alter its size is not trivial. I think different models of the equity premium actually make different predictions about what would happen in such a situation.
It's important, though, to keep in mind that the equity premium is not about the rate of time preference: risk-free rates of return are already quite low in our world of mortal people. It's more about the volatility of marginal utility growth, and there's no logical connection between that and the time for which people are alive. One of the most striking illustrations of that is Campbell and Cochrane's habit formation model of the equity premium, which produces a long-run equity premium even at infinite time horizons, something a lot of other models of the equity premium struggle with.
I think in the real world if people became immortal the long-run (or average) equity premium would fall, but the short-run equity premium would still sometimes be high, in particular in times of economic difficulty.
Over 20 years that's possible (and I think it's in fact true), but the paper I cite in the post gives some data which makes it unlikely that the whole past record is outperformance. It's hard to square 150 years of over 6% mean annual equity premium with 20% annual standard deviation with the idea that the true stock return is actually the same as the return on T-bills. The "true" premium might be lower than 6% but not by too much, and we're still left with more or less the same puzzle even if we assume that.
That's alright, it's partly on me for not being clear enough in my original comment.
I think information aggregation from different experts is in general a nontrivial and context-dependent problem. If you're trying to actually add up different forecasts to obtain some composite result it's probably better to average probabilities; but aside from my toy model in the original comment, "field data" from Metaculus also backs up the idea that on single binary questions median forecasts or log odds average consistently beats probability averages.
I agree with SimonM that the question of which aggregation method is best has to be answered empirically in specific contexts and theoretical arguments or models (including mine) are at best weakly informative about that.
I don't know what you're talking about here. You don't need any nonlinear functions to recover the probability. The probability implied by is just , and the probability you should forecast having seen is therefore
since is a martingale.
I think you don't really understand what my example is doing. is not a Brownian motion and its increments are not Gaussian; it's a nonlinear transform of a drift-diffusion process by a sigmoid which takes values in . itself is already a martingale so you don't need to apply any nonlinear transformation to M on top of that in order to recover any probabilities.
The explicit definition is that you take an underlying drift-diffusion process Y following
and let . You can check that this is a martingale by using Ito's lemma.
If you're still not convinced, you can actually use my Python script in the original comment to obtain calibration data for the experts using Monte Carlo simulations. If you do that, you'll notice that they are well calibrated and not overconfident.
Thanks for the comment - I'm glad people don't take what I said at face value, since it's often not correct...
What I actually maximized is (something like, though not quite) the expected value of the logarithm of the return, i.e. what you'd do if you used the Kelly criterion. This is the correct way to maximize long-run expected returns, but it's not the same thing as maximizing expected returns over any given time horizon.
My computation of is correct, but the problem comes in elsewhere. Obviously if your goal is to just maximize expected return then we have
and to maximize this we would just want to push as high as possible as long as , regardless of the horizon at which we would be rebalancing. However, it turns out that this is perfectly consistent with
where is the ideal leveraged portfolio in my comment and is the actual one, both with k-fold leverage. So the leverage decay term is actually correct, the problem is that we actually have
and the leverage decay term is just the second term in the sum multiplying . The actual leveraged portfolio we can achieve follows
which is still good enough for the expected return to be increasing in . On the other hand, if we look at the logarithm of this, we get
so now it would be optimal to choose something like if we were interested in maximizing the expected value of the logarithm of the return, i.e. in using Kelly.
The fundamental problem is that is not the good definition of the ideally leveraged portfolio, so trying to minimize the gap between and is not the same thing as maximizing the expected return of . I'm leaving the original comment up anyway because I think it's instructive and the computation is still useful for other purposes.
The experts in my model are designed to be perfectly calibrated. What do you mean by "they are overconfident"?
To elaborate on the information acquisition cost point; small pieces of information won't be worth tying up a big amount of capital for.
If you have a company worth $1 billion and you have very good insider info that a project of theirs that the market implicitly values at $10 million is going to flop, if the only way you can express that opinion is to short the stock of the whole company that's likely not even worth it. Even with 10% margin you'd be at best making a 10% return on capital over the time horizon that the market figures out the project is bad (maybe O(1) years), and that mean return would come with way more risk than just buying into the S&P 500, so your Sharpe would be much worse.
In general this kind of trading is only worth it if your edge over the market is big enough. If you just know something the market doesn't know that's not very useful unless you can find someone to bet on that exact thing rather than have to involve a ton of other variance in your trades, and even if you try to do that people can figure out what you're up to and refuse to take the other side of your trades anyway.