机器学习在金融量化的失败原因

量化 2023年10月27日

机器学习正在彻底改变我们生活的方方面面。直到现在,依然是只有专家才能通过使用机器学习算法来完成任务。就金融领域而言,采用这样一个具有颠覆性的,并将改变人们坚持了几十年的投资方法的技术,令人尤为心潮澎湃。这本书将对我在过去 20 年中使用的可靠的机器学习工具进行科学解释,而这些工具也帮助我为那些需求最复杂的机构投资者管理了大规模的基金

有关投资的书大多分为两类。一类书的作者用极其简化的数学方式描述了在现实中不存在的情况,对其阐述的东西并没有实践过。只是因为在逻辑上是正确的原理,不一定意味着在现实世界中就能行得通。另类书的作者提供的解释缺乏逻辑严密的学术理论的支撑,不能使用正确的数学工具来解释实际观察,他们的模型在实现时就会出现过拟合并失败的结果。学术性的调研和报告与金融市场的实际应用脱节,很多交易或投资领域中的应用也不以正确的科学方法为基础

作者写本书的动机源于很多投资者无法理解机器学习在投资应用中的复杂性。这一点在那些拥有完全支配权的投资公司向“量化”投资领域转型时尤为突出。我担心他们的高预期很难达到,不是因为机器学习失效了,而是因为他们用错了机器学习。在未来几年,很多公司可能会利用来自学术机构或者硅谷的现成的机器学习算法进行投资,我预测他们将赔钱(相对于更好的机器学习解决方案而言)。战胜群众的智慧比识别人脸或者驾驶汽车难得多。通过本书,我希望你能学会如何应对一些挑战,这些挑战使得金融成为机器学习算法难以攻克的领域,比如回测的过拟合问题。金融机器学习已成为一门学科,与标准机器学习既有关联又有区别,本书将为你全面介绍金融机器学习。

西西弗斯范式

Wherever I have seen that formula applied to quantitative or ML projects, it hasled to disaster. The boardroom's mentality is, let us do with quants what hasworked with discretionary PMs. Let us hire 50 PhDs and demand that each ofthem produce an investment strategy within six months. This approach alwaysbackfires, because each PhD will frantically search for investment opportunitiesand eventually settle for (1) a false positive that looks great in an overfit backtestor (2) standard factor investing, which is an overcrowded strategy with a lowSharpe ratio, but at least has academic support. Both outcomes will disappointthe investment board, and the project will be cancelled. Even if 5 of those PhDsidentified a true discovery, the profits would not suffice to cover for the expensesof 50, so those 5 will relocate somewhere else, searching for a proper reward.

很多公司采用全权委托投资组合管理人模式去做量化或机器学习的项目,最终结果都很糟糕。老板的心态是,全权委托投资组合管理人都在用这种方式,那我们就用这种模式做量化吧。雇用50个博士,要求他们每人在6 个月内制定一套投资策略。这个方法往往事与愿违,因为每个博士都会疯狂地寻找投资机会,结果通常是:(1) 拥有亮丽回测结果的过拟合;

(2) 标准的因子投资,一种低夏普比率(SR)的过饱和投资,但至少有理论支撑。

这两种结果都会让投资委员会失望,致使项目最终被取消。即使有 5 个博士发现了有效的投资策略5个人的收益也无法支付 50 个博士的费用,因此这5个博士也要另谋高就这就是所谓的让每个员工日复一日搬石头上山的西西弗斯 (Sisyphus)范式,这种范式的投入产出比极低。


元策略范式

If you have been asked to develop ML strategies on your own, the odds arestacked against you. It takes almost as much effort to produce one trueinvestment strategy as to produce a hundred, and the complexities areoverwhelming: data curation and processing, HPC infrastructure, softwaredevelopment, feature analysis, execution simulators, backtesting, etc. Even if thefirm provides you with shared services in those areas, you are like a worker at aBMW factory who has been asked to build an entire car by using all theworkshops around you. One week you need to be a master welder, another weekan electrician, another week a mechanical engineer, another week a painter ...You will try, fail, and circle back to welding. How does that make sense?

个人构建一套成功机器学习策略是非常难的,你会不停的尝试 和失败,构建上百套策略,团队协作的成功率大于单打独斗。

过拟合

There are two reasons. First, backtest overfitting is arguably the most importantopen problem in all of mathematical finance. It is our equivalent to “p versusNp” in computer science. lf there was a precise method to prevent backtestoverfitting, we would be able to take backtests to the bank. A backtest would bealmost as good as cash, rather than a sales pitch. Hedge funds would allocatefunds to portfolio managers with confidence. Investors would risk less, andwould be willing to pay higher fees. Regulators would grant licenses to hedgefund managers on the basis of reliable evidence of skill and knowledge, leavingno space for charlatans. In my opinion, an investments book that does notaddress this issue is not worth your time. Why would you read a book that dealswith CAPM,APT, asset allocation techniques, risk management, etc. when theempirical results that support those arguments were selected without determiningtheir false discovery probabilities?

The second reason is that ML is a great weapon in your research arsenal, and adangerous one to be sure. lf backtest overfitting is an issue in econometricanalysis, the flexibility of ML makes it a constant threat to your work. This isparticularly the case in finance, because our datasets are shorter, with lowersignal-to-noise ratio, and we do not have laboratories where we can conductexperiments while controlling for all environmental variables (Lopez de Prado2015]). An ML book that does not tackle these concerns can be moredetrimental than beneficial to your career.

首先,过拟合回测可以说是数学金融领域中最重要的开放性问题。这相当于计算机科学中的“P与NP”问题。如果有一种精确的方法来防止过度拟合回测,我们就能将回测结果视为现金,而不仅仅是推辞。对冲基金会有信心地分配资金给投资组合经理。投资者风险更低,愿意支付更高的费用。这会解决很多纸上谈兵或者诈骗的情况。

第二个原因是机器学习是你研究工具箱中的一个强大武器,但也是一个危险的武器。如果回测过度拟合在计量经济学分析中是一个问题,那么机器学习的灵活性就会对你的工作产生持续威胁。在金融领域尤其如此,因为我们的数据集更短,信噪比更低,并且我们没有实验室可以在控制所有环境变量的情况下进行实验(Lopez de Prado 2015)。一本不解决这些问题的机器学习书可能对你的量化生涯造成更多的伤害而非受益。

Tags