Machine Learning: Is More Data Always Better?

Subscribe to newsletter

Data science and machine learning are advancing at a rapid pace. They’re now being applied in areas as diverse as healthcare, retail, marketing, and finance. However, a key question that still needs to be answered is: how much data do you need to train these models?

The answer, it turns out, is not always more data. In some cases, using too much data can actually hurt the performance of your machine learning models. In this context, Reference [1] argued that more data is not always better,

Managers often believe that collecting more data will continually improve the accuracy of their machine learning models. However, we argue in this paper that when data lose relevance over time, it may be optimal to collect a limited amount of recent data instead of keeping around an infinite supply of older (less relevant) data. In addition, we argue that increasing the stock of data by including older datasets may, in fact, damage the model’s accuracy. Expectedly, the model’s accuracy improves by increasing the flow of data (defined as data collection rate); however, it requires other tradeoffs in terms of refreshing or retraining machine learning models more frequently.

Subscribe to newsletter https://harbourfrontquant.beehiiv.com/subscribe Newsletter Covering Trading Strategies, Risk Management, Financial Derivatives, Career Perspectives, and More

The paper also pointed out that the value of a firm does not scale with its stock of data,

This result, coupled with the fact that older datasets may deteriorate models’ accuracy, suggests that created business value doesn’t scale with the stock of available data unless the firm offloads less relevant data from its data repository. Consequently, a firm’s growth policy should incorporate a balance between the stock of historical data and the flow of new data.

What implication does this paper have for trading and portfolio management? Should we use more data?

The short answer is probably no. In fact, using more data can actually lead to sub-optimal results. The reason is that, in the financial world, data is often noisy and contains a lot of irrelevant information. If you use too much data, your machine learning models will end up picking up on this noise, which can lead to sub-optimal results.

So how do we use data for trading? Let us know in the comments below.

References

[1] Valavi, Ehsan, Joel Hestness, Newsha Ardalani, and Marco Iansiti. Time and the Value of Data. Harvard Business School Working Paper, No. 21-016, August 2020. (Revised November 2021.)

Subscribe to newsletter https://harbourfrontquant.beehiiv.com/subscribe Newsletter Covering Trading Strategies, Risk Management, Financial Derivatives, Career Perspectives, and More

Further questions

What's your question? Ask it in the discussion forum

Have an answer to the questions below? Post it here or in the forum

LATEST NEWSMoody's affirms Consensus Cloud Solutions' B2 rating, downgrades notes
Moody's affirms Consensus Cloud Solutions' B2 rating, downgrades notes
Stay up-to-date with the latest news - click here
LATEST NEWSFlatiron Health Research on AI-Driven Cancer Progression Extraction Presented at AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning 2025
Flatiron Health Research on AI-Driven Cancer Progression Extraction Presented at AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning 2025

Flatiron Health presents two new pieces of research demonstrating the potential of AI to advance oncology research across multiple tumor types NEW YORK — Flatiron Health today announced that the novel findings from its research, “ Using large language models for scalable extraction of real-world…

Stay up-to-date with the latest news - click here
LATEST NEWSFastco Expands Operations with New Warehouse in Pickering, Ontario
Fastco Expands Operations with New Warehouse in Pickering, Ontario

VAUDREUIL-DORION, Québec — Fastco is pleased to announce the upcoming opening of a new distribution center in Pickering, Ontario, on July 28, 2025. This strategic expansion supports the company’s continued growth and commitment to improving service levels for its Ontario-based retail partners. The new 13,849…

Stay up-to-date with the latest news - click here
LATEST NEWSCSC Research Finds 40% of Enterprises Could Be at Risk of an Outage Due to SSL Expiration
CSC Research Finds 40% of Enterprises Could Be at Risk of an Outage Due to SSL Expiration

Domain control validation sunsets on July 15, 2025, putting many companies that rely on WHOIS email at risk for service disruption WILMINGTON, Del. — New research from CSC, an enterprise-class domain security provider and world leader in domain management, SSL management, brand protection, and anti-fraud…

Stay up-to-date with the latest news - click here
LATEST NEWSUnusual Machines stock tumbles after $48.5M offering at discounted price
Unusual Machines stock tumbles after $48.5M offering at discounted price
Stay up-to-date with the latest news - click here

Leave a Reply