Working on an AI Project? Here’s How Much Data You’ll Need.

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Working on an AI Project? Here’s How Much Data You’ll Need.

iMerit Technology

@imerit

January 24, 2022

Data Science & AI Community AI

400

Considering that 80% of the work required for an AI project is collecting and preparing data, determining how much data you will need is a critical first step to correctly estimate the effort and cost for the whole project. Rather than rushing to the common conclusion that ‘it depends’, we have put together a list of general recommendations from industry experts to kickstart your project.

Making Educated Guesses

While there are no hard and fast rules for a minimum or recommended amount of data, you can arrive at feasible points by applying the following rules of thumb:

1. Estimate using the rule of 10:: For an initial estimation for the amount of data required, you can apply the rule of 10, which recommends that the amount of training data you need is 10 times the number of parameters – or degrees of freedom - in the model. This recommendation came about as a way of addressing the totality of outputs available when combining the defined parameters.

2. Supervised deep learning rule of thumb: in their deep learning book Goodfellow, Bengio and Courville claim that 5,000 labeled examples per category is enough for a supervised deep learning algorithm to achieve acceptable performance which will match human performance. To exceed human performance, they recommend at least 10 million labeled examples.

3. Computer vision rule of thumb: When using deep learning for image classification, a good baseline to start from is 1,000 images per class. Pete Warden analyzed entries in the ImageNet classification challenge, where the dataset had 1,000 categories, each being a bit short of 1,000 images for each class. The dataset was large enough to train the early generations of image classifiers like AlexNet, so the author concluded that roughly 1,000 images is a good baseline for computer vision algorithms.

4. 20% of a training set is typically used for the validation: Another recommendation from the Deep Learning book is to use about 80% of the data for learning and 20% for validation. The validation set is the subset of data used to guide the selection of hyperparameters. To apply this recommendation in our context, if you have carried out a successful validation or proof of concept on your algorithm, we suggest quadrupling the amount of data you’ve used to develop your final product.

5. Plotting learning curves: To determine the efficiency of your machine learning algorithm, try plotting the learning curve of the sample size against the success rate. If the algorithm was trained adequately, the graph will look similar to a log function. If you found that the last two points plotted with your current sample size still have a positive slope, then you can increase the dataset for a better success rate. As the slope approaches zero, increasing the dataset is unlikely to improve the success rate.

Learning Curve of machine learning model with the size of dataset used for testing and training.

Source: https://www.researchgate.net/figure/Learning-Curve-of-machine-learning-model-with-the-size-of-dataset-used-for-testing-and_fig7_320592670

The dangers of too little training data

While it may seem obvious to collect enough diverse, high-quality data to train your model, even some of the largest players in the industry fail to do so. Perhaps one of the most expensive and well-covered AI failures is the IBM Watson and University of Texas MD Anderson Cancer Center’s Oncology Expert Advisor system. The system was designed with the ambitious scope to ‘eradicate cancer’, with the first step being to ‘uncover valuable information for the cancer centre's rich patient and research database’.

Unfortunately, the product ended up costing $62 million just to be cancelled after the AI algorithm was recommending unsafe and dangerous treatments. The main issue behind the poor performance was that Watson was trained on a small and narrow dataset, producing poor performance and ignoring other significant manifestations in cancer patients. Trained in natural language processing, the system only had accuracy scores ranging from 90 to 96 percent when dealing with clear concepts like diagnosis, but scores of only 63 to 65 percent for time-dependent information like therapy timelines.

Amazon’s Rekognition had a similar failure. This computer vision system is used to detect and analyze faces by multiple government agencies including US Immigration. When Rekognition was trained using 25,000 publicly available arrest photos, the software incorrectly matched 28 members of Congress with people who had been arrested for a crime. Of these false matches ,40% were congressmen of color, despite only making up 20% of congress.

What to do if you need more datasets

To prevent these types of failures, we recommend the following methods for enlarging datasets:

Open Datasets: these great sources of data are from reputable institutions, and can be incorporated if you find relevant data:
1. Google Public Data Explorer – as expected, Google Public Data Explorer aggregates data from multiple reputable sources and provides visualization tools with a time dimension.
2. Registry of Open Data on AWS(RODA) - This repository contains public datasets from AWS resources such as Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
3. DBpedia - DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. There are around 4.58 million entities in the DBpedia dataset. 4.22 million are classified in ontology, including 1,445,000 persons, 735,000 places, 123,000 music albums, 87,000 films, 19,000 video games, 241,000 organizations, 251,000 species and 6,000 diseases.
4. European Union Open Data Portal – contains EU-related data for domains such as economy, employment, science, environment, and education. It is extensively used by European agencies, including EuroStat.
5. Data.gov – The US government provides open data on topics such as agriculture, climate, energy, local government, maritime, ocean and elderly health.

Synthetic Data – is a type of generated data which has the same characteristics and schema as ‘real’ data. It is particularly useful in the context of transfer learning, where a model can be taught on synthetic data, and then re-trained for real-world data. A great example to understand synthetic data is its application to computer vision, specifically self-driving algorithms. A self-driving AI system can be taught to recognize objects and navigate a simulated environment using a video game engine . Advantages of synthetic data include being able to efficiently produce data once the synthetic environment is defined, having perfectly accurate labels on the generated data, and lack of sensitive information such as personal data. Synthetic Minority Over-sampling Technique (SMOTE) is another technique for extending data from an existing dataset. SMOTE randomly selects a data point for the minority class, then finds its nearest neighbours and again randomly one of those is selected. The newly created data point will be synthesised randomly in the straight line between the two selected points

Augmented Data – is a technique that performs transformation on an existing dataset to repurpose it as new data. The clearest example of augmenting data is in computer vision applications, where images can be transformed with a variety of operations, including rotating the image, cropping, flipping, color shifting, and more.

Conclusion

Beginning any AI project begins with a guess. Educating that guess will be the difference between a successful outcome and a costly one.

#AI #MachineLearning Data Science datasets

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

iMerit Technology

iMerit is recognized as a leader in delivering high-quality data to power the development of AI technology in autonomous vehicles, healthcare, GIS applications, agriculture, e-commerce, finance & insurance.

Mental Health in the Workplace and Why It's Important for HR Leaders

Mental Health..

@MHFA India

11 Aug 2025

Diversity And Inclusion

The 3 AM coffee runs, endless sprint cycles, and that perpetual "just one more feature" mindset – sound familiar? If you're part of India's tech ecosystem, you've probably witnessed (or lived through) this reality more times than you'd like to admit…

Value-Velocity Dashboards: Linking AI Resource Management Software to Margin in Real Time

Kytes by Prod..

@ProductDossier

08 Aug 2025

Analytics AI Inside

Most enterprises can track resource schedules, time logs, and project status, yet real-time insight into how those decisions influence profitability remains elusive. This disconnect forces teams to juggle separate systems for project tracking,…

Role of a Data Annotation Company in Accelerating Multimodal AI

Gurpreet Sing..

@gurpreetarora

08 Aug 2025

Think of a scenario where an AI system analyzes a client’s frustrated tone in a support call. Upon cross-referencing their usage data, the system not only alerts the account manager but also equips them with de-escalation strategies. Once a distant…

Intelligent Document Processing: Global Impact, Industry Adoption, and ROI

AlgoDocs

@AlgoDocs

08 Aug 2025

Data Science & AI Community AI

Data extraction has long been a fundamental aspect of business operations across industries. Whether it's for record keeping, financial transactions, compliance documentation, or customer onboarding, the ability to accurately extract and process…

India’s Data Center Boom Hobbled by Fragmented Regulations

Yashasvi

@Yashasvi

07 Aug 2025

Data Science & AI Community IT Services Government Schemes

The Patchwork Problem India’s fragmented legal and regulatory environment poses a serious challenge to data center development. The absence of uniform definitions, exponential state-specific approval processes, and ineffective single-window…

The Forgotten Frontline: Why Field Officers Deserve Smarter Tools

Valiance Solu..

@ValianceSolutions

07 Aug 2025

Data Science & AI Community Smartcities Tech for Good Impact Stories

The sun had barely risen when Officer Ray received his fifth call of the day. A street fight reported in Sector 18. A cow stuck in an open drain near the railway line. Waterlogging on a school route, blocking buses. A burst pipe triggering a…

Topics In Demand

Notification

New

Working on an AI Project? Here’s How Much Data You’ll Need.

Making Educated Guesses

The dangers of too little training data

Conclusion

Share this blog

Related blogs