Working on an AI Project? Here’s How Much Data You’ll Need.

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Working on an AI Project? Here’s How Much Data You’ll Need.

iMerit Technology

@imerit

January 24, 2022

Data Science & AI Community AI

339

Considering that 80% of the work required for an AI project is collecting and preparing data, determining how much data you will need is a critical first step to correctly estimate the effort and cost for the whole project. Rather than rushing to the common conclusion that ‘it depends’, we have put together a list of general recommendations from industry experts to kickstart your project.

Making Educated Guesses

While there are no hard and fast rules for a minimum or recommended amount of data, you can arrive at feasible points by applying the following rules of thumb:

1. Estimate using the rule of 10:: For an initial estimation for the amount of data required, you can apply the rule of 10, which recommends that the amount of training data you need is 10 times the number of parameters – or degrees of freedom - in the model. This recommendation came about as a way of addressing the totality of outputs available when combining the defined parameters.

2. Supervised deep learning rule of thumb: in their deep learning book Goodfellow, Bengio and Courville claim that 5,000 labeled examples per category is enough for a supervised deep learning algorithm to achieve acceptable performance which will match human performance. To exceed human performance, they recommend at least 10 million labeled examples.

3. Computer vision rule of thumb: When using deep learning for image classification, a good baseline to start from is 1,000 images per class. Pete Warden analyzed entries in the ImageNet classification challenge, where the dataset had 1,000 categories, each being a bit short of 1,000 images for each class. The dataset was large enough to train the early generations of image classifiers like AlexNet, so the author concluded that roughly 1,000 images is a good baseline for computer vision algorithms.

4. 20% of a training set is typically used for the validation: Another recommendation from the Deep Learning book is to use about 80% of the data for learning and 20% for validation. The validation set is the subset of data used to guide the selection of hyperparameters. To apply this recommendation in our context, if you have carried out a successful validation or proof of concept on your algorithm, we suggest quadrupling the amount of data you’ve used to develop your final product.

5. Plotting learning curves: To determine the efficiency of your machine learning algorithm, try plotting the learning curve of the sample size against the success rate. If the algorithm was trained adequately, the graph will look similar to a log function. If you found that the last two points plotted with your current sample size still have a positive slope, then you can increase the dataset for a better success rate. As the slope approaches zero, increasing the dataset is unlikely to improve the success rate.

Learning Curve of machine learning model with the size of dataset used for testing and training.

Source: https://www.researchgate.net/figure/Learning-Curve-of-machine-learning-model-with-the-size-of-dataset-used-for-testing-and_fig7_320592670

The dangers of too little training data

While it may seem obvious to collect enough diverse, high-quality data to train your model, even some of the largest players in the industry fail to do so. Perhaps one of the most expensive and well-covered AI failures is the IBM Watson and University of Texas MD Anderson Cancer Center’s Oncology Expert Advisor system. The system was designed with the ambitious scope to ‘eradicate cancer’, with the first step being to ‘uncover valuable information for the cancer centre's rich patient and research database’.

Unfortunately, the product ended up costing $62 million just to be cancelled after the AI algorithm was recommending unsafe and dangerous treatments. The main issue behind the poor performance was that Watson was trained on a small and narrow dataset, producing poor performance and ignoring other significant manifestations in cancer patients. Trained in natural language processing, the system only had accuracy scores ranging from 90 to 96 percent when dealing with clear concepts like diagnosis, but scores of only 63 to 65 percent for time-dependent information like therapy timelines.

Amazon’s Rekognition had a similar failure. This computer vision system is used to detect and analyze faces by multiple government agencies including US Immigration. When Rekognition was trained using 25,000 publicly available arrest photos, the software incorrectly matched 28 members of Congress with people who had been arrested for a crime. Of these false matches ,40% were congressmen of color, despite only making up 20% of congress.

What to do if you need more datasets

To prevent these types of failures, we recommend the following methods for enlarging datasets:

Open Datasets: these great sources of data are from reputable institutions, and can be incorporated if you find relevant data:
1. Google Public Data Explorer – as expected, Google Public Data Explorer aggregates data from multiple reputable sources and provides visualization tools with a time dimension.
2. Registry of Open Data on AWS(RODA) - This repository contains public datasets from AWS resources such as Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
3. DBpedia - DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. There are around 4.58 million entities in the DBpedia dataset. 4.22 million are classified in ontology, including 1,445,000 persons, 735,000 places, 123,000 music albums, 87,000 films, 19,000 video games, 241,000 organizations, 251,000 species and 6,000 diseases.
4. European Union Open Data Portal – contains EU-related data for domains such as economy, employment, science, environment, and education. It is extensively used by European agencies, including EuroStat.
5. Data.gov – The US government provides open data on topics such as agriculture, climate, energy, local government, maritime, ocean and elderly health.

Synthetic Data – is a type of generated data which has the same characteristics and schema as ‘real’ data. It is particularly useful in the context of transfer learning, where a model can be taught on synthetic data, and then re-trained for real-world data. A great example to understand synthetic data is its application to computer vision, specifically self-driving algorithms. A self-driving AI system can be taught to recognize objects and navigate a simulated environment using a video game engine . Advantages of synthetic data include being able to efficiently produce data once the synthetic environment is defined, having perfectly accurate labels on the generated data, and lack of sensitive information such as personal data. Synthetic Minority Over-sampling Technique (SMOTE) is another technique for extending data from an existing dataset. SMOTE randomly selects a data point for the minority class, then finds its nearest neighbours and again randomly one of those is selected. The newly created data point will be synthesised randomly in the straight line between the two selected points

Augmented Data – is a technique that performs transformation on an existing dataset to repurpose it as new data. The clearest example of augmenting data is in computer vision applications, where images can be transformed with a variety of operations, including rotating the image, cropping, flipping, color shifting, and more.

Conclusion

Beginning any AI project begins with a guess. Educating that guess will be the difference between a successful outcome and a costly one.

#AI #MachineLearning Data Science datasets

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

iMerit Technology

iMerit is recognized as a leader in delivering high-quality data to power the development of AI technology in autonomous vehicles, healthcare, GIS applications, agriculture, e-commerce, finance & insurance.

Call for Inputs: TEC’s Draft Standa...

moncourt.ananya831

Public Policy

25 Nov 2024

Boosting Range and Efficiency: How ...

adarshgowda

Data Science &a..

25 Nov 2024

How software-defined vehicles are r...

Tata Technologies

120

Engineering Res..

25 Nov 2024

The Impact of AI on Qualitative Res...

SG Analytics

136

AI

25 Nov 2024

Part 1: From Quantum Supremacy to Q...

Shwetank

Emerging Tech

23 Nov 2024

7 Machine Learning Algorithms I Use...

Harish Kumar Ajjan

Data Science &a..

22 Nov 2024

Hybrid Cloud Computing Solutions: B...

Judge India Solution..

Cloud Computing

20 Nov 2024

Generative AI Running into Breaking...

Polestar Insights In..

AI

19 Nov 2024

How Salesforce Agentforce Leverages...

Intelliswift Softwar..

Mulesoft and Sa..

19 Nov 2024

What are the basics of AI models

Harish Kumar Ajjan

521

AI

15 Nov 2024

Unleashing the potential of artific...

SumCircle

221

e-Commerce

15 Nov 2024

Key Trends in Gen AI Startups: Pivo...

Madhumay

Digital Transfo..

15 Nov 2024

The Significance of Data Science in Climate Change: The Importance of Data Science Certification

Pihu

@pihu

10 Jul 2024

Data Science & AI Community Energy & Utilities Smartcities Tech for Good Big Data Analytics

Climate change is one of the most essential worries of our day. Temperatures are rising, ice caps are melting, and an extreme climate is taking place as a result of the growing attention to greenhouse gases in the atmosphere. Against this…

Leveraging Data Science to Boost Community Engagement in Local Governments

Suhas gm

@suhasgm

08 Jul 2024

Smartcities Data Science & AI Community G - Governance Government Schemes

The Scholar’s Decision modified the complete commercial enterprise world, and neighborhood authorities are no exception. Many approaches to Data Collection can be included in public administration to enhance neighborhood participation, services,…

Understanding RLHF in Gen AI Applications

Cigniti Techn..

@cigniti

08 Jul 2024

Data Science & AI Community AI

Reinforcement Learning with Human Feedback (RLHF) is a pivotal concept in the realm of Generative AI (Gen AI), revolutionizing how machines learn and interact with human inputs. As Artificial Intelligence (AI) technologies advance, integrating RLHF…

Ethical implications of AI in software development: A call for responsible innovation

Opcito Techno..

@Opcito Technologies

08 Jul 2024

AI Cyber Security & Privacy AI Inside

As passionate advocates for innovation in software development, we've witnessed the transformative potential of Artificial Intelligence (AI) firsthand. AI has rapidly evolved from a theoretical concept to a practical tool capable of automating tasks…

Actionable Insights with AI-Powered Business Intelligence in Dispute Resolution

Chargeback Gu..

@cbg@nasscom

05 Jul 2024

Data Science & AI Community AI

The payments industry, once a realm of just cards and physical receipts, is undergoing a massive revolution driven by artificial intelligence (AI) and business intelligence (BI). Gone are the days of static networks; today, the industry has…

How AI is Supercharging App Development

Sneha Sharma

@snsharma

04 Jul 2024

Application Data Science & AI Community

The world of software development is undergoing a transformation. While lines of code remain the foundation, a powerful new ally is emerging with AI. AI is already augmenting the development process, empowering developers to build applications…

New

Working on an AI Project? Here’s How Much Data You’ll Need.

iMerit Technology

Making Educated Guesses

The dangers of too little training data

Conclusion

iMerit Technology

The Significance of Data Science in Climate Change: The Importance of Data Science Certification

Pihu

Leveraging Data Science to Boost Community Engagement in Local Governments

Suhas gm

Understanding RLHF in Gen AI Applications

Cigniti Techn..

Ethical implications of AI in software development: A call for responsible innovation

Opcito Techno..

Actionable Insights with AI-Powered Business Intelligence in Dispute Resolution

Chargeback Gu..

How AI is Supercharging App Development

Sneha Sharma

About Us

Knowledge Center

In the News

Topics In Demand

Notification

New

Working on an AI Project? Here’s How Much Data You’ll Need.

Making Educated Guesses

The dangers of too little training data

Conclusion

Share this blog

Related blogs

moncourt.ananya831

25 Nov 2024

adarshgowda

25 Nov 2024

Tata Technologies

25 Nov 2024

SG Analytics

25 Nov 2024

Shwetank

23 Nov 2024

Harish Kumar Ajjan

22 Nov 2024

Judge India Solution..

20 Nov 2024

Polestar Insights In..

19 Nov 2024

Intelliswift Softwar..

19 Nov 2024

Harish Kumar Ajjan

15 Nov 2024

SumCircle

15 Nov 2024

Madhumay

15 Nov 2024

About Us

Knowledge Center

In the News

Newsletter