Working on an AI Project? Here’s How Much Data You’ll Need.

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Working on an AI Project? Here’s How Much Data You’ll Need.

iMerit Technology

@imerit

January 24, 2022

Data Science & AI Community AI

335

Considering that 80% of the work required for an AI project is collecting and preparing data, determining how much data you will need is a critical first step to correctly estimate the effort and cost for the whole project. Rather than rushing to the common conclusion that ‘it depends’, we have put together a list of general recommendations from industry experts to kickstart your project.

Making Educated Guesses

While there are no hard and fast rules for a minimum or recommended amount of data, you can arrive at feasible points by applying the following rules of thumb:

1. Estimate using the rule of 10:: For an initial estimation for the amount of data required, you can apply the rule of 10, which recommends that the amount of training data you need is 10 times the number of parameters – or degrees of freedom - in the model. This recommendation came about as a way of addressing the totality of outputs available when combining the defined parameters.

2. Supervised deep learning rule of thumb: in their deep learning book Goodfellow, Bengio and Courville claim that 5,000 labeled examples per category is enough for a supervised deep learning algorithm to achieve acceptable performance which will match human performance. To exceed human performance, they recommend at least 10 million labeled examples.

3. Computer vision rule of thumb: When using deep learning for image classification, a good baseline to start from is 1,000 images per class. Pete Warden analyzed entries in the ImageNet classification challenge, where the dataset had 1,000 categories, each being a bit short of 1,000 images for each class. The dataset was large enough to train the early generations of image classifiers like AlexNet, so the author concluded that roughly 1,000 images is a good baseline for computer vision algorithms.

4. 20% of a training set is typically used for the validation: Another recommendation from the Deep Learning book is to use about 80% of the data for learning and 20% for validation. The validation set is the subset of data used to guide the selection of hyperparameters. To apply this recommendation in our context, if you have carried out a successful validation or proof of concept on your algorithm, we suggest quadrupling the amount of data you’ve used to develop your final product.

5. Plotting learning curves: To determine the efficiency of your machine learning algorithm, try plotting the learning curve of the sample size against the success rate. If the algorithm was trained adequately, the graph will look similar to a log function. If you found that the last two points plotted with your current sample size still have a positive slope, then you can increase the dataset for a better success rate. As the slope approaches zero, increasing the dataset is unlikely to improve the success rate.

Learning Curve of machine learning model with the size of dataset used for testing and training.

Source: https://www.researchgate.net/figure/Learning-Curve-of-machine-learning-model-with-the-size-of-dataset-used-for-testing-and_fig7_320592670

The dangers of too little training data

While it may seem obvious to collect enough diverse, high-quality data to train your model, even some of the largest players in the industry fail to do so. Perhaps one of the most expensive and well-covered AI failures is the IBM Watson and University of Texas MD Anderson Cancer Center’s Oncology Expert Advisor system. The system was designed with the ambitious scope to ‘eradicate cancer’, with the first step being to ‘uncover valuable information for the cancer centre's rich patient and research database’.

Unfortunately, the product ended up costing $62 million just to be cancelled after the AI algorithm was recommending unsafe and dangerous treatments. The main issue behind the poor performance was that Watson was trained on a small and narrow dataset, producing poor performance and ignoring other significant manifestations in cancer patients. Trained in natural language processing, the system only had accuracy scores ranging from 90 to 96 percent when dealing with clear concepts like diagnosis, but scores of only 63 to 65 percent for time-dependent information like therapy timelines.

Amazon’s Rekognition had a similar failure. This computer vision system is used to detect and analyze faces by multiple government agencies including US Immigration. When Rekognition was trained using 25,000 publicly available arrest photos, the software incorrectly matched 28 members of Congress with people who had been arrested for a crime. Of these false matches ,40% were congressmen of color, despite only making up 20% of congress.

What to do if you need more datasets

To prevent these types of failures, we recommend the following methods for enlarging datasets:

Open Datasets: these great sources of data are from reputable institutions, and can be incorporated if you find relevant data:
1. Google Public Data Explorer – as expected, Google Public Data Explorer aggregates data from multiple reputable sources and provides visualization tools with a time dimension.
2. Registry of Open Data on AWS(RODA) - This repository contains public datasets from AWS resources such as Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative.
3. DBpedia - DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. There are around 4.58 million entities in the DBpedia dataset. 4.22 million are classified in ontology, including 1,445,000 persons, 735,000 places, 123,000 music albums, 87,000 films, 19,000 video games, 241,000 organizations, 251,000 species and 6,000 diseases.
4. European Union Open Data Portal – contains EU-related data for domains such as economy, employment, science, environment, and education. It is extensively used by European agencies, including EuroStat.
5. Data.gov – The US government provides open data on topics such as agriculture, climate, energy, local government, maritime, ocean and elderly health.

Synthetic Data – is a type of generated data which has the same characteristics and schema as ‘real’ data. It is particularly useful in the context of transfer learning, where a model can be taught on synthetic data, and then re-trained for real-world data. A great example to understand synthetic data is its application to computer vision, specifically self-driving algorithms. A self-driving AI system can be taught to recognize objects and navigate a simulated environment using a video game engine . Advantages of synthetic data include being able to efficiently produce data once the synthetic environment is defined, having perfectly accurate labels on the generated data, and lack of sensitive information such as personal data. Synthetic Minority Over-sampling Technique (SMOTE) is another technique for extending data from an existing dataset. SMOTE randomly selects a data point for the minority class, then finds its nearest neighbours and again randomly one of those is selected. The newly created data point will be synthesised randomly in the straight line between the two selected points

Augmented Data – is a technique that performs transformation on an existing dataset to repurpose it as new data. The clearest example of augmenting data is in computer vision applications, where images can be transformed with a variety of operations, including rotating the image, cropping, flipping, color shifting, and more.

Conclusion

Beginning any AI project begins with a guess. Educating that guess will be the difference between a successful outcome and a costly one.

#AI #MachineLearning Data Science datasets

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

iMerit Technology

iMerit is recognized as a leader in delivering high-quality data to power the development of AI technology in autonomous vehicles, healthcare, GIS applications, agriculture, e-commerce, finance & insurance.

Zero to Seventeen: IndicLLM

Madhumay

Digital Transfo..

13 Nov 2024

How AI-Powered Automation Transform...

Anaptyss

BFSI

13 Nov 2024

Optimizing Electric Motorcycle Rang...

shashankkumar

Data Science &a..

13 Nov 2024

Innovations in BFSI Sector 2024

T V Krishnan

Digital Transfo..

13 Nov 2024

Paving the Way to Agentic Automatio...

AppsTek Corp

RPA

12 Nov 2024

Leveraging AI in Oncology: A Transf...

L&T Technology S..

Engineering Res..

12 Nov 2024

AI-Powered Medical Records Summariz...

Maruti Techlabs

Data Science &a..

08 Nov 2024

How Technology is Empowering Women ...

Digital Health News

HealthTech and ..

07 Nov 2024

The Future of Motorcycling: How Mac...

shashankkumar

Data Science &a..

06 Nov 2024

Future of RPA: Integration with AI,...

RPATech

RPA

06 Nov 2024

How Graph Analytics is Revolutioniz...

chandan gowda

420

Data Science &a..

05 Nov 2024

‘OEMs are transforming into Softwar...

Tata Technologies

Tech for Good

05 Nov 2024

Supply Chain Connected Intelligence: Navigating Challenges and Creating Value

C5i (Course5 ..

@Ronald Fernandes

04 Sep 2023

Analytics Data Science & AI Community AI Inside AI Machine Learning Industry Trends Industry 4.0

Supply Chain Connected Intelligence: Navigating Challenges and Creating Value SUPPLY CHAIN Omnichannel business models are the order of the day, and increasing complexity within the supply chain is a byproduct of this phenomenon. Today’s…

Streamline Lead Generation With Data Verification Services

Gurpreet Sing..

@gurpreetarora

31 Aug 2023

Data Science & AI Community

Any business that aims to achieve apex-level success has to take into account various aspects like sales plans, marketing strategies, business models, etc. An important consideration is maintaining the quality and accuracy of data. A clean and…

Data Analytics Consulting: Unveiling Insights, Driving Decisions

Tanya Gupta

@tanyagupta

30 Aug 2023

Analytics Data Science & AI Community Big Data Analytics

Introduction In today's data-driven business landscape, organizations are faced with an overwhelming amount of information generated by their operations, customer interactions, and market trends. Extracting valuable insights from this wealth of…

Machine Learning Aiding Management of Rare Diseases

Axtria - Inge..

@Axtria

28 Aug 2023

Data Science & AI Community AI

Effective treatments are available for only about 5 percent of Rare Diseases. Pharma has traditionally been hesitant to invest in this space as the development and commercialization of rare diseases is a challenging area for any company The US FDA…

From Data to Action: How Web Research Services Drive Business

Gurpreet Sing..

@gurpreetarora

24 Aug 2023

Data Science & AI Community

Web research services involve collecting, analyzing, and interpreting data from numerous internet sources in order to satisfy certain business requirements. These services are often provided by specialized businesses or individuals who have the…

Exploring the Power of Digital Reconstruction in Preserving Cultural Heritage

TAGBIN

@tagbin

22 Aug 2023

Digital Transformation Data Science & AI Community

Our cultural heritage stands tall as the embodiment of our collective identity—an archive of wisdom, artistry, and accomplishments from bygone eras. Yet, the relentless march of time, along with conflicts and catastrophes, has cast a shadow over…

New

Working on an AI Project? Here’s How Much Data You’ll Need.

iMerit Technology

Making Educated Guesses

The dangers of too little training data

Conclusion

iMerit Technology

Supply Chain Connected Intelligence: Navigating Challenges and Creating Value

C5i (Course5 ..

Streamline Lead Generation With Data Verification Services

Gurpreet Sing..

Data Analytics Consulting: Unveiling Insights, Driving Decisions

Tanya Gupta

Machine Learning Aiding Management of Rare Diseases

Axtria - Inge..

From Data to Action: How Web Research Services Drive Business

Gurpreet Sing..

Exploring the Power of Digital Reconstruction in Preserving Cultural Heritage

TAGBIN

About Us

Knowledge Center

In the News

Topics In Demand

Notification

New

Working on an AI Project? Here’s How Much Data You’ll Need.

Making Educated Guesses

The dangers of too little training data

Conclusion

Share this blog

Related blogs

Madhumay

13 Nov 2024

Anaptyss

13 Nov 2024

shashankkumar

13 Nov 2024

T V Krishnan

13 Nov 2024

AppsTek Corp

12 Nov 2024

L&T Technology S..

12 Nov 2024

Maruti Techlabs

08 Nov 2024

Digital Health News

07 Nov 2024

shashankkumar

06 Nov 2024

RPATech

06 Nov 2024

chandan gowda

05 Nov 2024

Tata Technologies

05 Nov 2024

About Us

Knowledge Center

In the News

Newsletter