Quality Assurance in GenAI Projects - Evolution from API Testing to Model Response Verification | nasscom | The Official Community of Indian IT Industry

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Quality Assurance in GenAI Projects - Evolution from API Testing to Model Response Verification

Sanju Dalla

@Sanju Dalla

May 31, 2024

Digital Transformation

173

In the rapidly evolving technology era, the emergence of Generative Artificial Intelligence (GenAI) has brought a new area in software development and quality assurance.

Traditionally, for digital transformation projects around microservices architecture, API testing is of paramount importance to provide fast feedback on the functionality and performance of software applications. However, with the emergence of AI into these systems, the focus has shifted.

While API testing remains important, the critical phase in testing GenAI-based projects requires verifying the responses generated by the applications or AI models.

In this whitepaper, we will look at the digital assurance design of GenAI-based projects, highlighting the shift from API testing to model response verification.

Shift Left

Requirement Understanding

Model behavior for use case
- Quality Assurance teams should be well versed with applications being built. All the functionalities that can impact the outcome due to model customizations should be known
Understanding of Knowledge base
- to be used for text based GenAI applications

Architecture and Design

Architecture components and communications between these components should be understood
Functioning of RAG process, Embedings, Vector search or any other important parameters like temperature
Prompt templates - prompt templates for creating a relevant and rich context to fetch right response from Model
Tokenization, Rate limit and token limit of model

Code Quality - Static code analysis and Unit testing

Implementation of code analysis and unit testing for early feedback
Ensure correctness, security, maintenance and performance is focused

Automation Testing and Manual Validations

Manual Model evaluation and validation

Human Evaluation (HIL)
- Implement human in the loop framework where domain experts will review the output of models for various data sets.
Domain/Business specific Assessment Framework
- Domain expert should create an assessment framework to highlight attributes that contributes to the success or failure or accuracy % of model response
Testing type, Test Data and Scenarios
- Test Data - Domain SME should prepare test data for each round of test and check the accuracy using framework
- Prompt and Response testing
  - General Scenarios - Domain SME should create multiple different set of questions or prompts , get it reviewed by POs
  - For text based applications, utilize GenAI itself to document a series of questions and model answers from a chunk of text on a particular subject and it is important that these are carefully checked manually before their use.
  - Prompt Variation
    - empty input or excessively long sentences
    - Same input but a lot of variation in outputs.
    - Multiple choice questions
    - edge cases or challenging examples that may push the model's limits
    - Relevance over time (add new knowledge, new checks and check the relevancy)
  - If application is related to Code generation
    - Incomplete code,Simple code,Complex code (nested structure),Code with comments and documentation,Code with external libraries,Code with deliberate errors,Code with exception handling ,Code with various format,Code with multithreading
  - Adversarial testing - to assess the robustness of AI models against unexpected or malicious inputs.
    - Ensure model is trained for adversarial inputs so it can generate right response
      - Explicit or Implicit input
        
        Malicious or Toxic or Ambiguous
        
        Inconsistent or Inaccurate or Non existing
        
        Biasness - age/race/gender
        
        Negation
      - Intensity Variation
        
        Adjusting the tone, sentiment, or emphasis
        
        Variation of output or information loss with intensity
        
        Ask for concise information and then ask if original information and concise information are having same meaning
Error handling
Compatibility of functionality with various versions of the model
Metrics
- Use Accuracy metrics to assess the accuracy of responses given by a machine learning model
  - Accuracy: The ratio of correctly predicted instances to the total number of instances. Formula: Accuracy= Correct responses/Total responses

Automation Testing

GenAI Application Response verification through automation

Design Automation and Create Python framework supporting following

Capture the inputs and responses marked as accurate or validated by Domain expert
Keep all the validated responses into a file, Read the responses from file
Create a function in the automation framework to calculate similarity between actual output and the expected output
Execute and create Report
Regression should be executed on a regular basis to ensure that the model remains consistent over time
Regression should be categorized in a manner to minimize cost of model usage. Service virtualization can be used for such cases where only API test is important and model accuracy is not being measured.
Monitor drift through regression and ensure there is no deviation
There is no “one size fits all” approach to choosing an evaluation metric so depending upon the use case following can be used

Metrics - Exact Match & Similarity Score

For exact match - Use evaluate library from hugging face to calculate similarity and get following scores
- BLEU Score: Measures the similarity between the generated output and the reference answers.
- ROUGE Score: Evaluates the overlap of n-grams between the generated and reference text.
- METEOR Score: Takes into account precision, recall, and alignment of generated and reference text.
Sentence/Text Similarity
- Compute the dot product between the embeddings of the generated and reference answers, use sentence transformer library from hugging face
- cosine similarity from hugging face
Factual Consistency - Assess whether the generated answer is factually consistent with the reference answer.
- Precision, Recall, F1 Score: Compare the generated facts to the ground truth facts.

API and UI tests Automation

Like we do for digital transformation projects or Microservices based projects, API testing plays an important role in GenAI based applications too which are communicating through microservices in the backend.

APIs should be tested and should be automated to ensure correctness of functional behavior of microservices, to reduce effort of regression and to catch bugs early in the development.
UI should be tested for user experience and functionality and should be automated
Integration tests should be written and automated to ensure the correctness and completeness of communication between microservices to support the end to end functionality of the system

Non functional requirements

Performance Testing

Models have a limit on the number of tokens they can process in a single step and once this limit is reached, models may start to “forget” previous information.

Parameters or error code to be monitored

Rate limit reached for requests - indicates that too many requests are sent in a short period of time and have exceeded the number of requests allowed.The rate limit expects that requests will be evenly distributed over a one-minute period. And you will receive a 429 response if it is not maintained even though the limit isn't met.
Total number of tokens, sum of prompt tokens and completion tokens a model is allowed to respond with
Max_tokens parameter: It determines the max tokens to be output as the model's response to avoid consuming more tokens.
Time to first token render from submission of the user prompt, measured at multiple percentiles.
Requests Per Second (RPS) for the LLM
Tokens rendered per second

Security

Proactive Feature and Architecture review
SCA and SAST using Sonar
DAST using ZAP for top 10 OWASP principles
PenTest to discover vulnerabilities
Prompt injection to test injection of malicious inputs - Unicode based prompts can be introduced to avoid threat injection.

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

Sanju Dalla

Supercharging Claims Processing with Automation: A Customer-Centric Advantage for Insurance Businesses

Ken Milko

@kenmilko

29 Aug 2025

Digital Transformation

Providing superior customer experiences to policyholders has become a necessity for insurers to survive the competition. That said, claims processing is one area where insurance businesses can gain a substantial edge. Traditional, manual claims…

Benefits of Cloud Hosting for Businesses

Cyfuture Clou..

@cyfuturecloud

28 Aug 2025

Cloud Computing Digital Transformation

In today's digital age, businesses are continually looking for ways to enhance efficiency, reduce costs, and increase flexibility. One of the most effective solutions to achieve these goals is cloud hosting. But what exactly is cloud hosting, and…

India’s Fintech GCCs are Building Tomorrow’s Digital Banks today

Sneha Sharma

@snsharma

27 Aug 2025

GCC Data Science & AI Community BFSI

India’s fintech Global Capability Centers (GCCs) are at the forefront of the country’s remarkable transformation into a powerhouse for technology-driven innovation and enterprise impact. India now hosts 1,760+ GCCs with over 2,975 units as of FY2024…

Why Cloud Hosting is Ideal for E-Commerce

Cyfuture Clou..

@cyfuturecloud

27 Aug 2025

Cloud Computing Digital Transformation

Running an e-commerce business is like running a busy highway toll booth during a festival season: cars (customers) keep pouring in, payments must go through smoothly, and no one wants to be stuck in traffic. If your systems fail, customers…

Rescripting Automotive Software with Microservices and DevOps

L&T Techn..

@L&T Technology Services

26 Aug 2025

Engineering Research & Design Smart Mobility DevOps

Lights, camera, innovation — this could well be the story of the modern automotive industry. Surprised? Well, imagine directing a blockbuster film with a star cast. Each actor shines in their role, yet every scene seamlessly contributes to the…

AI Fire & Smoke Detection Reimagined: Multi-Hazard Recognition from a Single Lens

iProgrammer S..

@iProgrammer

25 Aug 2025

Application Digital Transformation

Every second counts when it comes to fire safety. The National Fire Protection Association (NFPA) estimates a fire doubles in size every 30 seconds, and smoke inhalation is the number one cause of deaths due to fire globally. In high-…

Topics In Demand

Notification

New