Quality Assurance in GenAI Projects - Evolution from API Testing to Model Response Verification | nasscom | The Official Community of Indian IT Industry

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Quality Assurance in GenAI Projects - Evolution from API Testing to Model Response Verification

Sanju Dalla

@Sanju Dalla

May 31, 2024

Digital Transformation

132

In the rapidly evolving technology era, the emergence of Generative Artificial Intelligence (GenAI) has brought a new area in software development and quality assurance.

Traditionally, for digital transformation projects around microservices architecture, API testing is of paramount importance to provide fast feedback on the functionality and performance of software applications. However, with the emergence of AI into these systems, the focus has shifted.

While API testing remains important, the critical phase in testing GenAI-based projects requires verifying the responses generated by the applications or AI models.

In this whitepaper, we will look at the digital assurance design of GenAI-based projects, highlighting the shift from API testing to model response verification.

Shift Left

Requirement Understanding

Model behavior for use case
- Quality Assurance teams should be well versed with applications being built. All the functionalities that can impact the outcome due to model customizations should be known
Understanding of Knowledge base
- to be used for text based GenAI applications

Architecture and Design

Architecture components and communications between these components should be understood
Functioning of RAG process, Embedings, Vector search or any other important parameters like temperature
Prompt templates - prompt templates for creating a relevant and rich context to fetch right response from Model
Tokenization, Rate limit and token limit of model

Code Quality - Static code analysis and Unit testing

Implementation of code analysis and unit testing for early feedback
Ensure correctness, security, maintenance and performance is focused

Automation Testing and Manual Validations

Manual Model evaluation and validation

Human Evaluation (HIL)
- Implement human in the loop framework where domain experts will review the output of models for various data sets.
Domain/Business specific Assessment Framework
- Domain expert should create an assessment framework to highlight attributes that contributes to the success or failure or accuracy % of model response
Testing type, Test Data and Scenarios
- Test Data - Domain SME should prepare test data for each round of test and check the accuracy using framework
- Prompt and Response testing
  - General Scenarios - Domain SME should create multiple different set of questions or prompts , get it reviewed by POs
  - For text based applications, utilize GenAI itself to document a series of questions and model answers from a chunk of text on a particular subject and it is important that these are carefully checked manually before their use.
  - Prompt Variation
    - empty input or excessively long sentences
    - Same input but a lot of variation in outputs.
    - Multiple choice questions
    - edge cases or challenging examples that may push the model's limits
    - Relevance over time (add new knowledge, new checks and check the relevancy)
  - If application is related to Code generation
    - Incomplete code,Simple code,Complex code (nested structure),Code with comments and documentation,Code with external libraries,Code with deliberate errors,Code with exception handling ,Code with various format,Code with multithreading
  - Adversarial testing - to assess the robustness of AI models against unexpected or malicious inputs.
    - Ensure model is trained for adversarial inputs so it can generate right response
      - Explicit or Implicit input
        
        Malicious or Toxic or Ambiguous
        
        Inconsistent or Inaccurate or Non existing
        
        Biasness - age/race/gender
        
        Negation
      - Intensity Variation
        
        Adjusting the tone, sentiment, or emphasis
        
        Variation of output or information loss with intensity
        
        Ask for concise information and then ask if original information and concise information are having same meaning
Error handling
Compatibility of functionality with various versions of the model
Metrics
- Use Accuracy metrics to assess the accuracy of responses given by a machine learning model
  - Accuracy: The ratio of correctly predicted instances to the total number of instances. Formula: Accuracy= Correct responses/Total responses

Automation Testing

GenAI Application Response verification through automation

Design Automation and Create Python framework supporting following

Capture the inputs and responses marked as accurate or validated by Domain expert
Keep all the validated responses into a file, Read the responses from file
Create a function in the automation framework to calculate similarity between actual output and the expected output
Execute and create Report
Regression should be executed on a regular basis to ensure that the model remains consistent over time
Regression should be categorized in a manner to minimize cost of model usage. Service virtualization can be used for such cases where only API test is important and model accuracy is not being measured.
Monitor drift through regression and ensure there is no deviation
There is no “one size fits all” approach to choosing an evaluation metric so depending upon the use case following can be used

Metrics - Exact Match & Similarity Score

For exact match - Use evaluate library from hugging face to calculate similarity and get following scores
- BLEU Score: Measures the similarity between the generated output and the reference answers.
- ROUGE Score: Evaluates the overlap of n-grams between the generated and reference text.
- METEOR Score: Takes into account precision, recall, and alignment of generated and reference text.
Sentence/Text Similarity
- Compute the dot product between the embeddings of the generated and reference answers, use sentence transformer library from hugging face
- cosine similarity from hugging face
Factual Consistency - Assess whether the generated answer is factually consistent with the reference answer.
- Precision, Recall, F1 Score: Compare the generated facts to the ground truth facts.

API and UI tests Automation

Like we do for digital transformation projects or Microservices based projects, API testing plays an important role in GenAI based applications too which are communicating through microservices in the backend.

APIs should be tested and should be automated to ensure correctness of functional behavior of microservices, to reduce effort of regression and to catch bugs early in the development.
UI should be tested for user experience and functionality and should be automated
Integration tests should be written and automated to ensure the correctness and completeness of communication between microservices to support the end to end functionality of the system

Non functional requirements

Performance Testing

Models have a limit on the number of tokens they can process in a single step and once this limit is reached, models may start to “forget” previous information.

Parameters or error code to be monitored

Rate limit reached for requests - indicates that too many requests are sent in a short period of time and have exceeded the number of requests allowed.The rate limit expects that requests will be evenly distributed over a one-minute period. And you will receive a 429 response if it is not maintained even though the limit isn't met.
Total number of tokens, sum of prompt tokens and completion tokens a model is allowed to respond with
Max_tokens parameter: It determines the max tokens to be output as the model's response to avoid consuming more tokens.
Time to first token render from submission of the user prompt, measured at multiple percentiles.
Requests Per Second (RPS) for the LLM
Tokens rendered per second

Security

Proactive Feature and Architecture review
SCA and SAST using Sonar
DAST using ZAP for top 10 OWASP principles
PenTest to discover vulnerabilities
Prompt injection to test injection of malicious inputs - Unicode based prompts can be introduced to avoid threat injection.

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

Sanju Dalla

Leveraging AI for Smarter Credit De...

Anaptyss

BFSI

23 Jul 2025

Cloud Hosting vs Traditional Hostin...

Cyfuture Cloud

Cloud Computing

22 Jul 2025

Crypto Sniper Bot Development for F...

Agnaljohn

Blockchain

22 Jul 2025

New RDI Scheme: A Strategic Catalys...

Kuhu Singh

Current Issues

19 Jul 2025

The rise of contextual tech, from g...

YASH Technologies

Industry Trends

18 Jul 2025

How AI and Managed Services Enhance...

Anaptyss

BFSI

17 Jul 2025

Choosing the Right Cloud Hosting St...

Cyfuture Cloud

Cloud Computing

17 Jul 2025

AI-Driven Personalization in Wealth...

NuSummit

AI

15 Jul 2025

Critical Success Factors for Financ...

NuSummit

Digital Transfo..

15 Jul 2025

Building Client Loyalty with Data a...

NuSummit

Digital Transfo..

15 Jul 2025

How Cloud-Native Platforms Are Impr...

NuSummit

Cloud Computing

15 Jul 2025

How Cloud Infrastructure is Transfo...

NuSummit

582

BFSI

15 Jul 2025

Leveraging AI for Smarter Credit Decisions Across the Lending Lifecycle

Anaptyss

@Anaptyss

23 Jul 2025

BFSI Fintech

In today's fiercely competitive financial landscape, banking leaders face a triad of pressures, including compressing margins, escalating credit risk, and rapidly evolving regulatory expectations. For banks and financial institutions, navigating…

Cloud Hosting vs Traditional Hosting: A Detailed Comparison

Cyfuture Clou..

@cyfuturecloud

22 Jul 2025

Cloud Computing Digital Transformation

Choosing between cloud hosting and traditional hosting is a pivotal decision for any business or website owner. The right hosting solution not only affects your site's performance, reliability, and scalability but also has a major impact on your…

Crypto Sniper Bot Development for Fast Execution and Competitive Market Advantage

Agnaljohn

@agnaljohn

22 Jul 2025

Blockchain

Crypto trading has moved far beyond manual charts and late-night alerts. Today, milliseconds can determine whether you win big or miss out. As the digital asset space becomes more competitive, speed, precision, and automation have become…

New RDI Scheme: A Strategic Catalyst or A Repackaged Initiative?

Kuhu Singh

@Kuhu

19 Jul 2025

Current Issues Digital Transformation

On July 1, 2025, the union cabinet approved the Research Development and Innovation (RDI) Scheme worth a corpus of Rs 1 trillion (~US $12 billion). The new Scheme outlines the Indian government’s intent to catalyze private sector research and…

The rise of contextual tech, from generic to precise

YASH Technolo..

@yash.technologies

18 Jul 2025

Industry Trends ESG & Sustainability

The shift toward industry-focused technology A quiet pressure has been seeping through boardrooms and C-suite conversations: technology stacks are being refactored so that sectors, rather than vendors, dictate first principles. Instead of debating…

How AI and Managed Services Enhance Portfolio Stability for Banks Navigating Complex Covenants

Anaptyss

@Anaptyss

17 Jul 2025

BFSI Fintech

Covenants serve as an important mechanism for lenders to monitor borrower behavior and manage credit exposure. They include both financial measures—such as leverage or interest coverage ratios—and non-financial terms like reporting timelines or…

New

Quality Assurance in GenAI Projects - Evolution from API Testing to Model Response Verification

Sanju Dalla