Topics In Demand
Notification
New

No notification found.

Quality Assurance in GenAI Projects - Evolution from API Testing to Model Response Verification
Quality Assurance in GenAI Projects - Evolution from API Testing to Model Response Verification

May 31, 2024

48

0

In the rapidly evolving technology era, the emergence of Generative Artificial Intelligence (GenAI) has brought a new area in software development and quality assurance.

Traditionally, for digital transformation projects around microservices architecture, API testing is of paramount importance to provide fast feedback on the functionality and performance of software applications. However, with the emergence of AI into these systems, the focus has shifted. 

While API testing remains important, the critical phase in testing GenAI-based projects requires verifying the responses generated by the applications or AI models.

In this whitepaper, we will look at the digital assurance design of GenAI-based projects, highlighting the shift from API testing to model response verification.

 

Shift Left

Requirement Understanding

  • Model behavior for use case

    • Quality Assurance teams should be well versed with applications being built. All the functionalities that can impact the outcome due to model customizations should be known
  • Understanding of Knowledge base
    • to be used for text based GenAI applications

Architecture and Design

  • Architecture components and communications between these components should be understood
  • Functioning of RAG process, Embedings, Vector search or any other important parameters like temperature
  • Prompt templates - prompt templates for creating a relevant and rich context to fetch right response from Model
  • Tokenization, Rate limit and token limit of model

Code Quality - Static code analysis and Unit testing

 

  • Implementation of code analysis and unit testing for early feedback
  • Ensure correctness, security, maintenance and performance is focused

 

Automation Testing and Manual Validations

Manual Model evaluation and validation

  • Human Evaluation (HIL)

    • Implement human in the loop framework where domain experts will review the output of models for various data sets. 
  • Domain/Business specific Assessment Framework
    • Domain expert should create an assessment framework to highlight attributes that contributes to the success or failure or accuracy % of model response
  • Testing type, Test Data and Scenarios
    • Test Data - Domain SME should prepare test data for each round of test and check the accuracy using framework
    • Prompt and Response testing
      • General Scenarios - Domain SME should create multiple different set of questions or prompts , get it reviewed by POs
      • For text based applications, utilize GenAI itself to document a series of questions and model answers from a chunk of text on a particular subject and it is important that these are carefully checked manually before their use.
      • Prompt Variation
        • empty input or excessively long sentences
        • Same input but a lot of variation in outputs. 
        • Multiple choice questions
        • edge cases or challenging examples that may push the model's limits
        • Relevance over time (add new knowledge, new checks and check the relevancy)
      • If application is related to Code generation
        • Incomplete code,Simple code,Complex code (nested structure),Code with comments and documentation,Code with external libraries,Code with deliberate errors,Code with exception handling ,Code with various format,Code with multithreading
      • Adversarial testing - to assess the robustness of AI models against unexpected or malicious inputs.
        • Ensure model is trained for adversarial inputs so it can generate right response
          • Explicit or Implicit input
            • Malicious or Toxic or Ambiguous
            • Inconsistent or Inaccurate or Non existing
            • Biasness - age/race/gender
            • Negation
          • Intensity Variation
            • Adjusting the tone, sentiment, or emphasis
            • Variation of output or information loss with intensity
            • Ask for concise information and then ask if original information and concise information are having same meaning
  • Error handling
  • Compatibility of functionality with various versions of the model
  • Metrics
    • Use Accuracy metrics to assess the accuracy of responses given by a machine learning model
      • Accuracy: The ratio of correctly predicted instances to the total number of instances. Formula: Accuracy= Correct responses/Total responses

Automation Testing

GenAI Application Response verification through automation

Design Automation and Create Python framework supporting following

  • Capture the inputs and responses marked as accurate or validated by Domain expert
  • Keep all the validated responses into a file, Read the responses from file
  • Create a function in the automation framework to calculate similarity between actual output and the expected output
  • Execute and create Report
  • Regression should be executed on a regular basis to ensure that the model remains consistent over time 
  • Regression should be categorized in a manner to minimize cost of model usage. Service virtualization can be used for such cases where only API test is important and model accuracy is not being measured.
  • Monitor drift through regression and ensure there is no deviation
  • There is no “one size fits all” approach to choosing an evaluation metric so depending upon the use case following can be used

Metrics - Exact Match & Similarity Score 

  • For exact match - Use evaluate library from hugging face to calculate similarity and get following scores
    • BLEU Score: Measures the similarity between the generated output and the reference answers.
    • ROUGE Score: Evaluates the overlap of n-grams between the generated and reference text.
    • METEOR Score: Takes into account precision, recall, and alignment of generated and reference text.
  • Sentence/Text Similarity
    • Compute the dot product between the embeddings of the generated and reference answers, use sentence transformer library from hugging face 
    • cosine similarity from hugging face
  • Factual Consistency - Assess whether the generated answer is factually consistent with the reference answer.
    • Precision, Recall, F1 Score: Compare the generated facts to the ground truth facts.

 

API and UI tests Automation

Like we do for digital transformation projects or Microservices based projects, API testing plays an important role in GenAI based applications too which are communicating through microservices in the backend.

  • APIs should be tested and should be automated to ensure correctness of functional behavior of microservices, to reduce effort of regression and to catch bugs early in the development.
  • UI should be tested for user experience and functionality and should be automated
  • Integration tests should be written and automated to ensure the correctness and completeness of communication between microservices to support the end to end functionality of the system

 

Non functional requirements

Performance Testing

Models have a limit on the number of tokens they can process in a single step and once this limit is reached, models may start to “forget” previous information.

 

Parameters or error code to be monitored

  • Rate limit reached for requests - indicates that too many requests are sent in a short period of time and have exceeded the number of requests allowed.The rate limit expects that requests will be evenly distributed over a one-minute period. And you will receive a 429 response if it is not maintained even though the limit isn't met.
  • Total number of tokens, sum of prompt tokens and completion tokens a model is allowed to respond with
  • Max_tokens parameter: It determines the max tokens to be output as the model's response to avoid consuming more tokens.
  • Time to first token render from submission of the user prompt, measured at multiple percentiles.
  • Requests Per Second (RPS) for the LLM
  • Tokens rendered per second

 

Security

  • Proactive Feature and Architecture review
  • SCA and SAST using Sonar
  • DAST using ZAP for top 10 OWASP principles
  • PenTest to discover vulnerabilities
  • Prompt injection to test injection of malicious inputs - Unicode based prompts can be introduced to avoid threat injection.

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.


© Copyright nasscom. All Rights Reserved.