This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team of GoDaddy
Generative AI solutions have the potential to transform businesses by boosting productivity and improving customer experiences, and using large language models (LLMs) in these solutions has become increasingly popular. However, inference of LLMs as single model invocations or API calls doesn’t scale well with many applications in production.
With batch inference, you can run multiple inference requests asynchronously to process a large number of requests efficiently. You can also use batch inference to improve the performance of model inference on large datasets.
This post provides an overview of a custom solution developed by the for GoDaddy, a domain registrar, registry, web hosting, and ecommerce company that seeks to make entrepreneurship more accessible by using generative AI to provide personalized business insights to over 21 million customers—insights that were previously only available to large corporations. In this collaboration, the Generative AI Innovation Center team created an accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system.
Solution overview
GoDaddy wanted to enhance their product categorization system that assigns categories to products based on their names. For example:
GoDaddy used an out-of-the-box Meta Llama 2 model to generate the product categories for six million products where a product is identified by an SKU. The generated categories were often incomplete or mislabeled. Moreover, employing an LLM for individual product categorization proved to be a costly endeavor. Recognizing the need for a more precise and cost-effective solution, GoDaddy sought an alternative approach that was a more accurate and cost-efficient way for product categorization to improve their customer experience.
This solution uses the following components to categorize products more accurately and efficiently:
The key steps are illustrated in the following figure:
- A JSONL file containing product data is uploaded to an S3 bucket, triggering the first Lambda function. Amazon Bedrock batch processes this single JSONL file, where each row contains input parameters and prompts. It generates an output JSONL file with a new
model_output
value appended to each row, corresponding to the input data. - The Lambda function spins up an Amazon Bedrock batch processing endpoint and passes the S3 file location.
- The Amazon Bedrock endpoint performs the following tasks:
- It reads the product name data and generates a categorized output, including category, subcategory, season, price range, material, color, product line, gender, and year of first sale.
- It writes the output to another S3 location.
- The second Lambda function performs the following tasks:
- It monitors the batch processing job on Amazon Bedrock.
- It shuts down the endpoint when processing is complete.
The security measures are inherently integrated into the AWS services employed in this architecture. For detailed information, refer to the Security Best Practices section of this post.
We used a dataset that consisted of 30 labeled data points and 100,000 unlabeled test data points. The labeled data points were generated by llama2-7b and verified by a human subject matter expert (SME). As shown in the following screenshot of the sample ground truth, some fields have N/A or missing values, which isn’t ideal because GoDaddy wants a solution with high coverage for downstream predictive modeling. Higher coverage for each possible field can provide more business insights to their customers.
The distribution for the number of words or tokens per SKU shows mild outlier concern, suitable for bundling many products to be categorized in the prompts and potentially more efficient model response.
The solution delivers a comprehensive framework for generating insights within GoDaddy’s product categorization system. It’s designed to be compatible with a range of LLMs on Amazon Bedrock, features customizable prompt templates, and supports batch and real-time (online) inferences. Additionally, the framework includes evaluation metrics that can be extended to accommodate changes in accuracy requirements.
In the following sections, we look at the key components of the solution in more detail.
Batch inference
We used Amazon Bedrock for batch inference processing. Amazon Bedrock provides the CreateModelInvocationJob
API to create a batch job with a unique job name. This API returns a response containing jobArn
. Refer to the following code:
We can monitor the job status using GetModelInvocationJob
with the jobArn
returned on job creation. The following are valid statuses during the lifecycle of a job:
- Submitted – The job is marked Submitted when the JSON file is ready to be processed by Amazon Bedrock for inference.
- InProgress – The job is marked InProgress when Amazon Bedrock starts processing the JSON file.
- Failed – The job is marked Failed if there was an error while processing. The error can be written into the JSON file as a part of
modelOutput
. If it was a 4xx error, it’s written in the metadata of the Job. - Completed – The job is marked Completed when the output JSON file is generated for the input JSON file and has been uploaded to the S3 output path submitted as a part of the
CreateModelInvocationJob
inoutputDataConfig
. - Stopped – The job is marked Stopped when a
StopModelInvocationJob
API is called on a job that is InProgress. A terminal state job (Succeeded or Failed) can’t be stopped usingStopModelInvocationJob
.
The following is example code for the GetModelInvocationJob
API:
When the job is complete, the S3 path specified in s3OutputDataConfig will contain a new folder with an alphanumeric name. The folder contains two files:
- json.out – The following code shows an example of the format:
- <file_name>.jsonl.out – The following screenshot shows an example of the code, containing the successfully processed records under The
modelOutput
contains a list of categories for a given product name in JSON format.
We then process the jsonl.out
file in Amazon S3. This file is parsed using LangChain’s PydanticOutputParser
to generate a .csv file. The PydanticOutputParser
requires a schema to be able to parse the JSON generated by the LLM. We created a CCData class that contains the list of categories to be generated for each product as shown in the following code example. Because we enable n-packing, we wrap the schema with a List, as defined in List_of_CCData
.
We also use OutputFixingParser
to handle situations where the initial parsing attempt fails. The following screenshot shows a sample generated .csv file.
Prompt engineering
Prompt engineering involves the skillful crafting and refining of input prompts. This process entails choosing the right words, phrases, sentences, punctuation, and separator characters to efficiently use LLMs for diverse applications. Essentially, prompt engineering is about effectively interacting with an LLM. The most effective strategy for prompt engineering needs to vary based on the specific task and data, specifically, data card generation and GoDaddy SKUs.
Prompts consist of particular inputs from the user that direct LLMs to produce a suitable response or output based on a specified task or instruction. These prompts include several elements, such as the task or instruction itself, the surrounding context, full examples, and the input text that guides LLMs in crafting their responses. The composition of the prompt will vary based on factors like the specific use case, data availability, and the nature of the task at hand. For example, in a Retrieval Augmented Generation (RAG) use case, we provide additional context and add a user-supplied query in the prompt that asks the LLM to focus on contexts that can answer the query. In a metadata generation use case, we can provide the image and ask the LLM to generate a description and keywords describing the image in a specific format.
In this post, we briefly distribute the prompt engineering solutions into two steps: output generation and format parsing.
Output generation
The following are best practices and considerations for output generation:
- Provide simple, clear and complete instructions – This is the general guideline for prompt engineering work.
- Use separator characters consistently – In this use case, we use the newline character \n
- Deal with default output values such as missing – For this use case, we don’t want special values such as N/A or missing, so we put multiple instructions in line, aiming to exclude the default or missing values.
- Use few-shot prompting – Also termed in-context learning, few-shot prompting involves providing a handful of examples, which can be beneficial in helping LLMs understand the output requirements more effectively. In this use case, 0–10 in-context examples were tested for both Llama 2 and Anthropic’s Claude models.
- Use packing techniques – We combined multiple SKU and product names into one LLM query, so that some prompt instructions can be shared across different SKUs for cost and latency optimization. In this use case, 1–10 packing numbers were tested for both Llama 2 and Anthropic’s Claude models.
- Test for good generalization – You should keep a hold-out test set and correct responses to check if your prompt modifications generalize.
- Use additional techniques for Anthropic’s Claude model families – We incorporated the following techniques:
- Enclosing examples in XML tags:
- Using the Human and Assistant annotations:
- Guiding the assistant prompt:
- Use additional techniques for Llama model families – For Llama 2 model families, you can enclose examples in [INST] tags:
Format parsing
The following are best practices and considerations for format parsing:
- Refine the prompt with modifiers – Refinement of task instructions typically involves altering the instruction, task, or question part of the prompt. The effectiveness of these techniques varies based on the task and data. Some beneficial strategies in this use case include:
- Role assumption – Ask the model to assume it’s playing a role. For example:
You are a Product Information Manager, Taxonomist, and Categorization Expert who follows instruction well.
- Prompt specificity: Being very specific and providing detailed instructions to the model can help generate better responses for the required task.
EVERY category information needs to be filled based on BOTH product name AND your best guess. If you forget to generate any category information, leave it as missing or N/A, then an innocent people will die.
- Output format description – We provided the JSON format instructions through a JSON string directly, as well as through the few-shot examples indirectly.
- Pay attention to few-shot example formatting – The LLMs (Anthropic’s Claude and Llama) are sensitive to subtle formatting differences. Parsing time was significantly improved after several iterations on few-shot examples formatting. The final solution is as follows:
- Use additional techniques for Anthropic’s Claude model families – For the Anthropic’s Claude model, we instructed it to format the output in JSON format:
- Use additional techniques for Llama 2 model families – For the Llama 2 model, we instructed it to format the output in JSON format as follows:
Format your output in the JSON format (ensure to escape special character):
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]}
is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}}
is not well-formatted.
Here is the output schema:
{“properties”: {“list_of_dict”: {“title”: “List Of Dict”, “type”: “array”, “items”: {“$ref”: “#/definitions/CCData”}}}, “required”: [“list_of_dict”], “definitions”: {“CCData”: {“title”: “CCData”, “type”: “object”, “properties”: {“product_name”: {“title”: “Product Name”, “description”: “product name, which will be given as input”, “type”: “string”}, “brand”: {“title”: “Brand”, “description”: “Brand of the product inferred from the product name”, “type”: “string”}, “color”: {“title”: “Color”, “description”: “Color of the product inferred from the product name”, “type”: “string”}, “material”: {“title”: “Material”, “description”: “Material of the product inferred from the product name”, “type”: “string”}, “price”: {“title”: “Price”, “description”: “Price of the product inferred from the product name”, “type”: “string”}, “category”: {“title”: “Category”, “description”: “Category of the product inferred from the product name”, “type”: “string”}, “sub_category”: {“title”: “Sub Category”, “description”: “Sub-category of the product inferred from the product name”, “type”: “string”}, “product_line”: {“title”: “Product Line”, “description”: “Product Line of the product inferred from the product name”, “type”: “string”}, “gender”: {“title”: “Gender”, “description”: “Gender of the product inferred from the product name”, “type”: “string”}, “year_of_first_sale”: {“title”: “Year Of First Sale”, “description”: “Year of first sale of the product inferred from the product name”, “type”: “string”}, “season”: {“title”: “Season”, “description”: “Season of the product inferred from the product name”, “type”: “string”}}}}}
Models and parameters
We used the following prompting parameters:
- Number of packings – 1, 5, 10
- Number of in-context examples – 0, 2, 5, 10
- Format instruction – JSON format pseudo example (shorter length), JSON format full example (longer length)
For Llama 2, the model choices were meta.llama2-13b-chat-v1 or meta.llama2-70b-chat-v1. We used the following LLM parameters:
For Anthropic’s Claude, the model choices were anthropic.claude-instant-v1 and anthropic.claude-v2. We used the following LLM parameters:
The solution is straightforward to extend to other LLMs hosted on Amazon Bedrock, such as Amazon Titan (switch the model ID to amazon.titan-tg1-large, for example), Jurassic (model ID ai21.j2-ultra), and more.
Evaluations
The framework includes evaluation metrics that can be extended further to accommodate changes in accuracy requirements. Currently, it involves five different metrics:
- Content coverage – Measures portions of missing values in the output generation step.
- Parsing coverage – Measures portions of missing samples in the format parsing step:
- Parsing recall on product name – An exact match serves as a lower bound for parsing completeness (parsing coverage is the upper bound for parsing completeness) because in some cases, two virtually identical product names need to be normalized and transformed to be an exact match (for example, “Nike Air Jordan” and “nike. air Jordon”).
- Parsing precision on product name – For an exact match, we use a similar metric to parsing recall, but use precision instead of recall.
- Final coverage – Measures portions of missing values in both output generation and format parsing steps.
- Human evaluation – Focuses on holistic quality evaluation such as accuracy, relevance, and comprehensiveness (richness) of the text generation.
Results
The following are the approximate sample input and output lengths under some best performing settings:
- Input length for Llama 2 model family – 2,068 tokens for 10-shot, 1,585 tokens for 5-shot, 1,319 tokens for 2-shot
- Input length for Anthropic’s Claude model family – 1,314 tokens for 10-shot, 831 tokens for 5-shot, 566 tokens for 2-shot, 359 tokens for zero-shot
- Output length with 5-packing – Approximately 500 tokens
Quantitative results
The following table summarizes our consolidated quantitative results.
- To be concise, the table contains only some of our final recommendations for each model types.
- The metrics used are latency and accuracy.
- The best model and results are highlighted in green color and in bold font.
Config | Latency | Accuracy | ||||||
Batch process service | Model | Prompt | Batch process latency (5 packing) | Near-real-time process latency (1 packing) | Programmatic evaluation (coverage) | |||
test set = 20 | test set = 5k | GoDaddy rqmt @ 5k | Recall on parsing exact match | Final content coverage | ||||
Amazon Bedrock batch inference | Llama2-13b | zero-shot | n/a | n/a | 3600s | n/a | n/a | n/a |
5-shot (template12) | 65.4s | 1704s | 3600s | 72/20=3.6s | 92.60% | 53.90% | ||
Llama2-70b | zero-shot | n/a | n/a | 3600s | n/a | n/a | n/a | |
5-shot (template13) | 139.6s | 5299s | 3600s | 156/20=7.8s | 98.30% | 61.50% | ||
Claude-v1 (instant) | zero-shot (template6) | 29s | 723s | 3600s | 44.8/20=2.24s | 98.50% | 96.80% | |
5-shot (template12) | 30.3s | 644s | 3600s | 51/20=2.6s | 99% | 84.40% | ||
Claude-v2 | zero-shot (template6) | 82.2s | 1706s | 3600s | 104/20=5.2s | 99% | 84.40% | |
5-shot (template14) | 49.1s | 1323s | 3600s | 104/20=5.2s | 99.40% | 90.10% |
The following tables summarize the scaling effect in batch inference.
- When scaling from 5,000 to 100,000 samples, only eight times more computation time was needed.
- Performing categorization with individual LLM calls for each product would have increased the inference time for 100,000 products by approximately 40 times compared to the batch processing method.
- The accuracy in coverage remained stable, and cost scaled approximately linearly.
Batch process service | Model | Prompt | Batch process latency (5 packing) | Near-real-time process latency (1 packing) | |||
test set = 20 | test set = 5k | GoDaddy rqmt @ 5k | test set = 100k | ||||
Amazon Bedrock batch | Claude-v1 (instant) | zero-shot (template6) | 29s | 723s | 3600s | 5733s | 44.8/20=2.24s |
Amazon Bedrock batch | Anthropic’s Claude-v2 | zero-shot (template6) | 82.2s | 1706s | 3600s | 7689s | 104/20=5.2s |
Batch process service | Near-real-time process latency (1 packing) | Programmatic evaluation (coverage) | |||
Parsing recall on product name (test set = 5k) | Parsing recall on product name (test set = 100k) | Final content coverage (test set = 5k) | Final content coverage (test set = 100k) | ||
Amazon Bedrock batch | 44.8/20=2.24s | 98.50% | 98.40% | 96.80% | 96.50% |
Amazon Bedrock batch | 104/20=5.2s | 99% | 98.80% | 84.40% | 97% |
The following table summarizes the effect of n-packing. Llama 2 has an output length limit of 2,048 and fits up to around 20 packing. Anthropic’s Claude has a higher limit. We tested on 20 ground truth samples for 1, 5, and 10 packing and selected results from all model and prompt templates. The scaling effect on latency was more obvious in the Anthropic’s Claude model family than Llama 2. Anthropic’s Claude had better generalizability than Llama 2 when extending the packing numbers in output.
We only tried a few shots with Llama 2 models, which showed improved accuracy over zero-shot.
Batch process service | Model | Prompt | Latency (test set = 20) | Accuracy (final coverage) | ||||
npack = 1 | npack= 5 | npack = 10 | npack = 1 | npack= 5 | npack = 10 | |||
Amazon Bedrock batch inference | Llama2-13b | 5-shot (template12) | 72s | 65.4s | 65s | 95.90% | 93.20% | 88.90% |
Llama2-70b | 5-shot (template13) | 156s | 139.6s | 150s | 85% | 97.70% | 100% | |
Claude-v1 (instant) | zero-shot (template6) | 45s | 29s | 27s | 99.50% | 99.50% | 99.30% | |
5-shot (template12) | 51.3s | 30.3s | 27.4s | 99.50% | 99.50% | 100% | ||
Claude-v2 | zero-shot (template6) | 104s | 82.2s | 67s | 85% | 97.70% | 94.50% | |
5-shot (template14) | 104s | 49.1s | 43.5s | 97.70% | 100% | 99.80% |
Qualitative results
We noted the following qualitative results:
- Human evaluation – The categories generated were evaluated qualitatively by GoDaddy SMEs. The categories were found to be of good quality.
- Learnings – We used an LLM in two separate calls: output generation and format parsing. We observed the following:
- For this use case, we saw Llama 2 didn’t perform well in format parsing but was relatively capable in output generation. To be consistent and make a fair comparison, we required the LLM used in both calls to be the same—the API calls in both steps should all be invoked to
llama2-13b-chat-v1
, or they should all be invoked toanthropic.claude-instant-v1
. However, GoDaddy chose Llama 2 as the LLM for category generation. For this use case, we found that using Llama 2 in output generation only and using Anthropic’s Claude in format parsing was suitable due to Llama 2’s relative lower model capability. - Format parsing is improved through prompt engineering (JSON format instruction is critical) to reduce the latency. For example, with Anthropic’s Claude-Instant on a 20-test set and averaging multiple prompt templates, the latency can be reduced by approximately 77% (from 90 seconds to 20 seconds). This directly eliminates the necessity of using a JSON fine-tuned version of the LLM.
- For this use case, we saw Llama 2 didn’t perform well in format parsing but was relatively capable in output generation. To be consistent and make a fair comparison, we required the LLM used in both calls to be the same—the API calls in both steps should all be invoked to
- Llama2 – We observed the following:
- Llama2-13b and Llama2-70b models both need the full instruction as
format_instruction()
in zero-shot prompts. - Llama2-13b seems to be worse in content coverage and formatting (for example, it can’t correctly escape char,
\\
“), which can incur significant parsing time and cost and also degrade accuracy. - Llama 2 shows clear performance drops and instability when the packing number varies among 1, 5, and 10, indicating poorer generalizability compared to the Anthropic’s Claude model family.
- Llama2-13b and Llama2-70b models both need the full instruction as
- Anthropic’s Claude – We observed the following:
- Anthropic’s Claude-Instant and Claude-v2, regardless of using zero-shot or few-shot prompting, need only partial format instruction instead of the full instruction
format_instruction()
. It shortens the input length, and is therefore more cost-effective. It also shows Anthropic’s Claude’s better capability in following instructions. - Anthropic’s Claude generalizes well when varying packing numbers among 1, 5, and 10.
- Anthropic’s Claude-Instant and Claude-v2, regardless of using zero-shot or few-shot prompting, need only partial format instruction instead of the full instruction
Business takeaways
We had the following key business takeaways:
- Improved latency – Our solution inferences 5,000 products in 12 minutes, which is 80% faster than GoDaddy’s needs (5,000 products in 1 hour). Using batch inference in Amazon Bedrock demonstrates efficient batch processing capabilities and anticipates further scalability with AWS planning to deploy more cloud instances. The expansion will lead to increased time and cost savings.
- More cost-effectiveness – The solution built by the Generative AI Innovation Center using Anthropic’s Claude-Instant is 8% more affordable than the existing proposal using Llama2-13b while also providing 79% more coverage.
- Enhanced accuracy – The deliverable produces 97% category coverage on both the 5,000 and 100,000 hold-out test set, exceeding GoDaddy’s needs at 90%. The comprehensive framework is able to facilitate future iterative improvements over the current model parameters and prompt templates.
- Qualitative assessment – The category generation is in satisfactory quality through human evaluation by GoDaddy SMEs.
Technical takeaways
We had the following key technical takeaways:
- The solution features both batch inference and near real-time inference (2 seconds per product) capability and multiple backend LLM selections.
- Anthropic’s Claude-Instant with zero-shot is the clear winner:
- It was best in latency, cost, and accuracy on the 5,000 hold-out test set.
- It showed better generalizability to higher packing numbers (number of SKUs in one query), with potentially more cost and latency improvement.
- Iteration on prompt templates shows improvement on all these models, suggesting that good prompt engineering is a practical approach for the categorization generation task.
- Input-wise, increasing to 10-shot may further improve performance, as observed in small-scale science experiments, but also increase the cost by around 30%. Therefore, we tested at most 5-shot in large-scale batch experiments.
- Output-wise, increasing to 10-packing or even 20-packing (Anthropic’s Claude only; Llama 2 has 2,048 output length limit) might further improve latency and cost (because more SKUs can share the same input instructions).
- For this use case, we saw Anthropic’s Claude model family having better accuracy and generalizability, for example:
- Final category coverage performance was better with Anthropic’s Claude-Instant.
- When increasing packing numbers from 1, 5, to 10, Anthropic’s Claude-Instant showed improvement in latency and stable accuracy in comparison to Llama 2.
- To achieve the final categories for the use case, we noticed that Anthropic’s Claude required a shorter prompt input to follow the instruction and had a longer output length limit for a higher packing number.
Next steps for GoDaddy
The following are the recommendations that the GoDaddy team is considering as a part of future steps:
- Dataset enhancement – Aggregate a larger set of ground truth examples and expand programmatic evaluation to better monitor and refine the model’s performance. On a related note, if the product names can be normalized by domain knowledge, the cleaner input is also helpful for better LLM responses. For example, the product name ”<product_name> Power t-shirt, ladyfit vest or hoodie” can prompt the LLM to respond for multiple SKUs, instead of one SKU (similarly, “<product_name> – $5 or $10 or $20 or $50 or $100”).
- Human evaluation – Increase human evaluations to provide higher generation quality and alignment with desired outcomes.
- Fine-tuning – Consider fine-tuning as a potential strategy for enhancing category generation when a more extensive training dataset becomes available.
- Prompt engineering – Explore automatic prompt engineering techniques to enhance category generation, particularly when additional training data becomes available.
- Few-shot learning – Investigate techniques such as dynamic few-shot selection and crafting in-context examples based on the model’s parameter knowledge to enhance the LLMs’ few-shot learning capabilities.
- Knowledge integration – Improve the model’s output by connecting LLMs to a knowledge base (internal or external database) and enabling it to incorporate more relevant information. This can help to reduce LLM hallucinations and enhance relevance in responses.
Conclusion
In this post, we shared how the Generative AI Innovation Center team worked with GoDaddy to create a more accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system. We implemented n-packing techniques and used Anthropic’s Claude and Meta Llama 2 models to improve latency. We experimented with different prompts to improve the categorization with LLMs and found that Anthropic’s Claude model family gave the better accuracy and generalizability than the Llama 2 model family. GoDaddy team will test this solution on a larger dataset and evaluate the categories generated from the recommended approaches.
If you’re interested in working with the AWS Generative AI Innovation Center, please reach out.
Security Best Practices
References
About the Authors
Vishal Singh is a Data Engineering leader at the Data and Analytics team of GoDaddy. His key focus area is towards building data products and generating insights from them by application of data engineering tools along with generative AI.
Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.
Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.
Karan Sindwani is an Applied Scientist at AWS where he works with AWS customers across different verticals to accelerate their use of Gen AI and AWS Cloud services to solve their business challenges.
Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.