Manas Singh
Blogging
7
min read
Feb 26, 2024
Introducing MyMagic AI - Scalable and Affordable LLM Batch Inference
Since the launch of ChatGPT in 2022, our lives have been transformed by thin-wrapping OpenAI APIs that assist us in shopping, booking flights, planning vacations, and even finding the perfect emoji. GPUs, with their hundreds of cores enabling parallel processing of millions of customer queries, are essential for fast training and inference. However, they are becoming increasingly expensive to rent from existing cloud providers like Azure and AWS, making it difficult to train and deploy these agents in production. For instance, running a customer support chatbot with 5K daily active users (typical for SMBs) would cost $48K annually running on AWS ec2 (g5.xlarge) coupled with Hugging Face based model management. While it might appear like a simple GPU supply issue, there are deeper underlying issues that need further consideration.
Underutilized GPUs
A recent analysis by TechInsights, a leading semiconductor research agency, revealed that data centers by companies like AWS and Azure are built for peak demand. Due to ineffective demand scheduling and resource stranding, the load on these GPUs surges simultaneously, increasing peak load. Furthermore, GPUs don't function like CPUs when it comes to overprovisioning - an entire server is made available to the customer during training or inference, exacerbating the underutilization problem. According to TechInsight's estimate, if AWS’ GPU clusters were fully utilized year-round, they would generate 15% or $1B more in revenue.
Live Streaming Inference Leads to Slower Token Processing and Higher Cost
All major AI agents (YouChat, JasperChat, Zoho SalesIQ's Zobot) are chat-based applications that process your data analysis or information retrieval queries in real-time. Offline batch processing of user queries is an alternative that can be optimized for higher token processing speed by grouping queries before routing to GPUs for inference, resulting in lower costs with comparable accuracy. While consumer-facing chat applications, customer support, and fraud detection require real-time inference, several use cases, including data analysis of large volumes of data, periodic reporting, web scraping, and embedding generation, are more cost-efficient with offline batch inference.
Introducing MyMagic AI
MyMagic AI is an LLM batch inference platform that utilizes GPU-powered distributed computing with offline batch processing. It responds to user queries much faster and at a cheaper cost than existing cloud-based inference platforms. If you are looking to extract specific information from your massive datasets, MyMagic AI offers web-based UI and APIs to integrate with your existing datasets across cloud-based storage applications and answer queries using natively integrated open source LLMs (LLAMA2 70B,, Mixtral 8x7B and CodeLlama 70B etc). Additionally, you can analyze your data using popular ML use cases including sentiment analysis, summarization, classification, and embedding, all with a click or a single MyMagic API call.
Unlike most AI applications that use real-time inference for user requests, MyMagic AI uses batch processing and inference to resolve user queries suited for use cases including data analysis and reporting. Through extensive experimentation and deep technological advancement, we have determined an optimal GPU-based LLM infrastructure for batch processing including instance types, LLM quantization, data parallelization, and auto-scaling algorithms to enable faster token processing, allowing us to offer the lowest prices in the market.
Is your data stored across a single or multi-cloud environment? MyMagic AI got you covered! We have integrations with all major cloud providers and their storage services including S3 buckets from AWS, Azure Blob from Microsoft Azure, and Google Cloud Repo on Google Cloud Platform.
Competitive Benchmarking
Several cloud computing platforms in the market are trying to make LLM inference faster and more affordable. Below are some critical insights on how MyMagic AI fares compared to existing solutions in the market
We Offer the Lowest Inference Price Across All Platforms
Across several open-source LLMs, MyMagic offers the lowest inference price compared to all leading LLM inference platforms. For instance, our inference cost for Llama-2-70B is 6x lower than the mean price for input and generation across the platforms. This has been enabled through our high throughput and better GPU utilization.
Our Throughput is the Highest Among Existing Platforms (Benchmarked for Sentence Completion Case)
MyMagic AI inference API throughput is the highest among all existing platforms. This is primarily achieved by saturating each server with parallel GPUs, ensuring they are always fully loaded with no idle time, maximizing the tokens generated. Our calculation methodology includes determining how many tokens we can process (squeeze in) within one hour.
Few platforms are focusing on offline batch inference -
Only a few platforms offer the capability of batch inferences on your data and fewer offer offline batch inference ideal for data analysis and reporting use cases. OpenAI doesn’t allow batch processing of chat completion API calls into a single HTTP request even though it has certain workarounds to send concurrent requests. Together AI, Fireworks AI, and Friendli AI allow for sending multiple API requests at once and batches them before serving. MosaicML and Anyscale AI offer multiple batching options, including continuous batching and iteration-level scheduling. Once a batch sequence has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching. This has enabled 23x throughput in LLM inference. MyMagic utilizes such batching techniques to increase throughput and reduce inference costs, along with other proprietary deep tech optimizations.
Most Platforms Focus on Improving the Algorithms or Infrastructure
Current platforms utilize several algorithmic optimizations to enable GPTs to create tokens more effectively. Some of the most popular techniques include Pytorch’s flash decoding, PagedAttention, and FlashAttention. Flash-Decoding dramatically speeds up attention while FlashAttention helps with the time to the first token — up to 8x faster generation for a very long sequence. Infrastructure level optimization includes using CUDA graphs that save time by launching multiple GPU mechanisms through a single CPU operation and optimizing the memory bandwidth utilization (% of total memory bandwidth (TB/sec) utilized to load parameters onto the GPU).
What’s Next?
To further drive down costs while improving throughput, we are currently testing several algorithmic optimizations in collaboration with UC Berkeley researchers and leading industry CUDA engineers. The research focuses on model quantizations, CUDA-level optimizations of different kernels, and transformer-based architectural changes. So subscribe to our newsletter to stay tuned!
Want to try out MyMagic AI?
To begin, sign up for a MyMagic AI account, obtain your API key, and run your first batch job. You can also click on "How It Works" and follow the video demo of the product. If you have any questions, feel free to book some time with us in our calendar
Appendix
[OpenAI] GPT-3.5 Turbo - $0.001 / 1k tokens; 10,000 RPM (Pricing HERE, Rate Limit HERE)
[OpenAI] GPT-4 - $0.01 / 1k tokens; 10,000 RPM (Pricing HERE, Rate Limit HERE)
[MosaicML] - $0.005 / 1k tokens; 47 tokens/second (Pricing HERE, throughput HERE)
[Together AI] - $0.0009/1k tokens; 68 tokens/second; 100 QPM (All Data HERE)
Fireworks AI - $0.0007/ 1k tokens ;28 tokens / second; 100 RPM (All Data HERE, throughput HERE)
Anyscale AI - $0.001 / 1k tokens; 33 tokens / second ; 30 concurrent requests ( Price HERE, Throughput HERE, rate limit HERE)
Friendli AI - $0.0008/ 1k tokens; 32 tokens / second (4x of a vLLM); 5000 RPM (All Data HERE)
Mymagic AI - $0.0005 / 1k tokens; 230 tokens / second
Perplexity AI - $0.0007 / 1k tokens; 48 tokens/second (Data HERE)
Openrouter - $0.0008 / 1k tokens; 31.5 tokens/second (Data HERE)
Deepinfra - $0.0007 / 1k tokens; 27 tokens/second (Data HERE)
OpenAI - [No batching of request, no offline batching option] Doesn’t send request in batches but all prompts can be collated into 1 request as a message (json) and chat API can be asked to answer all of them. No offline batching option. More details HERE.
Together AI [Batching of request possible, no offline batching option]- They have batching enabled (concurrent requests) for their API but no offline batching option.
Anyscale AI [Batching of request possible, offline batching option present] - Allow 30 concurrent requests and has an offline batching option
MosaicML [Batching of request possible, offline batching option present] - Batching can be done and even has offline batching as an option HERE
Fireworks AI- [No batching of request, no offline batching option]
FriendliAI [Batching of request possible, offline batching option present] - Offline batch processing is present. Patented Iterative batching technique HERE and Periflow