Groq Platform - Fast AI Inference

Overview

Groq is a revolutionary AI inference platform built on their proprietary Language Processing Unit (LPU) architecture. It delivers unprecedented speed and efficiency for running large language models, making it ideal for real-time applications and high-throughput scenarios.

LPU Architecture

Specialized hardware designed specifically for sequential AI workloads

Ultra-Low Latency

Response times measured in milliseconds for most queries

Free Tier Access

Generous free usage limits with no credit card required initially

Getting Started

API Key Setup

Visit Groq Console and create an account
Navigate to API Keys section and generate a new key
Copy the key securely as it won't be shown again
Review the usage limits and supported models

Platform Features

Lightning Speed: Fastest inference speeds available commercially
Multiple Models: Access to Mixtral, Llama, and other open-source models
Simple Pricing: Transparent pay-per-use pricing model
REST API: Standard HTTP endpoints with JSON responses

Available Models

Mixtral 8x7b

Speed: ~480 tokens/second
Context: 32K tokens
Use Cases: General purpose, coding, reasoning
Specialty: Mixture of Experts architecture

Llama2 70b

Speed: ~300 tokens/second
Context: 4K tokens
Use Cases: Research, analysis, content creation
Specialty: Large-scale reasoning tasks

Code Llama

Speed: ~400 tokens/second
Context: 16K tokens
Use Cases: Code generation, debugging, review
Specialty: Programming and technical tasks

API Integration

Python Implementation

# Install Groq Python package
pip install groq

# Basic implementation
from groq import Groq

client = Groq(api_key="your-api-key-here")

def chat_completion(prompt):
    try:
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            model="mixtral-8x7b-32768",
            temperature=0.5,
            max_tokens=1024,
            top_p=1,
            stream=False,
        )
        return chat_completion.choices[0].message.content
    except Exception as e:
        print(f"Error: {e}")
        return None

# Example usage
response = chat_completion("Explain quantum computing")
print(response)

JavaScript Implementation

// Using fetch API directly
async function groqChat(prompt) {
    const response = await fetch('https://api.groq.com/openai/v1/chat/completions', {
        method: 'POST',
        headers: {
            'Authorization': `Bearer ${process.env.GROQ_API_KEY}`,
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            messages: [{ role: 'user', content: prompt }],
            model: 'mixtral-8x7b-32768',
            temperature: 0.7,
            max_tokens: 1024,
            stream: false
        })
    });

    if (!response.ok) {
        throw new Error(`HTTP error! status: ${response.status}`);
    }

    const data = await response.json();
    return data.choices[0].message.content;
}

// Example usage
groqChat("What is machine learning?")
    .then(response => console.log(response))
    .catch(error => console.error('Error:', error));

Streaming Responses

# Streaming implementation in Python
def stream_chat(prompt):
    stream = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="mixtral-8x7b-32768",
        temperature=0.7,
        max_tokens=1024,
        top_p=1,
        stream=True
    )
    
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content is not None:
            print(content, end='', flush=True)

# JavaScript streaming
async function streamGroqChat(prompt, onChunk) {
    const response = await fetch('https://api.groq.com/openai/v1/chat/completions', {
        method: 'POST',
        headers: {
            'Authorization': `Bearer ${process.env.GROQ_API_KEY}`,
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            messages: [{ role: 'user', content: prompt }],
            model: 'mixtral-8x7b-32768',
            stream: true,
            temperature: 0.7
        })
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    
    while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        
        const chunk = decoder.decode(value);
        const lines = chunk.split('\n');
        
        for (const line of lines) {
            if (line.startsWith('data: ') && line !== 'data: [DONE]') {
                try {
                    const data = JSON.parse(line.slice(6));
                    const content = data.choices[0]?.delta?.content;
                    if (content) onChunk(content);
                } catch (e) {
                    // Continue processing other lines
                }
            }
        }
    }
}

Performance Benchmarks

Model	Tokens/Second	Latency	Throughput
Mixtral 8x7b	480 t/s	~200ms	High
Llama2 70b	300 t/s	~350ms	Medium
Code Llama	400 t/s	~250ms	High

Use Case Recommendations

Real-time Chat: Mixtral for balanced performance
Code Generation: Code Llama for programming tasks
Research: Llama2 70b for complex reasoning
High Throughput: Mixtral for batch processing

Best Practices

Optimizing for Speed

Use appropriate temperature settings (0.1-0.7 for deterministic outputs)
Set reasonable max_tokens to avoid unnecessary computation
Implement request batching for multiple queries
Use streaming for real-time user experiences

Error Handling

class GroqClient:
    def __init__(self, api_key):
        self.client = Groq(api_key=api_key)
        self.retry_delay = 1
        
    def robust_completion(self, prompt, max_retries=3):
        for attempt in range(max_retries):
            try:
                response = self.client.chat.completions.create(
                    messages=[{"role": "user", "content": prompt}],
                    model="mixtral-8x7b-32768",
                    max_tokens=1024
                )
                return response.choices[0].message.content
                
            except Exception as e:
                if "rate limit" in str(e).lower():
                    time.sleep(self.retry_delay * (2 ** attempt))
                    continue
                elif "server error" in str(e).lower():
                    time.sleep(5)
                    continue
                else:
                    raise e
                    
        raise Exception("Max retries exceeded")
        
    def batch_process(self, prompts, batch_size=5):
        results = []
        for i in range(0, len(prompts), batch_size):
            batch = prompts[i:i + batch_size]
            batch_results = []
            for prompt in batch:
                try:
                    result = self.robust_completion(prompt)
                    batch_results.append(result)
                except Exception as e:
                    batch_results.append(f"Error: {str(e)}")
            results.extend(batch_results)
            time.sleep(0.1)  # Small delay between batches
        return results