API Platforms

Groq Platform - Fast AI Inference

Ultra-fast AI inference platform with LPU architecture, free API access, and real-time processing capabilities

Overview

Groq is a revolutionary AI inference platform built on their proprietary Language Processing Unit (LPU) architecture. It delivers unprecedented speed and efficiency for running large language models, making it ideal for real-time applications and high-throughput scenarios.

LPU Architecture

Specialized hardware designed specifically for sequential AI workloads

Ultra-Low Latency

Response times measured in milliseconds for most queries

Free Tier Access

Generous free usage limits with no credit card required initially

Getting Started

API Key Setup

  1. Visit Groq Console and create an account
  2. Navigate to API Keys section and generate a new key
  3. Copy the key securely as it won't be shown again
  4. Review the usage limits and supported models

Platform Features

  • Lightning Speed: Fastest inference speeds available commercially
  • Multiple Models: Access to Mixtral, Llama, and other open-source models
  • Simple Pricing: Transparent pay-per-use pricing model
  • REST API: Standard HTTP endpoints with JSON responses

Available Models

Mixtral 8x7b

  • Speed: ~480 tokens/second
  • Context: 32K tokens
  • Use Cases: General purpose, coding, reasoning
  • Specialty: Mixture of Experts architecture

Llama2 70b

  • Speed: ~300 tokens/second
  • Context: 4K tokens
  • Use Cases: Research, analysis, content creation
  • Specialty: Large-scale reasoning tasks

Code Llama

  • Speed: ~400 tokens/second
  • Context: 16K tokens
  • Use Cases: Code generation, debugging, review
  • Specialty: Programming and technical tasks

API Integration

Python Implementation

# Install Groq Python package
pip install groq

# Basic implementation
from groq import Groq

client = Groq(api_key="your-api-key-here")

def chat_completion(prompt):
    try:
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            model="mixtral-8x7b-32768",
            temperature=0.5,
            max_tokens=1024,
            top_p=1,
            stream=False,
        )
        return chat_completion.choices[0].message.content
    except Exception as e:
        print(f"Error: {e}")
        return None

# Example usage
response = chat_completion("Explain quantum computing")
print(response)

JavaScript Implementation

// Using fetch API directly
async function groqChat(prompt) {
    const response = await fetch('https://api.groq.com/openai/v1/chat/completions', {
        method: 'POST',
        headers: {
            'Authorization': `Bearer ${process.env.GROQ_API_KEY}`,
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            messages: [{ role: 'user', content: prompt }],
            model: 'mixtral-8x7b-32768',
            temperature: 0.7,
            max_tokens: 1024,
            stream: false
        })
    });

    if (!response.ok) {
        throw new Error(`HTTP error! status: ${response.status}`);
    }

    const data = await response.json();
    return data.choices[0].message.content;
}

// Example usage
groqChat("What is machine learning?")
    .then(response => console.log(response))
    .catch(error => console.error('Error:', error));

Streaming Responses

# Streaming implementation in Python
def stream_chat(prompt):
    stream = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="mixtral-8x7b-32768",
        temperature=0.7,
        max_tokens=1024,
        top_p=1,
        stream=True
    )
    
    for chunk in stream:
        content = chunk.choices[0].delta.content
        if content is not None:
            print(content, end='', flush=True)

# JavaScript streaming
async function streamGroqChat(prompt, onChunk) {
    const response = await fetch('https://api.groq.com/openai/v1/chat/completions', {
        method: 'POST',
        headers: {
            'Authorization': `Bearer ${process.env.GROQ_API_KEY}`,
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            messages: [{ role: 'user', content: prompt }],
            model: 'mixtral-8x7b-32768',
            stream: true,
            temperature: 0.7
        })
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    
    while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        
        const chunk = decoder.decode(value);
        const lines = chunk.split('\n');
        
        for (const line of lines) {
            if (line.startsWith('data: ') && line !== 'data: [DONE]') {
                try {
                    const data = JSON.parse(line.slice(6));
                    const content = data.choices[0]?.delta?.content;
                    if (content) onChunk(content);
                } catch (e) {
                    // Continue processing other lines
                }
            }
        }
    }
}

Performance Benchmarks

Model Tokens/Second Latency Throughput
Mixtral 8x7b 480 t/s ~200ms High
Llama2 70b 300 t/s ~350ms Medium
Code Llama 400 t/s ~250ms High

Use Case Recommendations

  • Real-time Chat: Mixtral for balanced performance
  • Code Generation: Code Llama for programming tasks
  • Research: Llama2 70b for complex reasoning
  • High Throughput: Mixtral for batch processing

Best Practices

Optimizing for Speed

  • Use appropriate temperature settings (0.1-0.7 for deterministic outputs)
  • Set reasonable max_tokens to avoid unnecessary computation
  • Implement request batching for multiple queries
  • Use streaming for real-time user experiences

Error Handling

class GroqClient:
    def __init__(self, api_key):
        self.client = Groq(api_key=api_key)
        self.retry_delay = 1
        
    def robust_completion(self, prompt, max_retries=3):
        for attempt in range(max_retries):
            try:
                response = self.client.chat.completions.create(
                    messages=[{"role": "user", "content": prompt}],
                    model="mixtral-8x7b-32768",
                    max_tokens=1024
                )
                return response.choices[0].message.content
                
            except Exception as e:
                if "rate limit" in str(e).lower():
                    time.sleep(self.retry_delay * (2 ** attempt))
                    continue
                elif "server error" in str(e).lower():
                    time.sleep(5)
                    continue
                else:
                    raise e
                    
        raise Exception("Max retries exceeded")
        
    def batch_process(self, prompts, batch_size=5):
        results = []
        for i in range(0, len(prompts), batch_size):
            batch = prompts[i:i + batch_size]
            batch_results = []
            for prompt in batch:
                try:
                    result = self.robust_completion(prompt)
                    batch_results.append(result)
                except Exception as e:
                    batch_results.append(f"Error: {str(e)}")
            results.extend(batch_results)
            time.sleep(0.1)  # Small delay between batches
        return results