API Platforms
Groq Platform - Fast AI Inference
Ultra-fast AI inference platform with LPU architecture, free API access, and real-time processing capabilities
Overview
Groq is a revolutionary AI inference platform built on their proprietary Language Processing Unit (LPU) architecture. It delivers unprecedented speed and efficiency for running large language models, making it ideal for real-time applications and high-throughput scenarios.
LPU Architecture
Specialized hardware designed specifically for sequential AI workloads
Ultra-Low Latency
Response times measured in milliseconds for most queries
Free Tier Access
Generous free usage limits with no credit card required initially
Getting Started
API Key Setup
- Visit Groq Console and create an account
- Navigate to API Keys section and generate a new key
- Copy the key securely as it won't be shown again
- Review the usage limits and supported models
Platform Features
- Lightning Speed: Fastest inference speeds available commercially
- Multiple Models: Access to Mixtral, Llama, and other open-source models
- Simple Pricing: Transparent pay-per-use pricing model
- REST API: Standard HTTP endpoints with JSON responses
Available Models
Mixtral 8x7b
- Speed: ~480 tokens/second
- Context: 32K tokens
- Use Cases: General purpose, coding, reasoning
- Specialty: Mixture of Experts architecture
Llama2 70b
- Speed: ~300 tokens/second
- Context: 4K tokens
- Use Cases: Research, analysis, content creation
- Specialty: Large-scale reasoning tasks
Code Llama
- Speed: ~400 tokens/second
- Context: 16K tokens
- Use Cases: Code generation, debugging, review
- Specialty: Programming and technical tasks
API Integration
Python Implementation
# Install Groq Python package
pip install groq
# Basic implementation
from groq import Groq
client = Groq(api_key="your-api-key-here")
def chat_completion(prompt):
try:
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt
}
],
model="mixtral-8x7b-32768",
temperature=0.5,
max_tokens=1024,
top_p=1,
stream=False,
)
return chat_completion.choices[0].message.content
except Exception as e:
print(f"Error: {e}")
return None
# Example usage
response = chat_completion("Explain quantum computing")
print(response)
JavaScript Implementation
// Using fetch API directly
async function groqChat(prompt) {
const response = await fetch('https://api.groq.com/openai/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.GROQ_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
messages: [{ role: 'user', content: prompt }],
model: 'mixtral-8x7b-32768',
temperature: 0.7,
max_tokens: 1024,
stream: false
})
});
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
const data = await response.json();
return data.choices[0].message.content;
}
// Example usage
groqChat("What is machine learning?")
.then(response => console.log(response))
.catch(error => console.error('Error:', error));
Streaming Responses
# Streaming implementation in Python
def stream_chat(prompt):
stream = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model="mixtral-8x7b-32768",
temperature=0.7,
max_tokens=1024,
top_p=1,
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content is not None:
print(content, end='', flush=True)
# JavaScript streaming
async function streamGroqChat(prompt, onChunk) {
const response = await fetch('https://api.groq.com/openai/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.GROQ_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
messages: [{ role: 'user', content: prompt }],
model: 'mixtral-8x7b-32768',
stream: true,
temperature: 0.7
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
try {
const data = JSON.parse(line.slice(6));
const content = data.choices[0]?.delta?.content;
if (content) onChunk(content);
} catch (e) {
// Continue processing other lines
}
}
}
}
}
Performance Benchmarks
| Model | Tokens/Second | Latency | Throughput |
|---|---|---|---|
| Mixtral 8x7b | 480 t/s | ~200ms | High |
| Llama2 70b | 300 t/s | ~350ms | Medium |
| Code Llama | 400 t/s | ~250ms | High |
Use Case Recommendations
- Real-time Chat: Mixtral for balanced performance
- Code Generation: Code Llama for programming tasks
- Research: Llama2 70b for complex reasoning
- High Throughput: Mixtral for batch processing
Best Practices
Optimizing for Speed
- Use appropriate temperature settings (0.1-0.7 for deterministic outputs)
- Set reasonable max_tokens to avoid unnecessary computation
- Implement request batching for multiple queries
- Use streaming for real-time user experiences
Error Handling
class GroqClient:
def __init__(self, api_key):
self.client = Groq(api_key=api_key)
self.retry_delay = 1
def robust_completion(self, prompt, max_retries=3):
for attempt in range(max_retries):
try:
response = self.client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model="mixtral-8x7b-32768",
max_tokens=1024
)
return response.choices[0].message.content
except Exception as e:
if "rate limit" in str(e).lower():
time.sleep(self.retry_delay * (2 ** attempt))
continue
elif "server error" in str(e).lower():
time.sleep(5)
continue
else:
raise e
raise Exception("Max retries exceeded")
def batch_process(self, prompts, batch_size=5):
results = []
for i in range(0, len(prompts), batch_size):
batch = prompts[i:i + batch_size]
batch_results = []
for prompt in batch:
try:
result = self.robust_completion(prompt)
batch_results.append(result)
except Exception as e:
batch_results.append(f"Error: {str(e)}")
results.extend(batch_results)
time.sleep(0.1) # Small delay between batches
return results