API Reference
Complete API endpoint documentation
API Reference
Complete reference for the HiveOps AI Inference API. Compatible with OpenAI's API specification.
Base URL
https://ai.hiveops.io
All API requests must be made over HTTPS. Requests made over plain HTTP will be rejected.
Authentication
HiveOps uses API keys for authentication. Include your API key in the Authorization header:
Authorization: Bearer YOUR_API_KEY
Example
curl https://ai.hiveops.io/chat/completions \
-H "Authorization: Bearer sk-YOUR-API-KEY" \
-H "Content-Type: application/json" \
-d '{"model": "llama3:8b-instruct-q8_0", "messages": [{"role": "user", "content": "Hello!"}]}'
Rate Limits
To ensure fair usage and system stability, we enforce the following limits per API key:
| Limit Type | Value |
|---|---|
| Requests per minute | 60 |
| Tokens per minute | 150,000 |
When you exceed rate limits, you'll receive a 429 Too Many Requests response. Implement exponential backoff in your application to handle rate limits gracefully.
Rate Limit Headers
Every API response includes rate limit information:
X-RateLimit-Limit-Requests: 60
X-RateLimit-Remaining-Requests: 45
X-RateLimit-Reset-Requests: 2026-03-20T12:30:00Z
X-RateLimit-Limit-Tokens: 150000
X-RateLimit-Remaining-Tokens: 145000
Endpoints
Chat Completions
Create a chat completion with conversational context.
Endpoint: POST /v1/chat/completions
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | ID of the model to use (see Models) |
messages | array | Yes | Array of message objects |
temperature | number | No | Sampling temperature (0-2, default: 1) |
top_p | number | No | Nucleus sampling (0-1, default: 1) |
max_tokens | integer | No | Maximum tokens to generate |
stream | boolean | No | Enable streaming responses (default: false) |
stop | string/array | No | Stop sequences |
presence_penalty | number | No | Penalize new tokens (-2 to 2, default: 0) |
frequency_penalty | number | No | Penalize repeated tokens (-2 to 2, default: 0) |
response_format | object | No | Specify output format (e.g., JSON mode) |
tools | array | No | Available tools for function calling |
tool_choice | string/object | No | Control tool usage behavior |
Message Object
{
"role": "user|assistant|system|tool",
"content": "Message content",
"name": "Optional function/user name",
"tool_calls": [] // For assistant messages with tool calls
}
Example Request
curl https://ai.hiveops.io/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-YOUR-API-KEY" \
-d '{
"model": "llama3:8b-instruct-q8_0",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that speaks like a pirate."
},
{
"role": "user",
"content": "Tell me about the ocean."
}
],
"temperature": 0.7,
"max_tokens": 500
}'
Example Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1711065600,
"model": "llama3:8b-instruct-q8_0",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Arrr, matey! The ocean be a vast and mysterious place..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 150,
"total_tokens": 175
}
}
Streaming Example
from openai import OpenAI
client = OpenAI(
api_key="sk-YOUR-API-KEY",
base_url="https://ai.hiveops.io"
)
stream = client.chat.completions.create(
model="llama3:8b-instruct-q8_0",
messages=[{"role": "user", "content": "Count to 5"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Completions
Create a text completion (legacy format, chat completions recommended).
Endpoint: POST /v1/completions
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID |
prompt | string/array | Yes | Text prompt(s) |
max_tokens | integer | No | Maximum tokens to generate |
temperature | number | No | Sampling temperature (0-2) |
top_p | number | No | Nucleus sampling (0-1) |
stream | boolean | No | Enable streaming |
stop | string/array | No | Stop sequences |
Example Request
curl https://ai.hiveops.io/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-YOUR-API-KEY" \
-d '{
"model": "llama3:8b-instruct-q8_0",
"prompt": "Once upon a time",
"max_tokens": 50
}'
List Models
Get a list of available models.
Endpoint: GET /v1/models
Example Request
curl https://ai.hiveops.io/models \
-H "Authorization: Bearer sk-YOUR-API-KEY"
Example Response
{
"object": "list",
"data": [
{
"id": "llama3:8b-instruct-q8_0",
"object": "model",
"created": 1711065600,
"owned_by": "meta"
},
{
"id": "llama-3-70b-instruct",
"object": "model",
"created": 1711065600,
"owned_by": "meta"
}
]
}
Get Model Info
Retrieve detailed information about a specific model.
Endpoint: GET /v1/model/info?model={model_id}
Example Request
curl "https://ai.hiveops.io/model/info?model=llama3:8b-instruct-q8_0" \
-H "Authorization: Bearer sk-YOUR-API-KEY"
Models
Llama 3 8B Instruct
Model ID: llama3:8b-instruct-q8_0
- Context Window: 8,192 tokens
- Input Pricing: $0.010 / 1M tokens
- Output Pricing: $0.020 / 1M tokens
- Best For: General-purpose tasks, fast responses
- Provider: Meta AI
Llama 3 70B Instruct
Model ID: llama-3-70b-instruct
- Context Window: 16,384 tokens
- Input Pricing: $0.100 / 1M tokens
- Output Pricing: $0.200 / 1M tokens
- Best For: Complex reasoning, high-quality outputs
- Provider: Meta AI
Gemma 2 9B IT
Model ID: gemma-2-9b-it
- Context Window: 8,192 tokens
- Input Pricing: $0.005 / 1M tokens
- Output Pricing: $0.010 / 1M tokens
- Best For: Balanced performance and cost
- Provider: Google
Mistral 7B Instruct
Model ID: mistral-7b-instruct-v0.3
- Context Window: 4,096 tokens
- Input Pricing: $0.001 / 1M tokens
- Output Pricing: $0.002 / 1M tokens
- Best For: Budget-conscious applications, high volume
- Provider: Mistral AI
Advanced Features
JSON Mode
Force the model to output valid JSON:
response = client.chat.completions.create(
model="llama3:8b-instruct-q8_0",
messages=[
{"role": "system", "content": "You are a helpful assistant that outputs JSON."},
{"role": "user", "content": "List 3 colors in JSON format."}
],
response_format={"type": "json_object"}
)
Function Calling
Define functions the model can call:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g., San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model="llama3:8b-instruct-q8_0",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools
)
Error Codes
| Code | Description | Solution |
|---|---|---|
400 | Bad Request | Check request format and parameters |
401 | Unauthorized | Verify API key is correct and active |
403 | Forbidden | Insufficient balance - add funds |
429 | Rate Limit Exceeded | Implement exponential backoff |
500 | Internal Server Error | Retry with exponential backoff |
503 | Service Unavailable | Temporarily overloaded - retry later |
Error Response Format
{
"error": {
"message": "Invalid API key provided",
"type": "invalid_request_error",
"code": "invalid_api_key"
}
}
See Error Handling Guide for detailed troubleshooting.
Best Practices
1. Use Streaming for Chat UIs
Streaming provides a better user experience by showing responses as they're generated:
stream = client.chat.completions.create(
model="llama3:8b-instruct-q8_0",
messages=messages,
stream=True
)
2. Implement Retry Logic
Handle transient errors gracefully:
import time
from openai import OpenAI, APIError
max_retries = 3
for attempt in range(max_retries):
try:
response = client.chat.completions.create(...)
break
except APIError as e:
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
3. Monitor Token Usage
Track usage to manage costs:
response = client.chat.completions.create(...)
print(f"Tokens used: {response.usage.total_tokens}")
print(f"Estimated cost: ${response.usage.total_tokens / 1_000_000 * 0.01}")
4. Set Max Tokens
Prevent unexpectedly long (and expensive) responses:
response = client.chat.completions.create(
model="llama3:8b-instruct-q8_0",
messages=messages,
max_tokens=500 # Limit response length
)
SDK Support
HiveOps is compatible with official OpenAI SDKs and many third-party libraries:
- ✅ Python:
openai(pip install openai) - ✅ JavaScript/TypeScript:
openai(npm install openai) - ✅ Go:
go-openai - ✅ .NET:
Azure.AI.OpenAI - ✅ Java:
openai-java - ✅ CLI:
openaicommand-line tool
See our SDK guides for language-specific examples.
Support
Questions? We're here to help:
- 📚 Documentation
- 💬 Discord Community
- 📧 Email: [email protected]