API Reference

Complete reference for the HiveOps AI Inference API. Compatible with OpenAI's API specification.

Base URL

https://ai.hiveops.io

All API requests must be made over HTTPS. Requests made over plain HTTP will be rejected.

Authentication

HiveOps uses API keys for authentication. Include your API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Example

curl https://ai.hiveops.io/chat/completions \
  -H "Authorization: Bearer sk-YOUR-API-KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3:8b-instruct-q8_0", "messages": [{"role": "user", "content": "Hello!"}]}'

Rate Limits

To ensure fair usage and system stability, we enforce the following limits per API key:

Limit Type	Value
Requests per minute	100
Tokens per minute	150,000

When you exceed rate limits, you'll receive a 429 Too Many Requests response. Implement exponential backoff in your application to handle rate limits gracefully.

Rate Limit Headers

Every API response includes rate limit information:

X-RateLimit-Limit-Requests: 100
X-RateLimit-Remaining-Requests: 45
X-RateLimit-Reset-Requests: 2026-03-20T12:30:00Z
X-RateLimit-Limit-Tokens: 150000
X-RateLimit-Remaining-Tokens: 145000

Endpoints

Chat Completions

Create a chat completion with conversational context.

Endpoint: POST /v1/chat/completions

Request Body

Field	Type	Required	Description
`model`	string	Yes	ID of the model to use (see Models)
`messages`	array	Yes	Array of message objects
`temperature`	number	No	Sampling temperature (0-2, default: 1)
`top_p`	number	No	Nucleus sampling (0-1, default: 1)
`max_tokens`	integer	No	Maximum tokens to generate
`stream`	boolean	No	Enable streaming responses (default: false)
`stop`	string/array	No	Stop sequences
`presence_penalty`	number	No	Penalize new tokens (-2 to 2, default: 0)
`frequency_penalty`	number	No	Penalize repeated tokens (-2 to 2, default: 0)
`response_format`	object	No	Specify output format (e.g., JSON mode)
`tools`	array	No	Available tools for function calling
`tool_choice`	string/object	No	Control tool usage behavior

Message Object

{
  "role": "user|assistant|system|tool",
  "content": "Message content",
  "name": "Optional function/user name",
  "tool_calls": [] // For assistant messages with tool calls
}

Example Request

curl https://ai.hiveops.io/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-YOUR-API-KEY" \
  -d '{
    "model": "llama3:8b-instruct-q8_0",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant that speaks like a pirate."
      },
      {
        "role": "user",
        "content": "Tell me about the ocean."
      }
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Example Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711065600,
  "model": "llama3:8b-instruct-q8_0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Arrr, matey! The ocean be a vast and mysterious place..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 150,
    "total_tokens": 175
  }
}

Streaming Example

from openai import OpenAI

client = OpenAI(
    api_key="sk-YOUR-API-KEY",
    base_url="https://ai.hiveops.io"
)

stream = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Completions

Create a text completion (legacy format, chat completions recommended).

Endpoint: POST /v1/completions

Request Body

Field	Type	Required	Description
`model`	string	Yes	Model ID
`prompt`	string/array	Yes	Text prompt(s)
`max_tokens`	integer	No	Maximum tokens to generate
`temperature`	number	No	Sampling temperature (0-2)
`top_p`	number	No	Nucleus sampling (0-1)
`stream`	boolean	No	Enable streaming
`stop`	string/array	No	Stop sequences

Example Request

curl https://ai.hiveops.io/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-YOUR-API-KEY" \
  -d '{
    "model": "llama3:8b-instruct-q8_0",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

List Models

Get a list of available models.

Endpoint: GET /v1/models

Example Request

curl https://ai.hiveops.io/models \
  -H "Authorization: Bearer sk-YOUR-API-KEY"

Example Response

{
  "object": "list",
  "data": [
    {
      "id": "llama3:8b-instruct-q8_0",
      "object": "model",
      "created": 1711065600,
      "owned_by": "meta"
    },
    {
      "id": "llama-3-70b-instruct",
      "object": "model",
      "created": 1711065600,
      "owned_by": "meta"
    }
  ]
}

Get Model Info

Retrieve detailed information about a specific model.

Endpoint: GET /v1/model/info?model={model_id}

Example Request

curl "https://ai.hiveops.io/model/info?model=llama3:8b-instruct-q8_0" \
  -H "Authorization: Bearer sk-YOUR-API-KEY"

Models

Llama 3 8B Instruct

Model ID: llama3:8b-instruct-q8_0

Context Window: 8,192 tokens
Input Pricing: $0.010 / 1M tokens
Output Pricing: $0.020 / 1M tokens
Best For: General-purpose tasks, fast responses
Provider: Meta AI

Llama 3 70B Instruct

Model ID: llama-3-70b-instruct

Context Window: 16,384 tokens
Input Pricing: $0.100 / 1M tokens
Output Pricing: $0.200 / 1M tokens
Best For: Complex reasoning, high-quality outputs
Provider: Meta AI

Gemma 2 9B IT

Model ID: gemma-2-9b-it

Context Window: 8,192 tokens
Input Pricing: $0.005 / 1M tokens
Output Pricing: $0.010 / 1M tokens
Best For: Balanced performance and cost
Provider: Google

Mistral 7B Instruct

Model ID: mistral-7b-instruct-v0.3

Context Window: 4,096 tokens
Input Pricing: $0.001 / 1M tokens
Output Pricing: $0.002 / 1M tokens
Best For: Budget-conscious applications, high volume
Provider: Mistral AI

Advanced Features

JSON Mode

Force the model to output valid JSON:

response = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that outputs JSON."},
        {"role": "user", "content": "List 3 colors in JSON format."}
    ],
    response_format={"type": "json_object"}
)

Function Calling

Define functions the model can call:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., San Francisco"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools
)

Error Codes

Code	Description	Solution
`400`	Bad Request	Check request format and parameters
`401`	Unauthorized	Verify API key is correct and active
`403`	Forbidden	Insufficient balance - add funds
`429`	Rate Limit Exceeded	Implement exponential backoff
`500`	Internal Server Error	Retry with exponential backoff
`503`	Service Unavailable	Temporarily overloaded - retry later

Error Response Format

{
  "error": {
    "message": "Invalid API key provided",
    "type": "invalid_request_error",
    "code": "invalid_api_key"
  }
}

See Error Handling Guide for detailed troubleshooting.

Best Practices

1. Use Streaming for Chat UIs

Streaming provides a better user experience by showing responses as they're generated:

stream = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=messages,
    stream=True
)

2. Implement Retry Logic

Handle transient errors gracefully:

import time
from openai import OpenAI, APIError

max_retries = 3
for attempt in range(max_retries):
    try:
        response = client.chat.completions.create(...)
        break
    except APIError as e:
        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # Exponential backoff
        else:
            raise

3. Monitor Token Usage

Track usage to manage costs:

response = client.chat.completions.create(...)
print(f"Tokens used: {response.usage.total_tokens}")
print(f"Estimated cost: ${response.usage.total_tokens / 1_000_000 * 0.01}")

4. Set Max Tokens

Prevent unexpectedly long (and expensive) responses:

response = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=messages,
    max_tokens=500  # Limit response length
)

SDK Support

HiveOps is compatible with official OpenAI SDKs and many third-party libraries:

✅ Python: openai (pip install openai)
✅ JavaScript/TypeScript: openai (npm install openai)
✅ Go: go-openai
✅ .NET: Azure.AI.OpenAI
✅ Java: openai-java
✅ CLI: openai command-line tool

See our SDK guides for language-specific examples.

Support

Questions? We're here to help: