HiveOps Logo
HiveOps
/API Reference

API Reference

Complete API endpoint documentation

API Reference

Complete reference for the HiveOps AI Inference API. Compatible with OpenAI's API specification.

Base URL

https://ai.hiveops.io

All API requests must be made over HTTPS. Requests made over plain HTTP will be rejected.


Authentication

HiveOps uses API keys for authentication. Include your API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Example

curl https://ai.hiveops.io/chat/completions \
  -H "Authorization: Bearer sk-YOUR-API-KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3:8b-instruct-q8_0", "messages": [{"role": "user", "content": "Hello!"}]}'

Rate Limits

To ensure fair usage and system stability, we enforce the following limits per API key:

Limit TypeValue
Requests per minute60
Tokens per minute150,000

When you exceed rate limits, you'll receive a 429 Too Many Requests response. Implement exponential backoff in your application to handle rate limits gracefully.

Rate Limit Headers

Every API response includes rate limit information:

X-RateLimit-Limit-Requests: 60
X-RateLimit-Remaining-Requests: 45
X-RateLimit-Reset-Requests: 2026-03-20T12:30:00Z
X-RateLimit-Limit-Tokens: 150000
X-RateLimit-Remaining-Tokens: 145000

Endpoints

Chat Completions

Create a chat completion with conversational context.

Endpoint: POST /v1/chat/completions

Request Body

FieldTypeRequiredDescription
modelstringYesID of the model to use (see Models)
messagesarrayYesArray of message objects
temperaturenumberNoSampling temperature (0-2, default: 1)
top_pnumberNoNucleus sampling (0-1, default: 1)
max_tokensintegerNoMaximum tokens to generate
streambooleanNoEnable streaming responses (default: false)
stopstring/arrayNoStop sequences
presence_penaltynumberNoPenalize new tokens (-2 to 2, default: 0)
frequency_penaltynumberNoPenalize repeated tokens (-2 to 2, default: 0)
response_formatobjectNoSpecify output format (e.g., JSON mode)
toolsarrayNoAvailable tools for function calling
tool_choicestring/objectNoControl tool usage behavior

Message Object

{
  "role": "user|assistant|system|tool",
  "content": "Message content",
  "name": "Optional function/user name",
  "tool_calls": [] // For assistant messages with tool calls
}

Example Request

curl https://ai.hiveops.io/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-YOUR-API-KEY" \
  -d '{
    "model": "llama3:8b-instruct-q8_0",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant that speaks like a pirate."
      },
      {
        "role": "user",
        "content": "Tell me about the ocean."
      }
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Example Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711065600,
  "model": "llama3:8b-instruct-q8_0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Arrr, matey! The ocean be a vast and mysterious place..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 150,
    "total_tokens": 175
  }
}

Streaming Example

from openai import OpenAI

client = OpenAI(
    api_key="sk-YOUR-API-KEY",
    base_url="https://ai.hiveops.io"
)

stream = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Completions

Create a text completion (legacy format, chat completions recommended).

Endpoint: POST /v1/completions

Request Body

FieldTypeRequiredDescription
modelstringYesModel ID
promptstring/arrayYesText prompt(s)
max_tokensintegerNoMaximum tokens to generate
temperaturenumberNoSampling temperature (0-2)
top_pnumberNoNucleus sampling (0-1)
streambooleanNoEnable streaming
stopstring/arrayNoStop sequences

Example Request

curl https://ai.hiveops.io/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-YOUR-API-KEY" \
  -d '{
    "model": "llama3:8b-instruct-q8_0",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

List Models

Get a list of available models.

Endpoint: GET /v1/models

Example Request

curl https://ai.hiveops.io/models \
  -H "Authorization: Bearer sk-YOUR-API-KEY"

Example Response

{
  "object": "list",
  "data": [
    {
      "id": "llama3:8b-instruct-q8_0",
      "object": "model",
      "created": 1711065600,
      "owned_by": "meta"
    },
    {
      "id": "llama-3-70b-instruct",
      "object": "model",
      "created": 1711065600,
      "owned_by": "meta"
    }
  ]
}

Get Model Info

Retrieve detailed information about a specific model.

Endpoint: GET /v1/model/info?model={model_id}

Example Request

curl "https://ai.hiveops.io/model/info?model=llama3:8b-instruct-q8_0" \
  -H "Authorization: Bearer sk-YOUR-API-KEY"

Models

Llama 3 8B Instruct

Model ID: llama3:8b-instruct-q8_0

  • Context Window: 8,192 tokens
  • Input Pricing: $0.010 / 1M tokens
  • Output Pricing: $0.020 / 1M tokens
  • Best For: General-purpose tasks, fast responses
  • Provider: Meta AI

Llama 3 70B Instruct

Model ID: llama-3-70b-instruct

  • Context Window: 16,384 tokens
  • Input Pricing: $0.100 / 1M tokens
  • Output Pricing: $0.200 / 1M tokens
  • Best For: Complex reasoning, high-quality outputs
  • Provider: Meta AI

Gemma 2 9B IT

Model ID: gemma-2-9b-it

  • Context Window: 8,192 tokens
  • Input Pricing: $0.005 / 1M tokens
  • Output Pricing: $0.010 / 1M tokens
  • Best For: Balanced performance and cost
  • Provider: Google

Mistral 7B Instruct

Model ID: mistral-7b-instruct-v0.3

  • Context Window: 4,096 tokens
  • Input Pricing: $0.001 / 1M tokens
  • Output Pricing: $0.002 / 1M tokens
  • Best For: Budget-conscious applications, high volume
  • Provider: Mistral AI

Advanced Features

JSON Mode

Force the model to output valid JSON:

response = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that outputs JSON."},
        {"role": "user", "content": "List 3 colors in JSON format."}
    ],
    response_format={"type": "json_object"}
)

Function Calling

Define functions the model can call:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., San Francisco"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools
)

Error Codes

CodeDescriptionSolution
400Bad RequestCheck request format and parameters
401UnauthorizedVerify API key is correct and active
403ForbiddenInsufficient balance - add funds
429Rate Limit ExceededImplement exponential backoff
500Internal Server ErrorRetry with exponential backoff
503Service UnavailableTemporarily overloaded - retry later

Error Response Format

{
  "error": {
    "message": "Invalid API key provided",
    "type": "invalid_request_error",
    "code": "invalid_api_key"
  }
}

See Error Handling Guide for detailed troubleshooting.


Best Practices

1. Use Streaming for Chat UIs

Streaming provides a better user experience by showing responses as they're generated:

stream = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=messages,
    stream=True
)

2. Implement Retry Logic

Handle transient errors gracefully:

import time
from openai import OpenAI, APIError

max_retries = 3
for attempt in range(max_retries):
    try:
        response = client.chat.completions.create(...)
        break
    except APIError as e:
        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # Exponential backoff
        else:
            raise

3. Monitor Token Usage

Track usage to manage costs:

response = client.chat.completions.create(...)
print(f"Tokens used: {response.usage.total_tokens}")
print(f"Estimated cost: ${response.usage.total_tokens / 1_000_000 * 0.01}")

4. Set Max Tokens

Prevent unexpectedly long (and expensive) responses:

response = client.chat.completions.create(
    model="llama3:8b-instruct-q8_0",
    messages=messages,
    max_tokens=500  # Limit response length
)

SDK Support

HiveOps is compatible with official OpenAI SDKs and many third-party libraries:

  • Python: openai (pip install openai)
  • JavaScript/TypeScript: openai (npm install openai)
  • Go: go-openai
  • .NET: Azure.AI.OpenAI
  • Java: openai-java
  • CLI: openai command-line tool

See our SDK guides for language-specific examples.


Support

Questions? We're here to help: