Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.streamkap.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Use the Pinecone Sink Destination to stream data from your Kafka topics into Pinecone indexes. This connector is useful for building real-time vector search applications, powering recommendation systems, and maintaining synchronized vector embeddings. Since Pinecone requires pre-computed vectors, you must provide embeddings in your Kafka records using the Field Vector Strategy.

Prerequisites

  • A Pinecone account with an existing index
  • Pinecone API key
  • The name of your target Pinecone index (must be pre-created)
  • Pre-computed embedding vectors in your Kafka records
  • Understanding of your ID strategy (see below)
Unlike some vector databases, Pinecone does not generate embeddings automatically. You must provide pre-computed vectors in your Kafka records using the Field Vector Strategy.

Key Concepts

Indexes & Namespaces

Pinecone organizes vectors into indexes, and each index can contain multiple namespaces. Each Kafka topic is mapped to a Pinecone namespace within your configured index. By default, a Kafka topic named users maps to a namespace named users. You can customize this mapping using the Namespace Mapping setting.

Document IDs & Upsert Operations

Pinecone uses string-based IDs to uniquely identify vectors. The connector supports multiple ID strategies:
  • NoIdStrategy (default) – Generates a new UUID for each record, always creating new vectors (INSERT semantics)
  • FieldIdStrategy – Uses a field from your Kafka record as the vector ID, enabling upserts
  • KafkaIdStrategy – Uses the Kafka message key as the vector ID
When using FieldIdStrategy, specify the field name (e.g., id, user_id) in the Document ID Field setting. When using KafkaIdStrategy or FieldIdStrategy, the connector generates a deterministic UUID from the key or field value, enabling idempotent upserts.

Vectors (Bring Your Own Vectors)

Pinecone requires vectors to be provided with each record. Use the Field Vector Strategy and specify the field in your Kafka records that contains the embedding vector. The connector supports multiple vector input formats:
  • Direct float arrays
  • Lists/collections of numbers
  • JSON string format (e.g., "[0.1, 0.2, 0.3]")
The No Vector Strategy is not supported for Pinecone. You must configure Field Vector Strategy and specify a vector field.

Metadata

All non-vector fields from your Kafka records are automatically stored as Pinecone metadata. This metadata can be used for filtering during queries. Pinecone metadata is schemaless — any field types are accepted, including nested structures, arrays, and maps.

Delete Operations

If Delete Enabled is set to true, records with null values or a __deleted=true field are treated as deletes. For example, when a record is deleted in the source and a tombstone record is sent to Kafka, the connector will delete the corresponding vector from Pinecone.
Delete operations require KafkaIdStrategy to be configured as the Document ID Strategy.

Pinecone Setup

Before configuring the connector, prepare your Pinecone environment:

Create an Index

Pinecone indexes must be created before streaming data. Create an index via the Pinecone console or API:
from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")

# Create a serverless index
pc.create_index(
    name="my-index",
    dimension=1536,       # Must match your embedding dimensions
    metric="cosine",      # Options: cosine, euclidean, dotproduct
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)
The index dimension must match the dimensionality of the embedding vectors in your Kafka records. A mismatch will cause upsert failures.

Gather Connection Details

  • API Key: Available in your Pinecone console under API Keys
  • Index Name: The name of the index you created

Prepare Embedding Vectors

Ensure your Kafka records contain pre-computed embedding vectors. Common embedding sources include:
  • OpenAI Embeddings API
  • Cohere Embed API
  • Sentence Transformers
  • Custom ML models

Streamkap Setup

  1. Navigate to Destinations and choose Pinecone.
  2. Fill in the fields:
    1. Name – A memorable identifier for this Destination.
    2. Pinecone API Key – Your Pinecone API key for authentication.
    3. Pinecone Index Name – The name of your pre-created Pinecone index.
    4. Namespace Mapping – Pattern to map Kafka topics to Pinecone namespaces. Default: ${topic}. Examples:
      • ${topic} – Topic users → Namespace users
      • pinecone_${topic} – Topic users → Namespace pinecone_users
      • prod_${topic}_v2 – Topic users → Namespace prod_users_v2
    5. Document ID Strategy – Choose how to assign IDs to vectors:
      • No ID Strategy – Generate new UUID for each record (always inserts)
      • Field ID Strategy – Use a field from the record as the vector ID
      • Kafka ID Strategy – Use the Kafka message key as the vector ID
    6. Document ID Field (if Field ID Strategy selected) – The field name containing the ID (e.g., id, user_id). This field must exist in your Kafka records.
    7. Vector Field Name – The field in your Kafka records containing the embedding vector (e.g., embedding, vector). Must be an array of numbers.
    8. Delete Enabled – If true, null-valued records and records with __deleted=true are treated as deletes. Requires Kafka ID Strategy.
    9. Schema Evolution (default: basic) – Controls automatic namespace management:
      • basic (default) – Namespaces are automatically created on first upsert.
      • none – Namespaces must exist before the connector starts. Use this for strict control in production.
    10. Batch Size – Number of vectors to batch before sending to Pinecone (default: 100). Larger batches improve throughput; smaller batches reduce latency.
    11. Max Retries – Maximum retry attempts on connection/timeout errors (default: 3).
    12. Retry Interval (ms) – Delay between retries in milliseconds (default: 1000).
  3. Click Save.

How It Works

  1. Record Ingestion – Records from Kafka are received by the connector and grouped by namespace.
  2. Deduplication – Records are deduplicated by document ID within each batch to avoid redundant upserts.
  3. ID Generation – Based on the ID strategy, a vector ID is assigned or extracted.
  4. Vector Extraction – The embedding vector is extracted from the configured vector field.
  5. Metadata Extraction – All non-vector fields are converted to Pinecone metadata (protobuf Struct format).
  6. Batching – Vectors are accumulated and flushed when the Batch Size is reached or on explicit flush.
  7. Upsert to Pinecone – The batch is sent to Pinecone for upsert.
  8. Delete Handling – If enabled, null records or __deleted=true records trigger vector deletion by ID.
  9. Error Handling – Failed batches are retried up to Max Retries times with backoff. If error unrolling is enabled, individual records from a failed batch are retried separately and routed to a dead letter queue on failure.

Limitations & Best Practices

Limitations

  • Index Must Be Pre-Created – Pinecone indexes cannot be created by the connector. You must create the index in Pinecone before starting the connector.
  • Vectors Are Required – Pinecone does not auto-generate embeddings. Every record must contain a vector field with a pre-computed embedding.
  • Vector Dimensions Must Match – The dimensionality of vectors in your Kafka records must match the index dimension configured in Pinecone.
  • Delete Requires KafkaIdStrategy – Delete operations only work when using Kafka ID Strategy as the document ID strategy.
  • Single Index per Connector – Each connector instance writes to a single Pinecone index. Use multiple connectors for multiple indexes.

Best Practices

  1. Use Field ID or Kafka ID Strategy for Idempotency – If your source has unique identifiers, use them as vector IDs to enable idempotent upserts and avoid duplicate vectors.
  2. Pre-compute High-Quality Embeddings – Since Pinecone doesn’t generate embeddings, ensure your embedding pipeline produces consistent, high-quality vectors before they reach Kafka.
  3. Match Vector Dimensions – Double-check that your embedding model output dimensions match your Pinecone index dimensions (e.g., OpenAI text-embedding-3-small = 1536 dimensions).
  4. Use Namespaces for Data Isolation – Leverage namespace mapping to organize data by topic, environment, or tenant within a single index.
  5. Tune Batch Size for Throughput – Start with the default batch size of 100 and increase for higher throughput workloads. Monitor Pinecone’s rate limits.
  6. Enable Deletion for CDC Pipelines – If your source is a database with CDC, enable Delete Enabled with Kafka ID Strategy to propagate deletes.
  7. Monitor Pinecone Quotas – Keep track of your Pinecone plan’s vector count and storage limits to avoid hitting quota errors.

Troubleshooting

Namespace Not Found Error

Problem: Connector fails with a namespace validation error. Solution:
  • This occurs when Schema Evolution is set to none and the namespace doesn’t exist yet. Switch to basic to allow automatic namespace creation.
  • Alternatively, upsert at least one vector into the namespace manually via the Pinecone API before starting the connector.

Vector Dimension Mismatch

Problem: Upsert fails with a dimension mismatch error. Solution:
  • Verify that the embedding vectors in your Kafka records have the same dimensionality as your Pinecone index.
  • Check your embedding model configuration — different models produce different dimensions (e.g., OpenAI text-embedding-3-small = 1536, text-embedding-3-large = 3072).
  • Ensure the Vector Field Name points to the correct field in your records.

Authentication Failures

Problem: “Unauthorized” or “Invalid API key” errors. Solution:
  • Verify your Pinecone API key is correct and active.
  • Ensure the API key has permissions to write to the target index.
  • Check that the API key matches the correct Pinecone project and environment.

Missing Vectors Error

Problem: Connector fails with an error about missing or null vectors. Solution:
  • Pinecone requires every record to have a vector. Ensure the Vector Field Name is correctly configured.
  • Verify that your Kafka records contain the vector field and it is not null.
  • Check the vector format — it must be an array of numbers, a list of numbers, or a JSON string like "[0.1, 0.2, 0.3]".

High Latency or Rate Limiting

Problem: Connector is slow or receiving rate limit errors from Pinecone. Solution:
  • Reduce Batch Size if you’re hitting Pinecone rate limits.
  • Increase Retry Interval to allow more time between retries.
  • Check your Pinecone plan’s rate limits and upgrade if needed.
  • Monitor the Pinecone dashboard for throttling indicators.

Delete Operations Not Working

Problem: Deleted source records are not being removed from Pinecone. Solution:
  • Ensure Delete Enabled is set to true.
  • Delete operations require Kafka ID Strategy — verify this is configured as your Document ID Strategy.
  • Confirm that tombstone records (null values) are being produced to Kafka by your source connector.

Security Notes

  • API Keys – Stored encrypted. Never share them in logs or config files shared publicly.
  • HTTPS – All communication with Pinecone uses HTTPS by default.
  • Metadata – Be mindful of sensitive data stored as Pinecone metadata, as it is accessible via query results.

Next Steps

  1. Create a Pinecone index matching your embedding dimensions
  2. Ensure your source pipeline produces records with embedding vectors
  3. Test the connector with a small Kafka topic first
  4. Monitor upsert throughput and error rates in the Pinecone dashboard
  5. Adjust batch size based on observed performance and rate limits
  6. Set up alerts for connector task failures