S3 Vectors GA: cheaper, faster vector search at scale

AI in Cloud Computing & Data Centers••By 3L3C

Amazon S3 Vectors is GA with 2B vectors per index and ~100ms queries. Learn how it cuts RAG cost/complexity and how to adopt it safely.

S3 VectorsVector SearchRAGAI InfrastructureCloud StorageAWSData Center Optimization
Share:

Featured image for S3 Vectors GA: cheaper, faster vector search at scale

S3 Vectors GA: cheaper, faster vector search at scale

A real number to anchor this: in a little over four months of preview usage, teams created 250,000+ vector indexes, ingested 40+ billion vectors, and ran 1+ billion queries. That’s not “experimentation.” That’s production pressure showing up early.

Amazon S3 Vectors is now generally available with a big jump in scale and performance—up to 2 billion vectors per index and up to 20 trillion vectors per vector bucket—plus latency improvements that make it viable for interactive AI apps. For anyone building retrieval augmented generation (RAG), semantic search, or multi-agent systems, this is one of those infrastructure shifts that changes how you design your stack.

This post is part of our AI in Cloud Computing & Data Centers series, where we look at the unglamorous layer that actually makes AI work: storage, networking, and resource allocation. S3 Vectors matters because vector workloads don’t just challenge models—they challenge data centers: memory, CPU cycles for indexing/search, network egress, and operational overhead. When you can store and query vectors directly in object storage, the cost and complexity profile changes.

Why vector storage is becoming a data center problem

Vector search is where modern AI systems quietly spend money. Not only on compute for embeddings and inference, but on the always-on infrastructure required to keep vector databases responsive.

Here’s the pattern I see most companies run into:

  • They start with a specialized vector database to ship a RAG prototype.
  • The prototype becomes a “real” feature.
  • Data volume grows, indexes get sharded, and operational load rises.
  • Costs become hard to predict because you’re paying for provisioned capacity to stay fast, even when query volume is uneven.

That’s a classic cloud infrastructure optimization challenge: burstiness. Your AI assistant might be quiet at 2 a.m. and slammed at 10 a.m. after a product announcement. Data centers don’t like spiky workloads; they like smooth ones. Serverless services exist to absorb that mismatch.

Amazon’s positioning for S3 Vectors is blunt: reduce total cost of storing and querying vectors by up to 90% compared to specialized vector database solutions. Whether you hit that number depends on your workload shape and feature needs, but the underlying bet is clear—object storage economics + managed indexing/query can win for many RAG and semantic search deployments.

What changed at GA: the scale jump is the headline

The single biggest practical improvement is the per-index ceiling.

  • Preview: 50 million vectors per index
  • GA: 2 billion vectors per index (a 40Ă— increase)

That’s not just a bragging-rights number. It changes architecture:

Sharding goes from “required” to “optional”

When you’re capped at tens of millions per index, sharding becomes inevitable. Then you’re building (and debugging) logic like:

  • routing queries to the right shard
  • querying multiple shards and merging results
  • rebalancing shards as data grows
  • duplicating metadata filters across indexes

With 2 billion vectors in a single index, a lot of teams can consolidate. Less code. Fewer failure modes. Lower tail latency because you’re not fanning out queries.

Bigger buckets support the “AI memory” problem

S3 Vectors supports up to 20 trillion vectors per vector bucket. If you’re building multi-agent systems—where agents create and retrieve memories, tool outputs, and traces—your “vector footprint” grows fast. You want a storage layer that scales like storage, not like a cluster you babysit.

This is exactly where AI infrastructure meets data center realities: when state grows without bound, the winning design is usually cheap storage + fast-enough query.

Performance: fast enough for chat, agents, and RAG

S3 Vectors performance improvements target interactive use cases:

  • Infrequent queries: still under 1 second
  • Frequent queries: now around 100 ms or less
  • Results per query: up to 100 (up from 30)

The 100 ms figure matters because it’s the difference between:

  • “This chatbot feels sluggish.”
  • “This chatbot feels instant.”

And the jump to 100 results is a RAG quality knob. Many production RAG systems pull more candidates than they ultimately use so they can:

  • re-rank results,
  • apply metadata/business filters,
  • assemble multi-source context,
  • or support multi-hop retrieval.

More candidates per query means you can be less aggressive with chunking strategies and still assemble strong context—especially for knowledge bases with lots of short, similar passages.

Write throughput: the underappreciated feature

GA also improves write performance, including up to 1,000 PUT transactions per second for streaming single-vector updates.

That’s the difference between “batch nightly updates” and “updates are searchable right after they happen.” If you’re building:

  • a customer support assistant that must reflect policy changes immediately,
  • an internal doc assistant where new pages should show up right away,
  • a security or ops assistant that embeds incident notes in real time,

…then write throughput becomes a product requirement, not a backend detail.

How S3 Vectors fits into AI infrastructure optimization

S3 Vectors is serverless: no cluster sizing, no node types, no replica planning. You pay for storage and API usage.

From an AI-in-cloud perspective, that’s an infrastructure optimization story:

1) Better alignment between cost and demand

Vector workloads are notoriously variable:

  • Embedding ingestion spikes during backfills, migrations, or large document imports.
  • Query spikes track user activity, launches, and seasonal cycles.

A serverless vector storage layer can absorb this without requiring you to provision for peak all year.

2) Less operational drag means faster iteration

Most teams underestimate how much engineering time goes into:

  • index lifecycle management,
  • capacity planning,
  • shard rebalancing,
  • incident response for degraded search latency.

In practice, that time comes out of your AI roadmap. I’d rather see teams spend those cycles on:

  • improving retrieval quality,
  • tightening evaluation,
  • building safer agent toolchains,
  • and instrumenting cost per answer.

3) Cleaner separation of concerns in the stack

A common production architecture ends up looking like this:

  • Object storage for raw documents and artifacts
  • Vector layer for similarity search
  • Search/analytics for text queries, dashboards, filters, and observability

S3 Vectors supports that split directly, especially with its GA integration options.

Where S3 Vectors plugs in: Bedrock Knowledge Bases and OpenSearch

Two integrations moving from preview to GA are particularly relevant for teams building on AWS.

Bedrock Knowledge Bases + S3 Vectors

If you’re using Amazon Bedrock Knowledge Bases for RAG, S3 Vectors can be the vector storage engine behind it. The typical flow is:

  1. Store source content (PDFs, docs, HTML exports) in general-purpose object storage.
  2. Split content into chunks.
  3. Generate embeddings using an embeddings model.
  4. Store vectors + metadata.
  5. Query vectors to retrieve context for generation.

The practical upside is speed to production: you’re using managed ingestion patterns and letting the platform handle the heavy lifting.

OpenSearch + S3 Vectors

S3 Vectors can also act as a vector storage layer while OpenSearch provides search and analytics features. I like this because it maps to how many teams actually work:

  • Product teams want relevance tuning, filters, dashboards, and analytics.
  • Platform teams want scalable storage and predictable cost.

Using each tool where it’s strongest is usually better than trying to force one system to do everything.

Metadata, filtering, and the “RAG correctness” problem

Vector similarity alone is rarely enough. Real apps need constraints: tenant boundaries, document types, regions, dates, access control labels.

S3 Vectors supports up to 50 metadata keys per vector, with up to 10 marked as non-filterable.

Here’s how I’d use that split:

  • Filterable metadata: small, structured fields you’ll use to narrow results (tenant ID, ACL group, doc type, language, product area, created_at range).
  • Non-filterable metadata: larger context blobs you want to retrieve but don’t need to index for filtering (chunk text, JSON payloads, long summaries).

That design directly improves both cost and relevance. You keep the searchable index lean while still returning rich context to the model.

A useful rule: filterable metadata is for “should this result be allowed?” Non-filterable metadata is for “what does the model need to read?”

A practical adoption plan (what I’d do in the next 2 weeks)

If you’re considering S3 Vectors for production RAG or semantic search, a structured rollout beats a big-bang migration.

Step 1: Pick one workload and define success metrics

Choose one of these:

  • internal doc assistant
  • support article retrieval
  • product knowledge base
  • code search for a single repo set

Define metrics you can actually measure:

  • p50/p95 retrieval latency
  • cost per 1,000 queries
  • answer groundedness rate (human-graded or automated checks)
  • top-k hit rate against a labeled test set

Step 2: Model your index sizing early

Decisions to lock in:

  • embedding dimension (must match your embedding model)
  • distance metric: cosine vs euclidean
  • metadata fields and which are filterable

These choices impact cost and quality more than most teams expect.

Step 3: Start with “single index, strong metadata boundaries”

Given the 2 billion vectors per index capacity, default to consolidation unless you have a hard boundary requirement.

Then add boundaries via metadata filters:

  • tenant_id
  • environment (prod vs staging)
  • corpus_id (docs vs tickets vs runbooks)

Step 4: Plan for updates, not just ingestion

Most RAG systems fail in the messy middle: documents change. If your content changes daily, you need:

  • an update strategy (upsert vs delete+reinsert)
  • idempotent keys
  • monitoring for ingestion lag

The 1,000 PUT/s improvement is a direct enabler here.

Step 5: Put network and security controls in place

S3 Vectors supports operational building blocks teams expect in enterprise environments:

  • encryption controls (including KMS options)
  • tagging for cost allocation and access control
  • private connectivity options
  • infrastructure as code support

Treat these as Day 1 requirements if you’re aiming for production.

Pricing signals to watch (so you don’t get surprised)

S3 Vectors pricing is based on:

  • PUT pricing (logical GB uploaded, including metadata and keys)
  • Storage costs (logical storage across indexes)
  • Query charges (per API + $/TB based on index size, excluding non-filterable metadata)

Two practical implications:

  1. Metadata discipline is cost discipline. Don’t bloat filterable metadata with large strings or unnecessary fields.
  2. Index scale affects query $/TB. As indexes grow beyond 100,000 vectors, the $/TB pricing decreases—so consolidating can help both architecture and unit economics.

Where this lands in the “AI in Cloud Computing & Data Centers” series

S3 Vectors is a strong example of a broader trend: AI features are pushing cloud platforms to move intelligence closer to foundational services—storage, networking, scheduling—so the data center can allocate resources more efficiently.

When vector search becomes a native capability of object storage, you get a simpler mental model:

  • store everything in one durable place,
  • query similarity without running a separate always-on cluster,
  • scale as the product grows,
  • and keep cost tied to usage.

If your 2026 roadmap includes agents, conversational interfaces, or organization-wide semantic search, the question isn’t whether you’ll manage vectors—it’s whether you want to manage infrastructure to manage vectors.

What would change in your architecture if you could stop sharding indexes and treat vector storage like any other cloud storage primitive?