Amazon S3 Vectors is GA with 2B vectors per index and ~100ms queries. Here’s what it means for RAG scale, cost, and AI infrastructure ops.

Most companies blame their RAG quality problems on the model.
But when I dig into what’s actually happening, the failure is usually infrastructure math: vectors are exploding in volume, teams shard indexes to keep up, query paths get complicated, and latency creeps past what interactive apps can tolerate.
That’s why Amazon S3 Vectors going generally available (GA) matters for the AI in Cloud Computing & Data Centers story. It’s not “another vector store.” It’s a clear signal that cloud storage is being rebuilt around AI workloads—where embeddings are first-class data, not an add-on.
AWS is claiming up to 90% lower total cost versus specialized vector databases for storing and querying vectors, and the GA release comes with hard scale and performance numbers: 2 billion vectors per index, 20 trillion vectors per vector bucket, and query latencies that can land around ~100 ms for frequent access patterns. Those are data-center-grade design targets, not hackathon targets.
Why S3 Vectors changes the storage-vs-database conversation
S3 Vectors’ biggest idea is simple: treat vector search as a native capability of object storage.
Traditionally, teams build AI search like this:
- Store raw docs in object storage (often S3).
- Generate embeddings.
- Push embeddings into a separate vector database.
- Keep metadata, permissions, retention, and lifecycle rules consistent across two systems.
That last bullet is where the cost shows up—sometimes in dollars, but more often in engineering time and operational risk.
With S3 Vectors, AWS is betting that a lot of “vector database” value for common RAG and semantic search use cases can be delivered where your data already lives. If you’re already an S3-heavy shop, this matters because:
- Fewer moving parts: fewer clusters, fewer scaling knobs, fewer failure modes.
- Cleaner governance: encryption, tagging, IAM patterns, and lifecycle policies can live closer to the data.
- Better alignment with AI infrastructure optimization: serverless primitives shift the burden of capacity planning away from your team and onto AWS’s fleet management.
In a series focused on AI in cloud computing and data centers, that last point is the headline: the infrastructure layer is adapting so AI workloads don’t require bespoke storage stacks just to function.
The scale jump that removes sharding as a default
During preview, S3 Vectors supported 50 million vectors per index. GA raises that to 2 billion vectors per index—a 40× increase.
That’s not a vanity metric. It changes architecture choices.
At 50 million, many teams end up with:
- Sharding by tenant
- Sharding by time window
- Sharding by document type
- Query federation across shards
Each of those strategies makes retrieval quality harder to reason about. When your data’s split into many indexes, you often either:
- Search fewer shards (faster, but you miss context), or
- Search all shards (better recall, but latency and cost climb)
At 2 billion vectors in one index, you can consolidate far more data into a single retrieval surface area. Less federation logic. Fewer “why didn’t it find that doc?” incidents. And simpler observability.
Performance that actually fits interactive AI
The GA release includes performance improvements that target a practical requirement: humans won’t wait.
AWS states:
- Infrequent queries: results in under one second
- Frequent queries: latencies around ~100 ms or less
- Top-k results per query: up to 100 (up from 30)
- Write throughput: up to 1,000 PUT transactions/second for single-vector streaming updates
Here’s how these numbers map to real systems.
What ~100 ms retrieval enables
When retrieval takes 700–1200 ms, teams compensate by:
- Caching aggressively (often incorrectly)
- Returning fewer chunks (hurting accuracy)
- Precomputing answers (limiting freshness)
At ~100 ms, you can support:
- Conversational AI where retrieval happens every turn
- Multi-agent workflows where several agents query in parallel
- Tool-using assistants that “think → retrieve → act” repeatedly
A useful rule: once retrieval is consistently below ~200 ms, your LLM inference time becomes the dominant factor again. That’s a good place to be because LLM inference is where teams are already investing (model choice, prompt design, batching, caching).
Why top-k=100 is a quiet but important change
Many RAG implementations fail because they retrieve too little context, not too much.
Raising the limit to 100 results is valuable when you use:
- Rerankers (retrieve broad, rerank precisely)
- Hybrid search patterns (vector similarity + metadata filters)
- Multi-step retrieval (first broad, then narrow)
In practice, this gives you room to design retrieval that prioritizes recall first, then uses metadata and reranking to improve precision.
Write throughput: freshness becomes realistic
The ability to handle up to 1,000 PUT/s for streaming single-vector updates is a big deal for “freshness-sensitive” use cases:
- Support tickets that must be searchable minutes after creation
- Security findings that need immediate triage context
- Product catalogs with frequent attribute changes
If your system requires “write now, search now,” write throughput is often the limiting factor. This GA improvement makes S3 Vectors more viable for near-real-time knowledge.
Building blocks: Bedrock Knowledge Bases and OpenSearch integration
The most practical part of the GA announcement is that two preview integrations are now generally available:
- S3 Vectors as vector storage for Amazon Bedrock Knowledge Bases
- S3 Vectors integration with Amazon OpenSearch
This is where the “AI in cloud infrastructure” narrative gets concrete: teams want managed paths from documents → chunks → embeddings → retrieval, and they want it to sit inside existing governance boundaries.
Bedrock Knowledge Bases + S3 Vectors: the clean RAG pipeline
A common RAG workflow looks like this:
- Put documents in a general-purpose S3 bucket
- Chunk them
- Generate embeddings
- Store vectors + metadata
- Query using a query embedding
With Knowledge Bases, AWS handles chunking and embedding generation (for example, using Titan Text Embeddings), then stores vectors and metadata in S3 Vectors.
A detail I like here: S3 Vectors supports up to 50 metadata keys per vector, with up to 10 marked as non-filterable. That enables a useful split:
- Filterable metadata: tenant ID, region, doc type, access group, effective date
- Non-filterable metadata: larger text payloads (like the original chunk) that you want returned but not indexed for filtering
That design pattern is underrated. It keeps your searchable index slimmer while still returning rich context to the LLM.
OpenSearch + S3 Vectors: analytics and search on top
If you need dashboards, aggregation, keyword search, or operational analytics, OpenSearch is often part of the stack.
Using OpenSearch for search/analytics while S3 Vectors holds the vectors can be appealing when:
- You want one storage plane for embeddings
- You still need classic search features (filters, facets, aggregations)
- You’re building experiences where vector search is only one part of the query
The architecture choice is less about “which one is better” and more about keeping each component doing what it’s good at.
What this means for cloud and data center operations
S3 Vectors is serverless, and that’s the operational headline: no clusters to size, patch, or rebalance.
For AI infrastructure teams, this shifts work from:
- Capacity planning
- Index sharding strategies
- Cluster upgrades
- Hot shard mitigation
…to:
- Cost governance (tagging, budgets, chargeback)
- Security posture (encryption keys, access boundaries)
- Data lifecycle (retention, deletion, re-embedding)
That’s a healthier trade for most orgs, especially as we head into 2026 where AI workloads keep multiplying.
Security and governance features that reduce friction
GA includes operational capabilities you’ll care about in production:
- Encryption configuration at creation time
- Option to override bucket encryption at the index level with a custom KMS key
- Resource tagging for cost allocation and access control
- AWS PrivateLink for private connectivity
- CloudFormation support for deployment consistency
This is the unglamorous stuff that determines whether an AI pilot becomes a real service. If you can’t deploy it repeatedly, lock it down, and allocate spend, it won’t survive contact with finance and security.
Practical guidance: when S3 Vectors is the right fit (and when it’s not)
S3 Vectors is a strong default when your goal is cost-effective vector storage and retrieval at massive scale with minimal ops.
Great fits
- RAG over large document collections (policies, manuals, tickets, research)
- Semantic search inside internal tools
- Agentic systems that make frequent, parallel retrieval calls
- Multi-tenant platforms where sharding complexity is killing you
Situations to evaluate carefully
- Highly specialized vector DB features you depend on (custom indexing algorithms, advanced reranking pipelines built-in, niche distance functions)
- Cross-cloud portability requirements where S3-native features may increase lock-in
- Ultra-low latency constraints where every millisecond matters and you’re willing to run specialized infra to get it
The honest stance: for a large slice of enterprise RAG, teams don’t need exotic vector database features—they need predictable performance, sane cost, and less operational burden. S3 Vectors is aimed squarely at that reality.
A migration approach that keeps risk low
If you’re considering S3 Vectors, don’t start with a “big bang” replacement. Start like this:
- Pick one corpus (e.g., internal runbooks or a single product line).
- Define success metrics: p95 retrieval latency, answer accuracy (human eval), cost per 1,000 queries.
- Run parallel retrieval: current vector DB vs S3 Vectors.
- Test governance: encryption keys, IAM boundaries, PrivateLink, audit requirements.
- Scale the index deliberately to validate pricing behavior at your expected vector count.
This avoids the most common migration mistake: declaring victory after a small demo index performs well.
The bigger trend: AI workloads are reshaping storage primitives
AWS shared adoption stats from preview that explain why this moved fast: in just over four months, customers created 250,000+ vector indexes, ingested 40+ billion vectors, and ran 1+ billion queries (as of Nov 28).
That level of usage pressure forces cloud platforms to evolve. Data centers are being tuned for embedding-heavy workloads—high write rates during ingestion, bursty query patterns during chat interactions, and large logical datasets that don’t fit comfortably in traditional database shapes.
S3 Vectors is one of the clearest examples of the convergence happening across cloud computing and AI infrastructure: storage isn’t passive anymore. It’s becoming AI-aware.
If you’re building RAG, semantic search, or agentic applications for 2026 roadmaps, the question worth debating internally is straightforward: Do you want your embeddings to live in yet another specialized system, or in the same storage plane where your data lifecycle already lives?