AWS DataSync Enhanced mode now speeds on‑prem NFS/SMB transfers to S3. Learn how it helps AI datasets, data lakes, and hybrid migrations.

Move On‑Prem Files to S3 Faster with DataSync
Most hybrid cloud projects don’t fail because the cloud part is hard. They stall because moving the data is slow, brittle, and full of hidden limits—especially when you’re dealing with millions (or billions) of small files sitting on NFS or SMB shares.
AWS just removed one of the most common blockers here: AWS DataSync Enhanced mode now supports transfers between on‑premises NFS/SMB file servers and Amazon S3, with higher performance and scalability than Basic mode and without file count limitations. That sounds like a feature note. Practically, it changes what’s feasible for AI training pipelines, data lake synchronization, and large-scale modernization projects.
This post is part of our “AI in Cloud Computing & Data Centers” series, so I’ll frame this update the way infrastructure teams actually experience it: not as “faster copies,” but as less waiting, fewer transfer failures, and more predictable data movement—the stuff that makes AI workloads and hybrid operations run on schedule.
Why “file transfer” is still the bottleneck in AI-ready hybrid clouds
Answer first: If you can’t move data quickly and reliably, AI and analytics become schedule-driven by transfer windows instead of business needs.
Teams often plan GPU capacity, model training cycles, and data lake storage—but then discover their on‑prem file servers are the real pace-setter. A few common patterns show up:
- Millions of small files (images, logs, medical slices, CAD assets) overload metadata operations long before you hit raw throughput limits.
- Nightly sync windows become sacred, and everything else queues behind them.
- Homegrown scripts (rsync/robocopy + cron) work until they don’t—then you spend days figuring out what didn’t copy.
- Unclear progress and metrics means you can’t forecast completion time, which wrecks downstream planning.
In data centers, this creates a nasty spiral: to compensate, teams throw more compute at the problem (more agents, more retries, more parallel jobs), which increases contention and energy use. The result is the opposite of “AI-optimized infrastructure.” It’s brute force.
What DataSync Enhanced mode changes for on‑prem NFS/SMB to S3
Answer first: Enhanced mode brings parallelism and scalability to on‑prem file transfers, removing file-count ceilings and improving monitoring.
AWS DataSync has long been a go-to for secure data movement, but the pain point for many enterprise file environments has been scale: not terabytes, but file counts and metadata overhead.
With this December 2025 update, Enhanced mode now supports transfers between on‑premises NFS or SMB file servers and Amazon S3. The notable improvements (based on the AWS announcement) are:
Parallel processing designed for “too many files” problems
Enhanced mode uses parallel processing to push more work through at once. In real environments, this matters because the transfer isn’t just copying bytes; it’s constantly listing, checking, and verifying file system objects.
When you’re migrating a research share with 200 million small files, the system that handles metadata efficiently wins. Parallelism is the difference between “we can start training next week” and “we’re still enumerating files.”
No file count limitations (the practical headline)
AWS calls out that Enhanced mode removes file count limitations. This is the part that infrastructure leads care about because it changes planning:
- You don’t need to split datasets into awkward directory shards just to fit a tool’s constraints.
- You can treat S3 as a realistic destination for massive file archives—not only big media files.
- You reduce operational risk because you’re not juggling dozens of partial jobs.
Better transfer metrics for monitoring and management
Enhanced mode also provides detailed transfer metrics. That’s not just nice-to-have—it’s how you operate hybrid data movement like a production service.
Good metrics let you answer questions that always come up mid-migration:
- “Are we bottlenecked on the source filer, the network, or the destination?”
- “If we keep this pace, when will it finish?”
- “Which directories are slowest, and why?”
In the “AI in cloud computing & data centers” context, this is a foundational capability: you can’t optimize what you can’t see.
Where this matters most: AI training, data lakes, and modernization
Answer first: Enhanced mode makes it easier to feed cloud AI and analytics systems from on‑prem file servers without redesigning everything first.
AWS highlights three use cases that map cleanly to what teams are building right now.
1) Faster AI dataset staging (especially for generative AI)
Generative AI teams regularly underestimate how long dataset staging will take. The training cluster is ready; the data isn’t.
Enhanced mode helps when:
- Your training corpus lives on departmental SMB shares
- Your feature store sources are on NFS
- You’re moving image/audio/doc datasets with huge file counts
A practical workflow I’ve seen work well:
- Sync raw datasets from on‑prem NFS/SMB to S3
- Run validation and profiling jobs in the cloud (checksums, schema, duplicates)
- Convert into training-friendly layouts (for example: sharded archives or columnar formats) after landing in S3
The stance I’ll take: if you’re serious about AI velocity, you should treat data transfer like a pipeline, not a one-off task.
2) Data lake synchronization for analytics pipelines
A lot of organizations have a “half-built” data lake: cloud analytics tools exist, but the data is still born on-prem.
Enhanced mode supports a cleaner cadence:
- On-prem file shares keep doing what they’re good at (local workflows, legacy apps)
- S3 becomes the analytics landing zone
- You build cloud-based processing pipelines that assume the data shows up reliably
That shift matters because it reduces the number of “special cases” your pipeline needs. Analysts and ML engineers don’t want to care whether a dataset started life in a data center or in the cloud.
3) Large-scale migrations and archival modernization
Most “lift-and-shift” modernization efforts have a messy middle period where you’re living in both places.
Enhanced mode helps you move big, messy file trees into S3 so you can:
- Retire aging storage arrays on a schedule
- Reduce dependency on legacy file server scaling
- Create a more standard foundation for governance and lifecycle policies
And yes—this can also support sustainability goals. Efficient transfers reduce repeated retries and overprovisioned “migration armies” of servers running 24/7.
How to run an on‑prem to S3 transfer like an infrastructure team (not a hero project)
Answer first: Treat DataSync as a managed data movement layer with measurable SLAs, not a script you run once and hope for the best.
Here’s a pragmatic checklist that keeps migrations predictable.
Define success metrics before you start
Pick metrics that an ops team can stand behind:
- Time to first usable dataset (not just “transfer finished”)
- Steady-state sync duration (e.g., nightly delta completes in < 2 hours)
- Error rate and retry volume
- Source impact (CPU/IO utilization on the file server during transfer windows)
Enhanced mode’s detailed transfer metrics are what make this measurable instead of guesswork.
Decide whether you’re staging for AI or migrating for retirement
These are different goals:
- AI staging: you may accept partial availability early (get 80% of the data there fast), then improve completeness.
- Retirement migration: you need completeness, repeatability, and auditability.
If you don’t clarify this, stakeholders will argue about progress for weeks.
Design for hybrid operations, not a “big bang”
Most teams will run in hybrid for months. Plan for it:
- Keep a clear source of truth during the transition
- Use regular sync cycles to reduce final cutover time
- Document how applications discover the data (paths, prefixes, permissions model)
This is where I see AI-enhanced operations starting to show up: teams build automation that watches transfer metrics, predicts completion times, and schedules downstream jobs accordingly.
The AI angle: what this enables for intelligent workload management
Answer first: Better data movement improves AI workload scheduling, cost control, and even energy efficiency because you can stop overprovisioning “just in case.”
The connection to AI in cloud computing isn’t that DataSync is “AI-powered.” It’s that AI operations depend on consistent, observable data logistics.
When transfers are unpredictable, teams do three inefficient things:
- Overbook GPU clusters (idle time waiting on data)
- Run transfers longer than needed (extra retries, re-scans, re-copies)
- Delay automation because they don’t trust the pipeline
With Enhanced mode supporting on‑prem file servers and offering stronger scalability plus detailed metrics, you can build smarter control loops:
- Schedule training jobs when the dataset landing completes (not at 2 a.m. because “that’s the window”)
- Auto-throttle transfers if the on‑prem filer is under pressure
- Predict completion time and pre-warm downstream compute only when needed
That’s intelligent resource allocation in practice: fewer idle cycles, fewer fire drills, and more predictable hybrid cloud operations.
What to do next (if you’re planning a Q1 2026 migration or AI push)
Answer first: Start with one high-value dataset, run Enhanced mode transfers to S3, and instrument the workflow so you can forecast timing and cost.
A simple plan that works for most teams:
- Pick a “painful” share (high file count, frequent sync needs, or blocking AI work)
- Transfer it to S3 using DataSync Enhanced mode
- Measure: throughput, file enumeration time, error rates, source impact
- Operationalize: define a schedule, alerts, and ownership
- Then scale out to additional shares once the pattern is repeatable
If you’re building AI capabilities on top of hybrid infrastructure, this update is a quiet win: it makes the boring part—data logistics—less of a constraint. And that’s usually the constraint.
What would change for your AI roadmap if dataset staging stopped being the long pole in the tent?