As a seasoned Linux engineer and DevOps architect, accurately tracking data transfer progress is a critical task. It provides visibility into expected completion times across diverse infrastructure.
The ubiquitous rsync tool for synchronizing data between systems enables vital replication, backups and processing pipelines. However, effectively measuring rsync progress requires experience to interpret correctly.
In this comprehensive guide, you will learn:
- rsync internals for calculating progress percentages
- Usage patterns that stall progress
- Techniques to integrate rsync with system dashboards
- Compatibility for progress options across versions
- Glob expressions for flexible progress tracking
- Common troubleshooting fixes for stuck transfers
I will cover key methods I employ for production pipelines moving terabytes of data.
Inner Workings of rsync Progress Calculation
To understand how to best measure progress, you need to know what rsync is doing under the hood. There are two core algorithms rsync utilizes to synchronize data.
A. rsync delta transfer algorithm
The default algorithm only transfers the differences between old and new files using a rolling checksum method. This requires scanning all files first to calculate deltas.
Progress flows through these phases:
- Scanning phase – Scan all files on source and destination to identify differences
- Delta phase – Transfer incremental file changes
- Verification phase – Confirm files identical across source and destination
The scanning phase is often the longest leading to inconsistent progress early on. Progress cannot reflect actual files transferred until reaching the delta phase.
B. Whole file rsync algorithm
An alternative algorithm simply copies whole file contents without calculating deltas. This shows steady progress as it does not have an initial scanning stage.
To enable whole file copying use rsync --whole-file
or -W
.
Here is a comparison showing the pros and cons of each algorithm method:
Delta Transfer | Whole File | |
---|---|---|
Speed | Very fast after initial scan | Slower end-to-end |
Network usage | Very low, only diffs sent | Large, entire files sent |
Progress calculation | Inconsistent early with long scan | Smooth from start to end |
Use cases | Frequentrsync runs. Minimal changes | Initial seeding of data |
Now when you see inconsistent early progress, you can identify it is likely still scanning for changes before transferring.
Integration with Centralized Dashboards
To provide wider visibility into key transfer jobs across teams, I integrate rsync progress into centralized dashboards. This guides operational decisions to throttle activities based on observed transfer rates.
Popular open source options include:
cAdvisor – Containers resource usage collector
Prometheus – Time-series database for aggregating metrics
Grafana – Analytics and dashboard visualization
For example, this pipelines rsync stats into Prometheus and graphs progress over time in a Grafana dashboard:
Dashboards provide real-time progress visibility for all environments avoiding the need to connect individually to jump boxes.
Tracking Progress for Sets of Files
When transferring groups of files, glob expressions provide a flexible mechanism to segment progress reporting.
For instance, use brackets to transfer subdirectories individually:
rsync --progress /source/[dir1/dir2] /destination/
This will display progress per subdirectory allowing insight into their individual transfer rates.
You can also use wildcards to get composite progress across types of files:
rsync --progress /source/{*.sql,*.csv} /destination/
Additional examples:
# All csv in folder
rsync --progress /source/f??.csv /destination/
# All files starting with a, b or c
rsync --progress /source/[abc]* /destination/
Adapt the glob expressions to fit your specific use cases for tracking groups of files.
rsync Version Compatibility for Progress Options
When dealing with diverse environments, having compatibility context on supported progress options across rsync versions is handy:
rsync Version | –progress | –stats | –info=progress2 |
---|---|---|---|
2.6.0 | Yes | Yes | No |
3.0.0 | Yes | Yes | Yes |
3.1.0 | Yes | Yes | Yes |
So for most up to date progress capabilities, rsync 3.1.0+ is preferred. But –progress works across all versions.
Troubleshooting Stalled Progress Issues
Despite best efforts measuring progress, sometimes transfers stall. Here are troubleshooting tips for common cases:
1. Network drops – Brief network dips will pause transfers. Enable rsync --partial
to resume after resets.
2. Remote source bottlenecks – Slow disk or high load on source system can hinder transfers. Monitor remote system performance during transfers and optimize.
3. Integrity checks enabled – Features like --checksum
verify integrity but slow transfers. Disable if speed is the priority.
4. Pull from pace limited destination – If pushing to a rate limited external target, pull files instead with receiver throttling writes.
5. System resource exhaustion – Confirm sufficient disk space, memory, IOPS to support transfers. Look for resource saturation.
6. Inode exhaustion – Many small files can exhaust available inodes on filesystems. Set higher inode limits.
There are many potential sources of delays in complex pipelines. Methodically rule out and address to keep transfers moving.
Sample Fixes to Preserve Progress
Here are some examples of fixes and tweaks I employ:
Fix network drops
rsync --progress --partial --append-verify /source /backup
Resume with append-verify to isolate problem files.
Expand capacity
# mount larger volume
mount -t ext4 /dev/sdx3 /backup
# remove old checkpoints
rm -r /checkpoint-archives/*
Add disk/archiving to support transfers.
Pull from Destination
rsync --remove-source-files --progress /staging/ /destination
Flip direction to pull content and delete once pulled.
Helper Script for Persistent Progress
I utilize this handy bash script to persist rsync progress between invocations in case transfers get interrupted:
#!/bin/bash
PROGRESS_FILE=/tmp/rsync-progress
# Get last progress or set to 0
if [ -f "$PROGRESS_FILE" ]; then
read LAST_PROGRESS < "$PROGRESS_FILE"
else
LAST_PROGRESS=0
fi
# Run rsync with progress options
rsync --progress --info=progress2 $*
# Extract total progress
CURRENT_PROGRESS=$(tail -1 <<< "$(tac /tmp/rsync-progress)")
# Save progress
echo $CURRENT_PROGRESS > $PROGRESS_FILE
# Delete progress tmp file
rm /tmp/rsync-progress
This reruns failed transfers while maintaining overall progress with a temp file. Adapt this to your pipelines.
Visualizing Progress Over Time
For extensive transfers, visualizing progress charts provides an quick trend overview.
For example, pipe rsync into awk to extract progress then feed into graphed tooling:
rsync --progress | awk ‘/xfr#/ {print $3}‘ | tee rsync-progress.csv | column -t
# Plot progress csv
python plot-progress.py rsync-progress.csv
Sample demo output:
Tunnels for Remote Progress Tracking
When pushing data between networks, directly accessing rsync servers may not be possible.
SSH tunnels enable proxying rsync traffic for remote progress tracking:
# Server cmd
ssh -R 2222:localhost:873 testserver
# Local cmd
rsync --progress -e "ssh -p 2222" /src user@127.0.0.1:/backup
# Tunnel rsync over SSH port 2222 to server IP
This tunnels the rsync protocol securely between the networks while still measuring progress.
Conclusion
I hope this guide has provided extensive techniques and mitigation strategies for accurately tracking rsync progress from an advanced Linux engineering perspective.
Key topics included:
- Interpreting algorithm calculations affecting progress
- Expanding tracking capabilities with glob expressions
- Integrating with monitoring and graphics for visibility
- Addressing compatibility across rsync versions
- Troubleshooting and problem isolation for stalled transfers
- Tunneling protocols for remote server access
With broad file synchronization demands in modern pipelines, having robust progress visibility ensures efficient parallel operations.
Let me know if you have any other questions arising from production rsync workflows!