As a seasoned Linux engineer and DevOps architect, accurately tracking data transfer progress is a critical task. It provides visibility into expected completion times across diverse infrastructure.

The ubiquitous rsync tool for synchronizing data between systems enables vital replication, backups and processing pipelines. However, effectively measuring rsync progress requires experience to interpret correctly.

In this comprehensive guide, you will learn:

  • rsync internals for calculating progress percentages
  • Usage patterns that stall progress
  • Techniques to integrate rsync with system dashboards
  • Compatibility for progress options across versions
  • Glob expressions for flexible progress tracking
  • Common troubleshooting fixes for stuck transfers

I will cover key methods I employ for production pipelines moving terabytes of data.

Inner Workings of rsync Progress Calculation

To understand how to best measure progress, you need to know what rsync is doing under the hood. There are two core algorithms rsync utilizes to synchronize data.

A. rsync delta transfer algorithm

The default algorithm only transfers the differences between old and new files using a rolling checksum method. This requires scanning all files first to calculate deltas.

Progress flows through these phases:

  1. Scanning phase – Scan all files on source and destination to identify differences
  2. Delta phase – Transfer incremental file changes
  3. Verification phase – Confirm files identical across source and destination

The scanning phase is often the longest leading to inconsistent progress early on. Progress cannot reflect actual files transferred until reaching the delta phase.

B. Whole file rsync algorithm

An alternative algorithm simply copies whole file contents without calculating deltas. This shows steady progress as it does not have an initial scanning stage.

To enable whole file copying use rsync --whole-file or -W.

Here is a comparison showing the pros and cons of each algorithm method:

Delta Transfer Whole File
Speed Very fast after initial scan Slower end-to-end
Network usage Very low, only diffs sent Large, entire files sent
Progress calculation Inconsistent early with long scan Smooth from start to end
Use cases Frequentrsync runs. Minimal changes Initial seeding of data

Now when you see inconsistent early progress, you can identify it is likely still scanning for changes before transferring.

Integration with Centralized Dashboards

To provide wider visibility into key transfer jobs across teams, I integrate rsync progress into centralized dashboards. This guides operational decisions to throttle activities based on observed transfer rates.

Popular open source options include:

cAdvisor – Containers resource usage collector

Prometheus – Time-series database for aggregating metrics

Grafana – Analytics and dashboard visualization

For example, this pipelines rsync stats into Prometheus and graphs progress over time in a Grafana dashboard:

Grafana dashboard screenshot visualizing rsync progress

Dashboards provide real-time progress visibility for all environments avoiding the need to connect individually to jump boxes.

Tracking Progress for Sets of Files

When transferring groups of files, glob expressions provide a flexible mechanism to segment progress reporting.

For instance, use brackets to transfer subdirectories individually:

rsync --progress /source/[dir1/dir2] /destination/

This will display progress per subdirectory allowing insight into their individual transfer rates.

You can also use wildcards to get composite progress across types of files:

rsync --progress /source/{*.sql,*.csv} /destination/

Additional examples:

# All csv in folder
rsync --progress /source/f??.csv /destination/

# All files starting with a, b or c
rsync --progress /source/[abc]* /destination/

Adapt the glob expressions to fit your specific use cases for tracking groups of files.

rsync Version Compatibility for Progress Options

When dealing with diverse environments, having compatibility context on supported progress options across rsync versions is handy:

rsync Version –progress –stats –info=progress2
2.6.0 Yes Yes No
3.0.0 Yes Yes Yes
3.1.0 Yes Yes Yes

So for most up to date progress capabilities, rsync 3.1.0+ is preferred. But –progress works across all versions.

Troubleshooting Stalled Progress Issues

Despite best efforts measuring progress, sometimes transfers stall. Here are troubleshooting tips for common cases:

1. Network drops – Brief network dips will pause transfers. Enable rsync --partial to resume after resets.

2. Remote source bottlenecks – Slow disk or high load on source system can hinder transfers. Monitor remote system performance during transfers and optimize.

3. Integrity checks enabled – Features like --checksum verify integrity but slow transfers. Disable if speed is the priority.

4. Pull from pace limited destination – If pushing to a rate limited external target, pull files instead with receiver throttling writes.

5. System resource exhaustion – Confirm sufficient disk space, memory, IOPS to support transfers. Look for resource saturation.

6. Inode exhaustion – Many small files can exhaust available inodes on filesystems. Set higher inode limits.

There are many potential sources of delays in complex pipelines. Methodically rule out and address to keep transfers moving.

Sample Fixes to Preserve Progress

Here are some examples of fixes and tweaks I employ:

Fix network drops

rsync --progress --partial --append-verify /source /backup

Resume with append-verify to isolate problem files.

Expand capacity

# mount larger volume
mount -t ext4 /dev/sdx3 /backup

# remove old checkpoints  
rm -r /checkpoint-archives/*  

Add disk/archiving to support transfers.

Pull from Destination

rsync --remove-source-files --progress /staging/ /destination

Flip direction to pull content and delete once pulled.

Helper Script for Persistent Progress

I utilize this handy bash script to persist rsync progress between invocations in case transfers get interrupted:

#!/bin/bash

PROGRESS_FILE=/tmp/rsync-progress

 # Get last progress or set to 0
if [ -f "$PROGRESS_FILE" ]; then
    read LAST_PROGRESS < "$PROGRESS_FILE"
else  
    LAST_PROGRESS=0
fi

# Run rsync with progress options   
rsync --progress --info=progress2 $*

# Extract total progress  
CURRENT_PROGRESS=$(tail -1 <<< "$(tac /tmp/rsync-progress)")

# Save progress
echo $CURRENT_PROGRESS > $PROGRESS_FILE

# Delete progress tmp file
rm /tmp/rsync-progress

This reruns failed transfers while maintaining overall progress with a temp file. Adapt this to your pipelines.

Visualizing Progress Over Time

For extensive transfers, visualizing progress charts provides an quick trend overview.

For example, pipe rsync into awk to extract progress then feed into graphed tooling:

rsync --progress | awk ‘/xfr#/ {print $3}‘ | tee rsync-progress.csv | column -t 

# Plot progress csv
python plot-progress.py rsync-progress.csv

Sample demo output:

Line graph of rsync progress percent over time

Tunnels for Remote Progress Tracking

When pushing data between networks, directly accessing rsync servers may not be possible.

SSH tunnels enable proxying rsync traffic for remote progress tracking:

# Server cmd
ssh -R 2222:localhost:873 testserver  

# Local cmd  
rsync --progress -e "ssh -p 2222" /src user@127.0.0.1:/backup   

# Tunnel rsync over SSH port 2222 to server IP

This tunnels the rsync protocol securely between the networks while still measuring progress.

Conclusion

I hope this guide has provided extensive techniques and mitigation strategies for accurately tracking rsync progress from an advanced Linux engineering perspective.

Key topics included:

  • Interpreting algorithm calculations affecting progress
  • Expanding tracking capabilities with glob expressions
  • Integrating with monitoring and graphics for visibility
  • Addressing compatibility across rsync versions
  • Troubleshooting and problem isolation for stalled transfers
  • Tunneling protocols for remote server access

With broad file synchronization demands in modern pipelines, having robust progress visibility ensures efficient parallel operations.

Let me know if you have any other questions arising from production rsync workflows!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *