Fault-Tolerant SFTP scripting - Retry Failed Transfers Automatically

Fault-Tolerant SFTP scripting

Introduction

The whole of modern networking is built upon an unreliable medium. Routing equipment has free license to discard, corrupt, reorder, or duplicate data which it forwards. The understanding of the IP layer in TCP/IP is that there are no guarantees of accuracy. No IP network can claim to be 100% reliable.

The TCP layer acts as a guardian atop IP, ensuring data that it produces is correct. This is achieved with a number of techniques that sometimes purposely lose data in order to determine network limits. As most might know, TCP provides a connection-based network with guaranteed delivery atop an IP connectionless network that can and does discard traffic at will.

How curious it is that our file transfer tools are not similarly robust in the face of broken TCP connections. The SFTP protocol resembles both its ancestors and peers in that no effort is made to recover from TCP errors that cause connection closure. There are tools to address failed transfers (reget and reput), but these are not triggered automatically in a regenerated TCP session (those requiring this property might normally turn to NFS, but this requires both privilege and architectural configuration). Users and network administrators alike might be rapt with joy should such tools suddenly become pervasive.

What SFTP is able provide is a return status, an integer that signals success when it is the value of zero. It does not return status by default for file transfers, but only does so when called in batch mode. This return status can be captured by a POSIX shell and retried when non-zero. This check can even be done on Windows with Microsoft's port of OpenSSH with the help of Busybox (or even PowerShell, with restricted functionality). The POSIX shell script is deceptively simple, but uncommon. Let's change that.

Failure Detection with the POSIX Shell

The core implementation of SFTP fault tolerance is not particularly large, but batch mode assurance and standard input handling add some length and complexity, as demonstrated below in a Windows environment.

C:\Users\BillG>type sftpft
#!/bin/sh

set -eu                                                      # Shell strict mode

tvar=1

for param                                              # Confirm SFTP batch mode
do case "$param" in [-]b*) tvar=;; esac
done

[ -n "$tvar" ] && { printf '%s: must be called with -b\n' "${0##*/}"; exit; }

if [ -t 0 ]                                    # Save stdin unless on a terminal
then tvar=/dev/null
else tvar="$(mktemp -t sftpft-XXXXXX)"
     cat > "$tvar"
     if [ -s "$tvar" ]                          # Save only if stdin isn't empty
     then trap "rm -v \"$tvar\"" EXIT ABRT INT KILL TERM         # Erase at exit
     else rm "$tvar" 
          tvar=/dev/null
     fi
fi

until sftp "[email protected]" < "$tvar"
do echo "failed: $? $param"                   # Report failed transfer and retry
   sleep 15
done

There is some subtlety in the usage of this SFTP wrapper, in that the return of detectable errors is not the default. In order for the until to trigger a retry on a data error, the -b option must be passed, and further controls are available within the associated batch command script to configure the error response. This zero-status successful reporting of a failed transfer due to inadequate permissions is easily demonstrated:

~ $ echo 'put foobar.txt /var' | sftp -i secret_key [email protected]; echo $?
Connected to 10.11.12.13.
sftp> put foobar.txt /var
Uploading foobar.txt to /var/foobar.txt
remote open("/var/foobar.txt"): Permission denied
0

Detecting a transfer that has not taken place requires the -b option to SFTP; without it, only initial connection errors will be reported. An easy fix would be to add -b - for standard input:

~ $ echo 'put foobar.txt /var' | sftp -i secret_key -b - [email protected]; echo $?
sftp> put foobar.txt /var
remote open("/var/foobar.txt"): Permission denied
1

The script explicitly confirms that the -b parameter is present.

Most users of POSIX (and derivative) shells in a scripting context are more familiar with the if [ construct above for conditions. However, the majority of UNIX systems have a program in /bin/[ which will evaluate a POSIX test and return a status. We could instead write if /bin/[ or if /bin/test to call either program directly with the full path (and original Bourne shell always did so, but most modern shells implement [ as a “builtin” for speed). Both if and until can execute any program, including SFTP, but if is for branching, while until is for looping. When there is a transfer problem, we want to loop.

The arguments sent to sftp are exactly the same as those provided to the parent script, via the [email protected] shell variable, as described best in the Korn shell documentation:

[email protected]       Same as $*, unless it is used inside double quotes, in which case
         a separate word is generated for each positional parameter.  If
         there are no positional parameters, no word is generated.  [email protected] can
         be used to access arguments, verbatim, without losing NULL argu‐
         ments or splitting arguments with spaces.

The scripting inside the until block (between the do and the done) is never triggered when the SFTP session functions correctly; it is only called ether when the initial TCP connection fails, or a) SFTP is used in batch mode and b) a non-ignored command fails (as documented below). The error message combines both the (non-zero) return code held in the $? shell variable, and the last argument on the command line. Let's demonstrate on a Windows system with Busybox, where as a test I disconnect the server's ethernet network cable, call the transfer and wait for two failures, then reconnect:

C:\Users\BillG>type sbatch
dir

C:\Users\BillG>busybox sh

~ $ ./sftpft -i secret_key -b sbatch [email protected]
ssh: connect to host sftp.macrofirm.com port 22: Connection timed out
Connection closed
failed: 255 [email protected]
ssh: connect to host sftp.macrofirm.com port 22: Connection timed out
Connection closed
failed: 255 [email protected]

sftp> dir
CP046020-iLO.scexe ...

~ $ exit

C:\Users\BillG>ssh -V
OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2

The documentation for SFTP (extracted from RedHat v9) reveals additional controls and subtlety that will impact the reporting of errors:

$ man sftp | sed -n '/batchfile$/,/bsd/p'
-b batchfile
        Batch mode reads a series of commands from an input batchfile in‐
        stead of stdin.  Since it lacks user interaction it should be
        used in conjunction with non-interactive authentication to obvi‐
        ate the need to enter a password at connection time (see sshd(8)
        and ssh-keygen(1) for details).

        A batchfile of ‘-’ may be used to indicate standard input.  sftp
        will abort if any of the following commands fail: get, put,
        reget, reput, rename, ln, rm, mkdir, chdir, ls, lchdir, chmod,
        chown, chgrp, lpwd, df, symlink, and lmkdir.

        Termination on error can be suppressed on a command by command
        basis by prefixing the command with a ‘-’ character (for example,
        -rm /tmp/blah*).  Echo of the command may be suppressed by pre‐
        fixing the command with a ‘@’ character.  These two prefixes may
        be combined in any order, for example [email protected] /bsd.

The SFTP batch script that we used above consisted of the single command dir (which is an alias for the ls command). With multiple SFTP commands, we are free to allow some to fail, but detect and retry others in a more sophisticated usage.

Forcing a successful return on our previous failed transfer just requires another dash:

~ $ echo '-put foobar.txt /var' | sftp -i secret_key -b - [email protected]; echo $?
sftp> -put foobar.txt /var
remote open("/var/foobar.txt"): Permission denied
0

For those transferring large files over connections that are prone to TCP resets, the reget and reput options will retain the successfully-transmitted partial content, instead of restarting the failed transfer completely. It is important to ensure in this case that the reget is not issued for mismatched files (ensure that the target file does not exist before initally calling reget or reput in sftpft).

Here is an example session of an ISO image transfer that is repeatedly interrupted by disconnecting the server's network cable. First, the batch file will check for the ISO, then issue a reget:

~ $ cat sbatch
!dir OracleLinux-R8-U2-x86_64-dvd.iso
reget OracleLinux-R8-U2-x86_64-dvd.iso

This ISO file is a total of 7.8 GiB. The transfer was interrupted by disconnecting the network cable when 817 MiB had been downloaded, then again after 2.5 GiB had been locally stored.

~ $ ./sftpft -i secret_key -b sbatch [email protected]
sftp> !dir OracleLinux-R8-U2-x86_64-dvd.iso
 Volume in drive C is OSDisk
 Volume Serial Number is E44B-22EC

 Directory of C:\Users\BillG

File Not Found

sftp> reget OracleLinux-R8-U2-x86_64-dvd.iso
client_loop: send disconnect: Connection reset
Connection closed
failed: 255 [email protected]

sftp> !dir OracleLinux-R8-U2-x86_64-dvd.iso
 Volume in drive C is OSDisk
 Volume Serial Number is E44B-22EC

 Directory of C:\Users\BillG

12/20/2022  10:14 AM       857,309,184 OracleLinux-R8-U2-x86_64-dvd.iso
               1 File(s)    857,309,184 bytes
               0 Dir(s)  115,628,781,568 bytes free

sftp> reget OracleLinux-R8-U2-x86_64-dvd.iso
client_loop: send disconnect: Connection reset
Connection closed
failed: 255 [email protected]
ssh: connect to host 10.11.12.13 port 22: Connection timed out
Connection closed
failed: 255 [email protected]
ssh: connect to host 10.11.12.13 port 22: Connection timed out
Connection closed
failed: 255 [email protected]

sftp> !dir OracleLinux-R8-U2-x86_64-dvd.iso
 Volume in drive C is OSDisk
 Volume Serial Number is E44B-22EC

 Directory of C:\Users\BillG

12/20/2022  10:17 AM     2,638,348,288 OracleLinux-R8-U2-x86_64-dvd.iso
               1 File(s)  2,638,348,288 bytes
               0 Dir(s)  113,851,338,752 bytes free

sftp> reget OracleLinux-R8-U2-x86_64-dvd.iso

After the transfer completed, the SHA checksums were computed on both sides, which matched.

~ $ ls -l OracleLinux-R8-U2-x86_64-dvd.iso
-rw-rw-r-- 1 BillG BillG 8337227776 Dec 20 10:27 OracleLinux-R8-U2-x86_64-dvd.iso

~ $ sha256sum OracleLinux-R8-U2-x86_64-dvd.iso
67568941e976efb26a3d61cdbf98c5a46cd0b3463ec750992f305eee20957a6e  OracleLinux-R8-U2-x86_64-dvd.iso

~ $ ssh -i secret_key [email protected]
Last login: Mon Dec 19 15:36:37 2022

$ sha256sum OracleLinux-R8-U2-x86_64-dvd.iso
67568941e976efb26a3d61cdbf98c5a46cd0b3463ec750992f305eee20957a6e  OracleLinux-R8-U2-x86_64-dvd.iso

One scripting concern is the handling of standard input. When the -b - option is used, SFTP completely consumes stdin on the first iteration of the until, leaving null for later runs. To enable the repeated use of an SFTP batch command script on the standard input (which must not be an interactive terminal), it must be saved in a temporary file, applied on each run, then unlinked as an exit trap:

~ $ echo 'put foobar.txt' | ./sftpft -i secret_key -b - [email protected]
ssh: connect to host sftp.macrofirm.com port 22: Connection timed out
Connection closed
failed: 255 [email protected]
ssh: connect to host sftp.macrofirm.com port 22: Connection timed out
Connection closed
failed: 255 [email protected]
sftp> put foobar.txt
removed 'C:/Users/BillG/AppData/Local/Temp/sftpft-a01472'

Note that the Windows port of Busybox only handles EXIT and ERR traps; all other signals will be ignored.

One other question is the best choice of shells to execute this script. A very good option is Debian dash, which is small and fast. The MirBSD mksh has an enormous install base on Android. Both ksh-93 and bash are quite large, and the manual page for bash does describe it as “too big and too slow.” All of these POSIX-compliant shells can be used (and many more); choose the one that you like.

For problem WAN connections, this SFTP approach is a boon for reliability. It will end after-hours support calls for critical data, which will be delayed but not discarded. It also can be instrumented with dates, times, traceroutes and other network diagnostics to identify specific failures with your provider.

No IP network is (or can be) reliable. TCP successfully creates a facade of reliability, but networks fail. Network transfers in general, including SFTP, should gracefully recover from TCP failures. Your network administrator will thank you for pervasive use of these techniques.

Special thanks are owed here to Ron Yorston, the maintainer of the Windows Busybox port, for his advice on elements of this section.

PowerShell

While the POSIX shell was mostly present in the features of the 1988 version of the Korn shell, and completely defined in the 1992 POSIX.2 standard, PowerShell is a much younger language that is still gaining core feature equivalence.

A basic SFTP retry can be implemented in PowerShell with the form of:

do {sftp -i secret_key -b sbatch [email protected]} while (-not $?)

Incorporating a delay (for all transfers, even successful) could be accomplished with:

do {sftp -i secret_key -b sbatch [email protected];$n=$?;sleep 15} while (-not $n)

It appears that early versions of PowerShell have no implementation of [email protected], so a generalized script can be much more challenging. Such an effort is left as an exercise for the reader.

It also appears that $?, the exit status, is boolean instead of an integer type. This is sufficient to force SFTP to retry on fail, but less granular than a POSIX shell's reporting. Note that we have seen two distinct SFTP error codes above, 1 for a permissions failure on /var, and 255 for TCP connection failures. Microsoft's bundled curl returns over 90 different exit status codes as reported in the RedHat 9 documentation (which includes an earlier version of curl than is currently present in Windows 10) with varied meanings that are opaque to PowerShell's $? status boolean.

Perhaps the PowerShell maintainers might consider retroactively defining $¿ as an integer type that reports the actual exit status of a completed program. This would be useful in a variety of scenarios.

Microsoft has been highly responsive to requests for current software, as their ports of curl and OpenSSH prove. Additional functionality for PowerShell will surely be carefully considered if recent diligence continues.

Other Usage

There many other utilities that return similar status, and can be configured to retry on failure. Considering Windows users, the prime focus should be upon:

  • Curl, which Microsoft now bundles into modern versions of Windows, and

     

  • PuTTY psftp, which has been the most popular SSH client on Windows for many years, and offers a few features not found within OpenSSH.

Microsoft's curl port has been present for some time, with support for several transfer protocols (but oddly not SFTP):

C:\Users\BillG>curl --version
curl 7.83.1 (Windows) libcurl/7.83.1 Schannel
Release-Date: 2022-05-13
Protocols: dict file ftp ftps http https imap imaps pop3 pop3s smtp smtps telnet tftp                                                                                   
Features: AsynchDNS HSTS IPv6 Kerberos Largefile NTLM SPNEGO SSL SSPI UnixSockets

The PuTTY utilities will also report status, and have password-handling functionality that OpenSSH explicitly omits (which can be abused):

C:\Users\BillG>psftp --version
psftp: Release 0.76
Build platform: 64-bit x86 Windows
Compiler: clang 13.0.0 (https://github.com/llvm/llvm-project/ ab5ee342b92b4661cfec3cdd647c9a5c18e346dd), emulating Visual Studio 2013 (12.0), _MSC_VER=1800
Source commit: 1fd7baa7344bb38d62a024e5dba3a720c67d05cf

SCP

The SCP utility is more straightforward in error reporting, and by default returns an exit status indicative of the success of a transfer. It is easier to adapt to fault-tolerant scripting, and historically has been somewhat faster than SFTP.

However, it is advisable to avoid SCP in critical transfers for reasons described below.

OpenSSH 8.7 first introduced a modified SCP that uses SFTP as the wire protocol. This was advised as a deprecation of SCP in OpenSSH 8.8, and the default handling switched in OpenSSH 9.0 (although the -O option to the SCP client can revert to the classic server if allowed).

A partial list of drawbacks to SCP:

  • A discussion of historic flaws in SCP heighten concern.
  • Several CVEs, which cannot be addressed within the limitations of the current protocol.
  • Risks in the utilization of older clients and servers.
  • Protocol ambiguity with the mix of old and new versions.
  • Ease in applying chroot() to SFTP on the server.

It might be advisable to consider the PuTTY pscp utility in preference to OpenSSH SCP in any form, due to the long history that PuTTY has with SCP security problems (UNIX ports of the PuTTY utility suite exist and work well). PuTTY implemented an SFTP backend for pscp in 2002 with the 0.52 release and “only falls back to the old scp1 form if SFTP can't be found.” At the time of the 0.52 release, the changelog noted that “scp1's implementation of server-side wildcards is inherently unsafe.” Forcing SFTP mode when using pscp is a best practice, and specifically avoid the -unsafe mode that gives a free hand to a malicious SCP server.

Stability and security favor SFTP. Unfortunately, these benefits come at a cost.

SFTP Performance Benchmarks

SFTP has suffered from performance problems in some situations, as it does not exploit TCP sliding windows and thus can be outperformed by the FTP protocol that it has largely replaced.

In an effort to quantify this performance penalty, tests were performed between two HPE DL380 servers running the Oracle UEK Linux kernel. The sshd was configured with Ciphers [email protected] to restrict all transfers to the fastest AEAD cipher that is able to exploit the AES-NI accelleration of the server CPUs. Transfer trials were first conducted over a corporate firewall that was handling unrelated traffic outside testing control, and then by directly connected ethernet using a crossover cable.

The results of these performance tests are somewhat surprising.

A 32 GiB file was used as a test transfer for all trials:

# ll backup_file.dat 
-rw-r--r--. 1 BillG BillG 34344017920 Jan 25  2021 backup_file.dat
# ll -h backup_file.dat 
-rw-r--r--. 1 BillG BillG 32G Jan 25  2021 backup_file.dat

First, SFTP retrieval was performed with a POSIX shell “here document,” and the wall-clock time recorded:

time sftp -q [email protected] <<-''EndOfSFTP
get backup_file.dat
quit
EndOfSFTP

real	8m35.846s
real	8m43.507s
real	8m43.883s
real	8m44.184s
real	8m46.229s

When the direct-wired network interfaces were used, the transfer time was nearly halved:

real	4m53.244s
real	4m54.488s
real	4m54.497s

SCP was then attempted, which appeared to exhibit a small performance advantage:

time scp -q [email protected]:backup_file.dat .
real	8m4.313s
real	8m8.493s
real	8m12.369s
real	8m14.518s
real	8m16.414s

The small difference between SFTP and SCP vanished when the direct-wired interfaces were used instead:

real	4m53.007s
real	4m53.578s
real	4m53.671s

Directly pulling over the standard input appeared to offer a miniscule improvement:

f=backup_file.dat
time ssh [email protected] "cat $f" > $f
real	7m46.035s
real	8m4.932s
real	8m6.543s
real	8m12.891s
real	8m14.345s

This advantage also disappeared over the direct connection:

real	4m55.404s
real	4m55.705s
real	4m57.028s

Finally, a direct TCP connection was established with a pair of Netcats, which should have similar performance to cleartext FTP, indicating the overall performance cost of the SSH connection:

nc --send-only -l 65432 < backup_file.dat
time nc --recv-only sftp.macrofirm.com 65432 > delme
real	7m44.775s
real	7m46.638s
real	7m49.271s
real	7m49.669s
real	7m51.181s

This lead also collapsed with the direct connection:

real	4m53.328s
real	4m54.608s
real	4m54.577s

A best guess on these patterns might be that SFTP is less able to control latency in the data stream, and this appears to impact bandwidth as the connection grows more distant.

Conclusion

SFTP does not appear to be designed as an easy migration from either FTP or SCP. It does improve upon FTP in many ways, at some performance cost. The SFTP error reporting is also disabled for data failures by default, and detecting critical transfer interruptions is now much more cumbersome than SCP.

It does offer strong security benefits, as OpenSSH has reworked it to run in a chroot(), bringing the whole of the SFTP server into the main SSHD. Combined with strong encryption, privilege separation (where supported) and reget/reput, SFTP offers far greater security and flexibility than the legacy protocols that it has replaced.

Despite its foibles, SFTP has gained preeminent status as a secure file transfer agent. With the addition of fault-tolerant handling techniques, there is little reason to fall back to legacy protocols, despite the penalties imposed. The perfect is the enemy of the good, and the industry has deemed SFTP good enough; little can be gained by rowing against this current.

Charles Fisher has an electrical engineering degree from the University of Iowa and works as a systems and database administrator for a Fortune 500 mining and manufacturing corporation.

Load Disqus comments