Transferring Lots Of Small Data

I was in the situation recently where I needed to transfer about 500Gb of data from one HP-UX box to another.

Unfortunately most of these data were in small files – the largest probably a megabyte or so.

I ran some speed tests. We had gigabit ethernet connecting the 2 boxes on a little used LAN.

I wrote a small script to start 6 copy processes in parallel and another to periodically measure how much disk space had been used on the target machine.

I found that 6 simultaneous scp commands ran at about 15Mb/s (Megabytes per second). scp has the overhead of encrypting the data stream before it is transferred, so I tried plain old rcp as well. That gave me worse results than scp.

Thinking that the large number of small files was slowing things down I tried using tar to batch up the files. I wrote a tiny script to pipe the output of the tar command across to the target machine directly in to a receiving tar process which unbundled the files.

The tar approach almost doubled the rate of transfer to about 30Mb/s.

Here are the tar commands. Note that you’ll need to enable r-commands by creating .rhosts files in the home directories of each machine (remove these after they copy is complete – they are notorious security problems). Note also that, as usual, HP-UX is awkward – whereas the rest of the world uses ‘rsh’ for the remote-shell command, HP-UX uses ‘remsh’. ‘rsh’ is some kind of restricted shell in HP parlance.

box1> cd source_directory; tar cf - . | remsh box2 "cd target_directory; tar xf - "

The first tar command creates a file called ‘-’, which is a special token meaning ‘standard output’ in this case. The archive created contains all the files in the current directory (.) plus all subdirectories (tar is recursive by default). This archive file is piped into the remsh command which sends it to the box2 machine. On box 2 I first change to the proper receiving directory, then I extract from ‘-’, or ‘standard input’ the incoming files.

I had 6 of these tar commands running simultaneously to ensure the network link was saturated with data, although I suspect that disk access may have been the limiting factor.

Leave a reply

You must be logged in to post a comment.