Quozl's Copy In Place

| quozl@us.netrek.org | up |

Scope: Linux Copy In Place Program.

2008-07-08

Why

Television receiving computer (DVB-T) creates large transport stream files, which are available over rsync or NFS to the playback and archiving computers.

When the archiving is done, an incremental rsync (--partial) slows dramatically if any portion of the output file already exists. No idea why that is, but what was needed was a way to copy the remaining portion of a file over NFS.

How

Imagine this starting condition: some of the file has been copied, and the rest of the file is to be copied:

A : 0123456789012345678901234567890123456789 (input file)
B : 0123456789012345678                      (partial output file)
C :                    901234567890123456789 (to be copied)
So what on Linux can do this? Tell Quozl.

Results

2008-08-19
The test case is a 1,427,621,240 byte file, with roughly 50%, exactly 713,809,920 bytes already transferred. Here is a summary, or read the test results.

commandcommenttest time
cpno relevant flag 
scpno relevant flag 
ddtoo hard, but possible with seek=1m2s
wget --continue http://tv/fileworks, but HTTP isn't in use
wget --continue ftp://tv/fileworks, but FTP isn't in use
rsync rsync://tv/fileslows to 2.5Mbit/sec over 100Mbit/sec link, version 3.0.24m18s
rsync --inplace rsync://tv/fileslows to 2.5Mbit/sec over 100Mbit/sec link, version 3.0.24m18s
rsync --append rsync://tv/filedoes not slow, but does checksum the current output file first1m29s
cp-inplacedoes not slow, does not checksum1m6s

Code

Therefore between main course and desert last night, Quozl wrote something to do it.

For reading here:

#!/usr/bin/python
""" copy the uncopied portion of a file """
import os, sys, time

r = open(sys.argv[1], 'r') # input file
a = open(sys.argv[2], 'a') # output file, may already exist

# seek input to end to determine current size
r.seek(0, os.SEEK_END)
rs = r.tell()
print rs, "size of input."

# seek output to end
a.seek(0, os.SEEK_END)

# get output file length
as = a.tell()
uncopied = rs - as
print as, "size of output,", uncopied, "to be moved."

# position input to current output position
r.seek(as, os.SEEK_SET)

start = time.time()
copied = 0
size = max(uncopied/10, 8192*1024)

# loop reading and writing until end of file on input
chunk = r.read(size)
while chunk != '':
    a.write(chunk)
    copied += len(chunk)
    print r.tell(), "moved", len(chunk), "chunk", copied, "copied"
    chunk = r.read(size)

# generate summary
elapsed = time.time() - start
bps = int(copied / elapsed)
print r.tell(), "eof", copied, "copied", bps, "bytes per second"

| quozl@us.netrek.org | up |