Discussion Forums
Advanced search options
[ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: greg13070
Posted on: Sep 23, 2006 8:48 PM
  Click to reply to this thread Reply
This is a ruby program that easily transfers directories between a local
directory and an S3 bucket:prefix. It behaves somewhat, but not precisely, like
the rsync program.

One benefit over some other comparable tools is that s3sync goes out of its way
to mirror the directory structure on S3.  Meaning you don't *need* to use s3sync
later in order to view your files on S3.  You can just as easily use an S3
shell, a web browser (if you used the --public-read option), etc.  Note that
s3sync is NOT necessarily going to be able to read files you uploaded via some
other tool.  This includes things uploaded with the old perl version!  For best
results, start fresh!

s3sync runs happily on linux, probably other *ix, and also Windows (except that
symlinks and permissions management features don't do anything on Windows).

For more information, check out:
    http://s3.amazonaws.com/ServEdge_pub/s3sync/README.txt
and to download s3sync along with its assorted ruby libraries:
    http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz

Let me know what you think, how it works for you, what features are missing, etc.

I hope one day to support incremental backups to S3, but I haven't figured out a satisfactory way to do that yet (it really needs key renaming to work well). 

G
Permlink Replies: 243 | Pages: 10 - Last Post: Jan 13, 2009 10:18 AM by: m3networks
Replies
« Previous | Page: 1 2 3 4 5 | Next »
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: Bryan Pendleton
Posted on: Sep 25, 2006 12:30 AM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
This looks pretty good, but I haven't gotten it to work yet. I keep getting "S3 command failed: .... With result 400 Bad Request". Any idea where to look for what's going wrong? The key/secret key both appear to be right.

Also - not sure if this will show up with the ruby implementation, but I got stuck doing a backup with the perl implementation beacuse it was following symlinked directories in recursion. Somewhere in my system, there was a stupid directory that contained a symlink to itself...... I haven't fixed that symblink yet (I hesitate to delete symlinks in system directories that I didn't create), so, once I get the ruby version going, I'll be looking to see if that still goes wrong.

Thanks, though! This looks really great!
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: greg13070
Posted on: Sep 25, 2006 1:01 AM
in response to: Bryan Pendleton in response to: Bryan Pendleton
  Click to reply to this thread Reply
I didn't get that error at all during my testing.  A 400 usually means a badly formatted HTTP request.  Do you have the capability to trace the tcp connection with tcpdump or similar and send it to me?  (Make sure you don't use --ssl, or else it will all be encrypted and impossible to analyze).

If you do this, also make sure and send the command line you're using and any env variables you have set (except your secret key, don't ever tell anyone that!)

As for symlinks.. No recent version of s3sync (ruby or perl > 0.3) should have ever "followed" symbolic links.  Only the original implementation did that, before I got to it.  If something is chasing symlinks, then it's definitely a bug.
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: Bryan Pendleton
Posted on: Sep 25, 2006 3:41 PM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
Seems to have something to do with symlinks. I've had it happen no several times on different source files, all of which are symlinks. The example I PMmed you stopped on the first file because the first file was, in fact, a symlink. Other attempts I've made make it further, but eventually get stuck on some symlink along the way.
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: greg13070
Posted on: Sep 25, 2006 9:51 PM
in response to: Bryan Pendleton in response to: Bryan Pendleton
  Click to reply to this thread Reply
I can't tell anything from the tcpdump because the packet body was only cap'ing a few bytes.  Please use -s 0 so the packet body is not limited.
Re: [ANNOUNCE] s3sync 1.0.1
Posted by: greg13070
Posted on: Sep 28, 2006 11:10 PM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
2006-09-29:
Added support for --expires and --cache-control. Eg:
--expires="Thu, 01 Dec 2007 16:00:00 GMT"
--cache-control="no-cache"

Thanks to Charles for pointing out the need for this, and supplying a patch
proving that it would be trivial to add =) Apologies for not including the short
form (-e) for the expires. I have a rule that options taking arguments should
use the long form.

http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
Re: [ANNOUNCE] s3sync 1.0.1
Posted by: J. Levine RealName(TM)
Posted on: Oct 2, 2006 1:53 PM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
Thanks for this awesome tool; it, honestly, is the only Linux command-line S3 backup tool I could get to work in anything resembling a reliable fashion, and for that I thank you doubly.

One thing: in synchronizing a directory of files, s3sync.rb is consistently failing on one file that's 33 Mb in size; I get the following error:

<blockquote> put jlevine-backup provolone/mysqlbackup/20061002/spamassassin.sql.bz2 #<S3::S3Object:0xb7e81fe4> Content-Type application/x-bzip2 Content-Length 33731296
With result 400 Bad Request
</blockquote>Is there a known problem with large files, or am I experiencing another kind of error and incorrectly ascribing it to the file size?
Re: [ANNOUNCE] s3sync 1.0.1
Posted by: greg13070
Posted on: Oct 2, 2006 3:16 PM
in response to: J. Levine in response to: J. Levine
  Click to reply to this thread Reply
I think there is a "known but not understood yet" bug that can cause 400 errors.  Can you help me test by backing up just the directory containing that file?  Make sure you do not use SSL (or else the connections will be impossible for me to inspect), and run this during the test:

tcpdump -p -s 0 -w tcpdump.cap -i eth0

(assming your net card interface is called eth0; edit appropriately)

If you can post, PM, or email me a zip of the log file produced by tcpdump, I can hopefully find out what I'm doing wrong that creates a malformed request (which is the 400 error)

<span style="color: #ff0000">Please be advised that any confidentail information in the files you transfer will be plainly visible to anyone who views the tcpdump capture file.</span>

Your help is appreciated :)
Re: [ANNOUNCE] s3sync 1.0.1
Posted by: J. Levine RealName(TM)
Posted on: Oct 2, 2006 4:29 PM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
I just replied to you privately, with a link to the tcpdump file.  It appears that Amazon is timing out the connection, and that s3sync isn't successful in re-establishing the connection...
[ANNOUNCE] s3sync 1.0.2
Posted by: greg13070
Posted on: Oct 2, 2006 8:29 PM
in response to: J. Levine in response to: J. Levine
  Click to reply to this thread Reply
New version is out, it contains a fix for fail/retry situations.  I also turned off the debug messages about HTTP streaming which I'd forgotten before.

I recommend all users to update, and make sure your new version reads 1.0.2 or greater (it's in the upper right corner of the 'usage' and also near the beginning of the s3sync.rb file)

The archive is still hosted at:
http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
Re: [ANNOUNCE] s3sync 1.0.2
Posted by: "santacruztech"
Posted on: Oct 4, 2006 11:17 PM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
Thanks, your work is much appreciated. :)
[ANNOUNCE] s3sync 1.0.4
Posted by: greg13070
Posted on: Oct 5, 2006 8:36 AM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
Few bugs fixed; everyone should update.
http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz

By the way now that AWS has addressed their keepalive issue, I'm proud to say that s3sync supports persistent connections.  This will measurably decrease latency, especially with SSL sessions.

It's possible that it will uncover more bugs as well of course :)

There are two known issues right now that I haven't addressed yet.   They seem to be more nuisance than show-stopper, so I probably won't get to them immediately.  This is what they look like:

<blockquote> Create node servers/mail.servedge.com /roundcubemail-svn/skins/default/images/buttons /.svn/props/ldap_pas.png.svn-work
/usr/lib/ruby/1.8/net/protocol.rb:133:in `sysread': Connection reset by peer (Errno::ECONNRESET)
        from /usr/lib/ruby/1.8/net/protocol.rb:133:in `rbuf_fill'
        <snip>
        from ./s3sync.rb:520

Create node servers/mail.servedge.com/phpmyadmin
S3 command failed:
put ServEdge mail_Wed/home/servers/mail.servedge.com/phpmyadmin #<S3::S3Object:0x4065a21c> Content-Length 38
With result 400 Bad Request
</blockquote>The first one seems self-explanatory.  I'm not catching connection resets.  The second one only occurs when trying to create a node describing a directory, and only for certain directories.  The contents (recursively) of that directory still get stored fine.
Re: [ANNOUNCE] s3sync 1.0.4
Posted by: Bryan Pendleton
Posted on: Oct 6, 2006 11:42 AM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
1.0.4 is going really great... A couple of things:

Bugs/issues:
1) What's the nature of the issue on the directories not getting stored? Will I (I assume) lose directory permissions on restores if they never get stored?
2) There's still a problem with following symlinks. One of my directories has a symlink to itself (subdirname -> .), which is getting followed recursively. This means that the backup will both never terminate, and that I end up storing a lot of copies of that directory in S3. Hrm.

Features/wants:
1) While I'm at it - multiple simultaneous transfers? It seems like S3 often has an incoming I/O limit, which is presumably bypassable with parallel sends. Maybe allow a settable number of simultaneous transfers to be started?
2) gzip compression on send? If you set the metadata right, it sounds like most browsers will transparently decompress gzip-compressed files, but it, of course, will often take less space to store files in s3 this way. Faster transfers/less storage cost sounds like a win, to me.

Thanks!
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: watercannon
Posted on: Oct 7, 2006 4:44 PM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
Hi,

It wasn't clear from reading the README and the source whether s3sync copies all specified files in full, or whether it only transfers changed files (whether judged on the last time s3sync was successfully run, or by comparing the modification times of the local and s3 files).
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: J. Levine RealName(TM)
Posted on: Oct 7, 2006 7:29 PM
in response to: watercannon in response to: watercannon
  Click to reply to this thread Reply
I certainly won't answer for Greg, but I can't imagine that s3sync is able to only transfer file changes, since that would involve one of two things:

- a sync-like service running at Amazon S3's end that's able to cooperate with your local s3sync to determine what's changed in each file, or
  • s3sync would have to download each file, check each for changes, and then upload those changes.

The first doesn't exist. The second wouldn't be of any assistance, since you'd actually *increase* the bandwidth being used -- you'd transfer the file in full back from S3, and then add on the transfer back to S3 of any file changes.
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: Tomas Markauskas RealName(TM)
Posted on: Oct 8, 2006 1:20 AM
in response to: J. Levine in response to: J. Levine
  Click to reply to this thread Reply
But you can compare the timestamps, so only the older files will be overwritten.
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: watercannon
Posted on: Oct 8, 2006 3:03 AM
in response to: J. Levine in response to: J. Levine
  Click to reply to this thread Reply
Thanks for the reply.

Although, as you say, a proper diff would not be possible or efficient, it'd save a lot of bandwidth if either the two methods I mentioned was coded: Either

1. A local file is written as a timestamp for the last time s3sync was last successfully run, and the s3sync program automatically filters the provided list of files & directories so that only files with a creation or modification time after this time stamp are sent to S3, or

2. Each file in the S3 archive has an update timestamp that can be retrieved and compared to the last modified time on the local version of the file.

The second method is more complex, but can handle partial syncs.

I'm looking to use s3sync as a daily backup of my most important data, and it'd take 1000 times the bandwidth to transfer all data daily compared to transferring just the files I've added and changed that day.

I know Ruby, so can help code up an option that enables use of Method 1, but I don't currently know enough about S3 to work out whether Method 2 is feasible or easy to implement.

How does s3sync compare to http://www.jungledisk.com ?
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: Martin Kochanski RealName(TM)
Posted on: Oct 8, 2006 9:01 AM
in response to: watercannon in response to: watercannon
  Click to reply to this thread Reply
Cardbox uses your Method 2 for its backups and it works very well.

The only special thing we had to do was define our own x-amz- header for the update timestamp, since S3 only stores the time at which the backup was written to S3: in pathological cases this might not be enough.
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: S. Matzke RealName(TM)
Posted on: Oct 8, 2006 9:40 AM
in response to: Martin Kochanski in response to: Martin Kochanski
  Click to reply to this thread Reply
Another method one could use to see if a file was changed is the MD5 hash (in the Etag header). I think a combination of both (modified and etag) would be the best way, as it would catch things like "touch /a/b/c" (date is modified, but not the content) too.


Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: Martin Kochanski RealName(TM)
Posted on: Oct 9, 2006 3:44 AM
in response to: S. Matzke in response to: S. Matzke
  Click to reply to this thread Reply
Is the content of the Etag header guaranteed to be always exactly an MD5 hash of the entire content of the object and of nothing else? I seem to remember, the last time I looked at the HTTP specifications, that Etag could be anything the server happened to find interesting at the time, as long as it was going to identify the object uniquely. MD5 happens to be one way of doing this. But my understanding is that if Amazon suddenly started using a different hash, and incorporating (let's say) the creation time in it as well, this would still be entirely valid in HTTP terms.

If S3 finds it convenient to use the MD5 as an Etag then it isn't safe to use it; if it undertakes to use the MD5 as an Etag (and nothing else, ever) then we can use it ourselves.

But otherwise it might be better to add another x-amz- header to contain our very own MD5.
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: S. Matzke RealName(TM)
Posted on: Oct 9, 2006 3:56 AM
in response to: Martin Kochanski in response to: Martin Kochanski
  Click to reply to this thread Reply
Do as you like... I just wanted to point out that only using the modified-date of a file might not be enough to recognize if the file changed or not. Use the modified date, the filesize and any kind of (secure) hash over the contents and you're pretty much on the safe side.
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: greg13070
Posted on: Oct 9, 2006 3:52 PM
in response to: watercannon in response to: watercannon
  Click to reply to this thread Reply
watercannon:
s3sync compares the Etag of the S3 object with the MD5 sum of the local object.  This is incredibly efficient, but if S3 stops using md5 for its etags, then another approach will be needed.  Modification time would be one obvious way.


I know it's using an unsupported feature this way, but you know, I am just a dangerous kinda guy that way. :)
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: John D. Eberly RealName(TM)
Posted on: Oct 10, 2006 4:53 PM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
I put together a post on how I used s3sync and the java GUI "cockpit" to automate and monitor my backups to Amazon S3.
http://blog.eberly.org/2006/10/09/how-automate-your-backup-to-amazon-s3-using-s3sync/

Obviously most of you reading this forum wouldn't need something like this, but hopefully it could help someone like myself who was looking for a simple low-level way to automate their backups using s3sync. 

Thanks again Greg for s3sync.

John
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: greg13070
Posted on: Oct 11, 2006 10:38 AM
in response to: John D. Eberly in response to: John D. Eberly
  Click to reply to this thread Reply
John,
This is a great walkthrough.  We definitely need things like this so new users have some good end-to-end resources available. 
Re: [ANNOUNCE] s3sync 1.0.0 using Ruby!
Posted by: "ardent-acct"
Posted on: Oct 12, 2006 5:12 PM
in response to: greg13070 in response to: greg13070
  Click to reply to this thread Reply
Hello..   I have ruby 1.8.4, i signed up for s3 and have it working fine with Jungle Disk.

Now I wanted to create a sync of a file structure (/foo for now)

so my command is:
s3sync.rb -d -n -v -r /foo someNewBucket:/backups

Doing such gives me a 404 not found.   When i tried using the same bucket names as in jungle disk, i got 403 errors.   I'm missing something very obvious here, any help appreciated.    

Thanks
« Previous | Page: 1 2 3 4 5 | Next »