Tags: acompressed, adds, bz2, compressed, file, file-like-object, function, programming, python, tar, tarfile, write

Add a file to a compressed tarfile

On Programmer » Python

13,713 words with 9 Comments; publish: Wed, 30 Apr 2008 10:09:00 GMT; (20046.88, « »)

Hi,

I'm trying to write a function that adds a file-like-object to a

compressed tarfile... eg ".tar.gz" or ".tar.bz2"

I've had a look at the tarfile module but the append mode doesn't support

compressed tarfiles... :(

Any thoughts on what I can do to get around this?

Cheers!

All Comments

Leave a comment...

  • 9 Comments
    • On Sat, 06 Nov 2004 00:13:16 +1100, Dennis Hotson <djdennie69.python.todaysummary.com.hotmail.com>

      wrote:

      > Hi,

      > I'm trying to write a function that adds a file-like-object to a

      > compressed tarfile... eg ".tar.gz" or ".tar.bz2"

      > I've had a look at the tarfile module but the append mode doesn't support

      > compressed tarfiles... :(

      > Any thoughts on what I can do to get around this?

      > Cheers!

      From the tarfile docs in python 2.3:-

      New in version 2.3.

      The tarfile module makes it possible to read and create tar archives. Some

      facts and figures:

      reads and writes gzip and bzip2 compressed archives.

      creates POSIX 1003.1-1990 compliant or GNU tar compatible archives.

      reads GNU tar extensions longname, longlink and sparse.

      stores pathnames of unlimited length using GNU tar extensions.

      handles directories, regular files, hardlinks, symbolic links, fifos,

      character devices and block devices and is able to acquire and restore

      file information like timestamp, access permissions and owner.

      can handle tape devices.

      open( [name[, mode [, fileobj[, bufsize]]]])

      Return a TarFile object for the pathname name. For detailed information on

      TarFile objects, see TarFile Objects (section 7.19.1).

      mode has to be a string of the form 'filemode[:compression]', it defaults

      to 'r'. Here is a full list of mode combinations:

      mode action

      'r' Open for reading with transparent compression (recommended).

      'r:' Open for reading exclusively without compression.

      'r:gz' Open for reading with gzip compression.

      'r:bz2' Open for reading with bzip2 compression.

      'a' or 'a:' Open for appending with no compression.

      'w' or 'w:' Open for uncompressed writing.

      'w:gz' Open for gzip compressed writing.

      'w:bz2' Open for bzip2 compressed writing.

      Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to

      open a certain (compressed) file for reading, ReadError is raised. Use

      mode 'r' to avoid this. If a compression method is not supported,

      CompressionError is raised.

      If fileobj is specified, it is used as an alternative to a file object

      opened for name.

      HTH,

      Martin.

      Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

      #1; Wed, 30 Apr 2008 10:10:00 GMT
    • On Fri, 05 Nov 2004 13:26:22 -0000, Martin Franklin

      <mfranklin1.python.todaysummary.com.gatwick.westerngeco.slb.com> wrote:

      > On Sat, 06 Nov 2004 00:13:16 +1100, Dennis Hotson

      > <djdennie69.python.todaysummary.com.hotmail.com> wrote:

      >

      >

      <snip - useless info from myself>

      Sorry I just re-read your message after sending my reply...

      #2; Wed, 30 Apr 2008 10:11:00 GMT
    • On Fri, 05 Nov 2004 13:40:22 +0000, Martin Franklin wrote:

      > On Fri, 05 Nov 2004 13:26:22 -0000, Martin Franklin

      > <mfranklin1.python.todaysummary.com.gatwick.westerngeco.slb.com> wrote:

      >

      >

      > <snip - useless info from myself>

      > Sorry I just re-read your message after sending my reply...

      Ahh ok... Yeah, I've already seen the docs... thanks anyway! :D

      I'm currently trying to read all of the files inside the tarfile... then

      writing them all back. Bit of a kludge, but it should work..

      Cheers!

      Dennis

      #3; Wed, 30 Apr 2008 10:12:00 GMT
    • Dennis Hotson <djdennie69.python.todaysummary.com.hotmail.com> writes:

      >I'm currently trying to read all of the files inside the tarfile... then

      >writing them all back. Bit of a kludge, but it should work..

      There isn't really any other way. A tar file is terminated by two empty

      blocks. In order to append to a tar file you simply append a new tar file t

      wo

      blocks from the end of the original. If it was uncompressed you just s

      back from the end and write but if it's compressed you can't find that point

      without decompressing[1]. In some cases a more time efficient but less spac

      e

      efficient method would be to just compress individual files in a directory a

      nd

      then tar them up before the final distribution (or whatever you do with your

      tar file)

      Eddie

      [1] I think, unless there's a clever way of just decompressing the last few

      blocks.

      #4; Wed, 30 Apr 2008 10:13:00 GMT
    • eddie.python.todaysummary.com.holyrood.ed.ac.uk (Eddie Corns) wrote:

      > Dennis Hotson <djdennie69.python.todaysummary.com.hotmail.com> writes:

      >

      > There isn't really any other way. A tar file is terminated by two empty

      > blocks. In order to append to a tar file you simply append a new tar file

      two

      > blocks from the end of the original. If it was uncompressed you just s

      > back from the end and write but if it's compressed you can't find that poi

      nt

      > without decompressing[1]. In some cases a more time efficient but less sp

      ace

      > efficient method would be to just compress individual files in a directory

      and

      > then tar them up before the final distribution (or whatever you do with yo

      ur

      > tar file)

      > Eddie

      > [1] I think, unless there's a clever way of just decompressing the last fe

      w

      > blocks.

      I am not aware of any such method. I am fairly certain gzip (and the

      associated zlib) does the following:

      while bytes remaining:

      reset/initialize state

      while state is not crappy and bytes remaining:

      compress portion of remaining bytes

      update state

      Even if one could discover the last reset/initialization of state, one

      would still need to decompress the data from then on in order to

      discover the two empty blocks.

      A 'resume compression friendly' algorithm would necessarily need to

      describe its internal state at the end of the byte stream. In the case

      of gzip (or other similar compression algorithms), really the only way

      this is reasonable is to just give an offset in the file to the last

      reset/initialization. Of course the internal state must still be

      regenerated from the remaining portion of the file (which may be the

      entire file), so isn't really a win over just processing the entire file

      again with an algorithm that discovers when/where to pick up where it

      left off before.

      - Josiah

      #5; Wed, 30 Apr 2008 10:15:00 GMT
    • Am Freitag, 5. November 2004 19:19 schrieb Josiah Carlson:

      > I am not aware of any such method. I am fairly certain gzip (and the

      > associated zlib) does the following:

      > while bytes remaining:

      > reset/initialize state

      > while state is not crappy and bytes remaining:

      > compress portion of remaining bytes

      > update state

      > Even if one could discover the last reset/initialization of state, one

      > would still need to decompress the data from then on in order to

      > discover the two empty blocks.

      This is not entirely true... There is a full flush which is done every n by=

      tes=20

      (n > 100000 bytes, IIRC), and can also be forced by the programmer. In case=

      =20

      you do a full flush, the block which you read is complete as is up till the=

      =20

      point you did the flush.

      =46rom the documentation:

      """flush([mode])

      All pending input is processed, and a string containing the remaining=20

      compressed output is returned. mode can be selected from the constants=20

      Z_SYNC_FLUSH, Z_FULL_FLUSH, or Z_FINISH, defaulting to Z_FINISH. Z_SYNC_FLU=

      SH=20

      and Z_FULL_FLUSH allow compressing further strings of data and are used to=

      =20

      allow partial error recovery on decompression, while Z_FINISH finishes the=

      =20

      compressed stream and prevents compressing any more data. After calling=20

      flush() with mode set to Z_FINISH, the compress() method cannot be called=20

      again; the only realistic action is to delete the object."""

      Anyway, the state is reset to the initial state after the full flush, so th=

      at=20

      the next block of data is independent from the block that was flushed. So,=

      =20

      you might start writing after the full flush, but you'd have to make sure=20

      that the compressed stream was of the same format specification as the one=

      =20

      previously written (see the compression level parameter of=20

      compress/decompress), and you'd also have to make sure that the gzip header=

      =20

      is supressed, and that the FINISH compression block correctly reflects the=

      =20

      data that was appended (because you basically overwrite the finish block of=

      =20

      the first compress).

      Little example:

      <zlib.Compress object at 0xb7e39de0>

      'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff

      \xff'

      '\x03\x00^\x84^9'

      n.

      'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff

      \xff'

      '\x03\x00^\x84^9'

      480 # Two times 240 =3D 480.

      'haha...' # Rest stripped for clarity.

      So, as far as this goes, it works. But:

      Traceback (most recent call last):

      File "<stdin>", line 1, in ?

      zlib.error: Error -3 while decompressing: incorrect data check

      You see here that if you append the new end of stream marker of the second=

      =20

      block (which is written by x.flush(zlib.Z_FINISH)), the data checksum is=20

      broken, as the data checksum is always written for the entire data, but=20

      leaving out the end of stream marker doesn't cause data-decompression to=20

      fail.

      I know too little about the internal format of a gzip file (which appends m=

      ore=20

      header data, but otherwise is just a zlib compressed stream) to tell whethe=

      r=20

      an approach such as this one would also work on gzip-files, but I presume i=

      t=20

      should.

      Hope this little explanation helps!

      Heiko.

      #6; Wed, 30 Apr 2008 10:15:00 GMT
    • Thanks Heiko, Thats really interesting..

      To tell you the truth though, I'm not that familiar with the structure of

      tar or gzip files. I've got a much better idea of how it works now though.

      :D

      I managed to get my function working... although it decompresses

      everything and then compresses it back... Not the best, but good enough I

      think.

      Speed isn't a huge issue in my case anyway because this is for a web app

      I'm writing... It's a directory tree which allows people to download and

      upload files into/from directories as well as compressed archives.

      Anyway.. thanks a lot for your help. I really appreciate it. Cheers mate!

      :)

      #7; Wed, 30 Apr 2008 10:16:00 GMT
    • Heiko Wundram <heikowu.python.todaysummary.com.ceosg.de> wrote:

      > Am Freitag, 5. November 2004 19:19 schrieb Josiah Carlson:

      > This is not entirely true... There is a full flush which is done every n b

      ytes

      > (n > 100000 bytes, IIRC), and can also be forced by the programmer. In cas

      e

      > you do a full flush, the block which you read is complete as is up till th

      e

      > point you did the flush.

      [snip explanation]

      Thank you for the great information!

      So it seems that one would still need to do the following in order to

      get tgz appending done:

      1. Find the last compressed section of the tar file.

      2. Invert the checksum (CRC32 is easy) to the end of the usable tarfile.

      3. Take note and adjust the size provided in the gzip footer.

      4. S to the end of the usable tarfile.

      5. Write a Z_FULL_FLUSH to start on a new block.

      6. Write the new compressed data, and make sure you keep track of the

      checksum (either by injecting it into zlib and/or gzip some way, or

      manually computing it).

      7. Write a Z_FINISH, update/write the checksum and and size trailers.

      All in all, it doesn't look too hard. I think such a thing could be

      done in an afternoon, and would be a truely nifty addition to the Python

      standard library.

      BZip2 on the other hand...looks to be nice because of the block

      structure, but each block is huffman coded, so it may not be possible to

      discover the final block very easily (also the file format isn't leaping

      out at me from the BZip2 docs).

      - Josiah

      #8; Wed, 30 Apr 2008 10:18:00 GMT
    • Dennis Hotson wrote:

      > I managed to get my function working... although it decompresses

      > everything and then compresses it back... Not the best, but good enough I

      > think.

      If you want a solution that allows to append files to an archive, while

      allowing compression, take a look at FileNode, a module that has been added

      to the latest PyTables package (www.pytables.org). You can see the

      documentation (and tutorials) for the module here:

      http://pytables.sourceforge.net/html-doc/c3616.html

      It supports the zlib, ucl and lzo compressors, as well as the shuffle

      compression pre-conditioner.

      HTH,

      Francesc Altet

      #9; Wed, 30 Apr 2008 10:18:00 GMT