EPITOME(1)                 OpenBSD Reference Manual                 EPITOME(1)

NAME
     epitome - deduplication services

DESCRIPTION
     The epitome suite consists of several discrete pieces that provide stor-
     age deduplication services.  Deduplication is defined as the elimination
     of redundant data.

     epitome provides a number of services to enable three major archiving
     technologies: CAS, SIS, and DEDUP.  Since these three are often (ab)used,
     epitome defines them as follows:

     CAS (Content Addressable Storage)

        CAS, also referred to as associative storage, is a mechanism for stor-
        ing information that can be retrieved based on its content, not its
        storage location.  It is typically used for high-speed storage and re-
        trieval of fixed content, such as documents stored for compliance with
        government regulations and medical content.

        CAS is a method to archive content and provide the issuer a UUID for
        identification at a later time.  The user or application using this
        service is responsible for maintaining the UUID to content mapping
        (e.g. UUID -> file).

        CAS is not traditionally associated with dedup technologies, however
        in epitome it is a trivial addition.  An added benefit when using CAS
        is that content is never stored more than once on the physical media
        because the chunks that make up the content are automatically deduped.
        Inherent to this mechanism is also a data integrity element.  The
        unique fingerprint that identifies the data doubles as a hash for the
        actual content.

        In an ideal world CAS is a mountable write-once filesystem that is
        transparent to the user operation where the user can associate policy
        to the content.  All stored content is immutable.

     SIS (Single Instance Storage)

        SIS is essentially application deduplication and is best explained
        with an example.  Consider user A sending user B an email; minus the
        mail header the email is identical so a client (in this case the mail
        server) that uses SIS would only save the content once.  Now imagine
        this email being sent to 100 people; the savings are considerable at
        this time.

        The big difference between SIS and the other techniques is that the
        application using it must conform to an API and be written specifical-
        ly to interface with the dedup system.

     DEDUP (Deduplication)

        DEDUP is really the underlying technique for all incarnations of hash
        based storage.  It uses a chunk based hash to determine what is dupli-
        cate.  Where CAS has a content <-> hash association, DEDUP is purely a
        hash for some arbitrary block of some arbitrary size.

     The limits of any of these technologies are directly linked to the amount
     of meta-data that is saved.  The more meta is saved the more capabilities
     can be implemented.  For example to make a CAS system there needs to be
     some sort of UUID -> blocks association (e.g. a catalog).  The design of
     epitome allows for flexibility on the back-end so that one or more front-
     ends can be used provided that there is storage space for the additional
     meta-data available and that the computational overhead is acceptable.

     The idea behind epitome is to provide a WORM based archive/backup mecha-
     nism that is lossless and offers permanent storage with inherent data
     protection properties.  Additionally, epitome provides several metadata
     formats and back-ends to meet several usage models.

SEE ALSO
     epitomize(1), eprepare(1), epitome(3)

HISTORY
     epitome first appeared in OpenBSD 4.5.

AUTHORS
     The epitome suite was written by Marco Peereboom <marco@peereboom.us>.

CAVEATS
     The epitome suite is currently considered experimental.

     Not everything mentioned in this manual has been implemented yet.

OpenBSD 4.4                     October 6, 2008                              2