sink/docs/storage.md

0001 ## Storage Model
0002 The storage model is simple:
0003 ```
0004 Entity {
0005   Id
0006   Revision {
0007       Revision-Id,
0008       Property*
0009   }+
0010 }*
0011 ```
0012
0013 The store consists of entities that have each an id and a set of properties. Each entity can have multiple revisions.
0014
0015 A entity is uniquely identified by:
0016
0017 * Resource + Id
0018
0019 The additional revision identifies a specific instance/version of the entity.
0020
0021 Uri Scheme:
0022 ```
0023 sink://resource/id:revision
0024 ```
0025
0026 ## Store Entities
0027 Each entity can be as normalized/denormalized as useful. It is not necessary to have a solution that fits everything.
0028
0029 Denormalized:
0030
0031 * priority is that the mime message stays intact (signatures/encryption)
0032
0033 ```
0034 Mail {
0035   id
0036   mimeMessage
0037 }
0038 ```
0039
0040 Normalized:
0041
0042 * priority is that we can access individual members efficiently.
0043 * we don't care about exact reproducability of e.g. an ical file
0044 ```
0045 Event {
0046   id
0047   subject
0048   startDate
0049   attendees
0050   ...
0051 }
0052 ```
0053
0054 Of course any combination of the two can be used, including duplicating data into individual properties while keeping the complete struct intact.
0055
0056 #### Optional Properties
0057 For each domain type, we want to define a set of required and a set of optional properties. The required properties are the minimum bar for each resource, and are required in order for applications to work as expected. Optional properties may only be shown by the UI if actually supported by the backend.
0058
0059 However, we'd like to be able to support local-only storage for resources that don't support an optional property. Each entity thus has a "local" buffer that provides default local only storage. This local-only buffer provides storage for all properties of the respective domain type.
0060
0061 Each resource can freely define how the properties are split, while it wants to push as many as possible into the left side so they can be synchronized. Note that the resource is free to add more properties to it's synchronized buffer even though they may not be required by the specification.
0062
0063 The advantage of this is that a resource only needs to specify a minimal set of properties, while everything else is taken care of by the local-only buffer. This is supposed to make it easier for resource implementors to get something working.
0064
0065 ### Value Format
0066 Each entity-value in the key-value store consists of the following individual buffers:
0067
0068 * Metadata: metadata that is required for every entity (revision, ....)
0069 * Local: default storage buffer that is domain-type specific.
0070 * Resource: the buffer defined by the resource (additional properties, values that help for synchronization)
0071
0072 ## Database
0073 ### Database Layout
0074 Storage is split up in multiple named databases that reside in the same database environment.
0075
0076 ```
0077  $DATADIR/storage/$RESOURCE_IDENTIFIER/
0078 ```
0079
0080 * $BUFFERTYPE.main: The primary store for a type
0081 * $BUFFERTYPE.index.$PROPERTY: Secondary indexes
0082 * revisionType: Allows to lookup the type by revision to find the correct primary or secondary db's.
0083 * revisions: Allows to lookup the entity id by revision
0084
0085 The resource can be effectively removed from disk (besides configuration),
0086 by deleting the directories matching `$RESOURCE_IDENTIFIER*` and everything they contain.
0087
0088 #### Design Considerations
0089 The stores are split by buffertype, so a full scan (which is done by type), doesn't require filtering by type first. The downside is that an additional lookup is required to get from revision to the data.
0090
0091 ### Revisions
0092 Every operation (create/delete/modify), leads to a new revision. The revision is an ever increasing number for the complete store.
0093
0094 Each entity is stored with a key consisting of its id and the revision. This way it is possible to lookup older revision.
0095
0096 Removing an entity simply results in a new revision of the entitiy recording the removal.
0097
0098 Secondary indexes always refer to the latest revision.
0099
0100 #### Design Considerations
0101 By having one revision for the complete store instead of one per type, the change replay always works across all types. This is especially important in the write-back mechanism that replays the changes to the source.
0102
0103 ### Revision cleanup
0104 Because the store would grow infinitely, old revisions need to be removed.
0105 The resource maintains a "lower bound revision", which is the lowest revision of any change-replaying component (such as clients and write-back).
0106 For all lower revisions the cleanup will remove any revision that:
0107
0108 * is a delete command (the revision is no longer existing)
0109 * has a newer revision for the same entity (the revision is outdated)
0110
0111 By doing cleanups continously, we avoid keeping outdated data.
0112
0113 ### BLOB properties
0114 Files are used to handle opaque large properties that should not end up in memory. BLOB properties are in their nature never queriable (extract parts of it to other properties if indexes are required).
0115
0116 For reading:
0117
0118 Resources...
0119
0120 * store the file in $DATADIR/storage/$RESOURCE_IDENTIFIER.files/
0121 * store the filename in the blob property.
0122 * delete the file when the corresponding entity is deleted.
0123
0124 Queries...
0125
0126 * Copy the requested property to /tmp/sink/client_files/ and provide the path in the property
0127 * The file is guaranteed to exist for the lifetime of the query result.
0128
0129 Clients..
0130
0131 * Load the file from disk and use it as they wish (moving is fine too)
0132
0133 For writing:
0134
0135 Clients..
0136
0137 * Request a path from sink and store the file there.
0138 * Store the path of the written file in the property.
0139
0140 Resources..
0141
0142 * move the file to $DATADIR/storage/$RESOURCE_IDENTIFIER.files/
0143 * store the new path in the entity
0144
0145 #### Design Considerations
0146 Using regular files as the interface has the advantages:
0147
0148 * Existing mechanisms can be used to stream data directly to disk.
0149 * The necessary file operations can be efficiently handled depending on OS and filesystem.
0150 * We avoid reinventing the wheel.
0151
0152 The copy is necessary to guarantee that the file remains for the client/resource even if the resource removes the file on it's side as part of a sync.
0153 The copy could be optimized by using hardlinks, which is not a portable solution though. For some next-gen copy-on-write filesystems copying is a very cheap operation.
0154
0155 A downside of having a file based design is that it's not possible to directly stream from a remote resource i.e. into the application memory, it always has to go via a file.
0156
0157 ## Database choice
0158 By design we're interested in key-value stores or perhaps document databases. This is because a fixed schema is not useful for this design, which makes
0159 SQL not very useful (it would just be a very slow key-value store). While document databases would allow for indexes on certain properties (which is something we need), we did not yet find any contenders that looked like they would be useful for this system.
0160
0161 ### Requirements
0162 * portable; minimally: Linux, Windows, MacOS X
0163 * multi-thread and multi-process concurrency with single writer and multiple readers.
0164     * This is required so we don't block clients while a resource is writing and deemed essential for performance and to reduce complexity.
0165 * Reasonably fast so we can implement all necessary queries with sufficient performance
0166 * Can deal with large amounts of data
0167 * On disk storage with ACID properties.
0168 * Memory consumption is suitable for desktop-system (no in-memory stores).
0169
0170 Other useful properties:
0171
0172 * Is suitable to implement some indexes (the fewer tools we need the better)
0173 * Support for transactions
0174 * Small overhead in on-disk size
0175
0176 ### Contenders
0177 * LMDB
0178     * support for mmapped values
0179     * good read performance, ok write performance
0180     * fairly complex API
0181     * Up to double storage size due to paging (with 4k pagesize 4001 bytes provide the worst case)
0182     * size limit of 4GB on 32bit systems, virtually no limit on 64bit.  (leads to 2GB of actual payload with write amplification)
0183     * limited key-search capabilities
0184     * ACID transactions
0185     * MVCC concurrency
0186     * no compaction, database always grows (pages get reused but are never freed)
0187 * berkeley db (bdb)
0188     * performance is supposedly worse than lmdb (lmdb was written as successor to bdb for openldap)
0189     * oracle sits behind it (it has an AGPL licence though)
0190 * rocksdb
0191     * => no multiprocess
0192 * kyotocabinet http://fallabs.com/kyotocabinet/
0193     * fast, low on-disk overhead, simple API
0194     * => no multiprocess
0195     * GPL
0196 * hamsterdb
0197     * => no multiprocess
0198 * sqlite4
0199     * not yet released
0200 * bangdb
0201     * not yet released opensource, looks promising on paper
0202 * redis
0203     * => loads everything into memory
0204     * => not embeddable
0205 * couchdb
0206     * MVCC concurrency
0207     * document store
0208     * not embeddable (unless we write sink in erlang ;)
0209 * https://github.com/simonhf/sharedhashfile
0210     * not portable (i.e. Windows; it's a mostly-Linux thing)
0211 * http://sphia.org/architecture.html
0212     * => no multiprocess
0213 * leveldb
0214     * => no multiprocess
0215 * ejdb http://ejdb.org/#ejdb-c-library
0216     * modified version of kyoto cabinet
0217     * => multiprocess requires locking, no multiprocess
0218     * Is more of a document store
0219     * No updates since September 2013
0220 * http://unqlite.org
0221     * bad performance with large database (looks like O(n))
0222     * like lmdb roughly 2\*datasize
0223     * includes a document store
0224     * mmapped ready access
0225     * reading about 30% the speed of lmdb
0226     * slow writes with transactions
0227
0228 ### Result
0229 LMDB was chosen as one of the few contenders that are embeddable and have true multi process support. It also outperformed unqllite significantly, although its write performance and disk usage aren't ideal.
0230
0231 ### Indexes
0232 Additionally to the primary store, indexes are required for efficient lookups.
0233
0234 Since indexes always need to be updated they directly affect how fast we can write data. While reading only a subset of the available indexes is typically used, so a slow index doesn't affect all quries.
0235
0236 #### Contenders
0237 * xapian:
0238     * fast fulltext searching
0239     * No MVCC concurrency
0240     * Only supports one writer at a time
0241     * If a reader is reading blocks that have now been changed by a writer, it throws a DatabaseModifiedException. This means most of the Xapian code needs to be in `while (1) { try { .. } catch () }` blocks and needs to be able to start from scratch.
0242     * Wildcard searching (as of 2015-01) isn't ideal. It works by expanding the word into all other words in the query and that typically makes the query size huge. This huge query is then sent to the database. Baloo has had to configure this expanding of terms so that it consumes less memory.
0243     * Non existent UTF support - It does not support text normalization and splitting the terms at custom characters such as '\_'.
0244 * lmdb:
0245     * sorted keys
0246     * sorted duplicate keys
0247     * No FTS
0248     * MVCC concurrency
0249 * sqlite:
0250     * SQL
0251     * updates lock the database for readers
0252     * concurrent reading is possible
0253     * Requires duplicating the data. Once in a column as data and then in the FTS.
0254 * lucenePlusPlus
0255     * fast full text searching
0256     * MVCC concurrency
0257
0258 ### Result
0259 For regular secondary indexes LMDB is used as well, because it's sufficient for key lookups, and by using the same database, we can store the indexed data directly in the same transaction.
0260
0261 No solution for full-text indexes has been chosen yet. Baloo implements a fulltext index on top of LMDB though.
0262
0263 ## Useful Resources
0264 * LMDB
0265     * Wikipedia for a good overview: <https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database>
0266     * Benchmarks: <http://symas.com/mdb/microbench/>
0267     * Tradeoffs: <http://symas.com/is-lmdb-a-leveldb-killer/>
0268     * Disk space benchmark: <http://symas.com/mdb/ondisk/>
0269     * LMDB instead of Kyoto Cabinet as redis backend: <http://www.anchor.com.au/blog/2013/05/second-strike-with-lightning/>
0270