Warning, /pim/sink/docs/storage.md is written in an unsupported language. File is not indexed.
0001 ## Storage Model 0002 The storage model is simple: 0003 ``` 0004 Entity { 0005 Id 0006 Revision { 0007 Revision-Id, 0008 Property* 0009 }+ 0010 }* 0011 ``` 0012 0013 The store consists of entities that have each an id and a set of properties. Each entity can have multiple revisions. 0014 0015 A entity is uniquely identified by: 0016 0017 * Resource + Id 0018 0019 The additional revision identifies a specific instance/version of the entity. 0020 0021 Uri Scheme: 0022 ``` 0023 sink://resource/id:revision 0024 ``` 0025 0026 ## Store Entities 0027 Each entity can be as normalized/denormalized as useful. It is not necessary to have a solution that fits everything. 0028 0029 Denormalized: 0030 0031 * priority is that the mime message stays intact (signatures/encryption) 0032 0033 ``` 0034 Mail { 0035 id 0036 mimeMessage 0037 } 0038 ``` 0039 0040 Normalized: 0041 0042 * priority is that we can access individual members efficiently. 0043 * we don't care about exact reproducability of e.g. an ical file 0044 ``` 0045 Event { 0046 id 0047 subject 0048 startDate 0049 attendees 0050 ... 0051 } 0052 ``` 0053 0054 Of course any combination of the two can be used, including duplicating data into individual properties while keeping the complete struct intact. 0055 0056 #### Optional Properties 0057 For each domain type, we want to define a set of required and a set of optional properties. The required properties are the minimum bar for each resource, and are required in order for applications to work as expected. Optional properties may only be shown by the UI if actually supported by the backend. 0058 0059 However, we'd like to be able to support local-only storage for resources that don't support an optional property. Each entity thus has a "local" buffer that provides default local only storage. This local-only buffer provides storage for all properties of the respective domain type. 0060 0061 Each resource can freely define how the properties are split, while it wants to push as many as possible into the left side so they can be synchronized. Note that the resource is free to add more properties to it's synchronized buffer even though they may not be required by the specification. 0062 0063 The advantage of this is that a resource only needs to specify a minimal set of properties, while everything else is taken care of by the local-only buffer. This is supposed to make it easier for resource implementors to get something working. 0064 0065 ### Value Format 0066 Each entity-value in the key-value store consists of the following individual buffers: 0067 0068 * Metadata: metadata that is required for every entity (revision, ....) 0069 * Local: default storage buffer that is domain-type specific. 0070 * Resource: the buffer defined by the resource (additional properties, values that help for synchronization) 0071 0072 ## Database 0073 ### Database Layout 0074 Storage is split up in multiple named databases that reside in the same database environment. 0075 0076 ``` 0077 $DATADIR/storage/$RESOURCE_IDENTIFIER/ 0078 ``` 0079 0080 * $BUFFERTYPE.main: The primary store for a type 0081 * $BUFFERTYPE.index.$PROPERTY: Secondary indexes 0082 * revisionType: Allows to lookup the type by revision to find the correct primary or secondary db's. 0083 * revisions: Allows to lookup the entity id by revision 0084 0085 The resource can be effectively removed from disk (besides configuration), 0086 by deleting the directories matching `$RESOURCE_IDENTIFIER*` and everything they contain. 0087 0088 #### Design Considerations 0089 The stores are split by buffertype, so a full scan (which is done by type), doesn't require filtering by type first. The downside is that an additional lookup is required to get from revision to the data. 0090 0091 ### Revisions 0092 Every operation (create/delete/modify), leads to a new revision. The revision is an ever increasing number for the complete store. 0093 0094 Each entity is stored with a key consisting of its id and the revision. This way it is possible to lookup older revision. 0095 0096 Removing an entity simply results in a new revision of the entitiy recording the removal. 0097 0098 Secondary indexes always refer to the latest revision. 0099 0100 #### Design Considerations 0101 By having one revision for the complete store instead of one per type, the change replay always works across all types. This is especially important in the write-back mechanism that replays the changes to the source. 0102 0103 ### Revision cleanup 0104 Because the store would grow infinitely, old revisions need to be removed. 0105 The resource maintains a "lower bound revision", which is the lowest revision of any change-replaying component (such as clients and write-back). 0106 For all lower revisions the cleanup will remove any revision that: 0107 0108 * is a delete command (the revision is no longer existing) 0109 * has a newer revision for the same entity (the revision is outdated) 0110 0111 By doing cleanups continously, we avoid keeping outdated data. 0112 0113 ### BLOB properties 0114 Files are used to handle opaque large properties that should not end up in memory. BLOB properties are in their nature never queriable (extract parts of it to other properties if indexes are required). 0115 0116 For reading: 0117 0118 Resources... 0119 0120 * store the file in $DATADIR/storage/$RESOURCE_IDENTIFIER.files/ 0121 * store the filename in the blob property. 0122 * delete the file when the corresponding entity is deleted. 0123 0124 Queries... 0125 0126 * Copy the requested property to /tmp/sink/client_files/ and provide the path in the property 0127 * The file is guaranteed to exist for the lifetime of the query result. 0128 0129 Clients.. 0130 0131 * Load the file from disk and use it as they wish (moving is fine too) 0132 0133 For writing: 0134 0135 Clients.. 0136 0137 * Request a path from sink and store the file there. 0138 * Store the path of the written file in the property. 0139 0140 Resources.. 0141 0142 * move the file to $DATADIR/storage/$RESOURCE_IDENTIFIER.files/ 0143 * store the new path in the entity 0144 0145 #### Design Considerations 0146 Using regular files as the interface has the advantages: 0147 0148 * Existing mechanisms can be used to stream data directly to disk. 0149 * The necessary file operations can be efficiently handled depending on OS and filesystem. 0150 * We avoid reinventing the wheel. 0151 0152 The copy is necessary to guarantee that the file remains for the client/resource even if the resource removes the file on it's side as part of a sync. 0153 The copy could be optimized by using hardlinks, which is not a portable solution though. For some next-gen copy-on-write filesystems copying is a very cheap operation. 0154 0155 A downside of having a file based design is that it's not possible to directly stream from a remote resource i.e. into the application memory, it always has to go via a file. 0156 0157 ## Database choice 0158 By design we're interested in key-value stores or perhaps document databases. This is because a fixed schema is not useful for this design, which makes 0159 SQL not very useful (it would just be a very slow key-value store). While document databases would allow for indexes on certain properties (which is something we need), we did not yet find any contenders that looked like they would be useful for this system. 0160 0161 ### Requirements 0162 * portable; minimally: Linux, Windows, MacOS X 0163 * multi-thread and multi-process concurrency with single writer and multiple readers. 0164 * This is required so we don't block clients while a resource is writing and deemed essential for performance and to reduce complexity. 0165 * Reasonably fast so we can implement all necessary queries with sufficient performance 0166 * Can deal with large amounts of data 0167 * On disk storage with ACID properties. 0168 * Memory consumption is suitable for desktop-system (no in-memory stores). 0169 0170 Other useful properties: 0171 0172 * Is suitable to implement some indexes (the fewer tools we need the better) 0173 * Support for transactions 0174 * Small overhead in on-disk size 0175 0176 ### Contenders 0177 * LMDB 0178 * support for mmapped values 0179 * good read performance, ok write performance 0180 * fairly complex API 0181 * Up to double storage size due to paging (with 4k pagesize 4001 bytes provide the worst case) 0182 * size limit of 4GB on 32bit systems, virtually no limit on 64bit. (leads to 2GB of actual payload with write amplification) 0183 * limited key-search capabilities 0184 * ACID transactions 0185 * MVCC concurrency 0186 * no compaction, database always grows (pages get reused but are never freed) 0187 * berkeley db (bdb) 0188 * performance is supposedly worse than lmdb (lmdb was written as successor to bdb for openldap) 0189 * oracle sits behind it (it has an AGPL licence though) 0190 * rocksdb 0191 * => no multiprocess 0192 * kyotocabinet http://fallabs.com/kyotocabinet/ 0193 * fast, low on-disk overhead, simple API 0194 * => no multiprocess 0195 * GPL 0196 * hamsterdb 0197 * => no multiprocess 0198 * sqlite4 0199 * not yet released 0200 * bangdb 0201 * not yet released opensource, looks promising on paper 0202 * redis 0203 * => loads everything into memory 0204 * => not embeddable 0205 * couchdb 0206 * MVCC concurrency 0207 * document store 0208 * not embeddable (unless we write sink in erlang ;) 0209 * https://github.com/simonhf/sharedhashfile 0210 * not portable (i.e. Windows; it's a mostly-Linux thing) 0211 * http://sphia.org/architecture.html 0212 * => no multiprocess 0213 * leveldb 0214 * => no multiprocess 0215 * ejdb http://ejdb.org/#ejdb-c-library 0216 * modified version of kyoto cabinet 0217 * => multiprocess requires locking, no multiprocess 0218 * Is more of a document store 0219 * No updates since September 2013 0220 * http://unqlite.org 0221 * bad performance with large database (looks like O(n)) 0222 * like lmdb roughly 2\*datasize 0223 * includes a document store 0224 * mmapped ready access 0225 * reading about 30% the speed of lmdb 0226 * slow writes with transactions 0227 0228 ### Result 0229 LMDB was chosen as one of the few contenders that are embeddable and have true multi process support. It also outperformed unqllite significantly, although its write performance and disk usage aren't ideal. 0230 0231 ### Indexes 0232 Additionally to the primary store, indexes are required for efficient lookups. 0233 0234 Since indexes always need to be updated they directly affect how fast we can write data. While reading only a subset of the available indexes is typically used, so a slow index doesn't affect all quries. 0235 0236 #### Contenders 0237 * xapian: 0238 * fast fulltext searching 0239 * No MVCC concurrency 0240 * Only supports one writer at a time 0241 * If a reader is reading blocks that have now been changed by a writer, it throws a DatabaseModifiedException. This means most of the Xapian code needs to be in `while (1) { try { .. } catch () }` blocks and needs to be able to start from scratch. 0242 * Wildcard searching (as of 2015-01) isn't ideal. It works by expanding the word into all other words in the query and that typically makes the query size huge. This huge query is then sent to the database. Baloo has had to configure this expanding of terms so that it consumes less memory. 0243 * Non existent UTF support - It does not support text normalization and splitting the terms at custom characters such as '\_'. 0244 * lmdb: 0245 * sorted keys 0246 * sorted duplicate keys 0247 * No FTS 0248 * MVCC concurrency 0249 * sqlite: 0250 * SQL 0251 * updates lock the database for readers 0252 * concurrent reading is possible 0253 * Requires duplicating the data. Once in a column as data and then in the FTS. 0254 * lucenePlusPlus 0255 * fast full text searching 0256 * MVCC concurrency 0257 0258 ### Result 0259 For regular secondary indexes LMDB is used as well, because it's sufficient for key lookups, and by using the same database, we can store the indexed data directly in the same transaction. 0260 0261 No solution for full-text indexes has been chosen yet. Baloo implements a fulltext index on top of LMDB though. 0262 0263 ## Useful Resources 0264 * LMDB 0265 * Wikipedia for a good overview: <https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database> 0266 * Benchmarks: <http://symas.com/mdb/microbench/> 0267 * Tradeoffs: <http://symas.com/is-lmdb-a-leveldb-killer/> 0268 * Disk space benchmark: <http://symas.com/mdb/ondisk/> 0269 * LMDB instead of Kyoto Cabinet as redis backend: <http://www.anchor.com.au/blog/2013/05/second-strike-with-lightning/> 0270