Warning, /pim/sink/docs/resource.md is written in an unsupported language. File is not indexed.

0001 The resource consists of:
0002 
0003 * the syncronizer process
0004 * a plugin providing the client-api facade
0005 * a configuration setting of the filters
0006 
0007 ## Synchronizer
0008 The synchronizer process is responsible for processing all commands, executing synchronizations with the source, and replaying changes to the source.
0009 
0010 Processing of commands happens in the pipeline which executes all preprocessors ebfore the entity is persisted.
0011 
0012 The synchronizer process has the following primary components:
0013 
0014 * Command Queues: Queues that hold all incoming commands. Persisted over reboots.
0015 * Command Processor: A processor that empties the command queues by pushing commands through the pipeline.
0016 * Listener: Opens a socket and listens for incoming connections. On connection all incoming commands are read and entered into command queues. Control commands (i.e. a sync) don't require persistency and are therefore processed directly.
0017 * Synchronization: Handles synchronization to the source, as well as change-replay to the source. The modification commands generated by the synchronization enter the command queue as well.
0018 
0019 A resource can:
0020 
0021 * provide a full mirror of the source.
0022 * provide metadata for efficient access to the source.
0023 
0024 In the former case the local mirror is fully functional locally and changes can be replayed to the source once a connection is established again.
0025 It the latter case the resource is only functional if a connection to the source is available (which is i.e. not a problem if the source is a local maildir on disk).
0026 
0027 ## Preprocessors
0028 Preprocessors are small processors that are guaranteed to be processed before an new/modified/deleted entity reaches storage. They can therefore be used for various tasks that need to be executed on every entity.
0029 
0030 Usecases:
0031 
0032 * Update indexes
0033 * Detect spam/scam mail and set appropriate flags
0034 * Email filtering to different folders or resources
0035 
0036 The following kinds of preprocessors exist:
0037 
0038 * filtering preprocessors that can potentially move an entity to another resource
0039 * passive preprocessors, that extract data that is stored externally (i.e. indexers)
0040 * flag extractors, that produce data stored with the entity (spam detection)
0041 
0042 Preprocessors are typically read-only, to i.e. not break signatures of emails. Extra flags that are accessible through the sink domain model, can therefore be stored in the local buffer of each resource.
0043 
0044 ### Requirements
0045 * A preprocessor must work with batch processing. Because batch-processing is vital for efficient writing to the database, all preprocessors have to be included in the batch processing.
0046 * Preprocessors need to be fast, since they directly affect how fast a message is processed by the system.
0047 
0048 ### Design
0049 Commands are processed in batches. Each preprocessor thus has the following workflow:
0050 * startBatch is called: The preprocessor can do necessary preparation steps to prepare for the batch (like starting a transaction on an external database)
0051 * add/modify/remove is called for every command in the batch: The preprocessor executes the desired actions.
0052 * endBatch is called: If the preprocessor wrote to an external database it can now commit the transaction.
0053 
0054 ### Generic Preprocessors
0055 Most preprocessors will likely be used by several resources, and are either completely generic, or domain specific (such as only for mail).
0056 It is therefore desirable to have default implementations for common preprocessors that are ready to be plugged in.
0057 
0058 The domain type adaptors provide a generic interface to access most properties of the entities, on top of which generic preprocessors can be implemented.
0059 It is that way trivial to i.e. implement a preprocessor that populates a hierarchy index of collections.
0060 
0061 ### Preprocessors generating additional entities
0062 A preprocessor, such as an email threading preprocessors, might generate additional entities (A thread entity is a regular entity, just like the mail that spawned the thread).
0063 
0064 In such a case the preprocessor must invoke the complete pipeline for the new entity.
0065 
0066 
0067 ## Indexes
0068 Most indexes are implemented as preprocessors to guarantee that they are always updated together with the data.
0069 
0070 Index types:
0071 
0072     * fixed value indexes (i.e. uid)
0073         * Input: key-value pair where key is the indexed property and the value is the uid of the entity
0074         * Lookup: by key, value is always zero or more uid's
0075     * fixed value where we want to do smaller/greater-than comparisons (like start date)
0076         * Input:
0077         * Lookup: by key with comparator (greater, equal range)
0078         * Result: zero or more uid's
0079     * range indexes (like the date range an event affects)
0080         * Input: start and end of range and uid of entity
0081         * Lookup: by key with comparator. The value denotes start or end of range.
0082         * Result: zero or more uid's
0083     * group indexes (like tree hierarchies as nested sets)
0084         * could be the same as fixed value indexes, which would then just require a recursive query.
0085         * Input:
0086     * sort indexes (i.e. sorted by date)
0087         * Could also be a lookup in the range index (increase date range until sufficient matches are available)
0088 
0089 ### Default implementations
0090 Since only properties of the domain types can be queried, default implementations for commonly used indexes can be provided. These indexes are populated by generic preprocessors that use the domain-type interface to extract properties from individual entites.
0091 
0092 ### Example index implementations
0093 * uid lookup
0094     * add:
0095         * add uid + entity id to index
0096     * update:
0097         * remove old uid + entity id from index
0098         * add uid + entity id to index
0099     * remove:
0100         * remove uid + entity id from index
0101     * lookup:
0102         * query for entity-id by uid
0103 
0104 * mail folder hierarchy
0105     * parent folder uid is a property of the folder
0106     * store parent-folder-uid + entity id
0107     * lookup:
0108         * query for entity-id by uid
0109 
0110 * mails of mail folder
0111     * parent folder uid is a property of the email
0112     * store parent-folder-uid + entity id
0113     * lookup:
0114         * query for entity-id by uid
0115 
0116 * email threads
0117     * Thread objects should be created as dedicated entities
0118     * the thread uid
0119 
0120 * email date sort index
0121     * the date of each email is indexed as timestamp
0122 
0123 * event date range index
0124     * the start and end date of each event is indexed as timestamp (floating date-times would change sorting based on current timezone, so the index would have to be refreshed)
0125 
0126 ### On-demand indexes
0127 To avoid building all indexes initially, and assuming not all indexes are necessarily regularly used for the complete data-set, it should be possible to omit updating an index, but marking it as outdated. The index can then be built on demand when the first query requires the index.
0128 
0129 Building the index on-demand is a matter of replaying the relevant dataset and using the usual indexing methods. This should typically be a process that doesn't take too long, and that provides status information, since it will block the query.
0130 
0131 The indexes status information can be recorded using the latest revision the index has been updated with.
0132 
0133 # Pipeline
0134 A pipeline is an assembly of a set of preprocessors with a defined order. A modification is always persisted at the end of the pipeline once all preprocessors have been processed.
0135 
0136 # Synchronization
0137 The synchronization can either:
0138 
0139 * Generate a full diff directly on top of the db. The diffing process can work against a single revision/snapshot (using transactions). It then generates a necessary changeset for the store.
0140 * If the source supports incremental changes the changeset can directly be generated from that information.
0141 
0142 The changeset is then simply inserted in the regular modification queue and processed like all other modifications. The synchronizer has to ensure only changes are replayed to the source that didn't come from it already. This is done by marking changes that don't require changereplay to the source.
0143 
0144 ## Synchronization Store
0145 To track progress of the synchronization, the synchronizer needs to maintain a separate store. It needs to be separate from the main store to properly separate the the synchronization from the Command Processor, which enables the two parts to run concurrently (We can't have two threads writing to the same store).
0146 
0147 While the synchronization store can contain any useful information for a resource to synchronize a typical example looks like this:
0148 
0149 * changereplay: Contains the last replayed revision. Used by the change replay to know what has been replayed to the source already.
0150 * remoteid.mapping.$BUFFERTYPE: Contains the mapping of a remote identifier to a local identifier. Necessary to track what has already been synchronized, and to replay changes to the remote entity.
0151 * localid.mapping.$BUFFERTYPE: Reverse mapping of the remoteid.mapping.
0152 
0153 The remoteid mapping has to be updated in two places:
0154 
0155 * New entities that are synchronized immediately get a localid assinged, that is then recorded together with the remoteid. This is required to be able to reference other entities directly in the command queue (i.e. for parent folders).
0156 * Entities created by clients get a remoteid assigned during change replay, so the entity can be recognized during the next sync.
0157 
0158 ## Change Replay
0159 To replay local changes to the source the synchronizer replays all revisions of the store and maintains the current replay state in the synchronization store.
0160 Changes that already come from the source via synchronizer are not replayed to the source again.
0161 
0162 # Testing / Inspection
0163 Resources have to be tested, which often requires inspections into the current state of the resource. This is difficult in an asynchronous system where the whole backend logic is encapsulated in a separate process without running tests in a vastly different setup from how it will be run in production.
0164 
0165 To alleviate this inspection commands are introduced. Inspection commands are special commands that the resource processes just like all other commands, and that have the sole purpose of inspecting the current resource state. Because the command is processed with the same mechanism as other commands we can rely on ordering of commands in a way that a prior command is guaranteed to be executed once the inspection command is processed.
0166 
0167 A typical inspection command could i.e. verify that a file has been created in the expected path after a create command.
0168 
0169 # Capabilities
0170 Resources can have various capabilities. Each capability is a contract that the resource guarantees to fullfil.
0171 
0172 ## Storage
0173 * The storage capability guarantees that the resource stores entities (of a supported type) given to it.
0174 
0175 ## Mailtransport
0176 * A mailtransport resource transports any mail that it receives to the indicated destination.
0177 * As long as the mail has not been transported it must be queryable, modifiabl and removable.
0178 * Once the mail has been transported it should be moved to the target sent mail folder and be removed from the resource.
0179 
0180 ## Drafts
0181 * A resource that supports the drafts capability must store any mail that is marked as draft in a suitable drafts folder.
0182 * The resource must guarantee that storage succeeds (as soon as it accepts the request), so it must create a suitable folder if non is available.