sheets/doc/FormulaEngine.dox

0001 /**
0002 \page formulaengine Formula Engine
0003 \author Ariya Hidayat (<a href="mailto:ariya@kde.org">ariya@kde.org</a>)
0004 \date 2004
0005
0006 \par Status:
0007     FINISHED; NEEDS UPDATE (inline arrays)
0008
0009 <p>This formula engine is just an expression evaluator. To offer better
0010 performance, the expression is first compiled into byte codes which will
0011 be executed later by a virtual machine.</p>
0012
0013 <p>Before compilation, the expression is separated into pieces, called tokens.
0014 This step, which is also known as lexical analysis, takes places at once
0015 and will produce sequence of tokens. They are however not stored and used only
0016 for the purpose of the subsequent step.
0017 Tokens are supplied to the parser, also known as syntax analyzer. In this
0018 design, the parser is also a code generator. It involve the generation
0019 of byte codes which represents the expression.
0020 Evaluating the formula expression is now basically running the virtual
0021 machine to execute compiled byte codes. No more scanning or parsing
0022 are performed during evaluation, this saves time a lot.</p>
0023
0024 <p>The virtual machine itself (and of course the byte codes) are designed to be
0025 as simple as possible. This is supposed to be stack-based, i.e. the virtual
0026 machine has an execution stack of values which would be manipulated
0027 as each byte code is processed. Beside the stack, there will be a list of
0028 constant (sometimes also called as "constants pool") to hold Boolean,
0029 integer, floating-point or string values. When a certain byte code needs
0030 a constant for the operand, an index is specified which should be used
0031 to look up the constant in the constants pool.</p>
0032
0033 <p>There are only few byte code, sufficient enough to perform calculation.
0034 Yes, this is really minimalist but yet does the job fairly well.
0035 The following provides brief description for each type of bytecode.</p>
0036
0037 <div style="margin-left: 4em">
0038
0039 <p><i>Nop</i> means no operation.</p>
0040
0041 <p><i>Load</i> means loads a constant and push it to the stack. The constant can
0042 be found at constant pools, at position by 'index', it could be a Boolean,
0043 integer, floating-point or string value.</p>
0044
0045 <p><i>Ref</i> means gets a value from a reference. Member variable 'index' will
0046 refers to a string value in the constant pools, i.e. the name of the reference.
0047 Typically the reference is either a cell (e.g. A1), range of cells (A1:B10)
0048 or possibly function name. Example: expression A2+B2 will be compiled as:<br/>
0049 Constants:<br/>
0050 #0: "A2"<br/>
0051 #1: "B2"<br/>
0052 Codes:<br/>
0053 Ref #0<br/>
0054 Ref #1<br/>
0055 Add
0056 </p>
0057
0058 <p><i>Function</i>.
0059 Example: expression "sin(x)" will be compiled as:<br/>
0060 Constants:<br/>
0061 #0: "sin"<br/>
0062 #1: "x"<br/>
0063 Codes:<br/>
0064 Ref #0<br/>
0065 Ref #1<br/>
0066 Function 1
0067 </p>
0068
0069 <p><i>Neg</i> is a unary operator, a value is popped from stack and negated and then
0070 pushed back to the stack. If it is not number (Boolean or string), it
0071 will be converted first.</p>
0072
0073 <p><i>Add, Sub, Mul, Div and Pow</i> are binary operators, two values are popped from
0074 stack and processed (added, subtracted, multiplied, divided, or power) and
0075 the result is pushed to the stack.</p>
0076
0077 <p><i>Concat</i> is string operation, two values are popped from stack (and converted
0078 to string if they are not string values), concatenated, and the result is
0079 pushed to the stack.</p>
0080
0081 <p><i>Not</i> is a logical operation, a value is popped from stack and its Boolean
0082 not is pushed into the stack. When it is not Boolean value, there will be
0083 a cast.</p>
0084
0085 <p><i>Equal, Less, and Greater</i> are comparison operators, two values are
0086 popped from stack and compared appropriately. The result, which is a Boolean
0087 value, is pushed into the stack. To simplify, there no &quot;not equal&quot;
0088 comparison because it can be regarded as &quot;equal&quot; followed by
0089 &quot;not&quot; byte codes. Same goes for &quot;less than or equal to&quot; and
0090 &quot;greater than or equal to&quot;.</p>
0091
0092 </div>
0093
0094 <p>The expression scanner is based on finite state acceptor. The state denotes
0095 the position of cursor, e.g. inside a cell token, inside an identifier, etc.
0096 State transition is following by emitting the associated token to the
0097 result buffer. Rather than showing the state diagrams here, it is much more
0098 convenience and less complicated to browse the scanner source code and try
0099 to follow its algorithm from there.</p>
0100
0101 <p>The parser is designed using one of bottom-up parsing technique, namely
0102 based on Polish notation. Instead of ordering the tokens in suffix Polish
0103 form, the parser (which is also the code generator) simply outputs
0104 byte codes. In its operation, the parser requires the knowledge of operator
0105 precedence to correctly translate unparenthesized infix expression and
0106 thus requires the use of a syntax stack.</p>
0107
0108 <p>The parser algorithm is given as follows:</p>
0109
0110 <div style="margin-left: 4em">
0111 Repeat the following steps:<br/>
0112 Step 1: Get next token<br/>
0113 Step 2: If it is an identifier<br/>
0114 - push it to syntax stack<br/>
0115 - generated "Ref"<br/>
0116 Step 3: If it is a Boolean, integer, float or string value<br/>
0117 - push it to syntax stack<br/>
0118 - generated "Load"<br/>
0119 Step 4: If it is an operator<br/>
0120 - check for reduce rules<br/>
0121 - when no more rules applies, push token to the syntax stack<br/>
0122 </div>
0123
0124 <p>The reduce rules are:</p>
0125
0126 <p>Rule A: <i>function argument</i>:
0127 if token is semicolon or right parenthesis,
0128 if syntax stack looks as:
0129 <ul type="square">
0130 <li>non-operator &lt;--- top</li>
0131 <li>operator ;</li>
0132 <li>non-operator</li>
0133 <li>operator (</li>
0134 <li>identifier</li>
0135 </ul>
0136 then reduce to
0137 <ul type="circle">
0138 <li>non operator</li>
0139 <li>operator (</li>
0140 <li>identifier</li>
0141 <li>increase number of function arguments</li>
0142 </ul>
0143 </p>
0144
0145 <p>Rule B: last function argument<br>
0146 if syntax stack looks as:<br>
0147 <ul type="square">
0148 <li>operator )</li>
0149 <li>non-operator</li>
0150 <li>operator (</li>
0151 <li>identifier</li>
0152 </ul>
0153 then reduce to:<br>
0154 <ul type="circle">
0155 <li>non-operator</li>
0156 <li>generated "Function" + number of function arguments</li>
0157 </ul>
0158 </p>
0159
0160 <p>Rule C: function without argument<br>
0161 if syntax stack looks as:<br>
0162 <ul type="square">
0163 <li>operator )</li>
0164 <li>operator (</li>
0165 <li>identifier</li>
0166 </ul>
0167 then reduce to:<br>
0168 <ul type="circle">
0169 <li>non-operator (dummy)</li>
0170 </ul>
0171 </p>
0172
0173 <p>Rule D: parenthesis removal<br>
0174 if syntax stack looks as:<br>
0175 <ul type="square">
0176 <li>operator (</li>
0177 <li>non-operator</li>
0178 <li>operator )</li>
0179 </ul>
0180 then reduce to:<br>
0181 <ul type="circle">
0182 <li>non-operator</li>
0183 </ul>
0184 </p>
0185
0186 <p>Rule E: binary operator<br>
0187 if syntax stack looks as:<br>
0188 <ul type="square">
0189 <li>non-operator</li>
0190 <li>binary operator</li>
0191 <li>non-operator</li>
0192 <li>and if the precedence of the binary operator in the syntax stack
0193 is greater or equals to the precedence of token</li>
0194 </ul>
0195 then reduce to:<br>
0196 <ul type="circle">
0197 <li>non-operator</li>
0198 <li>and generated appropriate byte code for the binary operator</li>
0199 </ul>
0200 </p>
0201
0202 <p>Rule F: unary operator<br>
0203 if syntax stack looks as:<br>
0204 <ul type="square">
0205 <li>non-operator</li>
0206 <li>unary operator</li>
0207 <li>operator</li>
0208 </ul>
0209 then reduce to:<br>
0210 <ul type="circle">
0211 <li>operator</li>
0212 <li>and generated "Neg" if unary operator is '-'</li>
0213 </ul>
0214 </p>
0215
0216 <p>Percent operator is a special case and not handled the above mentioned rule.
0217 When the parser finds the percent operator, it checks whether there's a non-operator
0218 token right before the percent. If yes, then the following code is generated:
0219 <tt>load 0.01</tt> followed by <tt>multiply</tt>.</p>
0220