Warning, /pim/kitinerary/README.md is written in an unsupported language. File is not indexed.
0001 # Itinerary data extraction engine 0002 0003 The itinerary data extraction engine extracts travel-related information from input in various forms, 0004 from PDF documents to ticket barcodes, from emails to calendar events, and provides that in a machine-readable way. 0005 0006 ## Users 0007 0008 * [KDE Itinerary](https://apps.kde.org/itinerary) 0009 * [KMail](https://kontact.kde.org/components/kmail/) (via the Itinerary plug-in) 0010 * [Nextcloud Mail](https://github.com/nextcloud/mail) 0011 0012 ## Architecture 0013 0014 For linked class names read this in [the API docs](https://api.kde.org/kdepim/kitinerary/html/index.html). 0015 0016 ### Data model 0017 0018 The data model used in here follows the [schema.org](https://schema.org) ontology, and for historic 0019 reasons some of [Google's extensions](https://developers.google.com/gmail/markup/reference/) to it. 0020 0021 Various QML-compatible value classes based on that can be found in the `src/lib/datatypes` sub-directory. 0022 Those do not implement the schema.org ontology one to one though, but focus on a subset relevant 0023 for the current consumers. Any avoidable complexity of the ontology is omitted, which mainly 0024 shows in a significantly flattened inheritance hierarchy, and stricter property types. This 0025 is done to make data processing and display easier. 0026 0027 There is one notable extension to the schema.org model, all date/time values support 0028 explicit IANA timezone identifiers, something that JSON cannot model out of the box. 0029 0030 De/serialization is provided via KItinerary::JsonLdDocument. 0031 0032 ### Document model 0033 0034 Input data is transformed into a tree of document nodes (KItinerary::ExtractorDocumentNode). 0035 This allows handling of arbitrarily nested data, such as an email with a PDF attached to it 0036 which contains an image that contains a barcode with an UIC 918.3 ticket container, without 0037 extractors having to consider all possible combinations. 0038 0039 A document node consists of a MIME type and its corresponding data, and potentially a number 0040 of child nodes. 0041 0042 Data extraction is then performed on that document tree starting at the leaf nodes, with results 0043 propagating upwards towards the root node. 0044 0045 Supported types of data are listed below. Additional data formats can be added via 0046 KItinerary::ExtractorDocumentProcessor and KItinerary::ExtractorDocumentNodeFactory. 0047 0048 #### Generic document formats 0049 0050 * PDF documents, represented as KItinerary::PdfDocument. 0051 * Emails, represented as KMime::Message. 0052 * Apple Wallet passes, represented as KPkPass::Pass. 0053 * iCal calendars and iCal calendar event, represented as KCalendarCore::Calendar and KCalendarCore::Event. 0054 * HTML and XML documents, represented as KItinerary::HtmlDocument. 0055 0056 #### Specialized ticket barcode formats 0057 0058 * UIC 918.3/918.9 ticket barcodes, represented as KItinerary::Uic9183Parser. 0059 * European Railway Agency (ERA) FCB ticket barcodes, represented as KItinerary::Fcb::UicRailTicketData. 0060 * European Railway Agency (ERA) SSB ticket barcodes, represented as KItinerary::SSBv1Ticket, 0061 KItinerary::SSBv2Ticket and KItinerary::SSBv3Ticket. 0062 * IATA boarding pass barcodes, represented as KItinerary::IataBcbp. 0063 * VDV eTicket barcodes, represented as KItinerary::VdvTicket. 0064 0065 #### Technical data types 0066 0067 These are primarily needed for internal use. 0068 0069 * Images, represented as QImage. 0070 * Apple property lists (plist), represented as KItinerary::PListReader. 0071 * HTTP responses, represented as KItineary::HTTPResponse. 0072 0073 #### Generic data types 0074 0075 These capture everything not handled above. 0076 0077 * JSON, represented as QJsonArray. 0078 * Plain textual data, represented as a QString. 0079 * Arbitrary binary data, represented as a QByteArray. 0080 0081 ### Data extraction 0082 0083 Data extraction is performed on the document tree starting at the leaf nodes, with results 0084 propagating upwards towards the root node. This means that results from child nodes are available 0085 to the extraction process, and can be extended/augmented there for example. 0086 0087 The entry point for data extraction is KItinerary::ExtractorEngine. 0088 0089 There's a number of built-in generic extractors for the following cases: 0090 * The various ticket barcode types (IATA, UIC 918.3/9, ERA FCB, ERA SSB). 0091 * Structured data in JSON-LD or XML microdata format included in HTML documents or iCal events. 0092 * PDF flight boarding passes. 0093 * Apple Wallet passes for flights, trains or events. 0094 * iCal calendar events (depends on KItinerary::ExtractorEngine::ExtractGenericIcalEvents). 0095 * ActivityPub events and places. 0096 0097 To cover anything not handled by this, there are vendor-specific extractor scripts. Those 0098 can produce complete results or merely fix or augment what the generic extraction has produced. 0099 0100 Extractor scripts consist of two basic parts, the filter defining when it should be triggered 0101 and the script itself (see KItinerary::ScriptExtractor). This is necessary as running all extractor 0102 scripts against a given input data would be too expensive. Filters therefore don't need to be perfect 0103 (noticing in the script it triggered on the wrong document is fine), but rather fast. 0104 0105 ### Data post-processing and augmentation 0106 0107 A number of additional processing steps are applied to extracted data 0108 (see KItineary::ExtractorPostProcessor). 0109 0110 #### Normalization 0111 0112 * Simplify whitespaces in human-readable strings. 0113 * Separate postal codes in addresses. 0114 * Remove name prefixes. 0115 * Convert human-readable country names into ISO 3166-1 alpha 2 country codes. 0116 * Apply timezones to date/time values. 0117 * Identify IATA airport codes based on airport names. 0118 0119 #### Augmentation 0120 0121 * Geographic coordinates based on IATA airport codes as well as a number of 0122 train station code. 0123 * Timezones based on geographic coordinates, or where sufficiently unique 0124 country/region information. 0125 * Countries and regions based on geographic coordinates. 0126 * Countries based on international phone numbers (needs libphonenumbers). 0127 0128 Most of this data is obtained from [OpenStreetMap](https://openstreetmap.org) 0129 and [Wikidata](https://wikidata.org) and provided as part of this library. No 0130 online operations are performed during extraction or post-processing. 0131 0132 #### Merging 0133 0134 If the result set contains multiple elements, merging elements referring 0135 to the same incidence is attempted. Two cases are considered: 0136 0137 * Elements that are considered to refer to exactly the same incidence 0138 are folded into one. 0139 * An element referring to a location change from A to B and two elements 0140 referring to a location change from A to C and C to B are considered 0141 to refer to the same trip, with the first one providing a lower level 0142 of detail. The first element is folded into the other two in that case. 0143 0144 #### Validation 0145 0146 In the final step all results are checked for containing a bare minimum of information 0147 (e.g. time and name for an event), and for being self-consistent (e.g. start time before end time). 0148 Invalid results are discarded. See KItinerary::ExtractorValidator. 0149 0150 0151 ## Creating extractor scripts 0152 0153 Extractor scripts are searched for in two locations: 0154 * In the file system at `$XDG_DATA_DIRS/kitinerary/extractors`. 0155 * Compiled into the binary at `:/org.kde.pim/kitinerary/extractors`. 0156 0157 Those locations are searched for JSON files containing one or more extractor script 0158 declarations. 0159 0160 ```json 0161 { 0162 "mimeType": "application/pdf", 0163 "filter": [ { ... } ], 0164 "script": "my-extractor-script.js", 0165 "function": "extractTicket" 0166 } 0167 ``` 0168 0169 The above example shows a single script declarations, for declaring multiple scripts in one 0170 file this can also be a JSON array of such objects. The individual fields are documented below. 0171 0172 ### Extractor filters 0173 0174 Extractor filters are evaluated against document nodes. This can be the node the extractor 0175 script wants to process, but also a descendant or ancestor node. 0176 0177 An extractor script filter consists of the following four properties: 0178 * `mimeType`: the type of the node to match 0179 * `field`: the property of the node content to match. This is ignored for nodes containing 0180 basic types such as plain text or binary data. 0181 * `match`: a regular expression 0182 * `scope`: this defines the relation to the node the script should be run on (Current, Parent, 0183 Children, Ancestors or Descendants). 0184 0185 #### Examples 0186 0187 Anything attached to an email sent by "booking@example-operator.com". The field matched against here 0188 is the `From` header of the MIME message. 0189 0190 ```json 0191 { 0192 "mimeType": "message/rfc822", 0193 "field": "From", 0194 "match": "^booking@exampl-operator.com$", 0195 "scope": "Ancestors" 0196 } 0197 ``` 0198 0199 Documents containing a barcode of the format "FNNNNNNNN". Note that the scope here is `Descendants` 0200 rather than `Children` as the direct child nodes tend to be the images containing the barcode. 0201 0202 ```json 0203 { 0204 "mimeType": "text/plain", 0205 "match": "^F\d{8}$", 0206 "scope": "Ancestors" 0207 } 0208 ``` 0209 0210 PDF documents containing the string "My Ferry Booking" anywhere. This should be used as a last resort 0211 only, as matching against the full PDF document content can be expensive. An imprecise trigger on a 0212 barcode is preferable to this. 0213 0214 ```json 0215 { 0216 "mimeType": "application/pdf", 0217 "field": "text", 0218 "match": "My Ferry Booking", 0219 "scope": "Current" 0220 } 0221 ``` 0222 0223 Apple Wallet passes issued by "org.kde.travelAgency". 0224 0225 ```json 0226 { 0227 "mimeType": "application/vnd.apple.pkpass", 0228 "field": "passTypeIdentifier", 0229 "match": "org.kde.travelAgency", 0230 "scope": "Current" 0231 } 0232 ``` 0233 0234 iCal events with an organizer email address of the "kde.org" domain. Note that the field here accesses 0235 a property of a property. This works at arbitrary depth, as long as the corresponding types are 0236 introspectable by Qt. 0237 0238 ```json 0239 { 0240 "mimeType": "internal/event", 0241 "field": "organizer.email", 0242 "match": "@kde.org$", 0243 "scope": "Current" 0244 } 0245 ``` 0246 0247 A (PDF) document containing an IATA boarding pass barcode of the airline "AB". Triggering 0248 vendor-specific UIC or ERA railway tickets can be done very similarly, matching on the corresponding 0249 carrier ids. 0250 0251 ```json 0252 { 0253 "mimeType": "internal/iata-bcbp", 0254 "field": "operatingCarrierDesignator", 0255 "match": "AB", 0256 "scope": "Descendants" 0257 } 0258 ``` 0259 0260 A node that has already existing results containing a reservation from "My Transport Operator". 0261 This is useful for scripts that want to augment or fix schema.org annotation already provided by 0262 the source. Note that the mimeType "application/ld+json" is special here as it doesn't only trigger 0263 on the document node content itself, but also matches against the result of nodes of any type. 0264 0265 ```json 0266 { 0267 "mimeType": "application/ld+json", 0268 "field": "reservationFor.provider.name", 0269 "match": "My Transport Operator", 0270 "scope": "Current" 0271 } 0272 ``` 0273 0274 ### Extractor scripts 0275 0276 Extractor scripts are defined by the following properties: 0277 * `script`: The name of the script file. 0278 * `function`: The name of the JS function that is called as the entry point into the script. 0279 * `mimeType`: The MIME type the script can handle. 0280 * `filter`: A list of extractor filters as described above. 0281 0282 Extractor scripts are run against a document node if all of the following conditions are met: 0283 * The `mimeType` of the script matches that of the node. 0284 * At least one of the extractor `filter` of the script match the node. 0285 0286 The script entry point is called with three arguments (this being JS, some of those can be omitted 0287 by the script and are then silently ignored): 0288 * The first argument is the content of the node that is processed. The data type of that argument 0289 depends on the node type as described in the document model section above. This is usually 0290 what extractor script are most concerned with. 0291 * The second argument is the document node being processed (see KItinerary::ExtractorDocumentNode). 0292 This can be useful to access already extracted results on a node (e.g. coming from generic extraction) 0293 in order to augment those. 0294 * The third argument is the document node that matched the filter. This can be the same as the second 0295 argument (for filters with `scope` = Current), but it doesn't have to be. This is most useful when 0296 triggering on descendant nodes such as barcodes, the content of which will then be incorporated into 0297 the extraction result by the script. 0298 0299 The script entry point function is expected to return one of the following: 0300 * A JS object following the schema.org ontology with a single extraction result. 0301 * A JS array containing one or more such objects. 0302 * Anything else (including empty arrays and script errors) are considered an empty result. 0303 0304 ### Extractor scripts runtime environment 0305 0306 Extractor scripts are run inside a QJSEngine, i.e. that's the JS subset to work with. 0307 There is some additional API available to extractor scripts (see the KItinerary::JsApi namespace). 0308 0309 API for supporting schema.org output: 0310 * KItinerary::JsApi::JsonLd: factory functions for schema.org objects, date/time parsing, etc 0311 0312 API for handling specific types of input data: 0313 * KItinerary::JsApi::ByteArray: functions for dealing with byte-aligned binary data, 0314 including decompression, Base64 decoding, Protcol Buffer decoding, etc. 0315 * KItinerary::JsApi::BitArray: functions for dealing with non byte-aligned binary data, 0316 such as reading numerical data at arbitrary bit offsets. 0317 * KItinerary::JsApi::Barcode: functions for manual barcode decoding. This should be rarely 0318 needed nowadays, with the extractor engine doing this automatically and creating corresponding 0319 document nodes. 0320 0321 API for interacting with the extractor engine itself: 0322 * KItinerary::JsApi::ExtractorEngine: this allows to recursively perform extraction. 0323 This can be useful for elements that need custom decoding in an extractor script first, 0324 but that contain otherwise generally supported data formats. Standard barcodes encoded 0325 in URL arguments are such an example. 0326 0327 ### Script development 0328 0329 [KItinerary Workbench](https://commits.kde.org/kitinerary-workbench) allows interactive development 0330 of extractor scripts. 0331 0332 ### Examples 0333 0334 Let's assume we want to create an extractor script for a railway ticket which comes with a simple 0335 tabular layout for a single leg per page, and contains a QR code with a 10 digit number for each leg. 0336 0337 ``` 0338 City A -> City B (Central Station) 0339 Departure: 21 Jun 18:42 0340 Arrival: 21 Jun 23:12 0341 ... 0342 ``` 0343 0344 As a filter we'd use something similar as example 2 above, triggering on the barcode content. 0345 0346 ```js 0347 function extractTicket(pdf, node, barcode) 0348 { 0349 // text for the PDF page containing the barcode that triggered this 0350 const text = pdf.pages[barcode.location].text; 0351 0352 // empty http://schema.org/TrainReservation object for the result 0353 let res = JsonLd.newTrainReservation(); 0354 0355 // when using regular expressions, matching on things that don't change in different 0356 // language variants is usually preferable, but might not always be possible 0357 // when creating regular expressions consider that various special characters might occur in names 0358 // of people or locations (in the above example spaces and parenthesis) 0359 const leg = text.match(/(.*) -> (.*)/); 0360 0361 // this can throw an error if the regular expression didn't match 0362 // that's fine though, the script is aborted here and considered not to have any result 0363 // ie. handling this case explicitly is unnecessary here 0364 res.reservationFor.departureStation.name = leg[1]; 0365 res.reservationFor.arrivalStation.name = leg[2]; 0366 0367 // date/time parsing can recover missing year numbers from context, if available 0368 // In our example it would consider the PDF creation time for that, and the resulting 0369 // date would be the first occurrence of the given day and month following that. 0370 res.reservationFor.departureTime = JsonLd.toDateTime(text.match(/Departure: (.*)/)[1], 'dd MMM hh:mm', 'en'); 0371 // for supporting different language formats, both the format string and the locale 0372 // argument can be lists. All combinations are then tried until one yields a valid result. 0373 res.reservationFor.arrivalTime = JsonLd.toDateTime(text.match(/(?:Arrival|Arrivé|Ankunft): (.*)/)[1], 0374 ['dd MMM hh:mm', 'dd MMM hh.mm'], ['en', 'fr', 'de']); 0375 0376 // the node that triggered this script (the barcode) can be accessed and integrated into the result 0377 res.reservedTicket.ticketToken = 'qrCode:' + barcode.content; 0378 0379 return res; 0380 } 0381 ``` 0382 0383 The above example produces and entirely new result. Another common case are scripts that 0384 merely augment an existing result. Let's assume an Apple Wallet pass for a flight, the 0385 automatically extracted result is correct but misses the boarding group. The filter for 0386 this would be similar to example 4 above, triggering on the pass issuer. 0387 0388 ```js 0389 // unused arguments can be omitted 0390 function extractBoardingPass(pass, node) 0391 { 0392 // use the existing result as a starting point 0393 // generally this can be more than one, but specific types of documents 0394 // might only produce a deterministic amount (like 1 in this case). 0395 let res = node.result[0]; 0396 0397 // modify the result as necessary 0398 res.boardingGroup = pass.field["group"].label; 0399 0400 // returning a result here will replace the existing results for this node 0401 return res; 0402 } 0403 ``` 0404 0405 A large number of real-world examples can also be found in the `src/lib/scripts` folder of the source code 0406 or browsed [here](https://invent.kde.org/pim/kitinerary/-/tree/master/src/lib/scripts). 0407 0408 ## Using the extractor engine 0409 0410 ### C++ API 0411 0412 Using the C++ API is the most flexible and efficient way to use this. This consists of three steps: 0413 * Extraction: This will attempt to find relevant information in the given input documents, its 0414 output however can still contain duplicate or invalid results. 0415 There are some options to customize this step, e.g. trading more expensive image processing against 0416 finding more results, depending on how certain you are the input data is going to contain such data. 0417 See KItinerary::ExtractorEngine. 0418 * Post-processing: This step merges duplicate or split results, but its output can still contain 0419 invalid elements. 0420 The main way to customize this step is in what you feed into it. For best results this should be all 0421 extractor results that can possibly contain information for a specific incident. 0422 See KItinerary::ExtractorPostprocessor. 0423 * Validation: This will remove and remaining incomplete or invalid results, or results of undesired types. 0424 For this step you typically want to set the set of types your application can handle. Letting incomplete 0425 results pass can be useful if you do have an existing set of data you want to apply those too. 0426 See KItineary::ExtractorValidator. 0427 0428 Example: 0429 ```c++ 0430 using namespace KItinerary; 0431 0432 // Create an instance of the extractor engine 0433 // use engine.setHints(...) to control its behavior 0434 ExtractorEngine engine; 0435 0436 // feed raw data into the extractor engine 0437 // passing a file name or MIME type additional to the data is optional 0438 // but can help with identifying the type of data passed in 0439 // should you already have data in decoded form, see engine.setContent() instead 0440 QFile f("my-document.pdf"); 0441 f.open(QFile::ReadOnly); 0442 engine.setData(f.readAll(), f.fileName()); 0443 0444 // perform the extraction 0445 const auto extractedData = engine.extract(); 0446 0447 // post process the extracted result 0448 ExtractorPostprocessor postproc; 0449 0450 // ExtractorPostprocessor::process() can be called multiple times 0451 // to accumulate a single merged result set 0452 postproc.process(extractedData); 0453 auto result = postproc.result(); 0454 0455 // select the type of data you can consume 0456 ExtractorValidator validator; 0457 validator.setAcceptedTypes<TrainReservation, BusReservation>(); 0458 validator.setAcceptOnlyCompleteElements(true); 0459 0460 // remove invalid results 0461 result.erase(std::remove_if(result.begin(), result.end(), [&validator](const auto &r) { 0462 return !validator.isValidElement(r); 0463 }), result.end()); 0464 ``` 0465 0466 ### Command line extractor 0467 0468 In cases where integrating with the C++ API isn't possible or desirable, there's also a command 0469 line interface to this, `kitinerary-extractor`. 0470 0471 This reads input data from stdin and outputs schema.org JSON with the results. 0472 0473 For easier deployment, the command line extractor can also be built entirely statically. This 0474 is available directly from the Gitlab CI/CD pipeline on demand. Nightly Flatpak builds are 0475 also available from KDE's nightly Flatpak repository: 0476 0477 ``` 0478 flatpak remote-add --if-not-exists kdeapps --from https://distribute.kde.org/kdeapps.flatpakrepo 0479 flatpak install org.kde.kitinerary-extractor 0480 ``` 0481 0482 ## Contributing 0483 0484 Contribution of new extractor scripts as well as improvements to the extractor engine are very welcome, 0485 preferably as merge request for this repository. 0486 0487 Another way to contribute is by donating sample data. Unlike similar proprietary solutions our data 0488 extraction runs entirely on your device, so we never get to see user documents and thus rely on donated 0489 material to test and improve the extractor. 0490 0491 Samples can be sent to vkrause@kde.org and will not be published. Anything vaguely looking like a 0492 train, bus, boat, flight, rental car, hotel, event or restaurant bookings/tickets/confirmations/cancellation/etc 0493 is relevant, even when they are seemingly already extracted correctly (in many cases there are non-obvious details 0494 we don't cover yet correctly). If possible, please provide material in its original unaltered form, 0495 for emails the easiest way is "Forward As Attachment", inline forwarding can destroy relevant details. 0496 0497 Feel free to join us in the [KDE Itinerary Matrix channel](https://matrix.to/#/#itinerary:kde.org)!