File indexing completed on 2024-05-19 11:27:07

0001 /****************************************************************************
0002 **                                MIT License
0003 **
0004 ** Copyright (C) 2020-2022 Klarälvdalens Datakonsult AB, a KDAB Group company, info@kdab.com, author Marc Mutz <marc.mutz@kdab.com>
0005 **
0006 ** This file is part of KDToolBox (https://github.com/KDAB/KDToolBox).
0007 **
0008 ** Permission is hereby granted, free of charge, to any person obtaining a copy
0009 ** of this software and associated documentation files (the "Software"), to deal
0010 ** in the Software without restriction, including without limitation the rights
0011 ** to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
0012 ** copies of the Software, ** and to permit persons to whom the Software is
0013 ** furnished to do so, subject to the following conditions:
0014 **
0015 ** The above copyright notice and this permission notice (including the next paragraph)
0016 ** shall be included in all copies or substantial portions of the Software.
0017 **
0018 ** THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
0019 ** IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
0020 ** FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
0021 ** AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
0022 ** LIABILITY, WHETHER IN AN ACTION OF ** CONTRACT, TORT OR OTHERWISE,
0023 ** ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
0024 ** DEALINGS IN THE SOFTWARE.
0025 ****************************************************************************/
0026 
0027 #include "qstringtokenizer.h"
0028 #include "qstringalgorithms.h"
0029 
0030 /*!
0031     \class QStringTokenizer
0032     \brief The QStringTokenizer class splits strings into tokens along given separators
0033     \reentrant
0034 
0035     Splits a string into substrings wherever a given separator occurs,
0036     and returns a (lazy) list of those strings. If the separator does
0037     not match anywhere in the string, produces a single-element
0038     containing this string.  If the separator is empty,
0039     QStringTokenizer produces an empty string, followed by each of the
0040     string's characters, followed by another empty string. The two
0041     enumerations Qt::SplitBehavior and Qt::CaseSensitivity further
0042     control the output.
0043 
0044     QStringTokenizer drives QStringView::tokenize(), but, at least with a
0045     recent compiler, you can use it directly, too:
0046 
0047     \code
0048     for (auto it : QStringTokenizer{string, separator})
0049         use(*it);
0050     \endcode
0051 
0052     \note You should never, ever, name the template arguments of a
0053     QStringTokenizer explicitly.  If you can use C++17 Class Template
0054     Argument Deduction (CTAD), you may write
0055     \c{QStringTokenizer{string, separator}} (without template
0056     arguments).  If you can't use C++17 CTAD, you must use the
0057     QStringView::split() or QLatin1String::split() member functions
0058     and store the return value only in \c{auto} variables:
0059 
0060     \code
0061     auto result = string.split(sep);
0062     \endcode
0063 
0064     This is because the template arguments of QStringTokenizer have a
0065     very subtle dependency on the specific string and separator types
0066     from with which they are constructed, and they don't usually
0067     correspond to the actual types passed.
0068 
0069     \section Lazy Sequences
0070 
0071     QStringTokenizer acts as a so-called lazy sequence, that is, each
0072     next element is only computed once you ask for it. Lazy sequences
0073     have the advantage that they only require O(1) memory. They have
0074     the disadvantage that, at least for QStringTokenizer, they only
0075     allow forward, not random-access, iteration.
0076 
0077     The intended use-case is that you just plug it into a ranged for loop:
0078 
0079     \code
0080     for (auto it : QStringTokenizer{string, separator})
0081         use(*it);
0082     \endcode
0083 
0084     or a C++20 ranged algorithm:
0085 
0086     \code
0087     std::ranges::for_each(QStringTokenizer{string, separator},
0088                           [] (auto token) { use(token); });
0089     \endcode
0090 
0091     \section End Sentinel
0092 
0093     The QStringTokenizer iterators cannot be used with classical STL
0094     algorithms, because those require iterator/iterator pairs, while
0095     QStringTokenizer uses sentinels, that is, it uses a different
0096     type, QStringTokenizer::sentinel, to mark the end of the
0097     range. This improves performance, because the sentinel is an empty
0098     type. Sentinels are supported from C++17 (for ranged for)
0099     and C++20 (for algorithms using the new ranges library).
0100 
0101     QStringTokenizer falls back to a non-sentinel end iterator
0102     implementation if the compiler doesn't support separate types for
0103     begin and end iterators in ranged for loops
0104     (\link{https://wg21.link/P0184}{P1084}), in which case traditional
0105     STL algorthms will \em appear to be supported, but as you migrate
0106     to a compiler that supports P0184, such code will break.  We
0107     recommend to use only the C++20 \c{std::ranges} algorithms, or, if
0108     you're stuck on C++14/17 for the time being,
0109     \link{https://github.com/ericniebler/range-v3}{Eric Niebler's
0110     Ranges v3 library}, which has the same semantics as the C++20
0111     \c{std::ranges} library.
0112 
0113     \section Temporaries
0114 
0115     QStringTokenizer is very carefully designed to avoid dangling
0116     references. If you construct a tokenizer from a temporary string
0117     (an rvalue), that argument is stored internally, so the referenced
0118     data isn't deleted before it is tokenized:
0119 
0120     \code
0121     auto tok = QStringTokenizer{widget.text(), u','};
0122     // return value of `widget.text()` is destroyed, but content was moved into `tok`
0123     for (auto e : tok)
0124        use(e);
0125     \endcode
0126 
0127     If you pass named objects (lvalues), then QStringTokenizer does
0128     not store a copy. You are reponsible to keep the named object's
0129     data around for longer than the tokenizer operates on it:
0130 
0131     \code
0132     auto text = widget.text();
0133     auto tok = QStringTokenizer{text, u','};
0134     text.clear();      // destroy content of `text`
0135     for (auto e : tok) // ERROR: `tok` references deleted data!
0136         use(e);
0137     \endcode
0138 
0139     \sa QStringView::split(), QLatin1Sting::split(), Qt::SplitBehavior, Qt::CaseSensitivity
0140 */
0141 
0142 /*!
0143     \typedef QStringTokenizer::value_type
0144 
0145     Alias for \c{const QStringView} or \c{const QLatin1String},
0146     depending on the tokenizer's \c Haystack template argument.
0147 */
0148 
0149 /*!
0150     \typedef QStringTokenizer::difference_type
0151 
0152     Alias for qsizetype.
0153 */
0154 
0155 /*!
0156     \typedef QStringTokenizer::size_type
0157 
0158     Alias for qsizetype.
0159 */
0160 
0161 /*!
0162     \typedef QStringTokenizer::reference
0163 
0164     Alias for \c{value_type &}.
0165 
0166     QStringTokenizer does not support mutable references, so this is
0167     the same as const_reference.
0168 */
0169 
0170 /*!
0171     \typedef QStringTokenizer::const_reference
0172 
0173     Alias for \c{value_type &}.
0174 */
0175 
0176 /*!
0177     \typedef QStringTokenizer::pointer
0178 
0179     Alias for \c{value_type *}.
0180 
0181     QStringTokenizer does not support mutable iterators, so this is
0182     the same as const_pointer.
0183 */
0184 
0185 /*!
0186     \typedef QStringTokenizer::const_pointer
0187 
0188     Alias for \c{value_type *}.
0189 */
0190 
0191 /*!
0192     \typedef QStringTokenizer::iterator
0193 
0194     This typedef provides an STL-style const iterator for
0195     QStringTokenizer.
0196 
0197     QStringTokenizer does not support mutable iterators, so this is
0198     the same as const_iterator.
0199 
0200     \sa const_iterator
0201 */
0202 
0203 /*!
0204     \typedef QStringTokenizer::const_iterator
0205 
0206     This typedef provides an STL-style const iterator for
0207     QStringTokenizer.
0208 
0209     \sa iterator
0210 */
0211 
0212 /*!
0213     \typedef QStringTokenizer::sentinel
0214 
0215     This typedef provides an STL-style sentinel for
0216     QStringTokenizer::iterator and QStringTokenizer::const_iterator.
0217 
0218     \sa const_iterator
0219 */
0220 
0221 /*!
0222     \fn QStringTokenizer(Haystack haystack, String needle, Qt::CaseSensitivity cs, Qt::SplitBehavior sb)
0223     \fn QStringTokenizer(Haystack haystack, String needle, Qt::SplitBehavior sb, Qt::CaseSensitivity cs)
0224 
0225     Constructs a string tokenizer that splits the string \a haystack
0226     into substrings wherever \a needle occurs, and allows iteration
0227     over those strings as they are found. If \a needle does not match
0228     anywhere in \a haystack, a single element containing \a haystack
0229     is produced.
0230 
0231     \a cs specifies whether \a needle should be matched case
0232     sensitively or case insensitively.
0233 
0234     If \a sb is QString::SkipEmptyParts, empty entries don't
0235     appear in the result. By default, empty entries are included.
0236 
0237     \sa QStringView::split(), QLatin1String::split(), Qt::CaseSensitivity, Qt::SplitBehavior
0238 */
0239 
0240 /*!
0241     \fn QStringTokenizer::const_iterator QStringTokenizer::begin() const
0242 
0243     Returns a const \l{STL-style iterators}{STL-style iterator}
0244     pointing to the first token in the list.
0245 
0246     \sa end(), cbegin()
0247 */
0248 
0249 /*!
0250     \fn QStringTokenizer::const_iterator QStringTokenizer::cbegin() const
0251 
0252     Same as begin().
0253 
0254     \sa cend(), begin()
0255 */
0256 
0257 /*!
0258     \fn QStringTokenizer::sentinel QStringTokenizer::end() const
0259 
0260     Returns a const \l{STL-style iterators}{STL-style sentinel}
0261     pointing to the imaginary token after the last token in the list.
0262 
0263     \sa begin(), cend()
0264 */
0265 
0266 /*!
0267     \fn QStringTokenizer::sentinel QStringTokenizer::cend() const
0268 
0269     Same as end().
0270 
0271     \sa cbegin(), end()
0272 */
0273 
0274 /*!
0275     \fn QStringTokenizer::toContainer(Container &&c) const &
0276 
0277     Convenience method to convert the lazy sequence into a
0278     (typically) random-access container.
0279 
0280     This function is only available if \c Container has a \c value_type
0281     matching this tokenizer's value_type.
0282 
0283     If you pass in a named container (an lvalue), then that container
0284     is filled, and a reference to it is returned.
0285 
0286     If you pass in a temporary container (an rvalue, incl. the default
0287     argument), then that container is filled, and returned by value.
0288 
0289     \code
0290     // assuming tok's value_type is QStringView, then...
0291     auto tok = QStringTokenizer{~~~};
0292     // ... rac1 is a QVector:
0293     auto rac1 = tok.toContainer();
0294     // ... rac2 is std::pmr::vector<QStringView>:
0295     auto rac2 = tok.toContainer<std::pmr::vector<QStringView>>();
0296     auto rac3 = QVarLengthArray<QStringView, 12>{};
0297     // appends the token sequence produced by tok to rac3
0298     //  and returns a reference to rac3 (which we ignore here):
0299     tok.toContainer(rac3);
0300     \endcode
0301 
0302     This gives you maximum flexibility in how you want the sequence to
0303     be stored.
0304 */
0305 
0306 /*!
0307     \fn QStringTokenizer::toContainer(Container &&c) const &&
0308     \overload
0309 
0310     In addition to the constraints on the lvalue-this overload, this
0311     rvalue-this overload is only available when this QStringTokenizer
0312     does not store the haystack internally, as this could create a
0313     container full of dangling references:
0314 
0315     \code
0316     auto tokens = QStringTokenizer{widget.text(), u','}.toContainer();
0317     // ERROR: cannot call toContainer() on rvalue
0318     // 'tokens' references the data of the copy of widget.text()
0319     // stored inside the QStringTokenizer, which has since been deleted
0320     \endcode
0321 
0322     To fix, store the QStringTokenizer in a temporary:
0323 
0324     \code
0325     auto tokenizer = QStringTokenizer{widget.text90, u','};
0326     auto tokens = tokenizer.toContainer();
0327     // OK: the copy of widget.text() stored in 'tokenizer' keeps the data
0328     // referenced by 'tokens' alive.
0329     \endcode
0330 
0331     You can force this function into existence by passing a view instead:
0332 
0333     \code
0334     func(QStringTokenizer{QStringView{widget.text()}, u','}.toContainer());
0335     // OK: compiler keeps widget.text() around until after func() has executed
0336     \endcode
0337 */
0338 
0339 /*!
0340     \fn qTokenize(Haystack &&haystack, Needle &&needle, Flags...flags)
0341     \relates QStringTokenizer
0342 
0343     Factory function for QStringTokenizer. You can use this function
0344     if your compiler doesn't, yet, support C++17 Class Template
0345     Argument Deduction (CTAD), but we recommend direct use of
0346     QStringTokenizer with CTAD instead.
0347 */