67 * Line boundary analysis determines where a text string can be broken 68 * when line-wrapping. The mechanism correctly handles punctuation and 69 * hyphenated words. 70 *
71 * Sentence boundary analysis allows selection with correct 72 * interpretation of periods within numbers and abbreviations, and 73 * trailing punctuation marks such as quotation marks and parentheses. 74 *
75 * Word boundary analysis is used by search and replace functions, as 76 * well as within text editing applications that allow the user to 77 * select words with a double click. Word selection provides correct 78 * interpretation of punctuation marks within and following 79 * words. Characters that are not part of a word, such as symbols or 80 * punctuation marks, have word-breaks on both sides. 81 *
82 * Character boundary analysis allows users to interact with 83 * characters as they expect to, for example, when moving the cursor 84 * through a text string. Character boundary analysis provides correct 85 * navigation of through character strings, regardless of how the 86 * character is stored. For example, an accented character might be 87 * stored as a base character and a diacritical mark. What users 88 * consider to be a character can differ between languages. 89 *
90 * The text boundary positions are found according to the rules 91 * described in Unicode Standard Annex #29, Text Boundaries, and 92 * Unicode Standard Annex #14, Line Breaking Properties. These 93 * are available at http://www.unicode.org/reports/tr14/ and 94 * http://www.unicode.org/reports/tr29/. 95 *
96 * In addition to the C++ API defined in this header file, a 97 * plain C API with equivalent functionality is defined in the 98 * file ubrk.h 99 *
100 * Code snippets illustrating the use of the Break Iterator APIs 101 * are available in the ICU User Guide, 102 * https://unicode-org.github.io/icu/userguide/boundaryanalysis/ 103 * and in the sample program icu/source/samples/break/break.cpp 104 * 105 */ 106 class U_COMMON_API BreakIterator : public UObject { 107 public: 108 /** 109 * destructor 110 * @stable ICU 2.0 111 */ 112 virtual ~BreakIterator(); 113 114 /** 115 * Return true if another object is semantically equal to this 116 * one. The other object should be an instance of the same subclass of 117 * BreakIterator. Objects of different subclasses are considered 118 * unequal. 119 *
120 * Return true if this BreakIterator is at the same position in the 121 * same text, and is the same class and type (word, line, etc.) of 122 * BreakIterator, as the argument. Text is considered the same if 123 * it contains the same characters, it need not be the same 124 * object, and styles are not considered. 125 * @stable ICU 2.0 126 */ 127 virtual bool operator==(const BreakIterator&) const = 0; 128 129 /** 130 * Returns the complement of the result of operator== 131 * @param rhs The BreakIterator to be compared for inequality 132 * @return the complement of the result of operator== 133 * @stable ICU 2.0 134 */ 135 bool operator!=(const BreakIterator& rhs) const { return !operator==(rhs); } 136 137 /** 138 * Return a polymorphic copy of this object. This is an abstract 139 * method which subclasses implement. 140 * @stable ICU 2.0 141 */ 142 virtual BreakIterator* clone() const = 0; 143 144 /** 145 * Return a polymorphic class ID for this object. Different subclasses 146 * will return distinct unequal values. 147 * @stable ICU 2.0 148 */ 149 virtual UClassID getDynamicClassID(void) const override = 0; 150 151 /** 152 * Return a CharacterIterator over the text being analyzed. 153 * @stable ICU 2.0 154 */ 155 virtual CharacterIterator& getText(void) const = 0; 156 157 158 /** 159 * Get a UText for the text being analyzed. 160 * The returned UText is a shallow clone of the UText used internally 161 * by the break iterator implementation. It can safely be used to 162 * access the text without impacting any break iterator operations, 163 * but the underlying text itself must not be altered. 164 * 165 * @param fillIn A UText to be filled in. If nullptr, a new UText will be 166 * allocated to hold the result. 167 * @param status receives any error codes. 168 * @return The current UText for this break iterator. If an input 169 * UText was provided, it will always be returned. 170 * @stable ICU 3.4 171 */ 172 virtual UText *getUText(UText *fillIn, UErrorCode &status) const = 0; 173 174 /** 175 * Change the text over which this operates. The text boundary is 176 * reset to the start. 177 * 178 * The BreakIterator will retain a reference to the supplied string. 179 * The caller must not modify or delete the text while the BreakIterator 180 * retains the reference. 181 * 182 * @param text The UnicodeString used to change the text. 183 * @stable ICU 2.0 184 */ 185 virtual void setText(const UnicodeString &text) = 0; 186 187 /** 188 * Reset the break iterator to operate over the text represented by 189 * the UText. The iterator position is reset to the start. 190 * 191 * This function makes a shallow clone of the supplied UText. This means 192 * that the caller is free to immediately close or otherwise reuse the 193 * Utext that was passed as a parameter, but that the underlying text itself 194 * must not be altered while being referenced by the break iterator. 195 * 196 * All index positions returned by break iterator functions are 197 * native indices from the UText. For example, when breaking UTF-8 198 * encoded text, the break positions returned by next(), previous(), etc. 199 * will be UTF-8 string indices, not UTF-16 positions. 200 * 201 * @param text The UText used to change the text. 202 * @param status receives any error codes. 203 * @stable ICU 3.4 204 */ 205 virtual void setText(UText *text, UErrorCode &status) = 0; 206 207 /** 208 * Change the text over which this operates. The text boundary is 209 * reset to the start. 210 * Note that setText(UText *) provides similar functionality to this function, 211 * and is more efficient. 212 * @param it The CharacterIterator used to change the text. 213 * @stable ICU 2.0 214 */ 215 virtual void adoptText(CharacterIterator* it) = 0; 216 217 enum { 218 /** 219 * DONE is returned by previous() and next() after all valid 220 * boundaries have been returned. 221 * @stable ICU 2.0 222 */ 223 DONE = (int32_t)-1 224 }; 225 226 /** 227 * Sets the current iteration position to the beginning of the text, position zero. 228 * @return The offset of the beginning of the text, zero. 229 * @stable ICU 2.0 230 */ 231 virtual int32_t first(void) = 0; 232 233 /** 234 * Set the iterator position to the index immediately BEYOND the last character in the text being scanned. 235 * @return The index immediately BEYOND the last character in the text being scanned. 236 * @stable ICU 2.0 237 */ 238 virtual int32_t last(void) = 0; 239 240 /** 241 * Set the iterator position to the boundary preceding the current boundary. 242 * @return The character index of the previous text boundary or DONE if all 243 * boundaries have been returned. 244 * @stable ICU 2.0 245 */ 246 virtual int32_t previous(void) = 0; 247 248 /** 249 * Advance the iterator to the boundary following the current boundary. 250 * @return The character index of the next text boundary or DONE if all 251 * boundaries have been returned. 252 * @stable ICU 2.0 253 */ 254 virtual int32_t next(void) = 0; 255 256 /** 257 * Return character index of the current iterator position within the text. 258 * @return The boundary most recently returned. 259 * @stable ICU 2.0 260 */ 261 virtual int32_t current(void) const = 0; 262 263 /** 264 * Advance the iterator to the first boundary following the specified offset. 265 * The value returned is always greater than the offset or 266 * the value BreakIterator.DONE 267 * @param offset the offset to begin scanning. 268 * @return The first boundary after the specified offset. 269 * @stable ICU 2.0 270 */ 271 virtual int32_t following(int32_t offset) = 0; 272 273 /** 274 * Set the iterator position to the first boundary preceding the specified offset. 275 * The value returned is always smaller than the offset or 276 * the value BreakIterator.DONE 277 * @param offset the offset to begin scanning. 278 * @return The first boundary before the specified offset. 279 * @stable ICU 2.0 280 */ 281 virtual int32_t preceding(int32_t offset) = 0; 282 283 /** 284 * Return true if the specified position is a boundary position. 285 * As a side effect, the current position of the iterator is set 286 * to the first boundary position at or following the specified offset. 287 * @param offset the offset to check. 288 * @return True if "offset" is a boundary position. 289 * @stable ICU 2.0 290 */ 291 virtual UBool isBoundary(int32_t offset) = 0; 292 293 /** 294 * Set the iterator position to the nth boundary from the current boundary 295 * @param n the number of boundaries to move by. A value of 0 296 * does nothing. Negative values move to previous boundaries 297 * and positive values move to later boundaries. 298 * @return The new iterator position, or 299 * DONE if there are fewer than |n| boundaries in the specified direction. 300 * @stable ICU 2.0 301 */ 302 virtual int32_t next(int32_t n) = 0; 303 304 /** 305 * For RuleBasedBreakIterators, return the status tag from the break rule 306 * that determined the boundary at the current iteration position. 307 *
308 * For break iterator types that do not support a rule status, 309 * a default value of 0 is returned. 310 *
311 * @return the status from the break rule that determined the boundary at 312 * the current iteration position. 313 * @see RuleBaseBreakIterator::getRuleStatus() 314 * @see UWordBreak 315 * @stable ICU 52 316 */ 317 virtual int32_t getRuleStatus() const; 318 319 /** 320 * For RuleBasedBreakIterators, get the status (tag) values from the break rule(s) 321 * that determined the boundary at the current iteration position. 322 *
323 * For break iterator types that do not support rule status, 324 * no values are returned. 325 *
326 * The returned status value(s) are stored into an array provided by the caller. 327 * The values are stored in sorted (ascending) order. 328 * If the capacity of the output array is insufficient to hold the data, 329 * the output will be truncated to the available length, and a 330 * U_BUFFER_OVERFLOW_ERROR will be signaled. 331 *
332 * @see RuleBaseBreakIterator::getRuleStatusVec 333 * 334 * @param fillInVec an array to be filled in with the status values. 335 * @param capacity the length of the supplied vector. A length of zero causes 336 * the function to return the number of status values, in the 337 * normal way, without attempting to store any values. 338 * @param status receives error codes. 339 * @return The number of rule status values from rules that determined 340 * the boundary at the current iteration position. 341 * In the event of a U_BUFFER_OVERFLOW_ERROR, the return value 342 * is the total number of status values that were available, 343 * not the reduced number that were actually returned. 344 * @see getRuleStatus 345 * @stable ICU 52 346 */ 347 virtual int32_t getRuleStatusVec(int32_t *fillInVec, int32_t capacity, UErrorCode &status); 348 349 /** 350 * Create BreakIterator for word-breaks using the given locale. 351 * Returns an instance of a BreakIterator implementing word breaks. 352 * WordBreak is useful for word selection (ex. double click) 353 * @param where the locale. 354 * @param status the error code 355 * @return A BreakIterator for word-breaks. The UErrorCode& status 356 * parameter is used to return status information to the user. 357 * To check whether the construction succeeded or not, you should check 358 * the value of U_SUCCESS(err). If you wish more detailed information, you 359 * can check for informational error results which still indicate success. 360 * U_USING_FALLBACK_WARNING indicates that a fall back locale was used. For 361 * example, 'de_CH' was requested, but nothing was found there, so 'de' was 362 * used. U_USING_DEFAULT_WARNING indicates that the default locale data was 363 * used; neither the requested locale nor any of its fall back locales 364 * could be found. 365 * The caller owns the returned object and is responsible for deleting it. 366 * @stable ICU 2.0 367 */ 368 static BreakIterator* U_EXPORT2 369 createWordInstance(const Locale& where, UErrorCode& status); 370 371 /** 372 * Create BreakIterator for line-breaks using specified locale. 373 * Returns an instance of a BreakIterator implementing line breaks. Line 374 * breaks are logically possible line breaks, actual line breaks are 375 * usually determined based on display width. 376 * LineBreak is useful for word wrapping text. 377 * @param where the locale. 378 * @param status The error code. 379 * @return A BreakIterator for line-breaks. The UErrorCode& status 380 * parameter is used to return status information to the user. 381 * To check whether the construction succeeded or not, you should check 382 * the value of U_SUCCESS(err). If you wish more detailed information, you 383 * can check for informational error results which still indicate success. 384 * U_USING_FALLBACK_WARNING indicates that a fall back locale was used. For 385 * example, 'de_CH' was requested, but nothing was found there, so 'de' was 386 * used. U_USING_DEFAULT_WARNING indicates that the default locale data was 387 * used; neither the requested locale nor any of its fall back locales 388 * could be found. 389 * The caller owns the returned object and is responsible for deleting it. 390 * @stable ICU 2.0 391 */ 392 static BreakIterator* U_EXPORT2 393 createLineInstance(const Locale& where, UErrorCode& status); 394 395 /** 396 * Create BreakIterator for character-breaks using specified locale 397 * Returns an instance of a BreakIterator implementing character breaks. 398 * Character breaks are boundaries of combining character sequences. 399 * @param where the locale. 400 * @param status The error code. 401 * @return A BreakIterator for character-breaks. The UErrorCode& status 402 * parameter is used to return status information to the user. 403 * To check whether the construction succeeded or not, you should check 404 * the value of U_SUCCESS(err). If you wish more detailed information, you 405 * can check for informational error results which still indicate success. 406 * U_USING_FALLBACK_WARNING indicates that a fall back locale was used. For 407 * example, 'de_CH' was requested, but nothing was found there, so 'de' was 408 * used. U_USING_DEFAULT_WARNING indicates that the default locale data was 409 * used; neither the requested locale nor any of its fall back locales 410 * could be found. 411 * The caller owns the returned object and is responsible for deleting it. 412 * @stable ICU 2.0 413 */ 414 static BreakIterator* U_EXPORT2 415 createCharacterInstance(const Locale& where, UErrorCode& status); 416 417 /** 418 * Create BreakIterator for sentence-breaks using specified locale 419 * Returns an instance of a BreakIterator implementing sentence breaks. 420 * @param where the locale. 421 * @param status The error code. 422 * @return A BreakIterator for sentence-breaks. The UErrorCode& status 423 * parameter is used to return status information to the user. 424 * To check whether the construction succeeded or not, you should check 425 * the value of U_SUCCESS(err). If you wish more detailed information, you 426 * can check for informational error results which still indicate success. 427 * U_USING_FALLBACK_WARNING indicates that a fall back locale was used. For 428 * example, 'de_CH' was requested, but nothing was found there, so 'de' was 429 * used. U_USING_DEFAULT_WARNING indicates that the default locale data was 430 * used; neither the requested locale nor any of its fall back locales 431 * could be found. 432 * The caller owns the returned object and is responsible for deleting it. 433 * @stable ICU 2.0 434 */ 435 static BreakIterator* U_EXPORT2 436 createSentenceInstance(const Locale& where, UErrorCode& status); 437 438 #ifndef U_HIDE_DEPRECATED_API 439 /** 440 * Create BreakIterator for title-casing breaks using the specified locale 441 * Returns an instance of a BreakIterator implementing title breaks. 442 * The iterator returned locates title boundaries as described for 443 * Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration, 444 * please use a word boundary iterator. See {@link #createWordInstance }. 445 * 446 * @param where the locale. 447 * @param status The error code. 448 * @return A BreakIterator for title-breaks. The UErrorCode& status 449 * parameter is used to return status information to the user. 450 * To check whether the construction succeeded or not, you should check 451 * the value of U_SUCCESS(err). If you wish more detailed information, you 452 * can check for informational error results which still indicate success. 453 * U_USING_FALLBACK_WARNING indicates that a fall back locale was used. For 454 * example, 'de_CH' was requested, but nothing was found there, so 'de' was 455 * used. U_USING_DEFAULT_WARNING indicates that the default locale data was 456 * used; neither the requested locale nor any of its fall back locales 457 * could be found. 458 * The caller owns the returned object and is responsible for deleting it. 459 * @deprecated ICU 64 Use createWordInstance instead. 460 */ 461 static BreakIterator* U_EXPORT2 462 createTitleInstance(const Locale& where, UErrorCode& status); 463 #endif /* U_HIDE_DEPRECATED_API */ 464 465 /** 466 * Get the set of Locales for which TextBoundaries are installed. 467 *
Note: this will not return locales added through the register 468 * call. To see the registered locales too, use the getAvailableLocales 469 * function that returns a StringEnumeration object