uax 14
1.0.0Implementation of the Unicode Standards Annex #14's line breaking algorithm
About UAX-14
This is an implementation of the Unicode Standards Annex #14's line breaking algorithm. It provides a fast and convenient way to determine line breaking opportunities in text.
Note that this algorithm does not support break opportunities that require morphological analysis. In order to handle such cases, please consult a system that provides this kind of capability, such as a hyphenation algorithm.
Also note that this system is completely unaware of layouting decisions. Any kind of layouting decisions, such as which breaks to pick, how to space between words, how to handle bidirectionality, and what to do in emergency situations when there are no breaks on an overfull line are left up to the user.
The system passes all tests offered by the Unicode standard.
How To
The system will compile binary database files on first load. Should anything go wrong during this process, a note is produced on load. If you would like to prevent this automated loading, push uax-14-no-load
to *features*
before loading. You can then manually load the database files when convenient through load-databases
.
Once loaded, you can produce a list of line breaks for a string with list-breaks
or break a string at every opportunity with break-string
. Typically however you will want to scan for the next break as you move along the string during layouting. To do so, create a breaker with make-breaker
, and call next-break
whenever the next line break opportunity is required.
In pseudo-code, that could look something like this. We assume the local nickname uax-14
for org.shirakumo.alloy.uax-14
here.
(loop with breaker = (uax-14:make-breaker string)
with start = 0 and last = 0
do (multiple-value-bind (pos mandatory) (uax-14:next-break breaker)
(cond (mandatory
(insert-break pos)
(setf start pos))
((beyond-extents-p start pos)
(if (< last start) ; Force a break if we are overfull.
(loop while (beyond-extents-p start pos)
do (let ((next (find-last-fitting-cluster start)))
(insert-break next)
(setf start next))
finally (setf pos start))
(insert-break last))))
(setf last pos)))
External Files
The following files are from their corresponding external sources, last accessed on 2019.09.03:
LineBreak.txt
https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txtLineBreakTest.txt
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.txt
At the time, Unicode 12.1 was considered the latest version.
Acknowledgements
The code in this project is largely based on the linebreak project by Devon Govett et al.
System Information
Definition Index
-
ORG.SHIRAKUMO.ALLOY.UAX-14
No documentation provided.-
EXTERNAL SPECIAL-VARIABLE *LINE-BREAK-DATABASE-FILE*
Variable containing the absolute path of the line break database file. See LOAD-DATABASES See COMPILE-DATABASES
-
EXTERNAL SPECIAL-VARIABLE *PAIR-TABLE-FILE*
Variable containing the absolute path of the pair table file. See LOAD-DATABASES See COMPILE-DATABASES
-
EXTERNAL STRUCTURE BREAKER
Contains line breaking state. An instance of this is only useful for passing to MAKE-BREAKER and NEXT-BREAK. It contains internal state that manages the line breaking algorithm. See MAKE-BREAKER See NEXT-BREAK
-
EXTERNAL FUNCTION BREAK-STRING
- STRING
- &OPTIONAL
- MANDATORY-ONLY
- BREAKER
Returns a list of all the pieces of the string, broken. If MANDATORY-ONLY is T, the string is only split at mandatory line break opportunities, otherwise it is split at every opportunity. See MAKE-BREAKER See NEXT-BREAK
-
EXTERNAL FUNCTION COMPILE-DATABASES
Compiles the database files from their sources. This will load an optional part of the system and compile the database files to an efficient byte representation. If the compilation is successful, LOAD-DATABASES is called automatically. See *LINE-BREAK-DATABASE-FILE* See *PAIR-TABLE-FILE* See LOAD-DATABASES
-
EXTERNAL FUNCTION LIST-BREAKS
- STRING
- &OPTIONAL
- BREAKER
Returns a list of all line break opportunities in the string. The list has the following form: LIST ::= ENTRY+ ENTRY ::= (position mandatory) This is equivalent to constructing a breaker and collecting the values of NEXT-BREAK in a loop. See MAKE-BREAKER See NEXT-BREAK
-
EXTERNAL FUNCTION LOAD-DATABASES
Loads the databases from their files into memory. If one of the files is missing, a warning of type NO-DATABASE-FILES is signalled. If the loading succeeds, T is returned. See *LINE-BREAK-DATABASE-FILE* See *PAIR-TABLE-FILE* See NO-DATABASE-FILES
-
EXTERNAL FUNCTION MAKE-BREAKER
- STRING
- &OPTIONAL
- BREAKER
Returns a breaker that can find line break opportunities in the given string. If the optional breaker argument is supplied, the supplied breaker is modified and reset to work with the new string instead. This allows you to re-use a breaker. Note that while you may pass a non-simple string, modifying this string without resetting any breaker using it will result in undefined behaviour. See BREAKER
-
EXTERNAL FUNCTION NEXT-BREAK
- BREAKER
Returns the next line breaking opportunity of the breaker, if any. Returns two values: POSITION --- The character index in the string at which the break is located, or NIL if no further breaks are possible. MANDATORY --- Whether the break must be made at this location. Note that there is always in the very least one break opportunity, namely at the end of the string. However, after consuming this break opportunity, NEXT-BREAK will return NIL. Note that you may have to insert additional line breaks as required by the layout constraints. See BREAKER
-