Automated Bulgarian Hyphenation

One specificity of the Bulgarian language is that the average length of the words is greater than in English. When typesetting a Bulgarian text, hyphenation is more important than when typesetting an English text. Knuth's algorithm for line-breaking is such that in most English paragraphs no hyphenation will be used. With a Bulgarian text, however, even the Knuth's algorithm will use hyphenation in most paragraphs. Hyphenation becomes an absolute necessity if we want to obtain nice, justified paragraphs when using a software with dumb line-breaking algorithm, such as LibreOffice.

According to Decree 936 of the Council of Ministers promulgated on 27 November 1950, the Institute for Bulgarian Language at the Bulgarian Academy of Sciences is authorised to publish the rules of the orthography of the Bulgarian language (within certain limits).

Hyphenation rules between 1945 and 1983

Between 1945 and 1983 Bulgarian used syllable hyphenation with two morphological exceptions: hyphenation is preferred between a prefix and a stem and at the boundary of compound words. The following were the rules governing the hyphenation:

In some rare cases the proper application of rule 9 depends on the semantics of the word. For example пре-дреша /pre-dresha/ 'change clothes' but пред-реша /pred-resha/ 'predetermine' or прес-пите /pres-pite/ 'the snow-drifts' but пре-спите /pre-spite/ 'sleep for a while/overnight'.

Hyphenation rules between 1983 and 2012

The Orthographic dictionary published by the Institute for Bulgarian language in 1983 introduced new hyphenation rules. The complexity of the previous rules was the main reason for the change. The new rules aimed at two objectives: simplicity and unambiguity.

The total disregard of the morphology by these rules leads to some strange results. For example пре-дизвестие /pre-dizvestie/ is permitted and пред-известие /pred-izvestie/ is forbidden, зад-вижвам /zad-vizhvam/ is permitted and за-движвам /za-dvizhvam/ is forbidden, авток-луб /avtok-lub/ is permitted and авто-клуб /avto-klub/ is forbidden, вакуу-мапарат /vakuu-maparat/ is permitted and вакуум-апарат /vakuum-aparat/ is forbidden. Because of this, the new rules were not universally accepted. The old rules are still mentioned in various places in Internet, they are included even in some grammar books published by the publishing houses of the Ministry of Education and of Sofia University. The software developers, however, soon came into love with the new hyphenation rules.

Hyphenation rules after 2012

In 2012 new rules came into force. There are two differences with respect to the previous rules:

Good hyphenation is a complex matter and it seems the linguists at the Institute for Bulgarian Language have recognised this. They no longer attempt to provide universal rules about everything. Instead, they provide some very permissible rules while the good application of these rules is leaved to the discretion and the experience of the printers and the developers of hyphenation software.

It makes sense to use at least two different sets of hyphenation rules for Bulgarian. In most cases a more restrictive version should be used, one which attempts to eliminate the controversial cases of hyphenation. When typesetting a Bulgarian text in a narrow newspaper column, however, it will be appropriate to use more liberal hyphenation rules. It should be noted that one of the reasons for the hyphenation reform in 1983 was the desire to fix the chaotic hyphenation in the Bulgarian newspapers at that time.

Computer implementations

Mathematical analysis of the Bulgarian hyphenation

The earliest mathematical analysis of the Bulgarian hyphenation rules belongs to Veska Noncheva.² In 1988 she proposed a mathematical formalisation of the hyphenation rules in a table with 22 rows.³

In the same year Eugene Belogay⁴ proposed an alternative formalisation with only 9 rules.⁵ Belogay proved that his rules are consistent and that they form a minimal set. The rules of Belogay have negative character – every hyphenation which is not forbidden by a rule is possible hyphenation.

Here А denotes an arbitrary vowel letter, Б denotes an arbitrary consonant letter (including ь and й), ТТ denotes a sequence of two equal consonant letters and the letters й, ь, д and ж denote themselves. For example the rule "Б-А" says that we are not permitted to separate a consonant letter from immediately following vowel letter.

The eighth rule of Belogay says that hyphenation is forbidden before the first and after the last vowel letter. The ninth rule of Belogay says that hyphenation is forbidden immediately after the first or immediately before the last letter of the word.

Notice that is is very easy to translate the rules of Belogay in the form, required for the hyphenation algorithm of Knuth and Liang used in TeX.⁶ Let us remind that this algorithm matches the word with a set of string patterns in which the odd numbers say hyphenation is permitted in this position and even numbers say the hyphenation is forbidden. When two patterns give conflicting numbers for the same position, then the greater number wins.

First, since the rules of Belogay are negative (they say where hyphenation is forbidden, not where it is permitted), we have to permit the hyphenation everywhere:

Since no Bulgarian word starts with more that four consonants and no Bulgarian word ends with more than three consonants, the eighth rule of Belogay can be translated in the following way:

The ninth rule of Belogay means that left and right hyphen mins should be set to 2.

The work of Eugene Belogay was not limited to merely a mathematical analysis of the Bulgarian hyphenation rules. In his paper he published a short algorithm in Pascal which implements these rules. It didn't take long for this algorithm to be used in various text processing software. The algorithm of Belogay was famous for many years. Even as late as 1997 in one book about TeX, the author didn't care to give any explanations but simply wrote about "the algorithm of Belogay" as something well known to the reader.⁷

Bulgarian hyphenation in TeX

One unfortunate design decision of Knuth was that the hyphenation algorithm of TeX applied the hyphenation patterns not to the input character codes but to the internal codes of the glyphs in the font. This created a problem for the Cyrillic languages because in TeX the Cyrillic fonts did not have standardised encoding. Perhaps this is one of the reasons why the earliest implementations of the Bulgarian hyphenation in TeX did not rely on the internal hyphenation algorithm of TeX. Instead, external tools were used to insert soft hyphens in all Bulgarian words. For example such a tool would replace the word сричкопренасяне /srichkoprenasyane/ with срич\-коп\-ре\-на\-ся\-не /srich\-kop\-re\-na\-sya\-ne/. The saying "To every disadvantage there is a corresponding advantage" is true – since Cyrillic and Latin letters use different character codes, an external tool could easily insert soft hyphens in all Bulgarian words while leaving the TeX commands intact.

The earliest known attempt to use the hyphenation algorithm of TeX for Bulgarian was made by Ognyan Tonev in 1990.⁸ He described his work as "a not very good translation of the rules. I work in this direction. But I don't have a 100% working complect of patterns. So, the copy I send to you⁹ is only a beta-version." The hyphenation patterns of Tonev don't work correctly and it seems he never completed his work.

The first usable Bulgarian hyphenation patterns for TeX were developed by Georgi Boshnakov¹⁰ in 1994. In order to solve the encoding problem, Boshnakov had developed TeX fonts supporting the MIK encoding (the prevalent encoding at that time in Bulgaria). This allowed him to introduce a fully working implementation only a few months after LaTeX2e became the official LaTeX version. Later Boshnakov modified his work with the Babel system. The hyphenation patterns of Boshnakov did their job well enough, so that for almost quarter a century after their initial creation, they remained the only Bulgarian hyphenation patterns in the standard distributions of TeX and CTAN.

There are some similarities between the patterns of Boshnakov and the patterns of Belogay. The following are the main differences.

First, Boshnakov used an ingenious and more compact implementation of the second and the third rule. Instead of {А2ББ, Б2ТТ, ТТ2Б}, or 8×22×22+22×22+22×22=4840 patterns in total, Boshnakov has patterns of the form 2Б3Б2 and 4Т3Т4, or only 22×22=484 in total, with the same effect.

The second main difference between the patterns of Boshnakov and the patterns of Belogay concerns the letter combination дж /dzh/. In Bulgarian this letter combination can denote either a single consonant, or a sequence of two consonants and the hyphenation rules change respectively. Unfortunately, it is impossible to know the meaning of дж /dzh/ without a vocabulary. The solution of Belogay was a cautious one – his rules do the hyphenation in a way which will be correct regardless of whether дж /dzh/ is a single consonant or a sequence of two consonant. On the other hand, the approach of Boshnakov is a bold one – since дж /dzh/ is more often a single consonant, his rules assume that it is always a single consonant. The number of the cases when this decision leads to bad hyphenations is insignificant in comparison with the cases in which we obtain improved hyphenation.

The third main difference between the patterns of Boshnakov and the patterns of Belogay concerns the eighth rule – its implementation in the rules of Boshnakov is rather limited which leads to wrong hyphenations like бри-дж /bri-dzh/. A full implementation of this rule would require 11660 patterns in total and this would be too much for the computers in 1994.

Later developments

In 1995 Atanas Topalov defended a Masters thesis in the Faculty of Mathematics and Informatics at Sofia University titled "Algorithms and software about text processing".¹¹ One of the main topics in his thesis was the Bulgarian hyphenation. Topalov criticised vehemently the official hyphenation rules and their total disregard of the morphology. He wrote:

Topalov proposed his own hyphenation algorithm. The hyphenation it generated was smooth and easy to read. One obvious defect of the algorithm of Topalov was that it contradicted the official hyphenation rules at that time. One can argue, however, that his algorithm is compatible with the current hyphenation rules.

In 1999 Svetla Koeva¹² wrote a paper about the automated Bulgarian hyphenation.¹³ At that time she was a junior member of the Department of Computational Linguistics at the Institute for Bulgarian Language but now she is a director of the whole institute. The paper of Koeva contains a list of hyphenation patterns which can be used as a basis of automated hyphenation. In 2004 with the help of Stoyan Mihov¹⁴ the rules of Koeva were formalised with regular relations and rewriting rules. They were implemented in a software product named ItaEst which provided Bulgarian hyphenation and grammar checking for various software products of Microsoft and Apple.

The main differences between the hyphenation of Koeva and the official hyphenation rules effective after 2012 is that the separation of a long sequence of consonants between two vowels is done according to the rules valid before 1983. For example се-стра /se-stra/ and ай-сберг /ay-sberg/ are permitted. The main difference between the hyphenation of Koeva and the official hyphenation rules effective before 1983 is that the rules of Koeva disregard the morphology of the words. The following rule of Koeva is specific: in a sequence of two sonorant consonants between two vowels, we are permitted to separate the first vowel from the first consonant, for example материа-лна /materia-lna/.

In 2000 Anton Zinoviev¹⁵ created new hyphenation patterns for TeX. He didn't know about the previous work of Boshnakov and he didn't bother to make his work available in the various TeX distributions and CTAN. His work was used mostly by the local Linux enthusiasts and the colleagues of Zinoviev. In 2001 Radostin Radnev¹⁶ created a free grammar dictionary of Bulgarian¹⁷ where he used the hyphenation patterns of Zinoviev. From there the work of Zinoviev propagated to OpenOffice, LibreOffice and various online dictionaries, including http://bg.wiktionary.org and http://rechnik.chitanka.info.

The following are the main differences between the hyphenation of Zinoviev and the hyphenation of Boshnakov.

Second, the rules of Zinoviev try to detect when the letters дж /dzh/ (and дз /dz/) denote a single consonant and when they denote a sequence of two consonants. By default, however, Zinoviev (like Boshnakov) assumes that дж /dzh/ is a single consonant and hyphenates accordingly.

At the start of his work on the Bulgarian hyphenation, Zinoviev had the opportunity to discuss the hyphenation with Svetla Koeva. He remembers that some cases of unpleasant hyphenation were suggested to him by Koeva. Unfortunately, he hasn't taken notes so now he doesn't know which cases of unpleasant hyphenation have been suggested to him by Koeva and which are his own findings.

The present work

Motivation

The present work was carried out on the initiative of the leader of the Bulgarian localisation team of Mozilla, who contacted Zinoviev, Boshnakov and the maintainers of the TeX hyphenation patterns.¹⁸ This work pursues the following main objectives:

The current official hyphenating rules for Bulgarian are rather liberal. Very often, in a long sequence of consonants we are permitted to split the word at any position, for example аген-т-с-т-во /agen-t-s-t-vo/. This is prone to many unusual and unexpected results that interrupt the attention of the reader or deceive his expectations during the movement of his eyes to the next line. On the other hand, in order to produce nice justified paragraphs there is no need for so many hyphenation possibilities. It would be sufficient even if only one possible separation between any two syllables was permitted.

Therefore, it makes sense to use a more restrictive version of the Bulgarian hyphenation, one which eliminates the controversial cases of hyphenation. Only when typesetting a Bulgarian text in a very narrow newspaper column it will be appropriate to use a more liberal version. It should be noted that some specialised English dictionaries also separate the word-division positions into two categories – preferred positions and less recommended positions.

There are two methods to determine the optimal division within a sequence of consonants between two vowels:

Hyphenation according to the syllables in the word

Let us look at the properties of the Bulgarian syllables. All syllables have the following structure:

The nucleus in Bulgarian is always a vowel. Both the onset and the code are (possibly empty) sequences of consonants.

The Bulgarian syllables adhere to the Sonority Sequencing Principle. According to this principle, the consonants within the onset have raising sonority and the consonants within the code have decreasing sonority.

Several grammar books agree that the following sonority scale is valid for Bulgarian:

According to the investigations of the author, the only exception to this law is due to the letter в /v/ which is a voiced obtrusive but it can be used also as a voiceless obtrusive. This exception is due to a spelling particularity of the Bulgarian language. Whenever the letter в /v/ seemingly violates the Sonority Sequencing Principle, in the spoken language this letter is read as ф /f/, that is as a voiceless obtrusive (for example the word отвсякъде /otvsyakade/ is read as отфсякъде /otfsyakade/).¹⁹

The author has found that the sonorant consonants in Bulgarian have their own sonority scale:

Only a few words such as жанр /zhanr/ and химн /himn/ violate this scale. Such words are always loan-words and their pronunciation is somewhat problematic for the native Bulgarian speakers.

In addition to the Sonority Sequencing Principle, the consonant clusters within the Bulgarian syllable adhere to the following additional principles:

From all these properties of the Bulgarian syllable we can deduce the following hyphenation rules:

With so many prohibitive rules, a question arises: if we apply all these rules, aren't we going to eliminate too many hyphenation possibilities? The answer is no. It can be demonstrated that between any two consecutive syllables at least one separation point will be permitted.

Hyphenation according to the morphology

Between 1983 and 2012 the official orthographic rules of the Bulgarian language forbade morphologically based hyphenation. After 2012 such hyphenation is permitted (but not obligatory).

The most important case when it is very desirable to use morphologically based hyphenation is the case of the compound words. Divisions such as авток-луб /avtok-lub/ and вакуу-мапарат /vakuu-maparat/ are extremely irritating even if they are formally correct. Unfortunately, we do not have a vocabulary of the compound Bulgarian words that would permit us to produce rules for automated hyphenation. Therefore, the current Bulgarian hyphenation patterns do not attempt to apply morphological hyphenation to such words.

Second in importance (but far more significant in terms of numbers) is the case with the word prefixes. While the eyes of the reader still look at the start of the word, the word is still unknown to him. At this point, it is very important not to deceive his expectations. For example, when the reader sees над- /nad-/ at the end of the line, he will expect that this is the prefix над- /nad-/ with semantics 'attain more than'. This expectation will be fooled if this wasn't really a prefix, but a deceiving (while formally correct) hyphenation of the word надремя /nadremya/ 'have dozed enough' where the real prefix is not над- /nad-/ but на- /na-/ with semantics 'achieve a state after accumulation'. Such hyphenation distracts the reader and makes the reading more difficult.

Third in importance is the case with the word suffixes. With respect to the hyphenation rules we can divide the suffixes into three categories:

Even if it is possible to use morphological hyphenation with the suffixes of the third category, it turns out, this is not as useful as it is with the case of the prefixes. When the eyes of the reader have reached this part of the word, the word is already more or less known to the reader. Therefore, at this point the morphological hyphenation does not provide any significant advantages in comparison to the simpler hyphenation based only on the syllables in the word. Consider for example the word геройс-тво /geroys-tvo/ with suffix -ство /-stvo/. When the reader sees геройс- /geroys-/ at the end of the line this will give him an early clue that the suffix of the word is -ство /-stvo/. Such non-morphological hyphenation does not deceive the expectations of the reader. On the contrary, it makes the reading easier because it gives clues to the reader about what follows on the next line.

Because of these considerations, the current Bulgarian hyphenation patterns do not attempt to use morphological hyphenation with respect to the suffixes of the words. Though it would be useful to implement rules about the suffixes of the second cateogory. Hopefully, some future version will have such rules.

Occasionally,²¹ a fourth morphological requirement is stated: that hyphenation should conform with the boundary between the word and the definitive articles -та /-ta/ and -те /-te/ (postfixed in Bulgarian). There is no need to pay attention to this rule because it seems to be satisfied by its own nature. The author has searched in a dictionary with over 860000 Bulgarian words for cases when the hyphenation rules would hyphenate badly with respect to the definitive article. He was unable to find even one such case with the hyphenation rules valid after 1983 and only about 10 cases with the rules valid before 1983 (one of them is живопи-ста /zhivopi-sta/ instead of живопис-та /zhivopis-ta/).

One unavoidable characteristic of any morphologically based automated hyphenation is that it can create wrong hyphenations. Because of this, one useful option is to use the morphology in a safe way – to use it in order to forbid bad hyphenations but to create no new hyphenation possibilities solely on the basis of the morphology.

Take for example the word дозрея /dozreya/ 'ripen fully'. According to the phonological rules, we should hyphenate it as доз-рея /doz-reya/. According to the morphology, however, we should hyphenate as до-зрея /do-zreyq/ because this word is formed with the prefix до- /do-/ with semantics 'complete or supplement' and this semantics would be lost if the reader sees доз- /doz-/ at the end of the line. Therefore, there are three methods to hyphenate this word:

The option to use the morphology in a safe way is very attractive when the software uses a smart line-breaking algorithm which can produce good results even with less hyphenation possibilities. TeX is one such software. It should be noted that this option does not eliminate too many hyphenation possibilities because the morpheme boundaries most of the time are also syllable boundaries.

The following are results of a statistics about the quality of the morphological rules (the number after the sign ± is the expected standard deviation of our estimations):

Notice that the morphological patterns create a different hyphenation only in about 10% of the words. The following explanation can be given for this surprising fact. First, the natural evolution of the human languages tends to simplify the complex sequences of consonants. Therefore, no morpheme contains a complex sequence of consonants. And second, the Bulgarian orthography is morphological. This means that the morphemes are written according to their actual pronunciation, however the simplifications in the spoken languages which take place at the morpheme boundaries are not taken into account in the orthography. The independent operation of these two factors leads to the result that most of the time the morpheme boundaries coincide with the conventional syllable boundaries. The main exception to this is when a morpheme starts with a vowel, in this case its syllable will include one or more consonants of the preceeding morpheme. The second exception is when a morpheme ends with a vowel and the next morpheme starts with a sequence of two or more consonants.

Usage of the script hyph-bg.sh

The hyph-bg.sh is all-in-one script which can generate both documentation (this text) and Bulgarian hyphenation patterns. When given the option --help the script gives short usage instructions:

The following are the recommended ways to generate hyphenation patterns by this script:

Notice that some specialised English dictionaries separate the word-division positions into two categories – preferred positions and less recommended positions. It would be best if the Bulgarian online dictionaries could do the same. For example hyphen "-" can be used to display the preferred positions and dot "." – the less recommended positions. If a word-division position is permitted only by the patterns of hyph-bg.sh --permissible, then this position is less recommended.

In several publications this rule is formulated with the additional restriction that the sequence of consonants begins with an obstruent. I believe this restriction is unintentional. It makes no sense to forbid a hyphenation of the form AB-A but to permit ABB-A (A denotes a vowel and B – a consonant).↩
http://www.researchgate.net/profile/Veska_Noncheva ↩
Нончева В. Алгоритъм за автоматично пренасяне на думи в българския език. Математика и математическо образование. Сб. доклади на 17. ПК на СМБ. С., БАН, 1988, 479-482.↩
http://www.linkedin.com/in/belogay ↩
Белогай Е. Алгоритъм за автоматично пренасяне на думи. Компютър за вас (1988) 3, 12-14.↩
Liang, Franklin Mark. Word Hy-phen-a-tion by Com-put-er (Doctoral Dissertation). Stanford University, 1983↩
Василев В. Ултимативният ТеХ. Удоволствието да правим предпечатна подготовка сами. София, Интела, 1997, 36↩
The author of this text was unable to find current information about Ognyan Tonev in Internet. Apparently in 1990 he worked in the Center of Informatics and Computer Technology of the Bulgarian Academy of Sciences.↩
To Yannis Haralambous, http://perso.telecom-bretagne.eu/yannisharalambous ↩
http://www.maths.manchester.ac.uk/~gb/↩
The thesis of Atanas Topalov can be accessed at the author's website http://www.mind-print.com ↩
http://dcl.bas.bg/svetla_koeva/↩
Коева, Светла. Правила за пренасяне на части от думите на нов ред. Български език. 1999/2000, 1, 84-86↩
http://lml.bas.bg/~stoyan/↩
The author of this text.↩
http://bg.linkedin.com/in/radostinradnev ↩
http://bgoffice.sourceforge.net/↩
http://hyphenation.org ↩
No Primitive Slavonic word contains the phoneme ф /f/. Therefore, we can safely assume that in the Primitive Slavonic language the consonant ф /f/ was a positional variant of the consonant в /v/.↩
Actually, the letter в /v/ is not a real exception because in all such cases this letter denotes two different consonants – в /v/ and ф /f/. Only in the Russian loan-word взвод /vzvod/ the two letters в /v/ denote a repeating consonant в /v/.↩
Правописен и правоговорен наръчник. Състав. Иван Хаджов, Цв. Минков; Ред. Ив. Хаджов и др. София, Бълг. кн., 1945↩

Automated Bulgarian Hyphenation

Anton Zinoviev

21 October 2017

Principles of the Bulgarian hyphenation

Hyphenation rules between 1945 and 1983

Hyphenation rules between 1983 and 2012

Hyphenation rules after 2012

Computer implementations

Mathematical analysis of the Bulgarian hyphenation

Bulgarian hyphenation in TeX

Later developments

The present work

Motivation

Hyphenation according to the syllables in the word

Hyphenation according to the morphology

Usage of the script `hyph-bg.sh`

Automated Bulgarian Hyphenation

Anton Zinoviev

21 October 2017

Principles of the Bulgarian hyphenation

Hyphenation rules between 1945 and 1983

Hyphenation rules between 1983 and 2012

Hyphenation rules after 2012

Computer implementations

Mathematical analysis of the Bulgarian hyphenation

Bulgarian hyphenation in TeX

Later developments

The present work

Motivation

Hyphenation according to the syllables in the word

Hyphenation according to the morphology

Usage of the script hyph-bg.sh

Usage of the script `hyph-bg.sh`