3

I am currently trying to build a database of English words and their hyphenations (end-of-line divisions) (en-US, if it matters), and thereby have come across some words which I have found contradicting hyphenations for. If those words were exotic, I would not be wondering about it, but some of them are frequently used. For example:

  • Germany: Merriam-Webster - Ger-ma-ny; Hunspell (which by far is the most dominant spell checker and hyphenator in the open source scene, driving applications like LibreOffice, OpenOffice, Firefox, Thunderbird and the like) - Ger-many

  • freely: Merriam-Webster - free-ly; Hunspell - freely

  • rapid: Merriam-Webster - rap-id; Hunspell - rapid

I have read a lot of articles (most of them on this site) about hyphenation. The general consensus seems to be that we should look up the respective word and its hyphenation in authoritative sources. But what if those sources contradict each other?

Another advice which often was given was that we just should hyphenate between syllables. Since I am not a native English speaker, this is extremely difficult for me. While I would have done it right with Germany and freely, I would never have done it right with rapid (in my world, it would have been ra-pid).

I always have considered the Oxford English Dictionary to be the most authoritative English dictionary. Imagine my surprise when I saw that they neither show hyphenation nor syllabication. The Wiktionary does show hyphenation, but only for some words; the examples mentioned above, being very common words, are not among them, so it's worthless in this respect.

Could somebody please give me a hint what I should do if two important sources which both can (somehow) be considered authoritative show contradicting hyphenations, and even more important, could somebody please tell me if there is a reliable method to identify words which are suspect in this respect in the first place?

To explain the latter: I am currently using the hunspell data to build my database semi-automatically; otherwise, I couldn't handle it. The hunspell data is the only one I have found to be usable to get the hyphenation of a word quite easily.

As a second step, I would like to be able to identify and separate suspect words, which I then could look up manually in different sources (hoping that only about 5% of the words are suspect).

EDIT 1

As a reaction to one of the comments, I now have found a word where at least 3 characters are left at each side after hyphenation, but where different "authorities" hyphenate differently:

Microsoft Word 2010 hyphenates inconceivable as in-con-ceiv-a-ble, where Merriam-Webster has in-con-ceiv-able.

Another one: Merriam-Webster says cli-ent, where hunspell says client, i.e. does not hyphenate that word at all.

EDIT 2

@Hot Licks has pointed out that the dictionaries are showing syllable boundaries, not hyphenation points (if any). However, at least in case of Merriam-Webster, this is the same. From their dictionary API documentation:

<hw>...</hw>    (text = boldface)
    HEADWORD
    - This is the first bold word in an entry
    - contains "syllable" break points (that is, 
      end-of-line hyphenation points) here indicated 
      by asterisks, which will translate to raised dot, 
      {point} in Merriam-Webster font. 
    - may contain superscript homograph numbers 
      {h,1}, {h,2}, etc., in the same font (bold)
    - single word space after <hw> field

Please note the text following the second hyphen. IMHO, that means that each syllable boundary is a hyphenation point, and vice versa.

EDIT 3

I have found more precise information. From Merriam-Webster's guide to pronunciation:

Hyphens are used to separate syllables in pronunciation transcriptions. [...]

The centered dots in boldface entry words indicate potential end-of-line division points and not syllabication. [...] As a result, the hyphens indicating syllable breaks and the centered dots indicating end-of-line division often do not fall in the same places.

Binarus
  • 165
  • 2
    Generally speaking, you should never hyphenate a word and leave fewer than 3 characters on either side. – Hot Licks Aug 04 '18 at 11:38
  • I see. I should have mentioned that you can tell the hunspell utilities how many characters it should leave. For learning purposes and to compare results with other sources, I have allowed it to hyphenate at any position, i.e. to leave only one character if possible. But even with this configuration, it does not hyphenate where other sources do. It might be interesting to add other examples where the hyphenation differs between sources where more than two characters are left at either side. I'll try to find such examples and add them to my question. – Binarus Aug 04 '18 at 11:55
  • 3
    What the dictionaries show is not the hyphenation points but the syllable boundaries. – Hot Licks Aug 04 '18 at 12:00
  • 1
    Please correct me if I am wrong, but Merriam-Webster seems to show the hyphenations. For example, consider https://www.merriam-webster.com/dictionary/calculation. Directly under the title (in giant letters), there are three entries. The left denotes the type of the word (in this case, noun), the second is what I have considered to be the hyphenation, and the third is the pronunciation, and I always thought the syllable boundaries are part of the pronunciation. Please correct me if I am wrong (which may very well be the case). – Binarus Aug 04 '18 at 12:11
  • 1
    MW is showing the syllable boundaries. Generally, hyphenation occurs on syllable boundaries, but there are limits as to which boundaries can be used – Hot Licks Aug 04 '18 at 12:29
  • OK, thank you very much. Do you know of a source of hyphenation patterns, at least for the most common 10,000 words (or the like)? By the way, I now have an "EDIT 1" in my question ... – Binarus Aug 04 '18 at 13:05
  • 2
    @Hot Licks Please forgive me, but it seems you are wrong regarding what the dictionaries show, at least in the case of MW. Please take a look at my EDIT 3. – Binarus Aug 04 '18 at 14:08
  • 1
    @Binarus Refer to Merriam-Webster's athe·ism as an example. It's clearly not showing syllable boundaries. But ˈā-thē-ˌi-zəm, on the other hand, does. – Jason Bassford Aug 04 '18 at 15:30
  • 3
    Never assume that, given two opinions on hyphenation, one is right and one is wrong. No program written has ever done as good a job at hyphenation of English text as a professional editor (read: a human being). The best advice for when to split a word at the end of a line is never if you can get away with it, but in “a sensible place” if you can't—but therein lies the rub. See also https://sesquiotic.com/2013/01/09/hyphe-nation-hyphen-ation/ Related and possible duplicates: https://english.stackexchange.com/q/385/2085 https://english.stackexchange.com/q/21529/2085 – tchrist Aug 04 '18 at 15:40
  • 1
    @Binarus. There is further info in the Mirriam Webster explanatory notes. Like you say, it clearly states that the centre dots indicate end-of-line division. It further mentions the existence of acceptable alternative end-of-line divisions which it doesn't have space to include in the dictionary. – S Conroy Aug 05 '18 at 01:56
  • @ S Conroy Thank you very much. Actually, I already have come across that document, but I didn't find the relevant section because I always was looking for the keyword "hyphenation". I had no success in doing so because MW (and probably others) call this end-of-line division - a typical problem of a non-native speaker ... – Binarus Aug 05 '18 at 07:52
  • 1
    @Binarus if you must re-invent a wheel, why not choose a round one, at least? You could spend a lifetime trying to build the database you describe, and still not improve one jot or tittle on anything currently in use.

    I only spent 20 years in publishing and I’ve never heard of hyphenation as end-of-line division. D’you perhaps mean “justification” in typography, as at https://en.wikipedia.org/wiki/Typographic_alignment?

    All mainstream commercial publishing software includes routines for handling that, with great precision and plenty of room for specifying user preferences.

    – Robbie Goodwin Aug 19 '18 at 23:22
  • 1
    Take for instances, Hot Licks “Generally speaking, you should never hyphenate a word and leave fewer than 3 characters on either side.” I agree strongly but that’s a personal choice. Any number of professionals think we’re wrong, and all major publishing software allows either preference.

    How would you accommodate that in a database, in English or any other language?

    – Robbie Goodwin Aug 19 '18 at 23:23
  • @RobbieGoodwin Thanks for taking the time. Actually, this database is not for fun. From time to time, I publish articles in the www, using full justification with end-of-line-division (in the link @S Conroy gave, Merriam-Webster calls it exactly that). I have written an HTML parser (yes, I am crazy) which pulls the text out of my HTML pages, hyphenates it (this is where the database is involved), and re-assembles the pages. Alternatively, there is a 3rd-party Javascript solution which hyphenates on-the-fly, but I don't trust it; I'd like to be in control of the hyphenation patterns myself. – Binarus Aug 20 '18 at 14:09
  • @Binarus, that sounds like a magnificent effort and still, why would you want to re-invent the wheel? More, why would you want to make your wheel anything other than round?

    This is no joke. The theory you describe is counter to centuries of experience of uncounted thousands of experts.

    Does that matter to you, or not, please?

    I've no idea what your experience is. Mine is 20-odd years printing 100 or more newspapers, magazines and journals and never once hearing any phrase anything like "end-of-line-division"

    – Robbie Goodwin Aug 22 '18 at 20:11
  • @RobbieGoodwin Actually, I really don't care if we call it end-of-line-division, hyphenation or something else; by mentioning the issue, I simply wanted to justify that I hadn't found that part of MW's documentation myself (I am trying to first do my homework, then ask) because MW calls it end-of-line-division - this is fact, unfortunately. So I am with you: It should not be called end-of-line-division, but MW calls it exactly that, which was the reason why I didn't find it in MW's documentation. There is nothing more to it; it does not have anything to do with my actual problem. – Binarus Aug 23 '18 at 06:15
  • @RobbieGoodwin Regarding the other part of your comment: It would be great if you could give me a specific hint how to avoid reinventing the wheel. For example, writing HTML pages in a word processor is no option; tapping into MS Word's hyphenation is not possible for technical and legal reasons (although they once had an API); hunspell (which I currently deploy as a basis) does it wrong with about one third of the words I have tried so far; and so on ... so if you know about a free hyphenation pattern database or a method which simplifies my task, I would gratefully adopt it. – Binarus Aug 23 '18 at 06:27
  • Either you’re going to spend a very long time working by yourself to achieve something that’s already been done, or you need to care whether it's end-of-line-division, hyphenation or what… or both.

    Sorry and when you ask for a specific hint how to avoid reinventing the wheel you reveal that neither you nor - so much worse - your tutors has much idea what the wheel is… I assume you have tutors because you said you were trying to do homework; no?

    There is one need for justification in typesetting and that is purely to match a particular printer’s or publisher’s or author’s style.

    More…

    – Robbie Goodwin Aug 24 '18 at 23:13
  • Further… What you miss is first that “look it up in authoritative sources” is meant for writers who don’t have their own rules and even then that a style like “look up the respective word and its hyphenation in authoritative sources” won’t make anything easier.

    Of course, writing HTML pages in a word processor is no option but why would you have thought it might have been? What would HTML - much less pages - have to do with building any database? Displaying it, perhaps but how building?

    Writing your own HTML parser is one thing; doing the job efficiently quite another.

    More…

    – Robbie Goodwin Aug 24 '18 at 23:18
  • Further… All commercial publishing software uses algorithms - not databases - that can be adapted to suit many styles. Most obviously, many users will not accept Ger-ma-ny or free-ly or rap-id or even Ger-many but insist on Germ-any. I happen to agree with Hot Licks, “… never hyphenate and leave fewer than 3 characters on either side” but however well-informed that’s a choice, not a fixed rule…

    The choices don’t matter; only the ability to adapt to manage them, and the fact that’s well-established as being handled by flexible algorithms, not rigid databases.

    More…

    – Robbie Goodwin Aug 24 '18 at 23:23
  • Further… why not ask yourself how many words - even root words - there are in any of the major English dictionaries. Last time I looked it was about 300,000.

    Why would you want to build a database that large?

    – Robbie Goodwin Aug 24 '18 at 23:25
  • @RobbieGoodwin OK, last attempt to explain my situation: I don't have a tutor. I am an experienced developer. I am not a native English speaker. I do not have my own hyphenation rules because this is much beyond my English skills, so I take MW as reference when it comes to hyphenations. I publish articles as HTML in English language from time to time, using full justification and hyphenation. I do not trust algorithms for doing the hyphenation, since I didn't see any algorithm yet which even came close to how MW hyphenates; if I did, I could use hunspell or the on-the-fly Javascript solution. – Binarus Aug 25 '18 at 11:26
  • @RobbieGoodwin And finally, I think we are talking about different things. There must be a misunderstanding. I never will get why you insist on making a difference between end-of-line-division and hyphenation. Once again, all I want is to break words correctly at the end of a line, using a hyphen at the place where the word is broken, because without this, the full justification would look ugly. I do not care if we denote this sort of word-breaking as hyphenation or as end-of-line-division; regardless of what term I have used, I am meaning it in the sense described above. – Binarus Aug 25 '18 at 11:33
  • @RobbieGoodwin As a last remark, my task is not that big. In all English articles I have written so far, I have used about 4000 words plus a few field-specific terms. I expect to not use more than 10000 different words in future articles until I stop "publishing", so the whole thing is not that terrible. Nevertheless, it's too much to be processed manually. – Binarus Aug 25 '18 at 11:42
  • Jolly good. Sorry you diverted everything with "homework" as though you were following a tutor's instructions. Even so, you're trying to re-invent what long ago became standard processes… and which at the end of the day are purely questions of writing style in any language you choose, and nothing really to do with English language. Look, if you want to work with 10,000 - even 4,000 words then either you do it the normal way, or you have at least 4,000 potential questions. Please take all of those - and this - to Chat… – Robbie Goodwin Aug 25 '18 at 17:51
  • 1
    The Chicago Manual of Style calls this issue word division (16th edition: section 7.31, page 358).The section starts: Dictionary word division. For end-of-line word breaks...Chicago turns to Webster's as its primary guide. The dots between syllables in Webster's indicate where breaks may be made... – Shoe Dec 02 '18 at 08:11
  • Thanks for the insight. I also have got the impression the MW is considered sort of "authoritative" by many native speakers, which is why I would like the hyphenations in my text according to that source. Unfortunately, they don't offer an API, at least not for the "full" version of their dictionary, and it's not free. – Binarus Dec 02 '18 at 09:13
  • 1
    The reason the OED doesn’t have hyphenation is quite simply that it’s too big and too old. Many parts of it have not been updated since 1888. Oxford does publish a hyphenation dictionary, however, which I would guess will almost certainly disagree with M-W on many words. Hyphenation in English is hopelessly arcane and preferential. I kind of disagree with @tchrist here: even professional editors rarely do a really good job of hyphenating English text; it basically requires highly specialised education in hyphenation. – Janus Bahs Jacquet Dec 02 '18 at 11:13
  • @JanusBahsJacquet You have made me curious, so I have searched for the Oxford hyphenation dictionary you have mentioned. But I couldn't find it. Do they have it online, or do they sell it in printed form? Could you please give a link? – Binarus Dec 02 '18 at 16:26
  • 2
    @Binarus We have it at work (in physical form). I think it’s actually called the (New) Oxford Spelling Dictionary, but since spelling is quite easy to find in more accessible online dictionaries, we only really use it for hyphenation purposes. I don’t think it’s available in electronic/online form, unfortunately. – Janus Bahs Jacquet Dec 02 '18 at 16:29
  • @JanusBahsJacquet I see. Thanks for the link. So it's in their academic / university line of products, which might be the reason why I haven't found it before. The price would be OK, but unfortunately, I would need it in electronic form ... – Binarus Dec 02 '18 at 16:37
  • 1
  • No, this is not a duplicate question, and the link provided does not answer my question. I am not interested in the rules, because my knowledge of the English language by far wouldn't be sufficient to apply them correctly. I just wanted to know if there is a definitive source (preferable online with an API), why some well-known dictionaries contradict each other even with the easiest words, and which one to chose if there are such contradictions. So my question couldn't be more different from that other one ... – Binarus Aug 06 '19 at 16:06

4 Answers4

1

If you search hunspell hyphenation you should find an end-of-line hyphenation library (import from TeX) that should suit your needs. The min right and left lengths are variables.

I don't know if this can detect part-of-speech such as (verb) pro-ject vs (noun) proj-ect.

AmI
  • 3,662
  • I don't believe this answers the question, since Hunspell is already mentioned in the question. – Laurel Nov 02 '18 at 03:10
  • Hunspell was being used with partial success and I'm suggesting an add-on that should complete the task, rather than re-doing everything from scratch. – AmI Nov 02 '18 at 03:20
  • @Aml In fact, this is what I currently use for the automatic part. The problem is that Hunspell's hyphenation differs a lot from Merriam-Webster's, for example. Hence, whenever I feel that Hunspell may have missed a hyphenation point, I manually look the word up in other dictionaries. This is still quite painful, but better than nothing. The most difficult part (for me) is to determine if Hunspell might have missed something, so I sometimes end up unnecessarily looking up 10 words in a row manually just because I can't trust Hunspell completely ... What a pity that MW does not offer an API. – Binarus Nov 20 '18 at 08:30
  • I'm sorry -- I didn't realize that you were already using hyphen.tex. Because it is rule based rather than a full dictionary, it can't reliably handle breaks leaving less than 3 letters. It does have an exception list at the end where you can add on, but change the file name if you customize it. You could also build hyph_en_US.dic and customize that. – AmI Nov 20 '18 at 18:34
  • 1
    @Aml Thanks ... What I did: I installed Hunspell in Cygwin. It includes an executable "example.exe" which I directly use to look up hyphenations based on hyph_en_US.dic. I didn't know that you can customize the latter; I don't need this because each word goes into a database anyway. Hunspell / example.exe has a configuration file where you can tell it how many letters must remain at each side after hyphenation, but that didn't work. Hence, I have changed the threshold from three to two letters in the source code and compiled myself, which works reliably. But still great differences to MW ... – Binarus Dec 02 '18 at 09:00
1

The first thing you have to understand is that hyphenation in English is done on different principles:

  • An "American" system, which derives from this "Hyphens are used to separate syllables in pronunciation transcriptions." This involves two basic fallacies: pronunciation transcriptions are a rare special case of the use of hyphenation, which is normally used for texts that are to be read, not recited; and even if you wanted to use this as a base, there are lots of differences in syllabication between regional dialects.

  • A "British" system, which breaks words according to their etymological components (prefixes and suffixes etc.). This makes the word breaks easier to follow, and should be preferred. Thus: con-ceiv-able. But this puts you in conflict with Microsoft and the like, of course.

  • Thank you very much and +1 for bringing the two systems to my attention. – Binarus Apr 13 '19 at 13:14
  • 1
    Both American and British dictionaries that give hyphenations use pronunciation *and* etymology to find their breaks. The British may weight etymology higher, but they're using essentially the same principles, and making different judgment calls. Note that Merriam-Webster, which recommends con-ceiv-able, is an American dictionary, and Microsoft, recommending con-ceiv-a-ble, is an American company. – Peter Shor Aug 02 '19 at 10:47
1

This answer gives the general principles behind hyphenating words in English.

There is no single source for hyphenation in english. While all the sources follow the same principles, different sources make different judgment calls, so it's not surprising that they give different results.

No respectable source (this would include dictionaries and Hunspell) should give you an unacceptable hyphenation, so it's fine to pick one and use it. You should note, however, that some words like project have different hyphenations depending on whether they are a noun or a verb, and some, like debris, have different hyphenations in British and American English. This is because hyphenation sometimes depend on pronunciation, and pronunciation varies.

Peter Shor
  • 88,407
-1

An important point is being missed:

  1. Do not divide proper nouns or proper adjectives.

English Plus

Hot Licks
  • 27,508