6

Do you know of a list provided by some academic institution? I did find some lists, but I am unable to judge the quality and/or completeness of these:

Background: I am trying to program a random name generator for project working titles, using the approach outlined here, by extracting graphemes from these downloadable free corpus samples and feeding it to some kind of markov chain.

UPDATE:

I used the Wikipedia list as suggested by @tchrist and the free COCA sample corpus referenced above. The approach worked quite well for my purposes. Here is a small random set of generated words for anyone interested:

Wanstasy, Indricis, Voformer, Colutove, Ingerstr, Tottione, Lspheres,
Umandsam, Extivelo, Pironoba, Zofiropr, Bingernt, Kitleron, Viewinef,
Juntialt, Enabbyth, Uplpofor, Everopeo, Heventri, Ntozzler, Buncener, 
Granalse, Nocosacc, Randeren, Randantu, Caredyou, Ftedowla, Ncesnarr, 
Ulilkien, Factitur, Grontoft, Noughtoo, Lackeded, Zofricsp, Viewedon, 
Tuartand, Dossions, Kifreaps, Xicatage, Evertsom, Emorever, Manksgis, 
Ponkiold, Nsualina, Atofficl, Mallitsi, Spmethir, Dayspeed, Anditout, 
Xatofrse, Izamedoo, Bupleati, Plitteni, Failitha, Hinglood, Dcoveyou,
Reto Höhener
  • 163
  • 1
  • 6

2 Answers2

3

If you look at the various spellings for each given phoneme listed in Wikipedia’s section on “Sound to Spelling Correspondences” in their article on English Orthography, this may help.

I’ve looked at both your PDF sources: the Wikipedia section is better than either of those. Your task is harder than you may realize.

tchrist
  • 134,759
  • I ended up using this Wikipedia list. It worked quite well for my purposes. Thanks again. – Reto Höhener Apr 12 '15 at 23:40
  • 1
    @Zalumon Your results are quite good, and I bet you could make them even better. Some of the initial and final sequences don't work. I'm thinking if you include a special empty element to represent the beginning and end of the word for feeding to your markov chains, that that issue might go away. – tchrist Apr 13 '15 at 03:20
  • Yes, I see your point. Currently I treat all graphemes identically. I should definitely have separate probabilities for word boundaries. – Reto Höhener Apr 13 '15 at 13:48
0

I get what you mean because there are officially over 44 that include the American and British, but I don't know where to find it, and I'm looking for it as well. :/ But check page R45 of the Oxford Advanced Learner's Dictionary, you'll find it there with an example each. If you don't want to use a dictionary, then I'm sorry.

jimm101
  • 10,753