1

Has anyone done work to construct letter frequency charts based on the assumed age of the reader/speaker, and also spoken word vs written text?

One would expect that letter usage would be different in books targeted at (for example) a 3rd grader vs a college student. Also, people often write differently than they speak.

My first grade daughter loves the show "Pokemon". In that show the Pokemon characters only speak sounds that are made from pieces of their name. For example a Pikachu pokemon only speaks words made from combinations of the sounds "Pi", "Ka", and "Chu". She thought it would be cool to make a real Pikachu language. And I think its a good opportunity to teach her about encoding schemes.

The obvious choice is encoding letters of the alphabet using these three sounds. Ideally one would want the length of the words to be minimized. We have three sounds, therefore a ternary Huffman code based on an English letter frequency chart would provide an optimal code.

I have seen many charts (like this one) ...

http://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html

Based on that chart, here is the code I came up with so far. enter image description here

... but this chart is based on generic data and therefore wouldn't be optimal in terms of the words a first grader would choose to speak.

This would be a spoken language only, so there is no need to encode special symbols or different letter cases. I am only interested in the frequency of English letters A-Z.

The ideal table would come from research based on recordings of spoken interactions of elementary school children in the United States (preferably first grade). But if that's not available then tables based on books targeted at those age ranges would be the next best choice.

Anton
  • 28,634
  • 3
  • 42
  • 81
user4574
  • 111
  • This is a huge area to cover. You could consult https://en.wikipedia.org/wiki/List_of_children%27s_speech_corpora as a starting point. – Anton Nov 26 '20 at 11:33
  • or look at https://arxiv.org/ftp/arxiv/papers/1605/1605.07735.pdf I suspect the main problem here is that to access any of the various (expensively obtained) corpus databases like this you will have to pay. – Anton Nov 26 '20 at 11:49
  • @Anton Those look like good references if I wanted to build my own table from scratch. In the end that might be what I have to do. The end-product I need to find is a list of 26 letters and their frequency, for a certain age of the speaker. I would be surprised if no-one has already published something like this. – user4574 Nov 26 '20 at 16:12

0 Answers0