I'm in the process of creating a parser which, using a few dictionaries (English language words, places and acronyms), splits the domain name into a set of potential phrases and attempts to decide which phrase is the most likely to be correct according to the intentions of the person who registered the domain.
My current rules are:
- The place name must be at least 4 characters long.
- A maximum of 2 place names can occur in a phrase and must be adjacent
- Acronyms must be at least 3 characters in length
- Priority (in order) is:
- least number of words
- highest average character length
- highest maximum word length
For example, refundslondon extracts to refunds london, refunds lon don and refunds lond on.
It doesn't seem to work too badly at the moment, but from an English-language perspective, it would be nice to know if there was a way to determine (a) the validity of the sentence structure and (b) if certain words appear out of place, particular with respect to words that technically can't really go on the end of a sentence.
Examples:
- my probably shouldn't go at the end of a sentence
- and probably shouldn't go at the start or end of a sentence
...and so forth.
Any assistance would be appreciated!