How to determine the English structural validity of a domain name

Question

I'm in the process of creating a parser which, using a few dictionaries (English language words, places and acronyms), splits the domain name into a set of potential phrases and attempts to decide which phrase is the most likely to be correct according to the intentions of the person who registered the domain.

My current rules are:

The place name must be at least 4 characters long.
A maximum of 2 place names can occur in a phrase and must be adjacent
Acronyms must be at least 3 characters in length
Priority (in order) is:
1. least number of words
2. highest average character length
3. highest maximum word length

For example, refundslondon extracts to refunds london, refunds lon don and refunds lond on.

It doesn't seem to work too badly at the moment, but from an English-language perspective, it would be nice to know if there was a way to determine (a) the validity of the sentence structure and (b) if certain words appear out of place, particular with respect to words that technically can't really go on the end of a sentence.

Examples:

my probably shouldn't go at the end of a sentence
and probably shouldn't go at the start or end of a sentence

...and so forth.

Any assistance would be appreciated!

You'd be surprised how difficult it is to make hard and fast rules like these. It might even make you say, "oh my!" And why, exactly, should and not go at the start of a sentence? :) — Kosmonaut, Oct 25 '10 at 02:44
You are right in a broader sense, but it's easier to generalise when limiting the context to something such as domain names. — Nathan Ridley, Oct 25 '10 at 02:51
You think so? I would think that domain names would be more likely to stretch the limits of usage than other contexts. But at least the domain names are going to be made up of a very limited number of words on average; that is definitely in your favor. — Kosmonaut, Oct 25 '10 at 03:19
That's a challenging task. I am not quite sure how a parser could possibly deduce "the intentions of the person who registered the domain" from the domain name alone, without parsing at least the home page as well. There are lots of domain names where even humans have trouble figuring out the correct reading, you probably already know the most famous examples: whorepresents, expertsexchange, penisland, therapistfinder, molestationnursery, powergenitalia... Your rules would not work for the last two. — RegDwigнt, Oct 25 '10 at 10:06
Also, do you take TLDs into account? .me, .do, .it etc. are often (mis)used as words in their own right; .ly, .ng, .ee, .de etc. can be (mis)used in constructing the last word; and there are even .co.ck, .co.at, .co.il etc. After all, "icio" is not a word (though, admittedly, even "del.icio.us" doesn't really tell you what that site is all about). — RegDwigнt, Oct 25 '10 at 11:08
Ughh.. now it's even harder than I thought! Ok well my parser does extract all possibilities and simply ranks them. Perhaps you could offer suggestions which could help me eliminate possibilities from the result set? — Nathan Ridley, Oct 25 '10 at 12:03
@Reg I never did understand why they used del.icio.us instead of just delicio.us — Nathan Ridley, Oct 25 '10 at 13:52
"I'd registered the domain when .us opened the registry, and a quick test showed me the six letter suffixes that let me generate the most words." — RegDwigнt, Oct 25 '10 at 16:55
@Nathan: it did look pretty clever, which attracted a lot of us geeks to the site quickly. Also anything to differentiate your brand is usually good, as long as it's not too hard to remember. Same reason Flickr dropped the 'e'. Del.icio.us was a bit too hard to remember, though, so they set up a "delicious.com" redirect and eventually rebranded the site. — Stefan Monov, Oct 25 '10 at 18:19
This question would actually be more appropriate for StackOverflow or Theoretical Computer Science even! — Noldorin, Dec 03 '10 at 23:12
I hadn't really thought of it before, but this kind of makes for a return to how things used to be done. Early writings (very early) didn't have spaces between words (in some languages, like Thai, still don't). Website names have taken us back to space-less phrases. (not sure why, the technology is there to support spaces in web addresses). — Xantix, Dec 13 '12 at 05:32

score 1 · Answer 1 · answered Oct 25 '10 at 17:36

1

Tag the words with their parts of speech. Use a Markov model to see which parts of speech are most likely to follow other parts of speech (NOUN - VERB - ADVERB is probably more likely than ADVERB - NOUN - VERB). Use the probabilities to help determine which sentence is correct.

answered Oct 25 '10 at 17:36

Claudiu

10,911

For best results, the statistical probabilities of tag sequences should be gathered by examining a large collection of manually split/labeled website names. With almost every project of Natural Language Processing, greater accuracy is achieved by tailoring the process to the problem set. – Xantix Dec 13 '12 at 05:29

How to determine the English structural validity of a domain name

1 Answers1