3

I'm in the process of creating a parser which, using a few dictionaries (English language words, places and acronyms), splits the domain name into a set of potential phrases and attempts to decide which phrase is the most likely to be correct according to the intentions of the person who registered the domain.

My current rules are:

  • The place name must be at least 4 characters long.
  • A maximum of 2 place names can occur in a phrase and must be adjacent
  • Acronyms must be at least 3 characters in length
  • Priority (in order) is:
    1. least number of words
    2. highest average character length
    3. highest maximum word length

For example, refundslondon extracts to refunds london, refunds lon don and refunds lond on.

It doesn't seem to work too badly at the moment, but from an English-language perspective, it would be nice to know if there was a way to determine (a) the validity of the sentence structure and (b) if certain words appear out of place, particular with respect to words that technically can't really go on the end of a sentence.

Examples:

  • my probably shouldn't go at the end of a sentence
  • and probably shouldn't go at the start or end of a sentence

...and so forth.

Any assistance would be appreciated!

  • 2
    You'd be surprised how difficult it is to make hard and fast rules like these. It might even make you say, "oh my!" And why, exactly, should and not go at the start of a sentence? :) – Kosmonaut Oct 25 '10 at 02:44
  • You are right in a broader sense, but it's easier to generalise when limiting the context to something such as domain names. – Nathan Ridley Oct 25 '10 at 02:51
  • You think so? I would think that domain names would be more likely to stretch the limits of usage than other contexts. But at least the domain names are going to be made up of a very limited number of words on average; that is definitely in your favor. – Kosmonaut Oct 25 '10 at 03:19
  • 1
    That's a challenging task. I am not quite sure how a parser could possibly deduce "the intentions of the person who registered the domain" from the domain name alone, without parsing at least the home page as well. There are lots of domain names where even humans have trouble figuring out the correct reading, you probably already know the most famous examples: whorepresents, expertsexchange, penisland, therapistfinder, molestationnursery, powergenitalia... Your rules would not work for the last two. – RegDwigнt Oct 25 '10 at 10:06
  • Also, do you take TLDs into account? .me, .do, .it etc. are often (mis)used as words in their own right; .ly, .ng, .ee, .de etc. can be (mis)used in constructing the last word; and there are even .co.ck, .co.at, .co.il etc. After all, "icio" is not a word (though, admittedly, even "del.icio.us" doesn't really tell you what that site is all about). – RegDwigнt Oct 25 '10 at 11:08
  • Ughh.. now it's even harder than I thought! Ok well my parser does extract all possibilities and simply ranks them. Perhaps you could offer suggestions which could help me eliminate possibilities from the result set? – Nathan Ridley Oct 25 '10 at 12:03
  • @Reg I never did understand why they used del.icio.us instead of just delicio.us – Nathan Ridley Oct 25 '10 at 13:52
  • @Nathan: it did look pretty clever, which attracted a lot of us geeks to the site quickly. Also anything to differentiate your brand is usually good, as long as it's not too hard to remember. Same reason Flickr dropped the 'e'. Del.icio.us was a bit too hard to remember, though, so they set up a "delicious.com" redirect and eventually rebranded the site. – Stefan Monov Oct 25 '10 at 18:19
  • 4
    This question would actually be more appropriate for StackOverflow or Theoretical Computer Science even! – Noldorin Dec 03 '10 at 23:12
  • I hadn't really thought of it before, but this kind of makes for a return to how things used to be done. Early writings (very early) didn't have spaces between words (in some languages, like Thai, still don't). Website names have taken us back to space-less phrases. (not sure why, the technology is there to support spaces in web addresses). – Xantix Dec 13 '12 at 05:32

1 Answers1

1

Tag the words with their parts of speech. Use a Markov model to see which parts of speech are most likely to follow other parts of speech (NOUN - VERB - ADVERB is probably more likely than ADVERB - NOUN - VERB). Use the probabilities to help determine which sentence is correct.

Claudiu
  • 10,911
  • For best results, the statistical probabilities of tag sequences should be gathered by examining a large collection of manually split/labeled website names. With almost every project of Natural Language Processing, greater accuracy is achieved by tailoring the process to the problem set. – Xantix Dec 13 '12 at 05:29