42

Is there an EBNF (Extended Backus–Naur form) that covers all of English, and if so, what is it?

David
  • 12,625
Alex
  • 523
  • 10
    We should be so lucky! If anything like that existed all those programmers working on writing a natural language parser would probably be out of work! – FumbleFingers Jul 01 '11 at 21:30
  • I will give you a draft New York Times English EBNF--- it's not hard to produce. It will be missing the construction of this question: http://english.stackexchange.com/questions/60643/jameson-whiskey-commercial-construction-with-implicit-verb (but this can probably be introduced by hand by a scan after parsing), and it will be too generous in generating output that is not identified as grammatical (see here): http://english.stackexchange.com/questions/60119/can-if-while-whenever-when-recurse-deeply-how-deeply – Ron Maimon Mar 10 '12 at 04:00
  • 5
    @FumbleFingers: None of those programmers would be out of work, in fact, they probably already have this construction, but in a highly non-canonical form (I played with an online sentence parser). Their problem is that the parsing you get from a typical EBNF is ambiguous, and in many more ways than it is supposed to be ambiguous. You get too many parsings for each sentence. I will give tricks for reducing the number of parsings to the correct number. I am not sure which tricks are the best right now. – Ron Maimon Mar 10 '12 at 04:03
  • @Ron Maimon: As I understand it, the big guns (i.e. - Google) are switching more resources to statistical machine translation these days. Personally, I don't think algorithm-based parsing is likely to get far in my lifetime, but approaches based on neural nets sampling huge volumes of data are already making considerable headway. – FumbleFingers Mar 10 '12 at 15:24
  • 3
    @FumbleFingers: These approaches are due to the fundamental matching problem: "The man ate the chicken on the bed with a fork" has an ambiguity of whether the chicken is on the bed or the man, and whether the chicken has the fork or the man does. These things are ubiquitous, so Google is trying to avoid the problem by using non-context-free regular grammars. This is not likely to work great, but its better than semantic-less context-free grammars, because "chicken with a fork" does not return many hits, while "eating with a fork" does. The only way to do it right is context-free + semantics. – Ron Maimon Mar 10 '12 at 20:44
  • 3
    "doing it right" would really mean addressing a specific problem, not trying to generalize the "English Language" (whatever is meant by that). A generalized solution seems silly, considering the amount of AI required to process a natural language. You must apply lots of assumptions and restrictions, led by a specific problem definition you're trying to solve. – tenfour Mar 12 '12 at 00:42
  • 2
    sadly, this question did not result in any links to examples of (E)BNF for any nontrivial sets of English. I think there are some in the examples in the source for javaCC. Also I vaguely remember a text on the history of English that had an explicit BNF grammar for each of OE, ME, and modern English. – Mitch Mar 12 '12 at 01:49
  • 2
    For technical clarification, BNF (Backus-Naur Form) is a grammar that is context-free (that is, all and only context-free grammars can be produced with BNF. Now EBNF is Extended BNF, were you have the same context free rules of BNF plus some transformation rules that go beyond context-free (allowing some context sensitive rules). Since English (and most human languages) are considered to be beyond context-free, BNF wouldn't suffice, but maybe EBNF could. – Mitch Jun 03 '12 at 15:24

5 Answers5

31

The answer is best approximated as "yes", although there are some strictly non-context free components of English. The approximation of saying "English grammar is context free" is more true than false, in that the vast majority of the sentences you will encounter will be parsable by a simple EBNF, and all of them will be parseable by a mild extension of EBNF that I will describe below, and which does not require anything more than a stack plus some list-variables to keep track of permutable objects.

The permutable list variables you have to keep track of in the parsing algorithm are (probably) not strictly context free constructions, but they are not deep modifications of the context-free grammar idea, and there will be a preferred ordering of the list which is the one most commonly occurring in the New York Times, which is strictly context free.

I did not thoroughly bug-check the grammar, but it is mostly okeh.

The Generating problem and the Parsing Problem

There are two different problems in a formal grammar, the problem of generating all outputs and the problem of parsing a given output. To illustrate, consider the following BNF:

SENTENCE: ALIST | BLIST
ALIST: "a" ALIST | "a" BLIST | ""
BLIST: "b" BLIST | "b" ALIST | "ba" ALIST | ""

The very simple rules for generating the productions of the grammar are given above. The vertical line is read "or", and it separates options for expanding the objects in capitals. The stuff in quotes are actual letters, and the empty string "" means that you produce nothing. To start, you have one symbol

SENTENCE

Then you replace SENTENCE with one of the options to the right of the colon. For illustration, take ALIST

ALIST

Now you replace ALIST with one of its options (removing the quote marks),

a BLIST

Now replace BLIST with one of its options, and so on, until you run out of capital letter things. The capital letter things are called "nonterminals", and the lowercase things are called "terminals" or "leaves". The result in this case is a string of alternating a's and b's:

ababbabbab

Each finished product is a sentence of the language. Since I have given the rules explicitly, all productions are generatable by a simple algorithm.

The problem of parsing is going backwards--- given a sentence, figure out which rules were used to construct it. The problem of parsing is harder than the problem of generation, because it is usually ambiguous. The sentence

abaa

could have come from a-ba-a using the second BLIST option, or from a-b-a-b. This is the main difficulty in parsing, and different presentations of the same language sometimes have different amount of ambiguity. For special grammars, there is no ambiguity, and these are the kind that are most often used in computer languages. The C programming language is presented in an LALR (left-to-right) BNF that can be automatically parsed by the UNIX classics lex and yacc (or their free software versions, flex and bison).

The main type of grammars that people study are the context-free grammars, where the stuff on the left is just one symbol in capital letters, one nonterminal. General grammars are defined as the output of any conceivable computer program, and so are too general to be useful. There are also other classes, which I will completely ignore, because the context free class is the closest one to linguistics.

The prototypical example of a nontrivial context free grammar is parentheses matching of two different kinds of parentheses:

SENTENCE: S
S: "(" S ")" | "[" S "]"

This grammar generates all sentences which are properly balanced nested parentheses of two kinds, square or round. A more interesting example is

SENTENCE: EXPR
EXPR : EXPR "+" EXPR
| EXPR "*" EXPR
| "(" EXPR ")"
| NUMBER
NUMBER : "0" | NONZERODIGIT DIGITLIST
NONZERODIGIT: "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
DIGIT: "0" | NONZERODIGIT
DIGITLIST: DIGIT DIGITLIST | ""

You read this left to right as follows: "a SENTENCE is an EXPR", "An EXPR is an EXPR plus an EXPR or an EXPR times an EXPR, or an EXPR in parentheses, or a NUMBER", etc. This grammar generates every arithmetic expression you use in grammar school. The C programming language is generated by a simple context free grammar, as are nearly all programming languages in use today.

The context free grammars are distinguished from the regular grammars, which have at most one nonterminal on the left and on the right. The regular grammar describes a finite state language, which can be "lexed" by a "lexer" rather than "parsed" by a "parser". An example of a lexed regular language is the NUMBER grammar in the example above. You only need to remember a finite number of things in the past to determine if a number is well formed, and you never need to look at a far away place, as you do when you are trying to check if parentheses are properly balanced.

This stuff is standard, and is discussed in detail on Wikipedia and elsewhere. I only give a quick review to establish some examples, and for completeness sake. It is related to a fascinating theory of finite state machines and stack machines, which is good to understand.

The EBNF

A BNF is a set of rules for rewriting a nonterminal in terms of other nonterminals. I will give a trivial linguistic example.

SENTENCE: STATEMENT | QUESTION | COMMAND
STATEMENT: SUBJECT VERB ADVERBLIST OBJECT "."
QUESTION: "Hey," DO SUBJECT VERB ADVERBLIST OBJECT "?"
| DO SUBJECT VERB ADVERBLIST OBJECT "?"
| "Hey, dood, " DO SUBJECT VERB ADVERBLIST OBJECT "?"
COMMAND: VERB ADVERBLIST | VERB ADVERBLIST OBJECT
ADVERBLIST: ADVERB ADVERBLIST | ""
ADVERB: "quickly" "slowly"
DO: "do"
VERB: "run" "walk" "hop"
OBJECT: "here" "there" "everywhere"
SUBJECT "we" "they" "the bunnies"

This grammar is kind-of stupid (it is also regular, although the presentation above does not make it obvious at first glance). It generates a ridiculously tiny subset of English consisting of sentences, questions, and commands about bunnies and plural people who go places by various means of locomotion:

"hey, dood, do the bunnies run quickly quickly slowly here?"
"we hop slowly quickly here?"

The sentences are dumb. The only recursive part of the grammar is the part that generates the adjective list. This pattern, making a list, occurs very often:

ADVERBLIST : ADVERB ADVERBLIST | ""

I will abbreviate this nonterminal "ADVERBLIST" by calling it "ADVERB[]", with the brackets meaning a list of zero or more adverbs, and omitting the rule which forms it. This means I am using an extension of BNF, so it is EBNF (extended BNF).

Further, there are lots of cut-and-paste operations in the BNF for QUESTION above, corresponding to a certain thing being either present in one form or another. I will abreviate options using a nested parentheses language inside the BNF itself:

QUESTION : ("hey" | "hey, dood," | "") DO SUBJECT ADVERB[] VERB "?"

Hopefully this is self explanatory. This means that one BNF line can actually stand for very many replacement rules, since

QUESTION: ("hey" | "hey, dood," | "") DO ("" | "tell if" | "tell me if") SUBJECT ADVERB[] VERB, ("yo?" | "huh?" | "huh, yo?" | "?")

generates 36 different sentences.

Further, I will use the standard EBNF abbreviation of square brackets for an optional construction, [X] is the same as (X | ""). These types of EBNF are no good for lex and yacc, but you can manually expand them out to a real BNF, so they are still context free constructions.

QUESTION: ["hey" [",dood]] DO [ "tell" ["me"] "if" ] SUBJECT ADVERB[] VERB ["huh"]["yo"]"?"

The above is equivalent to the previous, but more compact.

Commutative extended BNF

The main problem with a BNF description is that it generates a completely different structure for the sentences

"to the store I went"

and

"I went to the store"

when their semantics are identical. The first is just a poetic rearrangement of the second, and a speaker will understand it just fine. But the BNF will glob together the "to the store" to a structure, which I will call an "ADVERB" (I really don't care what other people call it) because it attaches to "the store". The problem is that

ADVERB I went

and

I went ADVERB

attach in different ways in the formal grammar, because you need to go through the "I" in the first case. Further, the following production

I to the store went

is also marginally poetically acceptable. If you think this sounds forced, consider

"Without a fork, I, by a stream and on a chair, ate the chicken"

This is perfectly fine, and equivalent to

"I ate the chicken by a stream and on a chair without a fork."

In a normal BNF or EBNF description, the different ADVERBs in the sentence attach to the VERB in different ways depending on their relation to the verb, whether to the left or to the right, and even worse, in different ways depending on their relation to the subject, whether they are to the left or the right of the subject.

To fix this, I will leave the class of context free languages (maybe, perhaps this doesn't leave the class, I didn't prove it yet one way or the other, and it really doesn't matter for the practical purposes of linguistics). I will introduce another extension to the BNF, namely a "+" sign. A nonterminal with a plus sign attached can freely commute with all nonterminals not enclosed in parentheses by moving to the right. So during productions, it can slide around parentheses to the right as many times as it likes:

STATEMENT: ADVERB+[] SUBJECT (VERB OBJECT)

means that there is a list or ADVERB+ objects, which can slide past subject, or past the unit (VERB OBJECT). This construction is a pain in the neck to code generally, it isn't covered by lex and yacc. There might be a simple way to replace it with different rules, for example

STATEMENT: ADVERB[] SUBJECT ADVERB[] (VERB OBJECT) ADVERB[]

But this sort of thing leads to a prolifiration of different kinds of nodes (since each different kind of merge operation of two things into one in a basic description produces a different kind of node) and the prolification of node types (when their semantics is obviously the same) is the bugaboo of natural language generation. I will not put things in standard EBNF form, and I will use the "+" construction above.

While you can't feed this into yacc, it is still not to hard to code a parser by hand for commutative EBNF's (at least the ones that come up in English). Perhaps somebody will write an automated parser for commutative BNF's, but if the grammar is simple enough, a hand coding might be sufficient.

Nonobvious context-freeness of English

When you have a complex sentence, there are adjective-like and adverb-like phrases that can occur as embedded phrases:

"With a sword, tall and gallant, on a rock, breathless and encumbered, by a stream, over the field, armored in chains, without mercy, John slew the Jabberwock!" or

"Tall and gallant, by a stream, John, on a rock, breathless and encumbered, over the field, slew the Jabberwock, (John was) armored in chains, without mercy!"

I placed (John was) to indicate that the adjective like phrase should not attach to Jabberwock, which is the preferred parsing of native speakers. The attachement to John is also an ok parsing, but less salient.

The adjective-like phrases "tall and gallant" and "armored in chains" occur with no particular order preference in the list of adjective and adverb phrases. This means that as you parse the sentence, you keep track of the adjective list (each of which might recurse down several times) and the adverb list (each of these might recurse), and when you are done, you attach the adjectives to the subject, and the adverbs to the verb. The verb object is the exception--- it is tightly attached to the verb. You can't move this guy around, or stuff things between it and the verb.

I should point out that in terms of style, you are generally supposed to do it like this:

ADVERB, ADVERB ... ADVERB, ADJECTIVE, ADJECTIVE ..., ADJECTIVE SUBJECT ADJECTIVE ADJECTIVE VERB ADVERB, ADVERB...

So that it will appear:

"on the stream, by the river, without mercy, tall and gallant John, breathless and beaming, armored in chains, slew the Jabberwock over the field."

In this order, you can make it context free for sure, because you can glob up the initial adverbs and the final adverbs first, then the middle adjectives to the subject, then the adverbs, subject, and verb into a sentence. I don't like this, because to a speaker all permutations are semantically equivalent, since they attach the same things to the same things.

A subtle point appears here: while you are allowed to say

The man with a limp shot the deer.

If you say,

with a limp, the man shot the deer

the semantics changes--- the phrase becomes an adverb for sure. In the second position, it is more likely that you would bind "with a limp" as an adjective to "the man" rather than to the "shot the deer" as an adverb.

Recursive Skeleton English

The main commutative EBNF rules are:

SENTENCE: STATEMENT | QUESTION | COMMAND
STATEMENT: ADJ+[] ADVERB+[] (SUBJECT) ADJECTIVE+[] (VERB)
QUESTION: ADJ+[] ADVERB+[] (DO SUBJECT) ADJECTIVE+[] (VERB)
COMMAND: ADJ+[] ADVERB+[] (VERB)

ADVERB: ADV
| WHEN (STATEMENT)
| IF (STATEMENT)
| GERUND
| PREPLY NP
ADJECTIVE: GERUND | PREPISH NP | ADJ
WHEN: "when" | "whereby" | "until" | "unless"
IF: ["not"] ("if" | "only if" | "because")
AND: "and" | "but" | "or"
PREPLY: [not] ("to"|"onto"|"into"|"of"|"out of"|"in"|"within"|"by"|"with"|"without")
PREPISH: [not] ("to"|"of"|"in"|"by"|"with"|"without")
ADJ: ADJ ("and"|",") ADJ | GERUND
NP: (("a"|"the") (ADJ+[] SNP))} WHICH CLAUSE
CLAUSE: ADJECTIVE+[] ADVERB+[] SUBJECT VERB
| ADJECTIVE+[] ADVERB+[] SUBJECT VERB PREP
SUBJECT: NP
OBJECT: NP
WHICH: "that" | ""
GERUND: FOGHORN GER
FOGHORN: (PREPLY "-" NP)[] "-" OBJECT
ADJ: ADJ AND ADJ|
"purple"| "smelly"| "happy"| "windy" | "unbelievable" etc.
ADV: ADV AND ADV |
"quicky" | "slowly" | "happily"
NOUN: "cat" | "dog" | "man" | etc.
VERB: ("berate" [OBJECT]) | ("stop" [OBJECT]) | ("flee" [OBJECT]) | ("put" OBJECT) LOCATION+ ...
GER: "berating" | "stopping" | "fleeing" | "putting" etc.

Plurals, tenses, passive constructions, subject-verb matching, and a bunch of other things English incorporates are not incorporated in this BNF, and neither are the required argument rules, just the recursive structure of the phrases. The remaining stuff is complicated annoyance that doesn't touch the embedding structure. This embedding structure is what is more-or-less universal across all old world modern languages.

Disclaimer: This is a draft skeleton, and I have not debugged it. I am sure it is full of nonsense, since it is late at night.

Unambiguous English

The main ambiguity problem with parsing English is that sometimes the "with a fork" attaches to nouns and sometimes to verbs. For example:

I ate the Chicken with one leg with a fork

might be parsed as

ate(subject=me, object= Chicken(the:with one leg, with a fork)) ate(subject=me, object=Chicken(the: with one leg), with a fork) ate(subject=me, object=Chicken(the: with one leg, with a fork)) ate(subject=me(:with one leg), object=Chicken(the,singular),with a fork))

etc. depending on whether the eating was done with one leg, or with a fork, or whether these are attributes of myself, or of the chicken. This ambiguity makes it difficult to decide how to attach the ADVERB and ADJECTIVE arguments properly to what thing.

In order to eliminate this ambiguity, and make an unambiguously binding English-like language with the exact same word order, you can introduce the prepositions

with-ish of-ish on-ish to-ish

etc, to indicate that these attach to a nounphrase, not to a verb. I will write all ADJECTIVEs which modify the subject to the far left, and all adjectives which modify the various objects next to those objects. Further, there is the ambiguity of nesting level--- which depth of stack you are supposed to be on for each ADVERB and ADJECTIVE. I will indicate this using parentheses level. Then English becomes unambiguous. Further, these rules generally match the preferred style of the New York Times.

With-ish one leg, I ate (the Chicken) with a fork.

i.e. one-legged me ate the chicken using a fork.

I ate (the Chicken with-ish one leg) with a fork

i.e., I ate the one legged chicken using a fork

I ate (the Chicken with-ish one leg with-ish a fork)

i.e. I ate the one legged chicken which possesses a fork, which perhaps it uses as a prosthesis to replace its missing limb.

Unambiguous English might be a good target for starting semantics. It shouldn't require much to tell a computer that chickens can often have one leg, but more rarely have a fork. The semantic web construction from the nouns and the verbs and their attachments is not a hopeless project, but it is a difficult one, which requires ingenuity. The grammar problem is much easier, and a variation on the stuff above should solve it completely.

  • 11
    This is like coming to a medical research center and saying "I wrote a guide that can be used to diagnose 100% of medical ailments." To a doctor this will seem like a naive pursuit (which just implies that it should be more modestly described), and without even looking, an incomplete result at very best. Just saying... – tenfour Mar 11 '12 at 20:28
  • 2
    @tenfour: It is not like coming to a medical center, because you and everyone else already have an algorithm in their heads which decides if a sentence is grammatical. It's not very complicated, I just described the major recursive constructions explicitly. I left out "the apples picked in 2003" and other that-omitting past-tense constructions (but the present tense analog is included). If you refuse to read the content of this post, which describes how to do natural language generation, you won't understand English grammar. – Ron Maimon Mar 12 '12 at 02:54
  • 1
    Also omitted is verb ellipsis, which I admittedly am not sure how to handle yet. I was thinking of introducing a blank verb, but the syntax of this construction still eludes me. – Ron Maimon Mar 12 '12 at 07:22
  • tl;dr but: Can it correctly parse Buffalo? – musiKk Mar 12 '12 at 08:12
  • 3
    "I just described the major recursive constructions"... of course you can parse a subset of English with EBNF. The question is about "all" of English. And don't forget that "English" is more than just grammar, which many have already told you in other posts. – tenfour Mar 12 '12 at 09:18
  • @tenfour: This isn't a subset of the recursive sentence structure, it's all the sentence structure. I found a "bug", PREPISH adjectives are not allowed to be distant from the NP they modify in English, but this grammar allows it. Unlike these, ADJ lists are allowed to be distant from the subject. "I put her picture on my desk, which means I remember her" discussed here binds differently than what I described (but it can be fixed by making "which means" an IF. I didn't deal with tenses or subject object matching, but these are local finite things which do not touch the recursion. – Ron Maimon Mar 12 '12 at 16:19
  • @musiKk: Buffalo is trivial for a formal grammar to generate (or parse), it is only difficult for humans. You can skip the first sections if you already know generative grammar. – Ron Maimon Mar 12 '12 at 16:22
  • 2
    @tenfour: English is not much more than the grammar and binding rules. Once you have this, you can start discussing semantics, because you know what can bind to what. I am annoyed that nobody is actually dealing with the content of the post, which is a construction of the complete recursive structure of a major natural language. – Ron Maimon Mar 12 '12 at 16:27
  • 1
    As it has been demonstrated already by others, you need real world semantic understanding even to parse grammar properly. – tenfour Mar 13 '12 at 14:56
  • 3
    @tenfour: You are completely wrong. This is only true to the extent that you need to resolve ambiguity, as explained above for "The man ate the chicken with the fork". The exact opposite has been demonstrated: any consistent grammatical parsing can be recognized as a correct parse-tree for English sentences, even when it makes no sense. Once you point out the different parsings, speakers will recognize that the chicken could conceivably have a fork. Chomsky's "Colorless green ideas sleep furiously" is an embedding free example. – Ron Maimon Mar 13 '12 at 19:58
  • Then what's the goal of the parsing at all, if in the end a human needs to refer to the original text? Why not just boil down all your parsing to "block of English text"? I think it's almost pointless to try and tackle this without any consideration for a specific problem to solve. – tenfour Mar 14 '12 at 09:31
  • 2
    @tenfour: The moment you have a parsing, you have automated flawless translation. I don't understand your statement: parsing is understanding all the relations implied by a sentence, it gives you what did what to what and how. I don't need to motivate this. – Ron Maimon Mar 14 '12 at 15:56
  • That ellipsis thing from a few comments up there.. Did you ever end up settling on something? The best I've got so far is "the unnamed elements of the implied pattern between left and right" (like an infix operator, I suppose?) ---- Ex: [1,2,...,9] => ... == [3, 4, 5, 6, 7, 8] ----- that kinda thing..? Anyway, would appreciate knowing if you ended up expanding on this (3yrs since post). Thanks. – Inversus Feb 05 '15 at 12:48
  • Also, this answer was great! Just saying – Inversus Feb 05 '15 at 12:52
  • @Inversus: The main idea here is the commutative context free grammar, which I explained in more detailed in a linguistics question/answer I wrote. That answer also found a reference to what is essentially the same idea (although somewhat more/too general) in the then recent linguistic literature from Poland (~2012). It's only been two years, but I haven't thought about it actively. I think I understand the "John Jameson/Hawk" ellipsis now in terms of a model, but it's not presented in terms of grammars. I can say why, but I don't like SE anymore, so not here. – Ron Maimon Feb 07 '15 at 22:12
  • @RonMaimon Hi, thanks for the response, and again for this answer. So many are so quick to throw a "No!" and the rulebook at ya. I've got one too, where I was giving a PyQt5 tutorial in my answer to a question that was closed after being accepted (ugh). So I can see why you might not like SE lol. Anyway, I'm taking a stab at the grammar and would definitely be interested in that more detailed explanation. Where should I go? Thanks! – Inversus May 09 '15 at 15:50
  • @Inversus: Perhaps start a facebook chat group and invite me? All you need to do is sit down with a newspaper for about 100 hours and look at all the sentences and make sure they get generated. You need to also make sure that all the generations are ok intuitively. Once you have commutativity of adverb lists and adjective lists the hard problem is essentially solved, because this removes the combinatorial explosion. The rest is fiddling with the details, which is necessary but boring. – Ron Maimon May 12 '15 at 18:40
  • should S: "(" S ")" | "[" S "]" be S: "(" S ")" | "[" S "]" | "" to terminate sentences? – philcolbourn May 17 '17 at 11:46
  • How can you have a reputation so low when you created this amazing thing – Inversus Aug 14 '18 at 11:57
  • This answer is great! There is one little mistake in there, however, which is the statement that the C programming language is context-free: It is not. (See the Lexer Hack on Wikipedia for more info). – Qqwy Oct 06 '18 at 04:14
  • If I'm reading this right, then English is unambiguous in the absence of prepositions. Is that correct? – wanderingmathematician Jun 15 '20 at 19:50
24

No.

It's been well demonstrated in the linguistic literature that natural human languages, including English, cannot be captured in a context-free grammar.

Here's a link for you (PDF): Evidence against the context-freeness of natural language

JSBձոգչ
  • 54,843
  • 18
    This answer is 180 degrees opposite of the truth, and links to an irrelevant paper that uses a Swiss German construction which does not parallel any English construction. The answer for English is yes, and it does not put any parsing people out of work. The EBNF is only the start of the parsing project. They have pretty good ones today, but their problem is not that they don't capture all English sentences, but that they are too nonunique, they give too many interpretations to a given sentence. This answer is lousy, and it should not be accepted or upvoted. – Ron Maimon Mar 10 '12 at 04:05
  • 1
    just curious, why does it use Swiss German as opposed to English? (the paper is written in English, Swiss German is probably obscure to most English reader) – Louis Rhys Apr 28 '12 at 08:32
  • 5
    Geoffrey Pullum disagrees: http://dl.acm.org/citation.cfm?id=970173 – Wes Nov 22 '13 at 15:42
  • 5
    There is a machine-readable dialect of English called Attempto Controlled English: https://en.wikipedia.org/wiki/Attempto_Controlled_English – Anderson Green Jul 15 '14 at 06:41
  • 1
    @LouisRhys: https://en.wikipedia.org/wiki/Cross-serial_dependencies – ribamar Mar 04 '18 at 22:03
  • There is also such a thing as context sensitivity in culture https://en.wikipedia.org/wiki/High-context_and_low-context_cultures and the more you go East, the more relative sentences become. It may impact the number of possible parsings. English has probably very low number of possible parsings which makes @RonMaimon 's answer more realist. – v.oddou May 08 '19 at 10:40
9

If by "English" you mean "any utterance that might be written or uttered by a native English speaker", then English — and indeed any language for that matter — is far too complex probably for such a thing to exist. (And as JSBangs points out, there are reasons for believing that such a thing couldn't even exist in principle.)

If by "English" you're prepared to accept a highly constrained subset of English, then you might consider the rewrite rules that exist, for example, in early Chomskyan models of syntax. In symbolic form, these express things like "PP → P NP" ("a prepositional phrase is rewritten as/comprises a preposition plus noun phrase").

In reality, applications that need to do syntax parsing usually end up employing an algorithm that allows multiple candidate analyses to be assessed in parallel. There are several variants of this approach that essentially fall under the umbrella of "dynamic programming" — look in particular at something called the Viterbi Algorithm.

Neil Coffey
  • 19,622
  • This is complete nonsense. You can give an EBNF for your entire answer, and the question, and all the stuff on this stackexchange, minus some unusual constructions that hardly ever occur. – Ron Maimon Mar 10 '12 at 04:06
  • 12
    Well, do publish... – Neil Coffey Mar 10 '12 at 05:35
  • It would not be accepted in any journal, because it is too trivial. It is also probably well known to all those folks who do the NYT parsing challenge every year. Instead I will post it here. There might be bugs, I haven't programmed it into a computer to test all the productions, but if people check it by hand, these but bugs will be easily fixable, because the general idea is correct. – Ron Maimon Mar 10 '12 at 05:56
  • 4
    OK, excellent. Until proven wrong, I think I stick by the essence of my argument, though: an EBNF grammar would only cover a fairly constrained subset of English and it isn't usually the technique used in most parsing applications. – Neil Coffey Mar 10 '12 at 15:30
  • You are totally wrong, an EBNF covers all of your sentences, and it is the standard technique used in all parsing applications. For modern translation software, there has been a change to regular matching of patterns, because the binding ambiguities in context free grammars are too great: "I ate the chicken with the fork on the bed" can mean that the Chicken had the fork, the Chicken was on the bed, or any permutation of me/Chicken having these attributes, and without semantics, it's impossible to tell. – Ron Maimon Mar 10 '12 at 20:40
  • 2
    I wonder if we're arguing about a problem of nomenclature here? I think you could argue that some form of context-free rules form a very basic-level component of many parsing systems. But as far as I'm aware, in most systems these are really a very low-level component then sit underneath other components (as you say, to resolve ambiguities among other things), including in many cases some kind of statistical model -- which in my mind constitutes "context". Maybe the post you promised might help clear this up? – Neil Coffey Mar 11 '12 at 06:05
  • I am working on it, I'll post in the next 24 hours. It will only handle present tense, and I am sure it is missing a bunch of stuff, but it will get the general idea across. If people help and say what is missing, I'll try to fix it. It differs in an essential way from previous attempts, and this modification makes it more compact and intuitively "correct" as far as the algorithm it implies for parsing. – Ron Maimon Mar 11 '12 at 06:19
  • @RonMaimon: Any progress? If you published I don't see where. – Brett Holman May 23 '22 at 02:44
7

Attempts have been made:

  • from Robert Peters, A Linguistic History of English, (Houghton-Mifflin, 1968) at the end of the chapter on Late Modern English (presumably contemporary English) there are a couple pages of CFG (which is essentially BNF) plus some transformations (making it EBNF?):

CFG and xforms of Late Modern English

Is this complete? Most likely not. Is it at least comprehensive? I would expect lots more rules, but maybe I'm mistaken or maybe this is just a first good pass.

  • There is the current (academic) state of the art Stanford Parser. Unfortunately this is not rule based (EBNF based) but rather corpus and statistically based.

  • Likewise the recently popular SyntaxNet. These last two don't have explicit EBNF rule sets.

  • Link grammar has a working parser. It was developed by computer scientists rather than linguists, so it may not address linguistic concerns.

I vaguely remember there being a parser for English as an example code in JavaCC, the Java compiler-compiler (it compiles a parser from a grammar given to it). But all these examples are designed by hand in the sense that humans came up with explicit grammar rules.

There exist many examples of part-of-speech and shallow parsing procedures for many languages for applications where deep analysis is not need, just for situations where you want to tell if a word acts like a noun or an adjective or if it comes under the scope of a negation word.

Mitch
  • 71,423
2

In Syntactic Structures by Noam Chomsky, there is a chapter called "Limitations of Phrase Structure Description" (Phrase structure grammar = context-free grammar, correct me if I'm wrong), in which he gives examples of things not captured by CFG's.

The example he gives is "The scene of the movie and of the play was in Chicago". This sentence is created by combining two sentences together: "The scene of the movie was in Chicago" (called S1) and "The scene of the play was in Chicago" (called S2). Chomsky says that in order to form the combined sentence (called S3) from S1 and S2, "we must know not only the actual form of S1 and S2 but also their constituent structures...their 'history of derivation'".

He later says that "even if this form for grammar [CFG] is not literally inapplicable to English, it is certainly inadequate...[another kind of rule] leads to a considerable simplification of the grammar." I would agree with him on this point. Especially when it comes to subject-verb agreement, it really does not seem like a CFG is appropriate. The most natural way to write a CFG starts with this simple rule:

S -> NP VP

A sentence is a noun-phrase followed by verb phrase. But in this rule, the derivation of the noun-phrase is independent of the verb-phrase. There's no way to constrain the verb to agree with the noun.

I won't claim that it's impossible to write a CFG to handle this, but it's certainly not the most natural approach. If I were writing a parser, I would use a CFG to generate the skeleton of the sentence, and then go over it with another pass to check for agreement.