Enumerating a context-free language

May 2, 2008algorithms, code, haskell

Here is a familiar context-free grammar for arithmetic expressions:

S      ::= add
add    ::= mul | add + mul
mul    ::= term | mul * term
term   ::= number | ( S )
number ::= digit | digit number
digit  ::= 0 | 1 | ... | 9

I have a challenge for you: write a program (in a language of your choice) which enumerates all strings accepted by this grammar. That is, your program runs forever, but if I have a string this grammar accepts, then your program should output it in a finite amount of time.

(This really is a fun challenge, so I encourage interested readers to try it themselves)

This is a tricky problem. Indeed it is sufficiently tricky that I cannot think of a clean imperative solution to it (which I’m sure is also related to being very out-of-practice in imperative programming). I’m interested to see any such solutions that people came up with. (And I’ll try it myself in the meantime)

But I’m going to present a functional solution using a neat little monad called Omega which I just uploaded to Hackage [deprecated by weighted-search 2013].

Let’s step back and consider a simpler motivating example. Consider the following list comprehension (and recall that list comprehensions are just the list monad behind a little bit of sugar):

pairs = [ (x,y) | x <- [0..], y <- [0..] ]

This looks like it generates all pairs of naturals, but it does not. It generates the list [(0,0), (0,1), (0,2), ...], so the first element of the pair will never get to 1. If Haskell allowed us to use ordinal numbers as indices then we could: pairs !! ω == (1,0). :-)

Conceptually what we have is a lattice of pairs that we’re “flattening” poorly:

(0,0)  (0,1)  (0,2)  (0,3)  ...
(1,0)  (1,1)  (1,2)  (1,3)  ...
(2,0)  (2,1)  (2,2)  (2,3)  ...
  .      .      .      .

We’re flattening it by taking the first row, then the second row, and so on. That’s what concat does. But anybody who’s had a brief introduction to set theory knows that that’s not how you enumerate lattices! You take the positive diagonals: (0,0), (1,0), (0,1), (2,0), (1,1), (0,2), .... That way you hit every element in a finite amount of time. This is the trick used to show that there are only countably many rational numbers.

Omega is the monad that comes from this concept. We define Omega‘s join to be this “diagonal” function. What we get back is a monad with a nice property:

If x occurs at a finite position in xs and y occurs at a finite position in f x, then y occurs at a finite position in f =<< xs

It’s hard to know what that means without knowing what =<< is supposed to mean. It means if you have a (possibly infinite) list of items, and to each one you apply a function which generates another (possibly infinite) list, and you merge them all together, you’ll be able to reach every result in a finite time.

More intuitively, it means if you write a multiple branching recursive function where each branch recurses infinitely (generating values), you will not get stuck in the first branch, but rather generate values from all branches.

And that is exactly what we need to write the context-free enumerator. Given a simple data structure representing a context-free grammar:

data Symbol a
    = Terminal a
    | Nonterminal [[Symbol a]] -- a disjunction of juxtapositions

Then we can write an enumerate function in a straightforward depth-first way:

enumerate (Terminal a) = return [a]
enumerate (Nonterminal alts) = do
    alt <- each alts          -- for each alternative
      -- (each is the Omega constructor :: [a] -> Omega a)
    rep <- mapM enumerate alt -- enumerate each symbol in the sequence
    return $ concat rep       -- and concatenate the results

But the Omega monad will do some heavy lifting and stagger those generators for us. Defining the grammar above:

arithGrammar = s
    where
    s      = Nonterminal [[add]]
    add    = Nonterminal [[mul], [add, Terminal '+', mul]]
    mul    = Nonterminal [[term], [mul, Terminal '*', term]]
    term   = Nonterminal [[number], [Terminal '(', s, Terminal ')']]
    digit  = Nonterminal $ map (map Terminal . show) [0..9]
    number = Nonterminal [[digit], [digit, number]]

And then running our enumerator:

runOmega $ enumerate arithGrammar

We get a very encouraging list of results:

0
1
0+0
0*0
0+1
(0)
1+0
0*1
0+0*0
00
...

Notice how each type of node is represented in that initial prefix. That’s a good sign that there won’t be degenerate repetitive behavior. But of course we know there won’t be by the guarantee Omega makes (unless I implemented it wrong).

37 thoughts on “Enumerating a context-free language”

Tac-Tics says:

May 2, 2008 at 8:16 am

Omega seems like a pretty cool monad. I remember running into this issue when I started using Haskell, and it frustrated me. This solution is just so wonderful.
Rafael says:

May 2, 2008 at 9:06 am

One solution is to simply
generate a list of strings of length n
output every string on the list that can be parsed with the given grammar
although perhaps this can’t be called clean.

But my point is that I don’t quite understand what the enumerating pairs has to do with it. Since the alphabet is finite, you don’t have to use diagonalization.
Black Meph says:

May 2, 2008 at 9:41 am

I’d like to see your Omega monad, since it sounds suspiciously like a ZipList/stream.
Tac-Tics says:

May 2, 2008 at 9:42 am

@Rafael
I’m pretty sure that the above way is a much more efficient way of searching a context free language. With your method is entirely feasible, of course, but it requires parsing obviously nonsensical strings like “—” and “01-+3”. The above method never has to consider these, as they aren’t produced by the production rules.

The reason for using the diagonalization monad is that the naive approach to enumerating the production using the list monad is depth-first, and because the S-production rule here is unbounded (an expression can be as long as you want to make it), the tree is infinitely deep.

From what I’m understanding, omega is simply a breadth-first variation on the list monad.
The Plaid Mentat says:

May 2, 2008 at 10:31 am

I would tend to agree with Tac-Tics, it should be quite simple as a depth first search. I could probably hack something together by this evening, but here would be my general approach (in C++):

1) Create a queue of objects representing strings (or lists) of terminals and non-terminals.
2) Push the Start symbol on the queue.
3) At each iteration, pop the queue. If it is all terminals, print it. Otherwise apply every possible production to it and push the results onto the queue.
The Plaid Mentat says:

May 2, 2008 at 10:33 am

Ehrm, s/depth (first search)/breadth $1/

This is in pseudo-Python:

res := {S}
for s in res:
  for A -> X in productions:
    for each position p of S
        such that s[p] = A
      s' := s[p <- X]
      if s' not in res:
        res := res U s'
        if s' has no nonterminals:
          yield s'

Mike says:

May 2, 2008 at 11:23 pm

How about Goedel numbers in reverse?
Luke Palmer says:

May 3, 2008 at 1:35 am

Black Meph: the Omega monad is literally just a list with “diagonal” instead of “concat” for join.

pb, yours and The Plaid Mentat’s solutions are quite similar. I guess the reason I didn’t see it is because of too “typed” a world view; putting nonterminals and terminals together in list didn’t occur to me… even though I did that to construct the grammar in the first place.

Mike: I know what a Gödel number is, but I have no idea what you’re talking about.
Mike says:

May 3, 2008 at 2:59 am

Sorry Luke, I mean (i) write a function that maps natural numbers to expressions, then (ii) iterate through them. Perhaps less elegant in practice than I first thought, but it’s definitely possible.
Luke Palmer says:

May 3, 2008 at 5:34 am

A function like (enumerate arithGrammar !!)? :-)

That’s kind of the whole point of Omega, to enumerate recursive things.
Mike says:

May 3, 2008 at 5:50 am

Me again. It’s certain easy enough my way (a few dozen lines of my amateurish Ruby) to generate a promising-looking expression sequence, but to do it without duplicates is another matter entirely. Thanks, interesting problem, yours is by far the neater solution!
anonymous says:

May 3, 2008 at 11:14 am

Have you tried to make your Omega monad an instance of MonadPlus? The obvious instances (modulo newtype wrapping: mzero = [], mplus = merge, with merge merging two lists) fail to work, consider for example:

do x <- each [0..] y <- each [0..] guard $ y == 3 return (x,y)
Anon2 says:

May 3, 2008 at 1:06 pm

I believe Omega falls under the rubric of fair backtracking a.k.a. interleaving monad transformers in the literature.
MonadPlus says:

May 3, 2008 at 1:10 pm

@Anonymous:

Trivial.
instance MonadPlus Omega where mzero = Omega [] mplus (Omega x) (Omega y) = Omega $ diagonal [x,y]
anonymous says:

May 3, 2008 at 1:43 pm

@MonadPlus: The problem is not quite as trivial as it seems from your post: The definitions of mzero and mplus you’ve given are very obvious, of course.

But they fail to work properly! If you try my example, do { x <- each [0..]; y <- each [0..]; guard (y == 3); return (x,y) }, you’ll see that it goes infinite loop, instead of lazily returning the infinite list [(0,3),(1,3),(2,3),…].
MonadPlus says:

May 3, 2008 at 2:57 pm

You’re right! It’s not quite so trivial. On the other hand, this is entirely well-known territory, albeit arguably without a generic one-size-fits-all solution.

Hint: diagonal . (diagonal `fmap`) == diagonal . diagonal, except when it’s not.

The Oleg-Shan axis has a different approach, I believe.
p says:

May 3, 2008 at 3:56 pm

A solution in python (sorry for the ugly combine function).

http://pastebin.com/f4084533a
Luke Palmer says:

May 3, 2008 at 5:28 pm
p: Very nice solution in python. The use of generators really cleans things up.

anonymous: Indeed that sort of thing was exactly what I originally created this for (long ago). My solution was to define:
```
newtype Omega a = Omega [Maybe a]
```
With mzero defined as Omega [Nothing].

I found that quite ugly. It also was not very efficient. I have yet to find a better solution, though. The hackage module does not do this…

Another rather hackish implementation.

import re

def gen(P, start=''):
  t = {start: 1}
  ns = [start]
  while ns != []:
    n = ns.pop(0)
    for A, Xs in P.items():
      for X in Xs:
        for i in range(len(n)):
          for match in re.finditer('', n):
            m = n[:match.start()] + X + n[match.end():]
            if not t.has_key(m):
              t[m] = 1
              if '<' not in m:
                yield m
              else:
                ns.append(m)

prods = {
  'S': ['', ''],
  'Digit': '0123456789',
}

for x in gen(prods):
  print x

pb says:

May 3, 2008 at 8:34 pm

Sorry, should be

##
‘S’: [”, ”],
jav says:

May 5, 2008 at 2:57 am

I implemented Plaid Mentat’s solution in Haskell (see http://hpaste.org/7407), but I believe the strategy only works in theory. The reason: While it is a breadth first algorithm of sorts, it will quickly fill the queue with long combinations of non-terminals that produce even longer combinations and so on and you never reach a finished state in a reasonable time. The above program prints the digits 0 to 9 and then just runs with no output for quite some time.
The Plaid Mentat says:

May 5, 2008 at 7:55 am

Jav – I had that problem as well with my initial solution, but I found that if a little more care is taken about how you fill the queue, you can get around this problem.

If you apply all possible productions to the first non-terminal, then to the second non-terminal, etc, you get the queue explosion you’re talking about (as well as duplicates) because you leave the first one unchanged when you move to the second one. One solution I considered was to work recursively here, but I took the easy way out and simply applied all possible productions to the /first/ non-terminal and stopped. This made that step of the algorithm much faster, made the queue fill much more slowly, and worked pretty well overall (I got all ~10 character expressions in about 2-3 minutes).

You’ll excuse my jumping to conclusions if this isn’t what yours does, as I can’t read haskell quite well yet – but I hope that helps!
jav says:

May 6, 2008 at 2:41 am

Plaid Mentat, thx for your comment. I came to a similar conclusion, even though I’m pretty confident that I don’t have any duplicates. Whenever I touch an element of the queue I in fact apply productions to all non-terminals but I generate all possible combinations of all possible productions right away and then never touch that element again. I check the first 10000 elements and don’t have duplicates there. I tried to apply your strategy of only working on the first non-terminal but it wouldn’t work for me.
jav says:

May 6, 2008 at 2:45 am

In any case. I tried another solution. This one is inspired by the Python solution posted earlier as that program was pretty functional already. It uses the idea of alternating between productions which traverses the tree of possible derivations in a much more spread-out manner.

Initial code: http://hpaste.org/7432
Made more compact: http://hpaste.org/7433
jav says:

May 6, 2008 at 2:45 am

And the idea further generalized so that it should work with any grammar: http://hpaste.org/7434

As it turns out the last code example is actually pretty similar to Luke’s solutions . But instead of using the Omega Monad it uses the ‘alternate’ function.
Luke Palmer says:

May 6, 2008 at 4:21 am
jav: hehe, it was fun watching you get closer and closer to the Omega solution.

And as you noticed, I think if you replaced the occurrences of alternate with diagonal from Control.Monad.Omega, you get exactly my solution. You could implement a BreadthFirst monad or something using alternate in the same way. Hmm… I wonder if I can factor diagonal out of that monad an parameterize it on the join function (the answer is yes).

After all this, here’s a summary of using list combining functions as traversals:
- concat – depth first
- alternate – breadth first
- diagonal – diagonal?
The reason the diagonal traversal is important is that it still traverses the whole tree even if it has an infinitely deep path or infinitely wide branch (where depth- and breadth-first traversals fail, resp.).
MonadPlus says:

May 6, 2008 at 5:59 pm

A riddle:
concat = concat . id alternate = concat . transpose diagonal = concat . ???
Luke Palmer says:

May 6, 2008 at 6:05 pm
MonadPlus: stripe (approximately as used in diagonal’s implementation).
```
stripe [[1, 2, 3]
       ,[4, 5, 6]
       ,[7, 8, 9]]
= [[1],[4,2],[7,5,3],[8,6],[9]]
```
Another answer is (:[]) . diagonal :-)
MonadPlus says:

May 7, 2008 at 12:37 pm

The Force is with you, young Palmer.
But you are not a Jedi yet.

There is more than one stripe function, if you consider renaming the one you gave as rotate45.

What does this tell you, if anything, about the free theorem(s) associated with the type [[a]] -> [[a]]?
apfelmus says:

May 8, 2008 at 4:08 am
Hello Luke,

good work, but I have the suspicion that Omega doesn’t fulfill the monad laws, at least not in the sense that the lists are equal, only in the sense that they have the same elements (as sets). In other words, I think that
```
  join . join == join . fmap join
```
with join = diagonal doesn’t hold with list equality, it only works with set equality.
Luke Palmer says:

May 9, 2008 at 5:58 am

Daft, you’re right! As an introduction to Isabelle I was going to attempt to prove the monad laws for Omega, but I was too lazy.

Nor does the alternate function we were discussing, for that matter.

I wonder, is concat the only function :: [[a]] -> [a] that is a proper join?

I suspect that the effect of the monad laws not holding on Omega is that (>=>) is not associative. That seems like a major problem. Time to add a disclaimer to the hackage module.
apfelmus says:

May 13, 2008 at 7:14 am

I’m not sure whether concat is the only proper join, but at least I can say that there is no proper join for infinite lists.

Proof: Every join for infinite lists / diagonalization corresponds to a bijective function $\circ:\mathbb{N}\times\mathbb{N}\to\mathbb{N}$ .
Now, the third monad law corresponds to the associativity of this binary operation $\circ$ , i.e. the equation $a \circ (b \circ c) = (a \circ b) \circ c$ . But since it’s supposed to be bijective, we have $a = a \circ b$ , a contradiction.

(PS: Can you make this undersized comment box bigger? And a preview would be cook, too :-) )
Jake McArthur says:

November 2, 2009 at 8:42 pm

@apfelmus

Are you sure about that? What about the Stream monad (which I’m pretty sure follows the monad laws)?
Haskeller without a Hat says:

November 3, 2009 at 4:26 pm

This is the best thing after sliced bread and factory produced butter!
Dan Rosén says:

November 5, 2009 at 3:24 pm

This was really inspiring. I had to go implement my own omega monad:

http://hpaste.org/fastcgi/hpaste.fcgi/view?id=11871#a11871

(suggestions are appreciated)

Thanks for a nice article!
Ambroz Bizjak says:

February 13, 2011 at 12:47 am

I’ve designed an algorithm for enumerating the words of a context-free language given a grammar. It employs dynamic programming to find all possible words of each length for every symbol in the grammar (including the start symbol, which is of interest).

It requires the input grammar to have the following property: if there is at least one variable on the RHS of a production, there must also be at least one terminal there. It is possible to transform any grammar to have it; e.g. the Greibach normal form has this property.

Algorithm:

For every i>=0, find all i-long words each of the variables (V) can expand to. This is done like that: For each production of variable V, find all possible ways of splitting an i-long string into the variables and terminals on the RHS.

Because the terminals are of known size, this is equivalent to splitting an (i-t) long string into v substrings, where t is the number of terminals on the RHS, and v is the number of variables on the RHS. A splitting is identified by a tuple (N1, …, Nv), meaning the first variable on the RHS will expand to a N1-long string, …, and the last to a Nv-long string. In other words, find all possible N1, …, Nv >=0 such that, N1 + … + Nv = (i-t).

Example: if i=4, and we’re working the production A->xBC, possible splittings are (0,3), (1,2), (2,1) and (3,0). In other words: if xBC is of length 4, then BC must be of length 3. Then either B is of length 0 and C is of length 3, or B is of length 1 and C is of length 2, …

Now for each of the splittings (N1, …, Nv), find all possible assignments of words to the variables on the RHS. Let W(x) represent all x-long strings that the variable W can expand to. Now all possible assignments are elements of the cross product V1(N1) x … x Vv(Nv), where V1, …, Vv are variables on the RHS. For each combination, instantiate the variables on the RHS and remember the resulting word (making it part of V(i), V still being the LHS variable). The sets Vx(Nx) are known from previous iterations; the grammar restriction mentioned guarantees that all Nx < i.

Finally, the words of the language are the union S(0) U S(1) U S(2) U …, where S is the starting variable.

Example implementation in PHP:
http://193.77.101.149:54800/~ambro/enum.php.src
http://pastebin.com/na6L9932

Share this:

Related

37 thoughts on “Enumerating a context-free language”