« Back
in parser-combinators scala recursive-descent-parser lookahead lookbehind fastparse read.
Lookaheads and Recursive Descent Parser implementation with Scala Parser Combinators (Fastparse)

Lookaheads and Recursive Descent Parser implementation with Scala Parser Combinators (Fastparse).

While you are looking for parser combinators and how to use them around the internets you might crash into a bunch of hello worlds. As a developer, you will be pretty frustrated to find an exemplary positive and negative lookahead conditions and you might find yourself struggling with them. If you are writing code that is using backtracking heavily you are in the right place.

Let's offer a use case for our initiative:

We are going to parse actor nodes from an output and we have a case class that will carry the data of ActorNodes.

case class ActorNode(  
  name: String,
  `type`: String,
  `class`: String,
  status: Int,
  children: Int
) extends Ordered[ActorNode] {
  def compare(that: ActorNode): Int = this.children compare that.children
}

This structure will be filled in with lines of below data supplied per line:

-> / LocalActorRefProvider$$anon$1 class akka.actor.LocalActorRefProvider$Guardian status=0 2 children
-> system LocalActorRef class akka.actor.LocalActorRefProvider$SystemGuardian status=0 4 children
-> deadLetterListener RepointableActorRef class akka.event.DeadLetterListener status=0 no children
-> eventStreamUnsubscriber-2 RepointableActorRef class akka.event.EventStreamUnsubscriber status=0 no children
-> log1-Logging$DefaultLogger RepointableActorRef class akka.event.Logging$DefaultLogger status=0 no children
-> testProbe-2 RepointableActorRef class akka.testkit.TestActor status=0 no children
-> user LocalActorRef class akka.actor.LocalActorRefProvider$Guardian status=0 1 children
-> $a RepointableActorRef class akka.testkit.TestActors$ForwardActor status=0 no children

So let's see the line below as an example:

-> system LocalActorRef class akka.actor.LocalActorRefProvider$SystemGuardian status=0 4 children

We want to parse:

  • The name of the Actor (system)
  • Type of its reference (LocalActorRef)
  • Class of the actor that is instantiated (akka.actor.LocalActorRefProvider$Guardian)
  • Current status of Actor (extracting 0 from status=0)
  • Current children count of an Actor (extracting 4 from 4 children if it is no children return 0)

Normally, you can write a bunch of regexes that can be applied iteratively or have a single regex for word matching or carried regexes that act as a single parser and parses through the incoming data (which can be named recursive descent parsers).

So for capture groups, we need to specify descendency by defining what is going to be captured in which conditions:

val strChars = P( CharsWhile(_ != ' ').! )  

So all chars that is delimited by a space means strings to us.

Secondly, what can be the first parser that can be the initiator for our parser combinator?
We should tell to the parser that line should start with -> sign and continue with one space and then one or more characters with a lookahead of space. But we don't want to take last space so lookahead will check and set the last position to the last character of the sequence.

def leafName =  
P("-> ") ~ P( strChars ~ &( " " ) ).!  
^^^^^^^^
Look for the initial sequence  
           ^^^^^^^^^^^^
           Bunch of string chars
                         ^^^^^^^^
                         Space lookahead
                                   ^^^
                                   Exclamation mark is capture group

So index after successful parsing is going to be like:

-> system LocalActorRef class ak.......
 ~~~~~~~~^
         The second parser should continue from the first space character...
         So it can descend to the combined second parser that we are going to pass.

One thing about parser combinators is that they only continue to the next one when the previous one successfully parsed. So second one will be curried now.

How can we define the second one? We came to the last character of the first parsed string.

def typeName = P( " " ) ~ P( strChars ~/ &( " class" ) ).!  

Above is the second parser, and this is its steps to parse down:

  • Look for " " character
  • Look for many characters that are followed by " class" (In here ampersand is lookahead indicator)
  • ~/ after parsing descend into the position that is after multiple characters.

~/ character is syntactic sugar for don't backtrack the rest of the combined parsers behind this point. AKA cut in here, descend to this place...

And rest of the parsers basically look same:

def className = P( " class " ) ~ P( strChars ~/ &( " status=" ) ).!

val digits    = P( CharsWhileIn("0123456789"))  
def status    = (P( " status=" ) ~ P( digits ~/ &( " " ) ).!).map(_.toInt)  

As you can see in the status you can pass a function to the captured group and transform them into whatever you are looking for.

And below parser for parsing children. If there are no children instead of 0 we get no children so we need to look for both of the inputs and parse them as an integer.

  def childrenCount = (P( " " ) ~ P( ("no" | digits) ~/ &( " children" ) ).!).map { elem =>
    if (elem == "no") {
      0
    } else {
      elem.toInt
    }
  }

Now it is time for the beautiful part, combining the parsers. Parsers can be combined with tilde operator. So they can descend to next position after previously curried parser successfully executed.

def node = P(leafName ~ typeName ~ className ~ status ~ childrenCount)

def parseAll = node.parse(input)

parseAll(data) match {  
        case Parsed.Success(data, _) =>

        /**
         * Here is your node
         */
        val node = ActorNode.tupled(data)
        case Parsed.Failure(msg, i, _) => println("FAILURE: " + msg + " INDEX " + i.toString)
}

And that is all that you need to know how you can write recursive descent parsers.
How lookaheads can be combined and how you can write capture groups in Scala's fastparse and also make faster parser implementations with cut.

Thanks for all the 🐠 .

comments powered by Disqus