Improving vCard parsing with megaparsec

Last time I wrote about nested parsing of birthdays in vCards using megaparsec. That solution worked fine as far as I could tell, but there must’ve been a better way. And indeed there was! I’ll explore it and also refactor the vCard parser in this article. The changes (still based on the proof of concept) are at https://github.com/eunikolsky/b2c/tree/poc/full; there are individual commits for every distinct change described here.

Fascinating! I came up with a couple of these improvement ideas while bicycling for a few hours.

Also, I didn’t mention it before, but ideally, vcardParser’s result is:

Just Contact — for a contact with a birthday,
Nothing — for a valid vCard that doesn’t have a birthday,
a parse error — for any other error (including all kinds of parsing errors).

Parsing birthday immediately

The reason I initially started with parsing content lines into a list of key-value pairs and then looking for data there is I was confused by the fact that they are unordered, so it made sense to get a list of them with someTill.

But then I realized that megaparsec’s Parsec is a monadic parser — which means I can parse the name in a content line and decide what to do next based on that (a monad is required to do this, whereas an applicative functor is not enough). This means, I can parse a string as a value for the formatted name “FN” field and parse a birthday directly for the “BDAY” field! Let’s get started.

Well, it still makes sense to parse the content lines one by one because they have a regular structure, but now instead of returning a parsed content line for each one, I need to modify an intermediary value. It would be a temporary “mutable” value that would keep interesting data for a single vCard — something closer to the finished model I need (Contact). I call it ContactBuilder (in d5831c) and it Maybe has the name and birthday fields:

data ContactBuilder = ContactBuilder
  { cbName :: Maybe Name
  , cbBirthday :: Maybe Birthday
  }

In order to update this builder, I’ll use the StateT monad transformer. The Abstractions in Context book has a chapter on the monad transformers “pattern”; now I went to look at it to find a quote and turns out the book talks about megaparsec (it’s been a few months since I read it):

… If, for instance, you wanted to have parsers that keep track of how many times some token has appeared while parsing, and change the output based on that count, using StateT gives you a way to do so. More importantly, the author of megaparsec didn’t have to provide any special hooks to allow you to do that; through the use of a common abstraction (monads), we’re able to add this capability to our code orthogonally from the parsing library implementation.

This idea of extending a library without changing it is a very interesting view on monad transformers!

Now, there are two ways of composing State and Parsec: ParsecT (State x) and StateT x Parsec. At first, I went with the former option because it seemed logical: state inside a parser.

`State` in `ParsecT`

So the type is now:

type ContactParser = ParsecT Void Text (State ContactBuilder)

contentline :: ContactParser ()
contentline = do
  clName <- name
  …
  case clName of
    "FN" -> do
      fn <- value
      lift $ modify (\cb -> cb { cbName = Just . Name . T.pack $ fn })
    …

I updated the contentline parser, but didn’t touch the birthdayParser because it doesn’t need this fancy state. Now there are two places where I need to “convert” between two different types of parsers:

vcardParser :: Parser (Maybe Contact) calls contentline :: ContactParser (), and
contentline :: ContactParser () calls birthdayParser :: Parser Birthday.

I started with the latter, which was easy to do, although I still couldn’t get away from running the parser manually and dealing with the returned Either:

"BDAY" -> do
  state <- getParserState
  let (newState, bday) = runParser' (birthdayParser Nothing) state
  setParserState newState
  case bday of
    Left errorBundle -> parseError . NE.head . bundleErrors $ errorBundle
    Right bday -> lift $ modify (\cb -> cb { cbBirthday = Just bday })

But calling contentline from vcardParser doesn’t make sense — the types when applying runParserT' don’t match up:

-- runParserT' :: Monad m => ParsecT e s m a -> State s e -> m (State s e, Either (ParseErrorBundle s e) a)
-- ( m = ContactState = Control.Monad.State ContactBuilder )
-- runParserT' @ContactState :: ParsecT e s ContactState a -> State s e -> ContactState (State s e, Either …)

`Parsec` in `StateT`

Flipping the monad stack is the correct solution, which is also described in the megaparsec tutorial (but who reads manuals?! :) ). Layers of monad transformers work in reverse; there is more information about that in the Book of Monads.

type ContactParser = StateT ContactBuilder Parser

With this, composing different types of parser is so much easier:

-- m = ContactState = Control.Monad.StateT ContactBuilder (Parsec e s)
-- runStateT :: StateT s m a -> s -> m (a, s)
-- runStateT (_ :: ContactState ()) :: ContactBuilder -> Parser ((), ContactBuilder)
-- execStateT (_ :: ContactState ()) :: ContactBuilder -> Parser ContactBuilder
(ContactBuilder { cbName = Just name, cbBirthday = Just birthday }) <- execStateT (someTill contentline (string "END:VCARD" *> eol)) def
pure . pure $ Contact (name, birthday)

After some type juggling and fitting, I restored the MaybeT monad transformer in vcardParser in order to handle a missing birthday gracefully:

vcardParser = runMaybeT $ do
  …
  (ContactBuilder { cbName = Just name, cbBirthday = maybeBirthday :: Maybe Birthday })
    <- lift $ execStateT (someTill contentline (string "END:VCARD" *> eol)) def
  birthday :: Birthday <- MaybeT $ pure maybeBirthday
  pure $ Contact (name, birthday)

And I got rid of runParser'! Since the type of parsers is Parser everywhere (either plain or wrapped in a StateT), birthday parsing is simplified:

  "BDAY" -> do
    bday <- lift $ birthdayParser Nothing
    modify (\cb -> cb { cbBirthday = Just bday })

Difference in behavior

There is a difference in sequentiality between the previous and current versions of the parser. Given the same input where an invalid content line appeared after an invalid birthday, the previous version would fail on the later line:

λ> parseTest vcardParser "BEGIN:VCARD\nFN:Adam Ford\nBDAY:198x-01-01\nFOO\nEND:VCARD\n"
4:4:
  |
4 | FOO
  |    ^
unexpected newline
expecting '-', ':', ';', or alphanumeric character

whereas the current version (more appropriately) fails on the birthday field:

λ> parseTest vcardParser "BEGIN:VCARD\nFN:Adam Ford\nBDAY:198x-01-01\nFOO\nEND:VCARD\n"
3:9:
  |
3 | BDAY:198x-01-01
  |         ^
unexpected 'x'
expecting digit

Simplifying the monad stack

Since the parser doesn’t currently read the ContactBuilder while parsing, only modifies it, it’s better to make that more obvious by replacing the read-write StateT with a write-only WriterT. A common example of using a Writer is to compose a list of values, but in fact it can work with any Monoid, so in order to combine different parts of the builder, it needs to be a Monoid. How to implement that?

I have two fields, both are Maybes, and the use case is if I start with an empty builder and append a builder with Just a name to it, I should get a previous builder with the new name. It’s easy to implement a function that picks a Just value out of two Maybes, but it’s already implemented for us! You need to know about it though since it’s not a function like pickMaybe :: Maybe a -> Maybe a -> Maybe a, but a monoid called Last. It works like this:

λ> Last (Just 0) <> Last (Just 1)
Last {getLast = Just 1}

Since I’m lazy and don’t want to wrap both arguments in Last manually, I can conveniently use foldMap:

λ> getLast $ foldMap Last [Just 0, Just 1]
Just 1

With the change, the parser is updated this way:

type ContactParser = WriterT ContactBuilder Parser

instance Semigroup ContactBuilder where
  ContactBuilder lName lBirthday <> ContactBuilder rName rBirthday = ContactBuilder
    { cbName = getLast $ foldMap Last [lName, rName]
    , cbBirthday = getLast $ foldMap Last [lBirthday, rBirthday]
    }

instance Monoid ContactBuilder where
  mempty = ContactBuilder { cbName = Nothing, cbBirthday = Nothing }

contentline = do
  …
  "FN" -> do
    …
    tell $ ContactBuilder { cbName = Just . Name . T.pack . concat $ (fn:fnParts), cbBirthday = def }
  "BDAY" -> do
    …
    tell $ ContactBuilder { cbBirthday = Just bday, cbName = def }

or skipping the default values:

contentline = do
  …
  "FN" -> do
    …
    tell $ mempty { cbName = Just . Name . T.pack . concat $ (fn:fnParts) }
  "BDAY" -> do
    …
    tell $ mempty { cbBirthday = Just bday }

Verifying vCard is valid

According to RFC 2426:

A vCard object MUST include the VERSION, FN and N types.

I didn’t implement this before, but why not now? It’s very easy to extend our ContactBuilder to track whether the parser has seen the required fields:

data ContactBuilder = ContactBuilder
  { cbName :: Maybe Name
  , cbBirthday :: Maybe Birthday
  , cbVersionCorrect :: Bool -- whether "VERSION:3.0" is present, required for all vCards
  }

The contentline parser is looking for “VERSION”:

  "VERSION" -> do
    string "3.0"
    tell $ mempty { cbVersionCorrect = True }

Since the fields can come up in any place, I’ll verify the “VERSION”’s existence at the end of a vCard parsing:

vcardParser = runMaybeT $ do
  …
  ContactBuilder
    { cbName = Just name
    , cbBirthday = maybeBirthday :: Maybe Birthday
    , cbVersionCorrect = versionCorrect
    }
    <- lift $ execWriterT (someTill contentline (string "END:VCARD" *> eol))
  guard versionCorrect

Then there is a series of improvements here:

guard versionCorrect is easy to use and works for our monad, but it returns a Nothing when there is no version, which doesn’t match my expected behavior;
lift $ guard versionCorrect now propagates the stop condition to the parser instead of MaybeT — this works as expected, but it reports an “unknown parse error”;
lift $ fail is better; it needs to be wrapped in the unless function: lift $ unless versionCorrect $ fail "vCard must have a VERSION"`
fixing the location of the error so that it’s reported on the last line of a vCard, not after that line; this is done by moving the eol parser from someTill to after the version verification.

The “N” type is verified in a very similar way.

Don’t store “contentlines”

The refactoring in this post has updated the type of the contentline parser to ContactParser () — modifying a ContactBuilder, but not returning any meaningful result directly. Thus, there is no reason to receive a list of those ()s from someTill. This is implemented with the skipSomeTill combinator:

vcardParser = runMaybeT $ do
  …
  ContactBuilder …
    <- lift $ execWriterT (skipSomeTill contentline (string "END:VCARD"))

Unfolding support

One feature that got less supported in this version of the parser is unfolding. Previously, the parser only uniformly parsed strings out of content lines, so I could add this support to values (which was incomplete because unfolding can happen anywhere).

It was easy to add back the unfolding for string values. But I couldn’t come up with a general “unfolding” combinator to use with the custom birthday parser. Because count 4 digitChar expects four digits and not a newline and a space among those. I’ll keep this at the back of my head.

Conclusion

I’m astounded by the result of the refactoring and the fact that it was relatively easy! The combinator and monad transformer patterns are extremely flexible! As I’m new to Haskell (practically-speaking), this is still bending my mind.

KISS 🇺🇦

Stop the war!

Improving vCard parsing with megaparsec

Parsing birthday immediately

`State` in `ParsecT`

`Parsec` in `StateT`

Difference in behavior

Simplifying the monad stack

Verifying vCard is valid

Don’t store “contentlines”

Unfolding support

Conclusion

Comments

Parsing birthday immediately

State in ParsecT

Parsec in StateT

Difference in behavior

Simplifying the monad stack

Verifying vCard is valid

Don’t store “contentlines”

Unfolding support

Conclusion

Comments

`State` in `ParsecT`

`Parsec` in `StateT`