KISS

Keep It Simple Stupid

Inner parsing with megaparsec

| comments

For a small project, I need to parse the contact information from vcf files, which contain vCards (contact information). Specifically, the vCard 3.0 format defined by RFC 2426. And I only need contact’s name and birthday if any.

It’s a good project for me to learn more about parser combinators in Haskell. I’ve implemented a proof of concept using megaparsec because it has a great tutorial and produces good error messages. The code in this post is truncated and somewhat edited for (hopefully) easier understanding, but may not always make total sense, so you can take a look at the full working code in the PoC repository at https://github.com/eunikolsky/b2c/tree/poc/full.

Parsing content lines

I started with simple parsing of a single vCard:

1
2
3
BEGIN:VCARD
contentlines
END:VCARD

based on the formal grammar. The contentlines in the code is a list of key-value pairs, so

1
2
3
4
BEGIN:VCARD
FN:Adam Ford
BDAY:1984-01-01
END:VCARD

should be parsed into

1
2
3
[ ("FN", "Adam Ford")
, ("BDAY", "1984-01-01")
] :: [KeyValue]

where type KeyValue = (Text, Text).

(I know, the spec says “A vCard object MUST include the VERSION, FN and N types”, but I haven’t implemented these checks in the PoC yet.)

The content lines may be in any order; otherwise I would skip lines until the expected fields come up and return their values directly. Parsing worked great once I realized how to parse some lines until an end token: someTill. Here’s how:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
vcardParser :: Parser [KeyValue]
vcardParser = do
  string "BEGIN:VCARD" *> eol
  someTill contentline (string "END:VCARD" *> eol)

  where
    contentline :: Parser KeyValue
    contentline = do
      name <- name
      char ':'
      value <- value
      eol
      pure (T.pack name, T.pack value)
    name = some $ alphaNumChar <|> anySingleBut ':'
    value = many printChar

Testing this in ghci:

1
2
3
4
5
$ stack ghci b2c:lib
λ> :m +Text.Megaparsec
λ> :set -XOverloadedStrings
λ> parseTest vcardParser  "BEGIN:VCARD\nFN:Adam Ford\nBDAY:1984-01-01\nEND:VCARD\n"
[("FN","Adam Ford"),("BDAY","1984-01-01")]

Birthday parsing

Now that I can find a tuple where the first key is "BDAY" and get its value ("1984-01-01" here), how do I convert it to a Day? String extraction because the format is pretty simple? Regular expressions? Wait, I already have megaparsec for parsing, so I should use it!

Conceptually I need a function with the type signature like Text -> Maybe Day, however Parser Day encapsulates this idea and is a higher-level type — a parser that needs to be “run” and can return a lot of extra information.

The birthday parser is simple:

1
2
3
4
5
6
7
8
9
10
newtype Birthday = Birthday Day

birthdayParser :: Parser Birthday
birthdayParser = do
  year <- read <$> count 4 digitChar <?> "year"
  optional $ char '-'
  month <- read <$> count 2 digitChar <?> "month"
  optional $ char '-'
  day <- read <$> count 2 digitChar <?> "day"
  pure . Birthday $ fromGregorian year month day

Now I don’t want to get an unstructured [KeyValue] for a vCard, but a Contact, my domain model type:

1
2
type Name = Text
newtype Contact = Contact (Name, Birthday)

The parser is updated to produce this type:

1
2
3
4
5
6
7
8
9
10
11
vcardParser :: Parser Contact
vcardParser = do
  string "BEGIN:VCARD" *> eol
  keyValues <- someTill contentline (string "END:VCARD" *> eol)

  let (Just fName) = snd <$> find ((== "FN") . fst) keyValues
  let (Just bDay) = snd <$> find ((== "BDAY") . fst) keyValues
  let bDayResult = parse (birthdayParser Nothing) "" bDay
  case bDayResult of
    Left errorBundle -> parseError . NE.head . bundleErrors $ errorBundle
    Right bDay -> pure $ Contact (Name fName, bDay)

The key here is that I use the parse function to run an inner parsing of the birthday; if it fails, I stop the outer (vCard’s) parsing with the error using parseError, otherwise I produce a complete Contact.

1
2
λ> parseTest vcardParser "BEGIN:VCARD\nFN:Adam Ford\nBDAY:1984-01-01\nEND:VCARD\n"
Contact (Name "Adam Ford",Full 1984-01-01)

It works, but errors are displayed in the locations of the original string:

1
2
3
4
5
6
7
λ> parseTest vcardParser "BEGIN:VCARD\nFN:Adam Ford\nBDAY:198x-01-01\nEND:VCARD\n"
1:4:
  |
1 | BEGIN:VCARD
  |    ^
unexpected 'x'
expecting digit

This is fine for a PoC, but I wanted to do better.

Correct error positions

Is there a good way to run an inner parser with Megaparsec? hinted at the solution. I need to get getSourcePos and getOffset just before parsing content line’s value and save those values with the parsed value. The PoC’s code was updated before this change to support parameters in content lines and the type VCValue = Text type is used to store the line’s value (e.g. "1984-01-01" in our example); now the type is updated to store the position metadata:

1
2
3
4
5
data VCValue = VCValue
  { valText :: Text
  , valPos :: SourcePos
  , valOffset :: Int
  }

The vCard parser saves the source position and offset into a VCValue:

1
2
3
4
5
6
7
8
9
10
11
12
vcardParser = runMaybeT $ do
  
  where
    contentline :: Parser VCContentLine
    contentline = do
      
      char ':'
      pos <- getSourcePos
      offset <- getOffset
      clValue <- value
      eol
      pure $ VCContentLine (T.pack clName) maybeParam (VCValue (T.pack clValue) pos offset)

Finally I need to prepare the correct state for the inner parser so that it would report errors in the correct position:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
vcardParser = runMaybeT $ do
  
  curPosState :: PosState Text <- fmap statePosState . MaybeT . fmap Just $ getParserState
  let
    bDayCLValue = clValue bDayLine
    bDayParserState = State
      { stateInput = valText bDayCLValue
      , stateOffset = valOffset bDayCLValue
      , statePosState = curPosState
        { pstateInput = valText bDayCLValue
        , pstateOffset = valOffset bDayCLValue
        , pstateSourcePos = valPos bDayCLValue
        }
      , stateParseErrors = mempty
      }
    bDayResult = snd $ runParser' (birthdayParser maybeBDayOmittedYear) bDayParserState
  

Note that the code now uses runParser' in order to provide the custom initial state with the position information.

Now the inner parsing error is displayed in the correct place within the original string:

1
2
3
4
5
6
7
λ> parseTest vcardParser "BEGIN:VCARD\nFN:Adam Ford\nBDAY:198x-01-01\nEND:VCARD\n"
3:9:
  |
3 | BDAY:198x-01-01
  |         ^
unexpected 'x'
expecting digit

I’m sure there are better ways of implementing this, but I haven’t come across them yet.

Comments