Releasing podtime

It’s a post about version 0.6.0 of my small program podtime, published this week. It may be interesting (although not very likely) to someone else using gPodder as a podcast client. I have a big queue of podcasts and it’s thrilling to see the total duration of new episodes over time. The full readme is at https://github.com/eunikolsky/podtime/blob/master/README.adoc.

Technical details

The prototype in shell is short and fast (and even works in parallel thanks to xargs!), but isn’t necessarily accurate due to occasional estimated duration of MP3 files. To get the accurate duration, you need to parse the file and count its MP3 frames — that would be much slower. I wanted to implement the parser in pure Haskell instead and use it to write the program.

When doing the initial proof-of-concept in Haskell, I tried megaparsec to parse MP3 files. It worked well and with helpful error messages. Its readme says attoparsec is designed for parsing binary protocols and is faster, so I decided to TDD the parser from scratch with it this year. It is very similar to megaparsec, has fewer public functions and doesn’t provide a public way to get the current position.

A surprising inconvenience is the automatic backtracking on failure, which hides nested errors. Here’s an example: an MP3 file may have an ID3 tag at the beginning, then it contains a sequence of MP3 frames. If the file starts with ID3, it’s the beginning of an ID3 tag, so the ID3 parser expects the rest of the header. Suppose an unsupported feature is used there, so the parser fails with this error. However this parser is wrapped in the optional combinator (because the tag is optional), so attoparsec backtracks and tries to parse the beginning as an MP3 frame, which of course fails, but with a less helpful message; the ID3 parser’s error is lost. I couldn’t find an easy way to make this work.

On the other hand, my attoparsec-based MP3 parser is indeed significantly faster than the initial megaparsec-based one. An example: 212 MiB in 1.1 s (attoparsec) vs 193 MiB in 7.1 s (megaparsec).

As usual, property-based tests with QuickCheck are amazing at uncovering various failing cases. I haven’t decided yet which approach for generators makes more sense to me given a number of different generators for different properties: genFoo functions or newtype Foo; instance Arbitrary Foo wrappers. And I haven’t figured out proper shrinking much.

The integration test was extremely helpful for ensuring that my parser worked correctly. It went through all my downloaded podcast episodes and verified that the parsed duration matched the duration returned from sox. There was this success progression as I added more details or fixed the parser:

Successful examples: 22/585 (3.76%)
Successful examples: 311/592 (52.53%)
Successful examples: 312/591 (52.79%)
Successful examples: 487/591 (82.40%)
Successful examples: 506/591 (85.62%)
Successful examples: 513/591 (86.80%)
Successful examples: 509/586 (86.86%)
Successful examples: 514/586 (87.71%)
Successful examples: 515/582 (88.49%)
Successful examples: 525/584 (89.90%)
Successful examples: 527/585 (90.09%)
Successful examples: 530/586 (90.44%)
Successful examples: 586/586 (100.00%)
Successful examples: 588/588 (100.00%)

I even extended a test formatter to report these numbers.

Conduit is an excellent library for data streaming in Haskell because the standard IO is lazy, doesn’t have constant memory usage and has weird gotchas. I use it only in a few places, but I was already able to implement the takeLastC combinator (tests) to drop all lines except the last N ones. I’d previously thought Conduit was a framework for streaming and the whole program would need to be written with it, but in fact it’s much simpler than that: it’s a library that can be used in places where you want to process data in constant space.

Project details

It’s the first project where I’ve been using GHCup (to install GHC and HLS) and Haskell Language Server to provide some IDE-like features in vim with CoC — it works pretty well, especially compared to not having those features at all. Automatic completion is cool, in-place hlint suggestions show up (although applying the fix almost never works), imports are updated (sometimes annoyingly); the HLS’s file state often gets out of sync from the real files, maybe because I tend to use git commands to undo changes, apply changes, switch branches.

There is still great ghcid to provide very fast feedback with tests, running in a split tmux pane, or sometimes stack test --file-watch for cases when I use a new GHC extension w/o updating the .cabal file (GHCi doesn’t complain about that).

This project has no package.yaml file for hpack, I update the cabal file manually. https://vrom911.github.io/blog/common-stanzas provides a good guide on removing duplication. I’ve never used hpack manually, it’s just that the default project skeleton generated by stack includes this setup. Having to update the cabal file is slightly annoying, but no big deal.

The pre-commit hook works very well! Most of the time, I TDD a pure function, so I mostly work with unit tests; when I change the public interface of my library, the program target will likely break, and I wouldn’t discover that for several commits. The hook ensures all targets build, unit tests pass and hlint doesn’t offer any suggestions. The hook can be installed with gln -srv .git-pre-commit .git/hooks/pre-commit where gln is the GNU version of ln (brew install coreutils), which has the -r option to create a relative symlink — it’s more convenient than creating an absolute symlink or cding into .git/hooks/ and creating a relative symlink from there.

Speaking of which, a Makefile is a very convenient way to define small project tasks, such as the project build commands.

I haven’t got around to setting up an automatic code formatter in vim/HLS/git hook yet.

KISS 🇺🇦

Stop the war!

Technical details

Project details

Comments