Punctuation restoration demo

This service can be used to restore punctuation in unsegmented English text. Works best for Europarl-style text. Send me an e-mail to ottokar.tilk@phon.ioc.ee if you have any questions or problems. The service can also be used by sending the text with HTTP POST directly, e.g:

curl -d "text=hello%20world" http://bark.phon.ioc.ee/punctuator

Model

The service uses a bidirectional recurrent neural network model with attention mechanism. The source code is available here. The model restores commas, periods, question marks, exclamation marks, colons, semicolons and dashes.

Training data

We used roughly first 80% of lines from the Europarl v7 monolingual English corpus as trainging data, next 10% as development data and last 10% as test data (preprocessing script here). The training set size was about 40 million words. The corpus was obtained from the IWSLT 2012 TED task web page.

Try an example of a few random sentences from our test set.