I released my first CPAN module.

I finally played with libcrm114, a C library that implements several text classification algorithms. It is a potential replacement for the mailreaver.crm tool, which is the basis for my SpamAssassin plugin.

Having a C library removes the need to fork a crm interpreter process (and in most cases also the need to read the learned feature data) for every single classification; it also enables the inclusion into other languages (so far I know modules for PHP and Python – now my module seems to be the first one for Perl).

A first and unprofessional benchmark against my mailbox confirms the expected performance improvement. I use Text::AI for the old fork-interpret-model (source) and Text::AI::CRM114 for the new library in-memory model (source).

~> time perl test_ai_crm114.pl
classified 9111 texts in 100.86 seconds (11.070 millisec per text)
Spam Texts: 68
 Ham Texts: 9043
30.717u 64.212s 1:41.50 93.5%   200+8161k 9361+0io 0pf+0w

~> time perl test_text_ai_crm114.pl
classified 9111 texts in 7.89 seconds (0.866 millisec per text)
Spam Texts: 68
 Ham Texts: 9043
6.719u 0.485s 0:08.41 85.4%     9+2691k 0+0io 0pf+0w

Comments are closed.