Text::AI::CRM114
I released my first CPAN module.
I finally played with libcrm114, a C library that implements several text classification algorithms. It is a potential replacement for the mailreaver.crm
tool, which is the basis for my SpamAssassin plugin.
Having a C library removes the need to fork a crm interpreter process (and in most cases also the need to read the learned feature data) for every single classification; it also enables the inclusion into other languages (so far I know modules for PHP and Python – now my module seems to be the first one for Perl).
A first and unprofessional benchmark against my mailbox confirms the expected performance improvement. I use Text::AI for the old fork-interpret-model (source) and Text::AI::CRM114 for the new library in-memory model (source).
~> time perl test_ai_crm114.pl classified 9111 texts in 100.86 seconds (11.070 millisec per text) Spam Texts: 68 Ham Texts: 9043 30.717u 64.212s 1:41.50 93.5% 200+8161k 9361+0io 0pf+0w ~> time perl test_text_ai_crm114.pl classified 9111 texts in 7.89 seconds (0.866 millisec per text) Spam Texts: 68 Ham Texts: 9043 6.719u 0.485s 0:08.41 85.4% 9+2691k 0+0io 0pf+0w