I live in Japan. Recently there has been a lot of spam coming from China with messages written in Chinese. As spamassassin does not contain rules for Chinese, most of those emails pass with low score.
I would like to identify when an email is written in Chinese only. As most of the Japanese kanjis are included in the Chinese range (U+E400 to U+E9FF) one way to identify Japanese is to look at the Hiragana (U+3040 to U+309F) and the Katakana (U+30A0 to U+30FF). If it contains either Hiragana or Katakana I can safely assume is Japanese, otherwise is Chinese.
If I test individual characters, for example: あ
or ア
they match correctly, but when I use ranges it doesn't work. This is what we have tried:
body CHINESE /[\xe4-\xe9]/ <--- this form seems to work fine
body JAPANESE /[\x30-\x31]/ <--- not sure what is actually matching
body JAPANESE /(あ|え)/ <---- this matches single character just fine
body JAPANESE /[あ-ん]/ <--- doesn't work
body JAPANESE /[U+3040-U+30FF]/ <--- doesn't work
body JAPANESE /[\xe3\x81\x81-\xe3\x82\x96]/ <--- doesn't work
body JAPANESE /[\x{3040}-\x{30FF}]/ <--- doesn't work
I really don't know anymore what am I doing. I know some of the above make no sense...
What is the correct way to specify those ranges?
Have you tried to use Mail::SpamAssassin::Plugin::TextCat (language detector)?
IMHO You should consider/evaluate it first.
You can modify it to match "only one language detected/guessed" or some mixes of languages.
WARNING: Make sure the plugin is loaded by your SpamAssassin configuration.
It is configured in
/etc/spamassassin/v310.pre
file on Debian Linux.