CentOS
Is there an easy way to convert HTML special entities from a data stream? I'm passing data to a bash script and sometimes that data includes special entities. For example:
"test" & test $test ! test @ # $ % ^ & *
I'm not sure why some characters show up fine and other don't but unfortunately, I don't have control over the data coming in.
I'm thinking I might be able to use SED here but that seems like it would be cumbersome and possibly prone to false positives. Is there a Linux command I could pipe to that specializes in decoding this type of data?
Perl is (as always) your friend. I think this will do it:
E.g.:
With output:
PHP is well suited to this. This example requires PHP 5:
recode seems available on default packages repositories of main GNU/Linux distributions. E.g. to decode HTML entities into UTF-8 :
With Python 3:
Takes text file from stdin:
It probably needs bash >= version 4
I use this script. Save it as
html2utf.py
, and use it alaecho $some_html | html2utf.py
.