HTML, Unicode, and Perl

JD Austin plug-discuss@lists.plug.phoenix.az.us
Mon, 30 Jun 2003 16:30:55 -0700


Matt Alexander wrote:

>Does anyone know how I would take an HTML unicode character and convert it
>to the actual unicode character in a text file using Perl?  For example,
>let's say I have López.  I'd like the ó to be converted to the
>character with the o and the accent over it and saved to a plain text
>file.
>Thanks,
>~M
>---------------------------------------------------
>PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
>To subscribe, unsubscribe, or to change  you mail settings:
>http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>  
>
There are a number of perl modules that deal with unicode,
I haven't used it but a quick glance under
http://www.cpan.org/modules/01modules.index.html
Unicode AMICHAUER <http://www.cpan.org/authors/id/A/AM/AMICHAUER> 
Unicode-Lite-0.12.tar.gz 
<http://www.cpan.org/authors/id/A/AM/AMICHAUER/Unicode-Lite-0.12.tar.gz>  
(http://www.cpan.org/authors/id/A/AM/AMICHAUER/Unicode-Lite-0.12.tar.gz)
showed a function that appeared to do what you want. 
Using this module you can likely read in the entire file line by line, 
call this function to convert the characters
and write out to a new file.

Heres a snippet of the readme file:

FUNCTIONS
    convertor SRC_CP DST_CP [FLGS] [CHAR]
        Creates convertor function and returns reference to her, for further
        fast direct call.

        The param FLGS operates replacing by SBCS->SBCS converting if any
        char from SRC_CP is absent at DST_CP. The order of search of
        substitution:

         UL_7BT - to equivalent 7bit char or sequence of 7bit chars
         UL_SEQ - to equivalent char or sequence of chars
         UL_EQV - to equivalent char

         UL_ENT - to entity - &#0000;
         UL_CHR - to [CHAR].
         UL_ALL - UL_SEQ or UL_EQV and UL_ENT or UL_CHR



JD

-- 
JD Austin
Database Administrator
Maricopa Community Colleges
john.austin@domail.maricopa.edu
480.731.8759