HTML::Encoding
==============
This module can be used to determine the character encoding
of HTML and XHTML files. It reports explicitly given
informations on the encoding. It tries to read
* the Content-Type headers 'charset' parameter of
an HTTP::Headers or Mail::Header object
* the XML declaration of XHTML files
* the byte order mark at the beginning of the file
* a meta element like
You have always to know the encoding of (X)HTML files if
you are trying to process them, e.g. parse it with
HTML::Parser or extracting links with HTML::LinkExtor.
It is not safe and forbidden by HTML 4 to assume any
default encoding like US-ASCII or ISO-8859-1. Documents
may even be not encoded in some 8 bit character encoding
but may use UTF-16 or not compatible with US-ASCII like
EBCDIC encoded files. To assume some US-ASCII compatible
encoding could fail and even break document. Consider you
are retrieving an UTF-8 encoded file and pass it to some
other application, e.g. a web browser labeld as ISO-8859-1,
the user will see lots of for him weired characters.
This module provides an easy to use method to circumvent
all those possible problems. It may however fail if the
page author didn't supply character encoding informations;
this is indeed a problem, since if this module cannot
determine the encoding, no one can and the document is said
to break.
INSTALLATION
To install this module type the following:
perl Makefile.PL
make
make test
make install
COPYRIGHT AND LICENCE
This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
Copyright (C) 2001 Björn Höhrmann