>>> On 1/16/2008 at 3:40 PM, in message <fmm14k$lnc$1@news.tiscali.fr>,
Colin
Booth<colinsbooth@gmail.com> wrote:
>> Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
>> application uses code page 1252 will it perform better because no code
[quoted text clipped - 39 lines]
> from
> the UK and my address contans accents
I question your comment "the applications must coded for UTF-8". I just
wrote an OpenCobol application with imbedded DB2. No special "UTF-8"
coding, whatever that might mean. All it does is connect to the database,
retrieve the "string" and "hex" values of a set of VARCHAR(25) columns, and
displays those values.
I run this against two databases:
TEST1 is a database defined as codeset IBM-1252.
UTFDB is a database defined as codeset UTF-8.
Here are the results:
CONNECT TO test1
5B544553545D
+0006: [TEST]
7C544553547C
+0006: |TEST|
A654455354A6
+0006: ¦TEST¦
80
+0001: €
CONNECT TO utfdb
5B544553545D
+0006: [TEST]
7C544553547C
+0006: |TEST|
C2A654455354C2A6
+0006: ¦TEST¦
E282AC
+0001: €
(+0001: € <== that actually shows as the euro symbol in Notepad.)
As you can see, for the UTF-8 database the euro symbol was stored as
x'E282AC'. But since my application used code page 1252 DB2 was smart
enough to translate it to x'80', which is the value for euro in code page
1252.
Now of course when there is a symbol that exists in UTF-8 and not in 1252
then there will be a problem.
I guess your point is, and it's a good one, that if a CHAR or VARCHAR column
is defined in a UTF-8 database then you, in a sense, have to "over define"
the length to take in to account the possibility of multi-byte characters?
For instance, a 1 character field that could possibly contain a multi-byte
UTF-8 character (such as the euro symbol) would have to be defined in the
database as, say, CHAR(3).
This does bring to mind a question I have been pondering. Is there any harm
in defining 'string' fields to be much larger than the largest string length
that you would ever expect? Like an address line. It might be 50 or so
characters. Is there harm in defining it as VARCHAR(250) or even
VARCHAR(32000)? Does it waste space or any other resource?
Thanks for your help.
Frank