Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / DB2 Topics / January 2008

Tip: Looking for answers? Try searching our database.

choosing a server codeset

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Frank Swarbrick - 16 Jan 2008 00:33 GMT
Are there advantages to choosing, say, IBM-1252 over UTF-8?  If my PC
application uses code page 1252 will it perform better because no code page
translation is required?  I assume so.  What type of performance hit might I
expect when connecting to a UTF-8 database?  What advantages would I get by
using a UTF-8 database?  Obviously it can store the entire Unicode 'plane'
(or whatever that's called), but if my PC can't display it anyway what do I
really care?  And I guess that storing XML data requires UTF-8?  But I don't
think we plan on utilizing this.

What else should we know to make our decision?

Thanks,
Frank
Dan van Ginhoven - 16 Jan 2008 18:03 GMT
Hi Frank!

If the database contains national characters other than A-Z, a-z, using
UTF-8, a table column declared as Char(8) will
have room for 4-8  characters, since Characters lika ÅÄÖÉÜ takes 2 bytes  in
UTF-8. If you don't work with multiple national languages go for a character
set that suits your situation. If you need to work with XML-data put them in
separate database.
/dg

> Are there advantages to choosing, say, IBM-1252 over UTF-8?  If my PC
> application uses code page 1252 will it perform better because no code page
[quoted text clipped - 9 lines]
> Thanks,
> Frank
Colin Booth - 16 Jan 2008 22:40 GMT
> Are there advantages to choosing, say, IBM-1252 over UTF-8?  If my PC
> application uses code page 1252 will it perform better because no code
[quoted text clipped - 13 lines]
> Thanks,
> Frank

Hi

Some characters that may be single byte in 1252 are mult-byte in UTF-8. With
a standard UK keyboard I think that there are 3 or 4 characters that are
multi-byte in UTF-8.

I like and prefere UTF-8 but the applications must coded for UTF-8. E.g. if
you have an 8 byte character column and an 8 byte (1252) entry field and
fill the entry field using at least 1 of the UTF-8 multibyte characters you
will get a data truncation error. Also you need to be careful about the
number of characters in a column as the byte count is not necessarily the
character count.

Things are becoming much more global. I have moved to France but still have
some accounts and investments in the UK. I also purchase some things from
the UK and my address contans accents

Colin
Frank Swarbrick - 18 Jan 2008 18:04 GMT
>>> On 1/16/2008 at 3:40 PM, in message <fmm14k$lnc$1@news.tiscali.fr>,
Colin
Booth<colinsbooth@gmail.com> wrote:

>> Are there advantages to choosing, say, IBM-1252 over UTF-8?  If my PC
>> application uses code page 1252 will it perform better because no code
[quoted text clipped - 39 lines]
> from
> the UK and my address contans accents

I question your comment "the applications must coded for UTF-8".  I just
wrote an OpenCobol application with imbedded DB2.  No special "UTF-8"
coding, whatever that might mean.  All it does is connect to the database,
retrieve the "string" and "hex" values of a set of VARCHAR(25) columns, and
displays those values.

I run this against two databases:
TEST1 is a database defined as codeset IBM-1252.
UTFDB is a database defined as codeset UTF-8.

Here are the results:

CONNECT TO test1  
5B544553545D            
+0006: [TEST]
7C544553547C            
+0006: |TEST|
A654455354A6            
+0006: ¦TEST¦
80                      
+0001: €

CONNECT TO utfdb  
5B544553545D            
+0006: [TEST]
7C544553547C            
+0006: |TEST|
C2A654455354C2A6        
+0006: ¦TEST¦
E282AC                  
+0001: €

(+0001: € <== that actually shows as the euro symbol in Notepad.)

As you can see, for the UTF-8 database the euro symbol was stored as
x'E282AC'.  But since my application used code page 1252 DB2 was smart
enough to translate it to x'80', which is the value for euro in code page
1252.

Now of course when there is a symbol that exists in UTF-8 and not in 1252
then there will be a problem.

I guess your point is, and it's a good one, that if a CHAR or VARCHAR column
is defined in a UTF-8 database then you, in a sense, have to "over define"
the length to take in to account the possibility of multi-byte characters?
For instance, a 1 character field that could possibly contain a multi-byte
UTF-8 character (such as the euro symbol) would have to be defined in the
database as, say, CHAR(3).

This does bring to mind a question I have been pondering.  Is there any harm
in defining 'string' fields to be much larger than the largest string length
that you would ever expect?  Like an address line.  It might be 50 or so
characters.  Is there harm in defining it as VARCHAR(250) or even
VARCHAR(32000)?  Does it waste space or any other resource?

Thanks for your help.

Frank
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.