Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General DB Topics / DB Theory / February 2004

Tip: Looking for answers? Try searching our database.

Squeezing spaces out of a string

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
--CELKO-- - 01 Feb 2004 00:09 GMT
This problem comes up on newsgroup about once a year.  Given a
VARCHAR(n) column with words in it, how do you squeeze out the extra
spaces, that each word is separated by only one space?

You can nest function calls up to 32 levels deep in SQL Server and
they have REPLACE (<target string>, <old string>, <new string>)
function, so answer is simply:

UPDATE Foobar
  SET col_x
      = REPLACE (
         REPLACE (
          REPLACE (
           ..
           REPLACE(col_x, '  ', ' ')
           ..
         '  ', ' ')
      '   ', ' ')
'     ', ' ');

This is a LOT faster than hanging in a loop and it is pure SQL, not
proprietary, procedural code.  But now a matb problem for you: let
col_x be VARCHAR(n).  What is the optimal mix of replace function
calls and what should they look like for the general case of (n)?
There can be more than one word in the string, so you can have what
typeseters call "rivers" in the string.

Prove it.    

Let (j->k) mean "replaces (j) spaces with (k) spaces"  

a) (2->1) repeated log2(n) times?
b) Is it best to always have (k->1)?
c) (floor(sqrt(n)) -> 1) as a starter?
d) Fibonnaci series?

I am in a weird mood this afternoon, so I'll offer a book as a prize
for a proof.

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in
your schema are.
Lennart Jonsson - 04 Feb 2004 09:56 GMT
> This problem comes up on newsgroup about once a year.  Given a
> VARCHAR(n) column with words in it, how do you squeeze out the extra
[quoted text clipped - 34 lines]
> I am in a weird mood this afternoon, so I'll offer a book as a prize
> for a proof.

Sorry, no proof. Just some reflections. If we think of the string as a
sequence of spacecontainers:

S = {s_1, s_2, ..., s_n}

Since I dont think there is one optimal sequense of replacement
operators for all strings, we will have to investigate the length of
all s_i to find the optimal sequence. But this problem is somewhat
similar to the original one. Therefore, I would go for alternative a)

Kind regards
/Lennart
--CELKO-- - 04 Feb 2004 18:37 GMT
Assume '<' and '>' do not appear in col_x.

UPDATE Foobar
  SET col_x
       = REPLACE (
          REPLACE (
            REPLACE(col_x, SPACE(1), '<> '),
          '><', SPACE(0)),
         '<>', SPACE(1));

This is due to a guy named Carnegie; Steve Kass found it in an old
posting.  The only problem is that it fails if the the first function
call overflows the maximum string length.  You might get errors, or
truncation depending on your SQL.
John Gilson - 13 Feb 2004 07:57 GMT
> Assume '<' and '>' do not appear in col_x.
>
[quoted text clipped - 10 lines]
> call overflows the maximum string length.  You might get errors, or
> truncation depending on your SQL.

Very clever.  If one doesn't want to be that clever and also not worry
about string overflow or whether a certain character is present or not
in the string, then one would want to stay within the realm of space
replacement.  As an upper bound, by simply doing a 2-to-1 space
replacement with each call to the REPLACE function, for a VARCHAR(N)
we would need CEILING(LOG2(N))nested REPLACE calls.  SQL Server's
VARCHAR can have a maximum length of 8000 characters so 13 (2^13 =
8192) successive 2-to-1 space replacements will always be sufficient.
But this is an upper bound.  Replacing more than 2 spaces at each step
will yield a better result.  But how much better?  And what number of
spaces should be replaced at each step to guarantee a correct answer?

Here's a stored procedure, written in T-SQL, that will return all
sequences of divisors that will reduce a VARCHAR of a given length.
The procedure can be called to find all such sequences, regardless of
length, for a VARCHAR of a given length or simply the shortest
sequences.

CREATE TABLE DivisorSequences
(
divisor_sequence VARCHAR(100) NOT NULL,
number_of_divisors INT NOT NULL CHECK (number_of_divisors >= 1),
first_divisor INT NOT NULL CHECK (first_divisor >= 2),
max_dividend INT NOT NULL CHECK (max_dividend >= 2),
PRIMARY KEY (number_of_divisors, divisor_sequence)
)

CREATE VIEW Digits (d)
AS
SELECT 0
UNION ALL
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
UNION ALL
SELECT 4
UNION ALL
SELECT 5
UNION ALL
SELECT 6
UNION ALL
SELECT 7
UNION ALL
SELECT 8
UNION ALL
SELECT 9

CREATE TABLE NonnegativeIntegers
(
n INT NOT NULL PRIMARY KEY CHECK (n >= 0)
)

INSERT INTO NonnegativeIntegers (n)
SELECT Ones.d + 10 * Tens.d + 100 * Hundreds.d + 1000 * Thousands.d +
      10000 * TenThousands.d
FROM Digits AS Ones
    CROSS JOIN
    Digits AS Tens
    CROSS JOIN
    Digits AS Hundreds
    CROSS JOIN
    Digits AS Thousands
    CROSS JOIN
    Digits AS TenThousands

CREATE PROCEDURE InsertDivisorSequences
@target_dividend INT,
@shortest_only CHAR(1) = 'Y' -- 'N' for all sequences regardless of
length
AS
IF @target_dividend >= 2
BEGIN -- IF
 DECLARE @level_number INT
 DECLARE @max_level_number INT
 SET @level_number = 1
 SET @max_level_number =
     CAST(CEILING(LOG(@target_dividend)/LOG(2)) AS INT)
 INSERT INTO DivisorSequences
 (divisor_sequence, number_of_divisors, first_divisor, max_dividend)
 VALUES ('2', @level_number, 2, 2)
 WHILE @level_number < @max_level_number
 BEGIN -- WHILE
   IF @shortest_only = 'Y'
   BEGIN -- IF
     IF EXISTS (SELECT *
                FROM DivisorSequences
                WHERE number_of_divisors = @level_number AND
                      max_dividend >= @target_dividend)
       BREAK
     ELSE
       INSERT INTO DivisorSequences
       (divisor_sequence, number_of_divisors, first_divisor,
max_dividend)
       SELECT CAST(N.n AS VARCHAR) + '.' + divisor_sequence,
              D.number_of_divisors + 1,
              N.n,
              (D.max_dividend - N.n + 2) * N.n + N.n - 2
              -- unsimplified version of above is
              -- ((((D.max_dividend + 1) - (N.n - 1)) * N.n) + (N.n -
1)) - 1
       FROM DivisorSequences AS D
            INNER JOIN
            NonnegativeIntegers AS N
            ON D.number_of_divisors = @level_number AND
               N.n BETWEEN D.first_divisor AND max_dividend + 1
   END -- IF
   ELSE
   BEGIN -- ELSE
     INSERT INTO DivisorSequences
     (divisor_sequence, number_of_divisors, first_divisor,
max_dividend)
     SELECT CAST(N.n AS VARCHAR) + '.' + divisor_sequence,
            D.number_of_divisors + 1,
            N.n,
            (D.max_dividend - N.n + 2) * N.n + N.n - 2
     FROM DivisorSequences AS D
          INNER JOIN
          NonnegativeIntegers AS N
          ON D.number_of_divisors = @level_number AND
             D.max_dividend < @target_dividend AND
             N.n BETWEEN D.first_divisor AND max_dividend + 1
   END -- ELSE
   SET @level_number = @level_number + 1
 END -- WHILE
END -- IF

-- Find all sequences of divisors to reduce a VARCHAR(10)
EXEC InsertDivisorSequences 10, 'N'

SELECT divisor_sequence, number_of_divisors, max_dividend
FROM DivisorSequences
WHERE max_dividend >= 10
ORDER BY number_of_divisors, divisor_sequence

divisor_sequence number_of_divisors max_dividend
3.2.2 3 10
3.3.2 3 10
4.2.2 3 10
4.3.2 3 10
2.2.2.2 4 16
3.2.2.2 4 22
4.2.2.2 4 26
5.2.2.2 4 28
5.5.2.2 4 28
5.5.3.2 4 28
6.2.2.2 4 28
6.5.2.2 4 28
6.5.3.2 4 28
7.2.2.2 4 26
7.5.2.2 4 26
7.5.3.2 4 26
8.2.2.2 4 22
8.5.2.2 4 22
8.5.3.2 4 22
9.2.2.2 4 16
9.5.2.2 4 16
9.5.3.2 4 16

We see that there are solutions of length 3 and 4 that can reduce a
VARCHAR(10).
The divisor_sequence 4.3.2, for example, gives the divisors 4, 3, and
2 in that order.  To see only the shortest sequences

DELETE FROM DivisorSequences

EXEC InsertDivisorSequences 10

For a maximum-length VARCHAR, size 8000, the shortest sequences are of
length 6.  And there are over 200,000 to choose from!  By the way,
finding the
shortest sequences for length 8000 took under 30 seconds to execute on
a 1.7 GHz P4 with 512 MB so investigation is not prohibitive.

Regards,
jag
Paul - 17 Feb 2004 11:49 GMT
> Assume '<' and '>' do not appear in col_x.
>
[quoted text clipped - 10 lines]
> call overflows the maximum string length.  You might get errors, or
> truncation depending on your SQL.

That is very clever, yes. I've spent some time in the past thinking
about a similar problem and didn't get such an elegant solution. So
easy when you see how! You can get round the string overflow problem
though:

UPDATE Foobar
  SET col_x
    = REPLACE(
        REPLACE (
          REPLACE (
            REPLACE(col_x, SPACE(2), '<>'),
          '><', SPACE(0)),
        '<>', SPACE(1)),
      SPACE(2), SPACE(1))

still the problem of if you have the '<' or '>' chars in the string
already though.

Paul.
Paul - 17 Feb 2004 12:25 GMT
> Assume '<' and '>' do not appear in col_x.
>
[quoted text clipped - 10 lines]
> call overflows the maximum string length.  You might get errors, or
> truncation depending on your SQL.

OK, here's something to get round the problem of the string already
containing '>' or '<' characters:

UPDATE Foobar
 SET col_x
  = REPLACE (
      REPLACE (
        REPLACE (
          REPLACE (
            REPLACE (
              REPLACE (
                REPLACE (
                  REPLACE (col_x, '>', '\>\'),
                  '<', '\<\'),
                SPACE(2), '<>'),
              '><', SPACE(0)),
            '<>', SPACE(1)),
          SPACE(2), SPACE(1)),
        '\>\', '>'),
      '\<\', '<')

Basically you're "escaping" the '>' and '<' on the right and left by
protecting them with the '\' character. I'm fairly sure this works for
all possible combinations of '\', '<' and '>' characters in the
initial string.

Doing this does expand the string slightly though, so you do get back
to the possibility of overflowing the maximum string size. Maybe
there's a way of simplifying the combination of these two operations
somehow?

Paul.
Ernst-Udo Wallenborn - 04 Feb 2004 22:18 GMT
> This is a LOT faster than hanging in a loop and it is pure SQL, not
> proprietary, procedural code.  But now a matb problem for you: let
> col_x be VARCHAR(n).  What is the optimal mix of replace function
> calls and what should they look like for the general case of (n)?

What is optimal? Smallest nesting depth? Or lowest number of string
operations?

> a) (2->1) repeated log2(n) times?
> b) Is it best to always have (k->1)?
> c) (floor(sqrt(n)) -> 1) as a starter?
> d) Fibonnaci series?

Let's consider n=6, so the string may be 64 chars long. Then the following
series of operations will all result in single spaced strings:

1) (2->1),(2->1),(2->1),(2->1),(2->1),(2->1)
2) (3->1),(3->1),(3->1),(3->1),(2->1)
3) (55->1),(34->1),(21->1)(13->1)(8->1),(5->1),(3->1),(2->1)
4) (64->1),(32->1),(16->1),(8->1),(4->1),(2->1)
5) (65->1),(33->1),(17->1),(9->1),(5->1),(3->1),(2->1)

As is easily seen, (2->1) will prune the string in at most log2(n) rounds. This
is not always the lowest number of rounds. (3->1) will arrive at a string with
at most two consecutive spaces in floor(log3(2^n))+1 rounds (thats 4 rounds in
this case, which reduce a string of 63 spaces to 21, then 7, then 3, then 1
space, and a string of 64 spaces to 22, then 8, then 4, then 2). An additional
(2->1) step then removes the remaining consecutive spaces.

Which poses the question: what is the number of string replacements needed by
the algorithms above to reduce a string of k spaces (1<=k<=2^n) to a single
space? This can easily be evaluated and it is clear that 1) is O(n) and 2)
O(n/2) but 3)-5) are nearly constant. In fact, the average number of string
replacement operations for reducing a string of length k with 1<=k<=64 to a
single space are:

1) 31.50
2) 16.00
3)  2.39
4)  3.09
5)  3.00

This seems to support the Fibonacci series theory. However, real life strings don't
consist of spaces only. The spaces in real life strings are not randomly
distributed either. But let's assume they were. Let's construct strings that
consist of characters that are ' ' with probability p and not ' ' with
probability 1-p. Then running through 10000 randomly created 64 character long
strings, the five algorithms above need the following number of string
replacement operations (mean with sdev in parens)

         p=0.1          p=0.25         p=0.5          p=0.75         p=0.9
1)        0.63 (0.84)    3.95 (2.29)   15.74 (4.42)   35.40 (5.36)   51.04 (4.30)
2)        0.57 (0.74)    3.17 (1.67)   10.54 (2.55)   20.38 (2.53)   27.07 (1.85)
3)        0.57 (0.74)    3.13 (1.63)    9.40 (2.17)   13.13 (2.01)   10.22 (2.47)
4)        0.57 (0.73)    3.03 (1.56)    9.01 (2.04)   13.60 (1.98)   11.60 (2.71)
5)        0.57 (0.74)    3.13 (1.64)    9.48 (2.19)   13.78 (2.04)   11.30 (2.63)

The more spaces there are in a string, the worse are 1) and 2), for an obvious
reason. It is very inefficient to prune a string of n spaces 2 or 3 spaces at
a time. The other three seem to be similar, with the Fibonacci series getting
a slight edge for large strings. Why? Well, the minimum number of string
replacement operations in a string with n substrings consisting of more than
one space is of course n, with the series:

6) (64->1), (63->1), (62->1), (61->1), ..., (2->1)

at the cost of having n-1 nested rounds, instead of O(log(n)). Fibonacci has
more rounds, and the Fibonacci numbers are denser than 2^k, so the probability
of a particular substring being pruned to ' ' in only very few steps is higher.
So it all comes down to the relative costs of nesting depth vs. string
operations.

Signature

Ernst-Udo Wallenborn

--CELKO-- - 16 Feb 2004 21:06 GMT
>> What is optimal? Smallest nesting depth? Or lowest number of string
operations? ... the Fibonacci series getting a slight edge for large
strings ... comes down to the relative costs of nesting depth vs.
string operations. <<

Very nice work! I am impressed!   I would assume that string
operations would be the expensive part since nesting can be unrolled
in the compiler.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.