Database Forum / General DB Topics / DB Theory / February 2004
Squeezing spaces out of a string
|
|
Thread rating:  |
--CELKO-- - 01 Feb 2004 00:09 GMT This problem comes up on newsgroup about once a year. Given a VARCHAR(n) column with words in it, how do you squeeze out the extra spaces, that each word is separated by only one space?
You can nest function calls up to 32 levels deep in SQL Server and they have REPLACE (<target string>, <old string>, <new string>) function, so answer is simply:
UPDATE Foobar SET col_x = REPLACE ( REPLACE ( REPLACE ( .. REPLACE(col_x, ' ', ' ') .. ' ', ' ') ' ', ' ') ' ', ' ');
This is a LOT faster than hanging in a loop and it is pure SQL, not proprietary, procedural code. But now a matb problem for you: let col_x be VARCHAR(n). What is the optimal mix of replace function calls and what should they look like for the general case of (n)? There can be more than one word in the string, so you can have what typeseters call "rivers" in the string.
Prove it.
Let (j->k) mean "replaces (j) spaces with (k) spaces"
a) (2->1) repeated log2(n) times? b) Is it best to always have (k->1)? c) (floor(sqrt(n)) -> 1) as a starter? d) Fibonnaci series?
I am in a weird mood this afternoon, so I'll offer a book as a prize for a proof.
--CELKO-- =========================== Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are.
Lennart Jonsson - 04 Feb 2004 09:56 GMT > This problem comes up on newsgroup about once a year. Given a > VARCHAR(n) column with words in it, how do you squeeze out the extra [quoted text clipped - 34 lines] > I am in a weird mood this afternoon, so I'll offer a book as a prize > for a proof. Sorry, no proof. Just some reflections. If we think of the string as a sequence of spacecontainers:
S = {s_1, s_2, ..., s_n}
Since I dont think there is one optimal sequense of replacement operators for all strings, we will have to investigate the length of all s_i to find the optimal sequence. But this problem is somewhat similar to the original one. Therefore, I would go for alternative a)
Kind regards /Lennart
--CELKO-- - 04 Feb 2004 18:37 GMT Assume '<' and '>' do not appear in col_x.
UPDATE Foobar SET col_x = REPLACE ( REPLACE ( REPLACE(col_x, SPACE(1), '<> '), '><', SPACE(0)), '<>', SPACE(1));
This is due to a guy named Carnegie; Steve Kass found it in an old posting. The only problem is that it fails if the the first function call overflows the maximum string length. You might get errors, or truncation depending on your SQL.
John Gilson - 13 Feb 2004 07:57 GMT > Assume '<' and '>' do not appear in col_x. > [quoted text clipped - 10 lines] > call overflows the maximum string length. You might get errors, or > truncation depending on your SQL. Very clever. If one doesn't want to be that clever and also not worry about string overflow or whether a certain character is present or not in the string, then one would want to stay within the realm of space replacement. As an upper bound, by simply doing a 2-to-1 space replacement with each call to the REPLACE function, for a VARCHAR(N) we would need CEILING(LOG2(N))nested REPLACE calls. SQL Server's VARCHAR can have a maximum length of 8000 characters so 13 (2^13 = 8192) successive 2-to-1 space replacements will always be sufficient. But this is an upper bound. Replacing more than 2 spaces at each step will yield a better result. But how much better? And what number of spaces should be replaced at each step to guarantee a correct answer?
Here's a stored procedure, written in T-SQL, that will return all sequences of divisors that will reduce a VARCHAR of a given length. The procedure can be called to find all such sequences, regardless of length, for a VARCHAR of a given length or simply the shortest sequences.
CREATE TABLE DivisorSequences ( divisor_sequence VARCHAR(100) NOT NULL, number_of_divisors INT NOT NULL CHECK (number_of_divisors >= 1), first_divisor INT NOT NULL CHECK (first_divisor >= 2), max_dividend INT NOT NULL CHECK (max_dividend >= 2), PRIMARY KEY (number_of_divisors, divisor_sequence) )
CREATE VIEW Digits (d) AS SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
CREATE TABLE NonnegativeIntegers ( n INT NOT NULL PRIMARY KEY CHECK (n >= 0) )
INSERT INTO NonnegativeIntegers (n) SELECT Ones.d + 10 * Tens.d + 100 * Hundreds.d + 1000 * Thousands.d + 10000 * TenThousands.d FROM Digits AS Ones CROSS JOIN Digits AS Tens CROSS JOIN Digits AS Hundreds CROSS JOIN Digits AS Thousands CROSS JOIN Digits AS TenThousands
CREATE PROCEDURE InsertDivisorSequences @target_dividend INT, @shortest_only CHAR(1) = 'Y' -- 'N' for all sequences regardless of length AS IF @target_dividend >= 2 BEGIN -- IF DECLARE @level_number INT DECLARE @max_level_number INT SET @level_number = 1 SET @max_level_number = CAST(CEILING(LOG(@target_dividend)/LOG(2)) AS INT) INSERT INTO DivisorSequences (divisor_sequence, number_of_divisors, first_divisor, max_dividend) VALUES ('2', @level_number, 2, 2) WHILE @level_number < @max_level_number BEGIN -- WHILE IF @shortest_only = 'Y' BEGIN -- IF IF EXISTS (SELECT * FROM DivisorSequences WHERE number_of_divisors = @level_number AND max_dividend >= @target_dividend) BREAK ELSE INSERT INTO DivisorSequences (divisor_sequence, number_of_divisors, first_divisor, max_dividend) SELECT CAST(N.n AS VARCHAR) + '.' + divisor_sequence, D.number_of_divisors + 1, N.n, (D.max_dividend - N.n + 2) * N.n + N.n - 2 -- unsimplified version of above is -- ((((D.max_dividend + 1) - (N.n - 1)) * N.n) + (N.n - 1)) - 1 FROM DivisorSequences AS D INNER JOIN NonnegativeIntegers AS N ON D.number_of_divisors = @level_number AND N.n BETWEEN D.first_divisor AND max_dividend + 1 END -- IF ELSE BEGIN -- ELSE INSERT INTO DivisorSequences (divisor_sequence, number_of_divisors, first_divisor, max_dividend) SELECT CAST(N.n AS VARCHAR) + '.' + divisor_sequence, D.number_of_divisors + 1, N.n, (D.max_dividend - N.n + 2) * N.n + N.n - 2 FROM DivisorSequences AS D INNER JOIN NonnegativeIntegers AS N ON D.number_of_divisors = @level_number AND D.max_dividend < @target_dividend AND N.n BETWEEN D.first_divisor AND max_dividend + 1 END -- ELSE SET @level_number = @level_number + 1 END -- WHILE END -- IF
-- Find all sequences of divisors to reduce a VARCHAR(10) EXEC InsertDivisorSequences 10, 'N'
SELECT divisor_sequence, number_of_divisors, max_dividend FROM DivisorSequences WHERE max_dividend >= 10 ORDER BY number_of_divisors, divisor_sequence
divisor_sequence number_of_divisors max_dividend 3.2.2 3 10 3.3.2 3 10 4.2.2 3 10 4.3.2 3 10 2.2.2.2 4 16 3.2.2.2 4 22 4.2.2.2 4 26 5.2.2.2 4 28 5.5.2.2 4 28 5.5.3.2 4 28 6.2.2.2 4 28 6.5.2.2 4 28 6.5.3.2 4 28 7.2.2.2 4 26 7.5.2.2 4 26 7.5.3.2 4 26 8.2.2.2 4 22 8.5.2.2 4 22 8.5.3.2 4 22 9.2.2.2 4 16 9.5.2.2 4 16 9.5.3.2 4 16
We see that there are solutions of length 3 and 4 that can reduce a VARCHAR(10). The divisor_sequence 4.3.2, for example, gives the divisors 4, 3, and 2 in that order. To see only the shortest sequences
DELETE FROM DivisorSequences
EXEC InsertDivisorSequences 10
For a maximum-length VARCHAR, size 8000, the shortest sequences are of length 6. And there are over 200,000 to choose from! By the way, finding the shortest sequences for length 8000 took under 30 seconds to execute on a 1.7 GHz P4 with 512 MB so investigation is not prohibitive.
Regards, jag
Paul - 17 Feb 2004 11:49 GMT > Assume '<' and '>' do not appear in col_x. > [quoted text clipped - 10 lines] > call overflows the maximum string length. You might get errors, or > truncation depending on your SQL. That is very clever, yes. I've spent some time in the past thinking about a similar problem and didn't get such an elegant solution. So easy when you see how! You can get round the string overflow problem though:
UPDATE Foobar SET col_x = REPLACE( REPLACE ( REPLACE ( REPLACE(col_x, SPACE(2), '<>'), '><', SPACE(0)), '<>', SPACE(1)), SPACE(2), SPACE(1))
still the problem of if you have the '<' or '>' chars in the string already though.
Paul.
Paul - 17 Feb 2004 12:25 GMT > Assume '<' and '>' do not appear in col_x. > [quoted text clipped - 10 lines] > call overflows the maximum string length. You might get errors, or > truncation depending on your SQL. OK, here's something to get round the problem of the string already containing '>' or '<' characters:
UPDATE Foobar SET col_x = REPLACE ( REPLACE ( REPLACE ( REPLACE ( REPLACE ( REPLACE ( REPLACE ( REPLACE (col_x, '>', '\>\'), '<', '\<\'), SPACE(2), '<>'), '><', SPACE(0)), '<>', SPACE(1)), SPACE(2), SPACE(1)), '\>\', '>'), '\<\', '<')
Basically you're "escaping" the '>' and '<' on the right and left by protecting them with the '\' character. I'm fairly sure this works for all possible combinations of '\', '<' and '>' characters in the initial string.
Doing this does expand the string slightly though, so you do get back to the possibility of overflowing the maximum string size. Maybe there's a way of simplifying the combination of these two operations somehow?
Paul.
Ernst-Udo Wallenborn - 04 Feb 2004 22:18 GMT > This is a LOT faster than hanging in a loop and it is pure SQL, not > proprietary, procedural code. But now a matb problem for you: let > col_x be VARCHAR(n). What is the optimal mix of replace function > calls and what should they look like for the general case of (n)? What is optimal? Smallest nesting depth? Or lowest number of string operations?
> a) (2->1) repeated log2(n) times? > b) Is it best to always have (k->1)? > c) (floor(sqrt(n)) -> 1) as a starter? > d) Fibonnaci series? Let's consider n=6, so the string may be 64 chars long. Then the following series of operations will all result in single spaced strings:
1) (2->1),(2->1),(2->1),(2->1),(2->1),(2->1) 2) (3->1),(3->1),(3->1),(3->1),(2->1) 3) (55->1),(34->1),(21->1)(13->1)(8->1),(5->1),(3->1),(2->1) 4) (64->1),(32->1),(16->1),(8->1),(4->1),(2->1) 5) (65->1),(33->1),(17->1),(9->1),(5->1),(3->1),(2->1)
As is easily seen, (2->1) will prune the string in at most log2(n) rounds. This is not always the lowest number of rounds. (3->1) will arrive at a string with at most two consecutive spaces in floor(log3(2^n))+1 rounds (thats 4 rounds in this case, which reduce a string of 63 spaces to 21, then 7, then 3, then 1 space, and a string of 64 spaces to 22, then 8, then 4, then 2). An additional (2->1) step then removes the remaining consecutive spaces.
Which poses the question: what is the number of string replacements needed by the algorithms above to reduce a string of k spaces (1<=k<=2^n) to a single space? This can easily be evaluated and it is clear that 1) is O(n) and 2) O(n/2) but 3)-5) are nearly constant. In fact, the average number of string replacement operations for reducing a string of length k with 1<=k<=64 to a single space are:
1) 31.50 2) 16.00 3) 2.39 4) 3.09 5) 3.00
This seems to support the Fibonacci series theory. However, real life strings don't consist of spaces only. The spaces in real life strings are not randomly distributed either. But let's assume they were. Let's construct strings that consist of characters that are ' ' with probability p and not ' ' with probability 1-p. Then running through 10000 randomly created 64 character long strings, the five algorithms above need the following number of string replacement operations (mean with sdev in parens)
p=0.1 p=0.25 p=0.5 p=0.75 p=0.9 1) 0.63 (0.84) 3.95 (2.29) 15.74 (4.42) 35.40 (5.36) 51.04 (4.30) 2) 0.57 (0.74) 3.17 (1.67) 10.54 (2.55) 20.38 (2.53) 27.07 (1.85) 3) 0.57 (0.74) 3.13 (1.63) 9.40 (2.17) 13.13 (2.01) 10.22 (2.47) 4) 0.57 (0.73) 3.03 (1.56) 9.01 (2.04) 13.60 (1.98) 11.60 (2.71) 5) 0.57 (0.74) 3.13 (1.64) 9.48 (2.19) 13.78 (2.04) 11.30 (2.63)
The more spaces there are in a string, the worse are 1) and 2), for an obvious reason. It is very inefficient to prune a string of n spaces 2 or 3 spaces at a time. The other three seem to be similar, with the Fibonacci series getting a slight edge for large strings. Why? Well, the minimum number of string replacement operations in a string with n substrings consisting of more than one space is of course n, with the series:
6) (64->1), (63->1), (62->1), (61->1), ..., (2->1)
at the cost of having n-1 nested rounds, instead of O(log(n)). Fibonacci has more rounds, and the Fibonacci numbers are denser than 2^k, so the probability of a particular substring being pruned to ' ' in only very few steps is higher. So it all comes down to the relative costs of nesting depth vs. string operations.
 Signature Ernst-Udo Wallenborn
--CELKO-- - 16 Feb 2004 21:06 GMT >> What is optimal? Smallest nesting depth? Or lowest number of string operations? ... the Fibonacci series getting a slight edge for large strings ... comes down to the relative costs of nesting depth vs. string operations. <<
Very nice work! I am impressed! I would assume that string operations would be the expensive part since nesting can be unrolled in the compiler.
|
|
|