5

I need to properly format some European addresses. One of the steps is to capitalize the first letter, but avoiding some specific words, like 'on', 'upon', 'von', 'van', 'di', 'in', 'sul'. Therefore, although my skills are scarce, I thought it was a good idea to use a RegEx-based function.

After some googling I found this here:

CREATE FUNCTION InitialCap
(
    @String nvarchar(max)
)
    RETURNS nvarchar(max)
AS
    BEGIN 
        DECLARE @Position INT;

        SELECT 
            @String   = STUFF(LOWER(@String),1,1,UPPER(LEFT(@String,1))) COLLATE Latin1_General_Bin,
            @Position = PATINDEX('%[^A-Za-z][a-z]%',@String COLLATE Latin1_General_Bin);

        WHILE @Position > 0
            SELECT 
                @String   = STUFF(@String,@Position,2,UPPER(SUBSTRING(@String,@Position,2))) COLLATE Latin1_General_Bin,
                @Position = PATINDEX('%[^A-Za-z][a-z]%',@String COLLATE Latin1_General_Bin);

        RETURN @String;
    END

That seems to search for a sequence of a 'non-letter' + a lowercase 'letter'

[^A-Za-z][a-z]

OK, I think I've understood how it works, and I modify it to best match my needs.

I supposed it was better to search for a space or ' or - and a lowercase letter, therefore I changed it to

[\s'-][\w]

Then, after several attempt, I built this RegEx on regexr.com, that seems to catch the required sequence:

[\s](?!di\s|in\s|sul\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]

But when I put that into the above function, the result is not as expected.

What's wrong?

Andriy M
  • 20,973
  • 6
  • 52
  • 93
Joe
  • 349
  • 2
  • 11
  • 3
    Each time I see regex questions I remember this [legendary answer](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). SQL Server stores and manipulates data, but is not meant to format it, do that on your application level. – Yaroslav Dec 07 '17 at 10:09
  • Usually I do it, but since the stored procedure input is an XML document that is parsed and stored in several tables.. the T-SQL solution is preferable. – Joe Dec 07 '17 at 10:18
  • Unfortunately I do not agree. as far as I know, STRING_SPLIT is available from 2016 Version and STRING_AGG from 2017 version. The "UPON" problem is this: NEWCASTLE UPON TYNE -> Newcastle upon Tyne, while now returns Newcastle UPON Tyne, but I add `LOWER` to `... case when ca.words=part then LOWER(part) else ...` and it seems it is solved, while still have to solve the Uppercase after `'` issue: santacroce sull'arno -> Santacroce sull'Arno not Santacroce Sull'arno as returns now – Joe Dec 15 '17 at 08:30
  • Joe, the test cases for "santacroce sull'arno" and "NEWCASTLE UPON TYNE" should be added to the question along with the expected results. Also, just FYI: @KumarHarsh most likely didn't see your message since you didn't include his `@name` in it. When commenting, the OP of the question/answer being commented on is notified, and when there's only 1 other person commenting they will get notified, but as soon as a 2nd, non-OP person comments, then only the OP is notified unless the `@name` is used, and only one can be used in a comment. I wasn't notified of your comment to me on Kumar's answer. – Solomon Rutzky Dec 15 '17 at 16:26
  • @Joe,check my edited answer,hope i covered all your points. – KumarHarsh Dec 15 '17 at 16:58
  • @SolomonRutzky,this is not clear to me "stored procedure input is an XML document that is parsed and stored in several tables.."does this mean xml data type is pass from .net application to store proc ?from where I am getting XML doc and via which source it reach to SQL server ?Say, My XML input come from application to DB then I do all regex work and title case work including word such as "Upon" in application itself then pass it to proc,proc will use EXCLUDEWORD table to compare and reverse back to "upon" and store it. – KumarHarsh Dec 16 '17 at 10:24
  • We receive a SOAP message that contains an XML payload with some 250 fields, that we pass directly to the Stored Procedure. The fields to Capitalize are only 4-5: Name, Street(2 or 3) and CityName Yes, actually we could parse this fields, apply the regex and push back into the xml the correct value... – Joe Dec 17 '17 at 18:14

2 Answers2

10

SQL Server does not internally support Regular Expressions. LIKE and PATINDEX both support very limited wildcards, including a single-character range match [...] or exclude [^...] that are of similar syntax to RegEx, and are to a degree functionally similar, but are certainly not RegEx.

If you want / need Regular Expressions in SQL Server, you need to use SQLCLR. You can either code this yourself, or use pre-built functions such as the ones available in SQL# (which I wrote). Most of the RegEx functions are available in the Free version. I think you could use RegEx_Matches to return a result set of words that are not in your exclude list, and then combine that with the String_ToTitleCase4k function (also available in the Free version) to do the InitCap.

For example:

DECLARE @Input NVARCHAR(MAX) =
               N'santacroce sull''arno o''sullivan suLL sUlLiVan gsantacroce',
        @Expression NVARCHAR(4000) =
               N'[\s](?!di\s|in\s|sull\s|on\s|upon\s|von\s|uber\s|ueber\s)[\w]';

-- show matches for debugging
SELECT word.[StartPos],
       word.[EndPos],
       word.[Value] AS [Original],
       SQL#.String_ToTitleCase4k(word.[Value], N'') AS [TitleCased]
FROM   SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;


SELECT @Input = STUFF(@Input,
                      word.[StartPos],
                      ((word.[EndPos] - word.[StartPos]) + 1),
                      SQL#.String_ToTitleCase4k(word.[Value], N'')
                                                COLLATE Latin1_General_100_BIN2)
FROM SQL#.RegEx_Matches(@Input, @Expression, 1, N'ignorecase') word;


SELECT @Input AS [Fixed];

Returns:

StartPos    EndPos    Original    TitleCased
--------    ------    --------    ----------
11          12        s           S
21          22        o           O
32          33        s           S
37          38        s           S
46          47        g           G


Fixed
-------------------------
santacroce Sull'arno O'sullivan SuLL SUlLiVan Gsantacroce

The reason that it doesn't work entirely correctly is due to your regular expression being incorrect:

  1. It only matches a single letter.
  2. It won't exclude any of the fragments if they are at the end of the string, but that might be ok if that never occurs in actual use.
  3. It won't include the first word of the string (due to the required white-space to the left), but that might be ok if that never occurs in actual use.

UPDATE:
I was able to fix your regular expression by changing it to be as follows:

\b(?!(?:di|in|sull|on|upon|von|uber|ueber)\b)\w+

Main differences from the original:

  1. I am using \b (word boundary) instead of \s (white-space) as it handles beginning and ending of line/string. It also doesn't capture the white-space, which \s does, and if it isn't visually obvious, each of the matched strings above is prefixed with the space that matched. While that space wouldn't affect the replacement as it would still be a space, it did prevent the first word in the group from matching, unless the entire string was preceded by some white-space. In the case of using this with addresses, if they always start with a number then perhaps there will always be that preceding space, but best not to include it in the match.
  2. I added the + (one or more) quantifier to the \w so that it will pick up more than just the first character
  3. I simplified the exclusion list by moving the common \b at the end of each fragment to just outside of a new, inner non-capturing group. This is a non-functional difference. It just makes it easier to read and deal with.

New output:

StartPos    EndPos    Original      TitleCased
--------    ------    --------      ----------
1           10        santacroce    Santacroce
17          20        arno          Arno
22          22        o             O
24          31        sullivan      Sullivan
38          45        sUlLiVan      Sullivan
47          57        gsantacroce   Gsantacroce

Fixed
-------------------------
Santacroce sull'Arno O'Sullivan suLL Sullivan Gsantacroce

UPDATE 2:

If the desire is to have a list of exclude words that is updateable without needing to update the function containing the regular expression, that is fairly easy to accomplish by doing the following:

  1. Create a table to hold the exclusion words:

    CREATE TABLE #ExcludeWords ([Word] NVARCHAR(50) NOT NULL);
    
  2. Populate / manage-over-time the words in that table:

    INSERT INTO #ExcludeWords ([Word]) VALUES
       (N'di'), (N'in'), (N'sull'), (N'on'), (N'upon'), (N'von'), (N'uber'), (N'ueber');
    
  3. In whatever code you have that does this data-cleansing, dynamically build the regular expression from the table of words to exclude:

    DECLARE @Expression NVARCHAR(4000) = N'\b(?!(?:';
    
    SELECT @Expression += ex.[Word] + N'|'
    FROM   @ExcludeWords ex
    
    SET @Expression = LEFT(@Expression, LEN(@Expression) - 1) + N')\b)\w+';
    
    SELECT @Expression; -- JUST FOR DEBUG
    --\b(?!(?:di|in|sull|on|upon|von|uber|ueber)\b)\w+
    

UPDATE 3:

Originally no sample data was provided, so both answers tested with a simple list of words, some of which were the exclusion words. But now some test cases have been provided and in testing with one of those, I found a few issues with my implementation:

  1. I had forgotten to specify the "case-insensitive" option for the RegEx function
  2. I was replacing all occurrences of the entire match within the original string. This works fine when no matches are substrings of other strings within the entire string. But when including "O'Sullivan", the "O" is a substring of a few items and hence produced erroneous results.

So, I have adjusted the code and test case and results above to account for these issues. The main differences are:

  1. Added the RegEx option of N'ignorecase'
  2. Switched the REPLACE function with STUFF which allows me to use the starting and ending positions of each match to replace just that one item

PLEASE NOTE:

  1. There is a minor discrepancy between the exclusion list in the question and the sample data provided in a comment on the question: question uses sul while the comment uses sull. I have adjusted my answer to work with the sull (two "L"s) shown in the test case provided in the comment.
  2. This is reason number 5,235,948,567 for why it is critically important to have and/or provide actual test data ;-).
Solomon Rutzky
  • 65,256
  • 6
  • 135
  • 267
  • Sorry for late answer but have been out for a while. This seems the ultimate solution but how can avoid to import all the 6MB of SQL# Functions: surely can be useful in many other situations, but currently I need only this 2 (small) functions.. – Joe Dec 14 '17 at 12:52
  • 1
    @Joe you do not need to create everything. Towards the top of the SQL# script are several BIT variables that determine which Assemblies get installed. I think you only need the main [SQL#](https://SQLsharp.com/?ref=db_192684) Assembly. Yes, that still includes a lot of functions, and those cannot be removed from the Assembly, but you can remove all of the T-SQL wrapper objects via `DROP` statements. I would script out the 2 `CREATE FUNCTION` statements for what you want, then drop everything but the Assembly, then issue those 2 `CREATE` statements. Would that work? – Solomon Rutzky Dec 14 '17 at 18:22
  • 1
    Really thank you for your offer, but I think I can manage (or have to learn how to do it). yes I thought I could drop the function I would not use for now, My fear was that the 2 could call internal some other and so on, but some test will let me understand (a little) how your work... works. – Joe Dec 14 '17 at 21:05
  • 1
    @Joe Sorry if I misspoke. I wasn't offering anything ;-). I was asking if my suggestion of having you drop all related objects and re-create just those 2 would satisfy your concerns about having too many functions. As I said, you cannot remove the .NET methods within the Assembly, so the other objects could be re-created if someone knew the CREATE commands. But without the T-SQL wrapper objects being created, those other functions in the .NET Assembly cannot be accessed. What "internal call" are you concerned about? The Assembly can stay marked as SAFE and so never reach outside of the DB. – Solomon Rutzky Dec 14 '17 at 21:10
  • Any concern about safety. For "internal call" I meant that the 2 functions we require could call other SQL# functions, therefore some of the dropped could be necessary. – Joe Dec 15 '17 at 08:08
  • @Joe Ah, ok. No, that's not anything to worry about. What you're dropping are merely wrapper objects that expose the .NET functions to T-SQL so that they can be accessed. You can't remove any code inside an Assembly without recompiling it. And, if there's a functional dependency between Assemblies, it won't let you drop the dependency without first dropping the Assembly that is referencing the other (like an FK that doesn't have `ON DELETE CASCADE`). So, just set all `@InstallSQL#...` variables to 0 (starting at around line 133), install, and then drop the unwanted wrapper objects. – Solomon Rutzky Dec 15 '17 at 16:02
  • Thanks a lot.. will test this solution. to answer to your second point in Update 3: test data would be some 5 Million addresses, and issues raise in sequence.. – Joe Dec 16 '17 at 15:19
1

IMHO, I think above UDF cannot be implemented with your requirement.

My approach will be to,

  1. Create Split String UDF
  2. Create Exclude Word dictionary table.

It will work like this. I have one split string UDF. You can create or download your own split string function.

        declare @Excludeword table(word varchar(50))--this should be permanenet table
insert into @Excludeword VALUES ('on'), ('upon'),('von'),('van'),('di'),('in'),('sul')

    DECLARE @String nvarchar(max)='santacroce sull''arno'--'caT  UPON the wet floor'

SELECT stuff((
            SELECT ' ' + CASE 
                    WHEN ca.word = item
                        THEN ca.word
                    WHEN charindex('''', item) > 0
                        THEN STUFF(LOWER(item), charindex('''', item) + 1, 1, UPPER(substring(item, charindex('''', item) + 1, 1))) COLLATE Latin1_General_Bin
                    ELSE STUFF(LOWER(item), 1, 1, UPPER(LEFT(item, 1))) COLLATE Latin1_General_Bin
                    END
            FROM dbo.DelimitedSplit8K(@String, ' ')
            OUTER APPLY (
                SELECT word
                FROM @Excludeword E
                WHERE e.word = item
                ) ca
            WHERE item <> ''
            FOR XML PATH('')
            ), 1, 1, '') Item

Output, Cat on The Wet Floor or Santacroce sull'Arno or Cat upon The Wet Floor

You should mention all your business rules in your question itself with suitable examples.

Also how are you are planning to use it? Like will you pass a variable or you will process whole table.

Note :

This is just rough idea.

If this is what you are looking for then Split String can be customized suiting your requirement and a few things can be encapsulated in UDf itself.

Download [dbo].[DelimitedSplit8K]

Michael Green
  • 22,481
  • 12
  • 46
  • 87
KumarHarsh
  • 1,554
  • 10
  • 10
  • Hi! Sorry for late answer but have been out for a while... hmm.. I'm using a different function to split the string, but it seems me something is missing since the `item` is defined at the end while is required in the inner select(s)... Or I've not understood your suggestion. Thanks – Joe Dec 14 '17 at 12:56
  • my split UDF return column name Item.my script above is working very fine as it is.may be put extra space " dbo.DelimitedSplit2K(@String,' ')".Also what i want to say is very clear,you can make it running. – KumarHarsh Dec 14 '17 at 14:24
  • executing the above TSQL on SSMS I got the following response: `Msg 207, Level 16, State 1, Line 10 Invalid column name 'item'. Msg 207, Level 16, State 1, Line 11 Invalid column name 'item'. Msg 207, Level 16, State 1, Line 7 Invalid column name 'item'. Msg 207, Level 16, State 1, Line 7 Invalid column name 'item'. Msg 207, Level 16, State 1, Line 8 Invalid column name 'item'. Msg 207, Level 16, State 1, Line 8 Invalid column name 'item'` I'm quite sure it is not a problem of the splitting function, but can you supply it so to fully test? thanks! – Joe Dec 14 '17 at 14:46
  • you can use your own split string udf and in place of item put the column name of your udf that it return.or you can download split string "[dbo].[DelimitedSplit8K] " from here http://www.sqlservercentral.com/articles/Tally+Table/72993/ – KumarHarsh Dec 14 '17 at 14:58
  • yes I found the udf meanwhile, and with that it works. Thanks! – Joe Dec 14 '17 at 15:01
  • yup i go with my solution ,instead of relying on REGEX,exclude word may keep changing,so you hv to keep changing regex every time. – KumarHarsh Dec 14 '17 at 15:04
  • REGEX is more elegant, but has to be kept update anyway. I think Solomon Rutzky's solution is the ultimate one, but I prefer not to create so many function to use only 2 at present. – Joe Dec 14 '17 at 15:34
  • 2
    I have removed many comments on this post. If you think you need access to those comments which were deleted, kindly send a message to the moderators. Or reply to me here, which will also get deleted once I respond to you. – jcolebrand Dec 15 '17 at 16:22
  • @Joe and Kumar: now that actual test data has been provided, please test with the following string: `N'santacroce sull''arno o''sullivan suLL sUlLiVan gsantacroce'`. I have updated my answer to account for some issues my solution had with this string. Please see **UPDATE 3** for that info. – Solomon Rutzky Dec 15 '17 at 18:38