MBS Plugin Advent calendar: 19 - RegEx
Door 19 - RegEx
Fact of the day |
---|
Regular expressions originally come from mathematics. In 1951, the mathematician Stephen Kleene wrote events similar to today's RegEx. |
You want to search for mail addresses in a text and extract only these addresses from the text. Or do you want to replace all Internet addresses in a text with a new Internet address of your own? In this case, regular expressions are a good solution for this task. What regular expressions are and how you can use them in FileMaker I will show you in this Door.
What are regular expressions?
With regular expressions you can search for certain patterns in a text or check a string if it meets certain criteria, e.g. if the chosen password contains upper and lower case letters, at least one number, one special character and is at least 8 characters long. If we search for something in a text search then we actually always search for a regular expression. For example, if we enter the word "Miss", then we search for a pattern in the text that searches for the letters M-i-s-s that stand behind each other. We find the word Miss but also the word Mississippi and if we tell the program that we don't care about upper and lower case letters we also find the word missed. So we are looking for a pattern where the 4 letters appear exactly in this order. This is already a regular expression. But now we can build it even further. E.g. for a mail address that the local part is separated from the domain with an @ sign. After the Domain is the Domaintail separated by a dot. abc@xyz.com but also a2x4y@abc.de are valid mailaddresses. We now want to develop a regular expression that finds both mail addresses in one text. So we have to formulate how the pattern looks like. The local part can be composed of upper and lower case letters, numbers and the characters .!#$%&'*+-/=?^_`{|}~ the string has no fixed length. Let's see how to define something like this. If we specify a range from which a character can be taken, we write the characters in square brackets. In this example, all characters that are an a, b or c would be found.
[abc]
If we now want the character we are looking for to be any lowercase letter, we do not have to list all 26 letters but can also write [a-z] instead of the list. The dash will then not be found because it only acts as an indicator for a range. If we want that also capital letters are found, we enter [A-Za-z]. If we want the numbers 0-9 to be found, we add [A-Za-z0-9]. We can add to this any character. If we define the range for the local part of the mail it looks like this [a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~].
If we now specify this as a regular expression, exactly one character would always be found that is contained in this set. To indicate that these characters can occur several times in a row we have different possibilities. First we can use the star, which indicates that a character can occur any number of times. With [a-z]* we could find a string of lowercase letters of any length. But the star is not what we would choose for the mail address, because the star excludes that the string can have a length of 0. We want our local part to consist of at least one character. For this we have the + which should be behind the range. If we would have the condition that the string should consist of at least 3 and at most 20 characters, we could indicate this range also in curly brackets. [a-z]{3,20}. For three to infinity, we would simply remove the 20. So our definition for the local part would look like this [a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+ now. We add an @ as separation between the local and domain part.
[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+@
Since the @ character should occur exactly once, we don't need a multiplicity here. The domain part has now again its own range definition, which characters can be taken and how many characters this part contains. Let's assume that we limit the character range to upper and lower case letters, numbers and the underline. For this range of characters there is already a predefined range called \w. Pay attention to the upper and lower case because \W means that it can be all characters except the characters just mentioned. So this is the negation. Since there should be at least 1 character here, too, it looks like this:
[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+@\w+
Now only the domain tail is missing. This can be .de, .com but also .info. That means we have a dot followed by 2-4 upper and lower case letters. Because the dot as such can be any character in a regular expression we have to escape it with a prefixed backslash if we really mean the dot character. Then follows the domain tail. The regular expression looks like this:
[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+@\w+\.[a-zA-Z]{2,4}
If we now leave the regular expression as it is and our text contains, for example, the misspelled mail address xyz@abc.oinfo, then the text xyz@abc.oinf would be found because we have not specified that there is a word limit. We can specify this with \b before and after the expression.
\b[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+@\w+\.[a-zA-Z]{2,4}\b
How to Implement this in FileMaker
Now we want to implement such a search in FileMaker. So that you can also create your own regex later, we create a field in FileMaker in which you write the regex expression (Search). A field in which we specify the text to be searched (Text), a field that specifies whether there were any matches at all in the text for the pattern (FindMatches), a field that specifies how many matches there are (MatchCount) and a field in which the matches are listed in a list (Catches).
With the RegEx.Match function, we first test whether there is at least one structure in the text that matches our pattern. If this is the case, the function returns a 1. In the parameters of this function, we start by entering the pattern, followed by the text we want to search. This is followed by the compiler options. Here we can specify further options for compiling. The following options are available here:
Compile option | Number | Description |
Caseless | 1 | Do caseless matching |
Multiline | 2 | ^ and $ match newlines within data |
Dot All | 4 | . matches anything including NL |
Extended | 8 | Ignore white space and # comments |
Anchored | 16 | Force pattern anchoring |
Dollar End Only | 32 | $ not to match newline at end |
Ungreedy | 512 | Invert greediness of quantifiers |
No Auto Capture | 4096 | Disable numbered capturing parentheses (named ones available) |
Auto Callout | 16384 | Compile automatic callouts |
FirstLine | 262144 | Force matching to be before newline |
Dup Names | 524288 | Allow duplicate names for subpatterns |
Newline CR | 1048576 | Set CR as the newline sequence |
Newline LF | 2097152 | Set LF as the newline sequence |
Newline CRLF | 3145728 | Set CRLF as the newline sequence |
Newline Any | 4194304 | Recognize any Unicode newline sequence |
Newline Any CRLF | 5242880 | Recognize CR, LF, and CRLF as newline sequences |
BSR Any CRLF | 8388608 | \R matches only CR, LF, or CRLF |
BSR Unicode | 16777216 | \R matches all Unicode line endings |
JavaScript Compatible | 33554432 | JavaScript compatibility |
No start optimize | 67108864 | Disable match-time start optimizations |
If you want to combine these options, you can add the individual values together. In this example, we have selected the option 512 and 1. So we want our pattern evaluation to be Ungreedy, which means that we want the smallest possible match to be displayed and the 1 stands for the fact that we don't care about upper and lower case in the evaluation.
The function then has another parameter and this contains the options that we set for the execution. Here, too, we can choose from various values and combine them.
Execute option | Number | Description |
Anchored | 16 | Force pattern anchoring |
Not BOL | 128 | Subject string is not the beginning of a line |
Not EOL | 256 | Subject string is not the end of a line |
Not Empty | 1024 | An empty string is not a valid match |
Partial | 32768 | Allow partial results. |
Newline CR | 1048576 | Set CR as the newline sequence |
Newline LF | 2097152 | Set LF as the newline sequence |
Newline CRLF | 3145728 | Set CRLF as the newline sequence |
Newline Any | 4194304 | Recognize any Unicode newline sequence |
Newline Any CRLF | 5242880 | Recognize CR, LF, and CRLF as newline sequences |
BSR Any CRLF | 8388608 | \R matches only CR, LF, or CRLF |
BSR Unicode | 16777216 | \R matches all Unicode line endings |
No start optimize | 67108864 | Disable match-time start optimizations |
Partial Hard | 134217728 | Return partial result if found before . |
Not Empty At Start | 268435456 | An empty string at the start of the subject is not a valid match |
UCP | 536870912 | Use Unicode properties for \d, \w, etc. |
In our case, we do not want to enter a specific value here and therefore enter a 0.
Set Field [ DoorNineteen::FindMatches ; MBS( "RegEx.Match"; DoorNineteen::Search; DoorNineteen::Text; 512+1; 0 ) ]
Now we also want to know what these hits look like. To do this, we first compile the pattern with the RegEx.Compile function. We pass our search pattern to the function and again the compiler options that we have seen above. The function gives us a reference of the pattern with which we can continue working in the RegEx.FindMatches function. This function returns a list with the results that have been found for the searched pattern.
Set Variable [ $regex ; Value: MBS("RegEx.Compile"; DoorNineteen::Search; 512+1) ] Set Variable [ $Match ; Value: MBS("RegEx.FindMatches"; $regex; DoorNineteen::Text; 0; 1) ] Set Field [ DoorNineteen::Catches ; $Match ]
Using this list, we can also use the ValueCount function in FileMaker to determine how many hits we have in the text.
Set Variable [ $count ; Value: ValueCount ( $Match ) ] Set Field [ DoorNineteen::MatchCount ; $count ]
Last but not least, we need to release the references that we have created. We use ReleaseAll for this.
Not only can we search for the entries in a text, we can also replace them. For example, if you want to delete the mail addresses from the text for data protection reasons or replace them with a placeholder, we can use the RegEx.Replace and RegEx.ReplaceAll functions. With the RegEx.Replace function, we can replace individual hits by specifying the index, whereby RegEx.ReplaceAll replaces all hits. In this example, we have a field in which we can enter the text we want to replace all hits with. We then write the result text in a separate field.
Set Field [ DoorNineteen::Result ; MBS( "RegEx.ReplaceAll"; DoorNineteen::Text; DoorNineteen::Search; DoorNineteen::Replace ) ]
I hope you enjoyed this door too and we'll see you again tomorrow.
18 | 👈 19 of 24 👉 | 20 |