MBS Claris FileMaker Blog - MBS Plugin Advent calendar: 19

MBS Plugin Advent calendar: 19 - RegEx

Door 19 - RegEx

Fact of the day
Regular expressions originally come from mathematics. In 1951, the mathematician Stephen Kleene wrote events similar to today's RegEx.

You want to search for mail addresses in a text and extract only these addresses from the text. Or do you want to replace all Internet addresses in a text with a new Internet address of your own? In this case, regular expressions are a good solution for this task. What regular expressions are and how you can use them in FileMaker I will show you in this Door.

What are regular expressions?

With regular expressions you can search for certain patterns in a text or check a string if it meets certain criteria, e.g. if the chosen password contains upper and lower case letters, at least one number, one special character and is at least 8 characters long. If we search for something in a text search then we actually always search for a regular expression. For example, if we enter the word "Miss", then we search for a pattern in the text that searches for the letters M-i-s-s that stand behind each other. We find the word Miss but also the word Mississippi and if we tell the program that we don't care about upper and lower case letters we also find the word missed. So we are looking for a pattern where the 4 letters appear exactly in this order. This is already a regular expression. But now we can build it even further. E.g. for a mail address that the local part is separated from the domain with an @ sign. After the Domain is the Domaintail separated by a dot. abc@xyz.com but also a2x4y@abc.de are valid mailaddresses. We now want to develop a regular expression that finds both mail addresses in one text. So we have to formulate how the pattern looks like. The local part can be composed of upper and lower case letters, numbers and the characters .!#$%&'*+-/=?^_`{|}~ the string has no fixed length. Let's see how to define something like this. If we specify a range from which a character can be taken, we write the characters in square brackets. In this example, all characters that are an a, b or c would be found.

[abc]

If we now want the character we are looking for to be any lowercase letter, we do not have to list all 26 letters but can also write [a-z] instead of the list. The dash will then not be found because it only acts as an indicator for a range. If we want that also capital letters are found, we enter [A-Za-z]. If we want the numbers 0-9 to be found, we add [A-Za-z0-9]. We can add to this any character. If we define the range for the local part of the mail it looks like this [a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~].

If we now specify this as a regular expression, exactly one character would always be found that is contained in this set. To indicate that these characters can occur several times in a row we have different possibilities. First we can use the star, which indicates that a character can occur any number of times. With [a-z]* we could find a string of lowercase letters of any length. But the star is not what we would choose for the mail address, because the star excludes that the string can have a length of 0. We want our local part to consist of at least one character. For this we have the + which should be behind the range. If we would have the condition that the string should consist of at least 3 and at most 20 characters, we could indicate this range also in curly brackets. [a-z]{3,20}. For three to infinity, we would simply remove the 20. So our definition for the local part would look like this [a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+ now. We add an @ as separation between the local and domain part.

[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+@

Since the @ character should occur exactly once, we don't need a multiplicity here. The domain part has now again its own range definition, which characters can be taken and how many characters this part contains. Let's assume that we limit the character range to upper and lower case letters, numbers and the underline. For this range of characters there is already a predefined range called \w. Pay attention to the upper and lower case because \W means that it can be all characters except the characters just mentioned. So this is the negation. Since there should be at least 1 character here, too, it looks like this:

[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+@\w+

Now only the domain tail is missing. This can be .de, .com but also .info. That means we have a dot followed by 2-4 upper and lower case letters. Because the dot as such can be any character in a regular expression we have to escape it with a prefixed backslash if we really mean the dot character. Then follows the domain tail. The regular expression looks like this:

[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+@\w+\.[a-zA-Z]{2,4}

If we now leave the regular expression as it is and our text contains, for example, the misspelled mail address xyz@abc.oinfo, then the text xyz@abc.oinf would be found because we have not specified that there is a word limit. We can specify this with \b before and after the expression.

\b[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~]+@\w+\.[a-zA-Z]{2,4}\b

How to Implement this in FileMaker

Now we want to implement such a search in FileMaker. So that you can also create your own regex later, we create a field in FileMaker in which you write the regex expression (Search). A field in which we specify the text to be searched (Text), a field that specifies whether there were any matches at all in the text for the pattern (FindMatches), a field that specifies how many matches there are (MatchCount) and a field in which the matches are listed in a list (Catches).

With the RegEx.Match function, we first test whether there is at least one structure in the text that matches our pattern. If this is the case, the function returns a 1. In the parameters of this function, we start by entering the pattern, followed by the text we want to search. This is followed by the compiler options. Here we can specify further options for compiling. The following options are available here:

Compile option	Number	Description
Caseless	1	Do caseless matching
Multiline	2	^ and $ match newlines within data
Dot All	4	. matches anything including NL
Extended	8	Ignore white space and # comments
Anchored	16	Force pattern anchoring
Dollar End Only	32	$ not to match newline at end
Ungreedy	512	Invert greediness of quantifiers
No Auto Capture	4096	Disable numbered capturing parentheses (named ones available)
Auto Callout	16384	Compile automatic callouts
FirstLine	262144	Force matching to be before newline
Dup Names	524288	Allow duplicate names for subpatterns
Newline CR	1048576	Set CR as the newline sequence
Newline LF	2097152	Set LF as the newline sequence
Newline CRLF	3145728	Set CRLF as the newline sequence
Newline Any	4194304	Recognize any Unicode newline sequence
Newline Any CRLF	5242880	Recognize CR, LF, and CRLF as newline sequences
BSR Any CRLF	8388608	\R matches only CR, LF, or CRLF
BSR Unicode	16777216	\R matches all Unicode line endings
JavaScript Compatible	33554432	JavaScript compatibility
No start optimize	67108864	Disable match-time start optimizations

If you want to combine these options, you can add the individual values together. In this example, we have selected the option 512 and 1. So we want our pattern evaluation to be Ungreedy, which means that we want the smallest possible match to be displayed and the 1 stands for the fact that we don't care about upper and lower case in the evaluation.

The function then has another parameter and this contains the options that we set for the execution. Here, too, we can choose from various values and combine them.

Execute option	Number	Description
Anchored	16	Force pattern anchoring
Not BOL	128	Subject string is not the beginning of a line
Not EOL	256	Subject string is not the end of a line
Not Empty	1024	An empty string is not a valid match
Partial	32768	Allow partial results.
Newline CR	1048576	Set CR as the newline sequence
Newline LF	2097152	Set LF as the newline sequence
Newline CRLF	3145728	Set CRLF as the newline sequence
Newline Any	4194304	Recognize any Unicode newline sequence
Newline Any CRLF	5242880	Recognize CR, LF, and CRLF as newline sequences
BSR Any CRLF	8388608	\R matches only CR, LF, or CRLF
BSR Unicode	16777216	\R matches all Unicode line endings
No start optimize	67108864	Disable match-time start optimizations
Partial Hard	134217728	Return partial result if found before .
Not Empty At Start	268435456	An empty string at the start of the subject is not a valid match
UCP	536870912	Use Unicode properties for \d, \w, etc.

In our case, we do not want to enter a specific value here and therefore enter a 0.

Set Field [ DoorNineteen::FindMatches ; MBS( "RegEx.Match"; 
	DoorNineteen::Search; DoorNineteen::Text; 512+1; 0 ) ]

Now we also want to know what these hits look like. To do this, we first compile the pattern with the RegEx.Compile function. We pass our search pattern to the function and again the compiler options that we have seen above. The function gives us a reference of the pattern with which we can continue working in the RegEx.FindMatches function. This function returns a list with the results that have been found for the searched pattern.

Set Variable [ $regex ; Value: MBS("RegEx.Compile"; DoorNineteen::Search; 512+1) ] 
Set Variable [ $Match ; Value: MBS("RegEx.FindMatches"; $regex; DoorNineteen::Text; 0; 1) ] 
Set Field [ DoorNineteen::Catches ; $Match ]

Using this list, we can also use the ValueCount function in FileMaker to determine how many hits we have in the text.

Set Variable [ $count ; Value: ValueCount ( $Match ) ] 
Set Field [ DoorNineteen::MatchCount ; $count ]

Last but not least, we need to release the references that we have created. We use ReleaseAll for this.

Not only can we search for the entries in a text, we can also replace them. For example, if you want to delete the mail addresses from the text for data protection reasons or replace them with a placeholder, we can use the RegEx.Replace and RegEx.ReplaceAll functions. With the RegEx.Replace function, we can replace individual hits by specifying the index, whereby RegEx.ReplaceAll replaces all hits. In this example, we have a field in which we can enter the text we want to replace all hits with. We then write the result text in a separate field.

Set Field [ DoorNineteen::Result ; MBS( "RegEx.ReplaceAll"; 
	DoorNineteen::Text; DoorNineteen::Search; DoorNineteen::Replace ) ]

I hope you enjoyed this door too and we'll see you again tomorrow.


18	👈 19 of 24 👉	20