FileMaking
Text Parsing With FileMaker
Pardon my extended absence from this column. Life just has a way of getting away from one at times, and the work load piles up. But hopefully I’m back to a normal schedule again.
As mentioned in the last column I wrote regarding the direction of this tutorial series, we’ll now cover some basic text parsing functions. For beginning FileMakers, parsing text is one of the most useful skills to have as well as one of the first that questions are asked about.
What Is Text Parsing
Text parsing is the process of applying a calculation to a string to pull out some other string within it that is useful or to determine if some condition is true regarding the string.
This comes up many times in FileMaker, or for that matter, any database application. I’m currently working on a project where data is received from an outside vendor as a text file, where each line is a record and each field in the record is a particular number of characters in the line (such as, characters 13 through 20 store the creation date of the record, 21 through 30 the product ID, etc.). This format is called a fixed format file because each field is stored in a fixed number of characters. Text parsing calculations allow me to extract the data into separate fields in my database.
Name Extraction
For example, proper database design dictates that information about people should have their different names (first, middle, last, etc.) separated into different fields: FirstName, MiddleName, LastName. However, perhaps you need to import data from another system where all three of these pieces of information were placed in a single field and you want to break them out to their individual components.
FileMaker includes a wide range of text functions that can be used to help with such chores. You can view a list of these text functions by creating a calculation field and choosing “Text Functions” from the View menu.
However, a better way to become familiar with the available functions is to look through the online help documents that come with FileMaker. While in FileMaker, click Command-Shift-/ to bring up the help system, and then click on Contents in the top left of the window, followed by Function Reference Category List, and then Text Functions. Click the link for each of the available text functions to become familiar with them.
For text parsing, the most useful functions will be Left(), Right(), Middle(), Length(), PatternCount, and Position().
So let’s begin by establishing some rules on how one would extract individual names from a full name field. My own name is a good example of a simple full name, with a single word for each of the first, middle, and last names: “Charles Edward Ross.” It is made up of a single word for each name (a word in FileMaker is a string of characters separated by whitespace, such as a space, return, or tab character). We will begin by extracting just the first name. In this case, the first name can be extracted by getting all of the characters before the first space. So the first thing we need to figure out is where the first space occurs, which is done with the Position() function.
Position( FullName; " "; 1; 1 )
Assuming we have a text field called FullName, the above will return the number of the character that has the first occurrence of a space. In the case of my name, this will be 8.
Now that we know where the first space occurs, we want to get all of the characters before it. For this we use the Left() function, which returns characters in a string beginning from the left side for the number we specify. The number of characters we want is 1 less than the position of that first space, so our calculation becomes:
Left( FullName; Position( FullName; " "; 1; 1 ) - 1 )
If Charles Edward Ross is in FullName, the above will return Charles. If FullName contains Chuck Ed Ross, the above will return Chuck.
Now, FileMaker does have more powerful functions available to us that we could use. For instance, I could do the above with the following as well:
LeftWords( FullName; 1 )
However, this tutorial is not only about how to do things in FileMaker, but also how to think about what you’re trying to do. I want to show you how to think about the result, and using both Left() and Position() were more instructive toward this than using LeftWords().
Debugging Our Name Extraction
As we have our FirstName calculation, it will work so long as the first name is the first word. But what if it isn’t? What if the FullName field contains Mr. Charles Edward Ross? Our function will return Mr. instead of Charles. So let’s next account for the possible, but not necessary, presence of a title before the first name.
First, let’s consider which titles we want to account for. Off the top of my head, I can think of, Mr., Mrs., Ms., Miss, Dr., and Rev.. We want to check if the first word of FullName is any of these, and if so, get the second word. If it’s not present, get the first word.
We’ll begin by creating a variable to store the list of titles. Unfortunately, FileMaker doesn’t have a built-in list or array construct, but we can fake one by using a return separated string.
Let( { Titles = "Mr.¶Mrs.¶Ms.¶Miss¶Dr.¶Rev."; FirstWord = LeftWords( FullName; 1 ); SecondWord = MiddleWords( FullName; 2; 1 ) }; Case( Position( Titles; FirstWord; 1; 1 ) > 0; SecondWord; FirstWord ) )
We begin this calculation by setting a number of variables, including a list of the possible titles that a full name could begin with. For ease of readability, we also set the FirstWord and SecondWord variables.
Then we use a Case() function to determine which word, the first or the second, we should use. The Position() function will return a 0 if the search string (the first word) is not in our searched string (the list of titles). Take a close look at the parameters of the Position() function by checking the documentation. We are looking for the first word of the full name being in our list of titles.
We can simplify this calculation a bit. First of all, our test expression in the Case() function doesn’t have to check for the value being greater than 0. FileMaker interprets any numerical value that is not 0 as being True, so the first thing to simplify the function is to remove the inequality expression:
Let( { Titles = "Mr.¶Mrs.¶Ms.¶Miss¶Dr.¶Rev."; FirstWord = LeftWords( FullName; 1 ); SecondWord = MiddleWords( FullName; 2; 1 ) }; Case( Position( Titles; FirstWord; 1; 1 ); SecondWord; FirstWord ) )
The second thing we can do is realize that using Position as a test for one string simply containing another string is probably going to be a common need. That implies that we could create a custom function for this purpose, called perhaps Contains(). We’ll create this custom function to take two parameters, a text parameter and a searchString parameter (see the article on creating custom functions). Here’s the definition of our custom function:
PatternCount( text; searchString; 1; 1)
After defining this, we have the following for our current FirstName calculation:
Let( { Titles = "Mr.¶Mrs.¶Ms.¶Miss¶Dr.¶Rev."; FirstWord = LeftWords( FullName; 1 ); SecondWord = MiddleWords( FullName; 2; 1 ) }; Case( Contains( Titles; FirstWord ); SecondWord; FirstWord ) )
Parsing Out the Last Name
Now that we can extract the first name with some reliability, let’s work on the last name. We’ll begin with a calculation similar to that of the first name calculation, accounting for a possible suffix. In this case, we’ll assume that any existing suffix appears after a comma, as in Charles Edward Ross, Jr. or Charles Edward Ross, Ph.D.. For the time being, we’ll also assume that there can only be one suffix.
Let( { LastWord = RightWords( FullName; 1 ); SecondToLastWord = MiddleWords( FullName; WordCount( FullName ) - 1; 1) }; Case( Contains( FullName; "," ); LastWord; SecondToLastWord ) )
If FullName contains a comma, then the above will return the second to last word in FullName. If it does not have a comma, then the last word is returned.
Note the use of WordCount() to determine which word is the second-to-last word. As you would expect, WordCount() return the number of words in a string, so WordCount( "Charles Edward Ross, Jr." ) would return 4. Subtracting 1 from this gives us word 3, and using MiddleWords() to begin at word 3 and get 1 word returns "Ross".
Refining Last Name Extraction
Again, there’s an exception to the rules we’ve covered for last names: Not all last names are a single word. Some last names are two words, as in Mc Gowan (my wife’s maiden name) or von Neumann. Other last names might begin with de or O (without an apostrophe). Again, off the top of my head, I can think of Mc, Mac, O, De, and Von. I’m sure there are others, but that will suffice for now.
So here is our newly defined rule for last names. Assume that the full name has n words in it. If word n is not a listed suffix, and word n–1 is not a listed last name word, the last name is word n. If word n–1 is a listed last name word, the last name is words n–1 through n. If there is a suffix, shift the last name (and the tests for a separate last name word) to the left by one word.
Sounds complicated. Let’s see what it looks like. First let’s add a variable to hold the possible words that a last name can begin with:
Let( { LNamePrefixes = " Mc ¶ Mac ¶ O ¶ De ¶ Von "; LastWord = RightWords( FullName; 1 ); SecondLastWord = MiddleWords( FullName; WordCount( FullName ) - 1; 1); ThirdLastWord = MiddleWords( FullName; WordCount( FullName ) - 2; 1 ) }; Case( Contains( FullName; ","); Case( Contains( LNamePrefixes; ThirdLastWord ); ThirdLastWord & " " & SecondLastWord; SecondLastWord ); Case( Contains( LNamePrefixes; SecondLastWord ); SecondLastWord & " " & LastWord; LastWord ) ) )
We’ve surrounded our last name prefixes with spaces because we only want to find them if they are separate words. We don’t want to catch the word before the final last name if it’s O’Grady, only if it’s O Grady.
Note that this isn’t a perfect extraction of first name and last name. There are many other exceptions we could account for, a suffix without a comma, or three-word last names (de la Hoya). But this should give you plenty to begin with for your own text parsing routines.
Email Verification
Text parsing isn’t always pulling out a piece of a string. Sometimes you need to know if a string satisfies a condition. A classic example is if the string is a valid e-mail address. You can’t verify that it’s valid in the sense that e-mail sent to the address will arrive at an e-mail server someplace, but you can verify that it looks like a valid address.
A valid e-mail address looks something like xxx@xxx.xxx. So, it has some characters, followed by an @ symbol, followed by some more characters, a period, and two or three more (i.e., .com or .uk). There are some more restrictions, but these will suffice for our tutorial purposes.
First, an e-mail address is valid if it has the @ character in it only once, and if it’s not the first character.
( Position( EmailAddress, "@", 1, 1 ) > 1 ) and ( PatternCount( EmailAddress, "@" ) = 1 )
For an e-mail address to be valid, the third-to-last or the fourth-to-last character must be a period. (For now, we’ll ignore .name and other newer domains.)
( Position( EmailAddress; "@"; 1; 1 ) > 1 ) and ( PatternCount( EmailAddress; "@" ) = 1 ) and ( ( Middle( EmailAddress; Length( EmailAddress ) - 3 ) = "."; 1 ) or ( Middle( EmailAddress; Length( EmailAddress ) - 2 ) = "."; 1 ) )
The first line checks that the first @ symbol is not the first character. The second condition checks that there is only one @ symbol in the string. Parentheses enclose a compound third condition made up of two separate conditions, one checking that the fourth-to-last character is a period and the other checking that the third-to-last character is a period. If either of these is true, the third (compound) condition returns true. So an e-mail like chuck@chivalrysoftware.com will validate, but @chivalrysoftware.com, chuck@chivalry@software.com and check@chivalrysoftware.c will not validate.
Again, this isn’t a complete validation of e-mail addresses, but it should give you an idea of the thought process needed and the functions required to do a more complex validation.
• • •
I think that will do it for us this month. As an exercise in using the skills covered in this column, I would suggest writing the routine to extract the middle name from a full name, including the possibility that the middle name isn’t provided (i.e., return an empty string when FullName contains Charles Ross). Until next month, happy FileMaking!
Also in This Series
- Incremental Progress · November 2008
- Getting Relational · April 2008
- FileMaker 9 · March 2008
- FileMaker 8.5 · September 2006
- Script Parameters and Results · July 2006
- “Real” Programming with FileMaker · June 2006
- Text Parsing With FileMaker · May 2006
- Welcome to FileMaker 8! · November 2005
- Common Functions · October 2005
- Complete Archive
Reader Comments (16)
Let ( [ $var[1] = "Mr." ; $var[2] = "Mrs." ] ; ... )
However given that there are no built in methods of filling the repeating variable then using a value list as you did is simpler and more effective.
I searched the net and most of the hits were "Buy a subscription and I'll tell you how to do it". I would have paid someone to help, but those sites looked a bit fishy.
Then I ran into yours. Thank you, once again, for empowering me. You have no idea how thrilled I am that I could do this -- though to many it would seem so simple.
You rock!
You need to count the words with the built-in FileMaker function WordCount.
Assume that full names can have two, three or four names. For simplicity, I'm not considering prefixes (i.e., "Dr.") or suffixes (i.e., "Jr.").
The first name will always be the first word:
FirstName = LeftWords( FullName; 1)
The last name will always be the last word:
LastName = RightWords( FullName; 1 )
The MiddleNameA will be the second word if there are three or more words, and blank otherwise:
MiddleNameA = Case( WordCount( FullName ) >= 3; MiddleWords( FullName; 2; 1 ); "" )
Finally, MiddleNameB will be the third word if there are four words, and blank otherwise.
MiddleNameB = Case( WordCount( FullName ) = 4; MiddleWords( FullName; 3; 1 ); "" )
I hope this helps if you need to extend this to more possibilities.
Thanks,
Chuck
You are, of course, correct, and I don't know how I could have missed that while writing the article. Thank you very much for pointing out the variable arrays.
Chuck
Thanks so much for the kind words. You have no idea how much they are appreciated. It's a great satisfaction to know that the articles are proving useful to people.
Thanks,
Chuck
Contact me off-site at [email protected].
Thanks,
Chuck
I'm pleased with this info. However, I have a different type of problem. Maybe you can help me solve this using Filemaker.
What I want to be able to do is to have Filemaker check if a text-field contains a specific string. For example, I want to check whether the text-field contains the word "trousers" just somewhere in the text.
I was looking for a function that does something like: IF FIELD CONTAINS "trousers" THEN ... etc.
However, I do not seem to find such a simple CONTAIN type of function.
Could you help me please?
Kind regards,
Silvester Draaijer
You're looking for the PatternCount function.
PatternCount ( FieldName ; "trousers" ) will return the number of times 'trousers' appears in the field.
The drawback is that it only looks for the pattern as a string not a word. So if you were looking for the word 'File' you would get a result if the word 'FileMaker' was in the field. There are a number of techniques you could use if you need a word match. I'd consider adding a leading and trailing return, substituting all the spaces and punctation for returns, and then PatternCounting for the word bracketed by returns:
Let (
_fieldWordList = ¶ & Substitute ( FieldName ; [ " " ; ¶ ] ; [ "." ; ¶ ] ; [ "?" ; ¶ ] ; [ "!" ; ¶ ] ) & ¶ ;
PatternCount ( _fieldWordList ; ¶ & "trousers & ¶ )
)
Note PatternCount is NOT case sensitive. Substitute IS case sensitive.
Hope that helps a little. :)
Paul Turnbull
I will have a go at this PatternCount function. Very helpful indeed.
Kind regards,
Silvester Draaijer
For more complex issues such as that brought up by Paul, where you want to find a word that is by itself and not within another word, I use the SmartPill PHP plugin that offers the availability of regular expressions. However, using it does require a knowledge, of course, of both PHP and regular expressions.
Thanks,
Chuck
I'm wondering if there is a way to split a paragraph with several sentences into several different fields. Strange idea, perhaps, but here's why:
I am using FMP to download a weather forecast from the web... from there that information is exported into a markup language and displayed on digital signage. The problem is that sometimes the forecast is too long to fit on just one screen. Since this is all automated, I need a way to break up the information into more than one chunk through a script or calculation.
(Of course, I can do a calculation based on word count, but this breaks the information in the middle of a sentence.)
Thanks for any assistance.
Jon,
If the sentences don't contain abbreviations, this can be relatively easily done. Without abbreviations, you can assume that a period terminates the sentence. You'll then want to find the last period before the maximum length of your substring and get everything before that. The calculation would look something like this.
If your strings do have abbreviations, this will get a lot more complicated, but is still possible.
Thanks,
Chuck
I am trying to help various users around the world to define a path to a picture that will be inserted into their container. So a user inserts a picture (reference only) from a shared Dropbox folder, and now I want other users to be able to insert that same picture by changing the filepath to reflect their personal path to Dropbox. One user may have a filepath of imagewin://C:users/JohnSmith/Dropbox.... and another might be imagemac:/Y:/Dropbox....and many other variations on that theme.
I have a calculation field (Image Filepath Set) that uses: MiddleWords (Image Filepath Set; 5; 25 ) and that works for everyone who has the first filepath (the same number of folders), but not for anyone else. I just used 25 in the calculation because it was a big number, and I wanted to include all possibilities. Overkill, I know.
I just want something that finds "Dropbox" wherever it occurs, and then deletes everything in front of it. It could delete Dropbox, too, I could build that into the calculation.
Any thoughts?
Right ( $filepath ; Length ( $filepath ) - Position ( $filepath ; "DropBox" ; 0 ; 1 ) - Length ( "DropBox" ) + 1 )
So the first thing you want to do is find the occurrence of "Dropbox" in the filepath string.
$dpPos = Position( $filepath; "Dropbox"; 1; 1 )
You also need to know the length of the string.
$strLen = Length( $filepath )
Next, you want to get everything from the beginning of the occurrence of "Dropbox" to the end of the string.
$subStr = Right( $filepath; $strLen - $dpPos + 1)
Putting this all together in a single calculation:
Let(
[
$dpPos = Position( $filepath; "Dropbox"; 1; 1 );
$strLen = Length( $filepath );
$subStr = Right( $filepath; $strLen - $dpPos + 1)
];
$subStr
)
I find calculations like this (using Let variables), with intermediate variables easier to understand, than a single long equation. The version provided by Paul does work, but also removes "Dropbox" from the resulting string.
Testing this in FileMaker, I got "Dropbox/image.jpg" for "imagewin://C:users/JohnSmith/Dropbox/image.jpg" and "Dropbox/otherimage.jpg" for "imagemac:/Y:/Dropbox/otherimage.jpg".
Add A Comment