AppleScript, SQL Statements, “do shell script” and International Formats

17 Nov 2011 - 16:58

When creating queries for a database or issuing some “do shell script”, text formats of several data types in AppleScript are an issue. This is still haunting me. All these problems are caused by software designers. The type of designers who know, what is good for you. Converting a value to a string depending on system format settings may be good for a “display dialog“ command, but it is more than nasty for somebody who needs to enter date into an SQL database using AppleScript (or doing a lot of other “do shell script” actions).

I am talking mainly about creating commands here. Hence the topic is “AppleScript data to string” and not vice versa in most of this page. (Naturally, I am thinking about a reverse strategy to get data from a database in the proper format. But this requires some more thinking, particularly regarding the date types.)

(The most important 8 handlers here in a separate AppleScript file)

This page got lengthier than I wanted it to be. Preparing AppleScript data types for an SQL database has several aspects. I suggest: Fetch a cup of tea of coffee and read the entire stuff. Hopefully it has some ideas that did not occur to you. I start with an outline of problems I found, suggestions to solve them follow as AppleScript snippets.

Conceptual Differences: Data in AppleScript and in Databases

AppleScript and SQL databases have some different concepts about numbers and particularly about date and time.

Regarding numbers, AppleScript knows about integers and reals. The range for integers is from -229 to 229 - 1. If some code uses an integer beyond that range, AppleScript will use a real automatically.

Real numbers in AppleScript have 53 dual digits. They can represent any integer correctly in the range from -253 to 253. Beyond that range, not every integer has a representation, only some of them.

Real numbers that look very simple in decimal notation (e.g. 0.1) may be repeating in dual notation. (“Repeating” means an infinite number of digits after the dual / decimal separator. Take 1/3 in decimal notation: It is 0.3333333… ad infinitum.) Actually, the chance for a number to be repeating in dual notation is pretty high. Such numbers can never be represented exactly in a dual system with a limited number of digits. Even with 53 internal digits, there is no way to have an exact dual representation of decimal 0.1. To put it bluntly: 0.1 does not exist in AppleScript (or other scripting languages). Only a number close to it.

Databases know a lot of detailed settings for numbers. Their representation is not optimised for calculation but for retrieval without any modification. Typically, the precision can be set for every number column of a table. Often they are represented in the decimal system to avoid rounding errors the user would not expect. A MONEY type often uses integers. $150.98 is actually stored as 15098, that is in cent. Just the display adds the separator again. Using AppleScript, you are limited to the things, AppleScript can do.

For date and time, databases typically know timestamps, date and time. A time zone may be included or not. (A point in time without a known time zone is an interesting kind of devil. It comes pretty close to specifying a temperature as “80”, but skipping the detail, if this is Kelvin, °Celsius of °Fahrenheit. Still, SQL standards require a server to support it.) Time types may have a double meaning: They can be used to indicate a time of a day without specifying the date (“Usually, we have lunch at 2 pm”), or it can be used as a time span or duration (“the movie lasts 1 hour, 24 minutes and 6 seconds”). In the real world, dates and durations do not have a time zone. In every country where Christmas is celebrated, it is December 25th (possibly Julian, but that is another point), and the time zones of New Zealand or Hawaii do not come in. The length of the movie is the same everywhere as well. But points in time have a time zone or they are ill defined. Trivial example: The next lunar eclipse by the time of writing this is at its height at December 10, 2011 14:31:49 UTC. If you want to watch it in New Zealand, you need to convert the time information or you well be out at the wrong hour.

The AppleScript date object is a combination of date and time. Any date without a time or a time without a date is alien to AppleScript. The AppleScript concept of a time span is simply a number, interpreted as seconds. The difference between two dates in AS is the number of seconds between them. Time zones are an interesting issue with AppleScript. Compare the following examples:

set x1 to date ("5-5-2011")
set x2 to date ("12-12-2011")
set diff to (x2 - x1) / days

(a code that should run with American and European system settings, change the date strings, if needed) will return 221.0.

That is OK if you think in categories of calendar dates. But imagine a variation:

set x1 to date ("5-5-2011")
set x2 to date ("12-12-2011")
set diff to (x2 - x1) / hours

The result is 5304.0. On my system (Olson time zone: Europe/Berlin), x1 is in daylight saving time, x2 is not. The result (clearly intended as a duration) is plainly wrong. It should be one hour more. Think about the semantics of your data and apply corrections if necessary.

It should be noted that strings usually do not cause any problem. If possible, “do shell script” talks UTF8. The communication between a database client and the database can be configured to use UTF8, either by some special command or by a SQL “SET NAMES” statement. This works even if your database uses some more limited encoding internally, like latin 1 – provided you do not use characters outside that encoding. For any strategy using prepared statements (i.e. a kind of SQL statement template with values fed into that template), characters with a special SQL meaning are no problem either. Prepared statements should never open a hole for SQL injections. Just be sure to use AppleScript’s “quoted form of” for each and every argument.

(SQLite is an exception. It does not know much about encoding. It basically stores bytes. This is OK for storing and retrieving data, if the clients are using UTF8 consistently. But you will need special affords for searching data: Normalise them in special search columns e.g. by converting everything to lower case, removing diacritical marks and so on. Apple’s Address Book does the same.)

Some Notes on Time Formatting with Time Zones

A time stamp with a time zone will typically take some form like:

  1. 2011-10-29T14:56:01 PST
  2. 2011-10-29T14:56:01+03:00
  3. 2011-10-29T14:56:01B
  4. 2011-10-29T14:56:01 America/New_York

There are some conceptual differences:

Example 1 (PST) simply says that the time is to be interpreted as Pacific Standard Time.

Example 2 (+3:00) says that the time is to be interpreted with an offset of plus three hours from UTC. Note that in this notation, French standard time and British daylight saving time look the same.

Example 3 (B) stands for military time zone Bravo, i.e. UTC plus two hours.

Example 4 is the most interesting one: That is an Olson time zone. Such a time zone is defined by the fact that in its region, time follows the same rules through the entire year, and it did so since 1970. Olson zones know about the history of the rules. France, Netherlands and Germany all currently use the same rules for time. Still, the Olson zones Europe/Paris, Europe/Amsterdam and Europe/Berlin are different: They introduced or changed rules for daylight saving time in different years. Conversions based on Olson zones can take these changes into account from 1970 to the present.

Not all databases support all notations, and some may have minor variants (e.g. not having “T” between date and time, but a space). But the basic concepts are pretty similar.

Issues with String conversions of Numbers

For integer numbers, everything seems to be fine, when converting to string. 12345 as text will return "12345", whatever your locale settings. Actually the reverse may not be true for negative integers: A simple line like (-1234 as string) as integer will cause an error when run under Arabic settings. -1234 as string returns "-1234", which is OK for a database. But AppleScript can not convert that string to a number.

Real problems start when your numbers are reals (no pun intended). This happens in two cases: a) You use a decimal separator or b) your integer is beyond the AppleScript maximum integer value.

Example:

set x to 2 ^ 29
class of x

returns real although no decimal places are involved. x is simply beyond the possible integer values.

AppleScript international behaviour of real to text conversion and vice versa is a mess – for “do shell script”ers at least. The more you need locale independent behaviour, the messier it gets.

Look at the following example:

set x to 2 ^ 29
x as text

In case, the format settings on your system are Hindi, this will return

"५.३६८७०९१२E+८"

Gujarati format yields "૫.૩૬૮૭૦૯૧૨E+૮", Arab (Jordan) yields "٥٫٣٦٨٧٠٩١٢اس+٨", Punjabi yields "੫.੩੬੮੭੦੯੧੨E+੮" and so on.

None of that is exactly what your database or any shell program will like!

Europeans use the word “Arabic Numerals”. According to the texts I read, Arabs traditionally call them “Indian Numerals”. And that is quite correct. The idea of ten digits representing numbers in a positional notation came from one place in India and spread across the world. Hence many of different sets of characters for those ten digits evolved. Quite a lot of them are found in India (Hindi, Punjabi, Gujarati…) for obvious reasons.

When OS X looks for a way to transform a real to a string, it seems to look for the following things (“it seems” because precise documentation is not really one of Apple’s strengths, this is what I found by experiments):

  • Digits: The glyphs used for the digits are taken from the current locale settings of the system
  • Decimal Separator: The decimal separator is taken from the current locale settings. I heard about three of them: comma, dot and Arabic.
  • Exponent mark: In scientific / technical notation, numbers are composed of a simple real number and an exponent. In English, “1.234E-3” means “0.001234” (that is: 1.234 * 10-3). The notation for the “E” is taken from the locale settings as well. It may be more than one character.
  • Sign: E.g. Persian has a different minus sign: − rather than -. The position (preceding or following the digits) varies as well.

I hope, one thing is clear: If x is a number, “x as text” will not always return something, an SQL statement can use.

Issues with String Conversions of Dates

Conversion between dates and strings is entirely dependent on the system settings. That is not only true for the order of year, day and month, but the calendar used for conversion may not be Gregorian at all. Anyhow, the situation is not as bad as with reals: Date objects have a lot of properties that behave reliable: year, month, day and (as of system 10.6) hours, minutes and seconds. Hence, preparing any standard format is quite possible.

A quite General Solution for Numbers and Dates with Time

The following handler is slow compared to suggestions that follow, but it has a particularly interesting behaviour. It lets System Events create the code for a property list and looks for the converted value in it.

on value2string(input_val)
	tell application "System Events"
		set the p_list to make new property list item ¬
			with properties {value:input_val}
		set s to text of p_list
	end tell
	set old_delims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to ¬
		{"<real>", "</real>", "<integer>", "</integer>", "<date>", "</date>"}
	set s to text item 2 of s
	set AppleScript's text item delimiters to old_delims
	return s
end value2string

When preparing a plist, System Events needs to represent data in a standard form. This handler first creates the text of a plist, and afterwards it extracts the text between the integer, real or date tags.

There are some special things about the behaviour of this handler:

Floating point numbers are converted from dual representation to decimal representation without any rounding. Since simple decimal numbers are repeating in dual representation in many cases, a lot of decimals are possible:

value2string(3.6) returns "3.6000000000000001", value2string(4.6) returns "4.5999999999999996".

Both are simply the exact decimal representation of the internal dual number. Depending on your database field, that may be good or not. If you are storing e.g. currency values, the rounded expression (see handlers num2str below) would be appropriate.

value2string(10 ^ 16) returns "10000000000000000", while value2string(10 ^ 17) returns "1e+17". This makes sense because in AppleScript reals can represent all integers exactly up to 253. That is between 1015 and 1016. Hence, in this conversion you get the full range of integers.

value2string(current date) returns something like "2011-10-29T14:56:01Z".

Note the “Z” at the end. This marks the “Zulu” time zone, military lingo for Universal Time Coordinated (or “Greenwich Mean Time”, as it was called by grandma or grandpa). The time preceding it is actually converted to UTC. I run the script with German daylight saving time at 16:56:01.

(I tried with different years: Apple seems to convert to UTC using the Olson time zone setting of your system. These time zones now about changes of the rules for years from 1970 to present. Hence when I run the code for a date in 1970, the correction will be 1 hour only even in summer. The Europe/Berlin timezone did not have daylight saving time in those years.)

Postgres will accept this version. The MySQL manual states that the format should be something like "2011-10-29 14:56:01", and this is understood as UTC. A minor modification of the handler will yield this:

on value2string(input_val) -- MySQL version
	set tl to {input_val}
	tell application "System Events"
		set the parent_dictionary to make new property list item with properties ¬
			{kind:list, value:tl}
		set s to text of parent_dictionary
	end tell
	set old_delims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to ¬
		{"<real>", "</real>", "<integer>", "</integer>", "<date>", "</date>"}
	set s to text item 2 of s
	set AppleScript's text item delimiters to "T"
	set temp to every text item of s
	set AppleScript's text item delimiters to " "
	set s to temp as text
	set AppleScript's text item delimiters to "Z"
	set s to text item 1 of s
	set AppleScript's text item delimiters to old_delims
	return s
end value2string

Since numbers neither have a Z or a T, those conversions are not affected. It should be possible to create other idioms without much additional work.

AppleScript date objects always have a time. This may cause problems when a date is intended as a “pure” date without time:

On my machine, value2string(date ("5-5-2012")) returns "2012-05-04T22:00:00Z"! Midnight has been converted to UTC!

Since calling System Events is time consuming, a handler that takes an entire list of values will be more efficient:

on values2strings(list_arg)
	-- takes a list of integers, reals, strings and / or dates.
	-- returns a list of string representation.
	-- reals have exact decimal representation for internal value. That is:
	-- 0.1 will become "0.10000000000000001"
	-- dates are ISO compatible, converted to UTC
	-- Other data will be ignored.
	if class of list_arg is not list then
		set list_arg to {list_arg}
	end if
	tell application "System Events"
		set plist to (make new property list item with properties {value:list_arg})
		set strg to text of plist
	end tell
	set old_delims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to {" ", tab, return, character id 10}
	set temp_list to every text item of strg
	set AppleScript's text item delimiters to ""
	set strg to temp_list as text
	set AppleScript's text item delimiters to {"<integer>", "</integer>", "<real>", "</real>", "<date>", "</date>"}
	set temp_list to every text item of strg
	set r_list to text items 2 thru -2 of strg
	set AppleScript's text item delimiters to old_delims
	set r to {}
	repeat with C from 2 to count temp_list by 2
		copy item C of temp_list to end of r
	end repeat
	return r
end values2strings

Other Solutions for Numbers

Converting Integers

If a real is used as an integer (values two large for AppleScript integers but in the range that is OK for reals), the following handler yields a text representation. It is much faster than using System Events. It avoids exponential notation and should work with all system settings. It yields a sequence of digits, optionally preceded by a minus sign:

on num2intString(n)
	set sign to ""
	if n < 0 then
		set sign to "-"
		set n to n * -1
	end if
	-- round, just in case
	set n to (n + 0.5) div 1
	set l to {}
	-- extract digits one by one
	repeat
		copy ((n mod 10) as integer) as text to beginning of l
		set n to n div 10
		if (n = 0) then
			exit repeat
		end if
	end repeat
	set old_delims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to ""
	set r to sign & l as text
	set AppleScript's text item delimiters to old_delims
	return r
end num2intString

A European Solution

Both the “European” and the “Indian” version are hacks. But I never was opposed to pragmatic solutions, if they are sufficient for the given task. But “beautiful” code is something different.

For all European (i.e. Latin script) cases, the only difference I found is in the decimal separator. Some countries use the dot, others the comma.

The following handler helps:

on num2str(the_real) -- European version
	set the_real to the_real as real -- cause an error if not possible
	set r to the_real as text
	set old_delims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to ","
	set temp to every text item of r
	set AppleScript's text item delimiters to "."
	set r to temp as text
	set AppleScript's text item delimiters to old_delims
	return r
end num2str

This simply replaces a comma by a dot, which will work in most number formats in European scripts. (If there is a dot already, it will do nothing.)

Main difference to the plist solution: Some reasonable rounding takes place. Hence num2str(4.6) will return "4.6".

An Indian Version (and some others)

Different parts of India use different glyphs for digits, but the concept for reals is based on English concepts. Hence, just replacing the digits will work in many cases. I use the fact that AppleScript can convert a digit glyph from any script system into an integer:

on num2str(the_real) -- Indian version (and others)
	set the_real to the_real as real -- cause an error if not possible
	set s to the_real as text
	set r to ""
	set temp to every character of s
	repeat with c in temp
		try
			set c to (c as integer) as text
		end try
		set r to r & c
	end repeat
	return r
end num2str

This simply leaves the number to string conversion to AppleScript. Afterwards, it changes any character that represents a digit to the European one. Everything else is left unchanged. The main difference to the plist solution is rounding again.

Real to String if Number of Decimals is known

If the number of required decimal places is known (currency is an example), a mathematically clean strategy can be used. It should not be dependent on locale settings and is simply an extension of the integer solution “num2intString”:

on num2realStr(the_real, digits)
	set shift to 10 ^ digits
	set n to the_real * shift
	set sign to ""
	if n < 0 then
		set sign to "-"
		set n to n * -1
	end if
	-- round
	set n to (n + 0.5) div 1
	set l to {}
	set counter to digits
	repeat
		copy ((n mod 10) as integer) as text to beginning of l
		set n to n div 10
		if (n = 0) and (counter < 1) then
			exit repeat
		end if
		set counter to counter - 1
	end repeat
	set c to count l
	set l1 to items 1 thru (c - digits) of l
	set l2 to items (c - digits + 1) thru -1 of l
	set old_delims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to ""
	set r to sign & (l1 as text) & "." & (l2 as text)
	--set r to l2 as text
	set AppleScript's text item delimiters to old_delims
	return r
end num2realStr

This handler takes the number to convert and the number of decimal places. num2str(123.4, 4) will return "123.4000".

(The handler looks complicated. But compared to the plist solution using System Events, it is some 100 times faster! That is in the same league as the simple “European version”.)

Other Solutions for Dates

Formatting dates as strings is not such a mess as formatting reals. String conversions take as much from the system formats as with reals. The main issues are the conceptual ones mentioned at the beginning. In current Mac systems, all necessary properties of a date can be retrieved as an integer to set up the desired format. They are numbers (or constants that can be used as numbers) and they are in the Gregorian calendar. None of the following techniques take the time zone into account. Everything is relative to your local time zone.

Every reasonable database will accept a date in ISO format: year-month-day, where year has at least 4 digits, month and day have two.

All of the versions that follow do nothing about time zones, as opposed the first handler on this page using property lists.

on date2iso(dt)
	set {y, m, d} to {year of dt, month of dt, day of dt}
	set y to y as text
	if (count y) < 4 then
		set y to text -4 thru -1 of ("000000" & y)
	end if
	set m to text -2 thru -1 of ((m + 100) as text)
	set d to text -2 thru -1 of ((d + 100) as text)
	return y & "-" & m & "-" & d
end date2iso

returns an ISO string for dt, something that every reasonable database should accept. (This is the version for people planning for long periods: It accepts years beyond 9999.)

on date2time(dt)
	set {h, m, s} to {hours of dt, minutes of dt, seconds of dt}
	set h to text -2 thru -1 of ((h + 100) as text)
	set m to text -2 thru -1 of ((m + 100) as text)
	set s to text -2 thru -1 of ((s + 100) as text)
	return h & ":" & m & ":" & s
end date2time

returns the time part of a date in ISO form.

on date_time2iso(dt)
	set {y, m, d, h, min, s} to {year of dt, month of dt, day of dt, hours of dt, minutes of dt, seconds of dt}
	set y to y as text
	if (count y) < 4 then
		set y to text -4 thru -1 of ("000000" & y)
	end if
	set m to text -2 thru -1 of ((m + 100) as text)
	set d to text -2 thru -1 of ((d + 100) as text)
	set h to text -2 thru -1 of ((h + 100) as text)
	set min to text -2 thru -1 of ((min + 100) as text)
	set s to text -2 thru -1 of ((s + 100) as text)
	return y & "-" & m & "-" & d & " " & h & ":" & min & ":" & s
	-- or with T for marking the time part:
	-- return y & "-" & m & "-" & d & "T" & h & ":" & min & ":" & s
end date_time2iso

combines both to an ISO time stamp format.

Usually, databases (e.g. PostgreSQL) allow to add a timezone according to the Olson standard. This looks like: "2011-10-21 23:21:12 Europe/Paris". (Cf. PostgreSQL manual.)

One possible handler getting the Olson timezone marker from Mac OS X is:

on get_tz()
	set cmd to "$x = @date_create(); print ($x->getTimezone()->GetName());"
	set cmd to quoted form of cmd
	set r to do shell script "php -r " & cmd
	return r
end get_tz

A time span is a duration or a difference between two points in time. For a database it typically takes the form hours:minutes:seconds, where hours may be beyond 24 and negative. In MySQL, possible values are in the range from -838:59:59 to 838:59:59. Check with your database.

on date_diff2timespan(start_date, end_date)
	set delta to end_date - start_date
	set sign to ""
	if delta < 0 then
		set sign to "-"
		set delta to delta * -1
	end if
	-- uncomment following lines if errors for
	-- MySQL should occur if beyond range
	(*
	if delta > 3020399 then -- limit in MySQL
		error "Can not convert time span for MySQL"
	end if
	*)
	set s to delta mod 60
	set s to text 2 thru 3 of ((s + 100) as text)
	set delta to delta div 60
	set m to delta mod 60
	set m to text 2 thru 3 of ((m + 100) as text)
	set delta to delta div 60
	set h to delta as text
	return sign & h & ":" & m & ":" & s
end date_diff2timespan

yields such a difference.