Basic astronomy databases with FSharp

<– Back to main page…

This is a short post about a small experiment migrating some Python code for working with astronomical databases to FSharp.  The basic task that I want to perform is to take an object identifier for an object from the Sloan Digital Sky Survey (SDSS) DR10 and look up what it is known to be. For example, say I am presented an image like the following example from the GalaxyZoo image set:

An SDSS image of a galaxy.

Given just the image and its object identifier (1237645941297709234), we might be curious to learn a few things about it:

Answering these questions requires a bit of work.  First, we need to query the SDSS database to retrieve the (ra, dec) coordinates of the object.  Once we have this, it is possible to then go to another database like SIMBAD to learn if it is a known astronomical object, and if so, what kind of object it is.

Both the SDSS and SIMBAD databases are accessible via HTTP queries, making programmatic access easy.  In this post I’m using their basic interfaces.  SDSS offers a number of access methods, so there is likely to be a cleaner one than I’m using here - I’m ignoring that for the moment.  SIMBAD on the other hand presents a relatively simple interface that seems to predate modern web services, so instead of well structured JSON or some other format, dealing with its response is an exercise in string parsing.

To start off, I defined a few types that are used to represent responses from SIMBAD.

type SimbadObject =
  | S_Galaxy 
  | S_PlanetaryNebula 
  | S_HerbigHaro 
  | S_Star 
  | S_RadioGalaxy
  | S_GalaxyInGroup 
  | S_GalaxyInCluster 
  | S_Unknown of string

type SimbadResponse  = 
  | SimbadValid of SimbadObject
  | SimbadError of string
  | SimbadEmpty

Interpreting the SIMBAD object type responses was a simple exercise of matching strings to the corresponding constructors from the SimbadObject discriminated union.

let interpret_simbad_objstring s =
  match s with
  | "PN" -> S_PlanetaryNebula
  | "HH" -> S_HerbigHaro
  | "Star" -> S_Star
  | "RadioG" -> S_RadioGalaxy
  | "Galaxy" -> S_Galaxy
  | "GinGroup" -> S_GalaxyInGroup
  | "GinCl" -> S_GalaxyInCluster
  | _ -> S_Unknown s

Before moving on, I needed a few helper functions. The first two exist to safely probe lists to extract either the first or second element. By “safely”, I mean that in the case that a first or second element doesn’t exist, a reasonable value is returned. Specifically, I make use of the usual option type (Maybe for the Haskell crowd).

let getfirst l =
  match l with
  | [] -> None
  | (x::xs) -> Some x

let getsecond l =
  match l with
  | [] -> None
  | (x::xs) -> getfirst xs

I could roll these into a single function, “get_nth”, but as I said, this was a quick exercise and these functions play a minor role in things so I didn’t care much about it. Another utility function that is required is one to take a single string that contains multiple lines and turn it into a list of lines, excluding all empty lines. This function should also be agnostic about line terminators: CR, LF, CR-LF all should work.

let multiline_string_to_lines (s:string) =
  s.Split([|'\r'; '\n'|])
  |> Array.filter (fun s -> s.Length > 0)
  |> Array.toList

With these helpers, we can finally write the code to query SIMBAD. This code assumes that the FSharp.Data package is available (this is accessible via Nuget in VS and under mono on Mac/Linux). Given a coordinate (ra,dec), we can define the following function:

let simple_simbad_query (ra:float) (dec:float) =
  let baseurl = "http://simbad.u-strasbg.fr/simbad/sim-script"
  let script = "format object \"%OTYPE(S)\"\nquery coo "+string(ra)+" "+string(dec)

  let rec find_data (lines:string list) = 
    match lines with
    | [] -> SimbadEmpty
    | (l::ls) -> if l.StartsWith("::data::") then
                    match getfirst ls with
                    | None -> SimbadEmpty
                    | Some s -> SimbadValid (interpret_simbad_objstring s)
                 elif l.StartsWith("::error::") then
                    match getsecond ls with
                    | None -> SimbadEmpty
                    | Some s -> SimbadError s
                 else
                    find_data ls

  Http.RequestString(baseurl,query=["script",script],httpMethod="GET")
  |> multiline_string_to_lines 
  |> find_data

The first two lines of the function body are related to the SIMBAD query - the base URL to aim the request at, and the script that will be sent to the server to execute and extract the information that we care about. The script is parameterized with the ra and dec coordinates that were passed in. Following those initial declarations, we have a recursive function that spins over a SIMBAD response looking for the information that we wanted. When all goes well, at some point in the SIMBAD response a line that looks like “::data::::::::” will appear, immediately followed by a line containing the information we were actually looking for. If something goes wrong, such as querying for a (ra,dec) that SIMBAD doesn’t know about, we will have to look for an error case that follows a line starting with “::error::::::”. In the error case, the information we are looking for is actually the second line following the error sentinel.

In the end, the find_data helper function will yield a value from a discriminated union representing SIMBAD responses:

type SimbadResponse = 
  | SimbadValid of SimbadObject
  | SimbadError of string
  | SimbadEmpty

This encodes valid responses, empty responses, and error responses in a nice type where the parameter represents the relevant information depending on the circumstance.

With all of this, the simple_simbad_query function body is formed from a simple pipeline in which an HTTP request is formed from the base URL and the query script. This is fed into the function to turn a multiline string into a string list, and then the recursive find_data call is invoked to scan for the data or error sentinels and act accordingly. Nothing terribly subtle here. What is nice though is that, in the end we get a well typed response that has been interpreted and brought into the FSharp type system as much as possible. For example, if an object was a galaxy, the result would be a value “SimbadValid S_Galaxy”.

A similar process is used to query the SDSS database to look up the (ra, dec) coordinates of the object given just its identifier.

let simple_sdss_query (objid: string) =
    let sdss_url = "http://skyserver.sdss3.org/public/en/tools/search/x_sql.aspx"
    let response = Http.RequestString(sdss_url,
                                      query=["format","json";
                                             "cmd","select * from photoobj where objid = "+objid],
                                      httpMethod="GET")
    let jr = JsonValue.Parse(response)

    let elt = jr.AsArray() |> Array.toList |> List.head

    let first_row = elt.["Rows"].AsArray() |> Array.toList |> List.head
    let ra, dec = (first_row.["ra"].AsFloat()) , (first_row.["dec"].AsFloat())
    ra, dec

As before, we form the request to the given URL. Fortunately, SDSS presents a reasonable output format - instead of a weird textual representation, we can ask for JSON and take advantage of the JSON parser available in FSharp.Data. Of course, I immediately abuse the well structured format by making a couple dangerous but, in this case, acceptable assumptions about what I get back. Specifically, I immediately turn the response into a list and extract the first element since that represents the table of results that were returned for my SQL query. I then extract the rows from that table, and again collapse them down to a list and take the first element since I only care about the first row. What is missing from this is error handling for the case when I asked for an object ID that doesn’t exist. I’m ignoring that for now.

Once we have the row corresponding to the object it becomes a simple task of extracting the “ra” and “dec” fields and turning them into floating point numbers. These then are returned as a pair.

Given this machinery, it then becomes pretty simple to ask both SDSS and SIMBAD about SDSS objects. Here is a simple test harness that asks about a few of them and prints the results out.

let main argv = 
    let objids = ["1237646585561219107"; "1237645941297053860"; "1237646586102349867"; "1237646588244918500"; "1237646647297376587"; "1237660558135787607"; "1237646586638827702"; "1237657608571125897";
      "1237656539664089169"]

    for objid in objids do
        let ra, dec = simple_sdss_query objid
        let objtype = simple_simbad_query ra dec
        let objstring = match objtype with
                        | SimbadValid s -> "OBJTYPE="+(sprintf "%A" s)
                        | SimbadError s -> "ERROR="+s
                        | SimbadEmpty   -> "Empty Simbad response"
        printfn "RA=%f DEC=%f %s" ra dec objstring

    0

The resulting output is:

RA=77.538556 DEC=-0.946330 OBJTYPE=S_Galaxy
RA=62.098838 DEC=-1.075872 OBJTYPE=S_Galaxy
RA=87.319430 DEC=-0.591992 ERROR=No astronomical object found :
RA=76.117486 DEC=1.164739 ERROR=No astronomical object found :
RA=71.760602 DEC=0.066689 OBJTYPE=S_GalaxyInCluster
RA=69.348799 DEC=25.042414 OBJTYPE=S_PlanetaryNebula
RA=86.456666 DEC=-0.086512 OBJTYPE=S_HerbigHaro
RA=122.733827 DEC=36.829334 OBJTYPE=S_RadioGalaxy
RA=345.357080 DEC=-8.465958 OBJTYPE=S_Galaxy

A few closing thoughts. My goal with doing this was to take advantage of the type system that FSharp provides to bring things encoded as strings or enumerated values into a form where the code can be statically checked at compile time. For example, we have three possible SIMBAD responses: valid, error, or empty. Using discriminated unions allows me to avoid things like untyped Nil values or empty strings, neither of which capture useful semantics in the data. I’ve also isolated the code that maps the ad-hoc string representations used in the database responses in specific functions, outside of which the string-based nature of the response is hidden such that the responses can be consumed in a type safe and semantically meaningful manner. An unanticipated response will immediately become apparent due to an error in the string interpretation functions, instead of potentially percolating out into a function that consumes the responses leading to hard to debug situations.

Of course, there are likely better ways to achieve this - either better FSharp idioms to clean up the code, or better interfaces to the web-based databases that would allow me to use proper WSDL type providers or LINQ database queries. I’m satisfied with this little demo though for a two hour exercise on a Saturday night.