Bullet Proof Data Science in Scala
In this post, we go over some typical aspects and challenges that occur in typical data science projects in order to extract some requirements for data analysis in the broad sense of the word. We then illustrate how we tackle these requirements in typical data science projects using Scala.
Introduction
Joining and Annotations
The reality is that often times several data sources need to be aggregated. Aggregation can be in two ways/directions:
- Vertical: New data, similar to what we already have, and
- Horizontal: Additional information about data we already obtained.
Parsing Libraries
Last week I was parsing a tab-separated file with information about genomic variants (aka a VCF file). The parsing was done using a custom library created by a colleague of mine. I had uses the library before, but now suddenly it did not work anymore. All I got was a confusing error about some exception
And then, the fun starts. How to find out where things go wrong and how to make sure you don’t have to rewrite (part of) the parser? It turned out additional fields can be added to a VCF format file, which the parser does not take into account.
Missing Values
A similar issue occurs when dealing with missing values, or missing columns in the data. It’s very easy to end up with exceptions in the Scala/Java world or equivalent in other languages.
The challenge of missing data becomes even more concrete when additional data is aggregated. Suppose we have additional annotations about a subset of the data. There needs to be a way to cover situations like this.
Solving the Challenge in Scala
In other words, you don’t have control over the input in most cases. In what follows, we describe an approach to data analysis in Scala that takes into account the above challenges. The approach is heavily based on principles of Functional Programming while capturing the data model in an object model.
Bullet Proof Data Science
Missing Values
On important aspect of the above challenges is missing values. Say you’re a service organization that keeps records of potential customers. Furthermore, say you want to analyze people’s hobbies. You would like to allow for a distinction between 3 situations:
- There is no information about the customer’s hobbies
- The customer does not have any hobbies
- The customer has 1 or more hobbies
Say you encode the hobbies as a List
of String
(free form), then (2) corresponds to an empty list and (3) corresponds to a list of n hobbies. But what does (1) correspond to? In R, one usually gets NA
. In Java, one would often sees the occurrence of null
, but using null
is not a good habit.
Instead, we use the Option
type. It encapsulates whatever other data structure you want. The above 3 situations then correspond to:
None
for no hobbies known about this personSome(List())
for this person does not have hobbiesSome(List(hobby1, hobby2, ...))
for the hobbies for this person
For the FP (Functional Programming) people among us, the option is a Monad. But let’s not go there yet in order not to scare off the others…
Start with the End in Mind
In our experience, it makes sense to define a good model for the data after aggregation and processing and capture this in an object model. This model need to be the one that can be used as input for a machine learning library as such, but it should capture the logic of the application domain.
And Scala’s case
classes come in very handy. We will not cover the specifics or benefits of case classes here, but remember this: always put case
in front of your class definition. Oh, and while you’re at it, add val
before every class parameter as well.
For instance, and to be in line with the discussion above, we could define a Name
class definition as follows:
case class Name(val firstName:Option[String], val lastName:Option[String])
One can add safe getters and setters if necessary:
case class Name(val firstName:Option[String],
val lastName:Option[String]) {
def getSafeFirstName = firstName.getOrElse("First name not known")
def getSafeLastName = lastName.getOrElse("Last name not known")
}
Please note that the safe getters return a String
, even if the value is not available.
Safe transformation
Once we encapsulated the data in an Option
, we can safely process this data as well. There are multiple way to do this, but all come down to the same principle:
- Only process values that should be processed, so don’t process missing entries, and
- Make sure the processing itself is safe by catching exceptions where necessary.
Coming back to our example: In practice one seldom gets a dataset with first name and last name clearly split. We could store input as well, and define a companion object in order for people to easily use the API:
case class Name(val unparsed:Option[String],
val firstName:Option[String]=None,
val lastName:Option[String]=None)
object Name {
def apply(unparsedStr: String) = new Name(Some(unparsedStr))
def apply(unparsedStr: String, fnStr:String, lnStr:String) = {
new Name(Some(unparsedStr), Some(fnStr), Some(lnStr))
}
}
We used default values for the first and last name. Imagine now what you can do with an object model like this:
val name1BeforeParsing = Name("John Doo")
val name2BeforeParsing = Name("Bar Foo")
val name3BeforeParsing = Name("Franz Octupus")
// A very crude parsing function
def parseName(in:Name):Name = {
val names = in.unparsed.map(_.split(" "))
.copy(firstName = names.map(_(0)), lastName = names.map(_(1)))
in}
val name1 = parseName(name1BeforeParsing)
val name2 = parseName(name2BeforeParsing)
val listOfPersons = List(name1, name2, name3BeforeParsing)
Now, obviously there are lots of problems with this approach to parsing the name. One could improve this in many ways. For instance, we did not take into account a middle name. One approach to improve the parsing itself could be to use a database of common first names and last names. But that’s not the aim of this post. So let us continue with the important stuff here. Let us convert all first names to initials in a safe way:
def getInitials(s: String):String = s.toCharArray.head.toString.toUpperCase + "."
def firstNameToInitials(name: Name):Name = name.firstName match {
case Some(nameString) => name.copy(firstName = Some(getInitials(nameString))
)
case None => name
}
The result is as follows:
.map(firstNameToInitials).foreach{println}
listOfPersonsName(Some(John Doo),Some(J.),Some(Doo))
Name(Some(Bar Foo),Some(B.),Some(Foo))
Name(Some(Franz Octupus),None,None)
See how the function gracefully managed to pass over nonexistent values! The initials of nothing is still nothing. It does not end here, though, because we managed to cope with missing values, but we did not yet make the transformation fully bullet proof! See what happens in the following case:
firstNameToInitials(parseName(Name("Jacobus")))
.util.NoSuchElementException: next on empty iterator
java.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at scala.collection.mutable.ArrayOps$ofChar.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:222)
at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
at scala.collection.mutable.ArrayOps$ofChar.head(ArrayOps.scala:222)
at scala.firstNameToInitials(<console>:10)
at ... 33 elided
An easy way to solve this, and one that is compatible with our approach thus far is by using Try
in Scala. You have two choices now, which depend on how you want the API to work:
- When an exception occurs during the transformation, let the result be
None
. So in other words, let it correspond to a missing value. - When an exception occurs, insert a default value
Both scenarios are shown below:
import scala.util.Try
val DEFAULT:String = ""
def getInitialsDefault(s: String):Option[String] = {
Some(Try(s.toCharArray.head.toString.toUpperCase + ".").toOption.getOrElse(DEFAULT))
}
def getInitialsOption(s: String):Option[String] = {
Try(s.toCharArray.head.toString.toUpperCase + ".").toOption
}
def firstNameToInitials(name: Name,
:String=>Option[String]):Name = {
getInitialF.firstName match {
namecase Some(nameString) => name.copy(firstName = getInitialF(nameString))
case None => name
}
}
And use it as follows:
firstNameToInitials(Name("def", "", "def"), getInitialsDefault _)
: Name = Name(Some(def),Some(),Some(def))
res
firstNameToInitials(Name("def", "", "def"), getInitialsOption _)
: Name = Name(Some(def),None,Some(def)) res
By now, the FP adepts give us 1 point for keeping our data structures immutable: the above functions do not mutate the name
object, but rather instantiate a new one.
But we receive a negative score on omitting to acknowledge that Option
is a Monad and that a much better way of writing the above exists:
def getInitialsDefault2(name: Option[String]):Option[String] =
.map( s => Try(s.toCharArray.head.toString.toUpperCase + ".").getOrElse(DEFAULT))
name
def getInitialsOption2(name: Option[String]):Option[String] =
.flatMap( s => Try(s.toCharArray.head.toString.toUpperCase + ".").toOption)
name
def firstNameToInitials2(name: Name,
:Option[String]=>Option[String]):Name = {
getInitialF.copy(firstName = getInitialF(name.firstName))
name}
These functions do not exactly do the same as the earlier functions, but they illustrate a very important concept: map
or flatMap
on None
results in None
. So, there is no need to explicitly use pattern matching here. In order to parse the list of names, we simply map
over it (indentation added for ease of reading):
.map(firstNameToInitials2(_, getInitialsDefault2))
listOfPersons: List[Name] = List(
resName(Some(John Doo),Some(J.),Some(Doo)),
Name(Some(Bar Foo),Some(B.),Some(Foo)),
Name(Some(Franz Octupus),None,None))
For a nice and graphical introduction to Monads and related concepts, I recommend the following blog post. Transformations like the above can be composed in a bullet-proof way, once a missing value or exception occurs, we either continue with a default value or with None
.
We can go a bit further still for the FP fanatics among us. The function firstNameToInitials
in fact is the setter part of what is called a Lens
. We will come back to this later in this post.
Immutability and Lenses
Please note the examples above do not mutate any objects. We use the copy
method to create an updated version of an object. We will not discuss the benefits of this kind of programming, but just mention the use of a Lens in order to update an immutable data structure.
Several libraries exist for working with Lenses, some based on Scala macros others more high-level. For the sake of the argument, we lift the Lens
definition out the Scalaz project:
case class Lens[A,B](get: A => B, set: (A,B) => A) extends Function1[A,B] with Immutable {
def apply(whole: A): B = get(whole)
def updated(whole: A, part: B): A = set(whole, part) // like on immutable maps
def mod(a: A, f: B => B) = set(a, f(this(a)))
def compose[C](that: Lens[C,A]) = Lens[C,B](
=> this(that(c)),
c (c, b) => that.mod(c, set(_, b))
)
def andThen[C](that: Lens[B,C]) = that compose this
}
A lens for first name can then be defined as such:
val aLens = Lens[Name,Option[String]](
.firstName,
_(o, value) => o.copy(firstName = value))
So, you provide two functions: a getter and a setter.
.get(listOfPersons(1))
aLens: Option[String] = Some(Bar)
res
.set(listOfPersons(1), Some("Barby"))
aLens: Name = Name(Some(Bar Foo),Some(Barby),Some(Foo)) res
In this case the usage of a lens is almost trivial because using the copy method on a class is an easy thing to do. But when you start to nest classes to create a more complicated model, multiple copy calls are required. Lenses are a good alternative in that case.
Bullet Proof Reading of Data Sources
The foundation of the bullet proof approach thus far clearly is the Option
type, aka the Maybe
Monad in Haskell. We already used Try
to catch exceptions in a user-friendly and safe way.
Using Try
while reading a file allows us to cope with missing values, columns or when transforming numbers encoded in strings to integers or floating point numbers. In every case, we have the option to cast the possible exceptions to None
(missing value) or a default value.
Bullet Proof Aggregation of Data Sources
As mentioned already it is seldom the case that all data is available from one input file/database. Sometimes additional annotations need to be added (horizontal aggregation), coming from a different source. Sometimes, more data should be added which lacks certain features that the already parsed data contains (vertical aggregation).
There are several approaches to this. One is to go from one class-representation to another. But in this case, classes should be closely matched to the source of the data. We use a different approach, also making use of the (… tada! …) Option
type.
Since a) the data ready for analytics should be in denormalized form, and b) we already have a model for that data that is able to cope with missing values, it is not hard to start from the data source that contains the most information about the denormalized form of the data. All fields/features that are not available in this first data source remain None
(aka the default).
Adding additional features later can easily be done by updating the features from None
to Some(...)
1. Adding additional data vertically can be done in the same way. It’s perfectly fine to end up with a data structure where most of the rows have None
for a certain feature but some contain more information. And since our transformations are bullet proof, all runs safe.
Summary
To summarize:
- Start out with the object model that would be used when all available would be loaded
- Use
Option
types for all fields in order to - Use default values
None
for data that is only later to be processed or loaded - Use the Monadic property of the
Option
type for easy manipulation - Use
Try
in combination withOption
s for safe reading and transformation of data - From beginning to end, work with the same data model (using case classes) and fill in the blanks by updating
None
toSome(...)
.
… and of course …
- Update the data model, transformations, aggregations, etc. whenever necessary
Footnotes
This even reads nice!↩︎