Transform SVG to PNG


Posted in Uncategorized | Leave a comment

Compare 2 PNG images using Python

The above code is mainly stolen from here.

Posted in Uncategorized | Leave a comment

Generate List Of Gray Coded Binary Strings In Python

The Python code below will generate a list of 4 bit binary strings/vectors as shown on Wikipedia using the sympy package:

Decimal Binary Gray
0 0000 0000
1 0001 0001
2 0010 0011
3 0011 0010
4 0100 0110
5 0101 0111
6 0110 0101
7 0111 0100
8 1000 1100
9 1001 1101
10 1010 1111
11 1011 1110
12 1100 1010
13 1101 1011
14 1110 1001
15 1111 1000


Posted in Uncategorized | Tagged , | Leave a comment

Using AWS Athena And Python To Scale Evolutionary Algorithms

Evolutionary algorithms (EAs) are powerful and versatile multi-objective optimisers, which can literally be applied to any problem. Data science specific applications could be:

  • Model extraction
  • Feature extraction/selection (please note that this can also be performed implicitly during model extraction)
  • Meta learning (e.g. find the best model structure for a nonlinear mixed effect modelling approach)
  • Optimisation of neural network hyper parameters (aka deep learning)
  • Utilising the extracted models to optimise a company’s KPIs. Example: use demand prediction models to optimise the prices of sold products whilst finding good trade-offs between revenue and margin.

The following points summarise some of the advantages of EAs:

  • Perform model selection without predefining a model. For the latter one usually uses genetic programming a specific type of EA.
  • Utilise prior knowledge or improve existing (stale) models.
  • Select/extract features implicitly
  • Optimise multiple potentially conflicting modelling objectives (e.g. sensitivity, specificity, complexity, novelty etc.).
  • Present decision maker with a set of trade-off solutions after each run (e.g. from complex/black box models to highly interpretable simple models).
  • Evolve models expressed in domain specific languages, making them readily understandable (e.g. rule systems).

Given these advantages, EAs should be part of every data scientist’s toolkit. So how can EAs be used to extract models from data?

In a nutshell by implementing evolution and thus simulating the Darwinian principle of the survival of the fittest. Instead of evolving organisms however, a population of models is evolved (if modelling is the application). The fitness of an individual/model often captures how well the model fits the data but also how novel and simple it is. To incorporate model simplicity into the fitness serves at least two purposes: prevention of overfitting and allowing decision makers to understand the evolved models. So where is the catch?

Like the real thing evolving models is slow. To establish the fitness of models their fit to the given dataset has to be estimated (e.g. by populating a confusion matrix and then calculating the sensitivity and specificity). This is time consuming even if only a sample of the dataset is used to estimate the models’ fit. Keep in mind that one also evolves a population of models over several generations.

This were Athena can help utilising AWS’s scalable cloud infrastructure. The slow fitness evaluation of a model is translated into SQL queries, which can be executed against big data. SQL is used to calculate the models fit to the data. Depending on the modelling task, the SQL query could, for example, calculate the entries of a confusion matrix to calculate the sensitivity and specificity. The latter values could also be calculated directly. Alternatively, SQL could estimate the area under the ROC curve or the mean squared error or in fact any imaginable loss function. Hence, good old SQL, most data scientists should be familiar with, can be used without having to resort to more complex approaches such as SPARK. This avoids having to face potentially steep learning curves as the implementation of parallel algorithms can be a challenging task.

I now try to explain how this can be implemented using Python. Other programming languages (e.g. C# or Java) could also be used as Athena is language agnostic.

The first step is to upload a data set into a S3 bucket. I used the famous Iris data set, which can be obtained here as tab separated flat file.

The created S3 bucket is called:

The characters after christiandata is known as GUID. It ensures that the bucket name is unique (otherwise you may have to spend a long time to find a unique name). You can generate GUIDs using this site. One thing to mention is that you should use the same region as the one for Athena. In this case US East (N.Virginia). The created bucket is shown here:

I then created a folder called Iris in this bucket and uploaded the aforementioned tab separated file.

Now it is time to deal with Athena. For this I created a database called analysisdata and created a table manually.

This is the address used:

Pick TSV for tab separated file:

Next bulk add the columns using this expression:

You can now preview the data, which generates this query:

Running this query should show the top 10 rows of the iris dataset. The following Python code can then be used to run the same query as a proof of concept.

Please note that the query had to be slightly adapted to as the original query threw some error:

The code uses the package pyathena. You will have to use your aws_access_key_id and aws_secret_access_key – for instructions follow this link. To determine the region name please refer to this site.

That’s it. You can now run SQL Athena queries using Python. To adapt the queries to calculate loss functions such as mean squared error or the values of a confusion matrix given some expression (e.g. model formula or rule system) should be quite straightforward.

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Creating A Scatter Plot Matrix In SAS

Creating a scatter plot matrix using the Iris dataset. For accessibility I have included the dataset via datalines. The result looks like this:

This is the code:

Posted in SAS, Scatter plot | Leave a comment

Prompting for dynamic data in SAS Enterprise Guide

Creating prompts in SAS is really easy. A sufficient intro can be found here. I followed these instruction and created a prompt, which asks for 2 mandatory integer numbers: FirstNumber and SecondNumber. These parameters can then be used in code like so:

The code just adds up both number and writes them to the log.

Posted in Prompts, SAS | Leave a comment

Translate pubmed publications into various reference formats

There are several reference formats out there: BibTeX, EndNote and Microsoft’s Word 2007 XML to name only a few. This site:

Can be used to transform Pubmed publications/references given their unique identider (PMID) into these referecne formats.

Posted in General | Tagged , , , , , | Leave a comment

Search all tables for string value

A little script I am using to find tables that contain a specific string value. I have not written this – sorry do not remember where I got it from:


Posted in TSql | Leave a comment

Bulk insert tab separated flat file

This code can be used to insert data from a tab separated text file, which contains column names in the first row:

Posted in TSql | Tagged , | Leave a comment

Hello world!

Just some notes on my work and interests. Hope this may be of some use for others too.

Posted in General | Tagged | Leave a comment