I demonstrate a widely applicable pattern which integrates Prolog as a critical
component in a data science analysis. Analytic methods are used to generate
properties about the data under study and Prolog is used to reason about the
data via the generated properties. The post includes some examples of piece-wise
regression on timeseries data by symbolic reasoning. I also discuss the general
pattern of application a bit.

Given some data, the bulk of “data science” for me is the study of what the data
implies and whether it can be coerced into the context specific role usually
decided by someone other than me. Dependent on the context, the goal may be to
find support for a hypothesis, to forecast, to predict a latent variable, to
explain a phenomena, and so on. The majority of the value I bring rests in
taking as input the data and the role it is expected to play, and producing as
output a computationally tractable problem the solution to which enables that
role for the data.

In my own practice – at least in its mature form – I find myself mostly relying
on reasoning with simple methods such as Linear Regression, kNN, kMeans and PCA.
I think this is mostly because I can understand what the results imply for the
“shape” of the data, meanwhile reasoning about the results does the heavy lifting.
Despite the importance of reasoning in my work, I never made it explicit, nor did
I consciously separate it from its subject matter. When I became aware of this,
it seemed to me that I should make my reasoning explicit at least so that I could
(1) reason more accurately, and (2) directly learn to reason better.

It is easier to demonstrate what it means to make reasoning explicit and the
effect it has on the structure of an analysis. What follows are a couple of
illustrative examples. I use SWI Prolog as the symbolic reasoning engine and
logic language, and Python for everything else.

I’ll start with a contrived example meant to illustrate the separation between
the reasoning element and the rest. Below (black line) is some mock 2D data
constructed by joining together 5 randomly selected lines into a series. The
x-axis is just an index which runs from 0-999. The coloured segments are a
piecewise linear fitting derived symbolically using Prolog. I’ll next describe
how.

Say I do not know anything about the construction of the data series above except
that it is made up of lines and the aim of the analysis is to find all the linear
sections. My general scheme will be to use Python to create symbols which
represent facts about the data, and then to use Prolog to reason about those
symbols. One broadly applicable way of generating symbols is to partition the
data at random along the explanatory variables (the (x) in this case) and fit
linear regressions to each partition segment. This is then repeated many times
in order to capture views of the data across many partitions. The result is lots
of overlapping segments for which I can record information such as:

  • Where the segment starts and ends.
  • The fitting parameters.
  • Correlation with the data.

Given some data (X,Y), the code below will produce rounds of random partitions
(4-10) with segments at least 20 units wide, until there are at least (N)
qualifying (perfectly linear) segments.

import numpy as np
from sklearn.linear_model import LinearRegression

i = 0
np.random.seed(10)
while i < N:
    knots = make_knots(
        np.random.randint(4,10),
        20,
        np.min(X),
        np.max(X))

    for start, stop in zip(knots[:-1], knots[1:]):
        cond = X[(X>=start) & (X<=stop)]
        x, y = X[cond].reshape(-1,1), Y[cond]
        m = LinearRegression().fit(x,y)
        p = m.predict(x)
        i += 1
        coef = np.round(m.coef_[0],4)
        score = m.score(x,y)
        if score ==1.0:
            print(f"segment(i{i},{start},{stop},{coef},{score}).")

Note that the code prints out strings in the following format:

segment(Id,Start,End,Gradient,Score).

I copy/paste these directly into Prolog for demonstration purposes, but there
are less manual ways.

The simple procedure above produces plenty of symbols to reason about. For
example, segments may be separate, overlapping, subsuming, adjacent, etc.
Gradients may be equal, similar, opposing, and so on. For these problems,
I will only need to use a fraction of the information available. To produce the
segmentation illustrated in the graph above, it is enough to note that if
segments are perfectly linear and they overlap, then they must be the same line.
I can use this assertion as a way of merging segments together in order to create
bigger continuous linear spans. In Prolog I could express this as follows:

overlap(I1,I2) :-
    segment(I1,Start1,End1,_,_),
    segment(I2,Start2,End2,_,_),
    Start2 > Start1,
    End1 > Start2,
    End2 > End1.

:- table reach/2.

reach(I1,I2) :-
    overlap(I1,I2).
reach(I1,I2) :-
    overlap(I1,X),
    reach(X,I2).

That is to say, segment (I1) overlaps with segment (I2) if it begins outside of
(I2) but ends inside of it. A segment (I2) can be “reached” from segment (I1)
iff there is a daisy chain of overlapping segments connecting them. Based on
the relationships above, the following query would produce all reachable spans
(daisy chains of segments):

?- setof((I1,I2),reach(I1,I2),Spans).

Prolog digression: overlap is intentionally non-commutative to prevent
backtracking from looping. I use
tabling on the reach
predicate because SLD resolution is very inefficient in this case.

To calculate an optimal piece-wise linear fitting, I need to find the subset of
all the non-overlapping spans generated by reach which result in the maximal
coverage of the data. This can be as simple as sorting the spans by length in
descending order and adding them to a list whilst making sure new additions do
not overlap with the old ones. In Prolog:

use_module(library(apply)).
use_module(library(pairs)).

span_overlap((I1,I2),(I3,I4)) :-
    segment(I1,Start1,_,_,_),
    segment(I2,_,End1,_,_),
    segment(I3,Start2,_,_,_),
    segment(I4,_,End2,_,_),
    ((End1 >= Start2, End2 >= End1)
    ;(End2 >= Start1, End1 >= End2)
    ;(Start1 >= Start2, End2 >= Start1)
    ;(Start2 >= Start1, End1 >= Start2)).

not_span_overlap((I1,I2),(I3,I4)) :-
    + span_overlap((I1,I2),(I3,I4)).

span_length((I1,I2),R) :-
    segment(I1,Start,_,_,_),
    segment(I2,_,End,_,_),
    R is Start-End.

span_set([],R,R).
span_set([(X,Y)|T],A,R) :-
    maplist(not_span_overlap((X,Y)),A),
    (X=Y;reach(X,Y)),!,
    span_set(T,[(X,Y)|A],R).
span_set([_|T],A,R) :-
    span_set(T,A,R).

spans([],R,R).
spans([(I1,I2)|T],A,R):-
    segment(I1,Start,_,_,_),
    segment(I2,_,End,_,_),
    spans(T,[(Start,End)|A],R).

possible(X,Y) :-
    segment(X,Start1,_,_,_),
    segment(Y,Start2,_,_,_),
    Start2 >= Start1.

find_cover(SortSpan) :-
    findall((X,Y),possible(X,Y),Possible),
    map_list_to_pairs(span_length,Possible,Pairs),
    keysort(Pairs,Sorted),
    pairs_values(Sorted,BySpan),
    span_set(BySpan,[],SpanById),
    spans(SpanById,[],Span),
    sort(Span,SortSpan).

The span_set predicate does the important work, whilst the find_cover
predicate is the entry point. I could have used reach directly, but it
massively over generates because there are usually many ways of getting
between two segments. It turns out to be much more efficient to list out
the possible spans and test whether they are reachable directly as above.
Finally, I can just query for the cover:

?- find_cover(Cover).

which outputs the segmentation in the plot above:

Cover = [
  (0, 110), (112, 201), (204, 336),
  (343, 647), (653, 758), (760, 998)].

The above example is contrived and easy to solve by more direct means but
hopefully it made the approach fairly clear. In this example I apply much the
same method but this time to perform a piece-wise isotonic fitting of the
FTSE 100 stock index. The graph below shows the last 5 years of the FTSE-100
index at weekly intervals. The segmentation was created by reasoning about
random segments just as in the example above, except an Isotonic Regression
was fitted instead of a Linear Regression.

The following information was recorded:

segment(Id,Start,Stop,Direction,Fit)

I made the following minimal adjustment to the Prolog code:

same_direction(I1,I2) :-
    segment(I1,_,_,G1,_),
    segment(I2,_,_,G2,_),
    G1 = G2.

:- table reach/2.

reach(I1,I2) :-
    same_direction(I1,I2),
    overlap(I1,I2).
reach(I1,I2) :-
    same_direction(I1,X),
    same_direction(X,I2),
    overlap(I1,X),
    reach(X,I2).

I have added the same_direction predicate which requires that the direction of
the isotonic regression is the same in overlapping segments for them to be
reachable. I have also relaxed the random partitioning condition to a correlation
of (ge 0.9). I can now query for the cover as before:

?- find_cover(Cover).

which outputs the segmentation in the plot above:

Cover = [(1, 22), (23, 88), (97, 260)].

Despite the importance of reasoning in my work, I never made it explicit, nor
did I consciously separate it from its subject matter. Adding Prolog to the mix
enabled explicit reasoning. A very general modus operandi is to generate facts
about your data and reason about them in Prolog. This came natural to me because
it is how I work anyway. However, I note that Prolog greatly enhances the
possibilities. For example, I would not have pursued the solutions above in
practice for lack of easy way to express them. Yet the solutions above are a
very direct expression of my intuition regarding what a piece-wise fitting ought
to capture, which I would have otherwise compromised on for practical reasons.

The random partitioning technique explained above is one of many broadly
applicable ways to generate facts about data. It should be noted that similar
methods can be applied to multivariate data and to non-timeseries data with
some tweaking. For example, I have a multivariate dataset containing many
explanatory variables and one response variable. The coverage across explanatory
variables is not uniform in the dataset and I am using Prolog to work out the
number and location of the models to be fitted if each model should have
explanatory variables in a continuous range. For the subsets of the space which
do not have explanatory variable coverage, I am using Prolog to reason about
which combination of adjacent models should be used to interpolate the space.

I intend to write more on this topic as I become more experienced with Prolog.

Read More