I demonstrate a widely applicable pattern which integrates Prolog as a critical

component in a data science analysis. Analytic methods are used to generate

properties about the data under study and Prolog is used to reason about the

data via the generated properties. The post includes some examples of piece-wise

regression on timeseries data by symbolic reasoning. I also discuss the general

pattern of application a bit.

Given some data, the bulk of “data science” for me is the study of what the data

implies and whether it can be coerced into the context specific role usually

decided by someone other than me. Dependent on the context, the goal may be to

find support for a hypothesis, to forecast, to predict a latent variable, to

explain a phenomena, and so on. The majority of the value I bring rests in

taking as input the data and the role it is expected to play, and producing as

output a computationally tractable problem the solution to which enables that

role for the data.

In my own practice – at least in its mature form – I find myself mostly relying

on reasoning with simple methods such as Linear Regression, kNN, kMeans and PCA.

I think this is mostly because I can understand what the results imply for the

“shape” of the data, meanwhile reasoning about the results does the heavy lifting.

Despite the importance of reasoning in my work, I never made it explicit, nor did

I consciously separate it from its subject matter. When I became aware of this,

it seemed to me that I should make my reasoning explicit at least so that I could

(1) reason more accurately, and (2) directly learn to reason better.

It is easier to demonstrate what it means to make reasoning explicit and the

effect it has on the structure of an analysis. What follows are a couple of

illustrative examples. I use SWI Prolog as the symbolic reasoning engine and

logic language, and Python for everything else.

I’ll start with a contrived example meant to illustrate the separation between

the reasoning element and the rest. Below (black line) is some mock 2D data

constructed by joining together 5 randomly selected lines into a series. The

x-axis is just an index which runs from 0-999. The coloured segments are a

piecewise linear fitting derived symbolically using Prolog. I’ll next describe

how.

Say I do not know anything about the construction of the data series above except

that it is made up of lines and the aim of the analysis is to find all the linear

sections. My general scheme will be to use Python to create symbols which

represent facts about the data, and then to use Prolog to reason about those

symbols. One broadly applicable way of generating symbols is to partition the

data at random along the explanatory variables (the (x) in this case) and fit

linear regressions to each partition segment. This is then repeated many times

in order to capture views of the data across many partitions. The result is lots

of overlapping segments for which I can record information such as:

- Where the segment starts and ends.
- The fitting parameters.
- Correlation with the data.

Given some data (X,Y), the code below will produce rounds of random partitions

(4-10) with segments at least 20 units wide, until there are at least (N)

qualifying (perfectly linear) segments.

```
import numpy as np
from sklearn.linear_model import LinearRegression
i = 0
np.random.seed(10)
while i < N:
knots = make_knots(
np.random.randint(4,10),
20,
np.min(X),
np.max(X))
for start, stop in zip(knots[:-1], knots[1:]):
cond = X[(X>=start) & (X<=stop)]
x, y = X[cond].reshape(-1,1), Y[cond]
m = LinearRegression().fit(x,y)
p = m.predict(x)
i += 1
coef = np.round(m.coef_[0],4)
score = m.score(x,y)
if score ==1.0:
print(f"segment(i{i},{start},{stop},{coef},{score}).")
```

Note that the code prints out strings in the following format:

`segment(Id,Start,End,Gradient,Score).`

I copy/paste these directly into Prolog for demonstration purposes, but there

are less manual ways.

The simple procedure above produces plenty of symbols to reason about. For

example, segments may be separate, overlapping, subsuming, adjacent, etc.

Gradients may be equal, similar, opposing, and so on. For these problems,

I will only need to use a fraction of the information available. To produce the

segmentation illustrated in the graph above, it is enough to note that if

segments are perfectly linear and they overlap, then they must be the same line.

I can use this assertion as a way of merging segments together in order to create

bigger continuous linear *spans*. In Prolog I could express this as follows:

```
overlap(I1,I2) :-
segment(I1,Start1,End1,_,_),
segment(I2,Start2,End2,_,_),
Start2 > Start1,
End1 > Start2,
End2 > End1.
:- table reach/2.
reach(I1,I2) :-
overlap(I1,I2).
reach(I1,I2) :-
overlap(I1,X),
reach(X,I2).
```

That is to say, segment (I1) overlaps with segment (I2) if it begins outside of

(I2) but ends inside of it. A segment (I2) can be “reached” from segment (I1)

*iff* there is a daisy chain of overlapping segments connecting them. Based on

the relationships above, the following query would produce all reachable spans

(daisy chains of segments):

`?- setof((I1,I2),reach(I1,I2),Spans).`

**Prolog digression**: *overlap* is intentionally non-commutative to prevent

backtracking from looping. I use

tabling on the *reach*

predicate because SLD resolution is very inefficient in this case.

To calculate an optimal piece-wise linear fitting, I need to find the subset of

all the non-overlapping spans generated by *reach* which result in the maximal

coverage of the data. This can be as simple as sorting the spans by length in

descending order and adding them to a list whilst making sure new additions do

not overlap with the old ones. In Prolog:

```
use_module(library(apply)).
use_module(library(pairs)).
span_overlap((I1,I2),(I3,I4)) :-
segment(I1,Start1,_,_,_),
segment(I2,_,End1,_,_),
segment(I3,Start2,_,_,_),
segment(I4,_,End2,_,_),
((End1 >= Start2, End2 >= End1)
;(End2 >= Start1, End1 >= End2)
;(Start1 >= Start2, End2 >= Start1)
;(Start2 >= Start1, End1 >= Start2)).
not_span_overlap((I1,I2),(I3,I4)) :-
+ span_overlap((I1,I2),(I3,I4)).
span_length((I1,I2),R) :-
segment(I1,Start,_,_,_),
segment(I2,_,End,_,_),
R is Start-End.
span_set([],R,R).
span_set([(X,Y)|T],A,R) :-
maplist(not_span_overlap((X,Y)),A),
(X=Y;reach(X,Y)),!,
span_set(T,[(X,Y)|A],R).
span_set([_|T],A,R) :-
span_set(T,A,R).
spans([],R,R).
spans([(I1,I2)|T],A,R):-
segment(I1,Start,_,_,_),
segment(I2,_,End,_,_),
spans(T,[(Start,End)|A],R).
possible(X,Y) :-
segment(X,Start1,_,_,_),
segment(Y,Start2,_,_,_),
Start2 >= Start1.
find_cover(SortSpan) :-
findall((X,Y),possible(X,Y),Possible),
map_list_to_pairs(span_length,Possible,Pairs),
keysort(Pairs,Sorted),
pairs_values(Sorted,BySpan),
span_set(BySpan,[],SpanById),
spans(SpanById,[],Span),
sort(Span,SortSpan).
```

The *span_set* predicate does the important work, whilst the *find_cover*

predicate is the entry point. I could have used *reach* directly, but it

massively over generates because there are usually many ways of getting

between two segments. It turns out to be much more efficient to list out

the possible spans and test whether they are reachable directly as above.

Finally, I can just query for the cover:

`?- find_cover(Cover).`

which outputs the segmentation in the plot above:

```
Cover = [
(0, 110), (112, 201), (204, 336),
(343, 647), (653, 758), (760, 998)].
```

The above example is contrived and easy to solve by more direct means but

hopefully it made the approach fairly clear. In this example I apply much the

same method but this time to perform a piece-wise isotonic fitting of the

FTSE 100 stock index. The graph below shows the last 5 years of the FTSE-100

index at weekly intervals. The segmentation was created by reasoning about

random segments just as in the example above, except an Isotonic Regression

was fitted instead of a Linear Regression.

The following information was recorded:

`segment(Id,Start,Stop,Direction,Fit)`

I made the following minimal adjustment to the Prolog code:

```
same_direction(I1,I2) :-
segment(I1,_,_,G1,_),
segment(I2,_,_,G2,_),
G1 = G2.
:- table reach/2.
reach(I1,I2) :-
same_direction(I1,I2),
overlap(I1,I2).
reach(I1,I2) :-
same_direction(I1,X),
same_direction(X,I2),
overlap(I1,X),
reach(X,I2).
```

I have added the *same_direction* predicate which requires that the direction of

the isotonic regression is the same in overlapping segments for them to be

reachable. I have also relaxed the random partitioning condition to a correlation

of (ge 0.9). I can now query for the cover as before:

`?- find_cover(Cover).`

which outputs the segmentation in the plot above:

`Cover = [(1, 22), (23, 88), (97, 260)].`

Despite the importance of reasoning in my work, I never made it explicit, nor

did I consciously separate it from its subject matter. Adding Prolog to the mix

enabled explicit reasoning. A very general *modus operandi* is to generate facts

about your data and reason about them in Prolog. This came natural to me because

it is how I work anyway. However, I note that Prolog greatly enhances the

possibilities. For example, I would not have pursued the solutions above in

practice for lack of easy way to express them. Yet the solutions above are a

very direct expression of my intuition regarding what a piece-wise fitting ought

to capture, which I would have otherwise compromised on for practical reasons.

The random partitioning technique explained above is one of many broadly

applicable ways to generate facts about data. It should be noted that similar

methods can be applied to multivariate data and to non-timeseries data with

some tweaking. For example, I have a multivariate dataset containing many

explanatory variables and one response variable. The coverage across explanatory

variables is not uniform in the dataset and I am using Prolog to work out the

number and location of the models to be fitted if each model should have

explanatory variables in a continuous range. For the subsets of the space which

do not have explanatory variable coverage, I am using Prolog to reason about

which combination of adjacent models should be used to interpolate the space.

I intend to write more on this topic as I become more experienced with Prolog.

## Leave A Comment