Keywords: Whatsapp, Data Analysis, Statistik Beratung, Data Science

# Introduction

In the following we want to create a mathematical model of a Whatsapp-two-persons-chat.

I have the following Ansatz: Let $$t_{j,i}$$ be the time at which something is said by Person A or Person B in the whatsapp-chat at conversation j. We have the following “waiting times”: $$0=t_{11}<t_{12}<\cdots<t_{1,a_1}<t_{2,1}<t_{2,2}<\cdots<t_{2,a_2}<\cdots<t_{n,1}<\cdots<t_{n,a_n}$$ So we have $$n$$ “conversations” in this chat by two people. Now my modeling Ansatz is that we have between each conversation a pause $$P_j$$:

$$t_{1,a_1}+P_1 = t_{2,1}$$

$$t_{2,a_2}+P_2 = t_{3,1}$$

$$\cdots$$

$$t_{n-1,a_{n-1}}+P_{n-1} = t_{n,1}$$

I have verified with the Kolmogorov-Smirnov Test all my assumptions concerning distribution of variables. Now we have

$$P_j \sim Exp(\lambda_P)$$

$$d_{j,i} = t_{j,i+1}-t_{j,i} \sim Exp(\lambda_d)$$ “interarrival times”

$$a_j \sim Pois(\lambda_a)$$

Now one could think of this as a “nested Poisson process”, by which I mean, that we have a Poisson Process which governs the distributions of the conversations, and in each conversation we have a homogeneous Poisson process.

Ok, so in reality we can not observe when one conversation ends and when it starts. So the question is, given the data $$t_1 < \cdots < t_m$$ is it possible to calibrate the above model to find out how many conversations there are in this chat and when a conversation ends / starts, or are there to many parameters in the model, which need to be estimated?

We have

$$t_{n,a_n} = \sum_{j=1}^n P_j + \sum_{j=1}^n\sum_{i=1}^{a_j-1}d_{j,i}$$

From this I have computed the expected value and the variance of $$t_{n,a_n}$$:

$$E(t_{n,a_n}) = n/\lambda_P + n(\lambda_a-1)/\lambda_d$$

$$Var(t_{n,a_n}) = n/\lambda_P^2 + n(\lambda_a-1)/\lambda_d^2$$

Now the question is, given the data $$t_1<\cdots<t_m$$ how to estimate the parameters: $$n, \lambda_P, \lambda_d, \lambda_a$$?

Suppose we had a cutoff value $$\widehat{d}$$. Now let $$n =$$ number of times we have $$d_i > \widehat{d}$$.

Suppose, that the above procedure can distinguish between a conversation and a pause, then we have $$E(m) = \sum_{i=1}^nE(a_i) = n \lambda_a$$ hence we can estimate $$\lambda_a$$ as $$\widehat{\lambda_a} = m / n$$. On the other hand we can estimate $$\lambda_P$$ as $$\widehat{\lambda_P} = \frac{1}{1/n \sum_{d_j>\widehat{d}}d_j}$$

And the Ansatz

$$t_m = n/\widehat{\lambda_P}+n(\widehat{\lambda_a}-1)/\widehat{\lambda_d}$$

gives an estimate of $$\widehat{\lambda_d}$$ as:

$$\widehat{\lambda_d} = \frac{m/n-1}{t_m/n-1/n \sum_{d_j>\widehat{d}}d_j}$$

Now the question is how to find the cutoff value $$T = \widehat{d}$$. Consider the following scenario: We have $$X_1,\cdots,X_m$$ bernoulli distributed variables with probability $$p$$. Let $$D_1,\cdots,D_m \sim Exp(\lambda_d)$$ and $$P_1,\cdots,P_m \sim Exp(\lambda_p)$$ and set $$d_i = X_i D_i + (1-X_i) P_i$$ and suppose that $$\lambda_d >> \lambda_p$$. Then we want to find a threshold $$T$$ such that $$d_i > T$$ implies that $$X_i=0$$ hence $$P_i$$ was chosen and such that if $$d_i \le T$$ then $$X_i = 1$$, hence $$D_i$$ was chosen.

One method to do this, is to assume that we know the $$p,\lambda_d,\lambda_p$$ and then to minimize the following probability:

$$(1-p)\mathbb{P}(P_i \le T) + p \mathbb{P}(D_i \ge T) = (1-p)(1-e^{-\lambda_p T}) + p e^{-\lambda_d T} \equiv (1-p)\lambda_p T + p e^{-\lambda_d T}$$ where the last equivalence is because we assume that $$T << 1/\lambda_p$$. Now taking derivatives with respect to $$T$$ and setting equal to $$0$$ and solving for $$T$$ we get :

$T = -\frac{1}{\lambda_d} log(\frac{(1-p)\lambda_p}{p\lambda_d})$

So the idea is to take for the first $$T = \widehat{d} := 1/m \sum_{i=1}^m d_i$$. Then to estimate $$\lambda_d,\lambda_p,p$$ based on what has been written above and then to iterate this procedure, say 10 times.

(Simulation suggest, that this procedure does not always converge to the actual $$\lambda_d,\lambda_p,p$$ but it is close enough for practical applications.)

The estimates then could be calibrated by the following R-Code as described above:

  threshold <- function(di,pu,lambdaD,lambdaP){
A <- di[di>T]
B <- di[di<=T]
lp <- 1/mean(A)
ld <- 1/mean(B)
p <- 1-length(A)/length(di)
return( c(T,lp,ld,p) )
}

calibrate<-function(dt){
A <- t[t>mean(t)]
B <- t[t<=mean(t)]
lp <- 1/mean(A)
ld <- 1/mean(B)
Pu <- 1-length(A)/m

for( i in seq(1,10)){
y <- threshold(t,Pu,ld,lp)
T <- y
lp <- y
ld <- y
Pu <- y
}
return( c(lp,ld,Pu) )
}

# Application:

As an application of the above model, Whatsapp could implement a reminder when to write to somebody you haven’t “long” been writing.

I recommend to remind if the actual time in minutes minus the last time when a conversation was is greater then $$-log(1-0.999)/\widehat{\lambda_P}$$.

Thanks go to Bjørn Kjos-Hanssen for pointing out how to simplify the model and Anthony Quas for pointing out how to find a threshold $$\widehat{d}$$ for $$d_i$$.