The world’s leading publication for data science, AI, and ML professionals.

Modeling the Extinction of the Catalan Language

Applying existing literature to a practical case

Photo by Brett Jordan on Unsplash
Photo by Brett Jordan on Unsplash

Can we predict the extinction of a language? It doesn’t sound easy, and it indeed shouldn’t, but it shouldn’t stop us from trying to model it.

I was recently interested in this topic and started reviewing some of the existing literature. I came across one article[1] that I enjoyed and thought of sharing.

So, in this post, I’ll be sharing the insights of that paper, translated into (hopefully) a simple read and applied to a practical case so we can see Data Science and mathematical modeling in action.

Introduction

I am Catalan and, for those who don’t know, Catalan is a co-official Language in Catalonia, Valencian Community, and the Balearic Islands (Spain) along with Spanish. It’s also the official language in Andorra, found in the south of France and even in Alghero (Italy).

It’s often that we see on local TV or media that the Catalan language is at risk of extinction. Focusing only on Catalonia, we can easily dig deeper into the case because the government takes care of studying the use of the language through what’s called the "survey of linguistic uses of the population" (Enquesta d’usos lingüistics de la població)[2].

Let’s pick 2018 and analyze the mother tongue ratio per language of the surveyed population (people over 14):

Surveyed population by mother tongue (2018) - Image from GenCat (open source)
Surveyed population by mother tongue (2018) – Image from GenCat (open source)

The first three columns relate to Catalan (blue), Spanish (red), and both Catalan and Spanish (green). While both Spanish and Catalan are official, we see Spanish standing out above Catalan.

That made me think: if it was a competition where only one would survive, Spanish would obviously be the winner. But, how much would it take for Catalan to disappear?

This is what we’ll check now.

Disclaimer: I’m not advocating the extinction of Catalan, quite the opposite, in fact. I am concerned about the reduction of its use and I love to use it. This post is also aimed to share the existence of this beautiful language and promote its use and learning.

Let’s talk math

For Catalan to disappear, all the Catalan speakers would have to stop using it (duh). In other words, we’d need Catalan speakers to transition to another language (let’s assume Spanish).

We’ll define Pyx(x,s) as the probability of an individual converting from language Y (Catalan) to X (Spanish), where x is the ratio of X – Spanish – speakers and s is a measure of X’s relative status (between 0 and 1). If we want to model the language change, here’s the proposed equation:

Language change model - Image by the author
Language change model – Image by the author

What we do is multiply the ratio of Catalan speakers times the probability of a Catalan speaker transitioning to Spanish and subtract the result of multiplying the ratio of Spanish speakers and the probability of a Spanish speaker transitioning to Catalan.

To make Math simple, we’ll assume the competition is between these two languages only. Then, the ratio of Catalan speakers is y = 1-x. Due to this symmetry, we can define the following: Pxy(x,s) = Pyx(1-x, 1-s).

Also, we can assume a language without speakers (x=0) or a null language status (s=0) makes the probability of transitioning equal to 0. We can redefine the equation as such:

Language change model based on Pyx - Image by the author
Language change model based on Pyx – Image by the author

Ok, cool. But what does the transition probability function looks like? It could take many shapes and forms, but the ones the authors picked were

Transition functions - Image by the author
Transition functions – Image by the author

Let’s dissect this function, focusing on the Pyx(x,s) representation. First of all, c is a constant scaling factor and a is an exponent that modifies the influence of the proportion of x (Spanish) speakers.

The value of a will determine the function’s behavior:

  • If a=1, it will be linear.
  • If a > 1, then it’ll be convex (meaning the transition probability increases more rapidly as x increases)
  • If 0 < a < 1, then the function follows a concave behavior (meaning that the transition probability increases more slowly as x increases).
  • If a=0, the function doesn’t depend on x (very unlikely).

This function was chosen because it has the true potential of being able to model language shift where the likelihood of individuals adopting a new language (and leaving another behind) depends on how prevalent that language already is and the status of that language. So, the higher the status, the higher the probability of people changing to language X.

Furthermore, if we wanted we could apply this type of function to model market transitions in economics, for example, where x could represent the market share and s could be an economic incentive, for example, the amount of advertising for a product.

Going back to our case, the problem now resides in computing parameters c, s, a, and x(0). Well, it is not our problem, it was the researchers’… Until they acquired the data to effectively estimate those parameters through least absolute-values regression[3].

Data on the number of speakers of endangered languages in 42 regions in Peru, Scotland, Wales, Bolivia, Ireland, and Alsace-Lorraine was collected and used to fit the model.

What they found out, unexpectedly, is that the exponent a was found to be roughly constant across cultures, with a = 1.31 ± 0.25.

Obviously, the most significant parameter from all these 4 estimated is the status (s), as it could serve as a useful measure of the threat to a given language.

For example, they found that Quechua still had many speakers in Huanuco, Peru, but its low status was driving a rapid shift to Spanish.

Unfortunately, we don’t have enough data available at the moment to use mathematically and get a clear function for our Catalan case. We can, however, try to mimic it using the work Daniel M. Abrams and Steven H. Strogatz already shared.

So, if we assume:

  • a = 1.31 – the mean they discovered
  • x(0)=0.466— The actual ratio of Spanish speakers (focusing on mother-tongue only, because in terms of speakers roughly 99% of the people talk it)
  • s= 0.40— The estimated status for a language like Catalan (supposing the Catalan status in Catalonia is similar to the Welsh’s in Wales, or the Scottish Gaelic’s in Scotland).
  • c = 1 – unit constant (for simplicity).

And we work the maths out for Equation 1, using the probability function, we get:

Final equation - Image by the author
Final equation – Image by the author

This is almost the final form for our language transitioning model… But we should solve this differential equation and add a time dimension to it. To do so, we’ll resort to Python.

Time to Code

We’ll only be using three modules: numpy, scipy and matplotlib. So make sure you install them if you want to use this code. For now, let’s just import them:

import numpy as np
from scipy.integrate import solve_ivp
import matplotlib.pyplot as plt

The next step is defining the differential equation that will take care of computing, over time, this population shift from Catalan to Spanish:

def language_change_model(t, x, c, a, s):
    return c * (x**a * s - x**(a+1) * s + x * (1-x)**a * (1-s))

We’ll also define our parameters, just as we did before, plus add the time spans:

c = 1.0  # unit constant
a = 1.31  # mean value
s = 0.4  # estimated Catalan's status
x0 = 0.466  # ratio of X (Spanish) mother-tonguers in Catalonia as of now

t_span = (1, 10) # 10-period window
t_eval = np.linspace(t_span[0], t_span[1], 100)

Very simple until now. But it’s time to do the math and we’ll use the solve_ivp function from the scipy package[4], which numerically integrates a system of ordinary differential equations given an initial value.

We have to choose our integration method and we’ll use the one by default: Explicit Runge-Kutta method of order 5[5].

sol = solve_ivp(language_change_model, t_span, [x0], args=(c, a, s), t_eval=t_eval, method='RK45')

And the last step is to plot our solution:

# Plot the solution
plt.plot(sol.t, sol.y[0], label=f'c={c}, a={a}, s={s}, x0={x0}')
plt.xlabel('Time')
plt.ylabel('Proportion of people speaking Spanish')
plt.legend()
plt.title('Language Change Model Over Time')
plt.show()

Just in case you want to try it out yourself, with different parameters, time frames, or methods, here’s the full code:

import numpy as np
from scipy.integrate import solve_ivp
import matplotlib.pyplot as plt
import time

# Define the differential equation
def language_change_model(t, x, c, a, s):
    return c * (x**a * s - x**(a+1) * s + x * (1-x)**a * (1-s))

# Parameters
c = 1.0  # example value
a = 1.31  # example value
s = 0.4  # example value
x0 = 0.466  # initial condition

# Time span for the solution
t_span = (1, 10) # 10-period window
t_eval = np.linspace(t_span[0], t_span[1], 100)

# Solve the differential equation
sol = solve_ivp(language_change_model, t_span, [x0], args=(c, a, s), t_eval=t_eval, method='RK45')

# Plot the solution
plt.plot(sol.t, sol.y[0], label=f'c={c}, a={a}, s={s}, x0={x0}')
plt.xlabel('Time')
plt.ylabel('Proportion of people speaking Spanish')
plt.legend()
plt.title('Language Change Model Over Time')
plt.show()

Want to see the result?

Catalan Population using Spanish as mother tongue over time - Image by the author
Catalan Population using Spanish as mother tongue over time – Image by the author

The time axis is shown as integers, not years, because mother tongues are established at birth or during infancy and they change from generation to generation, not within a person’s lifespan.

Conclusions

If we assume the data is right, then we can assume that the Spanish language will take over all Catalan speakers about 10 generations from now – maybe even less.

But let’s not become desperate, my Catalan fellows. This model isn’t stable.

Our model’s flaws

This model has several flaws, that’s why I didn’t spend much time in the conclusions phase.

To start, one of the basic premises is that one language competes with another. While this competition premise might make sense somewhere, Catalans are bilingual: two languages coexist.

Yes, there will almost always be only one mother tongue – as we saw, a very few percentage of Catalans have both Spanish and Catalan as their initial language – but that doesn’t mean the other language will disappear.

This is happening all over the world, in fact. English has made it to almost every country on Earth but it isn’t replacing local languages… Just becoming an extra asset for their population.

Another problem with this model is that’s based on research performed over 20 years ago. Yes, the math is correct but there might be more advanced and modern literature to go through and hopefully build a more stable version.

Also, we used estimated data: we don’t know the real a, s, and c parameters.

But don’t let these facts take the relevancy of the topic or results here. Catalan, like many other minority languages, is becoming less and less spoken as time goes by. And maybe not in 8 generations as we predicted, but they can at some point disappear.

Unless we do something… And I also involve politicians and governments, who should design strategies to boost its usage, increase its number of speakers, and overcome the danger of being extinguished.

We should all aim to use as many languages as possible, not the opposite.

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more machine learning-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] Abrams, D., Strogatz, S. Modelling the dynamics of language death. Nature 424, 900 (2003). https://doi.org/10.1038/424900a

[2] Enquesta d’usos lingüístics de la població de Catalunya. GenCat. https://llengua.gencat.cat/ca/serveis/dades_i_estudis/poblacio/Enquesta-EULP/

[3] Wikipedia contributors. (2023, November 29). Least absolute deviations. In Wikipedia, The Free Encyclopedia. Retrieved 07:05, June 28, 2024, from https://en.wikipedia.org/w/index.php?title=Least_absolute_deviations&oldid=1187408818

[4] solve_ivp – SciPy. solve_ivp – SciPy v1.14.0 Manual

[5] J. R. Dormand, P. J. Prince, "A family of embedded Runge-Kutta formulae", Journal of Computational and Applied Mathematics, Vol. 6, №1, pp. 19–26, 1980.


Related Articles