Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings

Code for the paper accepted at the 10th Workshop on Slavic Natural Language Processing 2025 (SlavicNLP 2025).

TLDR

We conducted an analysis of semantic change in Croatian language over 25 years using word embeddings trained on 9.5 million news articles.

Overview

We investigate how word meanings evolve by training skip-gram embeddings on Croatian news articles from five-year periods (2000-2024). Our analysis captures linguistic shifts related to major events like COVID-19, EU accession, and technological changes. We also find evidence that embeddings from post-2020 encode increased positivity in sentiment analysis tasks.

Embeddings

We release the trained embeddings from five-year periods in this repository and also model trained on whole 25-year periods.

To obtain the embeddings:

Navigate to the 5ysplits_models folder
Run the data retrieval script:
```
./get_data.sh
```

This will download and set up all the trained embedding models.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
5ysplits_models		5ysplits_models
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings

TLDR

Overview

Embeddings

About

Uh oh!

Releases

Packages

Languages

License

dd1497/cro-diachronic-emb

Folders and files

Latest commit

History

Repository files navigation

Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings

TLDR

Overview

Embeddings

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages