ChatGPT As Speechwriter For The French Presidents.18382v1
ChatGPT As Speechwriter For The French Presidents.18382v1
Abstract
1 Introduction
Based on large language models (LLMs) (Zhao et al., 2023), (Liu et al., 2023) trained on huge
corpora, the machine could generate short answers to users’ requests. The produced messages
are clear, coherent, plausible, and without spelling errors. Facing such successes, Bubeck et al.
(2023) assert that GPT “produces outputs that are essentially indistinguishable from (even better
than) what humans could produce”. The news however report also other examples of such
generated texts that include both incoherence and inaccuracies (called hallucinations). In this
context, the current study focuses on the written style adopted by ChatGPT1 when requesting to
produce short political messages.
1
To be precise, GPT is the foundation model employed to generate the next token based on the four previous ones while ChatGPT
is the fine-tuned application (based on GPT) to maintain the dialogue with the user.
-1-
Currently several LLMs coexist such as Google Bard, Gemini, PaLM 2, Meta Llama 2, but
OpenAI’s GPT is the most well-known. The initial objective is to produce a chatbot able to
maintain a dialogue, for example, to help the user in identifying and resolving a problem.
Trained with massive web-text data (e.g., Wikipedia), newspapers2 and books corpora, the
produced messages correspond to short wordings and producing a longer passage required many
interactions with the writing assistant. Therefore, the target text genre is a short response within
a question/answering system but the produced reply takes account of previous interactions.
According to this background, this study will focus on relatively short messages generated by
ChatGPT and compared their stylistic features with true ones.
More precisely, the main aim of this study is to empirically test whether ChatGPT can adopt the
style of a given author (Savoy, 2020). Moreover, we want to analyze the stylistic differences
between the text generated by ChatGPT and the messages written by humans. Do such
dissimilarities really occur? If yes, are they observable? In addition, can we characterize some
stylistic features used by ChatGPT compared to true authors?
In the remaining part of this article, we will first present some related work while Section 3
exposes the corpus used. Section 4 analyzes some ChatGPT stylistic features by comparing the
POS distribution to those occurring in speeches written by French presidents. Section 5 presents
analyses based on the vocabularies and the most frequent words occurring in messages written
by human and those generated by machine. Section 6 some additional experiments based on the
mean sentence length and distribution while Section 7 proposes to verify whether a machine
could detect a text generated by ChatGPT. Finally, a conclusion reports the main findings of
this study.
2
On Dec. 27th 2023, New York Times sues OpenAI and Microsoft for copyright infringement on millions of its articles.
3
A short tutorial on such models can be found at https://www.datacamp.com/tutorial/how-transformers-work.
-2-
To keep things simple (Wolfram, 2023) the foundation model GPT is able, based on a sequence
of initial tokens, to produce a ranked list of the next plausible token (e.g., word or punctuation
symbol), list elaborated based on the training documents. For example, after sequence “the
prime minister of”, the model can define a list of the next token as {UK, India, England, Canada,
France, Australia, …}.
From this list, and depending on some parameters, the system can then select the most probable
one (e.g., “UK” in our case), or based on a uniform distribution, one over the top k ranked tokens
(e.g., “Canada”) or, randomly depending on their respective probabilities of occurrence in the
training texts (e.g., “India”). This non-deterministic process4 guarantees that the same request
could create distinct messages. It is important to note that the choice of the next token is reason-
able or plausible according to the training sample. This does not mean that the produced se-
quence does exist in the training documents or that the resulting statement is true. Thus, and
common to all LLMs, GPT output may include hallucinations (incorrect information) in its an-
swers. The writing assistant will just provide a reasonable or plausible next token. Moreover,
the specification of the sources exploited to produce the text stays unknown to the user.
When working with such writing assistants, can we discriminate between texts written by a ma-
chine or a human being? Several studies expose the effectiveness of the learning strategies able
to discriminate between responses generated by GPT-3 or written by human beings (Guo et al.,
2023). Based on trained black-box classifiers (e.g., RoBERTa), the recognition rate is rather
high (around 95% to 98%). Such effectiveness is even obtained when the target language is not
English (e.g., French in (Antoun et al., 2023)). Such a high degree could be lower when faced
with a new and unknown domain or when substituting tokens by misspelled words (in such cases,
the achieved accuracy rate varies from 28% to 60%). In a related study, (Gao et al., 2023)
indicate that human beings are less efficient to detect machine-generated passages.
More problematic is the use of such writing assistants to produce scientific papers. Such appli-
cations are not new (Labbé & Labbé, 2013) even in generating tortured phrases (Cabanac et al.,
2021). According to Gao et al. (2023), the scientific abstracts generated by GPT are hard to
detect by expert in the field (success rate around 68%, high-impact journals). In this case, GPT
abstracts appear vague, superficial, focusing on details and could employ alternative spelling of
words. According to (Picazo-Sanchez & Ortiz-Martin, 2024), GPT have been used in around
10% of the 45,000 papers from 3009 journals analyzed by four automatic detectors. Recently,
(Soto et al. 2024) found that LLM exhibit similar and consistent results.
The main drawback of those LLMs is the need of large number of texts to train them. Of course,
their knowledge is time-limited by the period covered by the training set. One can assume that
the English language represents the largest part of the training sample, implying a better perfor-
mance in this language. Moreover, to achieve very good performance, the training sample must
be similar to the test one, with similar topics and text genre. When facing with a new domain,
text genre or with incorrectly spelled passages (Antoun et al., 2023) (Soto et al., 2024), the
4
This random aspect is under the control of the temperature parameter.
-3-
performance drops clearly. In addition, such detectors must take account of updated LLMs in
response to their known weaknesses. Therefore, there is a clear requirement to acquire a style
description or to derive some stylistic features associated to such LLMs and, in this study, for
GPT.
3 Corpus Overview
To compare the written style of recent French presidents with speeches generated by machine,
ChatGPT (chat.openai.com) was asked to generate the end-of-the-year address of four
presidents, namely Chirac, Sarkozy, Hollande, and Macron. For each leader and each year, we
asked ChatGPT to generate an end-of-the-year address after submitting it the corresponding
natural text (NT) as a model.
In our corpus, all NTs and GPTs5 were corrected and labeled according to the principles specified
by Muller (1977). This implies that an orthographic standardization was applied to ensure that
any differences observed between NTs and GPTs do not stem from fluctuations in word spelling.
Lemmatization adds a label (lemma) to each word in the text, including its dictionary entry (for
example, the infinitive of the verb) and its grammatical category. The characteristics of
ChatGPT's vocabulary (see Section 4) suggest that ChatGPT does not really know French
grammar and all the vocabulary, but that it works with a wide variety of tokens (graphic forms)
and seems to have some difficulty with certain homographs.
To have an overview of our corpus, Table 1 provides the president names and Nnt, the number
of words in each presidential message, followed by Ngpt, the number of words of the
corresponding generated text. As we don't know exactly how ChatGPT divides messages into
tokens, we have decided to divide them into words. For example, "aujourd'hui" (today) or
"parce que" (because) are single words in French because one can found them as entries in
language dictionaries.
As shown in Table 1, on average, the GPTs are almost half as long as the models. Only once
(Hollande’s message in 2012), the generated text is slightly longer than the natural one. As a
result, more than half of these generated texts are less than 1,000 words long. Finally, one notes
that the lengths of the texts generated fluctuate considerably, suggesting that, in addition to a
certain instability inherent in the method, ChatGPT draws on resources other than the texts
submitted by the user.
Two questions arise. First, what are the stylistic and vocabulary idiosyncrasies of the GPTs in
relation to the NTs used as models? This question will be examined at three levels, namely the
parts-of-speech (POS), vocabulary, and style. Second, can GPTs be detected, or are they close
enough to the NTs, provided as models, to escape a detection procedure based on the calculation
of intertextual distance and automatic classifications?
5
In this paper, we denote by NTs all natural addresses (or real ones) and by GPTs all messages generated by ChatGPT.
-4-
President/Year President Speeches (NTs) GPTs Ngpti /Nnti
Nnti Ngpti
Chirac
1 2002 1,026 552 0.54
2 2003 1,179 639 0.54
3 2004 1,304 446 0.34
4 2005 980 786 0.80
5 2006 1,142 420 0.37
Total Chirac 5,631 2,843 0.50
Sarkozy
6 2007 1,552 822 0.53
7 2008 1,158 770 0.66
8 2009 1,182 530 0.45
9 2010 1,216 693 0.57
10 2011 1,309 1,129 0.86
Total Sarkozy 6,417 3,944 0.61
Hollande
11 2012 1,214 1,434 1.18
12 2013 1,639 1,023 0.62
13 2014 1,547 1,213 0.78
14 2015 1,443 509 0.35
15 2016 1,426 1,056 0.74
Total Hollande 7,269 5,235 0,72
Macron
16 2017 2,387 850 0.36
17 2018 2,417 705 0.29
18 2019 2,510 787 0.31
19 2020 2,175 552 0.25
20 2021 2,129 1,783 0.84
Total Macron 9,818 4,677 0,49
Overall total 30,935 16,699 0.54
Table 1. Lengths in words of presidents’ addresses (NT) and texts generated by
machine (GPT)
4 Part-Of-Speech
Word labeling by the lemmatizer allows us to observe how ChatGPT employed the main
grammatical categories, and whether the characteristics of the generated messages match those
of the texts it has been given as models. To provide an answer to this question, we have grouped
all the GPTs into a single corpus, which is compared with all the presidential messages merged
together. The results of this comparison are displayed in Table 2.
-5-
In this table, the first column indicates the part-of-speech subdivided by some morphological
information (e.g., present tense for verbs). The second displayed the frequencies in permille in
messages written by the presidents (Fnt) while in the third column (Fgpt) presents the same
information for generated texts. The fourth column shows the percentage of difference compared
to NTs.
-6-
To interpret the values appearing in Table 2, the first row indicates that in the presidential
addresses (NTs), the density of verbs is 154.1 per thousand words. In texts generated, it is 151.9,
a difference of -1.5%. There is less than a 1% chance of being wrong in asserting that this density
is significantly lower in GPTs than in NTs (last column).
Except for verbs (notably futures, past participles, and infinitives), proper nouns (which
ChatGPT reuses from the original texts without any addition), and demonstrative determiners
(e.g., this), the deviations are significant with a risk of error of less than 1%.
Leading the way in overuse are present participles (an invariable verbal form in French),
possessive determiners (my, your, ...), numbers (in fact, the year provided in the prompt, which
ChatGPT tends to repeat too often), adjectives, coordinating conjunctions, and common nouns.
On the other hand, the underused categories are all linked to the verb: pronouns, adverbs and
subordinating conjunctions. For the verb, rare and complicated tenses and modes are avoided,
such as the imperfect tense, or even forgotten altogether, like the past simple tense. The
complexity of French verb usage may explain ChatGPT relative parsimony. The same goes for
subordinate clauses, which ChatGPT seems to avoid as much as possible. In both cases, this
complexity results in multiple inflections (and therefore as many different tokens) and
consequently low probabilities of occurrence.
Another example tends to confirm our finding. The main subordinating conjunction is “que”
(that), an homograph of the pronoun (who, which) that is endowed with a high degree of ubiquity
in French sentences. This aspect results again in low probabilities of occurrence for each of the
possible combinations.
To achieve a better overview, Table 3 combines the categories into two groups, namely the group
of POS linked to the verbs and the group of those linked to the nouns.
-7-
5 Vocabularies
In this section, the analysis focuses on lemmas (dictionary entries) belonging first to the verb
group, then to the noun group. In each case, we compare the most frequently used lemmas in
the presidential and generated speeches using the rank and the frequency. Since the two samples
are of unequal length, the absolute frequencies are converted into relative ones (expressed per
thousand words). The last column of the following tables displayed in percentage the difference
in frequencies between the natural texts and the generated ones.
-8-
Looking at the first row of Table 4, among presidents, the most frequently used verb is “to be”
with a frequency of 31.68 per thousand words. In texts generated by ChatGPT, it is also the first,
but with a lower frequency of 22.03‰, given a difference of -30% compared with presidents as
displayed in the last column.
In any natural French text, “être” (to be), “avoir” (to have) and “faire” (to do) are the three most
frequently used verbs in that order. In GPTs, “devoir” (which has several meanings in French
(a legal or moral obligation, or a probable achievement in the future)) takes the place of “to do”.
Moreover, its frequency doubles in GPTs compared with NTs (while “to do” drops by 11%). In
presidential addresses, it is the modality of the probable (“next year should be better”).
ChatGPT has a little trouble with simple verbs, but even more with verbal compositions (such
as “pouvoir faire” (to be able to do)). Three cases are particularly striking. First with “to go” (-
62%) which is used in French to indicate the near future, that is the main subject of an end-of-
the-year message in expressions such as “the next year is going to be better”, “some event is
going to happen next year”, etc. Second, the verb “falloir” (must) has almost completely
disappeared (-93%). Third, one can notice the considerable decline of “dire” (to say) (-78%).
Among presidents, the majority of uses are in expressions such as “il faut dire” (one must say),
“ce qui (ne) veut (pas) dire” (that is (not) to say), etc.
In these three cases, the main difficulty stems from the impersonal pronoun of the third person
“il” (it), which is the obligatory subject of “must” (also frequent in front of “to go”) and is often
found after a punctuation or a subordinating conjunction “que” (that) (another difficulty for
ChatGPT).
Clearly, ChatGPT has a poor grasp of this type of complex constructions, no doubt because their
great variability results in low probabilities of occurrence for each of these constructions used
by the presidents. As a result, the same difficulties are amplified for pronouns.
To analyze this question, Table 5 depicts the most frequently used pronouns. One can see that
the pronoun “we” is employed the most by presidents (16.1 per thousand words) and by ChatGPT
(24.28‰), an increase of 51% compared to presidents.
-9-
Presidents (NTs) GPTs
Rank Lemma Fnt (‰) Rank Fgpt (‰) (Fgpt-Fnt)/Fnt %
1 nous (we, us) 16.10 1 24.28 +51
2 je (I, me) 15.97 2 14.98 -6
3 qui (who, which) 13.58 3 10.47 -23
4 ce (it, this one) 11.35 6 4.07 -64
5 il (he, she, it) 8.54 9 3.27 -62
6 vous (you) 7.89 4 8.72 +11
7 se (himself, herself…) 5.88 5 4.87 -17
8 que (that) 5.33 7 4.07 -24
9 le (him, her, it) 4.14 11 1.38 -67
10 celui (this one) 3.85 8 2.69 -30
11 tout (everything) 2.94 10 2.47 -16
12 chacun (eachone 2.59 12 1.31 -49
13 ils (they) 2.46 13 0.65 -74
14 y (there) 2.26 14 0.87 -62
15 en (it, some) 2.20 16 0.80 -64
16 cela (this) 1.13 14 0.95 -16
17 autre (other) 0.84 18 0.36 -57
18 on (one) 0.84 29 0.07 -92
19 dont (whose, of which) 0.78 18 0.36 -54
20 lequel (who, which) 0.78 17 0.65 -17
Table 5. Pronouns most frequently used by presidents compared with GPT
(frequencies per thousand words)
Table 5 arises the following general two comments. Firstly, apart from “on” (one), which leaves
GPTs' list of the first twenty pronouns, the other nineteen are common to both lists. ChatGPT’s'
technique is therefore quite close to the provided speech. But as mentioned earlier in the analysis
of grammatical categories, in the texts generated by ChatGPT, there are 22% fewer pronouns
than in the texts it is supposed to emulate. Consequently, if ChatGPT had respected the hierarchy
of presidential pronouns, the last column should contain around -22% in each row.
This aspect highlights the overuse of “we” and “you”. The study of contexts shows that “we”
and “you” are mostly used not as subjects of a verb, but as complements (“I present you my best
wishes”) or as subjects of a relative proposition. In addition, they have no homographs.
Second, the personal pronouns “he” and “they”, of which “elles” and “elle” (she) are flexions,
and the pronoun “on” (one) are generally subject of a complicated verb such as “falloir” (must),
“aller” (to go) or “venir” (to come). Thus, one can see them at the head of a sentence or after an
internal punctuation. This could explain their high level of under-use.
- 10 -
Presidents (NTs) GPTs
Rank Lemma Fnt (‰) Rank Fgpt (‰) (Fgpt-Fnt)/Fnt %
1 plus (more) 7.44 1 6.62 -11
2 ne (don't) 7.34 2 3.85 -48
3 pas (not) 4.20 4 2.47 -41
4 aussi (also) 3.49 10 0.87 -75
5 où (where) 1.52 8 1.02 -33
6 encore (still) 1.10 15 0.51 -54
7 beaucoup (many) 1.07 8 0.22 -79
7 tout (all) 1.07 6 1.24 +16
9 alors (then) 1.03 18 0.44 -57
10 enfin (finally) 1.00 14 0.58 -42
10 là (there) 1.00 28 0.22 -78
10 toujours (always) 1.00 11 0.80 -20
13 bien (well) 0.94 20 0.36 -62
13 mieux (better) 0.94 20 0.36 -62
15 davantage (more) 0.84 15 0.51 -39
15 jamais (never) 0.84 18 0.44 -48
17 même (even) 0.74 37 0.07 -91
18 aujourd'hui (today) 0.71 13 0.73 +3
18 pourquoi (why) 0.71 28 0.22 -69
20 ensembe (together) 0.68 5 1.31 +93
(…)
37 également (also) 0.29 3 3.71 +1 178
Table 6. Adverbs most frequently used by presidents compared with ChatGPT
(frequencies per thousand words)
In addition, “ce” (that), “le” (relative pronoun of the third person), “tout” (every), “en” (other
relative pronoun), “autre” (other), are homographs of determiners or prepositions. This means
a single token for two lemmas and many possible constructions. Moreover, these determiners
and prepositions are much more frequent in French than their pronoun homographs.
Clearly, ChatGPT drastically avoids these difficulties. This is reflected in the use of adverbs,
the last grammatical category in the verb phrase. Table 6 presents them according to their
occurrence frequencies in presidential speeches. From this, the first 15 are present in both lists
but not in the same order. One must also recall that this category is clearly under-use by
ChatGPT, on average - 32% compared to the natural texts used for learning (see Table 2). This
proportion should be borne in mind when interpreting the last column of Table 6.
In any French text, the negative construction (“ne+verb+pas” (verb+not)) is extremely frequent.
It is almost halved in ChatGPT-generated texts, because it is employed exclusively with a verb
whose subject is often a pronoun like “ce n'est pas” (it is not). On the other hand, adverbs usually
used outside the verbal group (e.g., “today”) or in a nominal group (e.g., “all”) do not suffer the
- 11 -
same erosion or are even overused (such as “all” or “together”). The most striking case is
“également” (also), which ranks third among ChatGPT's favorite adverbs, whereas it is only 37th
among presidents. In fact, all its occurrences are found in a nominal group (generally in front of
an adjective). However, as the analysis of grammatical categories has shown, ChatGPT's favors
words belonging to the noun group, such as these adverbs (the names of which should better be
"adnouns").
- 12 -
Presidents (NTs) GPTs
Rank Lemma Fnt (‰) Rank Fgpt (‰) (Fgpt-Fnt)/Fnt %
1 année (year) 5.04 1 8.87 +76
2 pays (country) 3.04 3 3.27 +8
3 compatriote (compatriot) 2.42 13 2.11 -13
4 emploi (job) 2.17 5 2.69 +24
5 vie (life) 2.07 17 1.74 -16
6 monde (world) 1.94 7 2.47 +27
7 travail (work) 1.88 18 1.60 -15
8 avenir (future) 1.68 6 2.54 +51
9 république (republic) 1.52 4 2.91 +91
10 gouvernement (government) 1.45 41 0.87 -40
11 soir (evening) 1.36 169 0.29 -79
12 nation (nation) 1.29 10 2.25 +74
13 crise (crisis) 1.26 16 1.89 +50
14 face (face) 1.26 10 2.25 +79
15 réforme (reform) 1.23 18 1.60 +30
16 voeu (wish) 1.20 13 2.11 +76
17 mois (mounth) 1.16 73 0.58 -50
18 service (service) 1.13 41 0.87 -23
19 besoin (need) 1.10 103 0.44 -60
20 confiance (trust) 1.07 49 0.80 -25
… … … … … …
72 défi (chalenge) 0.52 2 4.14 +696
108 concitoyen (man fellow citizen) 0,39 21 1.31 +238
Table 7. Common nouns most frequently used by presidents compared with GPTs
(frequencies per thousand words)
Conversely, under-use of the noun “soir” can be explained in part by the difficulties ChatGPT
encounters with some homographs such as “ce” that could be a pronoun (that) or a demonstrative
article (this). Presidents always say “ce soir” (this evening), so GPT makes very little use of
“soir”, because after “ce”, there is a greater likelihood of encountering a verb like “sera” (that
will be), “sont” (those are), and so on. Of course, a grammatical analysis would easily separate
the determiner “ce” (this) from the homograph pronoun (that), but it seems that ChatGPT
struggle with this.
Associated with nouns, one can find the adjectives and Table 8 exposes that among presidential
addresses, the most frequently used is “new” with a frequency of 2.46 per thousand words. In
GPTs, it is also the first, but with a frequency of 2.76‰, +12% compared to president. One must
keep in mind that ChatGPT uses 13% too many adjectives compared with presidents (see
Table 2). With this information, one can infer that increases in density of “new” or “public” are
not really significant.
- 13 -
When combining this table with the previous one, one can see that ChatGPT overuses the main
presidential syntagms: “wishes for the new year”, “better economic and social situation”.
- 14 -
first with an almost equal frequency (110.07‰). The difference between the two (-1%) is not
significant.
- 15 -
6 Sentence Analysis
It is well-kwon that the style of an author can be measured by several indicators (Savoy, 2020).
In this study, after focusing on the POS distribution and the most frequently used lemmas, the
analysis of the sentence length can reveal some pertinent aspects of ChatGPT’s style. More
precisely, one can ask whether ChatGPT succeed in producing sentences that are formally similar
to those of the author it has been given as a model? To answer this question, the set of presidents
is compared with the set of texts generated by ChatGPT (see also (Monière et al., 2008)). Our
analysis is restricted to the sentence lengths.
- 16 -
Second, the ratio between sentence lengths at the upper bounds of the first and ninth deciles, i.e.,
an interval comprising 80% of the text (by cutting both ends of the distribution). In the
presidential texts, this interval is {12.0 - 57.9 words}, i.e., a ratio of 1 to 3.84. This can be
compared with GPTs with an interval {14.4 - 41.2 words} meaning a ratio of 1 to 1.88 or half
that of the presidents.
In other words, the generated sentence lengths are significantly less diverse or spread out than in
natural texts. The generator has difficulties deviating from the average and producing very short
or very long sentences that are familiar in natural language.
5 %
Presidents
4.5 GPT
4
3.5
2.5
2
1.5
0.5
0 lengths (words
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53
- 17 -
sentences (from 15 to 39 words) and avoids “extraordinary” sentences namely fewer very short
sentences (lengths less than 15 words) or very long ones (more than 39 words). ChatGPT
respects the mean of sentence lengths in the texts it is given as models, but randomly distributes
these lengths around the mean, as evidenced by the near-equality of the three central values
(mode, median, mean), a characteristic of a Gaussian distribution. On the other hand, the
inequality {mode < median < mean} is the main characteristic of sentence lengths in natural texts
(either written or oral).
Furthermore, the coefficient of relative variation (CV% in Table 10) indicates that this dispersion
around the mean is lower in the corpus of generated texts than in that by the presidents. This
finding explains why, on the figure, the mode of sentences produced by ChatGPT is clearly
higher than that of natural sentences. Finally, the absence of very long sentences in GPTs may
be linked to the low frequency of complex constructions, notably subordinate clauses.
About style, a conclusion is obvious namely that ChatGPT treats punctuation like words (hence
the roughly identical central values). Therefore, the generator had identified which words are
most likely to be followed by a comma, period, etc.
- 18 -
Figure 2. Hierarchical ascending classification by presidents
- 19 -
this conclusion only applies to the end-of-the year addresses and would need to be confirmed for
all the speeches.
All this shows that ChatGPT is a good imitator. Intertextual distance is therefore no longer able
to identify texts generated by ChatGPT (as it once could be with older text generators) at least if
users have taken care to submit to the generator a single homogeneous text to be “emulated”6.
8 Conclusion
ChatGPT generates short texts, certainly because its prime vocation is a chatbot and not a public
writer. When it comes to vocabulary and sentences, these generated texts display many singular
characteristics, namely underuse of verbs (especially when conjugated in tenses other than the
present), underuse of pronouns (especially third persons and indefinites), and underuse of ad-
verbs and subordinating conjunctions. On the other hand, ChatGPT tends to overuse common
nouns, adjectives, possessive determiners, and prepositions.
When looking at the sentence length and its distribution, the generator seems incapable of repro-
ducing the diversity of sentence lengths, a characteristic of natural texts. Of course, these obvi-
ous limitations are not insurmountable and may be mitigated in future versions of the generator.
However, despite these limitations, when ChatGPT is placed in the best possible conditions (a
single and short model, a known author, a date, a simple genre), it manages to reproduce the
main formal characteristics of a natural text that has been given to it as a model. Moreover, the
produced text makes sense to the reader who consults it. If this reader has any suspicions, there
is currently no computerized tool to help confirm them. In fact, the techniques used until now
to detect plagiarisms or texts generated by paper mills, no longer seem appropriate. Provided
that users have taken care to submit homogeneous texts to ChatGPT, intertextual distance seems
inoperative. Similarly, conventional plagiarism detection systems are likely to be ineffective,
since they analyze n-grams (or word stacks), whereas GPT rearranges the vocabulary of the
model in the generated text.
Further experiments will be needed to determine which features could be used to detect the gen-
erated texts. Moreover, experiments with other languages could confirm our findings. Finally,
low-resource languages could represent a real challenge for LLMs due to their limited text cor-
pora available.
References
Antoun, W., Mouilleron, V., Sagot, B., and Seddah, D. (2023). Towards a robust detection of
language model-generated text: Is ChatGPT that easy to detect? CORIA-TALN Conference,
Paris June 2023, 1–14.
Bartélémy, J.P., and Guénoche, A. (1991). Trees and Proximity Representations. New York:
John Wiley
6
We view this term as preferable to “copied” or “imitated”.
- 20 -
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T.,
Lundberg, S., Nori, H., Palangi, H., Ribeiro, M.T., and Zhang, Y. (2023). Sparks of artificial
general intelligence: Early experiments with GPT-4. arXiv:2303.12712v5.
Byrne, J.A. and Labbé, C. (2017). Striking similarities between publications from China describ-
ing single gene knockdown experiments in human cancer cell lines. Scientometrics,
110(3):1471–1493.
Cabanac, G., Labbé, C., and Magazinov, A. (2021). Tortured phrases: A dubious writing style
emerging in science. Evidence of critical issues affecting established journals. 2021.
https://arxiv.org/abs/2107.06751
Cover T. M. & Hart P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on
Information Theory. 13(1):21–27.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. Boston: The MIT Press.
Gao, C.A., Howard, F.M., Markov, N.S., Dyer, E.C., Rameh, S., Luo, Y., and Pearson, A.T.
(2023). Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors
and blinded human reviewers. Digital Medicine, 75:1-5.
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., and Wu, Y. (2023). How
close is ChatGPT to human experts? Comparison corpus, evaluation and detection.
arXiv:2301.07597.
Labbé, C. & Labbé, D. (2006). A tool for literary studies. Literary & Linguistic Computing,
21(3):311-326.
Labbé, D. (2007). Experiments on authorship attribution by intertextual distance in English.
Journal of Quantitative Linguistics, 14(1), 33–80.
Labbé, C. and Labbé, D. (2013). Duplicate and fake publications in the scientific literature: how
many SCIgen papers in computer science? Scientometrics, 94(1):379–396.
Liu, Z., Lin, Y., and Sun, M. (2023). Representation Learning for Natural Language Processing.
Singapore: Springer Verlag.
Monière, D., Labbé, C. & Labbé, D. (2008). Les styles discursifs des premiers ministres québé-
cois de Jean Lesage à Jean Charest. Canadian Journal of Political Science, 41(1):43-69
Muller, C. (1977). Principes et Méthodes de Statistique Lexicale. Paris: Hachette
Picazo-Sanchez, P., and Ortiz-Martin, L. (2024). Analysing the impact of ChatGPT in research.
Applied Intelligence, https://doi.org/10.1007/s10489-024-05298-0.
Soto, R., Koch, K., Khan, A., Chen, B., Bishop, M., and Andrews, N. (2024). Few-shot detection
on machine-generated text using style representations. arXiv: 2401.06712v1.
https://arxiv.org/abs/2401.06712
Savoy, J. (2018). Is Starnone really the author behind Ferrante? Digital Scholarship in the
Humanities, 2018, 33(4), 902-918
Savoy, J. (2020). Machine Learning Methods for Stylometry. Authorship Attribution and Author
Profiling. Cham: Springer.
Van Noorden, R. (2014). Publishers withdraw more than 120 gibberish papers, Nature, 24 Feb-
ruary 2014. https://doi.org/10.1038/nature.2014.14763.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and
Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Pro-
cessing Systems, 30.
- 21 -
Wolfram, S. (2023). What is GPT-4 Doing… and What Does it Work?. Orlando: Wolfram Re-
search Inc., Champaign (IL).
Zhao, W., Zhou, K., Li, J., Tang, T., Wang, X, Hou, Y. Min, Y., Zhang, B., Zhang, J., Dong, Z.,
Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Peiyu, P., Nie, J.Y.,
and Wen, J.R. (2023). A survey of large language models. arXiv: 2303.18223v11.
- 22 -