Chaos analysis of natural languages

Heretofore, the main focus of our research has been the study of chaotic dynamics from the point of view of non-linear physics and non-equilibrium statistical mechanics and large deviation statistical analysis concerning random walks on graph networks. In these contexts, we are often interested in the distribution of local averages of some randomly or chaotically varying quantity.

The rate function quantifies the rate at which such a distribution converges to the normal distribution, in conformity with the Central Limit Theorem. We can elucidate the characteristics of the fluctuations of a time series through large deviation analysis employing the rate function and other statistical quantities. One of our research topics is to deepen our understanding of the techniques of large deviation analysis of this kind, and a particular application in which we are doing this is the study of the fluctuations hidden in natural languages. One approach to this problem is to regard the words appearing in natural language text as forming a time series and to apply large deviation analysis in the conventional manner.

To this time, the type of quantitative analysis of text that has been undertaken involves investigation of the statistics of such “static” quantities as sentence length and the usage frequency of each grammatical part of speech, as just two of the many examples.

Among the applications of this type of analysis is authorship authentication for literary works. Considering English text, for example, this analysis can be based on sequences of numbers that represent the numbers of letters in (i.e., the length of) each word. Here we consider such analysis in some detail. The conventional approach in this context is to analyze statistical quantities regarding word length, such as the average, the mode and the frequency distribution of word lengths.

However, we would like to interpret a piece of text as something that transpires in time, with successive words appearing at successive times in a series. Then, the word lengths in a piece of text can be regarded as forming a random time series. With this understanding, we could apply large deviation analysis to text, just as we would to an ordinary time series.

With this understanding, we could apply large deviation analysis to text, just as we would to an ordinary time series. With such a treatment, we could investigate many quantities, in particular, the distribution of local time averages taken over finite intervals. In contrast with the conventional approach, which treats a sentence “statically,” and ignores the anteroposterior relations within a sentence, with this type of analysis, we would be able to capture some characteristics of the flow within sentences. We believe that these kinds of fluctuations hidden in sentences reflect each author’s particular style and the naturalness of their writing.

Another approach to the analysis of text is to analyze networks representing the interrelations among words. It has been found that there exists a universal statistical law regarding the frequency of word usage. This is an inverse power law known as Zipf’s law. In connection to this empirical law, it has been conjectured that there exists a kind of preferential word selection hidden in the process of language acquisition, and this acts as the principle from which there emerges a kind of scale-free network structure in this learning process. This scale-free network represents the connections between words, and these connections influence word selection. With this underlying idea, the analysis of natural language from the point of view of complex networks is being considered.

Here, we note that phenomena obeying Zipf’s law and those exhibiting 1/f spectra are often regarded as interrelated. For example, the empirical findings that the frequency and magnitude of earthquakes conforms to a Zipf-like statistical law and that there is a power law governing the distribution of the intervals between aftershocks akin to a 1/f spectrum are regarded as corresponding phenomena. However, in the study of natural languages, there is yet very little analysis of a 1/f spectrum type.

Also, it is interesting that the situation regarding the intervals between words in the Old Testament appears to be quite complicated. Specifically, according to preliminary studies, the intervals between the appearances of certain words exhibit different types of distributions, for example exponential distributions and power law distributions, depending on the part of speech and the case.

If we consider networks of words and random walks on them, then we can interpret the interval between consecutive appearances of a given word as the recurrence time of the corresponding node in such a random walk. Focusing on the statistical characteristics of such quantities, it should be possible to acquire an understanding of natural language in terms of 1/f spectra.

Language is used in many ways. In some of its uses, it acts as more than a means to straightforwardly convey ideas. Here I give two such examples.

I am certified as an industrial counselor. This work consists mainly of providing psychological support through spoken interaction. In this interaction, the use of language somehow transcends the role of expressing individual ideas, as information is also conveyed in the manner they are expressed. Language can also be used in ways that are apparently very far removed from the conveyance of ideas.

For example, recently, in several medical facilities, a cutting-edge diagnostic technique known as “Diagnostic Aid for Depression Using an Optical Topography Test” is being used in the diagnosis of depression. In this test, the patient is asked, for example, to utter as many nouns as they can think of that begin with the letter “A.” During this time, measurements are made of the blood flow in the brain. The results of this test then provide information that is useful in the diagnosis of disease.

These two examples illustrate how language can play important roles in wide-ranging and perhaps unexpected contexts. It would be interesting to investigate the relationship between these aspects of language the chaotic aspects of language.


Lecturer Syuji Miyazaki


京都大学 情報学研究科 先端数理科学専攻 非線形物理学講座