ARE WE DIFFERENT? MALE AND FEMALE “REGISTERS” IN CHINESE CORPUS DATA

The aim of this paper is to examine language data with regard to potential differences between male and female registers. Corpus linguistics is used as the basic methodological approach (mostly quantitative) to the topic and the Hanku corpus, more particularly the subcorpus Litchi, are used as the primary source of language data. The data are presented in the form of tables together with a brief analysis. The results indicate a considerable variation between male vs. female registers in some areas – lexicon, part-of-speech proportion etc. However in other areas (e.g. prosody) there exists no deviation at all. These indicators of variation will be subject of further, more detailed research.

As can be seen, the male authors prefer to use longer sentence. For comparison, the length of a sentence in the subcorpus zh-law is 29 tokens or 32 in the webcorpus web-zh. 5

Length of Words
Now, let us compare the length of words in different corpora, namely the web-zh and the zh-lit-1.1 (Litchi). As the tokens of punctation (PU), numbers (CD, OD) and foreign words (FW) might influence the length preferences, they are excluded from the queries ([tag!="PU|CD|OD|FW"]). 4 For this purpose we use the following queries: number of tokens in male or female subcorpora: [word=".*"] within <doc gender="F"/> within <doc authors_origin="CN"/>, number of sentences: <s/> within <doc gender="M"/> within <doc authors_origin="CN"/>. 5 See more GAJDOŠ, Ľ. Chinese legal texts -Quantitative Description. In Acta Linguistica Asiatica, 2017, Vol. 7, No. 1, pp. 77-87.  It is obvious from the table above that there are some differences between the subcorpora. However, the percentages between male and female authors hardly vary at all, that is to say, there is no evidence for word-length preference between male and female authors. Again, differences are only slight: the use of punctuation, numbers, foreign words (6%) and the use of monosyllabic words (2%). Now let us compare the language of translations (for the sake of simplicity, all literary works are considered as translations; except authors_origin!= "CN|TW",) with those of Chinese origin (authors_origin="CN|TW"). As in the previous comparison, there are no pronounced differences here. This is just one parameter of many, and further research may reveal the existence of the register variation observed in other languages.

Part of Speech
Now let us observe part of speech (POS) variation. There are 52 tags in the Hanku corpus which correspond to part of speech. 6 As Petrovčič states, the key doc.gender represents the gender of the author but not the translator. That is why only male/female author from CN 7 is chosen.
The following table shows the proportion of the POS tags in the subcorpus zhlit-1.1 for Chinese authors only (CN). Also notice that the absolute frequencies are calculated for the first 10 million tokens in the subcorpus. The differences of the IPM (Instance Per Million) 8 and relative differences 9 are calculated in the following table. Let us calculate the relative percentage difference df of frequencies for each POS tag. The results are presented in the following table. 8 The proportion of POS percentage is a division of IPM by 1000. The difference of percentage is calculated by subtraction. 9 To calculate the relative percentage difference between male and female IPM frequencies, we have modified the formulae for relative difference have only male or female value ("N/A" are still texts from male and female authors), we use arithmetic means in the denominator. It is necessary to bear in mind that the difference df (%) does not consider absolute frequency of a given POS tag. The table above reveals some discrepancy (according to gender) in the proportions of the POS tags in the Litchi subcorpus.
To conclude, from the above data one may assume that female authors (compared to male authors) use more punctuation (PU), sentence particles (SP), pronouns (PN), adjectives (VA) etc. and this might be described, in the context of lexis, as more emotional or personal. The differences are more noticeable in functional words (xūcí 虚词) and less in the group of notional words (shící 实词).

Most frequent verbs and adjectives
When comparing concrete words (tokens) two different metrics are used: IPM difference (frequency difference) and relative percentage difference df. The former might (to some extent) be compared to keywords in Sketch Engine 10 : the redder the word in the difference row, the more it is used by male authors and vice versa. The latter may in some cases be more relevant (see for example the token ài 爱 [to love]). In this and the following sections the IPM measure is used.
Let us start with verbs and adjectives 11 which are the most frequent POS of the whole corpus. CQL query: [tag="VV|VA|VC|VE"] within <doc authors_origin="CN"/>

Most frequent nouns
As in the previous section, let us compare concrete nouns in the male and female subcorpus. To avoid specific tokens of particular literary works, the following CQL query is used: [tag="NN|NT"] within <doc authors_origin="CN"/> The original intention was to compare the 100 most frequent nouns in both subcorpora. However, because of some noisy data (inadequate tokenization and POS annotation) only 99 are presented. . As already stated, the lexicon is topic-related and this point will be developed in further, more detailed research.

Rest of the POS tags
For the rest of the tokens (except punctuation PU, numbers CD, OD, proper nouns NR), the following CQL query is used: [tag="AD|PN|M|AS|P|DEC|DEG|VA|DT|LC|JJ|CC|SP|DEV|BA|MSP|CS|DER|S B|LB|ETC|IJ|FW"] within <doc authors_origin="CN"/> We also set the frequency limit to 5000 of the absolute frequency in zh-lit-1.1.

Absolute IPM difference
When searching for the most frequent tokens in the zh-lit-1.1 (Word list User Interface), the following CQL query is used: [tag!="PU|CD|OD|NR"] within <doc authors_origin="CN"/> It should be noted that IPM is calculated for the male/female subcorpus separately and not for the whole corpus zh-lit-1.1. Also, the IPM for e.g. a particular verb may differ from the IPM presented in the following table. The tokens in the table below are not restricted to any tag, i.e. a certain verb may belong to two or more part of speech tags.

Conclusion
To conclude, from the language data presented in this paper, it is apparent that there are several differences in the male and female registers of literary texts. Let us highlight these. It is worth noting that the results and conclusions support our basic assumption but further research on a larger data set must be conducted to prove it.
Finally, in answer to the question; are we different? Yes, to some extent…