Learning to read starts from recognizing the individual words in the text. Gough (1984) described visual word recognition as the foundation of the reading process. Emphasis is laid on the automaticity of visual word recognition, wherein rapid visual word recognition is considered necessary to take full advantage of one’s reading capabilities in a given language (Roberts et al., 2011). Lack of automatic word recognition has been identified as a major impediment to reading fluency and a main cause of reading disorders (Fletcher et al., 2007). Also, in second language acquisition, vocabulary knowledge is a good predictor of language proficiency (Cheng & Matthews, 2018; De Wilde et al., 2020; Milton, 2013). Consequently, visual word recognition is one of the most researched processes in cognitive psychology (Lupker, 2005) using a variety of experimental tasks (Rastle, 2016; Roberts et al., 2011).

Several factors have been investigated as determinants of visual word recognition performance, among which are word frequency, i.e. the number of times a given word is encountered in the language (Brysbaert et al., 2011; Brysbaert et al., 2018; van Heuven et al., 2014); age of acquisition (AoA), the age at which a word is typically acquired (Brysbaert, Stevens, et al., 2014a); word familiarity, the extent to which language users know and use the word in their everyday life (Gernsbacher, 1984); neighbourhood density, the number of orthographically and phonologically similar words a word has (Yarkoni et al., 2008); word length, the number of orthographic (number of letters), phonological (no. of phonemes or no. of syllables) or morphological (no. of morphemes) units a word comprises of (Ferrand et al., 2011); concreteness, the degree to which a word refers to an instance that can be experienced through the senses (Brysbaert et al., 2014a, b); valence, whether the word is perceived as negative or positive; and arousal, whether a word is perceived as calming or exciting (Kuperman et al., 2014; Warriner et al., 2013).

The word characteristics are correlated with each other, making it difficult to decide which one is the most important. Still, there is large agreement that if one is to be chosen, it is word frequency (Balota et al., 2007; Brysbaert et al., 2018; Mandera et al., 2020). Preston (1935) first reported that high-frequency words are recognized faster and more accurately than low-frequency words. Word frequency was also emphasized by Murray and Forster (2004) as one of the most important stimulus variables to predict the time spent in recognizing a word. In large-scale visual word recognition studies, it has been observed that word frequency accounts for up to 30–40% of variance in the reaction times (RTs), of which 5% cannot be explained by the other variables (Brysbaert et al., 2011; Brysbaert et al., 2016). AoA and word length each explain 2% of unique variance; the other variables usually less than 1%. Together, the variables explain 40–60% of the RT variance in various databases.

For a long time, the standardized measure of word frequency was the frequency of a word per million words (fpmw), where low-frequency words were considered to be those having lower than 5 fpmw and high-frequency words were defined as those having more than 100 fpmw. However, van Heuven et al. (2014) pointed out problems with using fpmw as a measure of word frequency and proposed the use of the Zipf score as a measure of word frequency. The Zipf scale is a 7-point logarithmic scale, where the values typically range between 1 (i.e. 1 per 100 million words) and 7 (10,000 per million words), and words are categorized as low-frequency words if the Zipf score lies between 1 and 3, and high-frequency words if the Zipf score lies between 4 and 7 (Brysbaert et al., 2018; van Heuven et al., 2014).

Given, the impact of the variables mentioned above, it is clear that any study using words as stimuli, either to investigate the process of visual word recognition itself, or to investigate other mental processes such as attention, perception, memory or emotion, must take these variables into account (Balota et al., 2007). Therefore, researchers interested in using words as stimuli need information about these variables, and especially about word frequency, in order to control for them or in some cases to manipulate them in factorial designs.

It is therefore unsurprising that researchers have compiled large databases with information about lexical variables, such as word frequency and word length, for many languages. The language with the most resources unsurprisingly is English, which has CELEX (Baayen et al., 1995), MRC Psycholinguistic Database (Coltheart, 1981), the English Lexicon Project (Balota et al., 2007), SUBTLEX (Brysbaert & New, 2009; van Heuven et al., 2014), and the English Crowdsourcing Project (Mandera et al., 2020). French has Lexique (New et al., 2004) and Megalex (Ferrand et al., 2018). Spanish has EsPAL (Duchon et al., 2013) and SPALEX (Aguasvivas et al. 2020). German has dlexB (Heister et al., 2011). More recently, lexical databases have been compiled for languages that were traditionally less studied, like Modern Greek (Ktori et al., 2008; Kyparissiadis et al., 2017), Modern Arabic (Boudelaa & Marslen-Wilson, 2010), Malay (Yap et al., 2010), and Mandarin Chinese (Sun et al., 2018).

The case of Hindi

One of the hitherto understudied languages from the psycholinguistic perspective is Hindi. Hindi is one of the official languages of India, and is spoken by over 43% of the Indian population, amounting to over 520 million individuals residing in India, among whom over 320 million identify Hindi as their native language, as per the Language Census of India 2011. Hindi is written using the Devanagari script, which is classified by many researchers as an alphasyllabary because it resembles both a pure syllabic script like the Japanese Kana, where each grapheme or akshara represents one syllable, i.e. a CV combination, but at the same time it also shares features with alphabetic scripts, as it treats consonants and vowels separately within an akshara (Kandhadai & Sproat, 2010; Nag & Snowling, 2012; Rao & Singh, 2015; Vaid & Gupta, 2002). Recently there has been some debate about the classification of the script. On the one hand it has been proposed that it is an alphasyllabary script (Rimzhim et al., 2014), on the other hand it has been argued that the peculiar characteristics of the Devanagari script make it a unique category, which is better classified as “abugida” (Bright, 2000; Share & Daniels, 2016).

Several features of Hindi, in addition to the number of speakers, make it an interesting candidate for visual word recognition research. Each consonant symbol in Hindi is supposed to have an inherent vowel, and is pronounced unless specifically deleted. For example: any consonant in Hindi, say, (/m/) is supposed to have an inherent vowel (/a/ sound) that needs to be pronounced as (/ma/) unless the vowel is specifically omitted from the pronunciation (Kandhadai & Sproat, 2010). There has been considerable discussion about the rules for the articulation of this inherent schwa vowel in a range of studies, and authors have offered competing accounts. However, a detailed discussion of the rules for schwa deletion are beyond the scope of the current paper and the readers are referred to Ohala (1983, 1987), Pandey (1990, Pandey, 2014), and Choudhary and Basu (2002). Further, in Hindi, vowels are often written as diacritics on the consonant symbols, in a non-linear fashion, i.e. written after (e.g. (/m/ +/i:/), above (e.g. (/m/ +/ae/), below (e.g. (/m/ + /u:/) or even before (e.g. (/m/ + [I]) the consonant. Such an arrangement of vowels in Hindi is similar to that of the Korean script Hangul, although unlike Hangul, here, vowel diacritics can be treated as subordinate elements and generally appear smaller than consonant symbols (Kandhadai & Sproat, 2010). There is also a possibility of consonants being written as ligatures or subscripts, as half consonants, e.g. + = /km/ in the word /hukm/(Kandhadai & Sproat, 2010). Together these features combine to make Devanagari, a script that may have visually simple as well as visually complex words (Rao & Singh, 2015) and one that offers several interesting features to investigate.

A small number of studies have been published on visual word processing in Hindi (Das et al., 2009; Das et al., 2011; Kandhadai & Sproat, 2010; Rao & Singh, 2015; Vaid & Gupta, 2002). Psycholinguists have been interested in various unique features of Devanagari. For instance, in words like /ki:m t/, the /i:/ visually comes after the consonant and is pronounced after the consonant; however, in words like (/ki/ +/sm/ +/et/) the /i/ appears visually before the consonant /k/  (/[I] + /k/ = k[I]), but is pronounced after the consonant symbol /k/. This nonlinearity or visuospatial incongruity has been studied for its effects on naming latencies and errors (Vaid & Gupta, 2002), phonemic awareness (Kandhadai & Sproat, 2010) and visuospatial complexity (Rao & Singh, 2015). Other studies have investigated the cortical areas involved in reading Hindi (Winskel et al., 2013).

A limiting factor for research in Hindi is that there is no validated list of word frequency norms. As a result, many studies did not mention the source of word frequency information for their stimuli (Das et al., 2009; Das et al., 2011; Kumar et al., 2010; Vaid & Gupta, 2002). Other studies relied on lab-based ratings of subjective frequency (Rao & Singh, 2015).

One notable attempt looking at the effect of word frequency in Hindi (more precisely, unigram and bigram frequency) was Husain et al. (2015). They looked at how word characteristics affect reading times and sentence comprehension using an eye-tracker. They found that word frequency affected first-pass reading times, regression path duration, total reading times and outgoing saccade length. They also found a word length effect in reading (longer words lead to longer fixations). Word frequency was estimated on the beta version of the Hindi-Urdu treebank data, which had 400,000 words. The authors used the same word frequency measure to control for word frequency in a more recent paper investigating the role of expectation and working memory constraints in Hindi comprehension, i.e. Agrawal et al. (2017).

These findings are encouraging because they are among the first studies reporting the effects of frequency and graphemic complexity in Hindi word processing. At the same time, they illustrate the paucity of visual word recognition in Hindi, most likely because of the lack of lexical resources in Hindi, like the ones mentioned earlier for other languages. In the absence of well-controlled information about lexical factors, it is difficult for interested researchers to plan and run large experiments investigating visual word recognition in Hindi.

Currently available lexical resources for Hindi

To the best of our knowledge, at the moment there is no existing database that provides information about the various lexical variables in Hindi. However, there are a few Hindi corpora which have been compiled by different groups of researchers, both in India and abroad, that can be used to compute some measures. Here is a brief overview of the available corpora:

EMILLE

The Enabling Minority Language Engineering Project (EMILLE) undertaken by the Universities of Lancaster and Sheffield, UK, is by far the most notable effort, compiling and documenting corpora of up to 14 South Asian languages, and aggregating over 96 million words from both spoken and written sources. They also developed a parallel corpus of English and five languages (Baker et al., 2002).

The EMILLE project was mainly aimed at building corpora for South Asian languages, extending the General Architecture for Text Engineering (GATE) to these languages, and creating basic language engineering tools (LE). More specifically the project was focused on facilitating translation from these South Asian languages to English. It was primarily meant for the language engineering community, rather than psychologists or psycholinguists. Nonetheless, the EMILLE project has compiled written corpora for 13 Indian languages, namely Hindi, Bengali, Punjabi, Gujarati, Tamil, Urdu, Marathi, Oriya, Assamese, Kashmiri, Malayalam, Kannada and Telugu. The monolingual written corpora in EMILLE incorporated the corpora collected by the Central Institute of Indian Languages, Mysore, India. It further included news websites and spoken corpora for five Indian languages (Hindi, Bengali, Urdu, Gujarati and Punjabi). As Hindi is most relevant to the current discussion, the size of the monolingual written corpus for Hindi is about 12.3 million words, while that of the spoken corpus is about 5.8 million words.

The EMILLE corpus does not provide explicit word frequencies or, for that matter, any other lexical variables that may be useful to researchers in the psycholinguistic or psychology community. The corpus can be obtained for research purposes, but further processing must be done to get the relevant information. As a result, the resource has been underused.

English-Hindi parallel corpus by the Indian Institute of Technology Bombay

The IIT-Bombay corpus is a collection of 1.49 million parallel segments of Hindi and English and is mainly aimed at facilitating machine translation of Hindi-English (Kunchukuttan et al., 2017). The corpus was compiled from a variety of resources ranging from Judicial domain corpora, Hindi-English Wordnet, Gyan Nidhi Corpus, TED Talks and a few other sources of bilingual text material, mainly from formal sources linked with the Government of India’s administrative bodies. Again, the corpus does not focus at providing information useful to researchers from psychology or psycholinguistics.

Uppsala Hindi Corpus

The Uppsala Hindi Corpus mainly arose as a result of the researchers’ efforts to create a parallel corpus of Hindi-English-Swedish, which would be useful to researchers mainly in the area of linguistics (Saxena et al., 2008). It is a very small corpus (~ 0.1–0.2 million words), although it has information about parts of speech, morphology, chunking, etc., which is useful for linguists wanting to investigate the nature of processing at the sentence level.

Wordlex

Gimenes and New (2016) reported word frequencies for 66 languages, one of which was Hindi. The corpora were based on blogs, newspapers and Twitter feeds. For languages that had other information sources (French, English, Dutch, Malay, Chinese), the authors reported good performance of their word frequency measures. Specifically for Hindi, the data were limited to blogs and newspapers, with a total size of 13.7 million words: 6.7 million from blogs and 7 million from newspapers. As no reaction time studies were available in Hindi, the frequencies could not be validated for Hindi. A further limitation of the Wordlex database is that it only provides word frequencies, not the underlying corpus. This makes it impossible to extract more information from the texts (e.g., the frequency of word bigrams or part-of-speech information).

The creation of Shabd

Although there are some resources available, it is clear that further improvements are possible. First, the existing corpora are quite small. Second, none of the measures have been validated against empirical data. Third, there is no information about parts of speech or about word bigrams. To address the limitations for researchers working in Hindi, we decided to compile a new corpus of Hindi texts.

The first choice to make when one wants to compile a corpus for a language is which source to draw words from. Traditionally, corpora have been sourced from written texts such as books. This was the case for Brulex (Content et al., 1990) and Frantext (New et al., 2001) in French, for Celex in English, Dutch and German (Baayen et al., 1993), and for Kučera & Francis in English (Kučera & Francis, 1967).

More recently, movie subtitles have been suggested as a useful source for estimating word frequency, because they come closer to the language undergraduate students have been exposed to. Subtitle-based frequencies were first compiled for French by New, Brysbaert, Véronis & Pallier (2007), and were shown to better predict reaction times of participants than book-based frequencies. Similar results were reported for English (Brysbaert & New, 2009; van Heuven et al., 2014), Dutch (Keuleers et al., 2010), Chinese (Cai & Brysbaert, 2010), Greek (Dimitropoulou et al., 2010), Spanish (Cuetos et al., 2012), German (Brysbaert et al., 2011), and Polish (Mandera et al., 2015).

Unfortunately, the usefulness of subtitles in Hindi is less clear. First, India as a country has several different languages. While a little less than half of the Indian population speaks Hindi (~43.6% as per the 2011 Language Census of India), Bengali, Marathi, Tamil, Telugu, Gujarati, Kannada, Oriya, Malyalam and Punjabi are other prominent languages (~8% to ~3% in decreasing order, as per the 2011 Language Census of India). Hence, the entertainment market is divided into providing content into many different languages. Second, the Indian audience and content providers have a strong preference for dubbing over subtitling, because this makes the content available to viewers who do not read fluently. This is particularly an issue in Hindi, because many students take lessons in English and as a result are less fluent in reading Hindi, unless they have a personal interest in reading Hindi. Consequently, it is not easy to find a sizeable and representative corpus based on subtitles in Hindi.

A final large source of language corpora consists of internet web pages. As Gimenes and New (2016) pointed out, the language on the internet is likely to be more varied than in books and arguably also more read by students. Here again, however, we are confronted with elements particular to Hindi. For instance, a large section of Hindi-speaking social media users prefer to write Hindi in the Roman script also used for writing English. In addition, a considerable proportion of the user base in the 15–35-year bracket from urban India uses English as means of communication. It is only recently that mobile phones and social media platforms have started supporting Devanagari fonts for users to be able to communicate in Hindi. Even the introduction of Devanagari font on the social media platforms has not majorly changed the language of choice for many users, as only the most motivated of them choose to type in Hindi using Devanagari font. As a result, it is not straightforward to get a representative corpus of pure Hindi content from social media without considerable manual checking and cleaning. Finally, many modern social media platforms no longer allow researchers to extract messages.

In the end, we decided that the best source of Hindi language would be popular newspapers and news websites, which are readily available on the internet. By using a large selection of newspapers and news sites, differing in the level of editing and finesse, we tried to obtain a representative sample of written Hindi language that native speakers are exposed to. A further advantage of using newspaper-based word frequencies, in our view, is that the contents are very much the same on the website as in the printed edition of the newspaper, increasing the representativeness of the corpus for the type of written language Hindi speakers are exposed to.

Shabd

Corpus compilation, cleaning and processing

The compilation of the corpus started by downloading Hindi articles using an open source scraping tool based on Python (https://scrapy.org/), from a number of Hindi newspapers and news websites for a total of 1,808,616 documents. All special symbols and other formatting were removed, leaving us with 1,469,243,645 (~1.4 billion) word tokens, referring to 2,363,994 (~2.3 million) unique word types.

To make the list more useful for psycholinguists, we cleaned it first by excluding the word types with a frequency of less than 100 in the dataset (i.e., 1 per 15 million words). This left us with a list of 96,122 word types. The second cleaning consisted of excluding names and words predominantly belonging to other languages (in particular English and Urdu). For that purpose, we administered the list of remaining words to proficient native Hindi readers in such a manner that every word was read by at least three different readers. The readers were instructed to screen the words they were reading for spelling mistakes, language membership and names of cities or persons. This further cleaning left us with 34,122 words that are likely to be of primary interest to researchers of word processing, similar to the list of 40 thousand English words compiled by Brysbaert et al. (2014a, b) for the collection of word concreteness.

The three lists (2.3 million, 96 thousand and 34 thousand) are available at [https://osf.io/fygme/].

Frequency measures

Several different frequency measures were extracted from the corpus.

Word frequency

Word frequency has traditionally been measured in frequency per million words (fpmw), but as mentioned above, van Heuven et al. (2014) pointed out potential problems in the use of the fpmw values, and proposed a new scale of measurement of word frequencies, the Zipf scale. We provide both values, i.e. fpmw and Zipf values, for the words in the corpus Fig. 1.

Fig. 1
figure 1

Distribution of word frequencies in the Shabd corpus as expressed in Zipf values. Note that in the first two lists (34k, 96k) there are no words in the 0–1 Zipf range because the cleaned lists only include words with frequencies of at least 100 in the corpus

Contextual diversity

Contextual diversity (CD) refers to a measure of the number of contexts in which a word appears and was first proposed by Adelman et al. (2006) as an important predictor of participants’ reaction times for words. Subsequent evidence for the usefulness of CD as a predictor of participants’ reaction times has been mixed. While Brysbaert and New (2009) reported that CD was a better predictor of RTs than total word frequency, van Heuven et al. (2014) found that it was equally useful as the frequency of words. More recently, Gimenes and New (2016) concluded that CD is not a better predictor of participant’s reaction times than word frequency. Arguably, the strongest criticism was given by Niedtner et al. (2010) and Hollis (2020), who showed that a random permutation of a corpus gave the same CD ‘advantage’ as the usual division in coherent documents (see also Cevoli et al., 2021). Mixed results notwithstanding, we are making available the CD values for the words in the Shabd corpus, for researchers to further explore the usefulness of the measure in Hindi.

Bigram frequency

In recent years, it has become clear that sequences of word pairs (word bigrams) are interesting as well, for researchers looking for patterns of word-combinations and co-occurrences (Arnon & Snider, 2010; Baayen et al., 2011). For this reason, we also computed the word-bigram frequencies for the entire corpus, which are also available for download.

Length measures

In addition to word frequency, we also calculated various measures of word length.

Number of aksharas (nAksh)

Several researchers have argued that the akshara is the basic orthographic unit of the Devanagari script used to write Hindi (for a detailed discussion see Bright, 2000; Daniels, 2001; Vaid & Gupta, 2002; Rimzhim et al., 2014; Share & Daniels, 2016). The number of aksharas in a given word is counted and represented as nAksh. Further, a whole akshara without any matra/ligature has been counted as one, and half-aksharas that appear as part of several words in Hindi have been counted as halves. The count of an akshara does not take into account any matra/ligature attached to it.

Number of matras (nMatra)

Several vowels in the Devanagari script are represented as ligature marks and can be attached at any position on the akshara. We counted the ligature marks and enlist them under nMatra. The number of matras can serve a very important function for anyone interested in investigating orthographic processing of Hindi words, as it is a very important visual feature of the language. Further, it can be used to quantify the visual complexity of the Hindi word in question. In combination with a length index counting the number of aksharas, the variable nMatra may be helpful in determining the process of visual encoding of the Hindi orthography.

Number of phonemes (nPhon)

The differentiation of Hindi words into phonemes and syllables has been achieved using an algorithm developed and described by Pandey (2007) and illustrated in later papers (Pandey, 2014; Pandey & Roy, 2017). Using this algorithm, the description of which is beyond the aims and scope of the current paper, Hindi words were deconstructed into phonemes and syllables. The representation of the number of phonemes in a given word is stored in the variable nPhon, and the description of the phonemes in a word in the IPA format is displayed for each word, so that readers can ascertain the counting scheme and judge for themselves.

Number of syllables (nSyll)

The representation of the number of syllables in a given word is stored in the variable nSyll. As for the phonemes, the description of the syllables in a word in the IPA format is displayed for each word, so that readers can ascertain the counting scheme and correct possible errors because of the automatic processing.

Part-of-speech frequencies

Knowing the grammatical categories of the words in the corpus can help researchers who want to use the Hindi words from our corpus in several ways. For instance, it can help psycholinguists to design experiments in which they only want to include words of a specific category, such as nouns or verbs (van Heuven et al., 2014). It can also be helpful for researchers to select appropriate words for their rating studies (e.g. Kuperman et al., 2012).

Information about grammatical categories can only be included if the corpus has been parsed (i.e., sentences decomposed into their constituent parts) and tagged (i.e., the words assigned to their correct part of speech). Nowadays, most languages have dedicated computer programs that can parse the sentences of the language and provide part-of-speech tagging for the words in the language. Often-used programs in English are the CLAWS tagger from the University of Lancaster (Garside & Smith, 1987) and the Stanford Tagger (Toutanova et al., 2000; The Stanford Natural Language Processing Group), with accuracies up to 97% (van Heuven et al., 2014).

For Hindi we used the shallow parser developed by the Language Technologies Research Center (LTRC) at the Indian Institute of Information Technology Hyderabad (IIIT-H). The parser has an accuracy of up to 93% (Gadde & Yeleti, 2008). It is freely available on the website of the LTRC IIIT-Hyderabad, and has been downloaded and used without modification. It should be noted that the shallow parser for Hindi is a tool that is constantly being worked upon and is consequently updated in every successive version. We have downloaded Version 4.0, which has a tokenizer, morph analyser, p-o-s tagger, chunker and other useful modules.

Using the LTRC shallow parser, we provide part-of-speech tags and also part-of-speech-frequencies for each word. Because of the inaccuracy of the parser, it is better to consider these as useful guidelines than errorless norms. We provide up to eight different part-of-speech tags for a specific word along with the frequency with which a given word has appeared as the specific part of speech.

Correlations between the Shabd word frequencies and other word frequencies

A first way to examine the usefulness of word frequency norms is to examine how well the new norms correlate with existing ones. As indicated above, there are two existing sources that are of interest: the EMILLE corpus and the Worldlex word frequencies.

Although the EMILLE corpus does not provide word frequencies per se, it is possible to calculate them by downloading the corpus. We calculated the frequencies for this corpus in exactly the same way as we did for the Shabd corpus. The Hindi Worldlex frequencies are available for blogs and newspapers.

Table 1 shows the correlations between the various frequency estimates (logarithmic values). As can be seen, the Shabd frequencies are highly correlated with EMILLE and Wordlex frequencies. The maximum correlations are found with the Wordlex_News corpus, as can be expected given that both are based largely on newspapers (from the internet).

Table 1 Correlations between Hindi word frequency estimates (between brackets: the number of common words types upon which the correlation is based). *correlation with 34k-word list, **correlation with 96k-word list, and ***correlation with 2.3M-word list

In the next sections we describe two experiments to test the usefulness of the Shabd frequency norms to predict participants’ performance in responding to written Hindi words.

Experiment 1

In this experiment we investigated the effects of word frequency and word length in a lexical decision task. Both effects are clearly established in most languages tested.

Method

Stimuli

In a 2 × 2 design, we selected 125 high-frequency long words, 125 high-frequency short words, 125 low-frequency long words and 125 low-frequency short words. We used the Shabd Zipf frequencies (van Heuven et al., 2014). The low-frequency words had a Zipf value close to 2 (M = 1.97, SD = 0.09); the high-frequency words had a Zipf value between 4 and 7 (M = 4.61, SD = 0.61). Word length of the short words was between 2 and 4 aksharas (M = 2.4, SD = 0.47); length of the long words was between 5 and 8 aksharas (M = 5.2, SD = 0.3). For each word selected for the experiment, a non-word was generated by randomly transposing the aksharas in that word, along with their matras (ligatures). The stimulus list, together with the data, can be found at https://osf.io/xfbhd/.

Participants

A group of 33 native Hindi-speaking undergraduate students from the Indian Institute of Technology Kanpur were selected to participate in the experiment. All participants were right-handed and had normal or corrected-to-normal vision. Participants were in the age range of 19–34 years (M = 25.6, SD = 4.8, N = 24)Footnote 1. Participant self-rated proficiency in various tasks with Hindi was obtained on a 5-point Likert scale and was found to be as follows: for speaking (M = 4.45, SD = 0.65), for understanding (M = 4.7, SD = 0.46), for reading (M = 4.7, SD = 0.62) and for writing (M = 4.2, SD = 0.97). They spoke Hindi for almost 62% (SD = 0.13) of their daily interactions. Their proficiency for English, as obtained with a 5-point Likert scale for various tasks, was as follows: for speaking (M = 3.75, SD = 0.84), for understanding (M = 4.16, SD = 0.7), for reading (M = 4.54, SD = 0.58) and for writing (M = 3.91, SD = 0.71). Their proportion for English use in a typical day was about 43% (SD = 0.13). Their interaction profile for interactions with various agents was: with family (Hindi = 67%, mixed (Hindi, English) = 33), with friends (Hindi = 25%, mixed = 75%), classmates (Hindi = 17%, mixed = 83%) and with teachers (Hindi = 8%, mixed = 92%). Mainly, these participants reported using Hindi or a mixture of Hindi and English in their conversations, where Hindi would dominate in informal settings and English would dominate in classroom settings or interactions with teachers.

Procedure

The selected words and non-words were randomly assigned to three blocks. Each block contained 166/167 words and 166/167 non-words, with an equal number of words in the four categories (i.e. hf-long, hf-short, lf-long, lf-short). Within each block, the presentation of stimuli was randomized. The blocks were presented in three different permutations, and each participant was assigned to one of the three permutations. Each trial started with the presentation of a blank screen for 1000 ms. After the blank screen, two horizontal lines with a gap in between were presented at the centre of the screen for 500 ms, after which the word/non-word stimulus was presented in between the two lines for a maximum of 2500 ms. Participants were instructed to press the “Z” key if they saw a word and the “M” key if they saw a non-word. The order of response keys was counterbalanced across the participants. The participants were presented with six practice trials and then 332/334 experimental trials. The experiment had two breaks, after the completion of the first and the second block of trials. The overall duration of the entire experiment was about 40 minutes.

Results

Reaction time (RT) and accuracy were the two dependent variables. Mean accuracy was calculated for each participant, and the data of five participants were removed because they had a mean accuracy of less than 0.60. We also calculated the mean accuracy for each word used in the experiment, and 74 words with mean accuracy less than 0.66 were removed from the RT analysis.

The mean RT for all words (correct responses only) was 793 ms (SD = 268 ms). Mean accuracy for all words was 0.83 (SD = 0.37). For non-words, the mean RT was 942 ms (SD = 315 ms), and the mean accuracy was 0.81 (SD = 0.39). Mean RTs and accuracies were also calculated for the four categories of words separately, i.e. hf-long, hf-short, lf-long and lf-short, as shown in Table 2.

Table 2 Mean RTs and accuracy for different word categories in Experiment 1

We tested for the effects of frequency (high vs. low), word length (short vs. long) and their interaction for reaction times using a linear mixed-effects model with participants and stimuli as random effects using the R lme4 package (Bates et al., 2015).

The dependent variable was raw (untransformed) reaction times. The model used Gaussian distribution with identity link function to fit the data. The model used for the RT analysis was: lmer(RT ~ length*frequency + (length+frequency|participant) + (1|word)). More detailed information about the model can be found at the osf project [https://osf.io/xfbhd/].

The small/large conditions of length and frequency were coded as −1 and +1, respectively. We found a main effect of frequency on RT, i.e., high-frequency words were read faster than low-frequency words (estimate = 58.7, t-value = 10.5, p < 0.001). (As we fitted the linear model with R lme4 package, the exact p-values were not directly provided by the model fit. Here we report the p-values estimated by type II Wald chi-square tests). We also found a main effect of word length on RT, i.e. short words were read faster than long words (estimate = 53.6, t-value = 5.6, p < 0.001). Finally, we found some evidence for an interaction between the effects of frequency and word length (estimate = −8.2, t-value = −2.2, p < 0.029), because the difference in RT between long and short words was surprisingly larger for the high-frequency words than for the low-frequency words.

A similar analysis of the accuracy data was carried out using a linear mixed-effects model. The dependent variable was raw accuracy (coded 1 for correct and 0 for incorrect response) The model used the Binomial distribution with logit link function to fit the data. The model for the accuracy data analysis was: (glmer(Accuracy ~ length*frequency + (length+frequency|participant) + (1|word), family=binomial). More details about the model can be found at the osf project [https://osf.io/xfbhd/]. We found a main effect of frequency such that the accuracy for high-frequency words was higher than for low-frequency words (estimate = −0.75, z-value = −8.2, p < 0.001). There was a main effect of length such that the accuracy for long words was higher than that for short words (estimate = −0.32, z-value = −2.5, p = 0.009). This is opposite to the expected effect. There was no interaction effect.

To compare the various frequency measures, we looked at the correlation matrix. In order not to bias the findings, we only used the words present in all four corpora (N = 310 for RT analysis and N=338 for accuracy analysis), although the results remained very similar if the cells were based on the maximum number of words. As can be seen in Table 3, the four-word frequencies performed very similarly, with Shabd and EMILLE frequencies performing slightly better for accuracy and the Worldlex frequencies performing slightly better for RTs. Averaging the Zipf values of the four corpora did not have a major impact. There were no big differences between word frequency and contextual diversity (shown for Shabd in Table 3).

Table 3 Correlations between the variables of Experiment 1 for words present in all corpora (N for words present in all corporaN for words present in all corpora ( of Experiment of the four p < .01

An unexpected finding was the low correlation of accuracy in the lexical decision task with all other variables, including RTs for words with accuracy > .66. For instance, the words and had accuracy rates of only 70% even though they have frequencies of 100 per million words in Shabd. It is not clear what caused this.

Discussion

In Experiment 1, we were able to show a word frequency effect for Hindi words, i.e. high-frequency words were responded to faster than low-frequency words. We showed that the Shabd word frequencies accounted for .652 = 43.1% of variance in the mean RTs of participants in the lexical decision task. We found that the Shabd word frequencies were better than EMILLE word frequencies, probably due to the larger size of the Shabd corpus, but not better than the Worldlex word frequencies. Although better performance could be expected for the blog Worldlex frequencies (Brysbaert & New, 2009), we found a similar advantage for the newspaper frequencies. Not much progress was found if we used contextual diversity instead of word frequency or if we used average frequency across the four databases.

Another factor significant in Experiment 1 was word length, which accounted for .612 = 37.1% of variance in the mean RTs. We separately analysed the various factors that could have contributed to the word length (i.e. nPhoneme, nMatra or nSyllable; analyses available at https://osf.io/xfbhd/), but the number of aksharas was the best estimate.

There was an interaction between word frequency and word length, but opposite to the one we expected. The usual pattern is that the effect of another variable is larger for low-frequency words than for high-frequency words. Here, however, we saw the opposite effect: A larger length effect for the high-frequency words than for the low-frequency words, even though the length groups were matched on word frequency. The long words were also responded to more accurately than the short words. Further research will have to indicate the origin of this surprising finding.

Experiment 2

A less fortunate aspect of Experiment 1 could be that the words were selected according to the Shabd word frequencies. This may have given an advantage to the Shabd frequencies over the other frequencies. Therefore, in the second experiment we decided to directly compare the frequencies from the Shabd corpus with the frequencies from the EMILLE corpus (detailed earlier).

Method

Stimuli

We followed the design described by Mandera et al. (2015). To maximise the information gained from the experiment, we searched for words for which the EMILLE and the Shabd corpus gave divergent frequency estimates. To select those words, we performed a linear regression on the Shabd frequencies, using the EMILLE frequencies as predictors (Zipf values). This allowed us to select three lists of 166 words each: (a) words having a higher frequency in Shabd than predicted on the basis of EMILLE, (b) words having a lower frequency in Shabd than predicted on the basis of EMILLE, and (c) randomly sampled words with the same frequencies in both corpora. Due to our selection procedure, the correlation between the Shabd and the EMILLE_News frequencies in our sample was only r = −0.10. The correlation between the Shabd frequency and the Worldlex_Blogs+Newspapers frequency was r = 0.46, and the correlation between the EMILLE_News frequency and the Worldlex_Blogs+Newspapers frequency was r = 0.51. As in Experiment 1, for each word used in the experiment, a non-word was generated by randomly transposing the aksharas of that word. Again, all stimuli and data can be found at https://osf.io/xfbhd/.

Participants

A new group of 32 native speakers of Hindi were selected to participate in the experiment. They were mainly undergraduate students from the Indian Institute of Technology Kanpur. All participants were right-handed with normal or corrected-to-normal vision. Participants were in the age range of 19–43 years (M = 26.06, SD = 5.34, N = 32)Footnote 2. Not all participants could be reached, but we have no reason to believe that the sample would be unrepresentative for the participants tested. Participants' self-rated proficiency in various tasks with Hindi was obtained on a 5-point Likert scale and was found to be as follows: for speaking (M = 4.43, SD = 0.61), for understanding (M = 4.65, SD = 0.48), for reading (M = 4.71, SD = 0.58) and for writing (M = 4.18, SD = 0.89). They spoke Hindi for almost 59% (SD = 0.15) of their daily interactions. Their proficiency for English as obtained with a 5-point Likert scale for various tasks was as follows: for speaking (M = 3.71, SD = 0.77), for understanding (M = 4.15, SD = 0.67), for reading (M = 4.56, SD = 0.61) and for writing (M = 3.84, SD = 0.67). Their proportion for English use in a typical day was about 46% (SD = 0.15). Their interaction profile for interactions with various agents was: with family (Hindi = 78%, mixed (Hindi, English) = 22), with friends (Hindi = 22%, mixed = 78%), classmates (Hindi = 19%, mixed = 81%) and with teachers (Hindi = 6%, mixed = 94%). Mainly, these participants reported using Hindi or a mixture of Hindi and English in their conversations, where Hindi would dominate in informal settings and English would dominate in classroom settings or interactions with teachers.

Procedure

The selected words and non-words were randomly assigned into three blocks. Each block contained 166 words and 166 non-words, having equal proportions from the three selected lists (as described above). Within each block the sequence of stimuli was randomized. Blocks were presented in three different permutations and each participant was assigned to one of the three permutations.

The presentation was the same as in Experiment 1.

Results

Reaction times and accuracy were measured for every participant. Data from 5 participants with mean accuracy less than 0.60 was removed from the RT analysis. Further, we calculated mean correct RTs for each word and removed the words (n = 92) for which participants displayed a mean accuracy of less than 0.66 from the RT analysis. Because the linear regression analysis on mean RT and mean accuracy is the most informative, we limit the discussion to this analysis.

The details of mean_accuracy and mean_RT across the different word categories are summarised in Table 4. Table 5 shows the correlations between the various word frequencies and RT/accuracy. Because of the strong effect of word length in Experiment 1, we added this variable as well. Observations were limited to those words for which we had frequencies in all four databases (N = 407 for accuracy, N = 354 for RT). The correlations looked very similar for the total database.

Table 4 Summary statistics of reaction time and accuracy for words used in Experiment 2
Table 5 Correlations between different word frequencies and RT and accuracy

From Table 5, we can conclude that the Shabd frequencies correlated more with RT than the EMILLE frequencies (accuracy: z = 1.0, p = .15, RT: z = 3.1, p < 0.001 according to the Vuong test for non-nested models; Vuong, 1989). At the same time, the Worldlex blog frequencies tended to outperform the Shabd frequency (Vuong z = −3.1, p = .001 for accuracy and z = −1.2, p = 0.10 for RT), even though these frequencies had not been used in the selection of stimuli. Also see Figs. 2 and 3 for details.

Fig. 2
figure 2

Mean reaction times as predicted by word frequencies obtained from EMILLE and Shabd corpora. R-squared value for lm(.~Zipf_Shabd) is 0.125 and for lm(.~Zipf_EMILLE) is 0.005

Fig. 3
figure 3

Mean reaction times as predicted by Worldlex blog frequency. R-squared value is 0.181

As before, multiple regression analysis was used to examine the combined effects of frequency and length. For accuracy, both variables together accounted for 6.3% of the variance when the Shabd frequencies were used. There was no evidence for an interaction (variables centred). Shabd frequency and word length together accounted for 43% of the variance in RT. Again, there was no evidence for an interaction. When the Wordlex blog frequencies were used, percentages of variance increased to 14% for accuracy and 46% for RT. Again, there was no evidence for an interaction. As in Experiment 1, there was no big difference between word frequency and contextual diversity for Shabd. Averaging across Zipf-values did not improve the fit of the model relative to the best predictor (blog frequencies). As mentioned earlier the details of the analysis can be found at the osf project [https://osf.io/xfbhd/].

Discussion

Experiment 2 confirmed the superiority of the Shabd frequencies over the EMILLE frequencies in an experiment that directly pit both measures against each other. As in Experiment 1, accuracy correlated little with word frequency, although this time we got a healthy negative correlation between accuracy and RT (Table 5).

In Experiment 2 we also saw that Worldlex blogs outperformed Shabd, even though this variable had not been used in stimulus selection. This is further evidence for the importance of word register (Brysbaert & New, 2009; Gimenes & New, 2016).

General discussion

Word recognition research in Hindi has suffered from the lack of a comprehensive, validated database providing information about lexical variables like word frequency, length, neighbourhood, contextual diversity, etc. In the current paper, we present Shabd – A  Psycholinguistic Database for Hindi. There are two versions: (1) a cleaned database with 34k Hindi words, and (2) a full database with 96k words. The former will be of particular interest for researchers who want to select stimulus materials, because each entry is a useful stimulus (e.g., Brysbaert et al., 2014, b). The latter is more interesting for researchers who want to have frequency information about a stimulus set they collected. For this purpose, the more entries the better, as it increases the chances of finding a match in the list.

The two Shabd databases include the following information:

  • word: the stimulus word for which the information is provided

  • frequency: number of times the word was observed in the corpus

  • frpm: number of observations of the word, per million words in the corpus

  • Zipf: the Zipf frequency as defined by van Heuven et al. (2014)

  • AkshCount: number of aksharas in the given word.

  • MatraCount: number of matras (ligatures) in the given word.

  • Total_length: the length of the word (akshount + matracount)

  • pos_tag1: the most dominant part of speech taken by the word according to the PoS tagger we used.

  • pos_fr1: the frequency of the most important part of speech

  • pos_tag2-pos_tag8: other parts of speech taken by the word

  • pos_fr2-pos_fr8: frequencies of the other parts of speech

  • CD: number of documents in which the word appears (contextual diversity)

  • LOG_CD: Log10 of CD

  • OLD20: Orthographic similarity to the closest 20 words as defined by Yarkoni et al. (2008)

  • Prosodic Label (PLSB): Prosodic Label of each word with stress-marked syllables, which may be useful to speech production and perception researchers.

  • I-level Syllabification and Syllcount: Syllabification strategy that takes into account the inherent schwa vowel in each akshara of a Hindi word, and syllable count as per this syllabification scheme.

  • Phoneme_Level_ASCII Syllables and Syllcount: This syllabification is the more commonly used syllabification scheme based on ASCII notations of the words, and it operates as per the commonly used schwa deletion. For most purposes this measure of syllabification may be used.

  • IPA_Equivalent Phoneme Separation and Phoneme Count: Phoneme separation based on stress patterns and phoneme count based on these. May be more useful to phonology researchers.

  • Phonemes_Level_IPA and Phoneme Count: Phoneme separation and phoneme count based on the more commonly used IPA notations. This measure may be the more commonly used measure for phoneme counts for most researchers.

  • frequency_EMILLE, frpm_EMILLE and Zipf_EMILLE: frequency in the EMILLE corpus expressed as raw frequencies, frpm and Zipf scores.

  • frequency_WLblogs, frpm _WLblogs, and Zipf_WL Blogs: frequency in Worldlex blogs expressed as raw frequencies, frpm and Zipf scores.

  • Wordlex Blogs CD and Wordlex Blogs CDPc: CD measure as provided in the Wordlex Blogs Corpus, and percentage of documents a word appears in the WL Blogs.

  • frequency_WLnews, frpm_WLnews and Zipf_WLnews : frequency in Worldlex news corpus, expressed as raw scores, frpm and Zipf scores.

As we have seen in the two validation experiments we ran, the Worldlex blog frequencies in particular add interesting information to the Shabd frequencies. A limitation is that these frequencies are not available for all words. The missing word blog frequencies can be replaced by Zipf value 1.5, which is .3 lower than the lowest blog Zipf value observed.

In addition to the Shabd databases, we make two more lists available. The first is the list of all 2.36M word types observed in the Shabd corpus. The second list contains the 2.5M highest-frequency word bigrams observed in the Shabd corpus. For these two lists, we only give the frequencies of occurrence in the Shabd corpus and the corresponding Zipf values.

The validity of the Shabd measures has been tested with two lexical decision experiments. In each experiment, we found clear effects of word frequency and word length, in line with findings in other languages. The percentage of variance in RT explained by both variables was around 40%, also in line with other languages. Against our expectations, the percentage of variance in accuracy accounted for by word frequency was low (less than 10%). We do not have a clear explanation for this observation, but some hypotheses may be worthwhile to pursue. First, the RTs and error rates in our experiments are rather high. We think this is because most Indian students with Hindi as their mother tongue have an English education and they read many English texts. Even when they write Hindi words, they tend to use the Roman script, as this is more practical on computer keyboards. As a result, at a young age Hindi speakers are not very familiar with the Hindi script, which is likely to cause problems. Second, some of the high-frequency words are also used as names (similar to the word Archer in English). This may have caused interpretation uncertainty in some of our participants, all the more because Hindi does not make a distinction between upper case and lower case letters (making archer and Archer not completely the same in English). The best way to address these hypotheses (and other) is to run a megastudy on all 34 thousand Shabd words in the cleaned database, rather than on the 1000 included in the present studies.

A megastudy can also address the question which is the best measure of word length in Devanagari script and whether there is a genuine interaction between word length and word frequency. We found that the number of aksharas did well for the words we used, but this needs to be replicated in larger word samples. Also, better-controlled experiments such as the one ran by Husain et al. (2015) are likely to shed further light on the issue. It is our hope that the Shabd database will help Hindi researchers in this respect and that the database can be upgraded when additional information becomes available. Given that Hindi is a language spoken by more than 520 million people in India alone, the lack of such information has been a big hindrance for researchers across a wide spectrum of sciences.