Procura-PALavras (P-PAL)

P-PAL is a web-based interface for a new European Portuguese (EP) lexical database that provides researchers with an entire new set of words attributes and lexical and sublexical statistics of different grain sizes (word as a whole, syllables, trigrams, bigrams/biphones, letters/phones) drawn from a diversified and large-size lemmatized and non-lemmatized corpus (for more information about P-PAL corpus sampling, click here PDF

When entering the application a dialogue box appears asking the user to decide which of the word-queries available he/she wants to perform:

Analyze word lists previously selected in specific attributes and lexical and/or sublexical words statistics defined by the user in the menu of analyses in the lemma or in the word form databases. OR Generate word lists that meet specific word requirements defined by the user in the menu of analyses in the lemma or in the word form databases.

Then, the menu of analyses is displayed. In P-PAL, words’ attributes and statistics (lexical and sublexical) are organized into four main fields:

  • Word frequency measures: e.g., raw counts, per million word frequency, Log10 of the raw and per million frequencies, logarithmic Zipf scale.
  • Morphosyntactic measures: e.g., Parts-of-Speech [PoS], grammatical gender and number, dominant PoS, frequency and relative frequency of the dominant and non-dominant PoS.
  • (iii) Orthographic measures: information of different grain sizes comprising the word orthographic structure as a whole (e.g., number letters, consonant(C)-vowel(V) structure), and word neighborhood statistics (e.g., density and frequency of orthographic neighbors, phonographic neighbors, Orthographic Levenshtein Distance (OLD20), orthographic uniqueness point (OUP)), as well as other sublexical statistics concerning syllables (e.g., number of orthographic syllables, type and token syllable positional statistics), trigrams (e.g., summed trigram frequencies, type and token trigram positional statistics), bigrams (e.g., summed bigram frequency, type and token bigram positional statistics), and letters within words (e.g., summed letter frequency, mean letter frequency).
  • Phonological measures: e.g., information of different grain sizes comprising the word phonological structure as whole (e.g., pronunciation, number of phonemes, stress pattern), and word neighborhood statistics (e.g., density and frequency of phonological neighbors, transposed and phonographic neighbors), as well as other sublexical statistics regarding syllables (e.g., number of phonological syllables, type and token syllable positional statistics), biphones (e.g., summed biphone frequency, type and token biphone positional statistics), and phonemes within words (e.g., summed phoneme frequency, mean phoneme frequency).

The only measure that is selected by default in the application is the per million word frequency due to the importance of this variable in all the research using verbal stimuli. All the other word’s attributes/statistics in which the user is interested in should be selected by putting a tick on the checkbox that is on left of each word propriety.

If the user intends to conduct a ‘generate word lists’ query, he/she should, additionally, define the requirements that the words should meet in the constraint fields associated to each of the word attributes/statistics (minimum and/or maximum values) selected. If he/she intends to conduct a ‘analyze word lists’ query instead, he/she should upload a file (.txt or.xls) containing the words to be analyzed on the attributes/statistics selected in the menu of analyses.

For more information about P-PAL web-based interface click here. (soon)

Please cite P-PAL web-based interface in your research as: Soares, A. P., Iriarte, A., Almeida, J. J., Simões, A., Costa, A., Machado, J., & Perea, M. (submitted). Procura-PALavras (P-PAL): A web-based interface for a new European Portuguese lexical database. Behavior Research Methods.

Click here to access the application

A wordform is the actual occurrence of a word as it appears in a corpus with distinctive orthographic, phonologic and grammatical features. For instance, cantar [sing], canto [(I) sing], cantava [used to sing], cantei [(I) sang], cantamos [(we) sing], etc. are inflected forms of the same word, conventionally represented by the lemma 'sing'.

A lemma is a graphic form determined by convention as a representation of all inflected forms of a word. In the case of verbs the infinitive [e.g. 'ser' (be)] is the canonical form chosen to represent all inflected forms of the verbal paradigm [e.g. sou (I am), és (you are), é (is), era (was)]. Nouns and adjectives are represented by the masculine, singular form [e.g. menino (boy), bonito (pretty)] which comprises the nominal [e.g. menino (boy), menina (girl), meninos (boys), meninas (girls)] or the adjectival paradigm [e.g. bonito (pretty - masc., sing.), bonita (pretty - fem., sing.), bonitos (pretty - masc., plur.), bonitas (pretty - fem., plur.)]. In strictly masculine or feminine nouns the singular form is used (e.g.: animal [animal], comboio [train], costa [coast], adivinha [riddle]). Singular feminine words with different stems have also been included as lexical entries (e.g. homem [man] / mulher [woman]).