In this project, we developed a tool for statistical analysis of dependency-parsed corpora (the STARK tool) that returns a list of frequency lists for dependency trees from dependency-parsed corpora. The user can set several parameters in the configuration file thus defining the characteristics of the extracted dependency trees such as: number of nodes in the tree, type of extracted nodes (from concrete words to abstract grammatical characteristics), extraction of full/all possible subtrees etc.
In addition to this kind of inductive extraction (without any predefined linguistic assumptions), the tool can extract trees based on additional restrictions and predefined tree structures.
Results are shown in a table, displaying tree structures, tree nodes, corpus frequency data and the level of statistical connection between nodes (based on different collocation measures).
The tool can process dependency-parsed corpora in the input format CONLL-U. This means it can be used for the Slovene corpora ssj500k and Gigafida as well as any corpora in the given format (for more than 70 different languages).
STARK is a command line tool, available under the Apache 2.0 open license at https://gitea.cjvt.si/lkrsnik/STARK, and on the CLARIN.SI repository:
Krsnik, Luka; Dobrovoljc, Kaja and Robnik-Šikonja, Marko, 2019, Dependency tree extraction tool STARK 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1284.