{"id":2371,"date":"2020-06-02T12:26:48","date_gmt":"2020-06-02T10:26:48","guid":{"rendered":"https:\/\/www.cjvt.starkmat.si\/?page_id=2371"},"modified":"2020-11-05T22:32:24","modified_gmt":"2020-11-05T21:32:24","slug":"lists-textbooks","status":"publish","type":"page","link":"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/","title":{"rendered":"Keywords and n-grams from a textbook corpus"},"content":{"rendered":"<div class='flex_column_table av-equal-height-column-flextable -flextable' style='margin-top:0px; margin-bottom:0px; '><div class=\"flex_column av_two_third  flex_column_table_cell av-equal-height-column av-align-middle av-zero-column-padding first  avia-builder-el-0  el_before_av_one_third  avia-builder-el-first  \" style='border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><h2>Compiling keyword lists and n-grams from a textbook corpus for different school years and subjects<\/h2>\n<\/div><\/section><\/div>\n<div class='av-flex-placeholder'><\/div><div class=\"flex_column av_one_third  flex_column_table_cell av-equal-height-column av-align-middle av-zero-column-padding   avia-builder-el-2  el_after_av_two_third  el_before_av_two_third  \" style='border-radius:0px; '><p><div  class='avia-button-wrap avia-button-left  avia-builder-el-3  el_before_av_button  avia-builder-el-first  gumb-sodelavci-levo' title=\"Sloleks lexicon accentuation\"><a href='https:\/\/www.cjvt.si\/en\/infrastructure-support\/sloleks-lexicon-accentuation\/'  class='avia-button  av-button-notext   avia-icon_select-yes-left-icon avia-color-theme-color avia-size-small avia-position-left '   ><span class='avia_button_icon avia_button_icon_left ' aria-hidden='true' data-av_icon='\ue87c' data-av_iconfont='entypo-fontello'><\/span><span class='avia_iconbox_title' ><\/span><\/a><\/div><br \/>\n<div  class='avia-button-wrap avia-button-left  avia-builder-el-4  el_after_av_button  avia-builder-el-last  gumb-sodelavci-desno' title=\"LIST \u2013 efficient Slovene corpus analysis tool\"><a href='https:\/\/www.cjvt.si\/en\/infrastructure-support\/the-list-corpus-extraction-tool\/'  class='avia-button  av-button-notext   avia-icon_select-yes-right-icon avia-color-theme-color avia-size-small avia-position-left '   ><span class='avia_iconbox_title' ><\/span><span class='avia_button_icon avia_button_icon_right' aria-hidden='true' data-av_icon='\ue87d' data-av_iconfont='entypo-fontello'><\/span><\/a><\/div><\/p><\/div><\/div><!--close column table wrapper. Autoclose: 1 --><div class=\"flex_column av_two_third  flex_column_div av-zero-column-padding first  avia-builder-el-5  el_after_av_one_third  el_before_av_one_third  column-top-margin\" style='margin-top:36px; margin-bottom:0px; border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><p><!--Prijavitelj: Iztok Kosem, FF UL--><\/p>\n<p>In this project, we built a corpus of textbooks for primary and secondary school and extracted a list of words, n-grams and keywords. The corpus was converted from PDF and HTML into text that was corrected and annotated according to structure. It contains 5 million tokens from 127 textbooks for 16 subjects. The next step was extracting the lists according to several criteria. All lists were checked manually. The following lists were compiled:<\/p>\n<p>General words list that appear in at least 8 from 16 subjects. It contains lemma, word type, frequency (also according to school subject) and subject count data.<\/p>\n<p>General words list according to school year (class) containing data on lemma, word type, frequency (also according to school year) and subject number (from 16).<\/p>\n<p>2-5-gram list containing data on n-gram type, lemma, word type, morphological annotations, and subject count and frequency.<\/p>\n<p>The lists are available under the CC BY licence at:<\/p>\n<p>Kosem, Iztok; Pori, Eva and Arhar Holdt, \u0160pela, 2019, Keywords and n-grams from a textbook corpus, Slovenian language resource repository CLARIN.SI, <a href=\"http:\/\/hdl.handle.net\/11356\/1215\">http:\/\/hdl.handle.net\/11356\/1215<\/a>.<\/p>\n<\/div><\/section><\/div><\/p>\n<div class=\"flex_column av_one_third  flex_column_div   avia-builder-el-7  el_after_av_two_third  avia-builder-el-last  column-top-margin\" style='margin-top:36px; margin-bottom:0px; background: #f0f0f0; padding:30px; background-color:#f0f0f0; border-radius:0px; '><section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><h3 class=\"zn_text_box-title zn_text_box-title--style1 text-custom\">LINKS AND CONTACT<\/h3>\n<\/div><\/section><br \/>\n<div  style='height:20px' class='hr hr-invisible   avia-builder-el-9  el_after_av_textblock  el_before_av_textblock '><span class='hr-inner ' ><span class='hr-inner-style'><\/span><\/span><\/div><br \/>\n<section class=\"av_textblock_section \"  itemscope=\"itemscope\" itemtype=\"https:\/\/schema.org\/CreativeWork\" ><div class='avia_textblock  '   itemprop=\"text\" ><p>Dr. Iztok Kosem<br \/>Centre for Language Resources and Technologies at the University of Ljubljana<br \/>Faculty of Computer and Information Science<br \/>Ve\u010dna pot 113, SI-1000 Ljubljana<\/p>\n<ul>\n<li>e-mail: <a href=\"&#x6d;&#x61;&#x69;&#x6c;&#116;&#111;:iz&#x74;&#x6f;&#x6b;&#x2e;&#107;&#111;sem&#x40;&#x63;&#x6a;&#x76;&#116;&#46;&#115;i\">&#x69;&#122;t&#x6f;&#107;&#46;&#x6b;&#111;s&#x65;&#x6d;&#64;&#x63;&#x6a;&#118;t&#x2e;&#115;i<\/a><\/li>\n<\/ul>\n<\/div><\/section><br \/>\n<div  class='avia-button-wrap avia-button-right  avia-builder-el-11  el_after_av_textblock  avia-builder-el-last ' ><a href='http:\/\/hdl.handle.net\/11356\/1215'  class='avia-button   avia-icon_select-no avia-color-theme-color avia-size-medium avia-position-right '   ><span class='avia_iconbox_title' >Word lists<\/span><\/a><\/div><\/p><\/div>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":3,"featured_media":0,"parent":985,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"_relevanssi_hide_post":"","_relevanssi_hide_content":"","_relevanssi_pin_for_all":"","_relevanssi_pin_keywords":"","_relevanssi_unpin_keywords":"","_relevanssi_related_keywords":"","_relevanssi_related_include_ids":"","_relevanssi_related_exclude_ids":"","_relevanssi_related_no_append":"","_relevanssi_related_not_related":"","_relevanssi_related_posts":"","_relevanssi_noindex_reason":"","inline_featured_image":false,"episode_type":"","audio_file":"","podmotor_file_id":"","podmotor_episode_id":"","cover_image":"","cover_image_id":"","duration":"","filesize":"","filesize_raw":"","date_recorded":"","explicit":"","block":"","itunes_episode_number":"","itunes_title":"","itunes_season_number":"","itunes_episode_type":"","footnotes":""},"class_list":["post-2371","page","type-page","status-publish","hentry"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Keywords and n-grams from a textbook corpus - CJVT<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Keywords and n-grams from a textbook corpus - CJVT\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/\" \/>\n<meta property=\"og:site_name\" content=\"CJVT\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/centerzajezikovnevireintehnologije\" \/>\n<meta property=\"article:modified_time\" content=\"2020-11-05T21:32:24+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/infrastructure-support\\\/lists-textbooks\\\/\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/infrastructure-support\\\/lists-textbooks\\\/\",\"name\":\"Keywords and n-grams from a textbook corpus - CJVT\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/#website\"},\"datePublished\":\"2020-06-02T10:26:48+00:00\",\"dateModified\":\"2020-11-05T21:32:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/infrastructure-support\\\/lists-textbooks\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.cjvt.si\\\/en\\\/infrastructure-support\\\/lists-textbooks\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/infrastructure-support\\\/lists-textbooks\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Infrastructure Support\",\"item\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/infrastructure-support\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Keywords and n-grams from a textbook corpus\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/#website\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/\",\"name\":\"CJVT\",\"description\":\"Center za jezikovne vire in tehnologije\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/#organization\",\"name\":\"CJVT\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.cjvt.si\\\/wp-content\\\/uploads\\\/2020\\\/06\\\/CJVT-logo-red.jpg\",\"contentUrl\":\"https:\\\/\\\/www.cjvt.si\\\/wp-content\\\/uploads\\\/2020\\\/06\\\/CJVT-logo-red.jpg\",\"width\":1300,\"height\":683,\"caption\":\"CJVT\"},\"image\":{\"@id\":\"https:\\\/\\\/www.cjvt.si\\\/en\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/centerzajezikovnevireintehnologije\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Keywords and n-grams from a textbook corpus - CJVT","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/","og_locale":"en_US","og_type":"article","og_title":"Keywords and n-grams from a textbook corpus - CJVT","og_url":"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/","og_site_name":"CJVT","article_publisher":"https:\/\/www.facebook.com\/centerzajezikovnevireintehnologije","article_modified_time":"2020-11-05T21:32:24+00:00","twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/","url":"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/","name":"Keywords and n-grams from a textbook corpus - CJVT","isPartOf":{"@id":"https:\/\/www.cjvt.si\/en\/#website"},"datePublished":"2020-06-02T10:26:48+00:00","dateModified":"2020-11-05T21:32:24+00:00","breadcrumb":{"@id":"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.cjvt.si\/en\/infrastructure-support\/lists-textbooks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.cjvt.si\/en\/"},{"@type":"ListItem","position":2,"name":"Infrastructure Support","item":"https:\/\/www.cjvt.si\/en\/infrastructure-support\/"},{"@type":"ListItem","position":3,"name":"Keywords and n-grams from a textbook corpus"}]},{"@type":"WebSite","@id":"https:\/\/www.cjvt.si\/en\/#website","url":"https:\/\/www.cjvt.si\/en\/","name":"CJVT","description":"Center za jezikovne vire in tehnologije","publisher":{"@id":"https:\/\/www.cjvt.si\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.cjvt.si\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.cjvt.si\/en\/#organization","name":"CJVT","url":"https:\/\/www.cjvt.si\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.cjvt.si\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.cjvt.si\/wp-content\/uploads\/2020\/06\/CJVT-logo-red.jpg","contentUrl":"https:\/\/www.cjvt.si\/wp-content\/uploads\/2020\/06\/CJVT-logo-red.jpg","width":1300,"height":683,"caption":"CJVT"},"image":{"@id":"https:\/\/www.cjvt.si\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/centerzajezikovnevireintehnologije"]}]}},"_links":{"self":[{"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/pages\/2371","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/comments?post=2371"}],"version-history":[{"count":6,"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/pages\/2371\/revisions"}],"predecessor-version":[{"id":3566,"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/pages\/2371\/revisions\/3566"}],"up":[{"embeddable":true,"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/pages\/985"}],"wp:attachment":[{"href":"https:\/\/www.cjvt.si\/en\/wp-json\/wp\/v2\/media?parent=2371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}