SoNaR
  
  
    
    
      
        Persistent Identifierauto
        http://hdl.handle.net/21.11114/COLL-0000-000B-CABA-0
       
    
    
    
      Description0-1
      
        
          SoNaR is a 500-million-word reference c…
          
            SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. (SOURCE: CLARIN).
The SoNaR Corpus consists of three parts: 
- SONAR 500: 500 M words automatically tokenized, lemmatized and POS-tagged
- SONAR 1: 1 M words with handchecked semantic annotations: named entities, co-reference relations, semantic roles and spatio-temporal relations 
 - SONAR New Media Corpus: Subpart of SoNaR 500 containing the material from new media only (tweets, chats, SMS)
          
         
       
     
    
    
      
  
    LandingPage1
    
      
        http://hdl.handle.net/11372/LRT-1498
      
    
   
    
      
  
    Title(s)1-n
    
      
        
        
          [1]: 
          SoNaR, 
        
        
        
          [2]: 
          SoNaR Reference Corpus
        
        
      
     
   
    
      
  
    
      
  
    Genre(s)0-n
    
      
        
        
          newspaper-article
          , 
        
          other
          , 
        
          fiction
          , 
        
          social-media-texts
          , 
        
          non-academic-non-fiction
          , 
        
          academic-nonfiction
          
        
        
      
     
   
    
      
  
    
      
  
    Domain(s)0-n
    
      
        
        
          A large reference corpus of written Dutch is invaluable for linguistic research and the development of profitable services that require advanced language technology (SOURCE: N. Oostdijk et al.)
          
        
        
      
     
   
    
      
  
    Language(s)1-n
    
      
        
        
          Flemish [nld]
          , 
        
          Dutch (Northern) [nld]
          
        
        
      
     
   
    
      
  
    CLARIN centre0-1
    
      
        HLT Centre (INL)
      
    
   
    
      
  
    
      
  
    
      
  
    Creator(s)0-n
    
      
        
        
          [1]: 
          Dr. N. Oostdijk (CLST, Radboud University Nijmegen), 
        
        
        
          [2]: 
          Dr. N. Oostdijk (Tilburg University (ILK)-Utrecht University (UiL-OTS)-Hogeschool Gent-Katholieke Universiteit Leuven (CCL)-Twente University)
        
        
      
     
   
    
      
  
    Project(s)0-n
    
      
        
        
          
          SoNaR-corpus site (Funder: NTU STEVIN)
         
        
      
     
   
    
    
    
      Resource(s)1-n
      
        
          
          
            Resource 1
            
              
                  
                  
                    Description0-1
                    
                      
                        SoNaR500 is a 500-million-word referenc…
                        
                          SoNaR500 is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. 
SoNaR500 was automatically tokenized, lemmatized and POS-tagged (except for the new media part).
                        
                       
                     
                   
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
    Annotation0-n
    
      
        
        
          [1]: 
          [syntacticAnnotation] [mixed] [text/xml], 
        
        
        
          [2]: 
          [lemmatization] [mixed] [text/xml], 
        
        
        
          [3]: 
          [posTagging] [mixed] [text/xml]
        
        
      
     
   
                  
                    
  
                  
                    
  
                  
               
             
           
          
          
            Resource 2
            
              
                  
                  
                    Description0-1
                    
                      
                        SoNaR-1 is a dataset comprising one mil…
                        
                          SoNaR-1 is a dataset comprising one million words. Although largely a
subset of SoNaR-500, SoNaR-1 includes far fewer text types. With SoNaR-
1 different types of semantic annotation have been provided, viz. named
entity labelling, annotation of co-reference relations, semantic role labelling
and annotation of spatial and temporal relations. All annotations have been
manually verified
                        
                       
                     
                   
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
    Annotation0-n
    
      
        
        
          [1]: 
          [semanticAnnotation-coreference] [manual] [text/html], 
        
        
        
          [2]: 
          [semanticAnnotation-namedEntities] [manual] [text/xml], 
        
        
        
          [3]: 
          [semanticAnnotation-roles] [manual] [text/xml], 
        
        
        
          [4]: 
          [semanticAnnotation-relations] [manual] [other]
        
        
      
     
   
                  
                    
  
                  
               
             
           
          
          
            Resource 3
            
              
                  
                  
                    Description0-1
                    
                      
                        SoNaR New Media texts  (tweets, chats a…
                        
                          SoNaR New Media texts  (tweets, chats and sms's) were also collected in the STEVIN-project SoNaR but they are not part of the SoNaR-corpus 1.0. and can be obtained separately as the SoNar New Media Corpus.
                        
                       
                     
                   
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
                  
                    
  
    Annotation0-n
    
      
        
        
          
          [noAnnotation] [unknown] [other]
        
        
      
     
   
                  
                    
  
                  
               
             
           
          
        
       
     
    
    
      Provenance(s)0-n
      
        
          
            Provenance 1
            
              
                  
                    
  
                  
                    
  
                  
                    
  
    Country0-1
    
      
        Netherlands (the) NL
      
    
   
                  
               
             
           
        
       
     
    
    
    
    
      Accessibility0-1
      
        
          Accessibility
          
            
              
                
                  
  
                
                  
  
                
                  
  
    License name(s)0-n
    
      
        
        
          License via HLT-centre, Dutch Language Union
          
        
        
      
     
   
                
                  
  
    Licence URL(s)0-n
    
      
        
        
          http://tst-centrale.org/nl/tst-materialen/corpora/sonar-corpus-detail
          
        
        
      
     
   
                
                  
  
    Non-commercial usage0-1
    
      
        yes
      
    
   
                
                  
  
    Website(s)0-n
    
      
        
        
          http://lands.let.ru.nl/projects/SoNaR/intro.html
          
        
        
      
     
   
                
                  
  
                
                  
  
                
                  
  
    Contact(s)0-n
    
      
        
        
          Dr. Nelleke Oostdijk: Centre for Language and Speech, (n.oostdijk@let.ru.nl)
          
        
        
      
     
   
                
                  
  
                
              
             
           
         
       
     
    
    
      Documentation0-1
      
        
          Documentation
          
            
              
                
                  
  
                
                  
  
                
                  
  
                
                  
  
    URL(s)0-n
    
      
        
        
          http://ticclops.uvt.nl/SoNaR_end-user_documentation_v.1.0.4.pdf
          
        
        
      
     
   
                
              
             
           
         
       
     
    
    
     
   
  
  
    
  
    Editing is disabled, since you are not signed in