「GoCN酷Go推荐」Go高性能多语言NLP和分词库——gse

Posted GoCN

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了「GoCN酷Go推荐」Go高性能多语言NLP和分词库——gse相关的知识,希望对你有一定的参考价值。

gse是什么?

Go高性能多语言NLP和分词库, 支持英文、中文、日文等, 支持接入 elasticsearch 和 bleve Gse是结巴分词(jieba)的golang实现,并尝试添加NLP功能和更多属性

特征

  • 支持普通、搜索引擎、全模式、精确模式和 HMM 模式多种分词模式
  • 支持自定义词典、embed 词典、词性标注、停用词、整理分析分词
  • 多语言支持: 英文, 中文, 日文等
  • 支持繁体字
  • NLP 和 TensorFlow 支持 (进行中)
  • 命名实体识别 (进行中)
  • 支持接入 Elasticsearch 和 bleve
  • 可运行 JSON RPC 服务
  • 算法

  • 词典用双数组 trie(Double-Array Trie)实现,
  • 分词器算法为基于词频的最短路径加动态规划, 以及 DAG 和 HMM 算法分词.
  • 支持 HMM 分词, 使用 viterbi 算法.
  • 分词速度

  • 单线程 9.2MB/s
  • goroutines 并发 26.8MB/s.
  • HMM 模式单线程分词速度 3.2MB/s.(双核 4 线程 Macbook Pro)。
  • 快速入门

    package main

    import (
     "fmt"
     "regexp"

     "github.com/go-ego/gse"
     "github.com/go-ego/gse/hmm/pos"
    )

    var (
     seg gse.Segmenter
     posSeg pos.Segmenter

     new, _ = gse.New("zh,testdata/test_dict3.txt""alpha")

     text = "你好世界, Hello world, Helloworld."
    )

    func main() 
     // 加载默认词典
     seg.LoadDict()
     // 加载默认 embed 词典
     // seg.LoadDictEmbed()
     //
     // 加载简体中文词典
     // seg.LoadDict("zh_s")
     // seg.LoadDictEmbed("zh_s")
     //
     // 加载繁体中文词典
     // seg.LoadDict("zh_t")
     //
     // 加载日文词典
     // seg.LoadDict("jp")
     //
     // 载入词典
     // seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

     cut()

     segCut()



    func cut() 
     hmm := new.Cut(text, true)
     fmt.Println("cut use hmm: ", hmm)

     hmm = new.CutSearch(text, true)
     fmt.Println("cut search use hmm: ", hmm)
     fmt.Println("analyze: "new.Analyze(hmm, text))

     hmm = new.CutAll(text)
     fmt.Println("cut all: ", hmm)

     reg := regexp.MustCompile(`(\\d+年|\\d+月|\\d+日|[\\pLatin]+|[\\pHangul]+|\\d+\\.\\d+|[a-zA-Z0-9]+)`)
     text1 := `헬로월드 헬로 서울, 2021年09月10日, 3.14`
     hmm = seg.CutDAG(text1, reg)
     fmt.Println("Cut with hmm and regexp: ", hmm, hmm[0], hmm[6])


    func analyzeAndTrim(cut []string) 
     a := seg.Analyze(cut, "")
     fmt.Println("analyze the segment: ", a)

     cut = seg.Trim(cut)
     fmt.Println("cut all: ", cut)

     fmt.Println(seg.String(text, true))
     fmt.Println(seg.Slice(text, true))


    func cutPos() 
     po := seg.Pos(text, true)
     fmt.Println("pos: ", po)
     po = seg.TrimPos(po)
     fmt.Println("trim pos: ", po)

     posSeg.WithGse(seg)
     po = posSeg.Cut(text, true)
     fmt.Println("pos: ", po)

     po = posSeg.TrimWithPos(po, "zg")
     fmt.Println("trim pos: ", po)


    func segCut() 
     // 分词文本
     tb := []byte("山达尔星联邦共和国联邦政府")

     // 处理分词结果
     fmt.Println("输出分词结果, 类型为字符串, 使用搜索模式: ", seg.String(string(tb), true))
     fmt.Println("输出分词结果, 类型为 slice: ", seg.Slice(string(tb)))

     segments := seg.Segment(tb)
     // 处理分词结果, 普通模式
     fmt.Println(gse.ToString(segments))

     segments1 := seg.Segment([]byte(text))
     // 搜索模式
     fmt.Println(gse.ToString(segments1, true))


    输出结果:

    cut use hmm:  [你好 世界 ,  hello   world ,  helloworld .]
    cut search use hmm:  [你好 世界 ,  hello   world ,  helloworld .]
    analyze:  [0 6 0 0  你好 725 l 6 12 1 0  世界 34387 n 25 27 2 0  ,  0  27 32 3 0  hello 0  26 27 4 0    0  32 37 5 0  world 0  12 14 6 0  ,  0  27 37 7 0  helloworld 0  37 38 8 0  . 0 ]
    cut all:  [你好 世界 ,   h e l l o   w o r l d ,   h e l l o w o r l d .]
    Cut with hmm and regexp:  [헬로월드   헬로   서울 ,  2021年 09月 10日 ,  3.14] 헬로월드 2021年
    输出分词结果, 类型为字符串, 使用搜索模式:  山/n 达尔/nrt 星/n 联邦/n 共和/nz 国/zg 共和国/ns 联邦/n 政府/n 联邦政府/nt 
    输出分词结果, 类型为 slice:  [山 达尔 星 联邦 共和国 联邦政府]
    山/n 达尔/nrt 星/n 联邦/n 共和国/ns 联邦政府/nt 
    你好/l 世界/n ,/x  /x hello/x  /x world/x ,/x  /x helloworld/x ./x 

    更多用法可参考github上官方用例

    参考资料

  • https://github.com/go-ego/gse/blob/master/README_zh.md


  • 《酷Go推荐》招募:

    各位Gopher同学,最近我们社区打算推出一个类似GoCN每日新闻的新栏目《酷Go推荐》,主要是每周推荐一个库或者好的项目,然后写一点这个库使用方法或者优点之类的,这样可以真正的帮助到大家能够学习到

    新的库,并且知道怎么用。

    大概规则和每日新闻类似,如果报名人多的话每个人一个月轮到一次,欢迎大家报名!戳「阅读原文」,即可报名

    扫码也可以加入 GoCN 的大家族哟~


    GoCN酷Go推荐Go 系统监控利器-gopsutil

    简介

    什么是 gopsutil?要说gopsutil就不得不先了解 psutil,顾名思义,psutil = process and system utilities, 它是 Python 的跨平台库,能够轻松实现获取系统运行的进程和系统利用率,包括 CPU、内存、磁盘、网络等信息。而gopsutil就是psutil的 Golang 移植版。

    为什么用 gopsutil?和直接使用 syscall 调用对应的系统方法相比,gopsutil 为我们屏蔽了各个系统间的差异,可移植性非常强。

    快速上手

    安装:

    go get github.com/shirou/gopsutil

    使用:

    package main

    import (
        "fmt"
        "github.com/shirou/gopsutil/cpu"
        "time"
    )

    func main() {
        info, _ := cpu.Info()

        per, _ := cpu.Percent(1 * time.Second, true)

        fmt.Printf("CPU Percent: %f\n", per)

        fmt.Println(info)
    }

    输出:

    CPU Percent: [4.040404 4.000000 5.050505 6.930693]
    [{"cpu":0,"vendorId":"AuthenticAMD","family":"23","model":"49","stepping":0,"physicalId":"0","coreId":"0","cores":1,"modelName":"AMD EPYC 7K62 48-Core Processor","mhz":2595.124,"cacheSize":512,"flags":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","mmxext","fxsr_opt","pdpe1gb","rdtscp","lm","rep_good","nopl","cpuid","extd_apicid","tsc_known_freq","pni","pclmulqdq","ssse3","fma","cx16","sse4_1","sse4_2","x2apic","movbe","popcnt","aes","xsave","avx","f16c","rdrand","hypervisor","lahf_lm","cmp_legacy","cr8_legacy","abm","sse4a","misalignsse","3dnowprefetch","osvw","topoext","ibpb","vmmcall","fsgsbase","bmi1","avx2","smep","bmi2","rdseed","adx","smap","clflushopt","sha_ni","xsaveopt","xsavec","xgetbv1","arat"],"microcode":"0x1000065"} {"cpu":1,"vendorId":"AuthenticAMD","family":"23","model":"49","stepping":0,"physicalId":"0","coreId":"1","cores":1,"modelName":"AMD EPYC 7K62 48-Core Processor","mhz":2595.124,"cacheSize":512,"flags":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","mmxext","fxsr_opt","pdpe1gb","rdtscp","lm","rep_good","nopl","cpuid","extd_apicid","tsc_known_freq","pni","pclmulqdq","ssse3","fma","cx16","sse4_1","sse4_2","x2apic","movbe","popcnt","aes","xsave","avx","f16c","rdrand","hypervisor","lahf_lm","cmp_legacy","cr8_legacy","abm","sse4a","misalignsse","3dnowprefetch","osvw","topoext","ibpb","vmmcall","fsgsbase","bmi1","avx2","smep","bmi2","rdseed","adx","smap","clflushopt","sha_ni","xsaveopt","xsavec","xgetbv1","arat"],"microcode":"0x1000065"} {"cpu":2,"vendorId":"AuthenticAMD","family":"23","model":"49","stepping":0,"physicalId":"0","coreId":"2","cores":1,"modelName":"AMD EPYC 7K62 48-Core Processor","mhz":2595.124,"cacheSize":512,"flags":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","mmxext","fxsr_opt","pdpe1gb","rdtscp","lm","rep_good","nopl","cpuid","extd_apicid","tsc_known_freq","pni","pclmulqdq","ssse3","fma","cx16","sse4_1","sse4_2","x2apic","movbe","popcnt","aes","xsave","avx","f16c","rdrand","hypervisor","lahf_lm","cmp_legacy","cr8_legacy","abm","sse4a","misalignsse","3dnowprefetch","osvw","topoext","ibpb","vmmcall","fsgsbase","bmi1","avx2","smep","bmi2","rdseed","adx","smap","clflushopt","sha_ni","xsaveopt","xsavec","xgetbv1","arat"],"microcode":"0x1000065"} {"cpu":3,"vendorId":"AuthenticAMD","family":"23","model":"49","stepping":0,"physicalId":"0","coreId":"3","cores":1,"modelName":"AMD EPYC 7K62 48-Core Processor","mhz":2595.124,"cacheSize":512,"flags":["fpu","vme","de","pse","tsc","msr","pae","mce","cx8","apic","sep","mtrr","pge","mca","cmov","pat","pse36","clflush","mmx","fxsr","sse","sse2","ht","syscall","nx","mmxext","fxsr_opt","pdpe1gb","rdtscp","lm","rep_good","nopl","cpuid","extd_apicid","tsc_known_freq","pni","pclmulqdq","ssse3","fma","cx16","sse4_1","sse4_2","x2apic","movbe","popcnt","aes","xsave","avx","f16c","rdrand","hypervisor","lahf_lm","cmp_legacy","cr8_legacy","abm","sse4a","misalignsse","3dnowprefetch","osvw","topoext","ibpb","vmmcall","fsgsbase","bmi1","avx2","smep","bmi2","rdseed","adx","smap","clflushopt","sha_ni","xsaveopt","xsavec","xgetbv1","arat"],"microcode":"0x1000065"}]

    分工明确

    gopsutil 将不同的功能划分到不同的子包中:主要为 cpu,disk,docker,host,mem,net,process,winservices 这几个。想要使用对应的功能,要导入对应的子包。例如,上面代码中,我们要获取 CPU 信息,导入的是 cpu 子包。上述样例中,我们获取到了每个 cpu 的占用率和所有 cpu 的详细信息。

    闲言

    最近正在写一个 Golang 实现性能监控的 demo,之后还会写这方面的介绍或者对比。

    参考文档

    https://github.com/shirou/gopsutil 


    还想了解更多吗?

    更多请查看:https://github.com/shirou/gopsutil 


    《酷Go推荐》招募:


    各位Gopher同学,最近我们社区打算推出一个类似GoCN每日新闻的新栏目《酷Go推荐》,主要是每周推荐一个库或者好的项目,然后写一点这个库使用方法或者优点之类的,这样可以真正的帮助到大家能够学习到

    新的库,并且知道怎么用。



    扫码也可以加入 GoCN 的大家族哟~


     Gopher China2021大会日程详情来了!


    点击下方「阅读原文」即可报名参加大会


    以上是关于「GoCN酷Go推荐」Go高性能多语言NLP和分词库——gse的主要内容,如果未能解决你的问题,请参考以下文章

    「GoCN酷Go推荐」后现代时代远程办公网络问题的golang开源解决方案 —— PairMesh

    GoCN酷Go推荐Go 系统监控利器-gopsutil

    Odoo实现多语言翻译

    谷歌最强NLP模型BERT官方中文版来了!多语言模型支持100种语言

    GOCN每日新闻(2017-08-12)

    jQuery 如何实现本地切换语言