spaCy plugin for Transformers , Udify, ELmo, etc.

Overview

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc.

Documentation Status Gitter PyPI version test and publish

Camphr is a Natural Language Processing library that helps in seamless integration for a wide variety of techniques from state-of-the-art to conventional ones. You can use Transformers , Udify, ELmo, etc. on spaCy.

Check the documentation for more information.

(For Japanese: https://qiita.com/tamurahey/items/53a1902625ccaac1bb2f)

Features

  • A spaCy plugin - Easily integration for a wide variety of methods
  • Transformers with spaCy - Fine-tuning pretrained model with Hydra. Embedding vector
  • Udify - BERT based multitask model in 75 languages
  • Elmo - Deep contextualized word representations
  • Rule base matching with Aho-Corasick, Regex
  • (for Japanese) KNP

License

Camphr is licensed under Apache 2.0.

Comments
  • NER Problem

    NER Problem

    Hello!

    First of all I would like to thank you for the great work on lib Camphr. It's been very useful to me! Can you help me with this doubt? I used lib to train a name recognition model (ner) but when I load the model using nlp = (spacy.load ("~ / outputs // 2020-04-30 // 22-28-36 // models // 9 "), and I pass a text (doc = nlp (" I live in Brazil ")), I can't get any entity recognition (doc.ents >> ()). Could you tell me why this is happening?

    opened by gabrielluz07 9
  • Gender and number subtags generation

    Gender and number subtags generation

    I was comparing the default morpho-syntactic tags generated by camphr-udify and https://github.com/Hyperparticle/udify.

    import spacy
    import stanza
    from spacy_conll import ConllFormatter
    
    nlp=spacy.load("en_udify")
    conllformatter = ConllFormatter(nlp)
    nlp.add_pipe(conllformatter, last=True)
    
    doc=nlp("Mother Teresa devoted her entire life to helping others") 
    print(doc._.conll_str)
    
    
    1	Mother	Mother	PROPN		_	2	compound	_	_
    2	Teresa	Teresa	PROPN		_	3	nsubj	_	_
    3	devoted	devote	VERB		_	0	root	_	_
    4	her	her	PRON		_	6	nmod:poss	_	_
    5	entire	entire	ADJ		_	6	amod	_	_
    6	life	life	NOUN		_	3	obj	_	_
    7	to	to	SCONJ		_	8	mark	_	_
    8	helping	help	VERB		_	3	advcl	_	_
    9	others	other	NOUN		_	8	obj	_	SpaceAfter=No
    
    

    Tags returned by https://github.com/Hyperparticle/udify, for the same input.

    prediction:  1  Mother  Mother  PROPN   _       Number=Sing     2       compound        _       _
    2       Teresa  Teresa  PROPN   _       Number=Sing     3       nsubj   _       _
    3       devoted devote  VERB    _       Mood=Ind|Tense=Past|VerbForm=Fin        0       root    _       _
    4       her     her     PRON    _       Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs   6       nmod:poss      _                                               _
    5       entire  entire  ADJ     _       Degree=Pos      6       amod    _       _
    6       life    life    NOUN    _       Number=Sing     3       obj     _       _
    7       to      to      SCONJ   _       _       8       mark    _       _
    8       helping help    VERB    _       VerbForm=Ger    3       advcl   _       _
    9       others  other   NOUN    _       Number=Plur     8       obj     _       _
    

    Gender and number subtags are missing in camphr-udify. Could we have those included by default please?

    thanks, Ranjita

    enhancement 
    opened by ranjita-naik 6
  • Camphr+KNP returns an incorrect dependency tag when using a specific adposition.

    Camphr+KNP returns an incorrect dependency tag when using a specific adposition.

    Hello. I report a problem that is happened when analyzing universal dependencies in Japanese text using KNP. When I use a adposition “から”, camphr returns a following wrong result (that shows the conj dependency tag on NOUN→VERB, but an expectation result is the obl dependency tag on VERB→NOUN).

    例1 例2

    (Note that "再結晶", "留去" are the words I added manually, but other VERB words that existed in the original dictionary such as "除去", "撹拌" generates similarly incorrect results.) Same problems sometimes occur when using an adposition "と".

    But using other adpositions, such as “より”, “にて”, camphr returns a correct result.

    例3 例4

    Environment:

    • Docker(python:3.7-buster)
    • spacy = 2.3.2
    • camphr = 0.6.5
    • pyknp = 0.4.5
    • Juman++ ver.1.02
    • KNP ver.4.19
    opened by undermakingbook 6
  • Python 3.8

    Python 3.8

    Camphr is currently pinned at python < 3.8, is there a specific reason for this and if so, what can we do to help?

    Edit: sorry, I just saw #19, still, what can we do to help?

    opened by Evpok 5
  • Support multi labels textcat pipe for transformers

    Support multi labels textcat pipe for transformers

    closes #9

    • Add TrfForMultiLabelSequenceClassification for multiple text classification.
      • pipe name: transformers_multilabel_sequence_classifier
    • Add docs for fine-tuning multi textcat pipe
      • https://github.com/PKSHATechnology-Research/camphr/blob/feature%2Fmulti-textcat/docs/source/notes/finetune_transformers.rst#multilabel-text-classification
    enhancement 
    opened by tamuhey 5
  • unofficial-udify, allennlp,  and transformers  conflicting dependencies

    unofficial-udify, allennlp, and transformers conflicting dependencies

    I'm trying to install udify on WSL as shown below.

    $ pip install unofficial-udify==0.3.0 [email protected]://github.com/PKSHATechnology-Research/camphr_models/releases/download/0.7.0/en_udify-0.7.tar.gz

    ERROR: Cannot install unofficial-udify and unofficial-udify==0.3.0 because these package versions have conflicting dependencies.

    The conflict is caused by: unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.3.0 depends on transformers<4.1 and >=4.0 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.2 depends on transformers<3.6 and >=3.4 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.1 depends on transformers<3.5 and >=3.1 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.2.0 depends on transformers<3.5 and >=3.1 unofficial-udify 0.3.0 depends on transformers<3.0.0 and >=2.3.0 allennlp 1.1.0 depends on transformers<3.1 and >=3.0

    Is this a known issue? Could you suggest a workaroudn please?

    bug 
    opened by ranjita-naik 3
  • Missing tag information

    Missing tag information

    I noticed that the spacy tag field is empty. Is this a known issue? It looks like Udify supports some level of ufeats tagging (https://universaldependencies.org/u/feat/index.html)? I wonder if I'm supposed to b getting any of this in Spacy and I have a bug in my setup, or if it just isn't implemented yet? Would it be souced in token.tag like I'm thinking (if it does exist)?

    I also noticed that displacy doesn't render the POS info. I am wondering if that is related?

    BTW, just have to say that this is awesome.

    opened by tslater 3
  • ImportError: cannot import name 'load_udify' from 'camphr.pipelines' following the example

    ImportError: cannot import name 'load_udify' from 'camphr.pipelines' following the example

    I followed the example here: https://camphr.readthedocs.io/en/latest/notes/udify.html

    I did only see the 0.7.0 model, so I went with that instead. Anyway, the German and English examples work great, but the Japanese one gives me this error:

    >>> from camphr.pipelines import load_udify
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: cannot import name 'load_udify' from 'camphr.pipelines' (/home/tyler/camphr/env/lib/python3.8/site-packages/camphr/pipelines/__init__.py)
    
    opened by tslater 3
  • doc.ents empty, doc.is_nered == False

    doc.ents empty, doc.is_nered == False

    I followed the documentation to fine-tune the bert-base-cased (en) ner model and then made a spacy doc with text "Bob Jones and Barack Obama went up the hill in Wisconsin." but the resulting doc has doc.ents = () and doc.is_nered = False.

    Am I missing something?

    Thank you!

    opened by jack-rory-staunton 3
  • Improvement for サ変 of KNP

    Improvement for サ変 of KNP

    Inside _get_child_dep(c), pos for 名詞,サ変名詞 is changed into VERB when it is followed by AUX. So now I think that _get_dep(tag[0]) should be done after _get_child_dep(c).

    opened by KoichiYasuoka 3
  • Bump transformers from 3.0.2 to 4.1.1

    Bump transformers from 3.0.2 to 4.1.1

    Bumps transformers from 3.0.2 to 4.1.1.

    Release notes

    Sourced from transformers's releases.

    Patch release: better error message & invalid trainer attribute

    This patch releases introduces:

    • A better error message when trying to instantiate a SentencePiece-based tokenizer without having SentencePiece installed. #8881
    • Fixes an incorrect attribute in the trainer. #8996

    Transformers v4.0.0: Fast tokenizers, model outputs, file reorganization

    Transformers v4.0.0-rc-1: Fast tokenizers, model outputs, file reorganization

    Breaking changes since v3.x

    Version v4.0.0 introduces several breaking changes that were necessary.

    1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default.

    The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. The main breaking change is the handling of overflowing tokens between the python and rust tokenizers.

    How to obtain the same behavior as v3.x in v4.x

    In version v3.x:

    from transformers import AutoTokenizer
    

    tokenizer = AutoTokenizer.from_pretrained("xxx")

    to obtain the same in version v4.x:

    from transformers import AutoTokenizer
    

    tokenizer = AutoTokenizer.from_pretrained("xxx", use_fast=False)

    2. SentencePiece is removed from the required dependencies

    The requirement on the SentencePiece dependency has been lifted from the setup.py. This is done so that we may have a channel on anaconda cloud without relying on conda-forge. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation.

    This includes the slow versions of:

    • XLNetTokenizer
    • AlbertTokenizer
    • CamembertTokenizer
    • MBartTokenizer
    • PegasusTokenizer
    • T5Tokenizer
    • ReformerTokenizer
    • XLMRobertaTokenizer

    How to obtain the same behavior as v3.x in v4.x

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • Bump certifi from 2021.5.30 to 2022.12.7 in /packages/camphr_pattern_search

    Bump certifi from 2021.5.30 to 2022.12.7 in /packages/camphr_pattern_search

    Bumps certifi from 2021.5.30 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump numpy from 1.21.0 to 1.22.0 in /packages/camphr_pattern_search

    Bump numpy from 1.21.0 to 1.22.0 in /packages/camphr_pattern_search

    Bumps numpy from 1.21.0 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Releases(0.7.0)
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

67 Dec 15, 2022
超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

bert4pytorch 2021年8月27更新: 感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune

muqiu 317 Dec 18, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
Harvis is designed to automate your C2 Infrastructure.

Harvis Harvis is designed to automate your C2 Infrastructure, currently using Mythic C2. 📌 What is it? Harvis is a python tool to help you create mul

Thiago Mayllart 99 Oct 06, 2022
ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体,包括上市公司所属行业关系、行业上级关系、产品上游原材料关系、产品下游产品关系、公司主营产品、产品小类共6大类。 上市公司4,654家,行业511个,产品95,559条、上游材料56,824条,上级行业480条,下游产品390条,产品小类52,937条,所属行业3,946条。

liuhuanyong 415 Jan 06, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022
Semantic search for quotes.

squote A semantic search engine that takes some input text and returns some (questionably) relevant (questionably) famous quotes. Built with: bert-as-

cjwallace 11 Jun 25, 2022
A curated list of FOSS tools to improve the Hacker News experience

Awesome-Hackernews Hacker News is a social news website focusing on computer technologies, hacking and startups. It promotes any content likely to "gr

Bryton Lacquement 141 Dec 27, 2022
AI-Broad-casting - AI Broad casting with python

Basic Code 1. Use The Code Configuration Environment conda create -n code_base p

Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

Yahui Liu 16 Feb 24, 2022
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 05, 2023
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

CvarAdversarialRL Official code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning". Initial setup Create a virtual

Mathieu Godbout 1 Nov 19, 2021