Huggingface models

Huggingface models DEFAULT

Models¶

The base classes , , and implement the common methods for loading/saving a model either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace’s AWS S3 repository).

and also implement a few methods which are common among all the models to:

  • resize the input token embeddings when new tokens are added to the vocabulary

  • prune the attention heads of the model.

The other methods that are common to each model are defined in (for the PyTorch models) and (for the TensorFlow models) or for text generation, (for the PyTorch models), (for the TensorFlow models) and (for the Flax/JAX models).

PreTrainedModel¶

class (config:transformers.configuration_utils.PretrainedConfig, *inputs, **kwargs)[source]¶

Base class for all models.

takes care of storing the configuration of the models and handles methods for loading, downloading and saving models as well as a few methods common to all models to:

  • resize the input embeddings,

  • prune heads in the self-attention heads.

Class attributes (overridden by derived classes):

  • config_class () – A subclass of to use as configuration class for this model architecture.

  • load_tf_weights () – A python method for loading a TensorFlow checkpoint in a PyTorch model, taking as arguments:

    • model () – An instance of the model on which to load the TensorFlow checkpoint.

    • config () – An instance of the configuration associated to the model.

    • path () – A path to the TensorFlow checkpoint.

  • base_model_prefix () – A string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.

  • is_parallelizable () – A flag indicating whether this model supports model parallelization.

property

The main body of the model.

Type
property

Dummy inputs to do a forward pass in the network.

Type
classmethod (pretrained_model_name_or_path:Optional[Union[str, os.PathLike]], *model_args, **kwargs)[source]¶

Instantiate a pretrained pytorch model from a pre-trained model configuration.

The model is set in evaluation mode by default using (Dropout modules are deactivated). To train the model, you should first set it back in training mode with .

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

Parameters
  • pretrained_model_name_or_path ( or , optional) –

    Can be either:

    • A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like , or namespaced under a user or organization name, like .

    • A path to a directory containing model weights saved using , e.g., .

    • A path or url to a tensorflow index checkpoint file (e.g, ). In this case, should be set to and a configuration object should be provided as argument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.

    • A path or url to a model folder containing a flax checkpoint file in .msgpack format (e.g, containing ). In this case, should be set to .

    • if you are both providing the configuration and state dictionary (resp. with keyword arguments and ).

  • model_args (sequence of positional arguments, optional) – All remaining positional arguments will be passed to the underlying model’s method.

  • config (, optional) –

    Can be either:

    Configuration for the model to use instead of an automatically loaded configuration. Configuration can be automatically loaded when:

    • The model is a model provided by the library (loaded with the model id string of a pretrained model).

    • The model was saved using and is reloaded by supplying the save directory.

    • The model is loaded by supplying a local directory as and a configuration JSON file named config.json is found in the directory.

  • state_dict (, optional) –

    A state dictionary to use instead of a state dictionary loaded from saved weights file.

    This option can be used if you want to create a model from a pretrained configuration but load your own weights. In this case though, you should check if using and is not a simpler option.

  • cache_dir (, optional) – Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

  • from_tf (, optional, defaults to ) – Load the model weights from a TensorFlow checkpoint save file (see docstring of argument).

  • from_flax (, optional, defaults to ) – Load the model weights from a Flax checkpoint save file (see docstring of argument).

  • ignore_mismatched_sizes (, optional, defaults to ) – Whether or not to raise an error if some of the weights from the checkpoint do not have the same size as the weights of the model (if for instance, you are instantiating a model with 10 labels from a checkpoint with 3 labels).

  • force_download (, optional, defaults to ) – Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist.

  • resume_download (, optional, defaults to ) – Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.

  • proxies () – A dictionary of proxy servers to use by protocol or endpoint, e.g., . The proxies are used on each request.

  • output_loading_info (, optional, defaults to ) – Whether ot not to also return a dictionary containing missing keys, unexpected keys and error messages.

  • local_files_only (, optional, defaults to ) – Whether or not to only look at local files (i.e., do not try to download the model).

  • use_auth_token ( or bool, optional) – The token to use as HTTP bearer authorization for remote files. If , will use the token generated when running (stored in ).

  • revision (, optional, defaults to ) – The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so can be any identifier allowed by git.

  • mirror (, optional) – Mirror source to accelerate downloads in China. If you are from China and have an accessibility problem, you can set this option to resolve it. Note that we do not guarantee the timeliness or safety. Please refer to the mirror site for more information.

  • _fast_init (, optional, defaults to :obj:`True) – Whether or not to disable fast initialization.

  • low_cpu_mem_usage (, optional, defaults to :obj:`False) – Tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. This is an experimental feature and a subject to change at any moment.

  • torch_dtype ( or , optional) –

    Override the default and load the model under this dtype. If is passed the dtype will be automatically derived from the model’s weights.

    Warning

    One should only disable _fast_init to ensure backwards compatibility with for seeded model initialization. This argument will be removed at the next major version. See pull request for more information.

  • kwargs (remaining dictionary of keyword arguments, optional) –

    Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., ). Behaves differently depending on whether a is provided or automatically loaded:

    • If a configuration is provided with , will be directly passed to the underlying model’s method (we assume all relevant updates to the configuration have already been done)

    • If a configuration is not provided, will be first passed to the configuration class initialization function (). Each key of that corresponds to a configuration attribute will be used to override said attribute with the supplied value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model’s function.

Note

Passing is required when you want to use a private model.

Note

Activate the special “offline-mode” to use this method in a firewalled environment.

Examples:

>>> fromtransformersimportBertConfig,BertModel>>> # Download model and configuration from huggingface.co and cache.>>> model=BertModel.from_pretrained('bert-base-uncased')>>> # Model was saved using `save_pretrained('./test/saved_model/')` (for example purposes, not runnable).>>> model=BertModel.from_pretrained('./test/saved_model/')>>> # Update configuration during loading.>>> model=BertModel.from_pretrained('bert-base-uncased',output_attentions=True)>>> assertmodel.config.output_attentions==True>>> # Loading from a TF checkpoint file instead of a PyTorch model (slower, for example purposes, not runnable).>>> config=BertConfig.from_json_file('./tf_model/my_tf_model_config.json')>>> model=BertModel.from_pretrained('./tf_model/my_tf_checkpoint.ckpt.index',from_tf=True,config=config)>>> # Loading from a Flax checkpoint file instead of a PyTorch model (slower)>>> model=BertModel.from_pretrained('bert-base-uncased',from_flax=True)
() &#x; torch.nn.modules.module.Module[source]¶

Returns the model’s input embeddings.

Returns

A torch module mapping vocabulary to hidden states.

Return type
() &#x; torch.nn.modules.module.Module[source]¶

Returns the model’s output embeddings.

Returns

A torch module mapping hidden states to vocabulary.

Return type
(flag:bool=True)[source]¶

Deactivates gradient checkpointing for the current model.

Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.

(flag:bool=True)[source]¶

Activates gradient checkpointing for the current model.

Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.

()[source]¶

If needed prunes and maybe initializes weights.

(heads_to_prune:Dict[int, List[int]])[source]¶

Prunes heads of the base model.

Parameters

heads_to_prune () – Dictionary with keys being selected layer indices () and associated values being the list of heads to prune in said layer (list of ). For instance {1: [0, 2], 2: [2, 3]} will prune heads 0 and 2 on layer 1 and heads 2 and 3 on layer 2.

(repo_path_or_name:Optional[str]=None, repo_url:Optional[str]=None, use_temp_dir:bool=False, commit_message:Optional[str]=None, organization:Optional[str]=None, private:Optional[bool]=None, use_auth_token:Optional[Union[bool, str]]=None) &#x; str¶

Upload the model checkpoint to the 🤗 Model Hub while synchronizing a local clone of the repo in .

Parameters
  • repo_path_or_name (, optional) – Can either be a repository name for your model in the Hub or a path to a local folder (in which case the repository will have the name of that local folder). If not specified, will default to the name given by and a local directory with that name will be created.

  • repo_url (, optional) – Specify this in case you want to push to an existing repository in the hub. If unspecified, a new repository will be created in your namespace (unless you specify an ) with .

  • use_temp_dir (, optional, defaults to ) – Whether or not to clone the distant repo in a temporary directory or in inside the current working directory. This will slow things down if you are making changes in an existing repo since you will need to clone the repo before every push.

  • commit_message (, optional) – Message to commit while pushing. Will default to .

  • organization (, optional) – Organization in which you want to push your model (you must be a member of this organization).

  • private (, optional) – Whether or not the repository created should be private (requires a paying subscription).

  • use_auth_token ( or , optional) – The token to use as HTTP bearer authorization for remote files. If , will use the token generated when running (stored in ). Will default to if is not specified.

Returns

The url of the commit of your model in the given repository.

Return type

Examples:

fromtransformersimportAutoModelmodel=AutoModel.from_pretrained("bert-base-cased")# Push the model to your namespace with the name "my-finetuned-bert" and have a local clone in the# `my-finetuned-bert` folder.model.push_to_hub("my-finetuned-bert")# Push the model to your namespace with the name "my-finetuned-bert" with no local clone.model.push_to_hub("my-finetuned-bert",use_temp_dir=True)# Push the model to an organization with the name "my-finetuned-bert" and have a local clone in the# `my-finetuned-bert` folder.model.push_to_hub("my-finetuned-bert",organization="huggingface")# Make a change to an existing repo that has been cloned locally in `my-finetuned-bert`.model.push_to_hub("my-finetuned-bert",repo_url="https://huggingface.co/sgugger/my-finetuned-bert")
(new_num_tokens:Optional[int]=None) &#x; torch.nn.modules.sparse.Embedding[source]¶

Resizes input token embeddings matrix of the model if .

Takes care of tying weights embeddings afterwards if the model class has a method.

Parameters

new_num_tokens (, optional) – The number of new tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end. If not provided or , just returns a pointer to the input tokens module of the model without doing anything.

Returns

Pointer to the input tokens Embeddings Module of the model.

Return type
(save_directory: Union[str, os.PathLike], save_config: bool = True, state_dict: Optional[dict] = None, save_function: Callable = <function save>, push_to_hub: bool = False, **kwargs)[source]¶

Save a model and its configuration file to a directory, so that it can be re-loaded using the :func:`~transformers.PreTrainedModel.from_pretrained` class method.

Parameters
  • save_directory ( or ) – Directory to which to save. Will be created if it doesn’t exist.

  • save_config (, optional, defaults to ) – Whether or not to save the config of the model. Useful when in distributed training like TPUs and need to call this function on all processes. In this case, set only on the main process to avoid race conditions.

  • state_dict (nested dictionary of ) – The state dictionary of the model to save. Will default to , but can be used to only save parts of the model or if special precautions need to be taken when recovering the state dictionary of a model (like when using model parallelism).

  • save_function () – The function to use to save the state dictionary. Useful on distributed training like TPUs when one need to replace by another method.

  • push_to_hub (, optional, defaults to ) –

    Whether or not to push your model to the Hugging Face model hub after saving it.

    Warning

    Using will synchronize the repository you are pushing to with , which requires to be a local clone of the repo you are pushing to if it’s an existing folder. Pass along to use a temporary directory instead.

  • kwargs – Additional key word arguments passed along to the method.

(value:torch.nn.modules.module.Module)[source]¶

Set model’s input embeddings.

Parameters

value () – A module mapping vocabulary to hidden states.

()[source]¶

Tie the weights between the input embeddings and the output embeddings.

If the flag is set in the configuration, can’t handle parameter sharing so we are cloning the weights instead.

Model Instantiation dtype¶

Under Pytorch a model normally gets instantiated with format. This can be an issue if one tries to load a model whose weights are in fp16, since it’d require twice as much memory. To overcome this limitation, you can either explicitly pass the desired using argument:

model=T5ForConditionalGeneration.from_pretrained("t5",torch_dtype=torch.float16)

or, if you want the model to always load in the most optimal memory pattern, you can use the special value , and then will be automatically derived from the model’s weights:

model=T5ForConditionalGeneration.from_pretrained("t5",torch_dtype="auto")

Models instantiated from scratch can also be told which to use with:

config=T5Config.from_pretrained("t5")model=AutoModel.from_config(config)

Due to Pytorch design, this functionality is only available for floating dtypes.

ModuleUtilsMixin¶

class [source]¶

A few utilities for , to be used as a mixin.

()[source]¶

Add a memory hook before and after each sub-module forward pass to record increase in memory consumption.

Increase in memory consumption is stored in a attribute for each module and can be reset to zero with .

property

The device on which the module is (assuming that all the module parameters are on the same device).

Type
property

The dtype of the module (assuming that all the module parameters have the same dtype).

Type
(input_dict:Dict[str, Union[torch.Tensor, Any]]) &#x; int[source]¶

Helper function to estimate the total number of tokens from the model inputs.

Parameters

inputs () – The model inputs.

Returns

The total number of tokens.

Return type
(input_dict:Dict[str, Union[torch.Tensor, Any]], exclude_embeddings:bool=True) &#x; int[source]¶

Get number of (optionally, non-embeddings) floating-point operations for the forward and backward passes of a batch with this transformer model. Default approximation neglects the quadratic dependency on the number of tokens (valid if ) as laid out in this paper section Should be overridden for transformers with parameter re-use e.g. Albert or Universal Transformers, or if doing long-range modeling with very high sequence lengths.

Parameters
  • batch_size () – The batch size for the forward pass.

  • sequence_length () – The number of tokens in each line of the batch.

  • exclude_embeddings (, optional, defaults to ) – Whether or not to count embedding and softmax operations.

Returns

The number of floating-point operations.

Return type
(attention_mask: torch.Tensor, input_shape: Tuple[int], device: <property object at 0x7f5b06cba7c8>) &#x; torch.Tensor[source]¶

Makes broadcastable attention and causal masks so that future and masked tokens are ignored.

Parameters
  • attention_mask () – Mask with ones indicating tokens to attend to, zeros for tokens to ignore.

  • input_shape () – The shape of the input to the model.

  • device – (): The device of the input to the model.

Returns

The extended attention mask, with a the same dtype as .

(head_mask:Optional[torch.Tensor], num_hidden_layers:int, is_attention_chunked:bool=False) &#x; torch.Tensor[source]¶

Prepare the head mask if needed.

Parameters
  • head_mask ( with shape or , optional) – The mask indicating if we should keep the heads or not ( for keep, for discard).

  • num_hidden_layers () – The number of hidden layers in the model.

  • is_attention_chunked – (, optional, defaults to ): Whether or not the attentions scores are computed by chunks or not.

Returns

with shape or list with for each layer.

(encoder_attention_mask:torch.Tensor) &#x; torch.Tensor[source]¶

Invert an attention mask (e.g., switches 0. and 1.).

Parameters

encoder_attention_mask () – An attention mask.

Returns

The inverted attention mask.

Return type
(only_trainable:bool=False, exclude_embeddings:bool=False) &#x; int[source]¶

Get number of (optionally, trainable or non-embeddings) parameters in the module.

Parameters
  • only_trainable (, optional, defaults to ) – Whether or not to return only the number of trainable parameters

  • exclude_embeddings (, optional, defaults to ) – Whether or not to return only the number of non-embeddings parameters

Returns

The number of parameters.

Return type
()[source]¶

Reset the attribute of each module (see ).

TFPreTrainedModel¶

class (*args, **kwargs)[source]¶

Base class for all TF models.

takes care of storing the configuration of the models and handles methods for loading, downloading and saving models as well as a few methods common to all models to:

  • resize the input embeddings,

  • prune heads in the self-attention heads.

Class attributes (overridden by derived classes):

  • config_class () – A subclass of to use as configuration class for this model architecture.

  • base_model_prefix () – A string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.

(optimizer='rmsprop', loss='passthrough', metrics=None, loss_weights=None, weighted_metrics=None, run_eagerly=None, steps_per_execution=None, **kwargs)[source]¶

This is a thin wrapper that sets the model’s loss output head as the loss if the user does not specify a loss function themselves.

property

Dummy inputs to build the network.

Returns

The dummy inputs.

Return type
classmethod (pretrained_model_name_or_path, *model_args, **kwargs)[source]¶

Instantiate a pretrained TF model from a pre-trained model configuration.

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

Parameters
  • pretrained_model_name_or_path (, optional) –

    Can be either:

    • A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like , or namespaced under a user or organization name, like .

    • A path to a directory containing model weights saved using , e.g., .

    • A path or url to a PyTorch state_dict save file (e.g, ). In this case, should be set to and a configuration object should be provided as argument. This loading path is slower than converting the PyTorch model in a TensorFlow model using the provided conversion scripts and loading the TensorFlow model afterwards.

    • if you are both providing the configuration and state dictionary (resp. with keyword arguments and ).

  • model_args (sequence of positional arguments, optional) – All remaining positional arguments will be passed to the underlying model’s method.

  • config (, optional) –

    Can be either:

    Configuration for the model to use instead of an automatically loaded configuration. Configuration can be automatically loaded when:

    • The model is a model provided by the library (loaded with the model id string of a pretrained model).

    • The model was saved using and is reloaded by supplying the save directory.

    • The model is loaded by supplying a local directory as and a configuration JSON file named config.json is found in the directory.

  • from_pt – (, optional, defaults to ): Load the model weights from a PyTorch state_dict save file (see docstring of argument).

  • ignore_mismatched_sizes (, optional, defaults to ) – Whether or not to raise an error if some of the weights from the checkpoint do not have the same size as the weights of the model (if for instance, you are instantiating a model with 10 labels from a checkpoint with 3 labels).

  • cache_dir (, optional) – Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

  • force_download (, optional, defaults to ) – Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist.

  • resume_download (, optional, defaults to ) – Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.

  • proxies – (): A dictionary of proxy servers to use by protocol or endpoint, e.g., . The proxies are used on each request.

  • output_loading_info (, optional, defaults to ) – Whether ot not to also return a dictionary containing missing keys, unexpected keys and error messages.

  • local_files_only (, optional, defaults to ) – Whether or not to only look at local files (e.g., not try doanloading the model).

  • use_auth_token ( or bool, optional) – The token to use as HTTP bearer authorization for remote files. If , will use the token generated when running (stored in ).

  • revision (, optional, defaults to ) – The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so can be any identifier allowed by git.

  • mirror (, optional) – Mirror source to accelerate downloads in China. If you are from China and have an accessibility problem, you can set this option to resolve it. Note that we do not guarantee the timeliness or safety. Please refer to the mirror site for more information.

  • kwargs (remaining dictionary of keyword arguments, optional) –

    Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., ). Behaves differently depending on whether a is provided or automatically loaded:

    • If a configuration is provided with , will be directly passed to the underlying model’s method (we assume all relevant updates to the configuration have already been done)

    • If a configuration is not provided, will be first passed to the configuration class initialization function (). Each key of that corresponds to a configuration attribute will be used to override said attribute with the supplied value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model’s function.

Note

Passing is required when you want to use a private model.

Examples:

>>> fromtransformersimportBertConfig,TFBertModel>>> # Download model and configuration from huggingface.co and cache.>>> model=TFBertModel.from_pretrained('bert-base-uncased')>>> # Model was saved using `save_pretrained('./test/saved_model/')` (for example purposes, not runnable).>>> model=TFBertModel.from_pretrained('./test/saved_model/')>>> # Update configuration during loading.>>> model=TFBertModel.from_pretrained('bert-base-uncased',output_attentions=True)>>> assertmodel.config.output_attentions==True>>> # Loading from a Pytorch model file instead of a TensorFlow checkpoint (slower, for example purposes, not runnable).>>> config=BertConfig.from_json_file('./pt_model/my_pt_model_config.json')>>> model=TFBertModel.from_pretrained('./pt_model/my_pytorch_model.bin',from_pt=True,config=config)
() &#x; Union[None, Dict[str, tensorflow.python.ops.variables.Variable]][source]¶

Dict of bias attached to an LM head. The key represents the name of the bias attribute.

Returns

The weights representing the bias, None if not an LM model.

Return type
() &#x; keras.engine.base_layer.Layer[source]¶

Returns the model’s input embeddings layer.

Returns

The embeddings layer mapping vocabulary to hidden states.

Return type
() &#x; keras.engine.base_layer.Layer[source]¶

The LM Head layer. This method must be overwritten by all the models that have a lm head.

Returns

The LM head layer if the model has one, None if not.

Return type
() &#x; Union[None, keras.engine.base_layer.Layer][source]¶

Returns the model’s output embeddings

Returns

The new weights mapping vocabulary to hidden states.

Return type
() &#x; Union[None, keras.engine.base_layer.Layer][source]¶

Get the layer that handles a bias attribute in case the model has an LM head with weights tied to the embeddings

Returns

The layer that handles the bias, None if not an LM model.

Return type
() &#x; Union[None, str][source]¶

Get the concatenated _prefix name of the bias from the model name to the parent layer

Returns

The _prefix name of the bias.

Return type
(heads_to_prune)[source]¶

Prunes heads of the base model.

Parameters

heads_to_prune () – Dictionary with keys being selected layer indices () and associated values being the list of heads to prune in said layer (list of ). For instance {1: [0, 2], 2: [2, 3]} will prune heads 0 and 2 on layer 1 and heads 2 and 3 on layer 2.

(repo_path_or_name:Optional[str]=None, repo_url:Optional[str]=None, use_temp_dir:bool=False, commit_message:Optional[str]=None, organization:Optional[str]=None, private:Optional[bool]=None, use_auth_token:Optional[Union[bool, str]]=None) &#x; str¶

Upload the model checkpoint to the 🤗 Model Hub while synchronizing a local clone of the repo in .

Parameters
  • repo_path_or_name (, optional) – Can either be a repository name for your model in the Hub or a path to a local folder (in which case the repository will have the name of that local folder). If not specified, will default to the name given by and a local directory with that name will be created.

  • repo_url (, optional) – Specify this in case you want to push to an existing repository in the hub. If unspecified, a new repository will be created in your namespace (unless you specify an ) with .

  • use_temp_dir (, optional, defaults to ) – Whether or not to clone the distant repo in a temporary directory or in inside the current working directory. This will slow things down if you are making changes in an existing repo since you will need to clone the repo before every push.

  • commit_message (, optional) – Message to commit while pushing. Will default to .

  • organization (, optional) – Organization in which you want to push your model (you must be a member of this organization).

  • private (, optional) – Whether or not the repository created should be private (requires a paying subscription).

  • use_auth_token ( or , optional) – The token to use as HTTP bearer authorization for remote files. If , will use the token generated when running (stored in ). Will default to if is not specified.

Returns

The url of the commit of your model in the given repository.

Return type

Examples:

fromtransformersimportTFAutoModelmodel=TFAutoModel.from_pretrained("bert-base-cased")# Push the model to your namespace with the name "my-finetuned-bert" and have a local clone in the# `my-finetuned-bert` folder.model.push_to_hub("my-finetuned-bert")# Push the model to your namespace with the name "my-finetuned-bert" with no local clone.model.push_to_hub("my-finetuned-bert",use_temp_dir=True)# Push the model to an organization with the name "my-finetuned-bert" and have a local clone in the# `my-finetuned-bert` folder.model.push_to_hub("my-finetuned-bert",organization="huggingface")# Make a change to an existing repo that has been cloned locally in `my-finetuned-bert`.model.push_to_hub("my-finetuned-bert",repo_url="https://huggingface.co/sgugger/my-finetuned-bert")
(new_num_tokens=None) &#x; tensorflow.python.ops.variables.Variable[source]¶

Resizes input token embeddings matrix of the model if .

Takes care of tying weights embeddings afterwards if the model class has a method.

Parameters

new_num_tokens (, optional) – The number of new tokens in the embedding matrix. Increasing the size will add newly initialized vectors at the end. Reducing the size will remove vectors from the end. If not provided or , just returns a pointer to the input tokens module of the model without doing anything.

Returns

Pointer to the input tokens Embeddings Module of the model.

Return type
(save_directory, saved_model=False, version=1, push_to_hub=False, **kwargs)[source]¶

Save a model and its configuration file to a directory, so that it can be re-loaded using the class method.

Parameters
  • save_directory () – Directory to which to save. Will be created if it doesn’t exist.

  • saved_model (, optional, defaults to ) – If the model has to be saved in saved model format as well or not.

  • version (, optional, defaults to 1) – The version of the saved model. A saved model needs to be versioned in order to be properly loaded by TensorFlow Serving as detailed in the official documentation https://www.tensorflow.org/tfx/serving/serving_basic

  • push_to_hub (, optional, defaults to ) –

    Whether or not to push your model to the Hugging Face model hub after saving it.

    Warning

    Using will synchronize the repository you are pushing to with , which requires to be a local clone of the repo you are pushing to if it’s an existing folder. Pass along to use a temporary directory instead.

  • kwargs – Additional key word arguments passed along to the method.

(inputs)[source]¶

Method used for serving the model.

Parameters

inputs () – The input of the saved model as a dictionary of tensors.

()[source]¶

Prepare the output of the saved model. Each model must implement this function.

Parameters

output () – The output returned by the model.

(value)[source]¶

Set all the bias in the LM head.

Parameters

value () – All the new bias attached to an LM head.

(value)[source]¶

Set model’s input embeddings

Parameters

value () – The new weights mapping hidden states to vocabulary.

(value)[source]¶

Set model’s output embeddings

Parameters

value () – The new weights mapping hidden states to vocabulary.

(data)[source]¶

A modification of Keras’s default test_step that cleans up the printed metrics when we use a dummy loss.

(data)[source]¶

A modification of Keras’s default train_step that cleans up the printed metrics when we use a dummy loss.

TFModelUtilsMixin¶

class [source]¶

A few utilities for , to be used as a mixin.

(only_trainable:bool=False) &#x; int[source]¶

Get the number of (optionally, trainable) parameters in the model.

Parameters

only_trainable (, optional, defaults to ) – Whether or not to return only the number of trainable parameters

Returns

The number of parameters.

Return type

FlaxPreTrainedModel¶

class (config: transformers.configuration_utils.PretrainedConfig, module: flax.linen.module.Module, input_shape: Tuple = (1, 1), seed: int = 0, dtype: numpy.dtype = <class 'jax._src.numpy.lax_numpy.float32'>)[source]¶

Base class for all models.

takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.

Class attributes (overridden by derived classes):

  • config_class () – A subclass of to use as configuration class for this model architecture.

  • base_model_prefix () – A string indicating the attribute associated to the base model in derived classes of the same architecture adding modules on top of the base model.

classmethod (pretrained_model_name_or_path: Union[str, os.PathLike], dtype: numpy.dtype = <class 'jax._src.numpy.lax_numpy.float32'>, *model_args, **kwargs)[source]¶

Instantiate a pretrained flax model from a pre-trained model configuration.

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

Parameters
  • pretrained_model_name_or_path ( or ) –

    Can be either:

    • A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like , or namespaced under a user or organization name, like .

    • A path to a directory containing model weights saved using , e.g., .

    • A path or url to a pt index checkpoint file (e.g, ). In this case, should be set to .

  • model_args (sequence of positional arguments, optional) – All remaining positional arguments will be passed to the underlying model’s method.

  • config (, optional) –

    Can be either:

    Configuration for the model to use instead of an automatically loaded configuration. Configuration can be automatically loaded when:

    • The model is a model provided by the library (loaded with the model id string of a pretrained model).

    • The model was saved using and is reloaded by supplying the save directory.

    • The model is loaded by supplying a local directory as and a configuration JSON file named config.json is found in the directory.

  • cache_dir (, optional) – Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

  • from_pt (, optional, defaults to ) – Load the model weights from a PyTorch checkpoint save file (see docstring of argument).

  • ignore_mismatched_sizes (, optional, defaults to ) – Whether or not to raise an error if some of the weights from the checkpoint do not have the same size as the weights of the model (if for instance, you are instantiating a model with 10 labels from a checkpoint with 3 labels).

  • force_download (, optional, defaults to ) – Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist.

  • resume_download (, optional, defaults to ) – Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.

  • proxies () – A dictionary of proxy servers to use by protocol or endpoint, e.g., . The proxies are used on each request.

  • local_files_only (, optional, defaults to ) – Whether or not to only look at local files (i.e., do not try to download the model).

  • revision (, optional, defaults to ) – The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so can be any identifier allowed by git.

  • kwargs (remaining dictionary of keyword arguments, optional) –

    Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., ). Behaves differently depending on whether a is provided or automatically loaded:

    • If a configuration is provided with , will be directly passed to the underlying model’s method (we assume all relevant updates to the configuration have already been done)

    • If a configuration is not provided, will be first passed to the configuration class initialization function (). Each key of that corresponds to a configuration attribute will be used to override said attribute with the supplied value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model’s function.

Examples:

>>> fromtransformersimportBertConfig,FlaxBertModel>>> # Download model and configuration from huggingface.co and cache.>>> model=FlaxBertModel.from_pretrained('bert-base-cased')>>> # Model was saved using `save_pretrained('./test/saved_model/')` (for example purposes, not runnable).>>> model=FlaxBertModel.from_pretrained('./test/saved_model/')>>> # Loading from a PyTorch checkpoint file instead of a PyTorch model (slower, for example purposes, not runnable).>>> config=BertConfig.from_json_file('./pt_model/config.json')>>> model=FlaxBertModel.from_pretrained('./pt_model/pytorch_model.bin',from_pt=True,config=config)
(repo_path_or_name:Optional[str]=None, repo_url:Optional[str]=None, use_temp_dir:bool=False, commit_message:Optional[str]=None, organization:Optional[str]=None, private:Optional[bool]=None, use_auth_token:Optional[Union[bool, str]]=None) &#x; str¶

Upload the model checkpoint to the 🤗 Model Hub while synchronizing a local clone of the repo in .

Parameters
  • repo_path_or_name (, optional) – Can either be a repository name for your model in the Hub or a path to a local folder (in which case the repository will have the name of that local folder). If not specified, will default to the name given by and a local directory with that name will be created.

  • repo_url (, optional) – Specify this in case you want to push to an existing repository in the hub. If unspecified, a new repository will be created in your namespace (unless you specify an ) with .

  • use_temp_dir (, optional, defaults to ) – Whether or not to clone the distant repo in a temporary directory or in inside the current working directory. This will slow things down if you are making changes in an existing repo since you will need to clone the repo before every push.

  • commit_message (, optional) – Message to commit while pushing. Will default to .

  • organization (, optional) – Organization in which you want to push your model (you must be a member of this organization).

  • private (, optional) – Whether or not the repository created should be private (requires a paying subscription).

  • use_auth_token ( or , optional) – The token to use as HTTP bearer authorization for remote files. If , will use the token generated when running (stored in ). Will default to if is not specified.

Returns

The url of the commit of your model in the given repository.

Return type

Examples:

fromtransformersimportFlaxAutoModelmodel=FlaxAutoModel.from_pretrained("bert-base-cased")# Push the model to your namespace with the name "my-finetuned-bert" and have a local clone in the# `my-finetuned-bert` folder.model.push_to_hub("my-finetuned-bert")# Push the model to your namespace with the name "my-finetuned-bert" with no local clone.model.push_to_hub("my-finetuned-bert",use_temp_dir=True)# Push the model to an organization with the name "my-finetuned-bert" and have a local clone in the# `my-finetuned-bert` folder.model.push_to_hub("my-finetuned-bert",organization="huggingface")# Make a change to an existing repo that has been cloned locally in `my-finetuned-bert`.model.push_to_hub("my-finetuned-bert",repo_url="https://huggingface.co/sgugger/my-finetuned-bert")
(save_directory:Union[str, os.PathLike], params=None, push_to_hub=False, **kwargs)[source]¶

Save a model and its configuration file to a directory, so that it can be re-loaded using the :func:`~transformers.FlaxPreTrainedModel.from_pretrained` class method

Parameters
  • save_directory ( or ) – Directory to which to save. Will be created if it doesn’t exist.

  • push_to_hub (, optional, defaults to ) –

    Whether or not to push your model to the Hugging Face model hub after saving it.

    Warning

    Using will synchronize the repository you are pushing to with , which requires to be a local clone of the repo you are pushing to if it’s an existing folder. Pass along to use a temporary directory instead.

  • kwargs – Additional key word arguments passed along to the method.

Generation¶

class [source]¶

A class containing all of the functions supporting generation, to be used as a mixin in .

(logits:torch.FloatTensor, **kwargs) &#x; torch.FloatTensor[source]¶

Implement in subclasses of for custom behavior to adjust the logits in the generate method.

(input_ids:torch.LongTensor, beam_scorer:transformers.generation_beam_search.BeamScorer, logits_processor:Optional[transformers.generation_logits_process.LogitsProcessorList]=None, stopping_criteria:Optional[transformers.generation_stopping_criteria.StoppingCriteriaList]=None, logits_warper:Optional[transformers.generation_logits_process.LogitsProcessorList]=None, max_length:Optional[int]=None, pad_token_id:Optional[int]=None, eos_token_id:Optional[int]=None, output_attentions:Optional[bool]=None, output_hidden_states:Optional[bool]=None, output_scores:Optional[bool]=None, return_dict_in_generate:Optional[bool]=None, synced_gpus:Optional[bool]=None, **model_kwargs) &#x; Union[transformers.generation_utils.BeamSampleEncoderDecoderOutput, transformers.generation_utils.BeamSampleDecoderOnlyOutput, torch.LongTensor][source]¶

Generates sequences for models with a language modeling head using beam search with multinomial sampling.

Parameters
  • input_ids ( of shape ) – The sequence used as a prompt for the generation.

  • beam_scorer () – A derived instance of that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of should be read.

  • logits_processor (, optional) – An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

  • stopping_criteria (, optional) – An instance of . List of instances of class derived from used to tell if the generation loop should stop.

  • logits_warper (, optional) – An instance of . List of instances of class derived from used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.

  • max_length (, optional, defaults to 20) – DEPRECATED. Use or directly to cap the number of generated tokens. The maximum length of the sequence to be generated.

  • pad_token_id (, optional) – The id of the padding token.

  • eos_token_id (, optional) – The id of the end-of-sequence token.

  • output_attentions (, optional, defaults to False) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more details.

  • output_hidden_states (, optional, defaults to False) – Whether or not to return the hidden states of all layers. See under returned tensors for more details.

  • output_scores (, optional, defaults to False) – Whether or not to return the prediction scores. See under returned tensors for more details.

  • return_dict_in_generate (, optional, defaults to False) – Whether or not to return a instead of a plain tuple.

  • synced_gpus (, optional, defaults to ) – Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

  • model_kwargs – Additional model specific kwargs will be forwarded to the function of the model. If model is an encoder-decoder model the kwargs should include .

Returns

, or obj:torch.LongTensor: A containing the generated tokens (default behaviour) or a if and or a if .

Examples:

>>> fromtransformersimport( AutoTokenizer, AutoModelForSeq2SeqLM, LogitsProcessorList, MinLengthLogitsProcessor, TopKLogitsWarper, TemperatureLogitsWarper, BeamSearchScorer, )>>> importtorch>>> tokenizer=AutoTokenizer.from_pretrained("t5-base")>>> model=AutoModelForSeq2SeqLM.from_pretrained("t5-base")>>> encoder_input_str="translate English to German: How old are you?">>> encoder_input_ids=tokenizer(encoder_input_str,return_tensors="pt").input_ids>>> # lets run beam search using 3 beams>>> num_beams=3>>> # define decoder start token ids>>> input_ids=torch.ones((num_beams,1),device=model.device,dtype=torch.long)>>> input_ids=input_ids*model.config.decoder_start_token_id>>> # add encoder_outputs to model keyword arguments>>> model_kwargs={ "encoder_outputs":model.get_encoder()(encoder_input_ids.repeat_interleave(num_beams,dim=0),return_dict=True) }>>> # instantiate beam scorer>>> beam_scorer=BeamSearchScorer( batch_size=1, max_length=model.config.max_length, num_beams=num_beams, device=model.device, )>>> # instantiate logits processors>>> logits_processor=LogitsProcessorList([ MinLengthLogitsProcessor(5,eos_token_id=model.config.eos_token_id) ])>>> # instantiate logits processors>>> logits_warper=LogitsProcessorList([ TopKLogitsWarper(50), TemperatureLogitsWarper(), ])>>> outputs=model.beam_sample( input_ids,beam_scorer,logits_processor=logits_processor,logits_warper=logits_warper,**model_kwargs )>>> print("Generated:",tokenizer.batch_decode(outputs,skip_special_tokens=True))
(input_ids:torch.LongTensor, beam_scorer:transformers.generation_beam_search.BeamScorer, logits_processor:Optional[transformers.generation_logits_process.LogitsProcessorList]=None, stopping_criteria:Optional[transformers.generation_stopping_criteria.StoppingCriteriaList]=None, max_length:Optional[int]=None, pad_token_id:Optional[int]=None, eos_token_id:Optional[int]=None, output_attentions:Optional[bool]=None, output_hidden_states:Optional[bool]=None, output_scores:Optional[bool]=None, return_dict_in_generate:Optional[bool]=None, synced_gpus:Optional[bool]=None, **model_kwargs) &#x; Union[transformers.generation_utils.BeamSearchEncoderDecoderOutput, transformers.generation_utils.BeamSearchDecoderOnlyOutput, torch.LongTensor][source]¶

Generates sequences for models with a language modeling head using beam search decoding.

Parameters
  • input_ids ( of shape ) – The sequence used as a prompt for the generation.

  • beam_scorer () – An derived instance of that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of should be read.

  • logits_processor (, optional) – An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

  • stopping_criteria (, optional) – An instance of . List of instances of class derived from used to tell if the generation loop should stop.

  • max_length (, optional, defaults to 20) – DEPRECATED. Use or directly to cap the number of generated tokens. The maximum length of the sequence to be generated.

  • pad_token_id (, optional) – The id of the padding token.

  • eos_token_id (, optional) – The id of the end-of-sequence token.

  • output_attentions (, optional, defaults to False) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more details.

  • output_hidden_states (, optional, defaults to False) – Whether or not to return the hidden states of all layers. See under returned tensors for more details.

  • output_scores (, optional, defaults to False) – Whether or not to return the prediction scores. See under returned tensors for more details.

  • return_dict_in_generate (, optional, defaults to False) – Whether or not to return a instead of a plain tuple.

  • synced_gpus (, optional, defaults to ) – Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

  • model_kwargs – Additional model specific kwargs will be forwarded to the function of the model. If model is an encoder-decoder model the kwargs should include .

Returns

, or obj:torch.LongTensor: A containing the generated tokens (default behaviour) or a if and or a if .

Examples:

>>> fromtransformersimport( AutoTokenizer, AutoModelForSeq2SeqLM, LogitsProcessorList, MinLengthLogitsProcessor, BeamSearchScorer, )>>> importtorch>>> tokenizer=AutoTokenizer.from_pretrained("t5-base")>>> model=AutoModelForSeq2SeqLM.from_pretrained("t5-base")>>> encoder_input_str="translate English to German: How old are you?">>> encoder_input_ids
Sours: https://huggingface.co/transformers/main_classes/model.html

Transformers

  • ALBERT (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.

  • BART (from Facebook) released with the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.

  • BARThez (from École polytechnique) released with the paper BARThez: a Skilled Pretrained French Sequence-to-Sequence Model by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.

  • BEiT (from Microsoft) released with the paper BEiT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong, Furu Wei.

  • BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

  • BERT For Sequence Generation (from Google) released with the paper Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.

  • BigBird-RoBERTa (from Google Research) released with the paper Big Bird: Transformers for Longer Sequences by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.

  • BigBird-Pegasus (from Google Research) released with the paper Big Bird: Transformers for Longer Sequences by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.

  • Blenderbot (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.

  • BlenderbotSmall (from Facebook) released with the paper Recipes for building an open-domain chatbot by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston.

  • BORT (from Alexa) released with the paper Optimal Subarchitecture Extraction For BERT by Adrian de Wynter and Daniel J. Perry.

  • ByT5 (from Google Research) released with the paper ByT5: Towards a token-free future with pre-trained byte-to-byte models by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.

  • CamemBERT (from Inria/Facebook/Sorbonne) released with the paper CamemBERT: a Tasty French Language Model by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.

  • CANINE (from Google Research) released with the paper CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.

  • CLIP (from OpenAI) released with the paper Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.

  • ConvBERT (from YituTech) released with the paper ConvBERT: Improving BERT with Span-based Dynamic Convolution by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

  • CPM (from Tsinghua University) released with the paper CPM: A Large-scale Generative Chinese Pre-trained Language Model by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.

  • CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher.

  • DeBERTa (from Microsoft) released with the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.

  • DeBERTa-v2 (from Microsoft) released with the paper DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.

  • DeiT (from Facebook) released with the paper Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.

  • DETR (from Facebook) released with the paper End-to-End Object Detection with Transformers by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.

  • DialoGPT (from Microsoft Research) released with the paper DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.

  • DistilBERT (from HuggingFace), released together with the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into DistilGPT2, RoBERTa into DistilRoBERTa, Multilingual BERT into DistilmBERT and a German version of DistilBERT.

  • DPR (from Facebook) released with the paper Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.

  • EncoderDecoder (from Google Research) released with the paper Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.

  • ELECTRA (from Google Research/Stanford University) released with the paper ELECTRA: Pre-training text encoders as discriminators rather than generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.

  • FlauBERT (from CNRS) released with the paper FlauBERT: Unsupervised Language Model Pre-training for French by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.

  • FNet (from Google Research) released with the paper FNet: Mixing Tokens with Fourier Transforms by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.

  • Funnel Transformer (from CMU/Google Brain) released with the paper Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.

  • GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.

  • GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.

  • GPT-J (from EleutherAI) released in the repository kingoflolz/mesh-transformer-jax by Ben Wang and Aran Komatsuzaki.

  • GPT Neo (from EleutherAI) released in the repository EleutherAI/gpt-neo by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.

  • Hubert (from Facebook) released with the paper HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

  • I-BERT (from Berkeley) released with the paper I-BERT: Integer-only BERT Quantization by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.

  • LayoutLM (from Microsoft Research Asia) released with the paper LayoutLM: Pre-training of Text and Layout for Document Image Understanding by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.

  • LayoutLMv2 (from Microsoft Research Asia) released with the paper LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou.

  • LayoutXLM (from Microsoft Research Asia) released with the paper LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.

  • LED (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.

  • Longformer (from AllenAI) released with the paper Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.

  • LUKE (from Studio Ousia) released with the paper LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.

  • LXMERT (from UNC Chapel Hill) released with the paper LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering by Hao Tan and Mohit Bansal.

  • M2M (from Facebook) released with the paper Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

  • MarianMT Machine translation models trained using OPUS data by Jörg Tiedemann. The Marian Framework is being developed by the Microsoft Translator Team.

  • MBart (from Facebook) released with the paper Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.

  • MBart (from Facebook) released with the paper Multilingual Translation with Extensible Multilingual Pretraining and Finetuning by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.

  • Megatron-BERT (from NVIDIA) released with the paper Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.

  • Megatron-GPT2 (from NVIDIA) released with the paper Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.

  • MPNet (from Microsoft Research) released with the paper MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.

  • MT5 (from Google AI) released with the paper mT5: A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.

  • Pegasus (from Google) released with the paper PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.

  • ProphetNet (from Microsoft Research) released with the paper ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.

  • Reformer (from Google Research) released with the paper Reformer: The Efficient Transformer by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.

  • RemBERT (from Google Research) released with the paper Rethinking embedding coupling in pre-trained language models by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.

  • RoBERTa (from Facebook), released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.

  • RoFormer (from ZhuiyiTechnology), released together with the paper a RoFormer: Enhanced Transformer with Rotary Position Embedding by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.

  • SpeechEncoderDecoder

  • SpeechToTextTransformer (from Facebook), released together with the paper fairseq S2T: Fast Speech-to-Text Modeling with fairseq by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.

  • SpeechToTextTransformer2 (from Facebook), released together with the paper Large-Scale Self- and Semi-Supervised Learning for Speech Translation by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.

  • Splinter (from Tel Aviv University), released together with the paper Few-Shot Question Answering by Pretraining Span Selection by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.

  • SqueezeBert (from Berkeley) released with the paper SqueezeBERT: What can computer vision teach NLP about efficient neural networks? by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.

  • T5 (from Google AI) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.

  • T5v (from Google AI) released in the repository google-research/text-to-text-transfer-transformer by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.

  • TAPAS (from Google AI) released with the paper TAPAS: Weakly Supervised Table Parsing via Pre-training by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.

  • Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.

  • Vision Transformer (ViT) (from Google AI) released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

  • VisualBERT (from UCLA NLP) released with the paper VisualBERT: A Simple and Performant Baseline for Vision and Language by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.

  • Wav2Vec2 (from Facebook AI) released with the paper wav2vec A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

  • XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.

  • XLM-ProphetNet (from Microsoft Research) released with the paper ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.

  • XLM-RoBERTa (from Facebook AI), released together with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.

  • XLNet (from Google/CMU) released with the paper ​XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.

  • XLSR-Wav2Vec2 (from Facebook AI) released with the paper Unsupervised Cross-Lingual Representation Learning For Speech Recognition by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.

  • Sours: https://huggingface.co/transformers/
    1. Imac 2017 model
    2. Sofia carson black dress
    3. Weymouth police scanner frequencies

    BERT¶

    Overview¶

    The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.

    The abstract from the paper is the following:

    We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

    BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to % (% point absolute improvement), MultiNLI accuracy to % (% absolute improvement), SQuAD v question answering Test F1 to ( point absolute improvement) and SQuAD v Test F1 to ( point absolute improvement).

    Tips:

    • BERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

    • BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.

    This model was contributed by thomwolf. The original code can be found here.

    BertConfig¶

    class (vocab_size=, hidden_size=, num_hidden_layers=12, num_attention_heads=12, intermediate_size=, hidden_act='gelu', hidden_dropout_prob=, attention_probs_dropout_prob=, max_position_embeddings=, type_vocab_size=2, initializer_range=, layer_norm_eps=1e, pad_token_id=0, position_embedding_type='absolute', use_cache=True, classifier_dropout=None, **kwargs)[source]¶

    This is the configuration class to store the configuration of a or a . It is used to instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT bert-base-uncased architecture.

    Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

    Parameters
    • vocab_size (, optional, defaults to ) – Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the passed when calling or .

    • hidden_size (, optional, defaults to ) – Dimensionality of the encoder layers and the pooler layer.

    • num_hidden_layers (, optional, defaults to 12) – Number of hidden layers in the Transformer encoder.

    • num_attention_heads (, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer encoder.

    • intermediate_size (, optional, defaults to ) – Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.

    • hidden_act ( or , optional, defaults to ) – The non-linear activation function (function or string) in the encoder and pooler. If string, , , and are supported.

    • hidden_dropout_prob (, optional, defaults to ) – The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

    • attention_probs_dropout_prob (, optional, defaults to ) – The dropout ratio for the attention probabilities.

    • max_position_embeddings (, optional, defaults to ) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., or or ).

    • type_vocab_size (, optional, defaults to 2) – The vocabulary size of the passed when calling or .

    • initializer_range (, optional, defaults to ) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

    • layer_norm_eps (, optional, defaults to 1e) – The epsilon used by the layer normalization layers.

    • position_embedding_type (, optional, defaults to ) – Type of position embedding. Choose one of , , . For positional embeddings use . For more information on , please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on , please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

    • use_cache (, optional, defaults to ) – Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if .

    • classifier_dropout (, optional) – The dropout ratio for the classification head.

    Examples:

    >>> fromtransformersimportBertModel,BertConfig>>> # Initializing a BERT bert-base-uncased style configuration>>> configuration=BertConfig()>>> # Initializing a model from the bert-base-uncased style configuration>>> model=BertModel(configuration)>>> # Accessing the model configuration>>> configuration=model.config

    BertTokenizer¶

    class (vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]¶

    Construct a BERT tokenizer. Based on WordPiece.

    This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

    Parameters
    • vocab_file () – File containing the vocabulary.

    • do_lower_case (, optional, defaults to ) – Whether or not to lowercase the input when tokenizing.

    • do_basic_tokenize (, optional, defaults to ) – Whether or not to do basic tokenization before WordPiece.

    • never_split (, optional) – Collection of tokens which will never be split during tokenization. Only has an effect when

    • unk_token (, optional, defaults to ) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

    • sep_token (, optional, defaults to ) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

    • pad_token (, optional, defaults to ) – The token used for padding, for example when batching sequences of different lengths.

    • cls_token (, optional, defaults to ) – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

    • mask_token (, optional, defaults to ) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

    • tokenize_chinese_chars (, optional, defaults to ) –

      Whether or not to tokenize Chinese characters.

      This should likely be deactivated for Japanese (see this issue).

    • strip_accents – (, optional): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for (as in the original BERT).

    (token_ids_0:List[int], token_ids_1:Optional[List[int]]=None) &#x; List[int][source]¶

    Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

    • single sequence:

    • pair of sequences:

    Parameters
    • token_ids_0 () – List of IDs to which the special tokens will be added.

    • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

    Returns

    List of input IDs with the appropriate special tokens.

    Return type
    (token_ids_0:List[int], token_ids_1:Optional[List[int]]=None) &#x; List[int][source]¶

    Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:

    |firstsequence|secondsequence|

    If is , this method only returns the first portion of the mask (0s).

    Parameters
    • token_ids_0 () – List of IDs.

    • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

    Returns

    List of token type IDs according to the given sequence(s).

    Return type
    (token_ids_0:List[int], token_ids_1:Optional[List[int]]=None, already_has_special_tokens:bool=False) &#x; List[int][source]¶

    Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer method.

    Parameters
    • token_ids_0 () – List of IDs.

    • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

    • already_has_special_tokens (, optional, defaults to ) – Whether or not the token list is already formatted with special tokens for the model.

    Returns

    A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

    Return type
    (save_directory:str, filename_prefix:Optional[str]=None) &#x; Tuple[str][source]¶

    Save only the vocabulary of the tokenizer (vocabulary + added tokens).

    This method won’t save the configuration and special token mappings of the tokenizer. Use to save the whole state of the tokenizer.

    Parameters
    • save_directory () – The directory in which to save the vocabulary.

    • filename_prefix (, optional) – An optional prefix to add to the named of the saved files.

    Returns

    Paths to the files saved.

    Return type

    BertTokenizerFast¶

    class (vocab_file=None, tokenizer_file=None, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]¶

    Construct a “fast” BERT tokenizer (backed by HuggingFace’s tokenizers library). Based on WordPiece.

    This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

    Parameters
    • vocab_file () – File containing the vocabulary.

    • do_lower_case (, optional, defaults to ) – Whether or not to lowercase the input when tokenizing.

    • unk_token (, optional, defaults to ) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

    • sep_token (, optional, defaults to ) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

    • pad_token (, optional, defaults to ) – The token used for padding, for example when batching sequences of different lengths.

    • cls_token (, optional, defaults to ) – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

    • mask_token (, optional, defaults to ) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

    • clean_text (, optional, defaults to ) – Whether or not to clean the text before tokenization by removing any control characters and replacing all whitespaces by the classic one.

    • tokenize_chinese_chars (, optional, defaults to ) – Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this issue).

    • strip_accents – (, optional): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for (as in the original BERT).

    • wordpieces_prefix – (, optional, defaults to ): The prefix for subwords.

    (token_ids_0, token_ids_1=None)[source]¶

    Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

    • single sequence:

    • pair of sequences:

    Parameters
    • token_ids_0 () – List of IDs to which the special tokens will be added.

    • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

    Returns

    List of input IDs with the appropriate special tokens.

    Return type
    (token_ids_0:List[int], token_ids_1:Optional[List[int]]=None) &#x; List[int][source]¶

    Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:

    |firstsequence|secondsequence|

    If is , this method only returns the first portion of the mask (0s).

    Parameters
    • token_ids_0 () – List of IDs.

    • token_ids_1 (, optional) – Optional second list of IDs for sequence pairs.

    Returns

    List of token type IDs according to the given sequence(s).

    Return type
    (save_directory:str, filename_prefix:Optional[str]=None) &#x; Tuple[str][source]¶

    Save only the vocabulary of the tokenizer (vocabulary + added tokens).

    This method won’t save the configuration and special token mappings of the tokenizer. Use to save the whole state of the tokenizer.

    Parameters
    • save_directory () – The directory in which to save the vocabulary.

    • filename_prefix (, optional) – An optional prefix to add to the named of the saved files.

    Returns

    Paths to the files saved.

    Return type

    alias of

    Bert specific outputs¶

    class (loss:Optional[torch.FloatTensor]=None, prediction_logits:torch.FloatTensor=None, seq_relationship_logits:torch.FloatTensor=None, hidden_states:Optional[Tuple[torch.FloatTensor]]=None, attentions:Optional[Tuple[torch.FloatTensor]]=None)[source]¶

    Output type of .

    Parameters
    • loss (optional, returned when is provided, of shape ) – Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.

    • prediction_logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

    • seq_relationship_logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

    • hidden_states (, optional, returned when is passed or when ) –

      Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

      Hidden-states of the model at the output of each layer plus the initial embedding outputs.

    • attentions (, optional, returned when is passed or when ) –

      Tuple of (one for each layer) of shape .

      Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    class (loss:Optional[tensorflow.python.framework.ops.Tensor]=None, prediction_logits:tensorflow.python.framework.ops.Tensor=None, seq_relationship_logits:tensorflow.python.framework.ops.Tensor=None, hidden_states:Optional[Union[Tuple[tensorflow.python.framework.ops.Tensor], tensorflow.python.framework.ops.Tensor]]=None, attentions:Optional[Union[Tuple[tensorflow.python.framework.ops.Tensor], tensorflow.python.framework.ops.Tensor]]=None)[source]¶

    Output type of .

    Parameters
    • prediction_logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

    • seq_relationship_logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

    • hidden_states (, optional, returned when is passed or when ) –

      Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

      Hidden-states of the model at the output of each layer plus the initial embedding outputs.

    • attentions (, optional, returned when is passed or when ) –

      Tuple of (one for each layer) of shape .

      Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    class (prediction_logits:jax._src.numpy.lax_numpy.ndarray=None, seq_relationship_logits:jax._src.numpy.lax_numpy.ndarray=None, hidden_states:Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]]=None, attentions:Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]]=None)[source]¶

    Output type of .

    Parameters
    • prediction_logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

    • seq_relationship_logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

    • hidden_states (, optional, returned when is passed or when ) –

      Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

      Hidden-states of the model at the output of each layer plus the initial embedding outputs.

    • attentions (, optional, returned when is passed or when ) –

      Tuple of (one for each layer) of shape .

      Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    (**updates

    “Returns a new object replacing the specified fields with new values.

    BertModel¶

    class (config, add_pooling_layer=True)[source]¶

    The bare Bert Model transformer outputting raw hidden-states without any specific head on top.

    This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

    This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

    Parameters

    config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of cross-attention is added between the self-attention layers, following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

    To behave as an decoder the model needs to be initialized with the argument of the configuration set to . To be used in a Seq2Seq model, the model needs to initialized with both argument and set to ; an is then expected as an input to the forward pass.

    (input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

    The forward method, overrides the special method.

    Note

    Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

    Parameters
    • input_ids ( of shape ) –

      Indices of input sequence tokens in the vocabulary.

      Indices can be obtained using . See and for details.

      What are input IDs?

    • attention_mask ( of shape , optional) –

      Mask to avoid performing attention on padding token indices. Mask values selected in :

      • 1 for tokens that are not masked,

      • 0 for tokens that are masked.

      What are attention masks?

    • token_type_ids ( of shape , optional) –

      Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

      • 0 corresponds to a sentence A token,

      • 1 corresponds to a sentence B token.

      What are token type IDs?

    • position_ids ( of shape , optional) –

      Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

      What are position IDs?

    • head_mask ( of shape or , optional) –

      Mask to nullify selected heads of the self-attention modules. Mask values selected in :

      • 1 indicates the head is not masked,

      • 0 indicates the head is masked.

    • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

    • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

    • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

    • encoder_hidden_states ( of shape , optional) – Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

    • encoder_attention_mask ( of shape , optional) –

      Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in :

      • 1 for tokens that are not masked,

      • 0 for tokens that are masked.

    • past_key_values ( of length with each tuple having 4 tensors of shape ) –

      Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

      If are used, the user can optionally input only the last (those that don’t have their past key value states given to this model) of shape instead of all of shape .

    • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

    Returns

    A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

    • last_hidden_state ( of shape ) – Sequence of hidden-states at the output of the last layer of the model.

    • pooler_output ( of shape ) – Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

    • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

      Hidden-states of the model at the output of each layer plus the initial embedding outputs.

    • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

      Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    • cross_attentions (, optional, returned when and is passed or when ) – Tuple of (one for each layer) of shape .

      Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

    • past_key_values (, optional, returned when is passed or when ) – Tuple of of length , with each tuple having 2 tensors of shape ) and optionally if 2 additional tensors of shape .

      Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if in the cross-attention blocks) that can be used (see input) to speed up sequential decoding.

    Return type

    or

    Example:

    >>> fromtransformersimportBertTokenizer,BertModel>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')>>> model=BertModel.from_pretrained('bert-base-uncased')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs)>>> last_hidden_states=outputs.last_hidden_state

    BertForPreTraining¶

    class (config)[source]¶

    Bert Model with two heads on top as done during the pretraining: a masked language modeling head and a next sentence prediction (classification) head.

    This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

    This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

    Parameters

    config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

    (input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, next_sentence_label=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

    The forward method, overrides the special method.

    Note

    Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

    Parameters
    • input_ids ( of shape ) –

      Indices of input sequence tokens in the vocabulary.

      Indices can be obtained using . See and for details.

      What are input IDs?

    • attention_mask ( of shape , optional) –

      Mask to avoid performing attention on padding token indices. Mask values selected in :

      • 1 for tokens that are not masked,

      • 0 for tokens that are masked.

      What are attention masks?

    • token_type_ids ( of shape , optional) –

      Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

      • 0 corresponds to a sentence A token,

      • 1 corresponds to a sentence B token.

      What are token type IDs?

    • position_ids ( of shape , optional) –

      Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

      What are position IDs?

    • head_mask ( of shape or , optional) –

      Mask to nullify selected heads of the self-attention modules. Mask values selected in :

      • 1 indicates the head is not masked,

      • 0 indicates the head is masked.

    • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

    • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

    • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

    • labels ( of shape , optional) – Labels for computing the masked language modeling loss. Indices should be in (see docstring) Tokens with indices set to are ignored (masked), the loss is only computed for the tokens with labels in

    • next_sentence_label ( of shape , optional) –

      Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see docstring) Indices should be in :

      • 0 indicates sequence B is a continuation of sequence A,

      • 1 indicates sequence B is a random sequence.

    • kwargs (, optional, defaults to {}) – Used to hide legacy arguments that have been deprecated.

    Returns

    A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

    • loss (optional, returned when is provided, of shape ) – Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.

    • prediction_logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

    • seq_relationship_logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

    • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

      Hidden-states of the model at the output of each layer plus the initial embedding outputs.

    • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

      Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    Example:

    >>> fromtransformersimportBertTokenizer,BertForPreTraining>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')>>> model=BertForPreTraining.from_pretrained('bert-base-uncased')>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs)>>> prediction_logits=outputs.prediction_logits>>> seq_relationship_logits=outputs.seq_relationship_logits
    Return type

    or

    BertLMHeadModel¶

    class (config)[source]¶

    Bert Model with a language modeling head on top for CLM fine-tuning.

    This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

    This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

    Parameters

    config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

    (input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, labels=None, past_key_values=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

    The forward method, overrides the special method.

    Note

    Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

    Parameters
    • input_ids ( of shape ) –

      Indices of input sequence tokens in the vocabulary.

      Indices can be obtained using . See and for details.

      What are input IDs?

    • attention_mask ( of shape , optional) –

      Mask to avoid performing attention on padding token indices. Mask values selected in :

      • 1 for tokens that are not masked,

      • 0 for tokens that are masked.

      What are attention masks?

    • token_type_ids ( of shape , optional) –

      Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

      • 0 corresponds to a sentence A token,

      • 1 corresponds to a sentence B token.

      What are token type IDs?

    • position_ids ( of shape , optional) –

      Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

      What are position IDs?

    • head_mask ( of shape or , optional) –

      Mask to nullify selected heads of the self-attention modules. Mask values selected in :

      • 1 indicates the head is not masked,

      • 0 indicates the head is masked.

    • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

    • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

    • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

    • encoder_hidden_states ( of shape , optional) – Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

    • encoder_attention_mask ( of shape , optional) –

      Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in :

      • 1 for tokens that are not masked,

      • 0 for tokens that are masked.

    • labels ( of shape , optional) – Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in (see docstring) Tokens with indices set to are ignored (masked), the loss is only computed for the tokens with labels n

    • past_key_values ( of length with each tuple having 4 tensors of shape ) –

      Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.

      If are used, the user can optionally input only the last (those that don’t have their past key value states given to this model) of shape instead of all of shape .

    • use_cache (, optional) – If set to , key value states are returned and can be used to speed up decoding (see ).

    Returns

    A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

    • loss ( of shape , optional, returned when is provided) – Language modeling loss (for next-token prediction).

    • logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

    • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

      Hidden-states of the model at the output of each layer plus the initial embedding outputs.

    • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

      Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    • cross_attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

      Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads.

    • past_key_values (, optional, returned when is passed or when ) – Tuple of tuples of length , with each tuple containing the cached key, value states of the self-attention and the cross-attention layers if model is used in encoder-decoder setting. Only relevant if .

      Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see input) to speed up sequential decoding.

    Example:

    >>> fromtransformersimportBertTokenizer,BertLMHeadModel,BertConfig>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-cased')>>> config=BertConfig.from_pretrained("bert-base-cased")>>> config.is_decoder=True>>> model=BertLMHeadModel.from_pretrained('bert-base-cased',config=config)>>> inputs=tokenizer("Hello, my dog is cute",return_tensors="pt")>>> outputs=model(**inputs)>>> prediction_logits=outputs.logits
    Return type

    or

    BertForMaskedLM¶

    class (config)[source]¶

    Bert Model with a language modeling head on top.

    This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

    This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

    Parameters

    config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

    (input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

    The forward method, overrides the special method.

    Note

    Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

    Parameters
    • input_ids ( of shape ) –

      Indices of input sequence tokens in the vocabulary.

      Indices can be obtained using . See and for details.

      What are input IDs?

    • attention_mask ( of shape , optional) –

      Mask to avoid performing attention on padding token indices. Mask values selected in :

      • 1 for tokens that are not masked,

      • 0 for tokens that are masked.

      What are attention masks?

    • token_type_ids ( of shape , optional) –

      Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

      • 0 corresponds to a sentence A token,

      • 1 corresponds to a sentence B token.

      What are token type IDs?

    • position_ids ( of shape , optional) –

      Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

      What are position IDs?

    • head_mask ( of shape or , optional) –

      Mask to nullify selected heads of the self-attention modules. Mask values selected in :

      • 1 indicates the head is not masked,

      • 0 indicates the head is masked.

    • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

    • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

    • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

    • labels ( of shape , optional) – Labels for computing the masked language modeling loss. Indices should be in (see docstring) Tokens with indices set to are ignored (masked), the loss is only computed for the tokens with labels in

    Returns

    A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

    • loss ( of shape , optional, returned when is provided) – Masked language modeling (MLM) loss.

    • logits ( of shape ) – Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

    • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

      Hidden-states of the model at the output of each layer plus the initial embedding outputs.

    • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

      Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    Return type

    or

    Example:

    >>> fromtransformersimportBertTokenizer,BertForMaskedLM>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')>>> model=BertForMaskedLM.from_pretrained('bert-base-uncased')>>> inputs=tokenizer("The capital of France is [MASK].",return_tensors="pt")>>> labels=tokenizer("The capital of France is Paris.",return_tensors="pt")["input_ids"]>>> outputs=model(**inputs,labels=labels)>>> loss=outputs.loss>>> logits=outputs.logits

    BertForNextSentencePrediction¶

    class (config)[source]¶

    Bert Model with a next sentence prediction (classification) head on top.

    This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

    This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

    Parameters

    config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

    (input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None, **kwargs)[source]¶

    The forward method, overrides the special method.

    Note

    Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

    Parameters
    • input_ids ( of shape ) –

      Indices of input sequence tokens in the vocabulary.

      Indices can be obtained using . See and for details.

      What are input IDs?

    • attention_mask ( of shape , optional) –

      Mask to avoid performing attention on padding token indices. Mask values selected in :

      • 1 for tokens that are not masked,

      • 0 for tokens that are masked.

      What are attention masks?

    • token_type_ids ( of shape , optional) –

      Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

      • 0 corresponds to a sentence A token,

      • 1 corresponds to a sentence B token.

      What are token type IDs?

    • position_ids ( of shape , optional) –

      Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

      What are position IDs?

    • head_mask ( of shape or , optional) –

      Mask to nullify selected heads of the self-attention modules. Mask values selected in :

      • 1 indicates the head is not masked,

      • 0 indicates the head is masked.

    • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

    • output_hidden_states (, optional) – Whether or not to return the hidden states of all layers. See under returned tensors for more detail.

    • return_dict (, optional) – Whether or not to return a instead of a plain tuple.

    • labels ( of shape , optional) –

      Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see docstring). Indices should be in :

      • 0 indicates sequence B is a continuation of sequence A,

      • 1 indicates sequence B is a random sequence.

    Returns

    A or a tuple of (if is passed or when ) comprising various elements depending on the configuration () and inputs.

    • loss ( of shape , optional, returned when is provided) – Next sequence prediction (classification) loss.

    • logits ( of shape ) – Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).

    • hidden_states (, optional, returned when is passed or when ) – Tuple of (one for the output of the embeddings + one for the output of each layer) of shape .

      Hidden-states of the model at the output of each layer plus the initial embedding outputs.

    • attentions (, optional, returned when is passed or when ) – Tuple of (one for each layer) of shape .

      Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    Example:

    >>> fromtransformersimportBertTokenizer,BertForNextSentencePrediction>>> importtorch>>> tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')>>> model=BertForNextSentencePrediction.from_pretrained('bert-base-uncased')>>> prompt="In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced.">>> next_sentence="The sky is blue due to the shorter wavelength of blue light.">>> encoding=tokenizer(prompt,next_sentence,return_tensors='pt')>>> outputs=model(**encoding,labels=torch.LongTensor([1]))>>> logits=outputs.logits>>> assertlogits[0,0]<logits[0,1]# next sentence was random
    Return type

    or

    BertForSequenceClassification¶

    class (config)[source]¶

    Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

    This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

    This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

    Parameters

    config () – Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

    (input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

    The forward method, overrides the special method.

    Note

    Although the recipe for forward pass needs to be defined within this function, one should call the instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

    Parameters
    • input_ids ( of shape ) –

      Indices of input sequence tokens in the vocabulary.

      Indices can be obtained using . See and for details.

      What are input IDs?

    • attention_mask ( of shape , optional) –

      Mask to avoid performing attention on padding token indices. Mask values selected in :

      • 1 for tokens that are not masked,

      • 0 for tokens that are masked.

      What are attention masks?

    • token_type_ids ( of shape , optional) –

      Segment token indices to indicate first and second portions of the inputs. Indices are selected in :

      • 0 corresponds to a sentence A token,

      • 1 corresponds to a sentence B token.

      What are token type IDs?

    • position_ids ( of shape , optional) –

      Indices of positions of each input sequence tokens in the position embeddings. Selected in the range .

      What are position IDs?

    • head_mask ( of shape or , optional) –

      Mask to nullify selected heads of the self-attention modules. Mask values selected in :

      • 1 indicates the head is not masked,

      • 0 indicates the head is masked.

    • inputs_embeds ( of shape , optional) – Optionally, instead of passing you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert indices into associated vectors than the model’s internal embedding lookup matrix.

    • output_attentions (, optional) – Whether or not to return the attentions tensors of all attention layers. See under returned tensors for more detail.

    • output_hidden_states (

    Sours: https://huggingface.co/transformers/model_doc/bert.html
    Train and use a NLP model in 10 mins!

    Introduction

    Welcome to the 🤗 Course!

    This course will teach you about natural language processing (NLP) using libraries from the Hugging Face ecosystem — 🤗 Transformers, 🤗 Datasets, 🤗 Tokenizers, and 🤗 Accelerate — as well as the Hugging Face Hub. It’s completely free and without ads.

    What to expect?

    Here is a brief overview of the course:

    Brief overview of the chapters of the course.
    • Chapters 1 to 4 provide an introduction to the main concepts of the 🤗 Transformers library. By the end of this part of the course, you will be familiar with how Transformer models work and will know how to use a model from the Hugging Face Hub, fine-tune it on a dataset, and share your results on the Hub!
    • Chapters 5 to 8 teach the basics of 🤗 Datasets and 🤗 Tokenizers before diving into classic NLP tasks. By the end of this part, you will be able to tackle the most common NLP problems by yourself.
    • Chapters 9 to 12 dive even deeper, showcasing specialized architectures (memory efficiency, long sequences, etc.) and teaching you how to write custom objects for more exotic use cases. By the end of this part, you will be ready to solve complex NLP problems and make meaningful contributions to 🤗 Transformers.

    This course:

    After you’ve completed this course, we recommend checking out DeepLearning.AI’s Natural Language Processing Specialization, which covers a wide range of traditional NLP models like naive Bayes and LSTMs that are well worth knowing about!

    Who are we?

    About the authors:

    Matthew Carrigan is a Machine Learning Engineer at Hugging Face. He lives in Dublin, Ireland and previously worked as an ML engineer at Parse.ly and before that as a post-doctoral researcher at Trinity College Dublin. He does not believe we’re going to get to AGI by scaling existing architectures, but has high hopes for robot immortality regardless.

    Lysandre Debut is a Machine Learning Engineer at Hugging Face and has been working on the 🤗 Transformers library since the very early development stages. His aim is to make NLP accessible for everyone by developing tools with a very simple API.

    Sylvain Gugger is a Research Engineer at Hugging Face and one of the core maintainers of the 🤗 Transformers library. Previously he was a Research Scientist at fast.ai, and he co-wrote Deep Learning for Coders with fastai and PyTorch with Jeremy Howard. The main focus of his research is on making deep learning more accessible, by designing and improving techniques that allow models to train fast on limited resources.

    Are you ready to roll? In this chapter, you will learn:

    • How to use the function to solve NLP tasks such as text generation and classification
    • About the Transformer architecture
    • How to distinguish between encoder, decoder, and encoder-decoder architectures and use cases
    Sours: https://huggingface.co/course

    Models huggingface



    BuildGitHubDocumentationGitHub releaseContributor CovenantDOI

    English | 简体中文 | 繁體中文

    State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow

    Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over languages. Its aim is to make cutting-edge NLP easier to use for everyone.

    Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments.

    Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.

    Online demos

    You can test most of our models directly on their pages from the model hub. We also offer private model hosting, versioning, & an inference API for public and private models.

    Here are a few examples:

    Write With Transformer, built by the Hugging Face team, is the official demo of this repo’s text generation capabilities.

    If you are looking for custom support from the Hugging Face team

    HuggingFace Expert Acceleration Program

    Quick tour

    To immediately use a model on a given text, we provide the API. Pipelines group together a pretrained model with the preprocessing that was used during that model's training. Here is how to quickly use a pipeline to classify positive versus negative texts:

    >>>fromtransformersimportpipeline# Allocate a pipeline for sentiment-analysis>>>classifier=pipeline('sentiment-analysis') >>>classifier('We are very happy to introduce pipeline to the transformers repository.') [{'label': 'POSITIVE', 'score': }]

    The second line of code downloads and caches the pretrained model used by the pipeline, while the third evaluates it on the given text. Here the answer is "positive" with a confidence of %.

    Many NLP tasks have a pre-trained ready to go. For example, we can easily extract question answers given context:

    >>>fromtransformersimportpipeline# Allocate a pipeline for question-answering>>>question_answerer=pipeline('question-answering') >>>question_answerer({ 'question': 'What is the name of the repository ?', 'context': 'Pipeline has been included in the huggingface/transformers repository' }) {'score': , 'start': 34, 'end': 58, 'answer': 'huggingface/transformers'}

    In addition to the answer, the pretrained model used here returned its confidence score, along with the start position and end position of the answer in the tokenized sentence. You can learn more about the tasks supported by the API in this tutorial.

    To download and use any of the pretrained models on your given task, all it takes is three lines of code. Here is the PyTorch version:

    >>>fromtransformersimportAutoTokenizer, AutoModel>>>tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased") >>>model=AutoModel.from_pretrained("bert-base-uncased") >>>inputs=tokenizer("Hello world!", return_tensors="pt") >>>outputs=model(**inputs)

    And here is the equivalent code for TensorFlow:

    >>>fromtransformersimportAutoTokenizer, TFAutoModel>>>tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased") >>>model=TFAutoModel.from_pretrained("bert-base-uncased") >>>inputs=tokenizer("Hello world!", return_tensors="tf") >>>outputs=model(**inputs)

    The tokenizer is responsible for all the preprocessing the pretrained model expects, and can be called directly on a single string (as in the above examples) or a list. It will output a dictionary that you can use in downstream code or simply directly pass to your model using the ** argument unpacking operator.

    The model itself is a regular Pytorch or a TensorFlow (depending on your backend) which you can use normally. This tutorial explains how to integrate such a model into a classic PyTorch or TensorFlow training loop, or how to use our API to quickly fine-tune on a new dataset.

    Why should I use transformers?

    1. Easy-to-use state-of-the-art models:

      • High performance on NLU and NLG tasks.
      • Low barrier to entry for educators and practitioners.
      • Few user-facing abstractions with just three classes to learn.
      • A unified API for using all our pretrained models.
    2. Lower compute costs, smaller carbon footprint:

      • Researchers can share trained models instead of always retraining.
      • Practitioners can reduce compute time and production costs.
      • Dozens of architectures with over 2, pretrained models, some in more than languages.
    3. Choose the right framework for every part of a model's lifetime:

      • Train state-of-the-art models in 3 lines of code.
      • Move a single model between TF/PyTorch frameworks at will.
      • Seamlessly pick the right framework for training, evaluation and production.
    4. Easily customize a model or an example to your needs:

      • We provide examples for each architecture to reproduce the results published by its original authors.
      • Model internals are exposed as consistently as possible.
      • Model files can be used independently of the library for quick experiments.

    Why shouldn't I use transformers?

    • This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving into additional abstractions/files.
    • The training API is not intended to work on any model but is optimized to work with the models provided by the library. For generic machine learning loops, you should use another library.
    • While we strive to present as many use cases as possible, the scripts in our examples folder are just that: examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs.
    Sours: https://github.com/huggingface/transformers
    Train a Hugging Face Transformers Model with Amazon SageMaker

    Model sharing and uploading¶

    In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on the model hub.

    Note

    You will need to create an account on huggingface.co for this.

    Optionally, you can join an existing organization or create a new one.

    We have seen in the training tutorial: how to fine-tune a model on a given task. You have probably done something similar on your task, either using the model directly in your own training loop or using the / class. Let’s see how you can share the result on the model hub.

    Model versioning¶

    Since version v, the model hub has built-in model versioning based on git and git-lfs. It is based on the paradigm that one model is one repo.

    This allows:

    • built-in versioning

    • access control

    • scalability

    This is built around revisions, which is a way to pin a specific version of a model, using a commit hash, tag or branch.

    For instance:

    >>> model=AutoModel.from_pretrained(>>> "julien-c/EsperBERTo-small",>>> revision="v"# tag name, or branch name, or commit hash>>> )

    Push your model from Python¶

    Preparation¶

    The first step is to make sure your credentials to the hub are stored somewhere. This can be done in two ways. If you have access to a terminal, you cam just run the following command in the virtual environment where you installed 🤗 Transformers:

    It will store your access token in the Hugging Face cache folder (by default ).

    If you don’t have an easy access to a terminal (for instance in a Colab session), you can find a token linked to your acount by going on huggingface.co <https://huggingface.co/>, click on your avatar on the top left corner, then on Edit profile on the left, just beneath your profile picture. In the submenu API Tokens, you will find your API token that you can just copy.

    Directly push your model to the hub¶

    Once you have an API token (either stored in the cache or copied and pasted in your notebook), you can directly push a finetuned model you saved in by calling:

    finetuned_model.push_to_hub("my-awesome-model")

    If you have your API token not stored in the cache, you will need to pass it with . This is also be the case for all the examples below, so we won’t mention it again.

    This will create a repository in your namespace name , so anyone can now run:

    fromtransformersimportAutoModelmodel=AutoModel.from_pretrained("your_username/my-awesome-model")

    Even better, you can combine this push to the hub with the call to :

    finetuned_model.save_pretrained(save_directory,push_to_hub=True,repo_name="my-awesome-model")

    If you are a premium user and want your model to be private, just add to this call.

    If you are a member of an organization and want to push it inside the namespace of the organization instead of yours, just add .

    Add new files to your model repo¶

    Once you have pushed your model to the hub, you might want to add the tokenizer, or a version of your model for another framework (TensorFlow, PyTorch, Flax). This is super easy to do! Let’s begin with the tokenizer. You can add it to the repo you created before like this

    tokenizer.push_to_hub("my-awesome-model")

    If you know its URL (it should be ), you can also do:

    tokenizer.push_to_hub(repo_url=my_repo_url)

    And that’s all there is to it! It’s also a very easy way to fix a mistake if one of the files online had a bug.

    To add a model for another backend, it’s also super easy. Let’s say you have fine-tuned a TensorFlow model and want to add the pytorch model files to your model repo, so that anyone in the community can use it. The following allows you to directly create a PyTorch version of your TensorFlow model:

    fromtransformersimportAutoModelmodel=AutoModel.from_pretrained(save_directory,from_tf=True)

    You can also replace by the identifier of your model () if you don’t have a local save of it anymore. Then, just do the same as before:

    model.push_to_hub("my-awesome-model")

    or

    model.push_to_hub(repo_url=my_repo_url)

    Use your terminal and git¶

    Basic steps¶

    In order to upload a model, you’ll need to first create a git repo. This repo will live on the model hub, allowing users to clone it and you (and your organization members) to push to it.

    You can create a model repo directly from the /new page on the website.

    Alternatively, you can use the . The next steps describe that process:

    Go to a terminal and run the following command. It should be in the virtual environment where you installed 🤗 Transformers, since that command comes from the library.

    Once you are logged in with your model hub credentials, you can start building your repositories. To create a repo:

    transformers-cli repo create your-model-name

    If you want to create a repo under a specific organization, you should add a –organization flag:

    transformers-cli repo create your-model-name --organization your-org-name

    This creates a repo on the model hub, which can be cloned.

    # Make sure you have git-lfs installed# (https://git-lfs.github.com/) git lfs install git clone https://huggingface.co/username/your-model-name

    When you have your local clone of your repo and lfs installed, you can then add/remove from that clone as you would with any other git repo.

    # Commit as usualcd your-model-name echo"hello" >> README.md git add . && git commit -m "Update from $USER"

    We are intentionally not wrapping git too much, so that you can go on with the workflow you’re used to and the tools you already know.

    The only learning curve you might have compared to regular git is the one for git-lfs. The documentation at git-lfs.github.com is decent, but we’ll work on a tutorial with some tips and tricks in the coming weeks!

    Additionally, if you want to change multiple repos at once, the change_config.py script can probably save you some time.

    Make your model work on all frameworks¶

    You probably have your favorite framework, but so will other users! That’s why it’s best to upload your model with both PyTorch and TensorFlow checkpoints to make it easier to use (if you skip this step, users will still be able to load your model in another framework, but it will be slower, as it will have to be converted on the fly). Don’t worry, it’s super easy to do (and in a future version, it might all be automatic). You will need to install both PyTorch and TensorFlow for this step, but you don’t need to worry about the GPU, so it should be very easy. Check the TensorFlow installation page and/or the PyTorch installation page to see how.

    First check that your model class exists in the other framework, that is try to import the same model by either adding or removing TF. For instance, if you trained a , try to type

    >>> fromtransformersimportTFDistilBertForSequenceClassification

    and if you trained a , try to type

    >>> fromtransformersimportDistilBertForSequenceClassification

    This will give back an error if your model does not exist in the other framework (something that should be pretty rare since we’re aiming for full parity between the two frameworks). In this case, skip this and go to the next step.

    Now, if you trained your model in PyTorch and have to create a TensorFlow version, adapt the following code to your model class:

    >>> tf_model=TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked",from_pt=True)>>> tf_model.save_pretrained("path/to/awesome-name-you-picked")

    and if you trained your model in TensorFlow and have to create a PyTorch version, adapt the following code to your model class:

    >>> pt_model=DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked",from_tf=True)>>> pt_model.save_pretrained("path/to/awesome-name-you-picked")

    That’s all there is to it!

    Check the directory before pushing to the model hub.¶

    Make sure there are no garbage files in the directory you’ll upload. It should only have:

    • a config.json file, which saves the configuration of your model ;

    • a pytorch_model.bin file, which is the PyTorch checkpoint (unless you can’t have it for some reason) ;

    • a tf_model.h5 file, which is the TensorFlow checkpoint (unless you can’t have it for some reason) ;

    • a special_tokens_map.json, which is part of your tokenizer save;

    • a tokenizer_config.json, which is part of your tokenizer save;

    • files named vocab.json, vocab.txt, merges.txt, or similar, which contain the vocabulary of your tokenizer, part of your tokenizer save;

    • maybe a added_tokens.json, which is part of your tokenizer save.

    Other files can safely be deleted.

    Uploading your files¶

    Once the repo is cloned, you can add the model, configuration and tokenizer files. For instance, saving the model and tokenizer files:

    >>> model.save_pretrained("path/to/repo/clone/your-model-name")>>> tokenizer.save_pretrained("path/to/repo/clone/your-model-name")

    Or, if you’re using the Trainer API

    >>> trainer.save_model("path/to/awesome-name-you-picked")>>> tokenizer.save_pretrained("path/to/repo/clone/your-model-name")

    You can then add these files to the staging environment and verify that they have been correctly staged with the command:

    Finally, the files should be committed:

    git commit -m "First version of the your-model-name model and tokenizer."

    And pushed to the remote:

    This will upload the folder containing the weights, tokenizer and configuration we have just prepared.

    Add a model card¶

    To make sure everyone knows what your model can do, what its limitations, potential bias or ethical considerations are, please add a README.md model card to your model repo. You can just create it, or there’s also a convenient button titled “Add a README.md” on your model page. A model card documentation can be found here (meta-suggestions are welcome). model card template (meta-suggestions are welcome).

    Note

    Model cards used to live in the 🤗 Transformers repo under model_cards/, but for consistency and scalability we migrated every model card from the repo to its corresponding huggingface.co model repo.

    If your model is fine-tuned from another model coming from the model hub (all 🤗 Transformers pretrained models do), don’t forget to link to its model card so that people can fully trace how your model was built.

    Using your model¶

    Your model now has a page on huggingface.co/models 🔥

    Anyone can load it from code:

    >>> tokenizer=AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")>>> model=AutoModel.from_pretrained("namespace/awesome-name-you-picked")

    You may specify a revision by using the flag in the method:

    >>> tokenizer=AutoTokenizer.from_pretrained(>>> "julien-c/EsperBERTo-small",>>> revision="v"# tag name, or branch name, or commit hash>>> )

    Workflow in a Colab notebook¶

    If you’re in a Colab notebook (or similar) with no direct access to a terminal, here is the workflow you can use to upload your model. You can execute each one of them in a cell by adding a ! at the beginning.

    First you need to install git-lfs in the environment used by the notebook:

    sudo apt-get install git-lfs

    Then you can use either create a repo directly from huggingface.co , or use the to create it:

    transformers-cli login transformers-cli repo create your-model-name

    Once it’s created, you can clone it and configure it (replace username by your username on huggingface.co):

    git lfs install git clone https://username:[email protected]/username/your-model-name # Alternatively if you have a token,# you can use it instead of your password git clone https://username:[email protected]/username/your-model-name cd your-model-name git config --global user.email "[email protected]"# Tip: using the same email than for your huggingface.co account will link your commits to your profile git config --global user.name "Your name"

    Once you’ve saved your model inside, and your clone is setup with the right remote URL, you can add it and push it with usual git commands.

    git add . git commit -m "Initial commit" git push

    © Copyright , The Hugging Face Team, Licenced under the Apache License, Version

    Built with Sphinx using a theme provided by Read the Docs.
    Sours: https://huggingface.co/transformers/model_sharing.html

    Now discussing:

    .



    719 720 721 722 723