Filter for confidential information
DOI:
https://doi.org/10.18372/1990-5548.78.18256Keywords:
large language models, confidential information filter, word embedding, prompt injection, jailbreaking, NLP model, SBERTAbstract
The research is conducted on the topic of preventing various types of attacks on large language models (LLM), as well as preventing the leakage of confidential data when working with local text databases. Research is performed by implementing a filter and testing it on an example that aims to filter requests to the model. The proposed filter does not block the request to the LLM, but removes its parts, which is much faster and makes it impossible for an attacker to pick up a request, as it destroys its structure. The filter uses word embedding to evaluate the request to the LLM, which together with the use of a hash table for forbidden topics, speeds up the operation of the filter. To protect against attacks such as prompt injection and prompt leaking attack, the filter uses the method of randomly closing the sequence. During the testing process, significant improvements were obtained in maintaining the security of data used by LLM. Currently, the use of such filters in product projects and startups is an extremely important step, but there is a lack of ready-made implementations of filters with similar properties. The uniqueness of the filter lies in its independence from LLM and the use of semantic similarity as a fine-tuned way of classifying queries.
References
ChatGPT Question Filter. [Electronic resource]. URL:https://github.com/derwiki/llm-prompt-injection-filtering (accessed 30.09.23).
KANG, Daniel, et al. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
NI, Jianmo, et al. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021. https://doi.org/10.18653/v1/2022.findings-acl.146
Using GPT-Eliezer against ChatGPT Jailbreaking. [Electronic resource]. URL:https://www.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking (accessed 30.09.23).
“Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Nils Reimers, Iryna Gurevych, 2019, arXiv:1908.10084.
Downloads
Published
Issue
Section
License
Authors who publish with this journal agree to the following terms:
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).