Anonymity - Stylography protection (Running a Local LLM and copy pasting messages) #13

Open
opened 2024-09-17 20:31:55 +02:00 by nihilist · 7 comments
Owner
No description provided.
nihilist added this to the OPSEC Tutorials (paid contributions) project 2024-09-17 20:31:55 +02:00
Author
Owner

Assigned to pippin
price: 40 euros
deadline: 24th September

Assigned to pippin price: 40 euros deadline: 24th September
nihilist added the
Complex
label 2024-09-23 10:30:43 +02:00
Author
Owner

past the deadline, unassigning

past the deadline, unassigning
Author
Owner
image
<img width="1154" alt="image" src="attachments/bee35395-1bf0-49f2-9fce-b8c20ee30c04">
538 KiB
nihilist added the
/!\ On Priority - High Quality Tutorial
label 2024-10-15 13:24:17 +02:00
Author
Owner
see https://github.com/jasonacox/TinyLLM?tab=readme-ov-file#run-a-chatbot might be a good candidate
Author
Owner

to be showcased :
-how to setup locally, in a whonix VM https://github.com/jasonacox/TinyLLM?tab=readme-ov-file#run-a-chatbot or https://github.com/lmstudio-ai (LMSTUDIO AI MAY complain about threads if you only give it 4 vCPUs >> not adapted to a VM setup ???)
--> IT CANNOT BE RAN OUTSIDE OF THE WHONIX VM! need to find a way.

-make it run a TINY model to reduce the cpu usage as much as possible https://lmstudio.ai/model/llama-3.2-1b-instruct ???
-how to use it, from within the whonix VM (and see if the performance sucks or not)

to be showcased : -how to setup locally, in a whonix VM https://github.com/jasonacox/TinyLLM?tab=readme-ov-file#run-a-chatbot or https://github.com/lmstudio-ai (LMSTUDIO AI MAY complain about threads if you only give it 4 vCPUs >> not adapted to a VM setup ???) --> IT CANNOT BE RAN OUTSIDE OF THE WHONIX VM! need to find a way. -make it run a TINY model to reduce the cpu usage as much as possible https://lmstudio.ai/model/llama-3.2-1b-instruct ??? -how to use it, from within the whonix VM (and see if the performance sucks or not)
Author
Owner

assigned to: WonderfulEpitome
price: 50 euros
deadline: 19th November

assigned to: WonderfulEpitome price: 50 euros deadline: 19th November
Author
Owner

potential solution: g66ol3eb5ujdckzqqfmjsbpdjufmjd5nsgdipvxmsh7rckzlhywlzlqd.onion/post/c67d64ec4355ec872373

The Best Way to Evade Linguistic Analysis | llamafile Setup Guide
by /u/inadahime • 17 hours ago in /d/OpSec
The tutorial section of this post assumes a Linux-based operating system, and the presence of common programs like curl; however, because llamafile[1] runs on any operating system, you should be able to replicate this anywhere with relative ease!

==========

The subject of stylometry, or linguistic analysis, has come up rather frequently here on Dread. However, in case you’re wondering:

Background // What is stylometry?

According to Wikitionary, stylometry is defined as: ”A statistical method of analyzing a text to determine its author.” This is synonymous with the term linguistic analysis. In practical use, stylometry is a form of OSINT where forensic analysis is applied to determine a text’s origin. As an example, presume you have an account both here and on Reddit, with the Reddit profile tied to your real-life identity in some way. Let’s assume your OpSec is bloody perfect; there’s no traditional way to correlate your clearnet identity to your Dread account. Here’s when stylometry comes into play. As I said, it’s a sort of OSINT strategy; meaning that LEA can scrape Dread and Reddit, thus building linguistic profiles for every member of a certain subreddit, especially ones that may be correlated to Dread use (/r/DreadAlert, /r/onions, et cetera). After scraping both platforms and building profiles for each user, they can then compare the profiles to find which Dread users have the most similar styles to which clearnet users. Now, even though all other areas of your OpSec are flawless, linguistic analysis can help LE narrow down the search for your clearnet account, especially if they start with a broader scope than the sole two subreddits I listed.

A practical example: linguistic analysis allowed the FBI to successfully acquire a search warrant for the Unabomber, Ted Kaczynski[2].

If you’re interested in the more technical aspects of stylometry, or how it works behind the scenes, this paper offers a good overview: “A Survey of Modern Authorship Attribution Methods” by Efstathios Stamatatos (2009)[3]

Methods of Obfuscation

Over the years, there have been a variety of ways to obfuscate one’s “true” writing style, but the important ones:

Rewriting by hand: This can entail assuming a persona and writing as that demographic would; e.g., writing like someone incredibly young (and employing the use of modern slang, like Generation Z’s) when you are, in fact, much older would do well to obfuscate your linguistic fingerprint. However, humans aren’t perfect, and the most difficult patterns to avoid are subconscious ones.
Machine translation: This involves sending your message through multiple passes of a machine translation software which should then make your message sound completely different than your original style, but retains the intent.
✨ Use of LLMs: LLM is the acronym for a “large language model.” Think of things like ChatGPT, or Meta’s LLaMA - this is typically what you assume a person is talking about when they speak of “AI.” The use of LLMs for stylometric obfuscation is similar to machine translation, but because these models are so much smarter and more learned in the nuances of human speech patterns, they tend to provide an even better result. Another benefit is that the result can be instructed: e.g. “speak as if you are from Australia.” This is the highlight of the post.

Local LLM use // Choice of Model, Software

Model

I’m going to focus on local LLMs. When the entire point of this process is data privacy and anonymity, the last thing you want is to send your data over to OpenAI.

Language models are categorised by parameter count, or how many “B” they have (where B means billion parameters). My favorite model for stylometric obfuscation is Gemma 2 2B IT Abliterated, which is an uncensored version of an incredibly small model developed by Google (See ref. [4] for GGUFs). When choosing your model, try to keep the parameter count below 7; and regardless of model, you need to get a GGUF (quantised) version of it - this format allows us to actually run the model. When picking your quantisation (GGUF), try to stay above “Q4.” For example, with gemma-2-2b-it-abliterated, I use the Q6_K version of the GGUF.

Software

The best software for running LLMs locally is llamafile[1]. This is a fork of another program, llama.cpp[5], with more optimisations and the capability to be a single executable that runs on nearly any operating system. In the rest of the post, I’ll expand upon llamafile and show you how to create a single program that takes your text input and provides an anonymised version of it as output.

llamafile Setup // CounterStylometry.llamafile

Now, let’s create a program that anonymises your text! For this section, it is assumed that the “GGUF” of your model is gemma-2-2b-it-abliterated-Q6_K.gguf. First, download (and then mark as executable) the llamafile and zipalign executables from the llamafile release page on GitHub. In the terminal, you would do this like:

# At the time of writing, `0.8.16` was the latest version of llamafile. Change this as needed for future releases.
$ curl -L -o CounterStylometry.llamafile https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.16/llamafile-0.8.16
$ curl -L -o zipalign https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.16/zipalign-0.8.16
$ chmod a+x CounterStylometry.llamafile
$ chmod a+x zipalign

Next, let’s write to a file named “.args” (without the quotation marks) with this exact content:

-m
gemma-2-2b-it-abliterated-Q6_K.gguf
-p
“<start_of_turn>user\nYou are an AI assistant that anonymises user inputs. I, user, will send messages - you should reply with only a rephrased version of my message. I am never speaking to you, only providing text for you to anonymise.<end_of_turn>\n<start_of_turn>model\nYes, I understand, I will only reply to messages from you with a rephrased version of your message. What text would you like me to anonymise?<end_of_turn>\n<start_of_turn>user\n”
—reverse-prompt
“<start_of_turn>user\n”
-cnv
—log-disable
…

Finally, let’s use this file to create CounterStylometry.llamafile, a program that takes text input and anonymises it. In the terminal, run this command:

# After running this command and successfully modifying `CounterStylometry.llamafile`, you can delete the `gemma-2-2b-it-abliterated-Q6_K.gguf`, `.args`, and `zipalign` files to free up some space - they’ve been copied into `CounterStylometry.llamafile`.
$ ./zipalign -j0 CounterStylometry.llamafile gemma-2-2b-it-abliterated-Q6_K.gguf .args

To remove the now-unneeded files:

$ rm gemma-2-2b-it-abliterated-Q6_K.gguf .args zipalign

Anonymising Your Text // Using CounterStylometry.llamafile
We're all done! Assuming you have executed the steps prior to this correctly, usage is simple! Just run the program in your terminal like this, and type your text in:

$ ./CounterStylometry.llamafile

> The BEST Way to Block Linguistic Analysis | Anti-Stylometry
Effective Linguistic Analysis Suppression Strategies

>
potential solution: g66ol3eb5ujdckzqqfmjsbpdjufmjd5nsgdipvxmsh7rckzlhywlzlqd.onion/post/c67d64ec4355ec872373 The Best Way to Evade Linguistic Analysis | llamafile Setup Guide by /u/inadahime • 17 hours ago in /d/OpSec The tutorial section of this post assumes a Linux-based operating system, and the presence of common programs like `curl`; however, because llamafile[1] runs on any operating system, you should be able to replicate this anywhere with relative ease! ========== The subject of stylometry, or linguistic analysis, has come up rather frequently here on Dread. However, in case you’re wondering: Background // What is stylometry? According to Wikitionary, stylometry is defined as: ”A statistical method of analyzing a text to determine its author.” This is synonymous with the term linguistic analysis. In practical use, stylometry is a form of OSINT where forensic analysis is applied to determine a text’s origin. As an example, presume you have an account both here and on Reddit, with the Reddit profile tied to your real-life identity in some way. Let’s assume your OpSec is bloody perfect; there’s no traditional way to correlate your clearnet identity to your Dread account. Here’s when stylometry comes into play. As I said, it’s a sort of OSINT strategy; meaning that LEA can scrape Dread and Reddit, thus building linguistic profiles for every member of a certain subreddit, especially ones that may be correlated to Dread use (/r/DreadAlert, /r/onions, et cetera). After scraping both platforms and building profiles for each user, they can then compare the profiles to find which Dread users have the most similar styles to which clearnet users. Now, even though all other areas of your OpSec are flawless, linguistic analysis can help LE narrow down the search for your clearnet account, especially if they start with a broader scope than the sole two subreddits I listed. A practical example: linguistic analysis allowed the FBI to successfully acquire a search warrant for the Unabomber, Ted Kaczynski[2]. If you’re interested in the more technical aspects of stylometry, or how it works behind the scenes, this paper offers a good overview: “A Survey of Modern Authorship Attribution Methods” by Efstathios Stamatatos (2009)[3] Methods of Obfuscation Over the years, there have been a variety of ways to obfuscate one’s “true” writing style, but the important ones: Rewriting by hand: This can entail assuming a persona and writing as that demographic would; e.g., writing like someone incredibly young (and employing the use of modern slang, like Generation Z’s) when you are, in fact, much older would do well to obfuscate your linguistic fingerprint. However, humans aren’t perfect, and the most difficult patterns to avoid are subconscious ones. Machine translation: This involves sending your message through multiple passes of a machine translation software which should then make your message sound completely different than your original style, but retains the intent. ✨ Use of LLMs: LLM is the acronym for a “large language model.” Think of things like ChatGPT, or Meta’s LLaMA - this is typically what you assume a person is talking about when they speak of “AI.” The use of LLMs for stylometric obfuscation is similar to machine translation, but because these models are so much smarter and more learned in the nuances of human speech patterns, they tend to provide an even better result. Another benefit is that the result can be instructed: e.g. “speak as if you are from Australia.” This is the highlight of the post. Local LLM use // Choice of Model, Software Model I’m going to focus on local LLMs. When the entire point of this process is data privacy and anonymity, the last thing you want is to send your data over to OpenAI. Language models are categorised by parameter count, or how many “B” they have (where B means billion parameters). My favorite model for stylometric obfuscation is Gemma 2 2B IT Abliterated, which is an uncensored version of an incredibly small model developed by Google (See ref. [4] for GGUFs). When choosing your model, try to keep the parameter count below 7; and regardless of model, you need to get a GGUF (quantised) version of it - this format allows us to actually run the model. When picking your quantisation (GGUF), try to stay above “Q4.” For example, with gemma-2-2b-it-abliterated, I use the Q6_K version of the GGUF. Software The best software for running LLMs locally is llamafile[1]. This is a fork of another program, llama.cpp[5], with more optimisations and the capability to be a single executable that runs on nearly any operating system. In the rest of the post, I’ll expand upon llamafile and show you how to create a single program that takes your text input and provides an anonymised version of it as output. llamafile Setup // CounterStylometry.llamafile Now, let’s create a program that anonymises your text! For this section, it is assumed that the “GGUF” of your model is gemma-2-2b-it-abliterated-Q6_K.gguf. First, download (and then mark as executable) the `llamafile` and `zipalign` executables from the llamafile release page on GitHub. In the terminal, you would do this like: ``` # At the time of writing, `0.8.16` was the latest version of llamafile. Change this as needed for future releases. $ curl -L -o CounterStylometry.llamafile https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.16/llamafile-0.8.16 $ curl -L -o zipalign https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.16/zipalign-0.8.16 $ chmod a+x CounterStylometry.llamafile $ chmod a+x zipalign ``` Next, let’s write to a file named “.args” (without the quotation marks) with this exact content: ``` -m gemma-2-2b-it-abliterated-Q6_K.gguf -p “<start_of_turn>user\nYou are an AI assistant that anonymises user inputs. I, user, will send messages - you should reply with only a rephrased version of my message. I am never speaking to you, only providing text for you to anonymise.<end_of_turn>\n<start_of_turn>model\nYes, I understand, I will only reply to messages from you with a rephrased version of your message. What text would you like me to anonymise?<end_of_turn>\n<start_of_turn>user\n” —reverse-prompt “<start_of_turn>user\n” -cnv —log-disable … Finally, let’s use this file to create CounterStylometry.llamafile, a program that takes text input and anonymises it. In the terminal, run this command: # After running this command and successfully modifying `CounterStylometry.llamafile`, you can delete the `gemma-2-2b-it-abliterated-Q6_K.gguf`, `.args`, and `zipalign` files to free up some space - they’ve been copied into `CounterStylometry.llamafile`. $ ./zipalign -j0 CounterStylometry.llamafile gemma-2-2b-it-abliterated-Q6_K.gguf .args ``` To remove the now-unneeded files: ``` $ rm gemma-2-2b-it-abliterated-Q6_K.gguf .args zipalign ``` Anonymising Your Text // Using CounterStylometry.llamafile We're all done! Assuming you have executed the steps prior to this correctly, usage is simple! Just run the program in your terminal like this, and type your text in: ``` $ ./CounterStylometry.llamafile > The BEST Way to Block Linguistic Analysis | Anti-Stylometry Effective Linguistic Analysis Suppression Strategies > ```
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: nihilist/blog-contributions#13
No description provided.