haqistan

⋔ Home ⋔ About ⋔ Hire Me ⋔ Links ⋔ License ⋔ Archives ⋔ RSS ⋔

Offline Search for Privacy

3043 words by attila written on 2025-03-04, last edit: 2025-04-07, tags: anticap, anticorp, google, offline, privacy, search, web ⋔ Previous post: Goddamn Android, part I: Transfer All Thee Things ⋔ Next post: BlueSky's Followers Problem

The idea is simple: if your search drops no logs, it is perfectly private. What Google wants, above all else, are the logs of what everyone is searching for... Google and every other commercial search outfit.

I should preface this all by saying I worked at the first commercially self-sustaining search engine, Lycos. I'm one of the originals. We proved it was possible to run a commercial shop supported by advertising online and monetizing people's privacy, sadly. However, it took millions of dollar of Uncle Sam's money (aka OUR money) getting dumped on Stanford to turn that insight into the monster that is StanFoogle, er I mean Google.

Okay, so what would offline search look like in practice? I have two answers: in the small, and in the large.

In the small: Boutique Search

For "boutique" sites, e.g. blogs, static web sites, newsletters, even small shopping sites: we write an indexer that you run on your server (or that integrates into your stack, e.g. ghost.io, wordpress, etc) that produces a catalog of your web site in JSON: stems, positions, document ids, etc. Think: card catalog in a library. The card catalog is much, much smaller than the whole library, but very effective in helping you find things. Likewise, even if you have a blog with thousands of pages, a JSON catalog suitable for searching your site will still be tiny, and we can compress it as well.

The actual search functionality is in JS (really TS) and lives in the browser. When the search function is first invoked the catalog is pulled opportunistically from your site (say, in /catalog.jzon), unpacked and then used to actually search. We implement TFIDF in the browser, where the actual search code runs: no logs. Not even you know what your users are looking for.

I should note that the traffic Google gets from everyone's little "search my site" box is in some ways more valuable to them than regular front-door searches, because they're being given an additional piece of information. It's not just that you searched for "salami surprise," it's that you searched this particular set of pages for "salami surprise." They got that piece of information for free, and it is surely more valuable than that. Offline search in the small is both about protecting your users' privacy and denying Google that piece of information, in the aggregate.

In the large: The Whole Enchilada

Most of this section comes from a proposal of mine NLnet rejected in 2015.

As we've said already, the most private search is the one that never hits the wire: what if you could plug cheap, external storage (say, a USB stick) into your device of choice and search a credible, usable catalog of the web without a packet ever being sent or received? No logs for TLAs to pick over post-hoc, no censorship to deal with other than the explicit editorial policy used to build the catalog.

The vision in the large is to produce a catalog of the web that fits in something like 1TB and could be carried around on a USB stick, shared via Sneaker-Net and dead drops for those without 1st world connectivity and spread via BitTorrent by those who do. The catalog would have to be updated regularly; cranking out a new version every 30 days would be a reasonable goal to start with, the idea being that the number of days could be reduced if more resources were available. The canonical users would be journalists and activists, rural people and others with limited or no connectivity, privacy advocates and communities who wish to have their own view of the web for whatever reasons (religious, cultural, political).

Some prospective user communities might be interested in cooking their own catalogs, for whatever reason. A twin goal, therefore, is to release the software used to build the catalog under the ISC license (a simplified version of the BSD license). The software will be designed so that groups who are interested in producing catalogs can collaborate so long as they agree on certain basic parameters. The more the merrier: let a thousand flowers bloom. If you have the resources to crawl the web you can run the whole stack yourself.

If you don't want to do the crawling but are interested in tweaking a catalog in some other way you can subscribe to another group's crawl and build your own catalog. There are good reasons why you'd want to do both of those things. Of course, subscribing to someone else's crawl presumes that you already agree with enough of that crawl's parameters. For instance, a group of religious fundamentalists who wanted to make a catalog that does not contain any material antagonistic to their beliefs would probably not want to subscribe to a crawl produced by a group of hacktivists.

In any event, writing and operating a crawler would be something for a later phase of the project, to start with Common Crawl would be acceptable.

The catalog will come with a reference implementation of a search tool, command-line only to start with, written in Python, Perl or some other language something that could be packaged up in a universally consumable way. More complex, user-friendly tools to make use of the catalog could be written by anyone. I intend to use Xapian for the indexing back end, and the files we would distribute would be in the format that the Xapian client libraries expect. This means any programming language with Xapian bindings could be used to build search tools. Xapian's architecture allows for custom back-ends to be built; if it becomes necessary in order to reduce the size of the catalog we can do whatever we need to do without sacrificing the rest of Xapian's features, such as its extensible query language that already supports a rich set of search operators, and which can be extended via their API.

Editorial Policies for Catalogs

The specific choice of 1TB is arbitrary: it was previously unimaginable amount of storage which became commonplace. The terabyte is the new gigabyte, just like always. The actual size is more or less irrelevant because I'm talking about editing (a catalog of) the web to make it more manageable, so fundamentally the idea is to trade time for space. There hasn't been too much attention paid to the idea of minimizing space in the search and indexing world - mostly everyone wants to minimize time and is willing to trade space for time, especially given how cheap storage has become relative to other resources.

Our paradigm is different: we instead wish to put an upper bound on the size of the catalog and will trade both run-time (in the catalog build and in the end user experience) and functionality, when necessary, to shrink it. This decision colors everything else in many ways.

Whatever the specific maximum size ends up being, it is still obvious that something will have to be left out of The Whole Enchilada. If we are to believe pundits the web's ever-expanding hugeness will continue on an exponential arc until the heat death of the universe. Although it is true that the total size of the web in several dimensions continues to grow, it is not true that everyone wants to search all of it all of the time.

As a straw-man I propose the following editorial policy for a catalog of the web that might fit in 1TB:

Drop the following:

Porn;
Social Media (Anything on Facebook, LinkedIn, G+, ...);
Content in anything but a single language (in this case English);
SEO Garbage (pages that only exist to game PageRank).

It might be that some prospective user communities won't agree with my assessment, not just of Facebook, but in general: if you're a journalist researching a piece on porn then you do in fact want to find porn (not to be overly puritanical). This means that at least the catalog I would like to use will not necessarily be suitable for everyone. This is why the editorial machinery used to drop things from the catalog is not fixed: in fact this is the most important piece of the puzzle that does not yet exist.

Nonetheless my short kill file-style policy is just a sketch: without some numbers I don't really know if the goal of 1TB is achievable this way. This is why I propose to first do a survey of the web in order to produce numbers that can be used to do what-if analyses like:

What are all the largest top-level components of the web in various languages? e.g. Porn, Sports, Shopping, Social Media, ...;
How big are they all in various dimensions (urls, bytes, stems), and what does the Venn Diagram of these components look like?
What is the effect on the final size of the Xapian files if we drop or include specific components?

Answering the latter question requires a little modeling and experimentation but the idea should be clear: there might be very "large" areas of the web that don't contribute proportionally to the size of the index.

Once the first survey is done the next step is to come up with a final size constraint that is sensible given (a) the state of portable media at the time and (b) what we've learned from the survey.

In doing the survey it is certain that issues will arise that will require code to be written, design assumptions to be revisited, etc. For this reason I think it only makes sense to think about the architecture in general terms and not get too specific too early. My overall vision for the catalog builder is analogous to a snake that eats its own tail; intermediate results and content from the web can only be partially cached due to size constraints, so estimating the high watermark for temporary storage during a build is crucial. This is as opposed to a system such as Google's, which is more like a large herd of goats romping around in an effectively infinite field, a.k.a. their cache of the web. We can improve performance by running multiple ouroboros instances but Google's herd of goats is always going to win on speed.

Search Interfaces and Caching

Our overriding concern is privacy; the specific catalog we wish to produce is one geared towards activists, journalists and others on the front line of the war against privacy being waged by the largest, most aggressive governments on the planet. To this end we also wish to explore a few other ideas in the user interface. Although this is a secondary area we will be providing a reference tool with the catalog that can be used by itself to search, and which provides us with a platform to experiment with privacy-related features in this context.

Users of most online search engines have gotten used to a slew of features that can easily violate their privacy or otherwise leak information about them. Search results pages frequently contain bugs of many kinds; they also include summaries and extracts from web pages that appear in them, etc. Our catalogs, being limited in space, will surely not contain enough information by themselves to mimic these features without code to reach out and pre-fetch results to produce these kinds of summaries, snapshots, etc.

Any instance of our catalog will also have an associated cache, which will normally ship as an empty folder. As a user interacts with the catalog, subject to user preferences, this cache could accrue up-to-date summary information on results as they come out of searches, so that over time more user-friendly results could be displayed if cached information were available or if the user agreed to allow on-the-fly network access from the tool being used to search the catalog.

We propose that the reference client have three modes:

Catalog only: only information that appears in the catalog is displayed, which will result in fairly minimalist information for search hits. The cache is ignored in this mode, no summaries or snapshots are available;
Catalog + Cache: information in the cache that is not deemed too old (user settable) will be merged into search results, so that some hits might have more elaborate summaries, images, etc. available;
Full Cache: the cache is filled on the fly for all search hits that are displayed. Cache entries that are too old are refreshed.

In the first two modes no network traffic is generated by searching; only in the last mode will information about what the user is searching for leak onto the wire.

A journalist who was preparing to go into a war-torn area where getting online is to be avoided if possible could use the Full Cache mode to pre-load their cache over Tor by running a set of queries that loads the cache with summaries and snapshots for all search hits, before they go. They could then use their catalog with its associated cache to see reasonably rich results (assuming the cache covered them), and which allowed them to avoid using the network as much as possible while in inclement circumstances.

A brief word on reality as we know it

I grew up in a world without the web. I was privileged to be one of a host of people funded by NSF, DARPA, etc. at institutions like CMU, Berkeley and MIT, tasked with the overriding brief of make the Internet go. All through the 80s and the 90s, tremendous amounts of energy and money were put into making this whole Internet thing a reality, but the Web is not the 'net. It was search that created the Web, or maybe just revealed it, but in any event this is veil of perception-type stuff. The fact is, until you could find things the 'net wasn't much use to anyone but people like me.

Lycos and other attempts at giving regular people a way of finding things were a low-key revelation: all of a sudden the "killer app" for the Internet was staring us all right in the face.

Given the source of the funding (very corporate and corporate-friendly), it is not surprising that the first, second and third models anyone tried to turn this web thing into money was advertising, but there was always a sense among practitioners, at least some of us, that it was not a foregone conclusion that this was the only model worth trying.

In today's world, over 30 years later, it seems inevitable that advertising dominates all other models for making money on the web. It has also become clear that this is very much to the liking of C-suite-level people, venture capitalists and investors in general. But there are also other models possible, co-existing on the web, such as subscription-based sites and crowdfunding.

What these other models lack is the power to produce corporate bonus-levels of cash, but they have been shown to be viable for sustaining creators and developers. In short, the more affluent levels of society are addicted to that Mad Man cash, and a lot of poorer people who create content are as well, albeit in a different way. Gaming and maintaining one's SEO rankings is a sub-industry in and of itself, and more importantly has occupied an alarmingly large footprint in the minds of at least two generations of people born into this mess.

So I'm not totally unaware of the fact that what I'm proposing - log-free search that is fundamentally un-monetizable - is going to sound unhinged to a lot of people. Anyone who has worked in privacy or infosec at all is used to being looked at funny by "regular" people when we point out that we should do this or that to protect our privacy and even personal security when this or that sound basically impossible to normal people:

stop using chrome (many people cannot do this);
stop relying on Google, Bing, etc (again, ditto);
move towards open-source versions of proprietary software and stop messing around with malfeasant corporates like Microsoft and Adobe (many jobs require their use);
avoid corporate "largess" in general (a lot of people depend on free stuff to survive).

The fact is that all the pundits who have been telling two (going on three) generations of people that privacy is dead are also making out handsomely on the proposition themselves. Recent events e.g. the United States have put issues like privacy online and the security of one's personal tech on a lot more peoples' front burners than they had been.

What I'm scoping out here, both in the small and in the large, is clearly meant to be a volunteer-led/oriented project to start with; as I noted, I did propose a small ($30k/one year) idea to NLnet based on this framework a decade ago, but the idea was never to produce the next corporate search behemoth. It is instead that search, as a whole, is too important to be left to corporates. Decent, useful web search is not going to come from the profit motive. Everyone is familiar with the concept of Enshitification and how it has manifested in Google's search results (and everyone else's).

I am pinning my hopes on a swing back to the left that is fundamentally anti-corporate in nature and based on the idea of supporting each other and our communities online without the usual background radiation of competition as a motivation.

In this context, offline search makes sense and could be a powerful anti-corporate, anti-capitalist tool. I am working on implementing the idea in the small, as a first step: nogoog git repository

If this interests you please drop me a line or a pull request on codeberg.