Computing Profession Viewpoint

The Web Is Missing an Essential Part of Infrastructure: An Open Web Index

A proposal for building an index of the Web that separates the infrastructure part of the search engine—the index—from the services part that will form the basis for myriad search engines and other services utilizing Web data on top of a public infrastructure open to everyone.

By Dirk Lewandowski

Posted Apr 1 2019

Introduction
Search Engine Bias?
Three Major Problems
A Lack of Plurality
Why Are There No Alternatives?
Separating Index and Services
Benefits
Possible Applications
Alternative Approaches
Conclusion
References
Author

The web as it currently exists would not be possible without search engines. They are an integral part of the Web and can also be seen as a part of the Web’s infrastructure. Google alone now serves over two trillion search queries per year.¹¹ While there seems to be a multitude of search engines on the market, there are only a few relevant search engines in terms of them having their own index (the database of Web pages underlying a search engine). Other search engines pull results from one of these search engines (for example, Yahoo pulls results from Bing), and should therefore not be considered search engines in the true sense of the word. Globally, the major search engines with their own indexes are Google, Bing, Yandex, and Baidu. Other independent search engines may have their own indexes, but not to the extent that their size makes them competitive in the global search engine market.

While the search engine market in the U.S. is split between Google and Bing (and its partner Yahoo) with approximately two-thirds to one-third, respectively,¹⁰ in most European countries, Google accounts for more than 90% of the market share. As this situation has been stable over at least the last 10 years, there have been discussions about how much power Google has over what users get to see from the Web, as well as about anticompetitive business practices, most notably in the context of the European Commission’s competitive investigation into the search giant.³

Search Engine Bias?

From the users’ point of view, search engines are reliable and trustworthy sources, providing fair and unbiased results.⁸ However, it has been found that search results simply should not be considered “neutral.” Some scholars argue that an unbiased search engine is simply not possible, as there is no ideal result set against which a bias can be measured.^5,6 Therefore, I argue that every search engine presents its own algorithmically generated view of the Web’s content. Every such view can be different, and none of them are the definitive or correct one.

Problems that may arise from search engines’ interpreting the world in certain ways include: reinforcing stereotypes, for example, toward women;⁷ influencing public opinion in the context of political elections (see, for example, Epstein and Robertson²); and preferring dramatic interpretations of rather harmless health-related symptoms.¹³

It has been found that search results simply should not be considered “neutral.”

It seems, therefore, unreasonable to have only one (or a few) dominant search engines imposing their view on the Web’s content, which is, on closer inspection, really only one of many possible views. Therefore, I argue for building an index of the Web that will form the basis for a multitude of search engines and other services that are based on Web data.

Three Major Problems

There are three major problems resulting from a search engine market where only a few competitors are equipped with their own index of Web pages:

A search engine provides only one of many possible algorithmic interpretations of the Web’s content. At least for informational queries (see Broder¹), there is no correct set of results, let alone one single correct result. For these queries, we usually find a multitude of results of comparable quality. While a search engine’s ranking might provide some relevant results on the highest positions, there may be many more (or to some users, even better) results on lower positions.
Every search engine faces a conflict of interest when it also acts as a content provider and shows results from its own offerings on its results pages (for example, Google showing results from its subsidiary YouTube). This problem gets exacerbated when one search engine has a large market share, as it is able to increase both its influence on its users as well as its suppression of its competitors’ offerings.
The more users rely on a single search engine, the higher the influence of search engine optimization (SEO) on the search results, and therefore, on what users get to see from the Web. The aim of SEO is to optimize Web pages so they get ranked higher in search engines (that is, influencing a search engine’s results). Taken together with the fact that SEO is now a multibillion-dollar industry,¹² we can see huge external influences on search engine results.

A Lack of Plurality

Considering these three problems, we can see that in the current market situation, we are far from plurality, not only in terms of the numbers of search engine providers but also in the number of search results. A Yahoo 2011 study showed that while we can regard a search engine as a possible window to all of the Web’s content, more that 80% of all user clicks were found to go to only 10,000 different domains.⁴ We can assume these numbers are comparable for other search engines. Taken together, search engines have a huge influence on what we as users get to see on the results pages, and consequently, what we select from.

Why Are There No Alternatives?

Why are there no real alternatives to the few popular search engine index providers? Firstly, index providers face huge technical difficulties due to the large numbers of documents resulting from the ever-changing nature of the Web. A second, significant, issue is the cost of hardware, infrastructure, maintenance, and staff. Thirdly, the Web is huge, and a search engine index needs to be tasked with covering as large a part of it as possible. While we know that no search engine can cover the Web in total, modern search engines know of trillions of existing pages.⁹ And indexing these pages is only the start. A search engine must keep its index current, meaning it needs to update at least a part of it every minute. This is an important requirement that is not being met by any of the current projects (such as Common Crawl) aiming at indexing snapshots of (parts of) the Web.

Separating Index and Services

I am proposing an idea for a missing part of the Web’s infrastructure, namely a searchable index. The idea is to separate the infrastructure part of the search engine (the index) from the services part, thereby allowing for a multitude of services, whether existing as search engines or otherwise, to be run on a shared infrastructure.

The accompanying figure shows how the public infrastructure is responsible for crawling the Web, for indexing its content, and for providing an interface/API to the services built upon the index. The indexing stage is divided between basic indexing and advanced indexing. Basic indexing provides the data in a form that services built on top of the index can easily and rapidly process that data. So, while services are allowed to do their further indexing to prepare documents, some advanced indexing is also provided by the open infrastructure. This provides additional information to the indexed documents (for example, semantic annotations). For this, an extensive infrastructure for data mining and processing is needed. Services should, however, be able to decide for themselves to what extent they want to rely on the preprocessing infrastructure provided by the Open Web Index. A design principle should be to allow services a maximum of flexibility.

Figure. Separating services from infrastructure.

As modern search engines rely heavily on usage data, this data (most prominently search queries routed to the index) is collected and made available for reuse. The OWI Usage Data Index allows for this data to be collected, stored, and queried. So, while each service can collect and query its own usage data, every service that wants to access usage data from the OWI Usage Data Index should be required to share anonymized usage data with the other services, so that every service profits from the amassed data. It is clear that existing search engines like Google and Bing have a huge lead compared to new providers, as they have a solid user base and already amassed large amounts of usage data. However, sharing usage data between the services could at least lessen the cold start problem.

Benefits

The main benefit of such an index would be for all interested parties to be able to develop their own applications without the problem of having to create their own index of the Web, which currently is an impossible endeavor not only, but especially, for small- and medium-size enterprises, as well as for non-commercial bodies.

Given a considerable uptake for such an index, it would foster plurality not only in the use of Web content by developers but also in the variety of content that users get to see. We can rightly assume each search engine using the index would apply its own ranking function, and therefore, produce different results. Users would benefit in that they would not have to rely on only one or at best a few search engines but could choose from a variety of engines serving their different purposes. In that way, an Open Web Index would foster plurality and restrict the power of single companies dictating which content is shown to and consumed by users.

Another benefit would be that the index would be open to everyone, and therefore, would allow for investigating its transparency. However, search engines built on top of the index could still be “black boxes” in that they would not need to make their ranking functions open to anybody.

Possible Applications

While the Open Web Index would first and foremost make the development of new Web search engines feasible and financially attractive, it could also form the basis for a variety of other applications, being related to search or not.

In the field of search, the Open Web Index would also allow for vertical search engines (like image search, video search, or search in specific areas and on specific topics) to be built. In vertical search applications, OWI data could also be used to amend proprietary data. For instance, a provider of company information could amend its company profiles with Web data.

The index would be open to everyone, and therefore, would allow for investigating its transparency.

Apart from search, the OWI could also build the basis for data analysis and topic detection and tracking. Examples of applications are opinion-mining tools and market research applications.

In the field of artificial intelligence, the Open Web Index could be used as a basis for large-scale machine learning. Likely applications in this area are machine translation, question-answering, and conversational applications.

Last but not least, an Open Web Index would provide a rich data source for researchers in many different fields, ranging from computer science and computational linguistics to computational social sciences and research evaluation.

It is clear this short list of ideas is far from being complete and only serves illustrative purposes. It shows, however, the huge potential of making Web data open to all parties interested.

Alternative Approaches

Some alternative solutions have been proposed for fostering plurality in the search engine market. The first and probably most obvious solution is to wait for commercial market players to develop alternatives. However, as we have seen in the last 15 years or so, Bing has been the only search engine capable of gaining considerable market share. Other search engines have failed, have been acquired by larger search companies, or have focused on niche markets. All new search engine providers face the problem of having to build their own index, which is, as has been described earlier, a very costly undertaking. Furthermore, what would be gained if we had one or two, even three more search engines on the market? From my point of view, the problem lies not in having a few more search engines, but in providing real search plurality.

The second line of argumentation says Google should be forced to provide fair and unbiased results. This is what the European Commission’s competitive investigation against Google has been all about. However, as ranking results are always based on interpretations (and human assumptions inherent in the ranking algorithms), there is no such thing as an unbiased result set. Only a multitude of different algorithmic interpretations can help bring about search plurality.

Those that benefit from the index should have their say in building it.

The third line of argumentation calls for Google to open its index to third parties. Then, it would be possible to build (search) applications on top of Google’s index. However, the control over the index—and over what third parties would be able to get from the index—would still lie in the hands of a private company, the index would still not be transparent, and there would still be no influence on how the index is composed.

The fourth, and already widely discussed solution, is building a publicly funded search engine as an alternative to the commercial enterprises. However, this again would only add one more search engine to the market, instead of fostering plurality.

Conclusion

The main idea I presented in this Viewpoint is to foster building search engines and other services needing Web data on top of a public infrastructure that is open to everyone. A multitude of such services would foster plurality not only on the search engine market (with the result of having more than a few search engines to choose from) but even more importantly, a plurality with regard to the results users get to see when using search engines.

Search results as a basis for knowledge acquisition in society seem too important to be left solely in the hands of a few commercial enterprises. The Open Web Index is comparable to other public services such as constructing roads and railroad tracks, supporting public broadcasting and, most notably, building a library system. An Open Web Index could be one of the main building blocks of the library of the 21^st century.

An Open Web Index is a project that cannot and should not be undertaken by a single company or institution. On the contrary, I envision building such an index as a task of society and for society, meaning we should build the index involving all actors and interest groups relevant to society at large. Those that benefit from the index should have their say in building it.

A question that remains is funding. As a considerable amount of money is needed, I argue for public funding not by a single state, but rather by a larger entity such as the European Union. This, however, does not mean a governmental body should also be the operator of the Open Web Index. Rather, it should be run by an organization that is relatively free from state intervention. One could think of a foundation running it or a model similar to public broadcasting. Whatever the mode of operation, as a project of and for society, funding should be applied for the greater good.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

The Web Is Missing an Essential Part of Infrastructure: An Open Web Index

View in the ACM Digital Library

DOI

10.1145/3312479

April 2019 Issue

Published: April 1, 2019

Vol. 62 No. 4

Page: 24

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM May 3 2024

Pioneering Sustainable IT with Green Computing

Alex Williams

Architecture and Hardware

News May 2 2024

3D Printing Finds a Home

Samuel Greengard

Architecture and Hardware

Credit: Shutterstock 3D printer printing a structure

BLOG@CACM May 1 2024

HiPEAC’s Vision for the Future

Tullio Vardanega and Marc Duranton

Computing Profession

Credit: Roger Castro, Monzón HiPEAC Vision 2024 Next Computing Paradigm

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Search Engine Bias?

Three Major Problems

A Lack of Plurality

Why Are There No Alternatives?

Separating Index and Services

Benefits

Possible Applications

Alternative Approaches

Conclusion

The Web Is Missing an Essential Part of Infrastructure: An Open Web Index

DOI

April 2019 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.