Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Summary of the approach (p10):

"1) Public curation mechanisms for quality;

2) Transparency, telling users exactly how the information originated;

3) Open data access to metadata, giving users the exact date source of the information;

4) Protected user privacy, with their searching protected by strict privacy controls;

5) No advertising, which assures the free flow of information and a complete separation from commercial interests;

6) Internalization, which emphasizes community building and the sharing of information instead of a top-down approach."

My first thought: How will transparency impact SEO? Will spammers be able to better game the algorithm when they know its internals?

However I am excited at the prospect of a wikipedia-like public curation system for the entire web. I admit I'm flabbergasted that the whole thing ever worked, but it does.



1) Public curation mechanisms for quality;

The Mozilla / open directory project tried this. Curation doesn't scale and often assumes a single unifying ontology. This is particularly problematic in a cross-cultural context. Besides, 'quality' is not a unidimensional metric in a result set: consider timeliness, authority, notability, uniqueness, comprehensibility, etc.

2) Transparency, telling users exactly how the information originated;

Most search engines include a URL, I can see a [crawldate] button like the [cache] or [translate] buttons on each hit adding some information, but this will be of dubious additional utility for most searches.

3) Open data access to metadata, giving users the exact date source of the information;

As above.

4) Protected user privacy, with their searching protected by strict privacy controls;

We have duckduckgo already, friends are welcome but it's hardly a unique offering nor a trustworthy one given Snowden's revelations regarding the scale of systematic 5 eyes traffic monitoring/recording.

5) No advertising, which assures the free flow of information and a complete separation from commercial interests;

DDG or Google or Bing with plugins can supply this. Not ground breaking.

6) Internalization, which emphasizes community building and the sharing of information instead of a top-down approach.

This is so amorphous as to be a non-point.

So out of six points, 2 things (33%) are only useful in edge cases, 1 thing (16%) is too vague to be useful, and the other 3 things (50%) are currently implemented by others and have been tried before.

I would like to see the input of the former Blekko guys on this, https://news.ycombinator.com/user?id=ChuckMcM + https://news.ycombinator.com/user?id=greglindahl


> Curation doesn't scale and often assumes a single unifying ontology

Wikipedia is a pretty big exception to that assertion. Perhaps DMOZ (a clone of Yahoo circa 1996) is not the only way to do curation. Perhaps Wikipedia could apply what has worked for Wikipedia, i.e. develop a set of POV-neutral criteria for organizing collections of links and then invite everyone to participate.

It's really easy to be negative. But that's something that might at least be an interesting research project for the #1 open-curation system in the world.


You make a fair point. I'm not rubbishing Wikipedia, just questioning the supposed USP. I would also point out in response to your argument that a Wikipedia article and a set of search results are apples and oranges.

The article is written once then modified or evolved occasionally by (almost exclusively) humans, but very frequently read. It is intended to be intelligible, being structured and based in natural language. It has a very well defined scope within a flat namespace, and often clear relations to multiple formal ontologies. It is structured to be consumed in part or in whole, and may contain rich media and strong supporting contextual information (related pages).

By contrast a search result summarizes a set of potential information sources that may answer a search query in whole or in part, to various definitions of "answer". It is generally written once, by a computer, and thrown away after some period of caching. It is intended to be concise. Each component result has relatively poor context, relying upon the searcher to interpret timeliness, authority, notability, uniqueness, comprehensibility, etc. with the limited information presented, typically a very short content excerpt. It is structured to be scanned, classically in a ranked fashion from "best hit" to "worst hit", and is generally a wall of text.

Wikipedia successfully attracts people to contribute to the former, but the latter - where the information product is generated on the fly and lasting impact is amorphous (nothing particularly concrete for contributors to point to and say "I did that! Warm and fuzzies!") - is a very different beast.

I too believe there is room for innovation ... there are potentially low hanging fruit like inter-linguistic semantic queries (not keyword search) ... but there are no such key problem areas identified in the paper's summary.


The other big problem is that curating search results is inherently about prioritising a position rather than establishing a sourced and reasonably neutral version of the truth.

I'm imagining the edit wars and debates that take place on contentious wordings or facts in some parts of Wikipedia, but on a much wider scale involving hundreds of SEO consultants each aware that changing a particular criterion will have a quantifiable impact on their clients' bottom line. It doesn't sound like it would be fun to police.


Wikipedia already curates links to some extent on every page under "External Links". So there is a seed there.

And even the page text is not immune from the problem you describe. Grading and prioritizing sources is a fundamental part of producing a "reasonably neutral version of the truth." It's what determines what gets cited and how prominently it influences the article.

So while I wouldn't equate text and links in terms of the difficulty of managing POV-neutrality, I would say they sit on a spectrum.


There was a remark recently that most Jeopardy answers are Wikipedia titles. Consider Wikipedia as an ontology, with Wikipedia titles as the vocabulary. A search engine could associate articles with relevant Wikipedia titles, and try to do the same with queries. The first step of search is then relatively straightforward.


> We have duckduckgo already

DuckDuckGo a meta-search-engine! It relies mainly on Yahoo Boss API which uses Bing search (for most countries)! Yahoo Boss API turned from free to expensive in early 2015 and the future of Yahoo (tech company, not Alibaba stock) is uncertain.

We definitely need more search engines, only 6-7 exist that cover a wider range (international). Search on HN to retrieve the list, we had this discussion before.


I agree with you, but this quote was in the context of the claim of a privacy USP. You have taken it out of context.


BOSS is being discontinued on March 31st, so I imagine that DDG has some sort of idea what they're going to be doing after that date.

Any insight?


They could use the Bing API directly or just stick with Yandex.

http://datamarket.azure.com/dataset/bing/search


They are using Yandex now with a bit of their own crawling.


Curation scales for some topics. It was hard to build a curated list of Linux sites, they come and go. But there are only a couple of new good, comprehensive health websites per year.

I was Blekko's founder/cto. And it's worth noting that our founding team was the Open Directory Project's founding team. Blekko's curation data was even better than dmoz in its day. Check it out: https://github.com/blekko/slashtag-data


Duckduckgo is ad free? I never knew this. How do they make money?


My mistake. You are right, they do have ads, I just always use a blocker. I've updated the text.


Ddg has an option to disable ads, they just ask that you help promote them.


> 4) Protected user privacy, with their searching protected by strict privacy controls;

They'll have to keep the servers outside the USA then. It's illegal for European organisations to transfer personal data to the USA now that Safe Harbour is invalid.


"Public Curation" doesn't make quality. It makes a mob-rule system where only the most popular ideas flourish.


Nonsense. It means content will be filtered through the lens of one or more individuals. The results vary dramatically. Mob-rule is one possibility. Yahoo Directory used to be a great example of mid-level of quality where it gave nice start and obscure stuff people overlooked. On high end, the link below shows Stanford Encyclopedia of Philosophy set a pretty awesome precedent for high-quality curation:

https://news.ycombinator.com/item?id=10266103


I'm pretty sure that the spammer argument is just an excuse used by Google to allow them to keep their business practices out of public scrutiny. Google search results are biased in favour of content produced by those who have money and power.

Google ranks everything based on popularity - Not based on quality. Popularity and quality are two independent concepts and not necessarily related. That's something which Wikimedia understands but which Google doesn't.


Google does take quality into account. That's the whole point behind static ranking algorithms. However, quality isn't some universal concept. I'd say the inevitable paper on gravity wave detection is highest quality on it, but it certainly isn't popular because it's impenetrable to the lay masses, unlike say a Wikipedia article, which falls into look-at-me-i'm-oh-so-smart territory when it gets into double gradients and other math formulas with more letters than digits.


If you think "spam" isn't the defining problem of web search then you've never tried building a search engine. It's 90% of the problem.

Google does take plenty of quality features into account[1]. PageRank is one of course, but that isn't some corporate conspiracy, it's that it is a good feature.

[1] https://moz.com/search-ranking-factors/correlations


I'd ask you to cite your claims, but we both know you can't. It's a pity your issues with Google cause you to pollute discussions with BS.


Do a Google search for anything even slightly obscure, and you're likely to find the first page or so of results filled with highly-SEO'd sites that offer little in the way of deep, detailed content. The smaller sites which do have that content, but just haven't been SEO'd much, have been eclipsed. They're still there, but rendered nearly inaccessible.

Interesting discussion on a search engine that does sort of the opposite of Google: https://news.ycombinator.com/item?id=3910304


I hear this a lot, but I'd love to see an example (especially including the sites that should rank).


This may or may not be what you're looking for, but my freelance site just can't hit the front page in my industry when dozens of highly funded agencies can dominate it. I've published a 50k word industry-specific book on my site, have SEO'd as much as possible and have an older domain than those better funded. Won't link to it here, but it seems to be a real issue to me.

If a search engine let you differentiate and sort between content match vs pagerank match vs Adwords spend we might be able to mitigate the issue somewhat.


btw, your site's redirect to the HTTPS version doesn't seem to work correctly in Firefox, Safari, or IE. After reading your comment, I was curious to learn more. When I typed in just your domain name plus CMD+ENTER (which adds "www." and ".com" to the address bar text in Firefox), I got a 404 page, not the 301 redirect to the HTTPS site. When I add "http://", the redirect seems to work.


Thanks for letting me know, and for the more detailed info - the redirect worked for the basic permutations I tested for (www.domain.com, and http://domain.com etc.) but I am redirecting non-www and www traffic to https://www. in Nginx. (Solution found, see edit below.)

I'd not heard of the CMD+ENTER method before, so thanks for the heads up. Still not entirely sure what Firefox is submitting in that case. Will test.

I wasn't referring to this site in the parent comment, but to my freelance site. I'll put that site in my profile for 24 hours just in case anyone wants to take a look.

EDIT: Fixed, as noted in reply to nl's comments. A recent change led to a redirect line being mistakenly commented out.


This.

I don't work in this area, but I'd say that 99% of the time I hear someone complaining about how Google is favoring sites that pay for advertising over them I find that they are making these incredibly basic errors.

For me, http://www.linguaquote.com/ gives a 404. It's only when I go to https://www.linguaquote.com/ that it works.


That's all well and good, and thanks for checking, but I wasn't referring to that site in the parent comment. The other site does have SSL enabled, but only recently and via Let's Encrypt. The issue is much longer standing than this.

So I'm afraid it's not quite as simple as you make out in this case.

In other news, I've just pinpointed the missing line in the recently changed nginx config for Linguaquote; the http://www block had it's redirect commented out. Still, for this site in particular Google Webmaster tools is set up for the https version where no errors have been reported and SSL Labs gave an A+ for the stapling, PFS, heartbleed etc. etc. efforts I went to. I don't think this redirect was having an adverse effect on ranking, but I don't expect this site to hit the front page just yet - much more content to add before aiming for that.


Is it bad form to quote myself?

I'd love to see an example (especially including the sites that should rank).


I put the link to the freelance site in my profile, which I only mentioned in my reply to cpeterso above - apologies for not making it clearer.

Will leave it there a bit longer in case you do come back to this thread as still genuinely interested in your opinion on the matter.


One example I've experienced is searching for iphone jailbreak related stuff. Perhaps that's to be expected.


If you've ever read about the original PageRank algorithm, the parent post is a pretty reasonable way to describe it.

I have no idea what the current algorithm looks like but I'd be shocked if it somehow switched to evaluating the 'quality' of content, however one might do that with an algorithm.


Well they do have some quality metrics, like duplication with other content, words used and so on. I suspect more, eg writing style measurements, correlated with other things that are found useful. There is a lot that could be done without actually understanding the content, although if course it can be gamed.


The primary measurement used for Google's PageRank algorithm is the number and "value" of backlinks that a page has. The "value" of a backlink is determined by the cumulative number of descendent sub-backlinks that it has. This is common knowledge among SEO professionals.

Basically, when judging quality, Google is making assumptions like: "This page has a lot of backlinks, and those backlinks themselves have a lot of backlinks... Therefore this page is of high quality." This approach puts all the power in the hands of content providers (bloggers) who are funded by big companies (or well-funded startups) and who serve the interests of those companies.

Google wrongly assumes that content-providers serve the interest of consumers and that they can be trusted - Which is not the case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: