Friday, May 12, 2017

Well THAT'S F#@!*d Up

It looks like all my old Examiner.com stuff, including all of the original Fast and Furious reports, is now being blocked from Wayback Machine retrieval "due to robots.txt."

I hope it's just a temporary glitch over at the Internet Archive. Otherwise, that's years of stuff that I often still need to refer to because it's the only documentation out there.

Yes, I still have Word documents, but those do not contain revisions or links added in the publishing tool on the journey from draft to final. It would literally take thousands of hours to reconstruct everything, a task I'll never be able to even attempt without giving up everything else.

I need to look into this, but not today. I need to focus on what's in front of me and channel the negativity (putting it mildly) I'm feeling over this into some serious workaround calculating.

God, those people were/are douchebags.

11 comments:

Anonymous said...

David,

I volunteer to help in any way I'm able. I'm pretty good at keyboarding, and tech solutions if that helps.

HinMO

Anonymous said...

Ransomware attack world-wide

David Codrea said...

Thanks HinMo-- I'm not even ready to look at this today-- perhaps this is a hiccup and in a day or two things will be normal again.

And in re the ransomware attack, beats me. It's terrifying how vulnerable all the systems are, and that's the point.

Anonymous said...

Read the Congressional EMP commissions findings, not good.

FedUp said...

If it comes back, you need to get somebody to set up an automated script to download it all and save it to a thumb drive, then make copies of the thumb drive.

John Otis Comeau said...

you can pull up http://examniner.com/robots.txt yourself in a browser. it has:

User-agent: Mediapartners-Google
User-agent: SemrushBot-SA
User-agent: Googlebot-Image
User-agent: Googlebot
User-agent: Googlebot-mobile
Disallow:

User-agent: *
Disallow: /

so it's allowing Googlebot and a few other related bots access, but disallowing everything else. theoretically you could retrieve the page by changing your User-Agent string to one of the above, but I did that using wget and still get a `403 Forbidden` error: `wget -O- --user-agent=Googlebot https://web.archive.org/web/20120206005948/http://www.examiner.com/gun-rights-in-national/a-journalist-s-guide-to-project-gunwalker-part-two`

so this is something the owners of examiner.com did, not something archive.org changed; they've been honoring robots.txt for a long time, maybe from the beginning.

but that doesn't explain why I couldn't access it using "Googlebot" as user-agent.

David Codrea said...

Thank you John. Afraid you're trying to explain calculus to a Neanderthal.

Unknown said...

Couldn't find an email address, hence this post. Sorry to be off topic.

Here's one that could be titled 'we're the only ones liquefied enough':


HAMILTON COUNTY, Ohio -- Deputy Bobby Colwell was drunk and causing a scene at Froggy's in Monroe last April, records show. He told police he was intoxicated, disorderly and had two knives -- and they should arrest him.

One officer told Colwell an arrest would hurt his law enforcement career.
According to a police report, Colwell replied: "It's Hamilton County. I won't get fired."
He was right.

Colwell was arrested, booked into the Butler County Jail and pleaded guilty to disorderly conduct. But as he predicted, Sheriff Jim Neil did not fire him. In fact, Neil didn’t even suspend the six-year veteran.

An I-Team investigation found six Hamilton County sheriff's employees -- five deputies and one civilian -- have been charged with OVI since Neil took office in 2013. Two were suspended. Four received written reprimands. No one was fired.

http://www.wcpo.com/news/insider/drunken-deputies-how-hamilton-county-sheriff-handles-alcohol-related-discipline

Steve said...

Actually, it does explain it, John. The robots.txt applies to you hitting examiner.com directly. Changing it when you hit the Wayback Machine doesn't matter. The new file doesn't allow the Wayback Machine user agent (ia_archiver), so that effectively blocks all examiner.com content, even though it's already been archived.

See Prevent The Wayback Machine from Archiving Your Pages (and Delete All History!)

David, unless they change it to allow the Wayback Machine access, I'm afraid your content will not be accessible. This is something the current owners of examiner.com have done. The Wayback Machine is just honoring robots.txt, which is kinda like a doorman, for things that honor it.

Whenever something makes a request to a website, one of the items included in the HTTP request is the user agent, or program, making the request. All of the web browsers have them, as well as all the search bots and other things that crawl the web. For the ones that honor robots.txt, they will read that file and see if they have been granted access. If not, they'll leave.

This section means things identifying themselves with these user agents are not blocked from any content.

User-agent: Mediapartners-Google
User-agent: SemrushBot-SA
User-agent: Googlebot-Image
User-agent: Googlebot
User-agent: Googlebot-mobile
Disallow:

In this case, The Wayback Machine identifies itself as ia_archiver, and it doesn't have an explicit allow like the user agents above do. It sees this section:

User-agent: *
Disallow: /

And that tells it that it doesn't have access.

This is all dependent on robots.txt being honored. There are plenty of things out there that don't care and will do whatever they want. Site owners have other options if they want to block those things. The web server can be configured to block requests from certain user agents, or it can be done in code.

Anonymous said...

This is something I've been wondering about for a long time, ever since archive.org started letting robots.txt block my access to sites I use for research, even though they're inactive. I just hadn't the time to research the new (and frustrating) system of censorship.

Thanks John and Steve
MadMagyar

Anonymous said...

I wish I could help David but I'm no good at stuff like this. Lets hope that it is just a glitch and you'll be able to access it. Good Luck & Best Wishes.