Friday, May 12, 2017

Well THAT'S F#@!*d Up

It looks like all my old Examiner.com stuff, including all of the original Fast and Furious reports, is now being blocked from Wayback Machine retrieval "due to robots.txt."

I hope it's just a temporary glitch over at the Internet Archive. Otherwise, that's years of stuff that I often still need to refer to because it's the only documentation out there.

Yes, I still have Word documents, but those do not contain revisions or links added in the publishing tool on the journey from draft to final. It would literally take thousands of hours to reconstruct everything, a task I'll never be able to even attempt without giving up everything else.

I need to look into this, but not today. I need to focus on what's in front of me and channel the negativity (putting it mildly) I'm feeling over this into some serious workaround calculating.

God, those people were/are douchebags.

11 comments:

  1. David,

    I volunteer to help in any way I'm able. I'm pretty good at keyboarding, and tech solutions if that helps.

    HinMO

    ReplyDelete
  2. Ransomware attack world-wide

    ReplyDelete
  3. Thanks HinMo-- I'm not even ready to look at this today-- perhaps this is a hiccup and in a day or two things will be normal again.

    And in re the ransomware attack, beats me. It's terrifying how vulnerable all the systems are, and that's the point.

    ReplyDelete
  4. Read the Congressional EMP commissions findings, not good.

    ReplyDelete
  5. If it comes back, you need to get somebody to set up an automated script to download it all and save it to a thumb drive, then make copies of the thumb drive.

    ReplyDelete
  6. you can pull up http://examniner.com/robots.txt yourself in a browser. it has:

    User-agent: Mediapartners-Google
    User-agent: SemrushBot-SA
    User-agent: Googlebot-Image
    User-agent: Googlebot
    User-agent: Googlebot-mobile
    Disallow:

    User-agent: *
    Disallow: /

    so it's allowing Googlebot and a few other related bots access, but disallowing everything else. theoretically you could retrieve the page by changing your User-Agent string to one of the above, but I did that using wget and still get a `403 Forbidden` error: `wget -O- --user-agent=Googlebot https://web.archive.org/web/20120206005948/http://www.examiner.com/gun-rights-in-national/a-journalist-s-guide-to-project-gunwalker-part-two`

    so this is something the owners of examiner.com did, not something archive.org changed; they've been honoring robots.txt for a long time, maybe from the beginning.

    but that doesn't explain why I couldn't access it using "Googlebot" as user-agent.

    ReplyDelete
  7. Thank you John. Afraid you're trying to explain calculus to a Neanderthal.

    ReplyDelete
  8. Couldn't find an email address, hence this post. Sorry to be off topic.

    Here's one that could be titled 'we're the only ones liquefied enough':


    HAMILTON COUNTY, Ohio -- Deputy Bobby Colwell was drunk and causing a scene at Froggy's in Monroe last April, records show. He told police he was intoxicated, disorderly and had two knives -- and they should arrest him.

    One officer told Colwell an arrest would hurt his law enforcement career.
    According to a police report, Colwell replied: "It's Hamilton County. I won't get fired."
    He was right.

    Colwell was arrested, booked into the Butler County Jail and pleaded guilty to disorderly conduct. But as he predicted, Sheriff Jim Neil did not fire him. In fact, Neil didn’t even suspend the six-year veteran.

    An I-Team investigation found six Hamilton County sheriff's employees -- five deputies and one civilian -- have been charged with OVI since Neil took office in 2013. Two were suspended. Four received written reprimands. No one was fired.

    http://www.wcpo.com/news/insider/drunken-deputies-how-hamilton-county-sheriff-handles-alcohol-related-discipline

    ReplyDelete
  9. Actually, it does explain it, John. The robots.txt applies to you hitting examiner.com directly. Changing it when you hit the Wayback Machine doesn't matter. The new file doesn't allow the Wayback Machine user agent (ia_archiver), so that effectively blocks all examiner.com content, even though it's already been archived.

    See Prevent The Wayback Machine from Archiving Your Pages (and Delete All History!)

    David, unless they change it to allow the Wayback Machine access, I'm afraid your content will not be accessible. This is something the current owners of examiner.com have done. The Wayback Machine is just honoring robots.txt, which is kinda like a doorman, for things that honor it.

    Whenever something makes a request to a website, one of the items included in the HTTP request is the user agent, or program, making the request. All of the web browsers have them, as well as all the search bots and other things that crawl the web. For the ones that honor robots.txt, they will read that file and see if they have been granted access. If not, they'll leave.

    This section means things identifying themselves with these user agents are not blocked from any content.

    User-agent: Mediapartners-Google
    User-agent: SemrushBot-SA
    User-agent: Googlebot-Image
    User-agent: Googlebot
    User-agent: Googlebot-mobile
    Disallow:

    In this case, The Wayback Machine identifies itself as ia_archiver, and it doesn't have an explicit allow like the user agents above do. It sees this section:

    User-agent: *
    Disallow: /

    And that tells it that it doesn't have access.

    This is all dependent on robots.txt being honored. There are plenty of things out there that don't care and will do whatever they want. Site owners have other options if they want to block those things. The web server can be configured to block requests from certain user agents, or it can be done in code.

    ReplyDelete
  10. This is something I've been wondering about for a long time, ever since archive.org started letting robots.txt block my access to sites I use for research, even though they're inactive. I just hadn't the time to research the new (and frustrating) system of censorship.

    Thanks John and Steve
    MadMagyar

    ReplyDelete
  11. I wish I could help David but I'm no good at stuff like this. Lets hope that it is just a glitch and you'll be able to access it. Good Luck & Best Wishes.

    ReplyDelete

Keep it on topic. Submit tips on different topics via left sidebar Contact Form.

Note: Only a member of this blog may post a comment.