Why extracting data from PDFs is still a nightmare for data experts

For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files. These digital documents serve as containers for everything from scientific research to government records, but their rigid formats often trap the data inside, making it difficult for machines to read and analyze.

“Part of the problem is that PDFs are a creature of a time when print layout was a big influence on publishing software, and PDFs are more of a ‘print’ product than a digital one,” Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, wrote in an email to Ars Technica. “The main issue is that many PDFs are simply pictures of information, which means you need Optical Character Recognition software to turn those pictures into data, especially when the original is old or includes handwriting.”

Computational journalism is a field where traditional reporting techniques merge with data analysis, coding, and algorithmic thinking to uncover stories that might otherwise remain hidden in large datasets, which makes unlocking that data a particular interest for Willis.

Read full article

Comments

NCI employees can’t publish information on these topics without special approval

This story was originally published by ProPublica.

Employees at the National Cancer Institute, which is part of the National Institutes of Health, received internal guidance last week to flag manuscripts, presentations or other communications for scrutiny if they addressed “controversial, high profile, or sensitive” topics. Among the 23 hot-button issues, according to internal records reviewed by ProPublica: vaccines, fluoride, peanut allergies, autism.

While it’s not uncommon for the cancer institute to outline a couple of administration priorities, the scope and scale of the list is unprecedented and highly unusual, said six employees who spoke on the condition of anonymity because they were not authorized to comment publicly. All materials must be reviewed by an institute “clearance team,” according to the records, and could be examined by officials at the NIH or its umbrella agency, the US Department of Health and Human Services.

Read full article

Comments

How Trump’s EPA hopes to avoid greenhouse gas regulations

A document that was first issued in 2009 would seem an unlikely candidate for making news in 2025. Yet the past few weeks have seen a steady stream of articles about an analysis first issued by the Environmental Protection Agency (EPA) in the early years of Obama’s first term: the endangerment finding on greenhouse gasses.

The basics of the document are almost mundane: greenhouse gases are warming the climate, and this will have negative consequences for US citizens. But it took a Supreme Court decision to get written in the first place, and it has played a role in every attempt by the EPA to regulate greenhouse gas emissions across multiple administrations. And, while the first Trump administration left it in place, the press reports we’re seeing suggest that an attempt will be made to eliminate it in the near future.

The only problem: The science in which the endangerment finding is based on is so solid that any ensuing court case will likely leave its opponents worse off in the long run, which is likely why the earlier Trump administration didn’t challenge it.

Read full article

Comments

Former Google CEO Eric Schmidt is the new leader of Relativity Space

Another Silicon Valley investor is getting into the rocket business.

Former Google chief executive Eric Schmidt has taken a controlling interest in the Long Beach, California-based Relativity Space. The New York Times first reported the change becoming official, after Schmidt told employees in an all-hands meeting on Monday.

Schmidt’s involvement with Relativity has been quietly discussed among space industry insiders for a few months. Multiple sources told Ars that he has largely been bankrolling the company since the end of October, when the company’s previous fundraising dried up.

Read full article

Comments

Gmail gains Gemini-powered “Add to calendar” button

Google has a new mission in the AI era: to add Gemini to as many of the company’s products as possible. We’ve already seen Gemini appear in search results, text messages, and more. In Google’s latest update to Workspace, Gemini will be able to add calendar appointments from Gmail with a single click. Well, assuming Gemini gets it right the first time, which is far from certain.

The new calendar button will appear at the top of emails, right next to the summarize button that arrived last year. The calendar option will show up in Gmail threads with actionable meeting chit-chat, allowing you to mash that button to create an appointment in one step. The Gemini sidebar will open to confirm the appointment was made, which is a good opportunity to double-check the robot. There will be a handy edit button in the Gemini window in the event it makes a mistake. However, the robot can’t invite people to these events yet.

The effect of using the button is the same as opening the Gemini panel and asking it to create an appointment. The new functionality is simply detecting events and offering the button as a shortcut of sorts. You should not expect to see this button appear on messages that already have calendar integration, like dining reservations and flights. Those already pop up in Google Calendar without AI.

Read full article

Comments

Elon Musk blames X outages on “massive cyberattack”

After DownDetector reported that tens of thousands of users globally experienced repeated X (formerly Twitter) outages, Elon Musk confirmed the issues are due to an ongoing cyberattack on the platform.

“There was (still is) a massive cyberattack against X,” Musk wrote on X. “We get attacked every day, but this was done with a lot of resources. Either a large, coordinated group and/or a country is involved.”

Details remain vague beyond Musk’s post, but rumors were circulating that X was under a distributed denial-of-service (DDOS) attack.

Read full article

Comments

Firmware update bricks HP printers, makes them unable to use HP cartridges

HP, along with other printer brands, is infamous for issuing firmware updates that brick already-purchased printers that have tried to use third-party ink. In a new form of frustration, HP is now being accused of issuing a firmware update that broke customers’ laser printers—even though the devices are loaded with HP-brand toner.

The firmware update in question is version 20250209, which HP issued on March 4 for its LaserJet MFP M232-M237 models. Per HP, the update includes “security updates,” a “regulatory requirement update,” “general improvements and bug fixes,” and fixes for IPP Everywhere. Looking back to older updates’ fixes and changes, which the new update includes, doesn’t reveal anything out of the ordinary. The older updates mention things like “fixed print quality to ensure borders are not cropped for certain document types,” and “improved firmware update and cartridge rejection experiences.” But there’s no mention of changes to how the printers use or read toner.

However, users have been reporting sudden problems using HP-brand toner in their M232–M237 series printers since their devices updated to 20250209. Users on HP’s support forum say they see Error Code 11 and the hardware’s toner light flashing when trying to print. Some said they’ve cleaned the contacts and reinstalled their toner but still can’t print.

Read full article

Comments

Last of Us S2 trailer features wintry war with the zombies

Pedro Pascal returns as Joel in The Last of Us S2.

HBO released a one-minute teaser of the hotly anticipated second season of The Last of Us—based on Naughty Dog’s hugely popular video game franchise—during CES in January. We now have a full trailer, unveiled at SXSW after the footage leaked over the weekend, chock-full of Easter eggs for gaming fans of The Last of Us Part II.

(Spoilers for S1 below.)

The series takes place in the 20-year aftermath of a deadly outbreak of mutant fungus (Cordyceps) that turns humans into monstrous zombie-like creatures (the Infected, or Clickers). The world has become a series of separate totalitarian quarantine zones and independent settlements, with a thriving black market and a rebel militia known as the Fireflies making life complicated for the survivors. Joel (Pedro Pascal) is a hardened smuggler tasked with escorting the teenage Ellie (Bella Ramsay) across the devastated US, battling hostile forces and hordes of zombies, to a Fireflies unit outside the quarantine zone. Ellie is special: She is immune to the deadly fungus, and the hope is that her immunity holds the key to beating the disease.

Read full article

Comments

Yes, you get used to the grille: The 2025 BMW 430i Gran Coupe review

Like life itself, BMWs seemed less complicated last century. You didn’t need a crib sheet to understand the badge, and body styles were mostly just sedans, with a smattering of station wagons, two-door coupes, and convertibles. That was before it helped kickstart the SUV craze; now instead of 3, 5, 7, the series run 2–8 and X1 through X7. And don’t get me started on individual model names. Like the 2025 430i xDrive Gran Coupe.

At first glance, if you’re middle-aged like the average Ars reader, your brain probably says “this is a 3 Series sedan.” After all, it has a pair of doors on either side. But there is no requirement for a coupe to only have two doors: the name is derived from the French “couper,” meaning cut. And indeed, the roofline is cut down more than 2 inches lower than the actual 3 Series.

There’s also a hatch at the rear, rather than a trunk lid. So, technically it’s a fastback body style, which BMW has decided to call Gran Coupe the way it calls station wagons Tourings. Pedantic pigeonholing of body style will probably take a back seat to discussion of the front grille, though.

Read full article

Comments

Developer convicted for “kill switch” code activated upon his termination

A 55-year-old software developer faces up to 10 years in prison for deploying malicious code that sabotaged his former employer’s network, allegedly costing hundreds of thousands of dollars in losses.

The US Department of Justice announced Friday that Davis Lu was convicted by a jury after “causing intentional damage to protected computers” reportedly owned by the Ohio- and Dublin-based power management company Eaton Corp.

Lu had worked at Eaton Corp. for about 11 years when he apparently became disgruntled by a corporate “realignment” in 2018 that “reduced his responsibilities,” the DOJ said.

Read full article

Comments

What’s behind the recent string of failures and delays at SpaceX?

It has been an uncharacteristically messy start to the year for the world’s leading spaceflight company, SpaceX.

Let’s start with the company’s most recent delay. The latest launch date for a NASA mission to survey the sky and better understand the early evolution of the Universe comes Monday night. The launch window for this SPHEREx mission opened on February 28, but a series of problems with integrating the rocket and payloads have delayed the mission nearly two weeks.

Then there are the Falcon 9 first stage issues. Last week, a Falcon 9 rocket launched nearly two dozen Starlink satellites into low-Earth orbit. However, one of the rocket’s nine engines suffered a fuel leak during ascent. Due to a lack of oxygen in the thinning atmosphere, the fuel leak did not preclude the satellites from reaching orbit. But when the first stage returned to Earth, it caught fire after landing on a droneship, toppling over. This followed a similar issue in August, when there was a fire in the engine compartment. After nearly three years without a Falcon 9 landing failure, SpaceX had two in six months.

Read full article

Comments

Google Pixel 4a’s painful “update” was due to battery overheating risk

Google didn’t explain exactly why it shipped a mandatory software update to the Pixel 4a, an Android phone from 2020, earlier this year. The nature of that update, which gave some models all but unusable battery life, provided some clues, as did later software analysis. But now, Australian authorities have provided a more concrete answer: battery overheating and fire risk.

The Australian Competition and Consumer Commission’s (ACCC) Product Safety arm issued a recall for the Pixel 4a late last week. The reason, the commission said, is that Google’s firmware update and battery changes served to “mitigate the risk of overheating” because “an overheating battery could pose a risk of fire and/or burns to a user.”

Product Safety Recall notice, with red border and triangle symbol, asking consumers "Do you own this product?" with an image of a Google Pixel 4a.
Do you own this product?
Credit:
ACCC Product Safety

In the US and elsewhere, Google’s messaging did not use the term “recall.” Google stated on its “Pixel 4a Battery Performance Program” page that “certain” Pixel 4a models “require a software update to improve the stability of their battery’s performance,” which also “reduces available battery capacity and impacts charging performance.” Google said it is still safe to charge a Pixel 4a.

Read full article

Comments

DOJ: Google must sell Chrome, Android could be next

Google has gotten its first taste of remedies that Donald Trump’s Department of Justice plans to pursue to break up the tech giant’s monopoly in search. In the first filing since Trump allies took over the department, government lawyers backed off a key proposal submitted by the Biden DOJ. The government won’t ask the court to force Google to sell off its AI investments, and the way it intends to handle Android is changing. However, the most serious penalty is intact—Google’s popular Chrome browser is still on the chopping block.

“Google’s illegal conduct has created an economic goliath, one that wreaks havoc over the marketplace to ensure that—no matter what occurs—Google always wins,” the DOJ filing says. To that end, the government maintains that Chrome must go if the playing field is to be made level again.

The DOJ is asking the court to force Google to promptly and fully divest itself of Chrome, along with any data or other assets required for its continued operation. It is essentially aiming to take the Chrome user base—consisting of some 3.4 billion people—away from Google and hand it to a competitor. The government will vet any potential buyers to ensure the sale does not pose a national security threat. During the term of the judgment, Google would not be allowed to release any new browsers. However, it may continue to contribute to the open source Chromium project.

Read full article

Comments

Spark 2 adds AI, doubles its DSP power to help your guitar rock out

The Spark 2 from Positive Grid looks like a miniature old-school amp, but it is, essentially, a computer with some knobs and a speaker. It has Bluetooth, USB-C, and an associated smartphone app. It needs firmware updates, which can brick the device—ask me how I found this out—and it runs code on DSP chips. New guitar tones can be downloaded into the device, where they run as software rather than as analog electrical circuits in an amp or foot pedal.

In other words, the Spark 2 is the latest example of the “software-ization” of music.

Forget the old image of a studio filled with a million-dollar, 48-track mixing board from SSL or API and bursting with analog amps, vintage mics, and ginormous plate reverbs. Studios today are far more likely to be digital, where people record “in the box” (i.e., they track and mix on a computer running software like Pro Tools or Logic Pro) using digital models of classic (and expensive) amplifiers, coded by companies like NeuralDSP and IK Multimedia. These modeled amp sounds are then run through convolution software that relies on digital impulse responses captured from different speakers and speaker cabinets. They are modified with effects like chorus and distortion, which are all modeled, too. The results can be world-class, and they’re increasingly showing up on records.

Read full article

Comments

Study: Megalodon’s body shape was closer to a lemon shark

The giant extinct shark species known as the megalodon has captured the interest of scientists and the general public alike, even inspiring the 2018 blockbuster film The Meg. The species lived some 3.6 million years ago and no complete skeleton has yet been found. So there has been considerable debate among paleobiologists about megalodon’s size, body shape and swimming speed, among other characteristics.

While some researchers have compared megalodon to a gigantic version of the stocky great white shark,  others believe the species had a more slender body shape. A new paper published in the journal Palaeontologia Electronica bolsters the latter viewpoint, also drawing conclusions about the megalodon’s body mass, swimming speed (based on hydrodynamic principles), and growth patterns.

As previously reported, the largest shark alive today, reaching up to 20 meters long, is the whale shark, a sedate filter feeder. As recently as 4 million years ago, however, sharks of that scale likely included the fast-moving predator megalodon (formally Otodus megalodon). Due to incomplete fossil data, we’re not entirely sure how large megalodons were and can only make inferences based on some of their living relatives.

Read full article

Comments

Huh? The valuable role of interjections

Listen carefully to a spoken conversation and you’ll notice that the speakers use a lot of little quasi-words—mm-hmm, um, huh? and the like—that don’t convey any information about the topic of the conversation itself. For many decades, linguists regarded such utterances as largely irrelevant noise, the flotsam and jetsam that accumulate on the margins of language when speakers aren’t as articulate as they’d like to be.

But these little words may be much more important than that. A few linguists now think that far from being detritus, they may be crucial traffic signals to regulate the flow of conversation as well as tools to negotiate mutual understanding. That puts them at the heart of language itself—and they may be the hardest part of language for artificial intelligence to master.

“Here is this phenomenon that lives right under our nose, that we barely noticed,” says Mark Dingemanse, a linguist at Radboud University in the Netherlands, “that turns out to upend our ideas of what makes complex language even possible in the first place.”

Read full article

Comments

New research shows bigger animals get more cancer, defying decades-old belief

A longstanding scientific belief about a link between cancer prevalence and animal body size has tested for the first time in our new study ranging across hundreds of animal species.

If larger animals have more cells, and cancer comes from cells going rogue, then the largest animals on Earth—like elephants and whales—should be riddled with tumours. Yet, for decades, there has been little evidence to support this idea.

Many species seem to defy this expectation entirely. For example, budgies are notorious among pet owners for being prone to renal cancer despite weighing only 35 g. Yet cancer only accounts for around 2 percent of mortality among roe deer (up to 35 kg).

Read full article

Comments

Blood Typers is a terrifically tense, terror-filled typing tutor

When you think about it, the keyboard is the most complex video game controller in common use today, with over 100 distinct inputs arranged in a vast grid. Yet even the most complex keyboard-controlled games today tend to only use a relative handful of all those available keys for actual gameplay purposes.

The biggest exception to this rule is a typing game, which by definition asks players to send their fingers flying across every single letter on the keyboard (and then some) in quick succession. By default, though, typing games tend to take the form of extremely basic typing tutorials, where the gameplay amounts to little more than typing out words and sentences by rote as they appear on screen, maybe with a few cute accompanying animations.


Typing “gibbon” quickly has rarely felt this tense or important.
Credit:
Outer Brain Studios

Blood Typers adds some much-needed complexity to that basic type-the-word-you-see concept, layering its typing tests on top of a full-fledged survival horror game reminiscent of the original PlayStation era. The result is an amazingly tense and compelling action adventure that also serves as a great way to hone your touch-typing skills.

Read full article

Comments

NASA officials undermine Musk’s claims about ‘stranded’ astronauts

Over the last month there has been something more than a minor kerfuffle in the space industry over the return of two NASA astronauts from the International Space Station.

The fate of Butch Wilmore and Suni Williams, who launched on the first crewed flight of Boeing’s Starliner spacecraft on June 5, 2024, has become a political issue after President Donald Trump and SpaceX founder Elon Musk said the astronauts’ return was held up by the Biden White House.

In February, Trump and Musk appeared on FOX News. During the joint interview, the subject of Wilmore and Williams came up. They remain in space today after NASA decided it would be best they did not fly home in their malfunctioning Starliner spacecraft—but would return in a SpaceX-built Crew Dragon.

Read full article

Comments

The X-37B spaceplane lands after helping pave the way for “maneuver warfare”

The US military’s robotic mini-space shuttle dropped out of orbit and glided to a runway in California late Thursday, ending a 434-day mission that pioneered new ways of maneuvering in space.

The X-37B spaceplane touched down on Runway 12 at Vandenberg Space Force Base, California, at 11:22 pm local time Thursday (2:22 am EST Friday), capping its high-flying mission with an automated reentry and landing on the nearly three-mile-long runway at the West Coast’s spaceport.

The Space Force did not publicize the spacecraft’s return ahead of time, keeping with the Pentagon’s policy of secrecy surrounding the X-37B program. This was the seventh flight of an X-37B spaceplane, or Orbital Test Vehicle, since its first foray into orbit in 2010.

Read full article

Comments