For a deeper look into our Eikon Data API, look into:

Overview |  Quickstart |  Documentation |  Downloads |  Tutorials |  Articles

question

Upvotes
Accepted
3 0 0 2

How to get the full news story in CSV in Text Style

I have fetched the news story in a dataframe and saved it to a CSV. The data is having HTML tags. Is there a way i can get the story in plain text?


syntax1 = "(Hospital OR Health Center OR Medical center OR health system OR university hospital OR Emergency Department OR Inpatient OR Rehabilitat OR ICU ) AND ( build OR reopen OR construct OR expansion OR upgrade OR develop OR repurpose OR modern )"
df = ek.get_news_headlines(syntax1,100,date_from="2021-03-25T00:00:00", date_to="2021-04-10T00:00:00")
stories = pd.DataFrame(columns=['DATE','STORY'])
for index, headline_row in df.iterrows():   
    story = ek.get_news_story(headline_row['storyId'])
    stories = stories.append({'DATE':index,'STORY':story}, ignore_index=True)
stories = stories.set_index('DATE')
result = pd.concat([df, stories], axis=1)
result.to_csv("news.csv")

The result dataframe looks like this. I want to get rid of the html tags.

eikoneikon-data-apiworkspaceworkspace-data-apirefinitiv-dataplatform-eikonpython
1617811496770.png (40.7 KiB)
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
Accepted
6.7k 8 6 7

@alankar.gupta So our news stories are delivered as HTML - so you can use a package like Beautiful Soup (BS4) to strip the text of its html, hyperlinks etc. Please see this article on how to do it. I hope it can help.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
4.3k 2 4 5

A simple solution with html2text lib to extract text from html :

import html2text
...

result = pd.concat([df, stories], axis=1)

result['STORY'] = result['STORY'].apply(html2text.html2text)

result.to_csv("news.csv")


icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Click below to post an Idea Post Idea