Migrate Process Newspaper3k provides a Migrate process plugin to enable you to request and extract data from the python based Newspaper3k article download framework.
Newspaper3k Features
- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Google trending terms extraction
- Works in 10+ languages (English, Chinese, German, Arabic, …)
Prerequisites
LAMP server with python3 support.
See https://github.com/2dareis2do/newspaper3k-php-wrapper for installation and setup instructions.
Example Usage
process:
'body/value':
-
plugin: migrate_process_js_redirect_link
source: link
-
plugin: migrate_process_newspaper3k
-
plugin: skip_on_empty
method: row
message: 'migrate_process_newspaper3k import failed'
-
plugin: extract
index:
- summary
Schema Keys
See the included stub json for more info on this. Those of note include:
summary (string)
source_url (string)
url (string)
title (string)
top_img (string)
top_image (string)
meta_img (string)
imgs (array)
images (array)
movies (array)
text (string)
keywords (array)
meta_keywords (array)
tags (array)
authors (array)
publish_date (string)
summary (string)
html (string)
meta_data (array object)
meta_description (string)
article_html (string)
top_node (string)
doc (string)
DDEV support
If you are running ddev locally, to install newspaper3k and supporting libraries, you can add the following to your config.yaml found in your projects .ddev folder to install Newspaper3k and the necessary natural language data sets orcorpora.
hooks:
post-start:
- exec: 'pip3 install newspaper3k'
- exec: 'curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3'
More info
For more info on how Newspaper3k parses articles please see https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#pa...
Download
Date Created
9 months 2 weeks ago