Skip to main content

Migrate Process Newspaper3k

Submitted by daniel on

Migrate Process Newspaper3k provides a Migrate process plugin to enable you to request and extract data from the python based Newspaper3k article download framework.

Newspaper3k Features

  1. Multi-threaded article download framework
  2. News url identification
  3. Text extraction from html
  4. Top image extraction from html
  5. All image extraction from html
  6. Keyword extraction from text
  7. Summary extraction from text
  8. Author extraction from text
  9. Google trending terms extraction
  10. Works in 10+ languages (English, Chinese, German, Arabic, …)

Prerequisites

LAMP server with python3 support.

See https://github.com/2dareis2do/newspaper3k-php-wrapper for installation and setup instructions.

Example Usage

process:
   'body/value':
     -
       plugin: migrate_process_js_redirect_link
       source: link
     -
       plugin: migrate_process_newspaper3k
     -
       plugin: skip_on_empty
       method: row
       message: 'migrate_process_newspaper3k import failed'
     -
       plugin: extract
       index:
         - summary

Schema Keys

See the included stub json for more info on this. Those of note include:

summary (string)
source_url (string)
url (string)
title (string)
top_img (string)
top_image (string)
meta_img (string)
imgs (array)
images (array)
movies (array)
text (string)
keywords (array)
meta_keywords (array)
tags (array)
authors (array)
publish_date (string)
summary (string)
html (string)
meta_data (array object)
meta_description (string)
article_html (string)
top_node (string)
doc (string)

DDEV support

If you are running ddev locally, to install newspaper3k and supporting libraries, you can add the following to your config.yaml found in your projects .ddev folder to install Newspaper3k and the necessary natural language data sets orcorpora.

hooks:
    post-start:
        - exec: 'pip3 install newspaper3k'
        - exec: 'curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3'

More info

For more info on how Newspaper3k parses articles please see https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#pa...

Download

https://www.drupal.org/project/migrate_process_newspaper3k
 

Date Created
2 months ago